Urdu News Clustering Using K-Mean Algorithm On The Basis Of Jaccard Coefficient And Dice Coefficient Similarity

Zahid Rahman; Altaf Hussain; Hussain Shah; Muhammad Arshad

doi:10.14201/ADCAIJ2021104381399

Urdu News Clustering Using K-Mean Algorithm On The Basis Of Jaccard Coefficient And Dice Coefficient Similarity

Zahid Rahman

Institute of Computer Sciences & IT (ICS/IT), The University of Agriculture Pesh-awar, Pakistan
Altaf Hussain

Institute of Computer Science and IT, The University of Agriculture, Peshawar Pakistan altafkfm74[at]gmail.com
Hussain Shah

Shaykh Zayed Islamic Centre, University of Peshawar, Pakistan
Muhammad Arshad

City University of Science and Information Technology Peshawar, Pakistan

https://doi.org/10.14201/ADCAIJ2021104381399

Abstract

Clustering is the unsupervised machine learning process that group data objects into clusters such that objects within the same cluster are highly similar to one another. Every day the quantity of Urdu text is increasing at a high speed on the internet. Grouping Urdu news manually is almost impossible, and there is an utmost need to device a mechanism which cluster Urdu news documents based on their similarity. Clustering Urdu news documents with accuracy is a research issue and it can be solved by using similarity techniques i.e., Jaccard and Dice coefficient, and clustering k-mean algorithm. In this research, the Jaccard and Dice coefficient has been used to find the similarity score of Urdu News documents in python programming language. For the purpose of clustering, the similarity results have been loaded to Waikato Environment for Knowledge Analysis (WEKA), by using k-mean algorithm the Urdu news documents have been clustered into five clusters. The obtained cluster’s results were evaluated in terms of Accuracy and Mean Square Error (MSE). The Accuracy and MSE of Jaccard was 85% and 44.4%, while the Accuracy and MSE of Dice coefficient was 87% and 35.76%. The experimental result shows that Dice coefficient is better as compared to Jaccard similarity on the basis of Accuracy and MSE.

Referencias
Cómo citar
Del mismo autor
Métricas

Anwar, W., Bajwa, I. S., & Ramzan, S. (2019). Design and implementation of a machine learning-based authorship identification model. Scientific Programming, 2019.
Arif, S. Z., Yaqoob, M. M., Rehman, A., Jamil, F., & Jamil, F. (2016). Word sense disambiguation for Urdu text by machine learning. International Journal of Computer Science and Information Security (IJCSIS), 14(5).
Boudane, A., Jabbour, S., Sais, L., & Salhi, Y. (2017). Clustering complex data represented as propositional formulas. Paper presented at the Pacific-Asia Conference on Knowledge Discovery and Data Mining.
Bouras, C., & Tsogkas, V. (2012). A clustering technique for news articles using WordNet. Knowledge-Based Systems, 36, 115-128.
Fahiman, F., Erfani, S. M., Rajasegarar, S., Palaniswami, M., & Leckie, C. (2017). Improving load forecasting based on deep learning and K-shape clustering. Paper presented at the 2017 International Joint Conference on Neural Networks (IJCNN).
Kalmegh, S. (2015). Analysis of weka data mining algorithm reptree, simple cart and randomtree for classification of indian news. International Journal of Innovative Science, Engineering & Technology, 2(2), 438-446.
Khademi, M. E., Fakhredanesh, M., & Hoseini, S. M. (2017). Conceptual Text Summarizer: A new model in continuous vector space. arXiv preprint arXiv:1710.10994.
Khaliq, S., Iqbal, W., Bukhari, F., & Malik, K. (1989) Clustering Urdu News Using Headlines. Language & Technology, 89.
Khan, A. R., Karim, A., Sajjad, H., Kamiran, F., & Xu, J. (2020). A clustering framework for lexical normalization of Roman Urdu. Natural Language Engineering, 1-31.
Liu, Y., & Li, L. (2015). Similarity Based Hot Spot News Clustering. Paper presented at the 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom).
Munir, S., Wasi, S., & Jami, S. I. (2019). A Comparison of Topic Modelling Approaches for Urdu Text. Indian Journal of Science and Technology, 12, 45.
Patra, R., & Saha, S. K. (2019). A novel word clustering and cluster merging technique for named entity recognition. Journal of Intelligent Systems, 28(1), 15-30.
Pratama, M., Kemas, R., & Anisa, H. (2017). Digital news graph clustering using Chinese whispers algorithm. Paper presented at the Journal of Physics: Conference Series.
Usman, M., Shafique, Z., Ayub, S., & Malik, K. (2016). Urdu text classification using majority voting. International Journal of Advanced Computer Science and Applications, 7(8), 265-273.
Wang, C., Song, Y., Li, H., Zhang, M., & Han, J. (2015). Knowsim: A document similarity measure on structured heterogeneous information networks. Paper presented at the 2015 IEEE International Conference on Data Mining.

Rahman, Z., Hussain, A., Shah, H., & Arshad, M. (2022). Urdu News Clustering Using K-Mean Algorithm On The Basis Of Jaccard Coefficient And Dice Coefficient Similarity. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal, 10(4), 381–399. https://doi.org/10.14201/ADCAIJ2021104381399

Download Citation

Most read articles by the same author(s)

Altaf Hussain, Habib Ullah Khan, Shah Nazir, Tariq Hussain, Ijaz Ullah, Taking FANET to Next Level , ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal: Vol. 10 No. 4 (2021)
Altaf Hussain, Tariq Hussain, Iqtidar Ali, Muhammad Rafiq Khan, Impact of Sparse and Dense Deployment of Nodes Under Different Propagation Models in Manets , ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal: Vol. 9 No. 1 (2020)
Altaf Hussain, Mehtab Ahmad, Tariq Hussain, Ijaz Ullah, Efficient Content Based Video Retrieval System by Applying AlexNet on Key Frames , ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal: Vol. 11 No. 2 (2022)
Altaf Hussain, Tariq Hussain, Ijaz Ullah, The Approach of Data Mining , ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal: Vol. 10 No. 4 (2021)
Altaf Hussain, An Efficient Video Frames Retrieval System Using Speeded Up Robust Features Based Bag of Visual Words , ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal: Vol. 12 (2023)

Downloads

Download data is not yet available.

Author Biography

Altaf Hussain

,

Institute of Computer Science and IT, The University of Agriculture, Peshawar Pakistan

MS Scholar (Computer Networks)

+ −

Editorial dates

Submitted:

15-09-2021

Acceptance:

15-10-2021

Published:

08-02-2022

Issue

Vol. 10 No. 4 (2021)

Section

Articles

Keywords

Urdu News
Clustering Mechanism
Jaccard Coefficient
Dice coefficient
Python
WEKA
K-mean
MSE

Supporting agencies

This research didn't have any funding

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.