Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation

Muhammad Farhat Ullah; Ali Saeed; Naveed Hussain

doi:10.14201/adcaij.31084

Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation

Muhammad Farhat Ullah

Department of Software Engineering, University of Lahore, Lahore, Pakistan
Ali Saeed

Faculty of Information Technology, Department of Software Engineering, University of Central Punjab, Lahore, Pakistan
Naveed Hussain

Faculty of Information Technology, Department of Software Engineering, University of Central Punjab, Lahore, Pakistan dr.naveedhussain[at]ucp.edu.pk

Abstract

The prime objective of word sense disambiguation (WSD) is to develop such machines that can automatically recognize the actual meaning (sense) of ambiguous words in a sentence. WSD can improve various NLP and HCI challenges. Researchers explored a wide variety of methods to resolve this issue of sense ambiguity. However, majorly, their focus was on English and some other well-reputed languages. Urdu with more than 300 million users and a large amount of electronic text available on the web is still unexplored. In recent years, for a variety of Natural Language Processing tasks, word embedding methods have proven extremely successful. This study evaluates, compares, and applies a variety of word embedding approaches to Urdu Word embedding (both Lexical Sample and All-Words), including pre-trained (Word2Vec, Glove, and FastText) as well as custom-trained (Word2Vec, Glove, and FastText trained on the Ur-Mono corpus). Two benchmark corpora are used for the evaluation in this study: (1) the UAW-WSD-18 corpus and (2) the ULS-WSD-18 corpus. For Urdu All-Words WSD tasks, top results have been achieved (Accuracy=60.07 and F1=0.45) using pre-trained FastText. For the Lexical Sample, WSD has been achieved (Accuracy=70.93 and F1=0.60) using custom-trained GloVe word embedding method.

Referencias
Cómo citar
Del mismo autor
Métricas

Abid, M., A. H., Jawad, A., and Abdul, S., 2018. Urdu word sense disambiguation using machine learning approach. Cluster Computing 21(1), 515–522. 10.1007/s10586-017-0918-0

Ali, M., N., and Tan, G., and Hussain, A., 2018. Bidirectional recurrent neural network approach for Arabic named entity recognition. Future Internet. 10(12), 123. 10.3390/fi10120123

Ali, S., Nawab, R. M. A., Mark, S., and Paul, R., 2019. A word sense disambiguation corpus for Urdu. Language Resources and Evaluation. 53: 397–418. 10.1007/s10579-018-9438-7

Ali, S., Rao, M. A. N., Mark, S., and Paul, R., 2019. A Sense Annotated Corpus for All-Words Urdu Word Sense Disambiguation. ACM Transactions on Asian and Low-Resource Language Information Processing, 18(4), 1–14. 10.1145/3314940

Archana, K., and DK, L., 2020. Word2vec’s Distributed Word Representation for Hindi Word Sense Disambiguation. In International Conference on Distributed Computing and Internet Technology, 325–335. 10.1007/978-3-030-36987-3_21

Arif, S. Z., Muhammad, M. Y., Atif, R., Fuzel, J., and Jamil F., 2016. Word sense disambiguation for Urdu text by machine learning. International Journal of Computer Science and Information Security 14(5).

Bojanowski, P. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146. 10.1162/tacl_a_00051

Broda, B., Pawe, K., Micha, M., Adam, R., Radosaw, R., & Wardyski, A., 2013. Fextor: A feature extraction framework for natural language processing: A case study in word sense disambiguation, relation recognition and anaphora resolution. Computational Linguistics, 458, 41–62. 10.1007/978-3-642-34399-5_3

Cao, R., Bai, J., & Shinnou, H., 2019. Semi-supervised learning for all-words WSD using self-learning and fine-tuning. In Proceding of 33r Pacific Asia Conference Language, Information Computing, 356–361.

Cotton, P., E., & Scott., 2001. SENSEVAL-2: Overview. In Proceedings of the 2nd International Workshop on Evaluating Word Sense Disambiguation Systems, 1–5.

Das, D., Debapratim, A. K., Soharab, H. S., & Rajat, K. P., 2023. A dataset for evaluating Bengali word sense disambiguation techniques. Journal of Ambient Intelligence and Humanized Computing, 1–30.

Dongsuk, K. O., Kim, S., Ko, K., & Youngjoong, 2018. Word sense disambiguation based on word similarity calculation using word vector representation from a knowledge-based graph. In the Proceedings of the 27th international conference on computational linguistics.

Fang, W., Jianwen, Z., Dilin, W., Zheng, C., & Ming, Li. 2016. Entity disambiguation by knowledge and text jointly embedding. In the Proceedings of the 20th SIGNLL conference on computational natural language learning, 260–269. 10.18653/v1/K16-1026

Haider, S. 2018. Urdu Word Embeddings. In the Proceedings of the Eleventh International Conference on Language Resources and Evaluation.

Hussain, A. N., & Sarmad, H., 2009. Supervised Word Sense Disambiguation for Urdu Using Bayesian Classification. Center for Research in Urdu Language Processing, Lahore, Pakistan.

lgen, B., Eref, A., & Cneyd, A. T. 2012. The impact of collocational features in Turkish word sense disambiguation. In the Proceeding of 2012 IEEE 16th International Conference on Intelligent Engineering Systems (INES), 527–530. 10.1109/INES.2012.6249891

Kanwal, S., Kamran, M., Khurram, S., Faisal, A., & Zubair, N., 2019. Urdu Named Entity Recognition: Corpus Generation and Deep Learning Application. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 19, 1–13. 10.1145/3329710

Kashif, R., 2010. Rule-based named entity recognition in Urdu. In Proceedings of the 2010 Named Entities Workshop.

Khan, W., Ali, D., Jamal, A. N., & Tehmina, A., 2016. A survey on the state-of-the-art machine learning models in the context of NLP. Kuwait Journal of Science 43(4): 66–84.

Le, M., Marten, P., Jacopo, U., & Piek, V., 2018. A deep dive into word sense disambiguation with LSTM. In the Proceedings of the 27th international conference on computational linguistics.

Jawaid, B., Kamran, A., & Bojar, O. 2014. A Tagged Corpus and a Tagger for Urdu. In LREC, Vol. 2, 2938–2943.

Mihalcea, R., Timothy, C., & Adam, K., 2004. The SENSEVAL-3 English lexical sample task. In the Proceedings of SENSEVAL-3, the 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.

Mikolov, T. 2013. Efficient estimation of word representations in vector space. In the Proceeding of Internation Conference on Learning Representations.

Mir, T. A, Lawaye, A. A., Rana, P., & Ahmed. G., 2023. Building Kashmiri Sense Annotated Corpus and its Usage in Supervised Word Sense Disambiguation. Indian Journal of Science and Technology, 16(13), 1021–1029. 10.17485/IJST/v16i13.2396

Navigli, R., 2009. Word sense disambiguation: A survey. ACM computing surveys (CSUR), 41(2), 1–69. 10.1145/1459352.1459355

Nisha, K., 2020. Sentiment Analysis of Regional Languages Written in Roman Script on Social Media. In Data Science and Intelligent Applications, 113–119. 10.1007/978-981-15-4474-3_13

Pennington, J., Richard, S., & Christopher, D. M., 2014. Glove: Global vectors for word representation. In the Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP).

Rahman, T. 2004. Language policy and localization in Pakistan: Proposal for a paradigmatic shift. In Proceedings of the SCALLA Conference on Computational Linguistics.

Ramya, P, & B Karthik. 2023. Word Sense Disambiguation Based Sentiment Classification Using Linear Kernel Learning Scheme. Intelligent Automation & Soft Computing, 2379–2391. 10.32604/iasc.2023.026291

Rim, L., Aloulou, C., B., & Lamia, H. 2017. Word Sense Disambiguation of Arabic Language with Word Embeddings as Part of the Creation of a Historical Dictionary. In the Proceeding of International Workshop on Language Processing and Knowledge Management.

Saeed, A., Nawab, R. M. A., Stevenson, M., 2021. Investigating the Feasibility of Deep Learning Methods for Urdu Word Sense Disambiguation. Transactions on Asian and Low-Resource Language Information Processing, 21(2), 1–16. 10.1145/3477578

Sarmad, H., 2008. Resources for Urdu language processing. In the Proceedings of the 6th Workshop on Asian Language Resources.

Sokolova, M., & Guy, L., 2009. A systematic analysis of performance measures for classification tasks. Information processing and management 45(4), 427–437. 10.1016/j.ipm.2009.03.002

Taghipour, K., & Ng, H. T., 2015. Semi-supervised word sense disambiguation using word embeddings in general and specific domains. In the Proceedings of the 2015 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 314–323. 10.3115/v1/N15-1035

Tomas, M., & Geoffrey, Z., 2013. Linguistic Regularities in Continuous Space Word Representations. In the Proceedings of NAACL-HLT.

Uslu, T., Alexander, M., Daniel, B., & Wahed, H., 2018. FastSense: An efficient word sense disambiguation classifier. In the Proceedings of the Eleventh International Conference on Language Resources and Evaluation.

Varinder, P. S. & Parteek, K., 2020. Word sense disambiguation for Punjabi language using deep learning. Neural Computing and Applications. 32(8). 2963–2973. 10.1007/s00521-019-04581-3

Wang, Y., Wang, M., & Hamido, F., 2020. Word sense disambiguation: A comprehensive knowledge exploitation framework. Knowledge-Based Systems, 190, 105030. 10.1016/j.knosys.2019.105030

Wu, Y., Jun, X., Yaoyun, Z., & Hua, X., 2015. Clinical abbreviation disambiguation using neural word embeddings. Proceedings of BioNLP, 15. 10.18653/v1/W15-3822

Zhang, X., Zhang, R., Xiaoyang, L., Fanshuang, K., Junfan, Ch., Samuel, M., & Yongyi, M. 2023. Word Sense Disambiguation by Refining Target Word Embedding. In the Proceedings of the ACM Web Conference. 10.1145/3543507.3583191

Farhat Ullah, M., Saeed, A., & Hussain, N. (2023). Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal, 12(1), e31084. https://doi.org/10.14201/adcaij.31084

Download Citation

Most read articles by the same author(s)

Naveed Hussain, Hamid Turab Mirza, Ibrar Hussain, Detecting Spam Review through Spammer’s Behavior Analysis , ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal: Vol. 8 No. 2 (2019)
Muhammad Hamayon Khan Vardag, Ali Saeed, Umer Hayat, Muhammad Farhat Ullah, Naveed Hussain, Contextual Urdu Text Emotion Detection Corpus and Experiments using Deep Learning Approaches , ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal: Vol. 11 No. 4 (2022)

Downloads

Download data is not yet available.

+ −

Editorial dates

Submitted:

30-11-2022

Acceptance:

11-06-2023

Published:

01-11-2023

Issue

Vol. 12 (2023)

Section

Articles

Keywords

word sense disambiguation
deep learning approaches
word embedding approaches
lexical and all words sample

Supporting agencies

This research didn't have any funding

License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.