Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features

  • Amit Purushottam Pimpalkar
    Sathyabama Institute of Science and Technology amit.pimpalkar[at]gmail.com
  • R. Jeberson Retna Raj
    Sathyabama Institute of Science and Technology

Abstract

Data analytics and its associated applications have recently become impor-tant fields of study. The subject of concern for researchers now-a-days is a massive amount of data produced every minute and second as people con-stantly sharing thoughts, opinions about things that are associated with them. Social media info, however, is still unstructured, disseminated and hard to handle and need to be developed a strong foundation so that they can be utilized as valuable information on a particular topic. Processing such unstructured data in this area in terms of noise, co-relevance, emoticons, folksonomies and slangs is really quite challenging and therefore requires proper data pre-processing before getting the right sentiments. The dataset is extracted from Kaggle and Twitter, pre-processing performed using NLTK and Scikit-learn and features selection and extraction is done for Bag of Words (BOW), Term Frequency (TF) and Inverse Document Frequency (IDF) scheme. For polarity identification, we evaluated five different Machine Learning (ML) algorithms viz Multinomial Naive Bayes (MNB), Logistic Regression (LR), Decision Trees (DT), XGBoost (XGB) and Support Vector Machines (SVM). We have performed a comparative analysis of the success for these algorithms in order to decide which algorithm works best for the given data-set in terms of recall, accuracy, F1-score and precision. We assess the effects of various pre-processing techniques on two datasets; one with domain and other not. It is demonstrated that SVM classifier outperformed the other classifiers with superior evaluations of 73.12% and 94.91% for accuracy and precision respectively. It is also highlighted in this research that the selection and representation of features along with various pre-processing techniques have a positive impact on the performance of the classification.  The ultimate outcome indicates an improvement in sentiment classification and we noted that pre-processing approaches obviously suggest an improvement in the efficiency of the classifiers.
  • Referencias
  • Cómo citar
  • Del mismo autor
  • Métricas
Alam, S., and Yao, N. (2018). The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis, Computational and Mathematical Organization Theory. doi:10.1007/s10588-018-9266-8.
Alsmadi, I. and Hoon, GK., (2018). Term weighting scheme for short-text classification: Twitter corpuses. Neural Computing and Applications. doi:10.1007/s00521-017-3298-8.
Bao, Y., Quan, C., Wang, L., and Ren, F. (2014). The Role of Pre-processing in Twitter Sentiment Analysis, Lecture Notes in Computer Science, 615–624. doi:10.1007/978-3-319-09339-0_62.
Bilgin, M., and Kökta?, H., (2019). Sentiment Analysis with Term Weighting and Word Vectors, The International Arab Journal of Information Technology, Vol. 16, No. 5, pp 953-959.
Chatzakou, D., and Vakali, A., (2015). Harvesting Opinions and Emotions from Social Media Textual Resources, IEEE Internet Computing, pp 46-50.
Chen, J., Chen C., and Liang, Y., (2016). Optimized TF-IDF Algorithm with the Adaptive Weight of Position of Word, Advances in Intelligent Systems Research, volume 13, pp 114-117. doi: 10.2991/aiie-16.2016.28.
Das, B., and Chakraborty, S., (2018). An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation. arXiv: 1806.06407.
Dhanjal, K., and Sangeeta, (2019). Applying Machine Learning Algorithms for News Articles Categorization: Using SVM and kNN with TF-IDF Approach, Smart Computational Strategies: Theoretical and Practical Aspects, pp 95–105. doi:10.1007/978-981-13-6295-8_9.
Effrosynidis, D., Symeonidis, S., and Arampatzis, A., (2017). A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis, Lecture Notes in Computer Science, 394–406. doi:10.1007/978-3-319-67008-9_31.
Emelyanov, GM., Mikhailov, DV., and Kozlov, AP., (2017). The TF-IDF measure and analysis of links between words within N-grams in the formation of knowledge units for open tests, Pattern Recognition and Image Analysis. 27, 825–831. https://doi.org/10.1134/S1054661817040058.
Gao, W., Peng, M., Wang, H., Zhang, Y., Xie Q., and Tian, G., (2019). Incorporating word embeddings into topic modeling of short text, Knowledge and Information Systems 61, 1123–1145. doi:10.1007/s10115-018-1314-7.
Gu, Y., Wang, Y., Huan, J., Sun, Y., and Jia, W., (2018). An Improved TFIDF Algorithm Based on Dual Parallel Adaptive Computing Model, In IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData). doi:10.1109/cybermatics_2018.2018.00133.
HaCohen-Kerner, Y., Miller, D., and Yigal, Y. (2020). The influence of preprocessing on text classification using a bag-of-words representation, PLOS ONE, 15(5), e0232525. doi:10.1371/journal.pone.0232525.
Hasan, A., Moin, S., Karim, A. and Shamshirband, S. (2018). Machine Learning-Based Sentiment Analysis for Twitter Accounts Mathematical and Computational Applications, 23(1), 11. doi:10.3390/mca23010011.
Hasan, MR., Maliha, M. and Arifuzzaman, M., (2019). Sentiment Analysis with NLP on Twitter Data, International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2). doi:10.1109/ic4me247184.2019.9036670.
Hassan, N., Gomaa, W., Khoriba, G. and Haggag, M., (2020). Credibility Detection in Twitter Using Word N-gram Analysis and Supervised Machine Learning Techniques, International Journal of Intelligent Engineering and Systems, Vol.13, No.1. doi: 10.22266/ijies2020.0229.27.
Ho, J., Ondusko, D., Roy, B. and Hsu, DF., (2019). Sentiment Analysis on Tweets Using Machine Learning and Combinatorial Fusion, IEEE International Conference on Dependable, Autonomic and Secure Computing, In International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing. doi:10.1109/DASC/PiCom/CBDCom/CyberSciTech.2019.00191.
I?ik, M, Da?, H. (2020). The impact of text preprocessing on the prediction of review ratings, Turkish Journal of Electrical Engineering and Computer Science, 28 (3), 1405-1421. DOI: 10.3906/elk-1907-46.
Ismail, H., Harous, S. and Belkhouche, B., (2016). A Comparative Analysis of Machine Learning Classifiers for Twitter Sentiment Analysis, In International Conference on Intelligent Text Processing and Computational Linguistics – CICLing.
Kamath, CN., Bukhari, SS., and Dengel, A., (2018). Comparative Study between Traditional Machine Learning and Deep Learning Approaches for Text Classification, Proceedings of the ACM Symposium on Document Engineering- DocEng. doi:10.1145/3209280.3209526.
Kermani, ZF., Sadeghi, F., and Eslami, E., (2019). Solving the twitter sentiment analysis problem based on a machine learning-based approach, Evolutionary Intelligence. doi:10.1007/s12065-019-00301-x.
Kim, SW., and Gil, JM., (2019). Research paper classification systems based on TF-IDF and LDA schemes, Human-Centric Computing and Information Sciences, 9(1). doi:10.1186/s13673-019-0192-7.
Krouska, A., Troussas, C., and Virvou, M. (2016). The effect of preprocessing techniques on Twitter sentiment analysis, In 7th International Conference on Information, Intelligence, Systems & Applications (IISA). doi:10.1109/iisa.2016.7785373.
Kshirsagar, V., (2020). Detecting Hate tweets — Twitter Sentiment Analysis, https://towardsdatascience.com/detecting-hate-tweets-twitter-sentiment-analysis-780d8a82d4f6, (ONLINE last accessed on 06/06/2020).
Maryam, A., and Ali, R. (2018). Temporal TF-IDF-Based Twitter Event Summarization Incorporating Keyword Importance, Smart Innovation, Systems and Technologies, pp 559–566. doi:10.1007/978-981-13-1747-7_54.
Mestry, S., Singh, H., Chauhan, R., Bisht, V., and Tiwari, K., (2019). Automation in Social Networking Comments With the Help of Robust fastText and CNN, In 1st International Conference on Innovations in Information and Communication Technology (ICIICT). doi:10.1109/iciict1.2019.8741503.
Mrabti, S. El., Achhab, M. Al., and Lazaar, M., (2018). Comparison of Feature Selection Methods for Sentiment Analysis, Big Data, Cloud and Applications, pp 261–272. doi:10.1007/978-3-319-96292-4_21.
Nazir, F., Ghazanfar, MA., Maqsood, M., Aadil, F., Rho, S. and Mehmood, I., (2018). Social media signal detection using tweets volume, hashtag, and sentiment analysis, Multimedia Tools and Applications. doi:10.1007/s11042-018-6437-z.
Nivaashini, M., Soundariya, RS. and Thangaraj, P., (2018). Comparative Analysis of Machine Learning Approaches for Twitter Sentiment Analysis, Journal of Computational and Theoretical Nanoscience, 15(5), pp 1743–1749. doi:10.1166/jctn.2018.7371.
Othman, R., Abdelsadek, Y., Chelghoum, K., Kacem, I. and Faiz, R., (2019). Improving Sentiment Analysis in Twitter Using Sentiment Specific Word Embeddings, In 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS). doi:10.1109/idaacs.2019.8924403.
Pang, B., Lee, L., (2008). Opinion mining and sentiment analysis. Foundation Trends Information Retrieval 2(1–2), pp 1–135.
Pradha, S., Halgamuge, M. N., and Tran Quoc Vinh, N. (2019). Effective Text Data Preprocessing Technique for Sentiment Analysis in Social Media Data. In 11th International Conference on Knowledge and Systems Engineering (KSE). doi:10.1109/kse.2019.8919368
Pujari, C., Aiswarya, and Shetty, NP., (2017). Comparison of Classification Techniques for Feature Oriented Sentiment Analysis of Product Review Data, Data Engineering and Intelligent Computing, pp. 149–158. doi:10.1007/978-981-10-3223-3_14.
Renault, T., (2019). Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages. Digital Finance. doi:10.1007/s42521-019-00014-x.
Sidorov, G., (2019). Vector Space Model for Texts and the tf-idf Measure, In Syntactic n-grams in Computational Linguistics. Springer Briefs in Computer Science, pp 11–15. doi:10.1007/978-3-030-14771-6_3.
Singh, T., and Kumari, M. (2016). Role of Text Pre-processing in Twitter Sentiment Analysis, Procedia Computer Science, 89, 549–554. doi:10.1016/j.procs.2016.06.095
White, HD., (2018). Bag of works retrieval: TF*IDF weighting of works co-cited with a seed, International Journal of Digital Library 19, pp 139–149, 2018. https://doi.org/10.1007/s00799-017-0217-7.
Yamout, F. and Lakkis, R., (2018). Improved TFIDF weighting techniques in document Retrieval, In Thirteenth International Conference on Digital Information Management (ICDIM). doi:10.1109/icdim.2018.8847156.
Pimpalkar, A. P., & Retna Raj, R. J. (2020). Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal, 9(2), 49–68. https://doi.org/10.14201/ADCAIJ2020924968

Downloads

Download data is not yet available.

Author Biography

Amit Purushottam Pimpalkar

,
Sathyabama Institute of Science and Technology
Amit Pimpalkar is a Ph.D. Research Scholar of Sathyabama University, Chennai. He has receives his masters from Shri Ram Institute of Technology, Jabalpur in 2013 and B.E. from Nagpur University, Nagpur in 2005. His research interest includes Machine Learning, NLP and Data Mining domain. He has 15 years of academics and industrial experience. He has published more than 70 research articles in the field of Computer Science and IT applications in International Journal and International/National Conferences proceedings. He is Life Time Member of ISTE, IAENG, ICSES.
+