Hybrid Text Embedding and Evolutionary Algorithm Approach for Topic Clustering in Online Discussion Forums
Abstract Leveraging discussion forums as a medium for information exchange has led to a surge in data, making topic clustering in these platforms essential for understanding user interests, preferences, and concerns. This study introduces an innovative methodology for topic clustering by combining text embedding techniques—Latent Dirichlet Allocation (LDA) and BERT—trained on a singular autoencoder. Additionally, it proposes an amalgamation of K-Means and Genetic Algorithms for clustering topics within triadic discussion forum threads. The proposed technique begins with a preprocessing stage to clean and tokenize textual data, which is then transformed into a vector representation using the hybrid text embedding method. Subsequently, the K-Means algorithm clusters these vectorized data points, and Genetic Algorithms optimize the parameters of the K-Means clustering. We assess the efficacy of our approach by computing cosine similarities between topics and comparing performance against coherence and graph visualization. The results confirm that the hybrid text embedding methodology, coupled with evolutionary algorithms, enhances the quality of topic clustering across various discussion forum themes. This investigation contributes significantly to the development of effective methods for clustering discussion forums, with potential applications in diverse domains, including social media analysis, online education, and customer response analysis.
- Referencias
- Cómo citar
- Del mismo autor
- Métricas
Adams, P. H., & Martell, C. H. (2008). Topic detection and extraction in chat. In 2008 IEEE International Conference on Semantic Computing (pp. 581-588). 10.1109/ICSC.2008.61
Alsayat, A., & El-Sayed, H. (2016). Social media analysis using optimized K-Means clustering. In 2016 IEEE 14th Inter-national Conference on Software Engineering Research, Management and Applications (SERA) (pp. 61-66). 10.1109/SERA.2016.7516129
Atagün, E., Hartoka, B., & Albayrak, A. (2021). Topic Modeling Using LDA and BERT Techniques: Teknofest Example. In 2021 6th International Conference on Computer Science and Engineering (UBMK) (pp. 660-664). 10.1109/UBMK52708.2021.9558988
Bisandu, D. B., Prasad, R., & Liman, M. M. (2019). Data clustering using efficient similarity measures. Journal of Statistics and Management Systems, 22(5), 901-922. 10.1080/09720510.2019.1565443
Bouabdallaoui, I., Guerouate, F., Bouhaddour, S., Saadi, C., & Sbihi, M. (2022). A hybrid Latent Dirichlet Allocation-BERT approach for topic discovery of market places. 10.21203/rs.3.rs-1674353/v1
Bouabdallaoui, I., Guerouate, F. & Sbihi, M. (2023). Combination of genetic algorithms and K-means for a hybrid topic modeling: tourism use case. Evol. Intel. 17, 1801–1817. 10.1007/s12065-023-00863-x
Cao, N., & Cui, W. (2016). Introduction to text visualization. Springer. 10.2991/978-94-6239-186-4
Colladon, A. F., Grippa, F., & Innarella, R. (2020). Studying the association of online brand importance with museum vis-itors: An application of the semantic brand score. Tourism Management Perspectives, 33, 100588. 10.1016/j.tmp.2019.100588
Costa, G., & Ortale, R. (2021). Jointly modeling and simultaneously discovering topics and clusters in text corpora using word vectors. Information Sciences, 563, 226-240. 10.1016/j.ins.2021.01.019
Curiskis, S. A., Drake, B., Osborn, T. R., & Kennedy, P. J. (2020). An evaluation of document clustering and topic model-ling in two online social networks: Twitter and Reddit. Information Processing & Management, 57(2), 102034. 10.1016/j.ipm.2019.04.002
Eklund, A., Forsman, M., & Drewes, F. (2023). An empirical configuration study of a common document clustering pipe-line. Northern European Journal of Language Technology (NEJLT), 9(1). 10.3384/nejlt.2000-1533.2023.4396
Gokarn Nitin, M., Gottipati, S., & Shankararaman, V. (2019). Clustering models for topic analysis in graduate discussion forums. In Proceedings of the 27th International Conference on Computers in Education. https://ink.library.smu.edu.sg/sis_research/4516
Gupta, R., & Jivani, A. G. (2018). Analyzing the stemming paradigm. In Information and Communication Technology for Intelligent Systems (ICTIS 2017) -Volume 2 (pp. 333-342). Springer. 10.1007/978-3-319-63645-0_37
Hilmi, M. F., Mustapha, Y., & Omar, M. T. C. (2020). Innovation in an Emerging Market: A Bibliometric and Latent Di-richlet Allocation Based Topic Modeling Study. In 2020 International Conference on Decision Aid Sciences and Application (DASA) (pp. 882-886). 10.1109/DASA51403.2020.9317278
Ikotun, A. M., Ezugwu, A. E., Abualigah, L., Abuhaija, B., & Heming, J. (2022). K-means Clustering Algorithms: A Comprehensive Review, Variants Analysis, and Advances in the Era of Big Data. Information Sciences, 622, 178-210. 10.1016/j.ins.2022.11.139
Jeong, B., Yoon, J., & Lee, J.-M. (2019). Social media mining for product planning: A product opportunity mining ap-proach based on topic modeling and sentiment analysis. International Journal of Information Management, 48, 280-290. 10.1016/j.ijinfomgt.2017.09.009
Jia, J., Tumanian, V., & Li, G. (2021). Discovering semantically related technical terms and web resources in Q&A discus-sions. Frontiers of Information Technology & Electronic Engineering, 22(7), 969-985. 10.1631/FITEE.2000186
Jiang, L. C., Chu, T. H., & Sun, M. (2021). Characterization of vaccine tweets during the early stage of the COVID-19 outbreak in the United States: topic modeling analysis. Jmir Infodemiology, 1(1), e25636. 10.2196/25636
Kalhori, H., Alamdari, M. M., & Ye, L. (2018). Automated algorithm for impact force identification using cosine similari-ty searching. Measurement, 122, 648-657. 10.1016/j.measurement.2018.01.016
Obasa, A. I., Salim, N., & Khan, A. (2016). Hybridization of bag-of-words and forum metadata for web forum question post detection. Indian Journal of Science and Technology, 8(32), 1-12. 10.17485/ijst/2015/v8i32/92127
Pattabiraman, K., Sondhi, P., & Zhai, C. (2013). Exploiting forum thread structures to improve thread clustering. In Proceedings of the 2013 Conference on the Theory of Information Retrieval (pp. 64-71). 10.1145/2499178.2499196
Qiu, Y., Li, H., Li, S., Jiang, Y., Hu, R., & Yang, L. (2018). Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data: 17th China National Conference, CCL 2018, and 6th International Symposium, NLP-NABD 2018, Changsha, China, October 19--21, 2018, Proceedings 17 (pp. 209-221). 10.1007/978-3-030-01716-3_18
Rahman, M. A., & Islam, M. Z. (2014). A hybrid clustering technique combining a novel genetic algorithm with K-Means. Knowledge-Based Systems, 71, 345-365. 10.1016/j.knosys.2014.08.011
Saleem, H. M., Dillon, K. P., Benesch, S., & Ruths, D. (2017). A web of hate: Tackling hateful speech in online social spaces.
Santhanam, T., & Padmavathi, M. S. (2015). Application of K-means and genetic algorithms for dimension reduction by integrating SVM for diabetes diagnosis. Procedia Computer Science, 47, 76-83. 10.1016/j.procs.2015.03.185
Sia, S., Dalmia, A., & Mielke, S. J. (2020). Tired of topic models? Clusters of pretrained word embeddings make for fast and good topics too! 10.18653/v1/2020.emnlp-main.135
Wang, B., Liakata, M., Zubiaga, A., & Procter, R. (2017). A hierarchical topic modelling approach for tweet clustering. In Social Informatics: 9th International Conference, SocInfo 2017, Oxford, UK, September 13-15, 2017, Proceed-ings, Part II 9 (pp. 378-390). Springer International Publishing. 10.1007/978-3-319-67256-4_30
Wang, C., Zhang, H., Chen, B., Wang, D., Wang, Z., & Zhou, M. (2020). Deep relational topic modeling via graph pois-son gamma belief network. Advances in Neural Information Processing Systems, 33, 488-500. https://proceedings.neurips.cc/paper/2020/hash/05ee45de8d877c3949760a94fa691533-Abstract.html
Wu, Y., Cao, N., Archambault, D., Shen, Q., Qu, H., & Cui, W. (2016). Evaluation of graph sampling: A visualization perspective. IEEE transactions on visualization and computer graphics, 23(1), 401-410. 10.1109/TVCG.2016.2598867
Yang, Z., Zhang, W., Yuan, F., & Islam, N. (2021). Measuring topic network centrality for identifying technology and technological development in online communities. Technological Forecasting and Social Change, 167, 120673. 10.1016/j.techfore.2021.120673
Alsayat, A., & El-Sayed, H. (2016). Social media analysis using optimized K-Means clustering. In 2016 IEEE 14th Inter-national Conference on Software Engineering Research, Management and Applications (SERA) (pp. 61-66). 10.1109/SERA.2016.7516129
Atagün, E., Hartoka, B., & Albayrak, A. (2021). Topic Modeling Using LDA and BERT Techniques: Teknofest Example. In 2021 6th International Conference on Computer Science and Engineering (UBMK) (pp. 660-664). 10.1109/UBMK52708.2021.9558988
Bisandu, D. B., Prasad, R., & Liman, M. M. (2019). Data clustering using efficient similarity measures. Journal of Statistics and Management Systems, 22(5), 901-922. 10.1080/09720510.2019.1565443
Bouabdallaoui, I., Guerouate, F., Bouhaddour, S., Saadi, C., & Sbihi, M. (2022). A hybrid Latent Dirichlet Allocation-BERT approach for topic discovery of market places. 10.21203/rs.3.rs-1674353/v1
Bouabdallaoui, I., Guerouate, F. & Sbihi, M. (2023). Combination of genetic algorithms and K-means for a hybrid topic modeling: tourism use case. Evol. Intel. 17, 1801–1817. 10.1007/s12065-023-00863-x
Cao, N., & Cui, W. (2016). Introduction to text visualization. Springer. 10.2991/978-94-6239-186-4
Colladon, A. F., Grippa, F., & Innarella, R. (2020). Studying the association of online brand importance with museum vis-itors: An application of the semantic brand score. Tourism Management Perspectives, 33, 100588. 10.1016/j.tmp.2019.100588
Costa, G., & Ortale, R. (2021). Jointly modeling and simultaneously discovering topics and clusters in text corpora using word vectors. Information Sciences, 563, 226-240. 10.1016/j.ins.2021.01.019
Curiskis, S. A., Drake, B., Osborn, T. R., & Kennedy, P. J. (2020). An evaluation of document clustering and topic model-ling in two online social networks: Twitter and Reddit. Information Processing & Management, 57(2), 102034. 10.1016/j.ipm.2019.04.002
Eklund, A., Forsman, M., & Drewes, F. (2023). An empirical configuration study of a common document clustering pipe-line. Northern European Journal of Language Technology (NEJLT), 9(1). 10.3384/nejlt.2000-1533.2023.4396
Gokarn Nitin, M., Gottipati, S., & Shankararaman, V. (2019). Clustering models for topic analysis in graduate discussion forums. In Proceedings of the 27th International Conference on Computers in Education. https://ink.library.smu.edu.sg/sis_research/4516
Gupta, R., & Jivani, A. G. (2018). Analyzing the stemming paradigm. In Information and Communication Technology for Intelligent Systems (ICTIS 2017) -Volume 2 (pp. 333-342). Springer. 10.1007/978-3-319-63645-0_37
Hilmi, M. F., Mustapha, Y., & Omar, M. T. C. (2020). Innovation in an Emerging Market: A Bibliometric and Latent Di-richlet Allocation Based Topic Modeling Study. In 2020 International Conference on Decision Aid Sciences and Application (DASA) (pp. 882-886). 10.1109/DASA51403.2020.9317278
Ikotun, A. M., Ezugwu, A. E., Abualigah, L., Abuhaija, B., & Heming, J. (2022). K-means Clustering Algorithms: A Comprehensive Review, Variants Analysis, and Advances in the Era of Big Data. Information Sciences, 622, 178-210. 10.1016/j.ins.2022.11.139
Jeong, B., Yoon, J., & Lee, J.-M. (2019). Social media mining for product planning: A product opportunity mining ap-proach based on topic modeling and sentiment analysis. International Journal of Information Management, 48, 280-290. 10.1016/j.ijinfomgt.2017.09.009
Jia, J., Tumanian, V., & Li, G. (2021). Discovering semantically related technical terms and web resources in Q&A discus-sions. Frontiers of Information Technology & Electronic Engineering, 22(7), 969-985. 10.1631/FITEE.2000186
Jiang, L. C., Chu, T. H., & Sun, M. (2021). Characterization of vaccine tweets during the early stage of the COVID-19 outbreak in the United States: topic modeling analysis. Jmir Infodemiology, 1(1), e25636. 10.2196/25636
Kalhori, H., Alamdari, M. M., & Ye, L. (2018). Automated algorithm for impact force identification using cosine similari-ty searching. Measurement, 122, 648-657. 10.1016/j.measurement.2018.01.016
Obasa, A. I., Salim, N., & Khan, A. (2016). Hybridization of bag-of-words and forum metadata for web forum question post detection. Indian Journal of Science and Technology, 8(32), 1-12. 10.17485/ijst/2015/v8i32/92127
Pattabiraman, K., Sondhi, P., & Zhai, C. (2013). Exploiting forum thread structures to improve thread clustering. In Proceedings of the 2013 Conference on the Theory of Information Retrieval (pp. 64-71). 10.1145/2499178.2499196
Qiu, Y., Li, H., Li, S., Jiang, Y., Hu, R., & Yang, L. (2018). Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data: 17th China National Conference, CCL 2018, and 6th International Symposium, NLP-NABD 2018, Changsha, China, October 19--21, 2018, Proceedings 17 (pp. 209-221). 10.1007/978-3-030-01716-3_18
Rahman, M. A., & Islam, M. Z. (2014). A hybrid clustering technique combining a novel genetic algorithm with K-Means. Knowledge-Based Systems, 71, 345-365. 10.1016/j.knosys.2014.08.011
Saleem, H. M., Dillon, K. P., Benesch, S., & Ruths, D. (2017). A web of hate: Tackling hateful speech in online social spaces.
Santhanam, T., & Padmavathi, M. S. (2015). Application of K-means and genetic algorithms for dimension reduction by integrating SVM for diabetes diagnosis. Procedia Computer Science, 47, 76-83. 10.1016/j.procs.2015.03.185
Sia, S., Dalmia, A., & Mielke, S. J. (2020). Tired of topic models? Clusters of pretrained word embeddings make for fast and good topics too! 10.18653/v1/2020.emnlp-main.135
Wang, B., Liakata, M., Zubiaga, A., & Procter, R. (2017). A hierarchical topic modelling approach for tweet clustering. In Social Informatics: 9th International Conference, SocInfo 2017, Oxford, UK, September 13-15, 2017, Proceed-ings, Part II 9 (pp. 378-390). Springer International Publishing. 10.1007/978-3-319-67256-4_30
Wang, C., Zhang, H., Chen, B., Wang, D., Wang, Z., & Zhou, M. (2020). Deep relational topic modeling via graph pois-son gamma belief network. Advances in Neural Information Processing Systems, 33, 488-500. https://proceedings.neurips.cc/paper/2020/hash/05ee45de8d877c3949760a94fa691533-Abstract.html
Wu, Y., Cao, N., Archambault, D., Shen, Q., Qu, H., & Cui, W. (2016). Evaluation of graph sampling: A visualization perspective. IEEE transactions on visualization and computer graphics, 23(1), 401-410. 10.1109/TVCG.2016.2598867
Yang, Z., Zhang, W., Yuan, F., & Islam, N. (2021). Measuring topic network centrality for identifying technology and technological development in online communities. Technological Forecasting and Social Change, 167, 120673. 10.1016/j.techfore.2021.120673
Bouabdallaoui, I., Guerouate, F., & Sbihi, M. (2024). Hybrid Text Embedding and Evolutionary Algorithm Approach for Topic Clustering in Online Discussion Forums. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal, 13(1), e31448. https://doi.org/10.14201/adcaij.31448
Downloads
Download data is not yet available.
+
−