An Efficient Approach to Extract and Store Big Semantic Web Data Using Hadoop and Apache Spark GraphX

  • Wria Mohammed Salih Mohammed
    Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani 46001, Kurdistan Region, Iraq wria.mohammedsalih[at]
  • Alaa Khalil Jumaa
    Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani 46001, Kurdistan Region, Iraq


The volume of data is growing at an astonishingly high speed. Traditional techniques for storing and processing data, such as relational and centralized databases, have become inefficient and time-consuming. Linked data and the Semantic Web make internet data machine-readable. Because of the increasing volume of linked data and Semantic Web data, storing and working with them using traditional approaches is not enough, and this causes limited hardware resources. To solve this problem, storing datasets using distributed and clustered methods is essential. Hadoop can store datasets because it can use many hard disks for distributed data clustering; Apache Spark can be used for parallel data processing more efficiently than Hadoop MapReduce because Spark uses memory instead of the hard disk. Semantic Web data has been stored and processed in this paper using Apache Spark GraphX and the Hadoop Distributed File System (HDFS). Spark's in-memory processing and distributed computing enable efficient data analysis of massive datasets stored in HDFS. Spark GraphX allows graph-based semantic web data processing. The fundamental objective of this work is to provide a way for efficiently combining Semantic Web and big data technologies to utilize their combined strengths in data analysis and processing. First, the proposed approach uses the SPARQL query language to extract Semantic Web data from DBpedia datasets. DBpedia is a hugely available Semantic Web dataset built on Wikipedia. Secondly, the extracted Semantic Web data was converted to the GraphX data format; vertices and edges files were generated. The conversion process is implemented using Apache Spark GraphX. Third, both vertices and edge tables are stored in HDFS and are available for visualization and analysis operations. Furthermore, the proposed techniques improve the data storage efficiency by reducing the amount of storage space by half when converting from Semantic Web Data to a GraphX file, meaning the RDF size is around 133.8 and GraphX is 75.3. Adopting parallel data processing provided by Apache Spark in the proposed technique reduces the required data processing and analysis time. This article concludes that Apache Spark GraphX can enhance Semantic Web and Big Data technologies. We minimize data size and processing time by converting Semantic Web data to GraphX format, enabling efficient data management and seamless integration.
  • Referencias
  • Cómo citar
  • Del mismo autor
  • Métricas
Agathangelos, G., Troullinou, G., Kondylakis, H., Stefanidis, K., & Plexousakis, D. (2018a). Incremental data partitioning of RDF data in SPARK. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11155 LNCS, 50–54.

Agathangelos, G., Troullinou, G., Kondylakis, H., Stefanidis, K., & Plexousakis, D. (2018b). RDF query answering using apache spark: Review and assessment. Proceedings - IEEE 34th International Conference on Data Engineering Workshops, ICDEW 2018, 54–59.

Azzedin, F. (2013). Towards a scalable HDFS architecture. In 2013 International Conference on Collaboration Technologies and Systems (CTS) (pp. 155-161). IEEE.

Baby Nirmala, M., & Sathiaseelan, J. G. R. (2021). An Enhanced Approach Of RDF Graph Data In-Memory Processing For Social Networks With Performance Analysis, 18(6).

Banane, M., & Belangour, A. (2019). RDFSpark: a new solution for querying massive RDF data using Spark. International Journal of Engineering &Technology, 8(3), 288–294.

Banane, M., & Belangour, A. (2020). A new system for massive RDF data management using big data query languages pig, hive, and Spark. International Journal of Computing and Digital Systems, 9(2), 259–270.

Bansod, A. (2015). Efficient Big Data Analysis with Apache Spark in HDFS. International Journal of Engineering and Advanced Technology, 4(6), 313–316.

Berners-Lee, T., Hendler, J., Lassila, O. (2001). The Semantic Web. Scientific American, 284(5), 34.

Donvito, G., Marzulli, G., & Diacono, D. (2014). Testing of several distributed file-systems (HDFS, Ceph and GlusterFS) for supporting the HEP experiments analysis. Journal of Physics: Conference Series, 513(TRACK 4).

Gopalani, S., & Arora, R. (2015). Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means. International Journal of Computer Applications, 113(1), 8–11.

Hakimov, S. T. A. D. (2013). Semantic Question Answering System over Linked Data using Relational Patterns. EDBT/ICDT, 83–88.

Kulcu, S., Dogdu, E., & Ozbayoglu, A. M. (2016). A survey on semantic web and big data technologies for social network analysis. Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016, 1768–1777.

Lahore, Z. A. (2016). Semantic Web Mining in E-Commerce Websites. In International Journal of Computer Applications, 137(2), 1–4.

Lim, S. H., Lee, S., Ganesh, G., Brown, T. C., & Sukumar, S. R. (2015). Graph Processing Platforms at Scale: Practices and Experiences. ISPASS 2015 - IEEE International Symposium on Performance Analysis of Systems and Software, 2, 42–51.

Mohammed, W., Mohammed, S., & Khalil Jumaa, A. (2021). Storage, Distribution, and Query Processing RDF Data in Apache Spark and GraphX: A Review Survey. International Journal of Mechanical Engineering, 6(3).

Mohammed, W., & Saraee, M. (2016). Sematic Web Mining Using Fuzzy C-means Algorithm. British Journal of Mathematics & Computer Science, 16(4), 1–16.

Naacke, H., Amann, B., & Curé, O. (2017). SPARQL graph pattern processing with apache spark. 5th International Workshop on Graph Data Management Experiences and Systems, GRADES 2017 - Co-Located with SIGMOD/PODS 2017.

Omar, H. K., & Jumaa, A. K. (2019). Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java. Kurdistan Journal of Applied Research, 4(1), 7–14.

Ramalingeswara Rao, T., Ghosh, S. K., & Goswami, A. (2021). Mining user-user communities for a weighted bipartite network using Spark GraphFrames and Flink Gelly. Journal of Supercomputing, 77(6), 5984–6035.

Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. In International Journal of Data Science and Analytics, 1(3-4), 145–164.

Yu, J., Zhang, Z., & Sarwat, M. (2019). Spatial data management in apache spark: the GeoSpark perspective and beyond. GeoInformatica, 23(1), 37–78.
Salih Mohammed, W. M., & Jumaa, A. K. (2024). An Efficient Approach to Extract and Store Big Semantic Web Data Using Hadoop and Apache Spark GraphX. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal, 13(1), e31506.


Download data is not yet available.