Authors
Mohamed Nadjib Mami, Hajira Jabeen, Sören Auer
Description
We have recently made a huge leap in terms of data formats, data modalities, and storage capabilities. Dozens of data storage techniques have been created as a result. Today, we are able to store clusterwide data, and to choose a storage technique that suits our application needs, rather than the opposite. If different data stores are interlinked and integrated, this data can generate valuable knowledge and insights. In this article, we present an approach that uses semantic technologies to query heterogeneous Big Data stored in a Data Lake in a unified manner. Our approach is based on equipping original data stored in the Data Lake with mappings and adding transformations to the SPARQL query syntax to make heterogeneous data joinable across the Data Lake. We devise an implementation, named Sparkall, that uses Apache Spark as the underlying query engine. Our evaluation demonstrates the feasibility and efficiency of Sparkall in querying five popular data sources.