Towards a Scalable and Efficient ETL

Gueddoudj, El Yazid; Chikh, Azeddine

doi:https://dx.doi.org/10.12785/ijcds/140195

Journals About us Ethics and Policies Objectives Values Contact us

UOB Journals
→
02. International Journal of Computing and Digital Systems
→
Volume 14
→
Issue 01
→
View Item

Towards a Scalable and Efficient ETL

Gueddoudj, El Yazid; Chikh, Azeddine

DOI: https://dx.doi.org/10.12785/ijcds/140195

ISSN: 2210-142X

Date: 2023-10-01

Abstract:

Extract, transform, and load (ETL) processes are crucial for building repositories of data from a variety of self-contained sources. Despite their complexity and cost, ETL processes have demonstrated some maturity for traditional, XML, and graph data sources. However the main challenge for ETL processes is double: (1) they do not scale when brought down to managing large and highly varied data sources, involving web-data. (2) the deployment of the target data warehouse in a polystore. The paper reviews various research efforts along this line of research. The paper then proposes a conceptual modeling of these processes using BPMN (Business Process Modeling Notation). These processes are automatically converted to scripts to be implemented within Spark framework. The solution is packaged according a new distributed architecture (Open ETL) that supports both batch and stream processing. To make our new approach more concrete and evaluable, a real case study using the LUBM benchmark, which involves heterogeneous data sources is considered.

Show full item record