Abstract:
Extract, transform, and load (ETL) processes are crucial for building repositories of data from a variety of self-contained
sources. Despite their complexity and cost, ETL processes have demonstrated some maturity for traditional, XML, and graph data
sources. However the main challenge for ETL processes is double: (1) they do not scale when brought down to managing large and
highly varied data sources, involving web-data. (2) the deployment of the target data warehouse in a polystore. The paper reviews various
research efforts along this line of research. The paper then proposes a conceptual modeling of these processes using BPMN (Business
Process Modeling Notation). These processes are automatically converted to scripts to be implemented within Spark framework. The
solution is packaged according a new distributed architecture (Open ETL) that supports both batch and stream processing. To make our
new approach more concrete and evaluable, a real case study using the LUBM benchmark, which involves heterogeneous data sources
is considered.