We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Distributed real-time ETL architecture for unstructured big data.
- Authors
Mehmood, Erum; Anees, Tayyaba
- Abstract
Real-time extract transform load (ETL) is the integral part of increasing demand of faster business decisions targeting large number of modern applications. Multi-source unstructured data stream extraction and transformation using disk data in distributed environment are the building blocks of real-time ETL due to volume and velocity of data. Therefore designing an architecture for basic building blocks for real-time ETL remains a major challenge. In this paper, we focus primarily to expedite stream-disk joins during transformation phase of ETL that is considered most expensive operator in stream processing due to frequent disk access. We propose an architecture for real-time ETL to ingest unstructured stream of data from multi-sources, without having to worry about the structure of data sources, and transform them after joining with distributed disk data. We also present a novel data pipeline stream-disk join that uses partition-based input and best-effort in-memory database technique reducing frequent disk access. The proposed architecture addresses the challenges of stream data loss, ignored un-matching streams, disk overhead and real-time processing for distributed environment. The experimental results obtained using stream generator and real-world datasets on local and distributed machines show that proposed architecture yields significantly improved throughput especially for large number of stream tuples with large datasets.
- Subjects
BIG data; DISTRIBUTED computing; DATA extraction; ARCHITECTURAL design; PHASE transitions
- Publication
Knowledge & Information Systems, 2022, Vol 64, Issue 12, p3419
- ISSN
0219-1377
- Publication type
Article
- DOI
10.1007/s10115-022-01757-7