Technology Glossary Term

In the context of big data systems, the term data ingestion refers to the process of acquiring or importing large volume of data, by transferring it from various sources to a centralized big data storage or processing infrastructure for further analysis. It is the first step in the data processing pipeline and plays a crucial role in ensuring that diverse data sets from different applications are efficiently and reliably ingested into the big data ecosystem.

As part of data ingestion, the the data processing pipeline may also perform some steps to filter to avoid ingesting unwanted or invalid data. This preprocessing pipeline can be broadly split into four stages:

Data Collection: Data is collected from a wide range of sources, including databases, files, streaming platforms, IoT devices, sensors, social media feeds, and external APIs. These sources may produce structured, semi-structured, or unstructured data in different formats and protocols.

Data Extraction: Once collected, data needs to be extracted from its source format and transformed into a standardized format suitable for ingestion into the big data system. This may involve parsing, cleaning, and preprocessing the data to remove duplicates, errors, or inconsistencies.

Data Transport: Extracted data is then transported from the source to the target storage or processing infrastructure using various transport mechanisms such as batch processing, streaming, messaging queues, or file transfer protocols. This ensures that data is delivered reliably and efficiently, even across distributed or remote environments.

Data Loading: Finally, the ingested data is loaded into the target storage or processing system, such as a data lake, data warehouse, or distributed file system. This may involve partitioning, indexing, or replicating the data to optimize storage and retrieval performance.