Data Pipeline
Introduction
A series of automated steps that move data from a source to a destination while transforming it along the way.
Mainly involve Extraction(E), Transformation(T), Loading(L)
Extraction
The "entry point." Its primary goal is to retrieve data from a source system
Mainly involve 3 types of ingestion
Batch Ingestion: Processing data in large groups at specific intervals (e.g., every night at midnight). Tools like Apache Sqoop or Azure Data Factory are common here.
Streaming Ingestion: Capturing data in real-time as events occur (e.g., credit card transactions or sensor logs). Apache Kafka and Amazon Kinesis are the industry leaders for high-velocity streams.
Change Data Capture (CDC): A technique that tracks only the changes (inserts, updates, deletes) in a source database to keep the destination synchronized efficiently.
Transformation
The "engine room" of the data pipeline. This is where raw, messy data from the extraction phase is converted into a clean, structured, and usable format for analysis or machine learning.
It may follows modern architectures like the Medallion Architecture, this stage typically moves data from the Bronze (raw) layer to the Silver (cleansed) and Gold (aggregated) layers
Data Cleansing
Handling Nulls: Deciding whether to drop rows with missing values or fill them with defaults (e.g., changing a null "Country" to "Unknown").
Deduplication: Removing identical records that might have been ingested multiple times due to system retries.
Data Typing: Ensuring numbers are treated as integers, dates as timestamps, and prices as decimals rather than plain text.
Structural Transformation
This involves changing the shape of the data to make it easier to query.
Pivoting/Unpivoting: Turning columns into rows or vice versa to fit a specific report layout.
Flattening: If you extract data from a JSON API, it often arrives as "nested" arrays. Transformation "explodes" these nests into flat, readable tables.
Filtering: Stripping away unnecessary columns or rows that aren't needed for the final business use case (e.g., removing internal test accounts from sales data).
Aggregation and Summarization
In the final part of transformation (often the transition to the Gold layer), data is summarized to improve performance.
Joins: Linking a
Product_IDin an orders table to aProduct_Namein a catalog table.Lookups: Adding external context, such as converting a zip code into a city name or a GPS coordinate into a region.
Window Functions: Calculating running totals, moving averages, or ranking customers by their lifetime value.
Loading
The process of taking the cleaned, transformed, and structured data and physically moving it into a "permanent" destination where it can be used by data analysts, scientists, or business users.
Loading Strategy
Engineers must decide how the new data interacts with the data that is already in the destination.
Full Load: Every time the pipeline runs, the entire destination table is erased and replaced with a fresh copy. This is simple but becomes too slow and expensive as data grows.
Incremental Load (Append): Only the newest records are added to the bottom of the existing table. This is very fast but can lead to duplicate records if not managed carefully.
Upsert (Update + Insert): The system checks if a record already exists (using a unique ID). If it exists, it updates the information; if it doesn't, it inserts a new row. This is the most common way to keep "Customer" or "Product" tables accurate.
Data Indexing and Partitioning
To make the data easy to find later, the loading process "organizes the library."
Partitioning: Dividing a massive table into smaller physical folders (e.g., by
YearorRegion). When someone queries "Sales in 2026," the system only looks in the 2026 folder, skipping billions of other rows.Clustering: Grouping similar data together within those folders to further speed up search performance.
Schema Enforcement and Evolution
Enforcement: If a source system suddenly sends a "Text" string where a "Price" number should be, the loading process will block the data to prevent corrupting the clean destination.
Evolution: If a new column is added legitimately (e.g., adding "Discount Code" to an orders table), modern loading tools can automatically update the destination table to include this new field without breaking the pipeline.
ETL vs ELT
ETL
Imagine a retail chain with an older local database that has limited storage. They use ETL to ensure only the "perfect" data reaches their final reports.
Extract: Every night, the system pulls yesterday's raw CSV logs from 50 different stores.
Transform (The Staging Area): Before the data hits the main warehouse, a separate server processes it. It removes duplicate transactions, converts all currencies (Yen, Euro) into USD, and strips out sensitive customer names for privacy compliance.
Load: Only the cleaned, aggregated "Total Daily Sales per Store" is loaded into the Data Warehouse.
The Result: The warehouse is kept small and tidy, but if the manager later asks to see "Sales by Customer Name," that data is gone because it was "transformed away" before loading.
ELT
Now imagine a modern e-commerce site using a cloud platform like Databricks or Snowflake. They want to keep every detail for future AI models.
Extract: The system pulls every raw click, search, and purchase from the website API.
Load: Immediately, all that raw, messy JSON data is dumped into a "Data Lake" (the Bronze layer). No cleaning has happened yet.
Transform: Once the data is safely in the cloud, the data engineer writes a SQL or Spark script. This script runs inside the cloud environment to create a "Silver" table (cleaned) and a "Gold" table (business-ready).
The Result: The raw data is always available. If the marketing team decides six months later they want to analyze "Searches that didn't lead to a purchase," the data is already there waiting to be transformed.
Difference
Transformation Timing
Before loading.
After loading.
Processing Power
Uses a separate ETL engine/server.
Uses the destination's power (e.g., Spark).
Load Speed
Slower (waiting for cleanup).
Faster (immediate ingestion).
Flexibility
Low. You must know the format first.
High. Store everything, decide later.
Cost Efficiency
Higher (maintenance of ETL servers).
Lower (leverages cheap cloud storage).
Raw Data Stored
No
Yes
Old Bank Systems
ETL
High security; need to "scrub" data before it's stored.
Small Startup
ETL
Keeps storage costs very low by only saving what's needed.
Big Data / AI
ELT
Need all raw data for training models later; cloud scale makes it easy.
Real-time Apps
ELT
Speed is king; get the data "in" now, worry about cleaning it later.
Last updated