Data Pipeline

Introduction

  • A series of automated steps that move data from a source to a destination while transforming it along the way.

  • Mainly involve Extraction(E), Transformation(T), Loading(L)

Extraction

  • The "entry point." Its primary goal is to retrieve data from a source system

  • Mainly involve 3 types of ingestion

  1. Batch Ingestion: Processing data in large groups at specific intervals (e.g., every night at midnight). Tools like Apache Sqoop or Azure Data Factory are common here.

  2. Streaming Ingestion: Capturing data in real-time as events occur (e.g., credit card transactions or sensor logs). Apache Kafka and Amazon Kinesis are the industry leaders for high-velocity streams.

  3. Change Data Capture (CDC): A technique that tracks only the changes (inserts, updates, deletes) in a source database to keep the destination synchronized efficiently.

Transformation

  • The "engine room" of the data pipeline. This is where raw, messy data from the extraction phase is converted into a clean, structured, and usable format for analysis or machine learning.

  • It may follows modern architectures like the Medallion Architecture, this stage typically moves data from the Bronze (raw) layer to the Silver (cleansed) and Gold (aggregated) layers

Data Cleansing

  • Handling Nulls: Deciding whether to drop rows with missing values or fill them with defaults (e.g., changing a null "Country" to "Unknown").

  • Deduplication: Removing identical records that might have been ingested multiple times due to system retries.

  • Data Typing: Ensuring numbers are treated as integers, dates as timestamps, and prices as decimals rather than plain text.

Structural Transformation

This involves changing the shape of the data to make it easier to query.

  • Pivoting/Unpivoting: Turning columns into rows or vice versa to fit a specific report layout.

  • Flattening: If you extract data from a JSON API, it often arrives as "nested" arrays. Transformation "explodes" these nests into flat, readable tables.

  • Filtering: Stripping away unnecessary columns or rows that aren't needed for the final business use case (e.g., removing internal test accounts from sales data).

Aggregation and Summarization

In the final part of transformation (often the transition to the Gold layer), data is summarized to improve performance.

  • Joins: Linking a Product_ID in an orders table to a Product_Name in a catalog table.

  • Lookups: Adding external context, such as converting a zip code into a city name or a GPS coordinate into a region.

  • Window Functions: Calculating running totals, moving averages, or ranking customers by their lifetime value.

Loading

  • The process of taking the cleaned, transformed, and structured data and physically moving it into a "permanent" destination where it can be used by data analysts, scientists, or business users.

Loading Strategy

Engineers must decide how the new data interacts with the data that is already in the destination.

  • Full Load: Every time the pipeline runs, the entire destination table is erased and replaced with a fresh copy. This is simple but becomes too slow and expensive as data grows.

  • Incremental Load (Append): Only the newest records are added to the bottom of the existing table. This is very fast but can lead to duplicate records if not managed carefully.

  • Upsert (Update + Insert): The system checks if a record already exists (using a unique ID). If it exists, it updates the information; if it doesn't, it inserts a new row. This is the most common way to keep "Customer" or "Product" tables accurate.

Data Indexing and Partitioning

To make the data easy to find later, the loading process "organizes the library."

  • Partitioning: Dividing a massive table into smaller physical folders (e.g., by Year or Region). When someone queries "Sales in 2026," the system only looks in the 2026 folder, skipping billions of other rows.

  • Clustering: Grouping similar data together within those folders to further speed up search performance.

Schema Enforcement and Evolution

  • Enforcement: If a source system suddenly sends a "Text" string where a "Price" number should be, the loading process will block the data to prevent corrupting the clean destination.

  • Evolution: If a new column is added legitimately (e.g., adding "Discount Code" to an orders table), modern loading tools can automatically update the destination table to include this new field without breaking the pipeline.

ETL vs ELT

ETL

Imagine a retail chain with an older local database that has limited storage. They use ETL to ensure only the "perfect" data reaches their final reports.

  • Extract: Every night, the system pulls yesterday's raw CSV logs from 50 different stores.

  • Transform (The Staging Area): Before the data hits the main warehouse, a separate server processes it. It removes duplicate transactions, converts all currencies (Yen, Euro) into USD, and strips out sensitive customer names for privacy compliance.

  • Load: Only the cleaned, aggregated "Total Daily Sales per Store" is loaded into the Data Warehouse.

  • The Result: The warehouse is kept small and tidy, but if the manager later asks to see "Sales by Customer Name," that data is gone because it was "transformed away" before loading.

ELT

Now imagine a modern e-commerce site using a cloud platform like Databricks or Snowflake. They want to keep every detail for future AI models.

  • Extract: The system pulls every raw click, search, and purchase from the website API.

  • Load: Immediately, all that raw, messy JSON data is dumped into a "Data Lake" (the Bronze layer). No cleaning has happened yet.

  • Transform: Once the data is safely in the cloud, the data engineer writes a SQL or Spark script. This script runs inside the cloud environment to create a "Silver" table (cleaned) and a "Gold" table (business-ready).

  • The Result: The raw data is always available. If the marketing team decides six months later they want to analyze "Searches that didn't lead to a purchase," the data is already there waiting to be transformed.

Difference

Transformation Timing

Before loading.

After loading.

Processing Power

Uses a separate ETL engine/server.

Uses the destination's power (e.g., Spark).

Load Speed

Slower (waiting for cleanup).

Faster (immediate ingestion).

Flexibility

Low. You must know the format first.

High. Store everything, decide later.

Cost Efficiency

Higher (maintenance of ETL servers).

Lower (leverages cheap cloud storage).

Raw Data Stored

No

Yes

Old Bank Systems

ETL

High security; need to "scrub" data before it's stored.

Small Startup

ETL

Keeps storage costs very low by only saving what's needed.

Big Data / AI

ELT

Need all raw data for training models later; cloud scale makes it easy.

Real-time Apps

ELT

Speed is king; get the data "in" now, worry about cleaning it later.

Last updated