Data

Data Integration (Ingestion)

This is the first touchpoint where data is moved from a source to a storage system.

  • Batch Ingestion: Processing data in large groups at specific intervals (e.g., every night at midnight). Tools like Apache Sqoop or Azure Data Factory are common here.

  • Streaming Ingestion: Capturing data in real-time as events occur (e.g., credit card transactions or sensor logs). Apache Kafka and Amazon Kinesis are the industry leaders for high-velocity streams.

  • Change Data Capture (CDC): A technique that tracks only the changes (inserts, updates, deletes) in a source database to keep the destination synchronized efficiently.

Storage and Data Lakehouses

Data needs a scalable place to live. Engineers choose storage based on the structure of the data:

  • Data Lakes: Store raw data in its native format (JSON, CSV, Parquet).

  • Data Warehouses: Store highly structured, "cleaned" data optimized for fast querying.

  • Delta Lake / Lakehouse: A modern hybrid that brings the reliability and performance of a warehouse directly to a data lake, allowing for features like ACID transactions and versioning.

Distributed Processing Engines

When data is too large for one machine, it must be processed across a cluster.

  • Apache Spark: The most widely used engine for big data processing. It uses "Resilient Distributed Datasets" to perform lightning-fast transformations in memory.

  • SQL Engines: Many platforms now allow engineers to perform complex transformations using standard SQL, which the engine then translates into distributed tasks.

Transformation and Modeling

This is where the raw data is reshaped to provide value.

  • Medallion Architecture: A data design pattern used to organize data quality.

    • Bronze: Raw data landing zone.

    • Silver: Filtered, joined, and "cleansed" data.

    • Gold: High-level aggregates ready for business users.

  • dbt (data build tool): A popular framework that allows engineers to write transformations in SQL and manages the dependencies between them.

Orchestration (The "Brain")

Orchestration manages the timing and dependencies of the entire pipeline.

  • DAGs (Directed Acyclic Graphs): A visual representation of a workflow where each node is a task. The orchestrator ensures that "Task B" only runs if "Task A" completes successfully.

  • Tools: Apache Airflow, Prefect, or Dagster are used to schedule these complex workflows and provide alerts if something breaks.

Data Governance and Security

This component ensures the data is secure, private, and compliant with regulations (like GDPR).

  • Data Cataloging: A searchable inventory of all data assets (e.g., Unity Catalog or Alation).

  • Access Control: Defining exactly who can see which columns or rows of data.

  • Data Lineage: Tracking the "biography" of a piece of data—knowing exactly where it started and how it was changed before it reached a report.

Last updated