Data Lakehouse

Introduction

  • Provides scalable storage and processing capabilities for modern organizations that want to avoid isolated systems for processing different workloads

  • Establish a single source of truth, eliminate redundant costs, and ensure data freshness

Vs data warehouse & data lake

Data warehouse

  • Designed for data that is unlikely to change with high frequency, data warehouses seek to prevent conflicts between concurrently running queries. Many data warehouses rely on proprietary formats, which often limit support for machine learning

Data lake

  • Data lakes store and process data cheaply and efficiently. Cheaply stores data of any nature in any format

  • but not for BI reporting due to its unvalidated nature.

Data lake house

  • Combines the benefits of data lakes and data warehouses

  • Provides

    • Open, direct access to data stored in standard data formats.

    • Indexing protocols optimized for machine learning and data science.

    • Low query latency and high reliability for BI and advanced analytics.

Delta Lake

Introduction

  • It is an open-source storage framework that brings ACID transactions (reliability), scalability, and high performance to cloud storage, creating a "lakehouse" architecture.

  • If you are creating a Delta Lake table, it gets stored on the storage in one or more data files in

    parquet format. But along with these files, Delta stores a transaction log as well.

  • The Delta Lake transaction log, also known as Delta Log, is ordered records of every transaction

    performed on the table.since its creation. It serves as a single source of truth. So every time you query the table, Spark checks this transaction log to retrieve the most recent version of the data.

  • Delta Lake creates a transaction log (a "_delta_log" folder) that records every change made to the data. This log makes the data readable and manageable, transforming a data lake into a structured Lakehouse.

Key Features and Benefits

ACID Transactions:

  • Ensures data integrity by allowing multiple users to read/write concurrently without corruption.

Atomicity

  • All transactions either succeed or fail completely

  • The transaction log controls commit atomicity. During a transaction, data files are written to the file directory backing the table. When the transaction completes, a new entry is committed to the transaction log that includes the paths to all files written during the transaction.

  • Data files are not tracked unless the transaction log records a new version. If a transaction fails after writing data files to a table, these data files will not corrupt the table state

Consistency

  • It guarantees relate to how a given state of the data is observed by simultaneous operations.

  • It uses optimistic concurrency control to provide transactional guarantees between writes

  • Validate and commit:

    • Checks whether the proposed changes conflict with any other changes that may have been concurrently committed since the snapshot that was read.

    • If there are no conflicts, all the staged changes are committed as a new versioned snapshot, and the write operation succeeds.

    • If there are conflicts, the write operation fails with a concurrent modification exception. This failure prevents corruption of data.

Isolation

  • Uses write serializable isolation by default for all table writes and updates. Snapshot isolation is used for all table reads.

Durability

  • Uses cloud object storage to store all data files and transaction logs. Cloud object storage has high availability and durability

Time Travel (Data Versioning)

  • The transaction log maintains a history of changes, allowing you to query or roll back to older snapshots of data.

Vacuum

  • Clear up unused data files (older version) after retention period

Optimization

Partitioning

  • Create subdirectory, can give clear isolation on directory level

  • Split data into different subdirectory

  • Not suitable for high cardinary column e.g: id

  • Changing partition key is painful

Z-ordering indexing

  • Z-ordering is a data optimization technique used in Delta Lake tables to physically co-locate related information (rows) in the same set of data files on disk

Liquid Clustering

  • Liquid Clustering is a Databricks/Delta Lake feature that continuously organizes data files by one or more columns, without requiring fixed partitions.

    Think of it as a flexible, auto-managed replacement for static partitioning:

    • You specify clustering columns (e.g. customer_id, date).

    • Databricks automatically reorganizes data files in the background so that:

      • Rows with similar clustering values end up physically close together.

      • Query engines can skip more files and read less data.

    You get many of the benefits of partitioning (pruning, better IO) but with fewer constraints and less manual maintenance.

  • Flexible to redefine clustering key

Type of Table

Managed Table

  • Created under the database directory

  • If dropping the table, will delete the underlying data files

External Table

  • Created outside the database directory

  • If dropping the table, will not delete the underlying data files

Copying & Cloning

Create Table As Select (CTAS)

Deep Clone

  • Fully copy data + metadata from source

  • Can sync change

Shallow Clone

  • Quickly create a copy of table

Views

Last updated