Data Lakehouse

Introduction

Provides scalable storage and processing capabilities for modern organizations that want to avoid isolated systems for processing different workloads
Establish a single source of truth, eliminate redundant costs, and ensure data freshness

Vs data warehouse & data lake

Data warehouse

Designed for data that is unlikely to change with high frequency, data warehouses seek to prevent conflicts between concurrently running queries. Many data warehouses rely on proprietary formats, which often limit support for machine learning

Data lake

Data lakes store and process data cheaply and efficiently. Cheaply stores data of any nature in any format
but not for BI reporting due to its unvalidated nature.

Data lake house

Combines the benefits of data lakes and data warehouses
Provides
- Open, direct access to data stored in standard data formats.
- Indexing protocols optimized for machine learning and data science.
- Low query latency and high reliability for BI and advanced analytics.

Delta Lake

Introduction

It is an open-source storage framework that brings ACID transactions (reliability), scalability, and high performance to cloud storage, creating a "lakehouse" architecture.
If you are creating a Delta Lake table, it gets stored on the storage in one or more data files in
parquet format. But along with these files, Delta stores a transaction log as well.
The Delta Lake transaction log, also known as Delta Log, is ordered records of every transaction
performed on the table.since its creation. It serves as a single source of truth. So every time you query the table, Spark checks this transaction log to retrieve the most recent version of the data.
Delta Lake creates a transaction log (a "_delta_log" folder) that records every change made to the data. This log makes the data readable and manageable, transforming a data lake into a structured Lakehouse.

Key Features and Benefits

ACID Transactions:

Ensures data integrity by allowing multiple users to read/write concurrently without corruption.

Atomicity

All transactions either succeed or fail completely
The transaction log controls commit atomicity. During a transaction, data files are written to the file directory backing the table. When the transaction completes, a new entry is committed to the transaction log that includes the paths to all files written during the transaction.
Data files are not tracked unless the transaction log records a new version. If a transaction fails after writing data files to a table, these data files will not corrupt the table state

Consistency

It guarantees relate to how a given state of the data is observed by simultaneous operations.
It uses optimistic concurrency control to provide transactional guarantees between writes
Validate and commit:
- Checks whether the proposed changes conflict with any other changes that may have been concurrently committed since the snapshot that was read.
- If there are no conflicts, all the staged changes are committed as a new versioned snapshot, and the write operation succeeds.
- If there are conflicts, the write operation fails with a concurrent modification exception. This failure prevents corruption of data.

Isolation

Uses write serializable isolation by default for all table writes and updates. Snapshot isolation is used for all table reads.

Durability

Uses cloud object storage to store all data files and transaction logs. Cloud object storage has high availability and durability

Time Travel (Data Versioning)

DESCRIBE HISTORY table_name       -- get the full history of the table

DESCRIBE HISTORY table_name LIMIT 1  -- get the last operation only
--- Restore to the previous history
RESTORE TABLE target_table TO VERSION AS OF <version>;
RESTORE TABLE target_table TO TIMESTAMP AS OF <timestamp>;

The transaction log maintains a history of changes, allowing you to query or roll back to older snapshots of data.

Vacuum

Clear up unused data files (older version) after retention period

Optimization

Partitioning

Create subdirectory, can give clear isolation on directory level
Split data into different subdirectory
Not suitable for high cardinary column e.g: id
Changing partition key is painful

Z-ordering indexing

Z-ordering is a data optimization technique used in Delta Lake tables to physically co-locate related information (rows) in the same set of data files on disk

Liquid Clustering

Liquid Clustering is a Databricks/Delta Lake feature that continuously organizes data files by one or more columns, without requiring fixed partitions.
Think of it as a flexible, auto-managed replacement for static partitioning:
- You specify clustering columns (e.g. customer_id, date).
- Databricks automatically reorganizes data files in the background so that:
  - Rows with similar clustering values end up physically close together.
  - Query engines can skip more files and read less data.
You get many of the benefits of partitioning (pruning, better IO) but with fewer constraints and less manual maintenance.
Flexible to redefine clustering key

Type of Table

Managed Table

Created under the database directory
If dropping the table, will delete the underlying data files

External Table

Created outside the database directory
If dropping the table, will not delete the underlying data files

Copying & Cloning

Create Table As Select (CTAS)

Create table table_1 AS SELECT * FROM table_2

Deep Clone

Create table table_clone deep clone source_table

Fully copy data + metadata from source
Can sync change

Shallow Clone

Create table table_clone shallow clone source_table

Quickly create a copy of table

Views

PreviousDatabrick NextData Processing

Last updated 15 hours ago

hashtagIntroduction

hashtagVs data warehouse & data lake

hashtagData warehouse

hashtagData lake

hashtagData lake house

hashtagDelta Lake

hashtagIntroduction

hashtagKey Features and Benefits

hashtagACID Transactions:

hashtagAtomicity

hashtagConsistency

hashtagIsolation

hashtagDurability

hashtagTime Travel (Data Versioning)

hashtagVacuum

hashtagOptimization

hashtagPartitioning

hashtagZ-ordering indexing

hashtagLiquid Clustering

hashtagType of Table

hashtagManaged Table

hashtagExternal Table

hashtagCopying & Cloning

hashtagCreate Table As Select (CTAS)

hashtagDeep Clone

hashtagShallow Clone

hashtagViews

Introduction

Vs data warehouse & data lake

Data warehouse

Data lake

Data lake house

Delta Lake

Introduction

Key Features and Benefits

ACID Transactions:

Atomicity

Consistency

Isolation

Durability

Time Travel (Data Versioning)

Vacuum

Optimization

Partitioning

Z-ordering indexing

Liquid Clustering

Type of Table

Managed Table

External Table

Copying & Cloning

Create Table As Select (CTAS)

Deep Clone

Shallow Clone

Views