Data Lakehouse
Introduction
Provides scalable storage and processing capabilities for modern organizations that want to avoid isolated systems for processing different workloads
Establish a single source of truth, eliminate redundant costs, and ensure data freshness
Vs data warehouse & data lake
Data warehouse
Designed for data that is unlikely to change with high frequency, data warehouses seek to prevent conflicts between concurrently running queries. Many data warehouses rely on proprietary formats, which often limit support for machine learning
Data lake
Data lakes store and process data cheaply and efficiently. Cheaply stores data of any nature in any format
but not for BI reporting due to its unvalidated nature.
Data lake house
Combines the benefits of data lakes and data warehouses
Provides
Open, direct access to data stored in standard data formats.
Indexing protocols optimized for machine learning and data science.
Low query latency and high reliability for BI and advanced analytics.
Delta Lake
Introduction
It is an open-source storage framework that brings ACID transactions (reliability), scalability, and high performance to cloud storage, creating a "lakehouse" architecture.
If you are creating a Delta Lake table, it gets stored on the storage in one or more data files in
parquet format. But along with these files, Delta stores a transaction log as well.
The Delta Lake transaction log, also known as Delta Log, is ordered records of every transaction
performed on the table.since its creation. It serves as a single source of truth. So every time you query the table, Spark checks this transaction log to retrieve the most recent version of the data.
Delta Lake creates a transaction log (a "_delta_log" folder) that records every change made to the data. This log makes the data readable and manageable, transforming a data lake into a structured Lakehouse.
Key Features and Benefits
ACID Transactions:
Ensures data integrity by allowing multiple users to read/write concurrently without corruption.
Atomicity
All transactions either succeed or fail completely
The transaction log controls commit atomicity. During a transaction, data files are written to the file directory backing the table. When the transaction completes, a new entry is committed to the transaction log that includes the paths to all files written during the transaction.
Data files are not tracked unless the transaction log records a new version. If a transaction fails after writing data files to a table, these data files will not corrupt the table state
Consistency
It guarantees relate to how a given state of the data is observed by simultaneous operations.
It uses optimistic concurrency control to provide transactional guarantees between writes
Validate and commit:
Checks whether the proposed changes conflict with any other changes that may have been concurrently committed since the snapshot that was read.
If there are no conflicts, all the staged changes are committed as a new versioned snapshot, and the write operation succeeds.
If there are conflicts, the write operation fails with a concurrent modification exception. This failure prevents corruption of data.
Isolation
Uses write serializable isolation by default for all table writes and updates. Snapshot isolation is used for all table reads.
Durability
Uses cloud object storage to store all data files and transaction logs. Cloud object storage has high availability and durability




Time Travel (Data Versioning)

The transaction log maintains a history of changes, allowing you to query or roll back to older snapshots of data.
Vacuum

Clear up unused data files (older version) after retention period
Optimization
Partitioning


Create subdirectory, can give clear isolation on directory level
Split data into different subdirectory
Not suitable for high cardinary column e.g: id
Changing partition key is painful
Z-ordering indexing

Z-ordering is a data optimization technique used in Delta Lake tables to physically co-locate related information (rows) in the same set of data files on disk
Liquid Clustering
Liquid Clustering is a Databricks/Delta Lake feature that continuously organizes data files by one or more columns, without requiring fixed partitions.
Think of it as a flexible, auto-managed replacement for static partitioning:
You specify clustering columns (e.g.
customer_id,date).Databricks automatically reorganizes data files in the background so that:
Rows with similar clustering values end up physically close together.
Query engines can skip more files and read less data.
You get many of the benefits of partitioning (pruning, better IO) but with fewer constraints and less manual maintenance.
Flexible to redefine clustering key
Type of Table
Managed Table
Created under the database directory
If dropping the table, will delete the underlying data files
External Table
Created outside the database directory
If dropping the table, will not delete the underlying data files
Copying & Cloning
Create Table As Select (CTAS)
Deep Clone
Fully copy data + metadata from source
Can sync change
Shallow Clone
Quickly create a copy of table
Views

Last updated