Databrick
Introduction

Databricks is a cloud-based "Data Lakehouse" platform that unifies data engineering, analytics, and AI/machine learning workflows. Built by the creators of Apache Spark, it combines the scalability of data lakes with the reliability and performance of data warehouses to process, store, and analyze massive datasets.
Scalability
Databricks can process massive amounts of data—terabytes or even petabytes—comes down to a concept called Distributed Computing, powered by its underlying engine, Apache Spark
Reliability
Standard data lakes can get "messy" (the "Data Swamp" problem). Databricks uses Delta Lake, which adds ACID transactions. This means if a pipeline fails halfway through, it won't leave your data in a broken, partially-updated state—it rolls back automatically.
Flexiblity
Runs on top of AWS, Microsoft Azure, and Google Cloud.
Key Components
Data Lakehouse Architecture: Combines data lake storage with data warehouse functionality, enabling BI and AI on the same data.
Apache Spark Integration: Provides a managed platform for fast, distributed processing of large-scale data.
Delta Lake: Enables ACID transactions (reliability) on top of cloud storage.
MLflow: Built-in tool for managing the end-to-end machine learning lifecycle.
Unity Catalog: Provides unified governance, security, and data lineage.
Databricks SQL: Allows analysts to run SQL queries and build dashboards directly on the lakehouse.
Architecture

Control Plane
The management layer of Databricks. It is entirely hosted and managed by Databricks in their own cloud account (on AWS, Azure, or GCP).
Hosts the web application (the Databricks UI you log into).
Manages user identity, access control (SSO), and workspace configurations.
Stores your notebooks, jobs/orchestration schedules, and cluster configurations.
Manages the Unity Catalog (the centralized governance and metadata layer).
Orchestrates the creation and termination of compute resources (clusters).
Data Plane
The actual data is stored and where the heavy lifting of data processing happens.
A azure storage account is deployed in the data plan in your own subscription. It is used for Databricks File System or DBFS.
Last updated