Databrick

Introduction

  • Databricks is a cloud-based "Data Lakehouse" platform that unifies data engineering, analytics, and AI/machine learning workflows. Built by the creators of Apache Sparkarrow-up-right, it combines the scalability of data lakes with the reliability and performance of data warehouses to process, store, and analyze massive datasets.

Scalability

  • Databricks can process massive amounts of data—terabytes or even petabytes—comes down to a concept called Distributed Computing, powered by its underlying engine, Apache Spark

Reliability

  • Standard data lakes can get "messy" (the "Data Swamp" problem). Databricks uses Delta Lake, which adds ACID transactions. This means if a pipeline fails halfway through, it won't leave your data in a broken, partially-updated state—it rolls back automatically.

Flexiblity

  • Runs on top of AWS, Microsoft Azure, and Google Cloud.

Key Components

  1. Data Lakehouse Architecture: Combines data lake storage with data warehouse functionality, enabling BI and AI on the same data.

  2. Apache Sparkarrow-up-right Integration: Provides a managed platform for fast, distributed processing of large-scale data.

  3. Delta Lakearrow-up-right: Enables ACID transactions (reliability) on top of cloud storage.

  4. MLflowarrow-up-right: Built-in tool for managing the end-to-end machine learning lifecycle.

  5. Unity Catalog: Provides unified governance, security, and data lineage.

  6. Databricks SQL: Allows analysts to run SQL queries and build dashboards directly on the lakehouse.

Architecture

Control Plane

  • The management layer of Databricks. It is entirely hosted and managed by Databricks in their own cloud account (on AWS, Azure, or GCP).

  • Hosts the web application (the Databricks UI you log into).

  • Manages user identity, access control (SSO), and workspace configurations.

  • Stores your notebooks, jobs/orchestration schedules, and cluster configurations.

  • Manages the Unity Catalog (the centralized governance and metadata layer).

  • Orchestrates the creation and termination of compute resources (clusters).

Data Plane

  • The actual data is stored and where the heavy lifting of data processing happens.

  • A azure storage account is deployed in the data plan in your own subscription. It is used for Databricks File System or DBFS.

Last updated