Data Processing

Apache Spark

When you deploy a compute cluster or SQL warehouse on Azure Databricks, Apache Spark is configured and deployed to virtual machines.

Data Frame

A dataset organized into named columns. You can think of a DataFrame like a spreadsheet or a SQL table, a two-dimensional labeled data structure of a series of records (similar to rows in a table) and columns of different types. DataFrames provide a rich set of functions (for example, select columns, filter, join, and aggregate) that allow you to perform common data manipulation and analysis tasks efficiently.
DataFrames are immutable. Because of this, after performing transformations, a new DataFrame is returned that has to be saved to a variable in order to access it in subsequent operations

Vs Pandas

both Pandas and Apache Spark are used for data manipulation and analysis

Feature

Pandas

Apache Spark (PySpark)

Scale

Single machine (Local)

Distributed Cluster (Multiple machines)

Memory

Loads everything into RAM

Can process data larger than RAM

Execution

Eager (Runs immediately)

Lazy (Plans first, runs later)

Performance

Faster for small datasets (< 1GB)

Faster for massive datasets (> 10GB)

Complexity

Simple to set up and debug

Requires cluster management/overhead

Use Pandas if: Your data fits comfortably in your computer's RAM, you are doing exploratory data analysis (EDA), or you are working on a local machine/laptop.
Use Spark if: Your data is "Big Data" (multi-gigabyte or terabyte scale), you are building production data pipelines in the cloud (like on Databricks), or you need to process streaming data in real-time

Reading File

Create data frame from file

df_population = (spark.read
  .format("csv")
  .option("header", True)
  .option("inferSchema", True)
  .load("/databricks-datasets/samples/population-vs-price/data_geo.csv")
)
display(df_population)

Write

Write data frame to the table or file

df_joined.write.saveAsTable(f"{catalog_name}.{schema_name}.{table_name}")

file_path = ""
(df_joined.write
  .format("csv")
  .mode("overwrite")
  .write(file_path)
)

PySpark basics - Azure DatabricksMicrosoftLearn

SQL

Type of Table

Managed Table

CREARTE TABLE table_name

Created under the database directory
Managed data location and lifecycle
If dropping the table, will delete the underlying data files

External Table

CREATE TABLE table_name LOCATION 'path'

Created outside the database directory
User-managed outside the default location
If dropping the table, will not delete the underlying data files

File Location

For Unity Catalog, the data will be stored on

abfss://unity-catalog-storage@dbstoragexxx.dfs.core.windows.net/xxxx

For Hive meta store (legacy approach), tables reside in the DBFS default location.

dbfs:/user/hive/....

Describe Command

1. Simple Describe

Shows the basic schema: column names, data types, and any comments associated with those columns.

DESCRIBE TABLE main.default.my_table;
-- OR (shorthand)
DESC "my_table";

Output Includes:

col_name: Name of the column.
data_type: The type (string, int, array, etc.).
comment: Any documentation written during table creation.

2. Describe Extended (Most Common)

This is usually what developers want. It provides the basic schema plus detailed metadata about the table's physical storage and configuration.

DESCRIBE TABLE EXTENDED my_table;

Output Includes:

Database/Schema name
Owner (Who created it)
Created Time
Location (The exact S3/ADLS path where the data lives)
Provider (e.g., delta, parquet, csv)
Table Properties (Any custom tags or Delta configurations like delta.autoOptimize)
Partition Information (If the table is partitioned)

3. Describe Detail (Delta Specific)

If you are using Delta Lake, this command provides even deeper technical specifics regarding the current state of the Delta table.

DESCRIBE DETAIL my_table;

Output Includes:

Format: Always delta.
SizeInBytes: Total physical size on disk.
NumFiles: How many parquet files make up the table (great for checking "small file problems").
CreatedAt & LastModified: Timestamps for table lifecycle.
MinReaderVersion / MinWriterVersion: Protocol versions for Delta features.

4. Describe History (Audit Trail)

While not strictly for "schema," this is a vital related command for Delta tables. It shows the transaction log (who changed what and when).

DESCRIBE HISTORY my_table;

Use case: Finding out when a table was last updated or identifying which user ran a specific DELETE or UPDATE command.

Array / Json Operation

Json

get_json_object

{"id": 10, "city": "NY"}

get_json_object(col, '$.city')

"NY"

from_json

{"rank": 1, "score": 95}

from_json(col, 'rank INT, score INT')

{rank: 1, score: 95} (Struct)

to_json

{a: 1, b: 2} (Struct)

to_json(col)

"{"a":1,"b":2}" (String)

json_tuple

{"user": "A", "age": 30}

json_tuple(col, 'user', 'age')

["A", 30] (Multiple Columns)

Array

flatten

[[1, 2], [3]]

flatten(col)

[1, 2, 3]

array_distinct

[1, 1, 2]

array_distinct(col)

[1, 2]

array_sort

[3, 1, 2]

array_sort(col)

[1, 2, 3]

array_remove

[1, 2, 3]

array_remove(col, 2)

[1, 3]

slice

[A, B, C, D]

slice(col, 2, 2)

[B, C]

Explode

This is the most common array operation. it creates a new row for every element in the array.

Input Row: user_id: 1, skills: ["SQL", "Spark"]
SQL: SELECT user_id, explode(skills) AS skill FROM table GROUP BY ...

[1, 2, 3]

1

2

3

collect_list

Collects all values from a group into an array, including duplicates.

Input Rows: A, A, B
SQL: SELECT collect_list(col) FROM table
Result: ["A", "A", "B"]

collect_set

Collects only unique values from a group into an array.

Input Rows: A, A, B
SQL: SELECT collect_set(col) FROM table
Result: ["A", "B"]

High Order function

1. `transform` (The "Map" function)

This is used to modify every element in an array.

Example: Imagine you have an array of temperatures in Celsius and want to convert them to Fahrenheit.

SELECT transform(celsius_array, t -> (t * 9/5) + 32) as fahrenheit_array 
FROM weather_data;

2. `filter`

This removes elements from an array that do not meet a specific condition.

Example: Keeping only the "High Priority" IDs from a list.

SELECT filter(event_ids, id -> id > 100) as priority_events 
FROM logs;

3. `exists`

This returns a boolean (true/false) if at least one element in the array matches your criteria.

Example: Checking if a customer's order history contains a specific product category.

SELECT exists(categories, c -> c = 'Electronics') as is_tech_buyer 
FROM customer_profiles;

4. `aggregate` (The "Reduce" function)

This is the most powerful (and complex) higher-order function. It reduces an entire array down to a single value by applying a stateful calculation.

Example: Calculating a custom weighted sum of an array.

SELECT aggregate(numbers, 0, (acc, x) -> acc + x) as total_sum 
FROM data_table;

Advanced SQL Command

CTE (Common Table Expression)

CTE is a temporary result set that you define within the execution of a single query. It starts with the WITH keyword. It only exists for the duration of that specific command.
Purpose: To make complex queries more readable by breaking them into logical steps.

WITH regional_sales AS (
    SELECT region, SUM(amount) as total_revenue
    FROM sales_data
    GROUP BY region
)
SELECT * FROM regional_sales 
WHERE total_revenue > 10000;

Scope: Only the query immediately following it.
Storage: In-memory during execution; nothing is saved to the Databricks Catalog.

Create OR Replace

To allows you to either create a new table or completely overwrite an existing one with a single atomic operation.
If it doesn't exist: It creates the table from scratch.
If it does exist: It "drops" the old version and replaces it with the new definition and data.

CREATE OR REPLACE TABLE gold_sales_summary AS
SELECT 
    product_id, 
    SUM(amount) as total_revenue
FROM silver_sales
GROUP BY product_id;

Insert Overwrite

Used to replace the data within a table without changing the table's metadata (like its schema, permissions, or generated columns)
Preserves all historical versions of the data, you can time travel to previous versions

INSERT OVERWRITE monthly_report
SELECT * FROM daily_staging_table;

Create OR Replace VS Insert Overwrite

Similarity

The old data (row) are gone , but replacing with new data

Difference

Must match the existing table schema when using INSERT OVERWRITE , while the

Merge INTO

Mainly for atomic concurrency upserts

MERGE INTO customers AS target
USING daily_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET target.address = source.address, target.last_updated = current_timestamp()
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, address, last_updated) 
  VALUES (source.customer_id, source.name, source.address, current_timestamp());

Copying & Cloning

Create Table As Select (CTAS)

Create table table_1 AS SELECT * FROM table_2

Deep Clone

Create table table_clone deep clone source_table

Fully copy data + metadata from source
Can sync change
Shallow Clone

Create table table_clone shallow clone source_table

Quickly create a copy of table

Views

Querying File

SELECT * FROM read_files(
  '${dataset_bookstore}/books-csv/export_*.csv',
  format => 'csv',
  header => 'true',
  delimiter => ';'
);

It will be an external table with same location of csv

Copy Into

COPY INTO command is a high-performance, retryable SQL command used to load data from a cloud storage location (like S3, ADLS, or GCS) into a Delta Lake table.
One off solution to ingest the raw data from source

COPY INTO target_table
FROM 's3://my-bucket/landing_zone/'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true')
COPY_OPTIONS ('force' = 'false'); -- 'false' means don't re-load files we already finished

Delta Live Table (DLT)

a declarative framework for building reliable, maintainable, and testable data processing pipelines.

-- 1. BRONZE TABLE: Ingest raw JSON files from cloud storage
-- 'read_files' is used for Auto Loader (incremental ingestion)
CREATE OR REFRESH LIVE TABLE sales_raw
COMMENT "The raw sales data captured from the landing zone."
AS SELECT * FROM read_files("/databricks-datasets/retail-org/sales/", format => "json");

-- 2. SILVER TABLE: Clean and Validate
-- We use 'EXPECT' to drop rows where the 'customer_id' is missing
CREATE OR REFRESH LIVE TABLE sales_cleaned
(
  CONSTRAINT valid_customer_id EXPECT (customer_id IS NOT NULL) ON VIOLATION DROP ROW,
  CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION QUARANTINE
)
COMMENT "Cleaned sales data with enforced data quality rules."
AS SELECT 
  invoice_id,
  customer_id,
  CAST(amount AS DOUBLE) as sales_amount,
  upper(city) as city_name,
  to_date(timestamp) as sale_date
FROM LIVE.sales_raw; -- Notice the 'LIVE.' prefix

-- 3. GOLD TABLE: Business Summary
-- This is the table your BI tools (Power BI/Tableau) would connect to
CREATE OR REFRESH LIVE TABLE daily_revenue_summary
COMMENT "Daily revenue aggregates for the executive dashboard."
AS SELECT
  sale_date,
  city_name,
  SUM(sales_amount) as total_daily_revenue,
  COUNT(invoice_id) as transaction_count
FROM LIVE.sales_cleaned
GROUP BY sale_date, city_name;

Change Data Capture (CDC)

CDC is a design pattern that identifies and captures only the changes (Inserts, Updates, and Deletes) made to a source database. Instead of copying the entire database every night, you just stream a "feed" of the rows that changed.

Here is the example

user_id

name

city

op_type

sequence_num

101

Alice

London

INSERT

001

101

Alice

New York

UPDATE

002

102

Bob

Paris

DELETE

003

-- Step 1: Create the target table (it must be a STREAMING TABLE)
CREATE OR REFRESH STREAMING TABLE users_silver;

-- Step 2: Apply the changes from the source feed
APPLY CHANGES INTO LIVE.users_silver
FROM stream(LIVE.users_bronze_feed)
KEYS (user_id)                         -- The Primary Key
SEQUENCE BY update_timestamp            -- Handles out-of-order events
APPLY AS DELETE WHEN operation = 'DELETE' -- Automatically handles deletes
STORED AS SCD TYPE 1;                  -- Keep only the current version

Validation / On Violation

Action

SQL syntax

Python syntax

Result

warn (default)

EXPECT

dp.expect

Invalid records are written to the target.

drop

EXPECT ... ON VIOLATION DROP ROW

dp.expect_or_drop

Invalid records are dropped before data is written to the target. The count of dropped records is logged alongside other dataset metrics.

fail

EXPECT ... ON VIOLATION FAIL UPDATE

dp.expect_or_fail

Invalid records prevent the update from succeeding. Manual intervention is required before reprocessing. This expectation causes a failure of a single flow and does not cause other flows in your pipeline to fail.

Streaming

either a traditional approach where you reprocess the entire dataset, each time you receive a new update to your data.
write a custom logic to only capture those files or records that have been added since the last time an update was run.

Structured Streaming

It allows users to interact with ever-growing data source as if it were just a static table of records.
So new data in the input data stream is simply treated as new rows appended to a table.
And such a table representing an infinite data source is seen as "unbounded" table.
As we said, an input data stream could be a directory of files, a messaging system like Kafka, or simply a Delta table.
spark.readStream() to query the delta table as a stream source, which allows to process all of the data present in the table as well as any new data that arrive later.
This creates a streaming data frame on which we can apply any transformation as if it were just a
static data frame.

To persist the result of a streaming query. We need to write them out to durable storage using dataframe.writeStream method. With the writeStream method, we can configure our output.

Auto Loader

Auto loader can scale to support real time ingestion of millions of files per hour. And since it is based on spark structured streaming, Auto Loader uses checkpointing to track the ingestion process and to store metadata of the discovered files.
Ensures that data files are processed exactly once. And in case of failure, Auto Loader can resume from where it left off.
Has a specific format of StreamReader called cloudFiles. And in order to specify the format of the source files, we use simply cloudFiles.format option.
Detect new files as they arrive and queue them for ingestion. Once we read the files, we write the data into a target table using the StreamWriter, where you provide the location to store the checkpointing information.
Inferred all data to be STRING, since not safely determine a specific data type (like Integer, Boolean, or Timestamp) for your columns, so it chose the "safest" format possible: Plain Text by default. Schema hint can be provided as below

df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "json") \
  .option("cloudFiles.schemaHints", "age INT, price DOUBLE") \
  .load(path)
  
df = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "your_topic_name")
  .option("startingOffsets", "earliest")  # Read from the beginning of the topic
  .load())

Schema Evolution Mode

addNewColumns

(Default) Updates the schema and restarts the stream to include the new column.

failOnNewColumns

Stops the stream and throws an error so you can manually review the change.

rescue

Keeps the table schema exactly as is, but puts new/mismatched data into a _rescued_data column.

none

Ignores new columns and doesn't capture them at all (not recommended).

Autoloader vs Readable Stream

Feature

Auto Loader

Read Stream (Direct)

Data Source

Files (S3/ADLS/GCS)

Messaging (Kafka/Kinesis) or Delta Tables

Tracking

Tracks filenames/checkpoint

Tracks offsets/checkpoint

Schema

Automatic inference & evolution

Usually manual (must parse JSON/Avro)

Best For

Initial ingestion of raw data

Real-time event processing

Multi-Hop Architecture

Multi hop architecture usually consists of three layers: bronze, silver and gold.
Bronze table contains raw data ingested from various sources. Like json files, operational databases, or Kafka Stream
Silver table that provides more refined view of our data. For example, data can be cleaned and filtered at this level. And we can also join fields from various brands table to enrich our silver records.
Gold table that provides business-level aggregations, often used for reporting and dashboarding or even for machine learning.

Example

Bronze

Load the file into order-raw table and created view

Create temporary view and injected into bronze table

Silver

Create temporary view and join other table

inject into silver table

Gold

Aggregate the table to form the view
Output as a result table

PreviousData Lakehouse NextLakeflow Job

Last updated 6 days ago

hashtagApache Spark

hashtagData Frame

hashtagVs Pandas

hashtagReading File

hashtagWrite

hashtagSQL

hashtagType of Table

hashtagManaged Table

hashtagExternal Table

hashtagFile Location

hashtagDescribe Command

hashtagArray / Json Operation

hashtagJson

hashtagArray

hashtagExplode

hashtagcollect_list

hashtagcollect_set

hashtagHigh Order function

hashtag1. transform (The "Map" function)

hashtag2. filter

hashtag 3. exists

hashtag4. aggregate (The "Reduce" function)

hashtagAdvanced SQL Command

hashtagCTE (Common Table Expression)

hashtagCreate OR Replace

hashtagInsert Overwrite

hashtagCreate OR Replace VS Insert Overwrite

hashtagMerge INTO

hashtagCopying & Cloning

hashtagViews

hashtagQuerying File

hashtagCopy Into

hashtagDelta Live Table (DLT)

hashtagChange Data Capture (CDC)

hashtagValidation / On Violation

hashtagStreaming

hashtagStructured Streaming

hashtagAuto Loader

hashtagSchema Evolution Mode

hashtagAutoloader vs Readable Stream

hashtagMulti-Hop Architecture

hashtagExample

hashtagBronze

hashtagSilver

hashtagGold

Apache Spark

Data Frame

Vs Pandas

Reading File

Write

SQL

Type of Table

Managed Table

External Table

File Location

Describe Command

Array / Json Operation

Json

Array

Explode

collect_list

collect_set

High Order function

1. `transform` (The "Map" function)

2. `filter`

3. `exists`

4. `aggregate` (The "Reduce" function)

Advanced SQL Command

CTE (Common Table Expression)

Create OR Replace

Insert Overwrite

Create OR Replace VS Insert Overwrite

Merge INTO

Copying & Cloning

Views

Querying File

Copy Into

Delta Live Table (DLT)

Change Data Capture (CDC)

Validation / On Violation

Streaming

Structured Streaming

Auto Loader

Schema Evolution Mode

Autoloader vs Readable Stream

Multi-Hop Architecture

Example

Bronze

Silver

Gold