Data Processing

Querying File

SELECT * FROM read_files(
  '${dataset_bookstore}/books-csv/export_*.csv',
  format => 'csv',
  header => 'true',
  delimiter => ';'
);

either a traditional approach where you reprocess the entire dataset, each time you receive a new update to your data.
write a custom logic to only capture those files or records that have been added since the last time an update was run.

It allows users to interact with ever-growing data source as if it were just a static table of records.
So new data in the input data stream is simply treated as new rows appended to a table.
And such a table representing an infinite data source is seen as "unbounded" table.
As we said, an input data stream could be a directory of files, a messaging system like Kafka, or simply a Delta table.
spark.readStream() to query the delta table as a stream source, which allows to process all of the data present in the table as well as any new data that arrive later.
This creates a streaming data frame on which we can apply any transformation as if it were just a
static data frame.

To persist the result of a streaming query. We need to write them out to durable storage using dataframe.writeStream method. With the writeStream method, we can configure our output.

Auto loader can scale to support real time ingestion of millions of files per hour. And since it is based on spark structured streaming, Auto Loader uses checkpointing to track the ingestion process and to store metadata of the discovered files.
Ensures that data files are processed exactly once. And in case of failure, Auto Loader can resume from where it left off.
Has a specific format of StreamReader called cloudFiles. And in order to specify the format of the source files, we use simply cloudFiles.format option.
Detect new files as they arrive and queue them for ingestion. Once we read the files, we write the data into a target table using the StreamWriter, where you provide the location to store the checkpointing information.

Multi hop architecture usually consists of three layers: bronze, silver and gold.
Bronze table contains raw data ingested from various sources. Like json files, operational databases, or Kafka Stream
Silver table that provides more refined view of our data. For example, data can be cleaned and filtered at this level. And we can also join fields from various brands table to enrich our silver records.
Gold table that provides business-level aggregations, often used for reporting and dashboarding or even for machine learning.