Compute

Cluster

  • All-purpose clusters are interactive, used for development, data exploration, and ad-hoc queries.

  • Jobs clusters are dedicated to automated tasks, typically ephemeral (spin up for a job, terminate after).

Termination

  • Help reducing cost by shutting down inactive clusters

  • Terminating a cluster stops any running jobs and releases the compute resources.

Lost

Since clusters use ephemeral (temporary) storage, several things disappear the moment the cluster stops:

  • Spark Cache: Any dataframes or tables you cached using .cache() or .persist() are cleared. You will have to re-run the code to load them back into memory.

  • Local Disk Data: Any files saved to the cluster's local SSD (e.g., /tmp/ or /local_disk0/) are permanently deleted.

  • Spark Session & Variables: All active Python/Scala/SQL variables and the current Spark session state are gone. You cannot "resume" a notebook mid-cell after termination.

  • Library Installs (Manual): If you installed libraries via %pip or the terminal during a session, they will not be there when you restart. (Libraries installed via the Compute UI will persist).

Save

Databricks separates "compute" from "storage" so your actual work remains safe:

  • Cluster Configuration: The "blueprint" (instance types, autoscaling rules, tags) is saved. You can restart the cluster with one click.

  • Notebooks: Your code, comments, and results (if saved) stay in your Workspace.

  • Data in DBFS/S3/ADLS: Any data written to permanent storage (like Delta tables or CSVs in a Data Lake) is completely unaffected.

  • Metadata: Table definitions in the Hive Metastore (Unity Catalog) remain intact.

Restart

  • When libraries or configurations need to be reloaded.

  • When the cluster has become unresponsive or in an unexpected state.

  • After modifying cluster-level init scripts or changing environment settings.

Instance Family

Spot Instance

  • Spot instances allow you to leverage unused cloud capacity of virtual machines at a lower price, which can significantly reduce the cost of running your data applications. However, the downside is that these instances can be stopped at any time by your cloud provider, which

    can interrupt your jobs. So, they are ideal for temporary or batch workloads such as ETL jobs or machine learning training, where computing power is important but interruptions are acceptable.

Instance Pool

  • An instance pool is a set of idle and ready-to-use virtual machines. When cluster nodes are created using these idle instances, the cluster start time and auto-scaling time are reduced. So, if you have an automated job that needs to be run as quickly as possible, you can use instance pool in this case.

Serverless Compute

  • Provides fully managed computing resources with minimal configuration requirements, while classic compute offers greater control and flexibility for customization and configurations.

  • Serverless clusters come with built in-features like Photon Engine, runtime upgrade, automatic instance type selection, and auto scaling, in addition to out-of-the-box performance optimizations.

SQL Warehouse

  • SQL Warehouse is a specialized compute resource specifically optimized for running SQL queries, powering BI dashboards, and supporting data visualization tools like Tableau or Power BI.

Serverless SQL warehouses

  • Rapid startup time (typically between 2 and 6 seconds).

  • Rapid upscaling to acquire more compute when needed for maintaining low latency.

  • Query admittance is closer to the hardware's limitation than the virtual machine.

  • Quick downscaling to minimize costs when demand is low, providing consistent performance with optimized costs and resources.

Ideal for

  • ETL

  • Business intelligence

  • Exploratory analysis

Pro SQL warehouses

  • Supports Photon and Predictive IO, but does not support Intelligent Workload Management. With a pro SQL warehouse (unlike a serverless SQL warehouse), the compute layer exists in your AWS account account rather than in your Databricks account.

Use a pro SQL warehouse when:

  • Serverless SQL warehouses are not available in a region.

  • You have custom-defined networking and want to connect to databases in your network in the cloud or on-premises for federation or a hybrid-type architecture. For example, use a pro SQL warehouse if you want to put other services into your network such as an event bus or databases, or you want to connect your network to your on-premises network.

Classic SQL warehouses

  • A classic SQL warehouse supports Photon but does not support Predictive IO or Intelligent Workload Management. With a classic SQL warehouse (unlike a serverless SQL warehouse), the compute layer exists in your AWS account account rather than in your Databricks account.

  • Use a classic SQL warehouse to run interactive queries for data exploration with entry-level performance and Databricks SQL features.

Last updated