Azure Databricks Cluster Configuration for Decision Support Workloads by MobiLab

Introduction

In this document we try to describe some of the best practices in configuring Azure Databricks clusters suitable for decision support workloads in our typical customer projects.

A decision support system (DSS) is a computer program application used to improve a company’s decision-making capabilities. It analyzes large amounts of data and presents an organization with the best possible options available. Decision support systems bring together data and knowledge from different areas and sources to provide users with information beyond the usual reports and summaries. This is intended to help people make informed decisions. A decision support system is an informational application as opposed to an operational application. Informational applications provide users with relevant information based on a variety of data sources to support better-informed decision-making.

When creating a cluster (the term cluster refers to Azure Databricks cluster in the rest of this document) there are a lot of parameters that need to be taken into account. The following parameters depend on the size and type of workload and will not be covered in this document:

Number of nodes
Size of each node
Node type (General purpose / Memory optimized / Storage optimized / Compute optimized)

Assuming the above parameters are chosen, there are secondary parameters that can result in a faster execution and better price to performance ratio without any effort. We discuss the following parameters and try to optimise them for typical decision support workloads:

Photon acceleration
Disk cache (aka Delta cache)
Azure VM SKU/CPU generation
Intel® vs AMD

TPC-DS

The TPC Benchmark™DS (TPC-DS) is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of the System Under Test’s (SUT) performance as a general purpose decision support system. This benchmark illustrates decision support systems that:

Examine large volumes of data;
Give answers to real-world business questions;
Execute queries of various operational requirements and complexities (e.g., ad-hoc, reporting, iterative OLAP, data mining);
Are characterized by high CPU and IO load;
Are periodically synchronized with source OLTP databases through database maintenance functions.
Run on “Big Data” solutions, such as RDBMS as well as Hadoop/Spark based systems.

TPC-DS is intended to provide a fair and honest comparison of various vendor implementations by providing highly comparable, controlled and repeatable tasks in evaluating the performance of decision support systems (DSS). Its workload is expected to test the upward boundaries of hardware system performance in the areas of CPU utilization, memory utilization, I/O subsystem utilization and the ability of the operating system and database software to perform various complex functions important to DSS – examine large volumes of data, compute and execute the best execution plan for queries with a high degree of complexity, schedule efficiently a large number of user sessions, and give answers to critical business questions

For more information about TPC-DS please refer to this document.

Benchmarks

There are several benchmarks available from Intel® and other sources that compare above parameters for specific dataset sizes and cluster configurations. Even though these benchmarks are useful in identifying the best practices, as our typical customer use cases are usually different from the existing benchmarks in terms of dataset and cluster size, we will try to execute similar benchmarks but in a smaller scale and more in line with our needs. We don’t expect to arrive at the exact results of mentioned benchmarks due to differences in the cluster size, dataset size and SKUs chosen.

We are using TPC-DS benchmark and various E16*** Azure VM SKUs.

Our test workload is a 100GB dataset which is generated by TPC-DS and we are running the standard TPC_DS 99 queries in a single session using the tools found in this project.

We ran this benchmark also for dataset sizes up to 1TB as these ranges are more representative of what we see across our customer use cases. We noticed that the normalized improvements of time and cost in different scenarios are consistent(with negligible variance) across the range. Therefore, for the rest of the document we simply present results for the 100GB runs.

Photon acceleration

Photon is the native vectorized query engine on Azure Databricks, written to be directly compatible with Apache Spark APIs so that it works with your existing code. It is developed in C++ to take advantage of modern hardware, and uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is part of a high-performance runtime that runs your existing SQL and DataFrame API calls faster and reduces your total cost per workload.

Our tests show that enabling Photon on the same SKU (E16ds_v5), results in 2.85x speed increase and 45% cost decrease of the same workload.

Normalized effect of enabling Photon acceleration

The following charts from an Intel benchmark shows similar results(with bigger datasets) in terms of time and cost(price/performance) savings:

Normalized Processing Time to Complete Queries on Databricks With vs. Without Photon (Lower is better)

Normalized Cost to Run a Decision Support Workload With vs. Without Photon (Lower is better)

Disk cache(aka Delta cache)

Azure Databricks uses disk caching to accelerate data reads by creating copies of remote Parquet data files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed. The cache works for all Parquet data files (including Delta Lake tables).

Our tests show that using disk cache enabled similar SKU (E16ds_v5 vs E16s_v5), results in 2.21x speed increase and 56% cost decrease of the same workload.

Normalized effect of disk cache

Azure VM SKU/CPU generation

While the concept that newer hardware leads to better performance is hardly novel, it’s not always obvious how much of an improvement a workload will actually achieve on newer hardware. The improvement may be small enough that it doesn’t seem worth the effort or cost increase of moving to newer hardware. Azure removes a lot of the pain and effort from upgrading hardware.

We couldn’t isolate the performance improvement of the generation upgrade from Photon and Disk cache, but considering little effort that is required to use the latest generations and having the option to use Photon and Disk cache, it is advised to upgrade.

Also benchmarks from Intel® show the performance increase the newer hardware delivers yields a net win in terms of value, supporting the choice to upgrade.

Intel® vs AMD

In most Databricks SKU options, there are both Intel® and AMD choices for the given VM size and features.

Our tests show that using Intel® SKU(E16ds_v5) compared to similar AMD SKU (E16ads_v5), results in 1.05x speed increase and 3% cost decrease of the same workload. While these numbers might seem relatively small, when the improvement accumulates in the long run, the absolute improvement value (time and cost) is often significant.

Intel vs AMD (Normalized)

Conclusion

Based on the above we arrive at the following general recommendations when configuring clusters:

Use SKUs that support Photon Acceleration and enable it
Use SKUs that support Disk cache (Delta cache accelerated)
Use latest SKU generations (i.e. v4, v5)
Use Intel® based SKUs over AMD SKUs unless there is a specific need to do otherwise

Ramin

Ramin is a Full-Stack Engineer at MobiLab.

Azure Databricks Cluster Configuration for Decision Support Workloads