0% found this document useful (0 votes)
32 views11 pages

Tutorial Build Your First Machine Learning Model On Azure Databricks

This tutorial guides users on building a machine learning classification model using the scikit-learn library on Azure Databricks, specifically to predict wine quality based on various features. It covers essential steps including data loading, preprocessing, model training with MLflow for tracking, and hyperparameter tuning with Hyperopt. The tutorial also emphasizes the importance of having the proper permissions and configurations in the Databricks environment to successfully execute the model development process.

Uploaded by

sdiesel211
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views11 pages

Tutorial Build Your First Machine Learning Model On Azure Databricks

This tutorial guides users on building a machine learning classification model using the scikit-learn library on Azure Databricks, specifically to predict wine quality based on various features. It covers essential steps including data loading, preprocessing, model training with MLflow for tracking, and hyperparameter tuning with Hyperopt. The tutorial also emphasizes the importance of having the proper permissions and configurations in the Databricks environment to successfully execute the model development process.

Uploaded by

sdiesel211
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

Tutorial: Build your first machine learning


model on Azure Databricks
07/18/2025

This article shows you how to build a machine learning classification model using the scikit-
learn library on Azure Databricks.

The goal is to create a classification model to predict whether a wine is considered “high-quality”.
The dataset consists of 11 features of different wines (for example, alcohol content, acidity, and
residual sugar) and a quality ranking between 1 to 10.

This example also illustrates the use of MLflow to track the model development process, and
Hyperopt to automate hyperparameter tuning.

The dataset is from the UCI Machine Learning Repository , presented in Modeling wine
preferences by data mining from physicochemical properties [Cortez et al., 2009].

Before you begin


Your workspace must be enabled for Unity Catalog. See Get started with Unity Catalog.
You must have permission to create a compute resource or access to a compute resource
that uses Databricks Runtime for Machine Learning.
You must have the USE CATALOG privilege on a catalog.
Within that catalog, you must have the following privileges on a schema: USE SCHEMA,
CREATE TABLE, and CREATE MODEL.

 Tip

All of the code in this article is available in a notebook that you can import directly into your
workspace. See Example notebook: Build a classification model.

Step 1: Create a Databricks notebook


To create a notebook in your workspace, click New in the sidebar, and then click Notebook.
A blank notebook opens in the workspace.

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 1/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

To learn more about creating and managing notebooks, see Manage notebooks.

Step 2: Connect to compute resources


To do exploratory data analysis and data engineering, you must have access to compute. The
steps in this article require Databricks Runtime for Machine Learning. For more information and
instructions for selecting an ML version of Databricks Runtime, see Databricks Runtime for
Machine Learning.

In your notebook, click the Connect drop-down menu in the top right. If you have access to an
existing resource that uses Databricks Runtime for Machine Learning, then select that resource
from the menu. Otherwise, click Create new resource... to configure a new compute resource.

Step 3: Set up model registry, catalog, and schema


There are two important steps required before you get started. First, you must configure the
MLflow client to use Unity Catalog as the model registry. Enter the following code into a new cell
in your notebook.

Python

import mlflow
mlflow.set_registry_uri("databricks-uc")

You must also set the catalog and schema where the model will be registered. You must have USE
CATALOG privilege on the catalog, and USE SCHEMA, CREATE TABLE, and CREATE MODEL
privileges on the schema.

For more information about how to use Unity Catalog, see What is Unity Catalog?.

Enter the following code into a new cell in your notebook.

Python

# If necessary, replace "main" and "default" with a catalog and schema for which you
have the required permissions.
CATALOG_NAME = "main"
SCHEMA_NAME = "default"

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 2/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

Step 4: Load data and create Unity Catalog tables


This example uses two CSV files that are available in databricks-datasets . To learn how to ingest
your own data, see Standard connectors in Lakeflow Connect.

Enter the following code into a new cell in your notebook. This code does the following:

1. Read data from winequality-white.csv and winequality-red.csv into Spark DataFrames.


2. Clean the data by replacing spaces in column names with underscores.
3. Write the DataFrames to white_wine and red_wine tables in Unity Catalog. Saving the data
to Unity Catalog both persists the data and lets you control how to share it with others.

Python

white_wine = spark.read.csv("/databricks-datasets/wine-quality/winequality-
white.csv", sep=';', header=True)
red_wine = spark.read.csv("/databricks-datasets/wine-quality/winequality-red.csv",
sep=';', header=True)

# Remove the spaces from the column names


for c in white_wine.columns:
white_wine = white_wine.withColumnRenamed(c, c.replace(" ", "_"))
for c in red_wine.columns:
red_wine = red_wine.withColumnRenamed(c, c.replace(" ", "_"))

# Define table names


red_wine_table = f"{CATALOG_NAME}.{SCHEMA_NAME}.red_wine"
white_wine_table = f"{CATALOG_NAME}.{SCHEMA_NAME}.white_wine"

# Write to tables in Unity Catalog


spark.sql(f"DROP TABLE IF EXISTS {red_wine_table}")
spark.sql(f"DROP TABLE IF EXISTS {white_wine_table}")
white_wine.write.saveAsTable(f"{CATALOG_NAME}.{SCHEMA_NAME}.white_wine")
red_wine.write.saveAsTable(f"{CATALOG_NAME}.{SCHEMA_NAME}.red_wine")

Step 5. Preprocess and split the data


In this step, you load the data from the Unity Catalog tables you created in Step 4 into Pandas
DataFrames and preprocess the data. The code in this section does the following:

1. Loads the data as Pandas DataFrames.


2. Adds a Boolean column to each DataFrame to distinguish red and white wines, and then
combines the DataFrames into a new DataFrame, data_df .

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 3/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

3. The dataset includes a quality column that rates wines from 1 to 10, with 10 indicating the
highest quality. The code transforms this column into two classification values: “True” to
indicate a high-quality wine ( quality >= 7) and “False” to indicate a wine that is not high-
quality ( quality < 7).
4. Splits the DataFrame into separate train and test datasets.

First, import the required libraries:

Python

import numpy as np
import pandas as pd
import sklearn.datasets
import sklearn.metrics
import sklearn.model_selection
import sklearn.ensemble

import matplotlib.pyplot as plt

from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK


from hyperopt.pyll import scope

Now load and preprocess the data:

Python

# Load data from Unity Catalog as Pandas dataframes


white_wine = spark.read.table(f"{CATALOG_NAME}.{SCHEMA_NAME}.white_wine").toPandas()
red_wine = spark.read.table(f"{CATALOG_NAME}.{SCHEMA_NAME}.red_wine").toPandas()

# Add Boolean fields for red and white wine


white_wine['is_red'] = 0.0
red_wine['is_red'] = 1.0
data_df = pd.concat([white_wine, red_wine], axis=0)

# Define classification labels based on the wine quality


data_labels = data_df['quality'].astype('int') >= 7
data_df = data_df.drop(['quality'], axis=1)

# Split 80/20 train-test


X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
data_df,
data_labels,
test_size=0.2,
random_state=1
)

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 4/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

Step 6. Train the classification model


This step trains a gradient boosting classifier using the default algorithm settings. It then applies
the resulting model to the test dataset and calculates, logs, and displays the area under the
receiver operating curve to evaluate the model's performance.

First, enable MLflow autologging:

Python

mlflow.autolog()

Now start the model training run:

Python

with mlflow.start_run(run_name='gradient_boost') as run:


model = sklearn.ensemble.GradientBoostingClassifier(random_state=0)

# Models, parameters, and training metrics are tracked automatically


model.fit(X_train, y_train)

predicted_probs = model.predict_proba(X_test)
roc_auc = sklearn.metrics.roc_auc_score(y_test, predicted_probs[:,1])
roc_curve = sklearn.metrics.RocCurveDisplay.from_estimator(model, X_test, y_test)

# Save the ROC curve plot to a file


roc_curve.figure_.savefig("roc_curve.png")

# The AUC score on test data is not automatically logged, so log it manually
mlflow.log_metric("test_auc", roc_auc)

# Log the ROC curve image file as an artifact


mlflow.log_artifact("roc_curve.png")

print("Test AUC of: {}".format(roc_auc))

The cell results show the calculated area under the curve and a plot of the ROC curve:

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 5/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

Step 7. View experiment runs in MLflow


MLflow experiment tracking helps you keep track of model development by logging code and
results as you iteratively develop models.

To view the logged results from the training run you just executed, click the link in the cell output,
as shown in the following image.

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 6/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

The experiment page allows you to compare runs and view details for specific runs. Click the
name of a run to see details such as parameter and metric values for that run. See MLflow
experiment tracking.

You can also view your notebook's experiment runs by clicking the Experiment icon in the
upper right of the notebook. This opens the experiment sidebar, which shows a summary of each
run associated with the notebook eperiment, including run parameters and metrics. If necessary,
click the refresh icon to fetch and monitor the latest runs.

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 7/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

Step 8. Use Hyperopt for hyperparameter tuning


An important step in developing an ML model is optimizing the model's accuracy by tuning the
parameters that control the algorithm, called hyperparameters.

Databricks Runtime ML includes Hyperopt, a Python library for hyperparameter tuning. You can
use Hyperopt to run hyperparameter sweeps and train multiple models in parallel, reducing the
time required to optimize model performance. MLflow tracking is integrated with Hyperopt to
automatically log models and parameters. For more information about using Hyperopt in
Databricks, see Hyperparameter tuning.

The following code shows an example of using Hyperopt.

Python

# Define the search space to explore


search_space = {
'n_estimators': scope.int(hp.quniform('n_estimators', 20, 1000, 1)),
'learning_rate': hp.loguniform('learning_rate', -3, 0),
'max_depth': scope.int(hp.quniform('max_depth', 2, 5, 1)),
}

def train_model(params):
https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 8/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn
# Enable autologging on each worker
mlflow.autolog()
with mlflow.start_run(nested=True):
model_hp = sklearn.ensemble.GradientBoostingClassifier(
random_state=0,
**params
)
model_hp.fit(X_train, y_train)
predicted_probs = model_hp.predict_proba(X_test)
# Tune based on the test AUC
# In production, you could use a separate validation set instead
roc_auc = sklearn.metrics.roc_auc_score(y_test, predicted_probs[:,1])
mlflow.log_metric('test_auc', roc_auc)

# Set the loss to -1*auc_score so fmin maximizes the auc_score


return {'status': STATUS_OK, 'loss': -1*roc_auc}

# SparkTrials distributes the tuning using Spark workers


# Greater parallelism speeds processing, but each hyperparameter trial has less in‐
formation from other trials
# On smaller clusters try setting parallelism=2
spark_trials = SparkTrials(
parallelism=1
)

with mlflow.start_run(run_name='gb_hyperopt') as run:


# Use hyperopt to find the parameters yielding the highest AUC
best_params = fmin(
fn=train_model,
space=search_space,
algo=tpe.suggest,
max_evals=32,
trials=spark_trials)

Step 9. Find the best model and register it to Unity


Catalog
The following code identifies the run that produced the best results, as measured by the area
under the ROC curve:

Python

# Sort runs by their test auc. In case of ties, use the most recent run.
best_run = mlflow.search_runs(
order_by=['metrics.test_auc DESC', 'start_time DESC'],
max_results=10,
).iloc[0]
https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 9/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn
print('Best Run')
print('AUC: {}'.format(best_run["metrics.test_auc"]))
print('Num Estimators: {}'.format(best_run["params.n_estimators"]))
print('Max Depth: {}'.format(best_run["params.max_depth"]))
print('Learning Rate: {}'.format(best_run["params.learning_rate"]))

Using the run_id that you identified for the best model, the following code registers that model
to Unity Catalog.

Python

model_uri = 'runs:/{run_id}/model'.format(
run_id=best_run.run_id
)

mlflow.register_model(model_uri, f"{CATALOG_NAME}.{SCHEMA_NAME}.wine_quality_model")

Example notebook: Build a classification model


Use the following notebook to perform the steps in this article. For instructions on importing a
notebook to an Azure Databricks workspace, see Import a notebook.

Build your first machine learning model with Databricks


Get notebook

Learn more
Databricks provides a single platform that serves every step of ML development and deployment,
from raw data to inference tables that save every request and response for a served model. Data
scientists, data engineers, ML engineers, and DevOps can do their jobs using the same set of tools
and a single source of truth for the data.

To learn more, see the following:

Machine learning and AI tutorials


Overview of machine learning and AI on Databricks
Overview of training machine learning and AI models on Databricks
MLflow for ML model lifecycle

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 10/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

Related resources
scikit-learn

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 11/11

You might also like