0% found this document useful (0 votes)

32 views11 pages

Tutorial Build Your First Machine Learning Model On Azure Databricks

This tutorial guides users on building a machine learning classification model using the scikit-learn library on Azure Databricks, specifically to predict wine quality based on various features. It covers essential steps including data loading, preprocessing, model training with MLflow for tracking, and hyperparameter tuning with Hyperopt. The tutorial also emphasizes the importance of having the proper permissions and configurations in the Databricks environment to successfully execute the model development process.

Uploaded by

sdiesel211

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views11 pages

Tutorial Build Your First Machine Learning Model On Azure Databricks

Uploaded by

sdiesel211

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

Tutorial: Build your first machine learning

model on Azure Databricks
07/18/2025

This article shows you how to build a machine learning classification model using the scikit-
learn library on Azure Databricks.

The goal is to create a classification model to predict whether a wine is considered “high-quality”.
The dataset consists of 11 features of different wines (for example, alcohol content, acidity, and
residual sugar) and a quality ranking between 1 to 10.

This example also illustrates the use of MLflow to track the model development process, and
Hyperopt to automate hyperparameter tuning.

The dataset is from the UCI Machine Learning Repository , presented in Modeling wine
preferences by data mining from physicochemical properties [Cortez et al., 2009].

Before you begin

Your workspace must be enabled for Unity Catalog. See Get started with Unity Catalog.
You must have permission to create a compute resource or access to a compute resource
that uses Databricks Runtime for Machine Learning.
You must have the USE CATALOG privilege on a catalog.
Within that catalog, you must have the following privileges on a schema: USE SCHEMA,
CREATE TABLE, and CREATE MODEL.

 Tip

All of the code in this article is available in a notebook that you can import directly into your
workspace. See Example notebook: Build a classification model.

Step 1: Create a Databricks notebook

To create a notebook in your workspace, click New in the sidebar, and then click Notebook.
A blank notebook opens in the workspace.

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 1/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

To learn more about creating and managing notebooks, see Manage notebooks.

Step 2: Connect to compute resources

To do exploratory data analysis and data engineering, you must have access to compute. The
steps in this article require Databricks Runtime for Machine Learning. For more information and
instructions for selecting an ML version of Databricks Runtime, see Databricks Runtime for
Machine Learning.

In your notebook, click the Connect drop-down menu in the top right. If you have access to an
existing resource that uses Databricks Runtime for Machine Learning, then select that resource
from the menu. Otherwise, click Create new resource... to configure a new compute resource.

Step 3: Set up model registry, catalog, and schema

There are two important steps required before you get started. First, you must configure the
MLflow client to use Unity Catalog as the model registry. Enter the following code into a new cell
in your notebook.

Python

import mlflow
mlflow.set_registry_uri("databricks-uc")

You must also set the catalog and schema where the model will be registered. You must have USE
CATALOG privilege on the catalog, and USE SCHEMA, CREATE TABLE, and CREATE MODEL
privileges on the schema.

For more information about how to use Unity Catalog, see What is Unity Catalog?.

Enter the following code into a new cell in your notebook.

Python

# If necessary, replace "main" and "default" with a catalog and schema for which you
have the required permissions.
CATALOG_NAME = "main"
SCHEMA_NAME = "default"

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 2/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

Step 4: Load data and create Unity Catalog tables

This example uses two CSV files that are available in databricks-datasets . To learn how to ingest
your own data, see Standard connectors in Lakeflow Connect.

Enter the following code into a new cell in your notebook. This code does the following:

1. Read data from winequality-white.csv and winequality-red.csv into Spark DataFrames.

2. Clean the data by replacing spaces in column names with underscores.
3. Write the DataFrames to white_wine and red_wine tables in Unity Catalog. Saving the data
to Unity Catalog both persists the data and lets you control how to share it with others.

Python

white_wine = spark.read.csv("/databricks-datasets/wine-quality/winequality-
white.csv", sep=';', header=True)
red_wine = spark.read.csv("/databricks-datasets/wine-quality/winequality-red.csv",
sep=';', header=True)

# Remove the spaces from the column names

for c in white_wine.columns:
white_wine = white_wine.withColumnRenamed(c, c.replace(" ", "_"))
for c in red_wine.columns:
red_wine = red_wine.withColumnRenamed(c, c.replace(" ", "_"))

# Define table names

red_wine_table = f"{CATALOG_NAME}.{SCHEMA_NAME}.red_wine"
white_wine_table = f"{CATALOG_NAME}.{SCHEMA_NAME}.white_wine"

# Write to tables in Unity Catalog

spark.sql(f"DROP TABLE IF EXISTS {red_wine_table}")
spark.sql(f"DROP TABLE IF EXISTS {white_wine_table}")
white_wine.write.saveAsTable(f"{CATALOG_NAME}.{SCHEMA_NAME}.white_wine")
red_wine.write.saveAsTable(f"{CATALOG_NAME}.{SCHEMA_NAME}.red_wine")

Step 5. Preprocess and split the data

In this step, you load the data from the Unity Catalog tables you created in Step 4 into Pandas
DataFrames and preprocess the data. The code in this section does the following:

1. Loads the data as Pandas DataFrames.

2. Adds a Boolean column to each DataFrame to distinguish red and white wines, and then
combines the DataFrames into a new DataFrame, data_df .

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 3/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

3. The dataset includes a quality column that rates wines from 1 to 10, with 10 indicating the
highest quality. The code transforms this column into two classification values: “True” to
indicate a high-quality wine ( quality >= 7) and “False” to indicate a wine that is not high-
quality ( quality < 7).
4. Splits the DataFrame into separate train and test datasets.

First, import the required libraries:

Python

import numpy as np
import pandas as pd
import sklearn.datasets
import sklearn.metrics
import sklearn.model_selection
import sklearn.ensemble

import matplotlib.pyplot as plt

from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK

from hyperopt.pyll import scope

Now load and preprocess the data:

Python

# Load data from Unity Catalog as Pandas dataframes

white_wine = spark.read.table(f"{CATALOG_NAME}.{SCHEMA_NAME}.white_wine").toPandas()
red_wine = spark.read.table(f"{CATALOG_NAME}.{SCHEMA_NAME}.red_wine").toPandas()

# Add Boolean fields for red and white wine

white_wine['is_red'] = 0.0
red_wine['is_red'] = 1.0
data_df = pd.concat([white_wine, red_wine], axis=0)

# Define classification labels based on the wine quality

data_labels = data_df['quality'].astype('int') >= 7
data_df = data_df.drop(['quality'], axis=1)

# Split 80/20 train-test

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
data_df,
data_labels,
test_size=0.2,
random_state=1
)

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 4/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

Step 6. Train the classification model

This step trains a gradient boosting classifier using the default algorithm settings. It then applies
the resulting model to the test dataset and calculates, logs, and displays the area under the
receiver operating curve to evaluate the model's performance.

First, enable MLflow autologging:

Python

mlflow.autolog()

Now start the model training run:

Python

with mlflow.start_run(run_name='gradient_boost') as run:

model = sklearn.ensemble.GradientBoostingClassifier(random_state=0)

# Models, parameters, and training metrics are tracked automatically

model.fit(X_train, y_train)

predicted_probs = model.predict_proba(X_test)
roc_auc = sklearn.metrics.roc_auc_score(y_test, predicted_probs[:,1])
roc_curve = sklearn.metrics.RocCurveDisplay.from_estimator(model, X_test, y_test)

# Save the ROC curve plot to a file

roc_curve.figure_.savefig("roc_curve.png")

# The AUC score on test data is not automatically logged, so log it manually
mlflow.log_metric("test_auc", roc_auc)

# Log the ROC curve image file as an artifact

mlflow.log_artifact("roc_curve.png")

print("Test AUC of: {}".format(roc_auc))

The cell results show the calculated area under the curve and a plot of the ROC curve:

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 5/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

Step 7. View experiment runs in MLflow

MLflow experiment tracking helps you keep track of model development by logging code and
results as you iteratively develop models.

To view the logged results from the training run you just executed, click the link in the cell output,
as shown in the following image.

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 6/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

The experiment page allows you to compare runs and view details for specific runs. Click the
name of a run to see details such as parameter and metric values for that run. See MLflow
experiment tracking.

You can also view your notebook's experiment runs by clicking the Experiment icon in the
upper right of the notebook. This opens the experiment sidebar, which shows a summary of each
run associated with the notebook eperiment, including run parameters and metrics. If necessary,
click the refresh icon to fetch and monitor the latest runs.

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 7/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

Step 8. Use Hyperopt for hyperparameter tuning

An important step in developing an ML model is optimizing the model's accuracy by tuning the
parameters that control the algorithm, called hyperparameters.

Databricks Runtime ML includes Hyperopt, a Python library for hyperparameter tuning. You can
use Hyperopt to run hyperparameter sweeps and train multiple models in parallel, reducing the
time required to optimize model performance. MLflow tracking is integrated with Hyperopt to
automatically log models and parameters. For more information about using Hyperopt in
Databricks, see Hyperparameter tuning.

The following code shows an example of using Hyperopt.

Python

# Define the search space to explore

search_space = {
'n_estimators': scope.int(hp.quniform('n_estimators', 20, 1000, 1)),
'learning_rate': hp.loguniform('learning_rate', -3, 0),
'max_depth': scope.int(hp.quniform('max_depth', 2, 5, 1)),
}

def train_model(params):
https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 8/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn
# Enable autologging on each worker
mlflow.autolog()
with mlflow.start_run(nested=True):
model_hp = sklearn.ensemble.GradientBoostingClassifier(
random_state=0,
**params
)
model_hp.fit(X_train, y_train)
predicted_probs = model_hp.predict_proba(X_test)
# Tune based on the test AUC
# In production, you could use a separate validation set instead
roc_auc = sklearn.metrics.roc_auc_score(y_test, predicted_probs[:,1])
mlflow.log_metric('test_auc', roc_auc)

# Set the loss to -1*auc_score so fmin maximizes the auc_score

return {'status': STATUS_OK, 'loss': -1*roc_auc}

# SparkTrials distributes the tuning using Spark workers

# Greater parallelism speeds processing, but each hyperparameter trial has less in‐
formation from other trials
# On smaller clusters try setting parallelism=2
spark_trials = SparkTrials(
parallelism=1
)

with mlflow.start_run(run_name='gb_hyperopt') as run:

# Use hyperopt to find the parameters yielding the highest AUC
best_params = fmin(
fn=train_model,
space=search_space,
algo=tpe.suggest,
max_evals=32,
trials=spark_trials)

Step 9. Find the best model and register it to Unity

Catalog
The following code identifies the run that produced the best results, as measured by the area
under the ROC curve:

Python

# Sort runs by their test auc. In case of ties, use the most recent run.
best_run = mlflow.search_runs(
order_by=['metrics.test_auc DESC', 'start_time DESC'],
max_results=10,
).iloc[0]
https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 9/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn
print('Best Run')
print('AUC: {}'.format(best_run["metrics.test_auc"]))
print('Num Estimators: {}'.format(best_run["params.n_estimators"]))
print('Max Depth: {}'.format(best_run["params.max_depth"]))
print('Learning Rate: {}'.format(best_run["params.learning_rate"]))

Using the run_id that you identified for the best model, the following code registers that model
to Unity Catalog.

Python

model_uri = 'runs:/{run_id}/model'.format(
run_id=best_run.run_id
)

mlflow.register_model(model_uri, f"{CATALOG_NAME}.{SCHEMA_NAME}.wine_quality_model")

Example notebook: Build a classification model

Use the following notebook to perform the steps in this article. For instructions on importing a
notebook to an Azure Databricks workspace, see Import a notebook.

Build your first machine learning model with Databricks

Get notebook

Learn more
Databricks provides a single platform that serves every step of ML development and deployment,
from raw data to inference tables that save every request and response for a served model. Data
scientists, data engineers, ML engineers, and DevOps can do their jobs using the same set of tools
and a single source of truth for the data.

To learn more, see the following:

Machine learning and AI tutorials

Overview of machine learning and AI on Databricks
Overview of training machine learning and AI models on Databricks
MLflow for ML model lifecycle

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 10/11
8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

Related resources
scikit-learn

https://learn.microsoft.com/en-us/azure/databricks/getting-started/ml-get-started 11/11

MC Female Home Challenge 6.0 Cut
100% (2)
MC Female Home Challenge 6.0 Cut
22 pages
Get Started With Databricks For Machine Learning
No ratings yet
Get Started With Databricks For Machine Learning
85 pages
Algebraic Geometry - A First Course - Joe Harris - Harvard University
86% (7)
Algebraic Geometry - A First Course - Joe Harris - Harvard University
337 pages
Machine Learning (16CIC73) Project Report Template
33% (3)
Machine Learning (16CIC73) Project Report Template
12 pages
Top Datasets for Data Science
100% (1)
Top Datasets for Data Science
9 pages
Get Started Tutorials On Azure Databricks
No ratings yet
Get Started Tutorials On Azure Databricks
2 pages
Wine Quality Prediction Using Machine Learning
No ratings yet
Wine Quality Prediction Using Machine Learning
10 pages
MLP Slides Merged
No ratings yet
MLP Slides Merged
480 pages
Final Report Beer Recommendation Project
No ratings yet
Final Report Beer Recommendation Project
48 pages
Databricks Guide
No ratings yet
Databricks Guide
27 pages
HW04
No ratings yet
HW04
3 pages
Machine Learning Algorithms Assignment
No ratings yet
Machine Learning Algorithms Assignment
71 pages
Guillermo Garcia Rodriguez - Rivendel S.L
No ratings yet
Guillermo Garcia Rodriguez - Rivendel S.L
85 pages
Wine Quality Prediction GHAR
No ratings yet
Wine Quality Prediction GHAR
19 pages
ML Project Report
No ratings yet
ML Project Report
12 pages
Introduction To Data Science Prod Edxapp Edx CDN Org
No ratings yet
Introduction To Data Science Prod Edxapp Edx CDN Org
32 pages
Wine Quality Prediction
No ratings yet
Wine Quality Prediction
82 pages
ml-4
No ratings yet
ml-4
22 pages
Honours LY Project
No ratings yet
Honours LY Project
31 pages
KO WBNR Whitepaper MCW0011262MachineLearnining
No ratings yet
KO WBNR Whitepaper MCW0011262MachineLearnining
62 pages
Solution (Updated)
No ratings yet
Solution (Updated)
61 pages
Machine Learning with Lemonade Sales Data
No ratings yet
Machine Learning with Lemonade Sales Data
34 pages
Feature Stores For Sub ML
No ratings yet
Feature Stores For Sub ML
25 pages
Projet IMI5
No ratings yet
Projet IMI5
4 pages
0 PDF
No ratings yet
0 PDF
9 pages
Ai / ML Projects List: Si - No Project Title
No ratings yet
Ai / ML Projects List: Si - No Project Title
7 pages
Red Wine Quality Detection
No ratings yet
Red Wine Quality Detection
17 pages
Azure Databricks Brief Introduction
No ratings yet
Azure Databricks Brief Introduction
40 pages
ML Mini Report
No ratings yet
ML Mini Report
6 pages
Wine Quality Prediction Using Machine Learning Algorithms
100% (1)
Wine Quality Prediction Using Machine Learning Algorithms
4 pages
Business Analytics Assignment
No ratings yet
Business Analytics Assignment
26 pages
College Project by Muhannad-3
No ratings yet
College Project by Muhannad-3
20 pages
Exploratory Data Analysis and Case
No ratings yet
Exploratory Data Analysis and Case
29 pages
Wine Quality Prediction Project Report
No ratings yet
Wine Quality Prediction Project Report
4 pages
Wine Quality Prediction Project
No ratings yet
Wine Quality Prediction Project
32 pages
Wine Quality Research Paper
100% (1)
Wine Quality Research Paper
3 pages
Image Classification & Wildlife AI
No ratings yet
Image Classification & Wildlife AI
20 pages
A Beginner's Guide To ETL With Python - by Jesús Cantú - Medium
No ratings yet
A Beginner's Guide To ETL With Python - by Jesús Cantú - Medium
13 pages
MIT Data Science and Big Data Analytics Case Study
No ratings yet
MIT Data Science and Big Data Analytics Case Study
8 pages
Devesh
No ratings yet
Devesh
11 pages
ML Web App Presentation
No ratings yet
ML Web App Presentation
10 pages
5 - Predict Housing Prices With Tensorflow and AI Platform
No ratings yet
5 - Predict Housing Prices With Tensorflow and AI Platform
11 pages
ML Predicts Red Wine Quality
No ratings yet
ML Predicts Red Wine Quality
12 pages
ML Use Cases Ebook
100% (2)
ML Use Cases Ebook
53 pages
Pyspark MLlib
No ratings yet
Pyspark MLlib
4 pages
User Manual
No ratings yet
User Manual
8 pages
The Art of Effective Visualization of Multi-Dimensional Data
No ratings yet
The Art of Effective Visualization of Multi-Dimensional Data
33 pages
Azure ML Tools & CI/CD Overview
No ratings yet
Azure ML Tools & CI/CD Overview
8 pages
21ce070 CC Prac-10
No ratings yet
21ce070 CC Prac-10
13 pages
Report
No ratings yet
Report
6 pages
150+ +Azure+Databricks+Slides
No ratings yet
150+ +Azure+Databricks+Slides
35 pages
Tutorial Create Your First Table and Grant Privileges
No ratings yet
Tutorial Create Your First Table and Grant Privileges
4 pages
BigQuery ML: Custom Model Building
No ratings yet
BigQuery ML: Custom Model Building
32 pages
Red Wine Quality Prediction Using Machine Learning
No ratings yet
Red Wine Quality Prediction Using Machine Learning
4 pages
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
No ratings yet
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
10 pages
Mini Project Report
No ratings yet
Mini Project Report
12 pages
Wine Quality Prediction PoC Report
No ratings yet
Wine Quality Prediction PoC Report
2 pages
What Is Dynamics 365
No ratings yet
What Is Dynamics 365
3 pages
Tutorial Build An ETL Pipeline With Apache Spark On The Databricks Platform
No ratings yet
Tutorial Build An ETL Pipeline With Apache Spark On The Databricks Platform
6 pages
Tutorial Connect To Azure Data Lake Storage
No ratings yet
Tutorial Connect To Azure Data Lake Storage
8 pages
What Is Azure AI Foundry - Azure AI Foundry - Microsoft Learn
No ratings yet
What Is Azure AI Foundry - Azure AI Foundry - Microsoft Learn
7 pages
Tutorial Query and Visualize Data From A Notebook
No ratings yet
Tutorial Query and Visualize Data From A Notebook
3 pages
Add Your Custom Domain Name To Your Tenant
No ratings yet
Add Your Custom Domain Name To Your Tenant
7 pages
MATH 1300-MIDTERM # 2-2012: For Long Answer Questions, YOU MUST SHOW YOUR WORK
No ratings yet
MATH 1300-MIDTERM # 2-2012: For Long Answer Questions, YOU MUST SHOW YOUR WORK
8 pages
Chest Freezer: User Manual
No ratings yet
Chest Freezer: User Manual
31 pages
Fast-Play Tabletop Wargame Rules For Combined-Arms Operations, The Future
No ratings yet
Fast-Play Tabletop Wargame Rules For Combined-Arms Operations, The Future
140 pages
Total 207 212 27 51 Grand Total
No ratings yet
Total 207 212 27 51 Grand Total
20 pages
R22 BEFA All Units Questions & Answers 03-8-2024
No ratings yet
R22 BEFA All Units Questions & Answers 03-8-2024
87 pages
Turbo Machinery Exam Results 2019
No ratings yet
Turbo Machinery Exam Results 2019
3 pages
COE301 Lab 11 Datapath Component Design
No ratings yet
COE301 Lab 11 Datapath Component Design
7 pages
Village Map: Taluka: Kaij District: Bid
100% (1)
Village Map: Taluka: Kaij District: Bid
1 page
Business Plan Zulkifli Collection
No ratings yet
Business Plan Zulkifli Collection
58 pages
Lesson 4 Interpret Plans and Drawings
No ratings yet
Lesson 4 Interpret Plans and Drawings
48 pages
How Do Trusses Work
No ratings yet
How Do Trusses Work
14 pages
Leading With Joy
No ratings yet
Leading With Joy
6 pages
Automatic Door Solutions Guide
No ratings yet
Automatic Door Solutions Guide
5 pages
Some Basic Concepts of Chemistry
No ratings yet
Some Basic Concepts of Chemistry
19 pages
Goodwill Valuation in Accountancy
No ratings yet
Goodwill Valuation in Accountancy
4 pages
Disorders of The Thyroid Gand
No ratings yet
Disorders of The Thyroid Gand
167 pages
In An Artist's Studio
50% (2)
In An Artist's Studio
4 pages
Screening and Assessment LD
No ratings yet
Screening and Assessment LD
63 pages
Drugs
No ratings yet
Drugs
22 pages
CCC Professional Cloud Security Manager
No ratings yet
CCC Professional Cloud Security Manager
32 pages
Factors Led To The Growth of MIS
No ratings yet
Factors Led To The Growth of MIS
17 pages
Education, Arts, and Sciences
No ratings yet
Education, Arts, and Sciences
1 page
SK1-BRK-01-Brake System Bleeding-Rev 1.0
No ratings yet
SK1-BRK-01-Brake System Bleeding-Rev 1.0
9 pages
DiGi KaGB T&C
No ratings yet
DiGi KaGB T&C
5 pages
Top 100 AI Tools for Productivity
No ratings yet
Top 100 AI Tools for Productivity
19 pages
Minnesota Waterfowl Regulations 2023
No ratings yet
Minnesota Waterfowl Regulations 2023
32 pages
Math Test: Rounding & Operations
No ratings yet
Math Test: Rounding & Operations
4 pages
Awrrpt 1 66643 66644
No ratings yet
Awrrpt 1 66643 66644
228 pages

Tutorial Build Your First Machine Learning Model On Azure Databricks

Uploaded by

Tutorial Build Your First Machine Learning Model On Azure Databricks

Uploaded by

8/13/25, 1:52 AM Tutorial: Build your first machine learning model on Azure Databricks - Azure Databricks | Microsoft Learn

Tutorial: Build your first machine learning

Before you begin

Step 1: Create a Databricks notebook

Step 2: Connect to compute resources

Step 3: Set up model registry, catalog, and schema

Enter the following code into a new cell in your notebook.

Step 4: Load data and create Unity Catalog tables

1. Read data from winequality-white.csv and winequality-red.csv into Spark DataFrames.

# Remove the spaces from the column names

# Define table names

# Write to tables in Unity Catalog

Step 5. Preprocess and split the data

1. Loads the data as Pandas DataFrames.

First, import the required libraries:

import matplotlib.pyplot as plt

from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK

Now load and preprocess the data:

# Load data from Unity Catalog as Pandas dataframes

# Add Boolean fields for red and white wine

# Define classification labels based on the wine quality

# Split 80/20 train-test

Step 6. Train the classification model

First, enable MLflow autologging:

Now start the model training run:

with mlflow.start_run(run_name='gradient_boost') as run:

# Models, parameters, and training metrics are tracked automatically

# Save the ROC curve plot to a file

# Log the ROC curve image file as an artifact

print("Test AUC of: {}".format(roc_auc))

Step 7. View experiment runs in MLflow

Step 8. Use Hyperopt for hyperparameter tuning

The following code shows an example of using Hyperopt.

# Define the search space to explore

# Set the loss to -1*auc_score so fmin maximizes the auc_score

# SparkTrials distributes the tuning using Spark workers

with mlflow.start_run(run_name='gb_hyperopt') as run:

Step 9. Find the best model and register it to Unity

Example notebook: Build a classification model

Build your first machine learning model with Databricks

To learn more, see the following:

Machine learning and AI tutorials

You might also like