SlideShare a Scribd company logo
Practical Machine Learning
Pipelines with MLlib
Joseph K. Bradley
March 18, 2015
Spark Summit East 2015
About Spark MLlib
Started in UC Berkeley AMPLab
โ€ขโ€ฏ Shipped with Spark 0.8
Currently (Spark 1.3)
โ€ขโ€ฏ Contributions from 50+ orgs, 100+ individuals
โ€ขโ€ฏ Good coverage of algorithms
classi๏ฌca'on	
 ย 
regression	
 ย 
clustering	
 ย 
recommenda'on	
 ย 
feature	
 ย extrac'on,	
 ย selec'on	
 ย 
frequent	
 ย itemsets	
 ย 
sta's'cs	
 ย 
linear	
 ย algebra	
 ย 
MLlibโ€™s Mission
How	
 ย can	
 ย we	
 ย move	
 ย beyond	
 ย this	
 ย list	
 ย of	
 ย algorithms	
 ย 
and	
 ย help	
 ย users	
 ย developer	
 ย real	
 ย ML	
 ย work๏ฌ‚ows?	
 ย 
MLlibโ€™s mission is to make practical
machine learning easy and scalable.
โ€ขโ€ฏ Capable of learning from large-scale datasets
โ€ขโ€ฏ Easy to build machine learning applications
Outline
ML workflows
Pipelines
Roadmap
Outline
ML workflows
Pipelines
Roadmap
Example: Text Classification
Set Footer from Insert Dropdown Menu 6
Goal: Given a text document, predict its topic.
Subject: Re: Lexan Polish?!
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.!
McQuires will do something...!
1:	
 ย about	
 ย science	
 ย 
0:	
 ย not	
 ย about	
 ย science	
 ย 
Label	
 ย Features	
 ย 
text,	
 ย image,	
 ย vector,	
 ย ...	
 ย 
CTR,	
 ย inches	
 ย of	
 ย rainfall,	
 ย ...	
 ย 
Dataset:	
 ย โ€œ20	
 ย Newsgroupsโ€	
 ย 
From	
 ย UCI	
 ย KDD	
 ย Archive	
 ย 
Training & Testing
Set Footer from Insert Dropdown Menu 7
Training	
 ย  Tes*ng/Produc*on	
 ย 
Given	
 ย labeled	
 ย data:	
 ย 
	
 ย 	
 ย 	
 ย 	
 ย 	
 ย RDD	
 ย of	
 ย (features,	
 ย label)	
 ย 
Subject: Re: Lexan Polish?!
Suggest McQuires #1 plastic
polish. It will help...!
Subject: RIPEM FAQ!
RIPEM is a program which
performs Privacy Enhanced...!
...	
 ย 
Label 0!
Label 1!
Learn	
 ย a	
 ย model.	
 ย 
Given	
 ย new	
 ย unlabeled	
 ย data:	
 ย 
	
 ย 	
 ย 	
 ย 	
 ย 	
 ย RDD	
 ย of	
 ย features	
 ย 
Subject: Apollo Training!
The Apollo astronauts also
trained at (in) Meteor...!
Subject: A demo of Nonsense!
How can you lie about
something that no one...!
Use	
 ย model	
 ย to	
 ย make	
 ย predic'ons.	
 ย 
Label 1!
Label 0!
Example ML Workflow
Training
Train	
 ย model	
 ย 
labels	
 ย +	
 ย predicEons	
 ย 
Evaluate	
 ย 
Load	
 ย data	
 ย 
labels	
 ย +	
 ย plain	
 ย text	
 ย 
labels	
 ย +	
 ย feature	
 ย vectors	
 ย 
Extract	
 ย features	
 ย 
Explicitly	
 ย unzip	
 ย &	
 ย zip	
 ย RDDs	
 ย 
labels.zip(predictions).map {
if (_._1 == _._2) ...
}
val features: RDD[Vector]
val predictions: RDD[Double]
Create	
 ย many	
 ย RDDs	
 ย 
val labels: RDD[Double] =
data.map(_.label)
Pain	
 ย point	
 ย 
Example ML Workflow
Write	
 ย as	
 ย a	
 ย script	
 ย 
Pain	
 ย point	
 ย 
โ€ขโ€ฏ Not	
 ย modular	
 ย 
โ€ขโ€ฏ Di๏ฌƒcult	
 ย to	
 ย re-ยญโ€use	
 ย work๏ฌ‚ow	
 ย 
Training
labels	
 ย +	
 ย feature	
 ย vectors	
 ย 
Train	
 ย model	
 ย 
labels	
 ย +	
 ย predicEons	
 ย 
Evaluate	
 ย 
Load	
 ย data	
 ย 
labels	
 ย +	
 ย plain	
 ย text	
 ย 
Extract	
 ย features	
 ย 
Example ML Workflow
Training
labels	
 ย +	
 ย feature	
 ย vectors	
 ย 
Train	
 ย model	
 ย 
labels	
 ย +	
 ย predicEons	
 ย 
Evaluate	
 ย 
Load	
 ย data	
 ย 
labels	
 ย +	
 ย plain	
 ย text	
 ย 
Extract	
 ย features	
 ย 
Testing/Production
feature	
 ย vectors	
 ย 
Predict	
 ย using	
 ย model	
 ย 
predicEons	
 ย 
Act	
 ย on	
 ย predic'ons	
 ย 
Load	
 ย new	
 ย data	
 ย 
plain	
 ย text	
 ย 
Extract	
 ย features	
 ย 
Almost	
 ย 
iden-cal	
 ย 
work๏ฌ‚ow	
 ย 
Example ML Workflow
Training
labels	
 ย +	
 ย feature	
 ย vectors	
 ย 
Train	
 ย model	
 ย 
labels	
 ย +	
 ย predicEons	
 ย 
Evaluate	
 ย 
Load	
 ย data	
 ย 
labels	
 ย +	
 ย plain	
 ย text	
 ย 
Extract	
 ย features	
 ย 
Pain	
 ย point	
 ย 
Parameter	
 ย tuning	
 ย 
โ€ขโ€ฏ Key	
 ย part	
 ย of	
 ย ML	
 ย 
โ€ขโ€ฏ Involves	
 ย training	
 ย many	
 ย models	
 ย 
โ€ขโ€ฏ For	
 ย di๏ฌ€erent	
 ย splits	
 ย of	
 ย the	
 ย data	
 ย 
โ€ขโ€ฏ For	
 ย di๏ฌ€erent	
 ย sets	
 ย of	
 ย parameters	
 ย 
Pain Points
Create	
 ย &	
 ย handle	
 ย many	
 ย RDDs	
 ย and	
 ย data	
 ย types	
 ย 
Write	
 ย as	
 ย a	
 ย script	
 ย 
Tune	
 ย parameters	
 ย 
Enter...
Pipelines!	
 ย  in	
 ย Spark	
 ย 1.2	
 ย &	
 ย 1.3	
 ย 
Outline
ML workflows
Pipelines
Roadmap
Key Concepts
DataFrame: The ML Dataset
Abstractions: Transformers, Estimators, & Evaluators
Parameters: API & tuning
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
Named	
 ย columns	
 ย with	
 ย types	
 ย 
label: Double
text: String
words: Seq[String]
features: Vector
prediction: Double
label	
 ย  text	
 ย  words	
 ย  features	
 ย 
0	
 ย  This	
 ย is	
 ย ...	
 ย  [โ€œThisโ€,	
 ย โ€œisโ€,	
 ย โ€ฆ]	
 ย  [0.5,	
 ย 1.2,	
 ย โ€ฆ]	
 ย 
0	
 ย  When	
 ย we	
 ย ...	
 ย  [โ€œWhenโ€,	
 ย ...]	
 ย  [1.9,	
 ย -ยญโ€0.8,	
 ย โ€ฆ]	
 ย 
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
Named	
 ย columns	
 ย with	
 ย types	
 ย  Domain-ยญโ€Speci๏ฌc	
 ย Language	
 ย 
# Select science articles
sciDocs =
data.filter(โ€œlabelโ€ == 1)
# Scale labels
data(โ€œlabelโ€) * 0.5
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
โ€ขโ€ฏShipped	
 ย with	
 ย Spark	
 ย 1.3	
 ย 
โ€ขโ€ฏAPIs	
 ย for	
 ย Python,	
 ย Java	
 ย &	
 ย Scala	
 ย (+R	
 ย in	
 ย dev)	
 ย 
โ€ขโ€ฏIntegra'on	
 ย with	
 ย Spark	
 ย SQL	
 ย 
โ€ขโ€ฏData	
 ย import/export	
 ย 
โ€ขโ€ฏInternal	
 ย op'miza'ons	
 ย 
Named	
 ย columns	
 ย with	
 ย types Domain-ยญโ€Speci๏ฌc	
 ย Language	
 ย 
Pain	
 ย point:	
 ย Create	
 ย &	
 ย handle	
 ย 
many	
 ย RDDs	
 ย and	
 ย data	
 ย types	
 ย 
BIG	
 ย data	
 ย 
Abstractions
Set Footer from Insert Dropdown Menu 18
Training
Train	
 ย model	
 ย 
Evaluate	
 ย 
Load	
 ย data	
 ย 
Extract	
 ย features	
 ย 
Abstraction: Transformer
Set Footer from Insert Dropdown Menu 19
Training
Train	
 ย model	
 ย 
Evaluate	
 ย 
Extract	
 ย features	
 ย 
def transform(DataFrame): DataFrame
label: Double
text: String
label: Double
text: String
features: Vector
Abstraction: Estimator
Set Footer from Insert Dropdown Menu 20
Training
Train	
 ย model	
 ย 
Evaluate	
 ย 
Extract	
 ย features	
 ย 
label: Double
text: String
features: Vector
LogisticRegression
Model
def fit(DataFrame): Model
Train	
 ย model	
 ย 
Abstraction: Evaluator
Set Footer from Insert Dropdown Menu 21
Training
Evaluate	
 ย 
Extract	
 ย features	
 ย 
label: Double
text: String
features: Vector
prediction: Double
Metric:	
 ย 
accuracy
AUC
MSE
...
def evaluate(DataFrame): Double
Act	
 ย on	
 ย predic'ons	
 ย 
Abstraction: Model
Set Footer from Insert Dropdown Menu 22
Model	
 ย is	
 ย a	
 ย type	
 ย of	
 ย Transformer	
 ย 
def transform(DataFrame): DataFrame
text: String
features: Vector
Testing/Production
Predict	
 ย using	
 ย model	
 ย 
Extract	
 ย features	
 ย  text: String
features: Vector
prediction: Double
(Recall) Abstraction: Estimator
Set Footer from Insert Dropdown Menu 23
Training
Train	
 ย model	
 ย 
Evaluate	
 ย 
Load	
 ย data	
 ย 
Extract	
 ย features	
 ย 
label: Double
text: String
features: Vector
LogisticRegression
Model
def fit(DataFrame): Model
Abstraction: Pipeline
Set Footer from Insert Dropdown Menu 24
Training
Train	
 ย model	
 ย 
Evaluate	
 ย 
Load	
 ย data	
 ย 
Extract	
 ย features	
 ย 
label: Double
text: String
PipelineModel
Pipeline	
 ย is	
 ย a	
 ย type	
 ย of	
 ย Es*mator	
 ย 
def fit(DataFrame): Model
Abstraction: PipelineModel
Set Footer from Insert Dropdown Menu 25
text: String
PipelineModel	
 ย is	
 ย a	
 ย type	
 ย of	
 ย Transformer	
 ย 
def transform(DataFrame): DataFrame
Testing/Production
Predict	
 ย using	
 ย model	
 ย 
Load	
 ย data	
 ย 
Extract	
 ย features	
 ย  text: String
features: Vector
prediction: Double
Act	
 ย on	
 ย predic'ons	
 ย 
Abstractions: Summary
Set Footer from Insert Dropdown Menu 26
Training
Train	
 ย model	
 ย 
Evaluate	
 ย 
Load	
 ย data	
 ย 
Extract	
 ย features	
 ย Transformer
DataFrame
Estimator
Evaluator
Testing
Predict	
 ย using	
 ย model	
 ย 
Evaluate	
 ย 
Load	
 ย data	
 ย 
Extract	
 ย features	
 ย 
Demo
Set Footer from Insert Dropdown Menu 27
Transformer
DataFrame
Estimator
Evaluator
label: Double
text: String
features: Vector
Current	
 ย data	
 ย schema	
 ย 
prediction: Double
Training
Logis'cRegression	
 ย 
BinaryClassi๏ฌca'on	
 ย 
Evaluator	
 ย 
Load	
 ย data	
 ย 
Tokenizer	
 ย 
Transformer HashingTF	
 ย 
words: Seq[String]
Demo
Set Footer from Insert Dropdown Menu 28
Transformer
DataFrame
Estimator
Evaluator
Training
Logis'cRegression	
 ย 
BinaryClassi๏ฌca'on	
 ย 
Evaluator	
 ย 
Load	
 ย data	
 ย 
Tokenizer	
 ย 
Transformer HashingTF	
 ย 
Pain	
 ย point:	
 ย Write	
 ย as	
 ย a	
 ย script	
 ย 
Parameters
Set Footer from Insert Dropdown Menu 29
> hashingTF.numFeaturesStandard	
 ย API	
 ย 
โ€ขโ€ฏ Typed	
 ย 
โ€ขโ€ฏ Defaults	
 ย 
โ€ขโ€ฏ Built-ยญโ€in	
 ย doc	
 ย 
โ€ขโ€ฏ Autocomplete	
 ย 
org.apache.spark.ml.param.IntParam =
numFeatures: number of features
(default: 262144)
> hashingTF.setNumFeatures(1000)
> hashingTF.getNumFeatures
Parameter Tuning
Given:
โ€ขโ€ฏ Estimator
โ€ขโ€ฏ Parameter grid
โ€ขโ€ฏ Evaluator
Find best parameters
lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFeatures
{100, 1000, 10000}
Logis'cRegression	
 ย 
Tokenizer	
 ย 
HashingTF	
 ย 
BinaryClassi๏ฌca'on	
 ย 
Evaluator	
 ย 
CrossValidator
Parameter Tuning
Given:
โ€ขโ€ฏ Estimator
โ€ขโ€ฏ Parameter grid
โ€ขโ€ฏ Evaluator
Find best parameters
Logis'cRegression	
 ย 
Tokenizer	
 ย 
HashingTF	
 ย 
BinaryClassi๏ฌca'on	
 ย 
Evaluator	
 ย 
CrossValidator
Pain	
 ย point:	
 ย Tune	
 ย parameters	
 ย 
Pipelines: Recap
Inspira'ons	
 ย 
	
 ย 
scikit-ยญโ€learn	
 ย 
	
 ย 	
 ย +	
 ย Spark	
 ย DataFrame,	
 ย Param	
 ย API	
 ย 
	
 ย 
MLBase	
 ย (Berkeley	
 ย AMPLab)	
 ย 
	
 ย 	
 ย Ongoing	
 ย collaboraEons	
 ย 
Create	
 ย &	
 ย handle	
 ย many	
 ย RDDs	
 ย and	
 ย data	
 ย types	
 ย 
Write	
 ย as	
 ย a	
 ย script	
 ย 
Tune	
 ย parameters	
 ย 
DataFrame	
 ย 
Abstrac'ons	
 ย 
Parameter	
 ย API	
 ย 
*	
 ย Groundwork	
 ย done;	
 ย full	
 ย support	
 ย WIP.	
 ย 
Also	
 ย 
โ€ขโ€ฏ Python,	
 ย Scala,	
 ย Java	
 ย APIs	
 ย 
โ€ขโ€ฏ Schema	
 ย valida'on	
 ย 
โ€ขโ€ฏ User-ยญโ€De๏ฌned	
 ย Types*	
 ย 
โ€ขโ€ฏ Feature	
 ย metadata*	
 ย 
โ€ขโ€ฏ Mul'-ยญโ€model	
 ย training	
 ย op'miza'ons*	
 ย 
Outline
ML workflows
Pipelines
Roadmap
Roadmap
spark.mllib:	
 ย Primary	
 ย ML	
 ย package	
 ย 
	
 ย 
spark.ml:	
 ย High-ยญโ€level	
 ย Pipelines	
 ย API	
 ย for	
 ย algorithms	
 ย in	
 ย spark.mllib
(experimental	
 ย in	
 ย Spark	
 ย 1.2-ยญโ€1.3)	
 ย 
Near	
 ย future	
 ย 
โ€ขโ€ฏ Feature	
 ย aoributes	
 ย 
โ€ขโ€ฏ Feature	
 ย transformers	
 ย 
โ€ขโ€ฏ More	
 ย algorithms	
 ย under	
 ย Pipeline	
 ย API	
 ย 
	
 ย 
Farther	
 ย ahead	
 ย 
โ€ขโ€ฏ Ideas	
 ย from	
 ย AMPLab	
 ย MLBase	
 ย (auto-ยญโ€tuning	
 ย models)	
 ย 
โ€ขโ€ฏ SparkR	
 ย integra'on	
 ย 
Thank you!
Outline	
 ย 
โ€ขโ€ฏ ML	
 ย work๏ฌ‚ows	
 ย 
โ€ขโ€ฏ Pipelines	
 ย 
โ€ขโ€ฏ DataFrame	
 ย 
โ€ขโ€ฏ Abstrac*ons	
 ย 
โ€ขโ€ฏ Parameter	
 ย tuning	
 ย 
โ€ขโ€ฏ Roadmap	
 ย 
Spark	
 ย documenta'on	
 ย 
	
 ย 	
 ย 	
 ย 	
 ย hop://spark.apache.org/	
 ย 
	
 ย 
Pipelines	
 ย blog	
 ย post	
 ย 
	
 ย 	
 ย 	
 ย 	
 ย hops://databricks.com/blog/2015/01/07	
 ย 

More Related Content

What's hot (20)

PPTX
Elasticsearch
Divij Sehgal
ย 
PPTX
Kafka 101
Clement Demonchy
ย 
PPTX
Envoy and Kafka
Adam Kotwasinski
ย 
PDF
Elasticsearch
Shagun Rathore
ย 
PDF
When NOT to use Apache Kafka?
Kai Wรคhner
ย 
PDF
ksqlDB: A Stream-Relational Database System
confluent
ย 
PDF
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
confluent
ย 
PPTX
Kafka 101
Aparna Pillai
ย 
PDF
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
SANG WON PARK
ย 
PPTX
Amazon SageMaker for MLOps Presentation.
Knoldus Inc.
ย 
PDF
Speed-Up Kafka Delivery with AsyncAPI & Microcks | Hugo Guerrero, Red Hat
HostedbyConfluent
ย 
PDF
Making Apache Spark Better with Delta Lake
Databricks
ย 
PDF
So You Want to Write a Connector?
confluent
ย 
PDF
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
Edureka!
ย 
PPTX
Introduction to Apache Kafka
Jeff Holoman
ย 
PDF
Introduction to Apache Kafka and Confluent... and why they matter
confluent
ย 
PDF
Combining logs, metrics, and traces for unified observability
Elasticsearch
ย 
PDF
Automate Your Kafka Cluster with Kubernetes Custom Resources
confluent
ย 
PPTX
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
confluent
ย 
PDF
Can Apache Kafka Replace a Database?
Kai Wรคhner
ย 
Elasticsearch
Divij Sehgal
ย 
Kafka 101
Clement Demonchy
ย 
Envoy and Kafka
Adam Kotwasinski
ย 
Elasticsearch
Shagun Rathore
ย 
When NOT to use Apache Kafka?
Kai Wรคhner
ย 
ksqlDB: A Stream-Relational Database System
confluent
ย 
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
confluent
ย 
Kafka 101
Aparna Pillai
ย 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
SANG WON PARK
ย 
Amazon SageMaker for MLOps Presentation.
Knoldus Inc.
ย 
Speed-Up Kafka Delivery with AsyncAPI & Microcks | Hugo Guerrero, Red Hat
HostedbyConfluent
ย 
Making Apache Spark Better with Delta Lake
Databricks
ย 
So You Want to Write a Connector?
confluent
ย 
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
Edureka!
ย 
Introduction to Apache Kafka
Jeff Holoman
ย 
Introduction to Apache Kafka and Confluent... and why they matter
confluent
ย 
Combining logs, metrics, and traces for unified observability
Elasticsearch
ย 
Automate Your Kafka Cluster with Kubernetes Custom Resources
confluent
ย 
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
confluent
ย 
Can Apache Kafka Replace a Database?
Kai Wรคhner
ย 

Similar to Practical Machine Learning Pipelines with MLlib (20)

PPTX
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
ย 
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
ย 
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
ย 
PDF
Foundations for Scaling ML in Apache Spark
Databricks
ย 
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
ย 
PDF
Introduction to Spark ML Pipelines Workshop
Holden Karau
ย 
PPTX
Apache Spark MLlib
Zahra Eskandari
ย 
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
ย 
PDF
Machine learning pipeline with spark ml
datamantra
ย 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
ย 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
ย 
PPTX
Introduction to Spark ML
Holden Karau
ย 
PDF
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
ย 
PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
MLconf
ย 
PDF
Spark DataFrames and ML Pipelines
Databricks
ย 
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
ย 
PDF
Introduction to and Extending Spark ML
Holden Karau
ย 
PPTX
Building Machine Learning Inference Pipelines at Scale (July 2019)
Julien SIMON
ย 
PDF
Distributed ML in Apache Spark
Databricks
ย 
PDF
Scalable Data Science in Python and R on Apache Spark
felixcss
ย 
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
ย 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
ย 
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
ย 
Foundations for Scaling ML in Apache Spark
Databricks
ย 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
ย 
Introduction to Spark ML Pipelines Workshop
Holden Karau
ย 
Apache Spark MLlib
Zahra Eskandari
ย 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
ย 
Machine learning pipeline with spark ml
datamantra
ย 
Combining Machine Learning Frameworks with Apache Spark
Databricks
ย 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
ย 
Introduction to Spark ML
Holden Karau
ย 
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
ย 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
MLconf
ย 
Spark DataFrames and ML Pipelines
Databricks
ย 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
ย 
Introduction to and Extending Spark ML
Holden Karau
ย 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Julien SIMON
ย 
Distributed ML in Apache Spark
Databricks
ย 
Scalable Data Science in Python and R on Apache Spark
felixcss
ย 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
ย 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
ย 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
ย 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
ย 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
ย 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
ย 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
ย 
PDF
Learn to Use Databricks for Data Science
Databricks
ย 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
ย 
PDF
The Function, the Context, and the Dataโ€”Enabling ML Ops at Stitch Fix
Databricks
ย 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
ย 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
ย 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
ย 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
ย 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
ย 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
ย 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
ย 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
ย 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
ย 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
ย 
DW Migration Webinar-March 2022.pptx
Databricks
ย 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
ย 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
ย 
Data Lakehouse Symposium | Day 2
Databricks
ย 
Data Lakehouse Symposium | Day 4
Databricks
ย 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
ย 
Democratizing Data Quality Through a Centralized Platform
Databricks
ย 
Learn to Use Databricks for Data Science
Databricks
ย 
Why APM Is Not the Same As ML Monitoring
Databricks
ย 
The Function, the Context, and the Dataโ€”Enabling ML Ops at Stitch Fix
Databricks
ย 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
ย 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
ย 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
ย 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
ย 
Sawtooth Windows for Feature Aggregations
Databricks
ย 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
ย 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
ย 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
ย 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
ย 
Massive Data Processing in Adobe Using Delta Lake
Databricks
ย 
Ad

Recently uploaded (20)

PPTX
IObit Uninstaller Pro 14.3.1.8 Crack Free Download 2025
sdfger qwerty
ย 
PPTX
For my supp to finally picking supp that work
necas19388
ย 
PDF
Alur Perkembangan Software dan Jaringan Komputer
ssuser754303
ย 
PPTX
IObit Driver Booster Pro 12.4-12.5 license keys 2025-2026
chaudhryakashoo065
ย 
PDF
capitulando la keynote de GrafanaCON 2025 - Madrid
Imma Valls Bernaus
ย 
PPTX
IDM Crack with Internet Download Manager 6.42 Build 41 [Latest 2025]
pcprocore
ย 
DOCX
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
SEOLIFT - SEO Company London
ย 
PDF
AWS Consulting Services: Empowering Digital Transformation with Nlineaxis
Nlineaxis IT Solutions Pvt Ltd
ย 
PDF
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
ย 
PDF
Building scalbale cloud native apps with .NET 8
GillesMathieu10
ย 
PPTX
IObit Driver Booster Pro 12 Crack Latest Version Download
pcprocore
ย 
PPTX
CV-Project_2024 version 01222222222.pptx
MohammadSiddiqui70
ย 
PDF
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
ย 
PDF
OpenChain Webinar - AboutCode - Practical Compliance in One Stack โ€“ Licensing...
Shane Coughlan
ย 
PDF
Automated Testing and Safety Analysis of Deep Neural Networks
Lionel Briand
ย 
PPTX
declaration of Variables and constants.pptx
meemee7378
ย 
PPTX
Introduction to web development | MERN Stack
JosephLiyon
ย 
PDF
Writing Maintainable Playwright Tests with Ease
Shubham Joshi
ย 
PDF
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
ย 
PDF
Automated Test Case Repair Using Language Models
Lionel Briand
ย 
IObit Uninstaller Pro 14.3.1.8 Crack Free Download 2025
sdfger qwerty
ย 
For my supp to finally picking supp that work
necas19388
ย 
Alur Perkembangan Software dan Jaringan Komputer
ssuser754303
ย 
IObit Driver Booster Pro 12.4-12.5 license keys 2025-2026
chaudhryakashoo065
ย 
capitulando la keynote de GrafanaCON 2025 - Madrid
Imma Valls Bernaus
ย 
IDM Crack with Internet Download Manager 6.42 Build 41 [Latest 2025]
pcprocore
ย 
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
SEOLIFT - SEO Company London
ย 
AWS Consulting Services: Empowering Digital Transformation with Nlineaxis
Nlineaxis IT Solutions Pvt Ltd
ย 
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
ย 
Building scalbale cloud native apps with .NET 8
GillesMathieu10
ย 
IObit Driver Booster Pro 12 Crack Latest Version Download
pcprocore
ย 
CV-Project_2024 version 01222222222.pptx
MohammadSiddiqui70
ย 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
ย 
OpenChain Webinar - AboutCode - Practical Compliance in One Stack โ€“ Licensing...
Shane Coughlan
ย 
Automated Testing and Safety Analysis of Deep Neural Networks
Lionel Briand
ย 
declaration of Variables and constants.pptx
meemee7378
ย 
Introduction to web development | MERN Stack
JosephLiyon
ย 
Writing Maintainable Playwright Tests with Ease
Shubham Joshi
ย 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
ย 
Automated Test Case Repair Using Language Models
Lionel Briand
ย 

Practical Machine Learning Pipelines with MLlib

  • 1. Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015
  • 2. About Spark MLlib Started in UC Berkeley AMPLab โ€ขโ€ฏ Shipped with Spark 0.8 Currently (Spark 1.3) โ€ขโ€ฏ Contributions from 50+ orgs, 100+ individuals โ€ขโ€ฏ Good coverage of algorithms classi๏ฌca'on ย  regression ย  clustering ย  recommenda'on ย  feature ย extrac'on, ย selec'on ย  frequent ย itemsets ย  sta's'cs ย  linear ย algebra ย 
  • 3. MLlibโ€™s Mission How ย can ย we ย move ย beyond ย this ย list ย of ย algorithms ย  and ย help ย users ย developer ย real ย ML ย work๏ฌ‚ows? ย  MLlibโ€™s mission is to make practical machine learning easy and scalable. โ€ขโ€ฏ Capable of learning from large-scale datasets โ€ขโ€ฏ Easy to build machine learning applications
  • 6. Example: Text Classification Set Footer from Insert Dropdown Menu 6 Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish?! Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is.! McQuires will do something...! 1: ย about ย science ย  0: ย not ย about ย science ย  Label ย Features ย  text, ย image, ย vector, ย ... ย  CTR, ย inches ย of ย rainfall, ย ... ย  Dataset: ย โ€œ20 ย Newsgroupsโ€ ย  From ย UCI ย KDD ย Archive ย 
  • 7. Training & Testing Set Footer from Insert Dropdown Menu 7 Training ย  Tes*ng/Produc*on ย  Given ย labeled ย data: ย  ย  ย  ย  ย  ย RDD ย of ย (features, ย label) ย  Subject: Re: Lexan Polish?! Suggest McQuires #1 plastic polish. It will help...! Subject: RIPEM FAQ! RIPEM is a program which performs Privacy Enhanced...! ... ย  Label 0! Label 1! Learn ย a ย model. ย  Given ย new ย unlabeled ย data: ย  ย  ย  ย  ย  ย RDD ย of ย features ย  Subject: Apollo Training! The Apollo astronauts also trained at (in) Meteor...! Subject: A demo of Nonsense! How can you lie about something that no one...! Use ย model ย to ย make ย predic'ons. ย  Label 1! Label 0!
  • 8. Example ML Workflow Training Train ย model ย  labels ย + ย predicEons ย  Evaluate ย  Load ย data ย  labels ย + ย plain ย text ย  labels ย + ย feature ย vectors ย  Extract ย features ย  Explicitly ย unzip ย & ย zip ย RDDs ย  labels.zip(predictions).map { if (_._1 == _._2) ... } val features: RDD[Vector] val predictions: RDD[Double] Create ย many ย RDDs ย  val labels: RDD[Double] = data.map(_.label) Pain ย point ย 
  • 9. Example ML Workflow Write ย as ย a ย script ย  Pain ย point ย  โ€ขโ€ฏ Not ย modular ย  โ€ขโ€ฏ Di๏ฌƒcult ย to ย re-ยญโ€use ย work๏ฌ‚ow ย  Training labels ย + ย feature ย vectors ย  Train ย model ย  labels ย + ย predicEons ย  Evaluate ย  Load ย data ย  labels ย + ย plain ย text ย  Extract ย features ย 
  • 10. Example ML Workflow Training labels ย + ย feature ย vectors ย  Train ย model ย  labels ย + ย predicEons ย  Evaluate ย  Load ย data ย  labels ย + ย plain ย text ย  Extract ย features ย  Testing/Production feature ย vectors ย  Predict ย using ย model ย  predicEons ย  Act ย on ย predic'ons ย  Load ย new ย data ย  plain ย text ย  Extract ย features ย  Almost ย  iden-cal ย  work๏ฌ‚ow ย 
  • 11. Example ML Workflow Training labels ย + ย feature ย vectors ย  Train ย model ย  labels ย + ย predicEons ย  Evaluate ย  Load ย data ย  labels ย + ย plain ย text ย  Extract ย features ย  Pain ย point ย  Parameter ย tuning ย  โ€ขโ€ฏ Key ย part ย of ย ML ย  โ€ขโ€ฏ Involves ย training ย many ย models ย  โ€ขโ€ฏ For ย di๏ฌ€erent ย splits ย of ย the ย data ย  โ€ขโ€ฏ For ย di๏ฌ€erent ย sets ย of ย parameters ย 
  • 12. Pain Points Create ย & ย handle ย many ย RDDs ย and ย data ย types ย  Write ย as ย a ย script ย  Tune ย parameters ย  Enter... Pipelines! ย  in ย Spark ย 1.2 ย & ย 1.3 ย 
  • 14. Key Concepts DataFrame: The ML Dataset Abstractions: Transformers, Estimators, & Evaluators Parameters: API & tuning
  • 15. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL Named ย columns ย with ย types ย  label: Double text: String words: Seq[String] features: Vector prediction: Double label ย  text ย  words ย  features ย  0 ย  This ย is ย ... ย  [โ€œThisโ€, ย โ€œisโ€, ย โ€ฆ] ย  [0.5, ย 1.2, ย โ€ฆ] ย  0 ย  When ย we ย ... ย  [โ€œWhenโ€, ย ...] ย  [1.9, ย -ยญโ€0.8, ย โ€ฆ] ย 
  • 16. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL Named ย columns ย with ย types ย  Domain-ยญโ€Speci๏ฌc ย Language ย  # Select science articles sciDocs = data.filter(โ€œlabelโ€ == 1) # Scale labels data(โ€œlabelโ€) * 0.5
  • 17. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL โ€ขโ€ฏShipped ย with ย Spark ย 1.3 ย  โ€ขโ€ฏAPIs ย for ย Python, ย Java ย & ย Scala ย (+R ย in ย dev) ย  โ€ขโ€ฏIntegra'on ย with ย Spark ย SQL ย  โ€ขโ€ฏData ย import/export ย  โ€ขโ€ฏInternal ย op'miza'ons ย  Named ย columns ย with ย types Domain-ยญโ€Speci๏ฌc ย Language ย  Pain ย point: ย Create ย & ย handle ย  many ย RDDs ย and ย data ย types ย  BIG ย data ย 
  • 18. Abstractions Set Footer from Insert Dropdown Menu 18 Training Train ย model ย  Evaluate ย  Load ย data ย  Extract ย features ย 
  • 19. Abstraction: Transformer Set Footer from Insert Dropdown Menu 19 Training Train ย model ย  Evaluate ย  Extract ย features ย  def transform(DataFrame): DataFrame label: Double text: String label: Double text: String features: Vector
  • 20. Abstraction: Estimator Set Footer from Insert Dropdown Menu 20 Training Train ย model ย  Evaluate ย  Extract ย features ย  label: Double text: String features: Vector LogisticRegression Model def fit(DataFrame): Model
  • 21. Train ย model ย  Abstraction: Evaluator Set Footer from Insert Dropdown Menu 21 Training Evaluate ย  Extract ย features ย  label: Double text: String features: Vector prediction: Double Metric: ย  accuracy AUC MSE ... def evaluate(DataFrame): Double
  • 22. Act ย on ย predic'ons ย  Abstraction: Model Set Footer from Insert Dropdown Menu 22 Model ย is ย a ย type ย of ย Transformer ย  def transform(DataFrame): DataFrame text: String features: Vector Testing/Production Predict ย using ย model ย  Extract ย features ย  text: String features: Vector prediction: Double
  • 23. (Recall) Abstraction: Estimator Set Footer from Insert Dropdown Menu 23 Training Train ย model ย  Evaluate ย  Load ย data ย  Extract ย features ย  label: Double text: String features: Vector LogisticRegression Model def fit(DataFrame): Model
  • 24. Abstraction: Pipeline Set Footer from Insert Dropdown Menu 24 Training Train ย model ย  Evaluate ย  Load ย data ย  Extract ย features ย  label: Double text: String PipelineModel Pipeline ย is ย a ย type ย of ย Es*mator ย  def fit(DataFrame): Model
  • 25. Abstraction: PipelineModel Set Footer from Insert Dropdown Menu 25 text: String PipelineModel ย is ย a ย type ย of ย Transformer ย  def transform(DataFrame): DataFrame Testing/Production Predict ย using ย model ย  Load ย data ย  Extract ย features ย  text: String features: Vector prediction: Double Act ย on ย predic'ons ย 
  • 26. Abstractions: Summary Set Footer from Insert Dropdown Menu 26 Training Train ย model ย  Evaluate ย  Load ย data ย  Extract ย features ย Transformer DataFrame Estimator Evaluator Testing Predict ย using ย model ย  Evaluate ย  Load ย data ย  Extract ย features ย 
  • 27. Demo Set Footer from Insert Dropdown Menu 27 Transformer DataFrame Estimator Evaluator label: Double text: String features: Vector Current ย data ย schema ย  prediction: Double Training Logis'cRegression ย  BinaryClassi๏ฌca'on ย  Evaluator ย  Load ย data ย  Tokenizer ย  Transformer HashingTF ย  words: Seq[String]
  • 28. Demo Set Footer from Insert Dropdown Menu 28 Transformer DataFrame Estimator Evaluator Training Logis'cRegression ย  BinaryClassi๏ฌca'on ย  Evaluator ย  Load ย data ย  Tokenizer ย  Transformer HashingTF ย  Pain ย point: ย Write ย as ย a ย script ย 
  • 29. Parameters Set Footer from Insert Dropdown Menu 29 > hashingTF.numFeaturesStandard ย API ย  โ€ขโ€ฏ Typed ย  โ€ขโ€ฏ Defaults ย  โ€ขโ€ฏ Built-ยญโ€in ย doc ย  โ€ขโ€ฏ Autocomplete ย  org.apache.spark.ml.param.IntParam = numFeatures: number of features (default: 262144) > hashingTF.setNumFeatures(1000) > hashingTF.getNumFeatures
  • 30. Parameter Tuning Given: โ€ขโ€ฏ Estimator โ€ขโ€ฏ Parameter grid โ€ขโ€ฏ Evaluator Find best parameters lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} Logis'cRegression ย  Tokenizer ย  HashingTF ย  BinaryClassi๏ฌca'on ย  Evaluator ย  CrossValidator
  • 31. Parameter Tuning Given: โ€ขโ€ฏ Estimator โ€ขโ€ฏ Parameter grid โ€ขโ€ฏ Evaluator Find best parameters Logis'cRegression ย  Tokenizer ย  HashingTF ย  BinaryClassi๏ฌca'on ย  Evaluator ย  CrossValidator Pain ย point: ย Tune ย parameters ย 
  • 32. Pipelines: Recap Inspira'ons ย  ย  scikit-ยญโ€learn ย  ย  ย + ย Spark ย DataFrame, ย Param ย API ย  ย  MLBase ย (Berkeley ย AMPLab) ย  ย  ย Ongoing ย collaboraEons ย  Create ย & ย handle ย many ย RDDs ย and ย data ย types ย  Write ย as ย a ย script ย  Tune ย parameters ย  DataFrame ย  Abstrac'ons ย  Parameter ย API ย  * ย Groundwork ย done; ย full ย support ย WIP. ย  Also ย  โ€ขโ€ฏ Python, ย Scala, ย Java ย APIs ย  โ€ขโ€ฏ Schema ย valida'on ย  โ€ขโ€ฏ User-ยญโ€De๏ฌned ย Types* ย  โ€ขโ€ฏ Feature ย metadata* ย  โ€ขโ€ฏ Mul'-ยญโ€model ย training ย op'miza'ons* ย 
  • 34. Roadmap spark.mllib: ย Primary ย ML ย package ย  ย  spark.ml: ย High-ยญโ€level ย Pipelines ย API ย for ย algorithms ย in ย spark.mllib (experimental ย in ย Spark ย 1.2-ยญโ€1.3) ย  Near ย future ย  โ€ขโ€ฏ Feature ย aoributes ย  โ€ขโ€ฏ Feature ย transformers ย  โ€ขโ€ฏ More ย algorithms ย under ย Pipeline ย API ย  ย  Farther ย ahead ย  โ€ขโ€ฏ Ideas ย from ย AMPLab ย MLBase ย (auto-ยญโ€tuning ย models) ย  โ€ขโ€ฏ SparkR ย integra'on ย 
  • 35. Thank you! Outline ย  โ€ขโ€ฏ ML ย work๏ฌ‚ows ย  โ€ขโ€ฏ Pipelines ย  โ€ขโ€ฏ DataFrame ย  โ€ขโ€ฏ Abstrac*ons ย  โ€ขโ€ฏ Parameter ย tuning ย  โ€ขโ€ฏ Roadmap ย  Spark ย documenta'on ย  ย  ย  ย  ย hop://spark.apache.org/ ย  ย  Pipelines ย blog ย post ย  ย  ย  ย  ย hops://databricks.com/blog/2015/01/07 ย