SlideShare a Scribd company logo
Simpler 	

Machine Learning 	

with SKLL
Dan Blanchard	

Educational Testing Service	

dblanchard@ets.org	



PyData NYC 2013
Simpler Machine Learning with SKLL
Simpler Machine Learning with SKLL
Simpler Machine Learning with SKLL
Survived

Perished
Survived
first class,	

female,	

1 sibling,	

35 years old

Perished
Survived
first class,	

female,	

1 sibling,	

35 years old

Perished
third class, 	

female,	

2 siblings,	

18 years old
Survived
first class,	

female,	

1 sibling,	

35 years old

Perished
third class, 	

female,	

2 siblings,	

18 years old

second class,
male,	

0 siblings,	

50 years old
Survived
first class,	

female,	

1 sibling,	

35 years old

Perished
third class, 	

female,	

2 siblings,	

18 years old

second class,
male,	

0 siblings,	

50 years old

Can we predict survival from data?
SciKit-Learn Laboratory
SKLL
SKLL
SKLL

It's where the learning happens.
Learning to Predict Survival
1. Split up given training set: train (80%) and dev (20%)
Learning to Predict Survival
1. Split up given training set: train (80%) and dev (20%)
$ ./make_titanic_example_data.py
!
Creating titanic/train directory
Creating titanic/dev directory
Creating titanic/test directory
Loading train.csv............done
Loading test.csv........done
Learning to Predict Survival
2. Pick classifiers to try:	

1. Random forest	

2. Support Vector Machine (SVM)	

3. Naive Bayes
Learning to Predict Survival
3. Create configuration file for SKLL
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
directory with feature files
train_location = train
for training learner
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
directory with feature files
test_location = dev
for evaluating performance
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
# of siblings, spouses,
train_location = train children
parents,
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
departure port
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev & passenger class
fare
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
sex, & age
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
directory to store evaluation results
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
directory to store trained models
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
directory to store trained models
Learning to Predict Survival
4. Run the configuration file with run_experiment
$ run_experiment evaluate.cfg
!
Loading train/family.csv...........done
Loading train/misc.csv...........done
Loading train/socioeconomic.csv...........done
Loading train/vitals.csv...........done
Loading dev/family.csv.....done
Loading dev/misc.csv.....done
Loading dev/socioeconomic.csv.....done
Loading dev/vitals.csv.....done
Loading train/family.csv...........done
Loading train/misc.csv...........done
Loading train/socioeconomic.csv...........done
Loading train/vitals.csv...........done
Loading dev/family.csv.....done
...
Learning to Predict Survival
5. Examine results
Experiment Name: Titanic_Evaluate
Training Set: train
Test Set: dev
Feature Set: ["family.csv", "misc.csv", “socioeconomic.csv",
"vitals.csv"]
Learner: RandomForestClassifier
Task: evaluate
!
+-------+------+------+-----------+--------+-----------+
|
| 0.0 | 1.0 | Precision | Recall | F-measure |
+-------+------+------+-----------+--------+-----------+
| 0.000 | [97] |
18 |
0.874 | 0.843 |
0.858 |
+-------+------+------+-----------+--------+-----------+
| 1.000 |
14 | [50] |
0.735 | 0.781 |
0.758 |
+-------+------+------+-----------+--------+-----------+
(row = reference; column = predicted)
Accuracy = 0.8212290502793296
Aggregate Evaluation Results

Dev.
Accuracy

Learner

0.821

RandomForestClassifier

0.771

SVC

0.709

MultinomialNB
Tuning learner
• Can we do better than default hyperparameters?
Tuning learner
• Can we do better than default hyperparameters?
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Tuning]
grid_search = true
objective = accuracy
!
[Output]
results = output
Tuned Evaluation Results

Untuned
Accuracy

Tuned
Accuracy

Learner

0.821

0.849

RandomForestClassifier

0.771

0.737

SVC

0.709

0.709

MultinomialNB
Tuned Evaluation Results

Untuned
Accuracy

Tuned
Accuracy

Learner

0.821

0.849

RandomForestClassifier

0.771

0.737

SVC

0.709

0.709

MultinomialNB
Using All Available Data
Using All Available Data
• Use training and dev to generate predictions on test
Using All Available Data
• Use training and dev to generate predictions on test
[General]
experiment_name = Titanic_Predict
task = predict
!
[Input]
train_location = train+dev
test_location = test
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Tuning]
grid_search = true
objective = accuracy
!
[Output]
results = output
Test Set Performance

Untuned
Accuracy
(Train only)

Tuned
Accuracy
(Train only)

Untuned
Tuned
Accuracy
Accuracy
(Train + Dev) (Train + Dev)

0.732

0.746

0.746

0.756 RandomForestClassifier

0.608

0.617

0.612

0.641

SVC

0.627

0.623

0.622

0.622

MultinomialNB

Learner
Advanced SKLL Features
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
• Rescale predictions to be closer to observed data
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
• Rescale predictions to be closer to observed data
• Feature scaling
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
• Rescale predictions to be closer to observed data
• Feature scaling
• Python API
Currently Supported Learners
Classifiers

Regressors

Linear Support Vector Machine

Elastic Net

Logistic Regression

Lasso

Multinomial Naive Bayes

Linear
Decision Tree

Gradient Boosting
Random Forest
Support Vector Machine
Coming Soon
Classifiers

Regressors
AdaBoost
K-Nearest Neighbors

Stochastic Gradient Descent
Acknowledgements
• Mike Heilman	

• Nitin Madnani	

• Aoife Cahill
References
• Dataset: kaggle.com/c/titanic-gettingStarted	

• SKLL GitHub: github.com/EducationalTestingService/skll	

• SKLL Docs: skll.readthedocs.org	

• Titanic configs and data splitting script in examples dir
on GitHub
@Dan_S_Blanchard	

!

dan-blanchard
Bonus Slides
Cross-validation
[General]
experiment_name = Titanic_CV
task = cross_validate
!
[Input]
train_location = train+dev
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv",
"vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Tuning]
grid_search = true
objective = accuracy
!
[Output]
results = output
Cross-validation Results
Avg. CV
Accuracy

Learner

0.815

RandomForestClassifier

0.717

SVC

0.681

MultinomialNB
SKLL API
SKLL API
from skll import Learner, load_examples
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
confusion matrix
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
precision, recall, f-score
# Load test examples and evaluate
for each class
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
tuned model
# Load test examples and evaluate parameters
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
objective function
test_examples = load_examples('test.tsv')
score on test set
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVM
learner = Learner('SVC')
(fold_result_list,
grid_search_scores) = learner.cross_validate(train_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
per-fold
# evaluation results cross-validation with a radial SVM
Perform 10-fold
learner = Learner('SVC')
(fold_result_list,
grid_search_scores) = learner.cross_validate(train_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVM
per-fold training
learner = Learner('SVC')
set obj. scores
(fold_result_list,
grid_search_scores) = learner.cross_validate(train_examples)
SKLL API
import numpy as np
import os
from skll import write_feature_file
!
# Create some training examples
classes = []
ids = []
features = []
for i in range(num_train_examples):
y = "dog" if i % 2 == 0 else "cat"
ex_id = "{}{}".format(y, i)
x = {"f1": np.random.randint(1, 4),
"f2": np.random.randint(1, 4),
"f3": np.random.randint(1, 4)}
classes.append(y)
ids.append(ex_id)
features.append(x)
# Write them to a file
train_path = os.path.join(_my_dir, 'train',
'test_summary.jsonlines')
write_feature_file(train_path, ids, classes, features)

More Related Content

Similar to Simpler Machine Learning with SKLL (20)

PDF
Uvrgrp ml
dccom
 
PDF
Evaluating classifierperformance ml-cs6923
Raman Kannan
 
PPTX
wk5ppt1_Titanic
AliciaWei1
 
PPTX
Titanic: Machine Learning from Disaster
BharathKumar (BK) Inbasekaran
 
PPTX
Kaggle presentation friday
David Bourke
 
PPT
Titanic Survival Prediction Using Machine Learning
Md. Rana Mahmud
 
PDF
Mini datathon
Kunal Jain
 
PDF
Data Wrangling For Kaggle Data Science Competitions
Krishna Sankar
 
PPTX
Supervised Machine Learning in R
Babu Priyavrat
 
PDF
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
Paris Open Source Summit
 
PDF
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET Journal
 
PDF
Start machine learning in 5 simple steps
Renjith M P
 
PPTX
Survival Analysis Superlearner
Colleen Farrelly
 
PDF
Py ohio
Nate Taggart
 
PDF
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
PDF
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
PDF
20MEMECH Part 3- Classification.pdf
MariaKhan905189
 
PDF
Titanic
Isaac Yauri
 
PDF
Cheatsheet supervised-learning
Steve Nouri
 
PPTX
Titanic survivor prediction by machine learning
Ding Li
 
Uvrgrp ml
dccom
 
Evaluating classifierperformance ml-cs6923
Raman Kannan
 
wk5ppt1_Titanic
AliciaWei1
 
Titanic: Machine Learning from Disaster
BharathKumar (BK) Inbasekaran
 
Kaggle presentation friday
David Bourke
 
Titanic Survival Prediction Using Machine Learning
Md. Rana Mahmud
 
Mini datathon
Kunal Jain
 
Data Wrangling For Kaggle Data Science Competitions
Krishna Sankar
 
Supervised Machine Learning in R
Babu Priyavrat
 
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
Paris Open Source Summit
 
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET Journal
 
Start machine learning in 5 simple steps
Renjith M P
 
Survival Analysis Superlearner
Colleen Farrelly
 
Py ohio
Nate Taggart
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
20MEMECH Part 3- Classification.pdf
MariaKhan905189
 
Titanic
Isaac Yauri
 
Cheatsheet supervised-learning
Steve Nouri
 
Titanic survivor prediction by machine learning
Ding Li
 

Recently uploaded (20)

PDF
Simplify Your FME Flow Setup: Fault-Tolerant Deployment Made Easy with Packer...
Safe Software
 
PDF
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
PDF
Bridging CAD, IBM TRIRIGA & GIS with FME: The Portland Public Schools Case
Safe Software
 
PDF
Understanding AI Optimization AIO, LLMO, and GEO
CoDigital
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
PDF
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
PDF
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
PPTX
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PDF
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
PDF
Next level data operations using Power Automate magic
Andries den Haan
 
PDF
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
PPTX
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
PDF
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
PDF
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
PDF
Why aren't you using FME Flow's CPU Time?
Safe Software
 
PDF
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
Simplify Your FME Flow Setup: Fault-Tolerant Deployment Made Easy with Packer...
Safe Software
 
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
Bridging CAD, IBM TRIRIGA & GIS with FME: The Portland Public Schools Case
Safe Software
 
Understanding AI Optimization AIO, LLMO, and GEO
CoDigital
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
Next level data operations using Power Automate magic
Andries den Haan
 
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
Why aren't you using FME Flow's CPU Time?
Safe Software
 
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
Ad

Simpler Machine Learning with SKLL

  • 1. Simpler Machine Learning with SKLL Dan Blanchard Educational Testing Service [email protected] 
 PyData NYC 2013
  • 7. Survived first class, female, 1 sibling, 35 years old Perished third class, female, 2 siblings, 18 years old
  • 8. Survived first class, female, 1 sibling, 35 years old Perished third class, female, 2 siblings, 18 years old second class, male, 0 siblings, 50 years old
  • 9. Survived first class, female, 1 sibling, 35 years old Perished third class, female, 2 siblings, 18 years old second class, male, 0 siblings, 50 years old Can we predict survival from data?
  • 11. SKLL
  • 12. SKLL
  • 13. SKLL It's where the learning happens.
  • 14. Learning to Predict Survival 1. Split up given training set: train (80%) and dev (20%)
  • 15. Learning to Predict Survival 1. Split up given training set: train (80%) and dev (20%) $ ./make_titanic_example_data.py ! Creating titanic/train directory Creating titanic/dev directory Creating titanic/test directory Loading train.csv............done Loading test.csv........done
  • 16. Learning to Predict Survival 2. Pick classifiers to try: 1. Random forest 2. Support Vector Machine (SVM) 3. Naive Bayes
  • 17. Learning to Predict Survival 3. Create configuration file for SKLL
  • 18. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 19. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] directory with feature files train_location = train for training learner test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 20. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train directory with feature files test_location = dev for evaluating performance featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 21. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 22. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] # of siblings, spouses, train_location = train children parents, test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 23. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train departure port test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 24. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev & passenger class fare featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 25. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev sex, & age featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 26. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 27. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output directory to store evaluation results models = output
  • 28. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output directory to store trained models
  • 29. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output directory to store trained models
  • 30. Learning to Predict Survival 4. Run the configuration file with run_experiment $ run_experiment evaluate.cfg ! Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done Loading dev/misc.csv.....done Loading dev/socioeconomic.csv.....done Loading dev/vitals.csv.....done Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done ...
  • 31. Learning to Predict Survival 5. Examine results Experiment Name: Titanic_Evaluate Training Set: train Test Set: dev Feature Set: ["family.csv", "misc.csv", “socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Task: evaluate ! +-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [97] | 18 | 0.874 | 0.843 | 0.858 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 14 | [50] | 0.735 | 0.781 | 0.758 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8212290502793296
  • 33. Tuning learner • Can we do better than default hyperparameters?
  • 34. Tuning learner • Can we do better than default hyperparameters? [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Tuning] grid_search = true objective = accuracy ! [Output] results = output
  • 38. Using All Available Data • Use training and dev to generate predictions on test
  • 39. Using All Available Data • Use training and dev to generate predictions on test [General] experiment_name = Titanic_Predict task = predict ! [Input] train_location = train+dev test_location = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Tuning] grid_search = true objective = accuracy ! [Output] results = output
  • 40. Test Set Performance Untuned Accuracy (Train only) Tuned Accuracy (Train only) Untuned Tuned Accuracy Accuracy (Train + Dev) (Train + Dev) 0.732 0.746 0.746 0.756 RandomForestClassifier 0.608 0.617 0.612 0.641 SVC 0.627 0.623 0.622 0.622 MultinomialNB Learner
  • 42. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data
  • 43. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors
  • 44. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters
  • 45. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments
  • 46. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments • Collapse/rename classes from config file
  • 47. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments • Collapse/rename classes from config file • Rescale predictions to be closer to observed data
  • 48. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments • Collapse/rename classes from config file • Rescale predictions to be closer to observed data • Feature scaling
  • 49. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments • Collapse/rename classes from config file • Rescale predictions to be closer to observed data • Feature scaling • Python API
  • 50. Currently Supported Learners Classifiers Regressors Linear Support Vector Machine Elastic Net Logistic Regression Lasso Multinomial Naive Bayes Linear Decision Tree Gradient Boosting Random Forest Support Vector Machine
  • 52. Acknowledgements • Mike Heilman • Nitin Madnani • Aoife Cahill
  • 53. References • Dataset: kaggle.com/c/titanic-gettingStarted • SKLL GitHub: github.com/EducationalTestingService/skll • SKLL Docs: skll.readthedocs.org • Titanic configs and data splitting script in examples dir on GitHub @Dan_S_Blanchard ! dan-blanchard
  • 55. Cross-validation [General] experiment_name = Titanic_CV task = cross_validate ! [Input] train_location = train+dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Tuning] grid_search = true objective = accuracy ! [Output] results = output
  • 58. SKLL API from skll import Learner, load_examples
  • 59. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam')
  • 60. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples)
  • 61. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 62. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate confusion matrix test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 63. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 64. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) precision, recall, f-score # Load test examples and evaluate for each class test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 65. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) tuned model # Load test examples and evaluate parameters test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 66. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate objective function test_examples = load_examples('test.tsv') score on test set (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 67. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 68. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples)
  • 69. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) # Perform 10-fold cross-validation with a radial SVM learner = Learner('SVC') (fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)
  • 70. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) per-fold # evaluation results cross-validation with a radial SVM Perform 10-fold learner = Learner('SVC') (fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)
  • 71. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) # Perform 10-fold cross-validation with a radial SVM per-fold training learner = Learner('SVC') set obj. scores (fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)
  • 72. SKLL API import numpy as np import os from skll import write_feature_file ! # Create some training examples classes = [] ids = [] features = [] for i in range(num_train_examples): y = "dog" if i % 2 == 0 else "cat" ex_id = "{}{}".format(y, i) x = {"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4), "f3": np.random.randint(1, 4)} classes.append(y) ids.append(ex_id) features.append(x) # Write them to a file train_path = os.path.join(_my_dir, 'train', 'test_summary.jsonlines') write_feature_file(train_path, ids, classes, features)