Skip to content

Commit 122ded8

Browse files
authored
FIX address some feedback from forum (INRIA#337)
1 parent e5d424b commit 122ded8

5 files changed

+55
-17
lines changed

python_scripts/02_numerical_pipeline_ex_01.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,8 +62,23 @@
6262
data_numeric_train, data_numeric_test, target_train, target_test = \
6363
train_test_split(data_numeric, target, random_state=0)
6464

65+
# %% [markdown]
66+
# Split the dataset into a train and test sets.
6567
# %%
6668
from sklearn.model_selection import train_test_split
69+
70+
# Write your code here.
71+
72+
73+
# %% [markdown]
74+
# Use a `DummyClassifier` such that the resulting classifier will always
75+
# predict the class `' >50K'`. What is the accuracy score on the test set?
76+
# Repeat the experiment by always predicting the class `' <=50K'`.
77+
#
78+
# Hint: you can refer to the parameter `strategy` of the `DummyClassifier`
79+
# to achieve the desired behaviour.
80+
81+
# %%
6782
from sklearn.dummy import DummyClassifier
6883

6984
# Write your code here.

python_scripts/02_numerical_pipeline_hands_on.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,11 @@
190190
#
191191
# To create a logistic regression model in scikit-learn you can do:
192192

193+
# %%
194+
# to display nice model diagram
195+
from sklearn import set_config
196+
set_config(display='diagram')
197+
193198
# %%
194199
from sklearn.linear_model import LogisticRegression
195200

python_scripts/02_numerical_pipeline_introduction.py

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -83,14 +83,19 @@
8383
# into account its `k` closest samples in the training set and predicts the
8484
# majority target of these samples.
8585
#
86-
# The `fit` method is called to train the model from the input
87-
# (features) and target data.
88-
#
8986
# ```{caution}
9087
# We use a K-nearest neighbors here. However, be aware that it is seldom useful
9188
# in practice. We use it because it is an intuitive algorithm. In the next
9289
# notebook, we will introduce better models.
9390
# ```
91+
#
92+
# The `fit` method is called to train the model from the input (features) and
93+
# target data.
94+
95+
# %%
96+
# to display nice model diagram
97+
from sklearn import set_config
98+
set_config(display='diagram')
9499

95100
# %%
96101
from sklearn.neighbors import KNeighborsClassifier
@@ -105,12 +110,12 @@
105110
#
106111
# The method `fit` is composed of two elements: (i) a **learning algorithm**
107112
# and (ii) some **model states**. The learning algorithm takes the training
108-
# data and training target as input and set the model states. These model
109-
# states will be used later to either predict (for classifier and regressor) or
110-
# transform data (for transformers).
113+
# data and training target as input and sets the model states. These model
114+
# states will be used later to either predict (for classifiers and regressors)
115+
# or transform data (for transformers).
111116
#
112117
# Both the learning algorithm and the type of model states are specific to each
113-
# type of models.
118+
# type of model.
114119

115120
# %% [markdown]
116121
# ```{note}

python_scripts/02_numerical_pipeline_scaling.py

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,11 @@
3434

3535
adult_census = pd.read_csv("../datasets/adult-census.csv")
3636

37+
# %%
38+
# to display nice model diagram
39+
from sklearn import set_config
40+
set_config(display='diagram')
41+
3742
# %% [markdown]
3843
# We will now drop the target from the data we will use to train our
3944
# predictive model.
@@ -191,15 +196,24 @@
191196
# `Pipeline`, which chains together operations and is used as any other
192197
# classifier or regressor. The helper function `make_pipeline` will create a
193198
# `Pipeline`: it takes as arguments the successive transformations to perform,
194-
# followed by the classifier or regressor model, and will assign automatically
195-
# a name at steps based on the name of the classes.
199+
# followed by the classifier or regressor model.
196200

197201
# %%
198202
import time
199203
from sklearn.linear_model import LogisticRegression
200204
from sklearn.pipeline import make_pipeline
201205

202206
model = make_pipeline(StandardScaler(), LogisticRegression())
207+
model
208+
209+
# %% [markdown]
210+
# The `make_pipeline` function did not require us to give a name to each step.
211+
# Indeed, it was automatically assigned based on the name of the classes
212+
# provided; a `StandardScaler` will be a step named `"standardscaler"` in the
213+
# resulting pipeline. We can check the name of each steps of our model:
214+
215+
# %%
216+
model.named_steps
203217

204218
# %% [markdown]
205219
# This predictive pipeline exposes the same methods as the final predictor:
@@ -278,11 +292,11 @@
278292
#
279293
# ```{warning}
280294
# Working with non-scaled data will potentially force the algorithm to iterate
281-
# more as we showed in the example above. There is also a catastrophic scenario
282-
# where the number of required iterations are more than the maximum number of
283-
# iterations allowed by the predictor (controlled by the `max_iter`) parameter.
284-
# Therefore, before increasing `max_iter`, make sure that the data are well
285-
# scaled.
295+
# more as we showed in the example above. There is also the catastrophic
296+
# scenario where the number of required iterations are more than the maximum
297+
# number of iterations allowed by the predictor (controlled by the `max_iter`)
298+
# parameter. Therefore, before increasing `max_iter`, make sure that the data
299+
# are well scaled.
286300
# ```
287301

288302
# %% [markdown]
@@ -298,7 +312,7 @@
298312
# the procedure such that the training and testing sets are different each
299313
# time. Statistical performance metrics are collected for each repetition and
300314
# then aggregated. As a result we can get an estimate of the variability of the
301-
# model statistical performance.
315+
# model's statistical performance.
302316
#
303317
# Note that there exists several cross-validation strategies, each of them
304318
# defines how to repeat the `fit`/`score` procedure. In this section, we will

python_scripts/03_categorical_pipeline_column_transformer.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -118,8 +118,7 @@
118118
model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
119119

120120
# %% [markdown]
121-
# Starting from `scikit-learn 0.23`, the notebooks can display an interactive
122-
# view of the pipelines.
121+
# We can display an interactive diagram with the following command:
123122

124123
# %%
125124
from sklearn import set_config

0 commit comments

Comments
 (0)