Integrate Sklearn OneHotEncoder #830

eccabay · 2020-06-03T14:23:48Z

Closes #776.

I made several choices in how to handle unknown and missing values in the dataset. Both now have options ignore or error:

scikit-learn's handle_unknown is incompatible with our implementation of top_n, since the categories not in the top n are removed before the data is given to the sklearn implementation. Therefore, I added the option to add top_n=None if handle_unknown=error is used.
there is no way for the scikit-learn OneHot encoder to handle missing values in its current stable release. Therefore, if missing values need to be exposed to OneHot and handle_missing='ignore', the code replaces any np.nan values with a string "nan" to reflect the previous implementation's output.

Any feedback on these choices or how they were implemented and tested for would be greatly appreciated.

codecov · 2020-06-03T14:50:41Z

Codecov Report

Merging #830 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #830   +/-   ##
=======================================
  Coverage   99.69%   99.69%           
=======================================
  Files         195      195           
  Lines        7846     7930   +84     
=======================================
+ Hits         7822     7906   +84     
  Misses         24       24

Impacted Files	Coverage Δ
...ification_pipeline_tests/test_en_classification.py	`100.00% <ø> (ø)`
...ification_pipeline_tests/test_et_classification.py	`100.00% <ø> (ø)`
...ts/regression_pipeline_tests/test_en_regression.py	`100.00% <ø> (ø)`
...ts/regression_pipeline_tests/test_et_regression.py	`100.00% <ø> (ø)`
...egression_pipeline_tests/test_linear_regression.py	`100.00% <ø> (ø)`
...ts/regression_pipeline_tests/test_rf_regression.py	`100.00% <ø> (ø)`
...gression_pipeline_tests/test_xgboost_regression.py	`100.00% <ø> (ø)`
...lines/classification/logistic_regression_binary.py	`100.00% <100.00%> (ø)`
...l/pipelines/classification/random_forest_binary.py	`100.00% <100.00%> (ø)`
evalml/pipelines/classification/xgboost_binary.py	`100.00% <100.00%> (ø)`
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6a9815...b7c3e56. Read the comment docs.

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

dsherry

Left a bunch of comments on the docs, the impl and the tests.

Great job overall, particularly on the docs which read clearly! This is a solid first draft. I bet the next revision will be ready to merge.

Missing test coverage:

Init test: check that parameters dict is as expected after init.
Coverage for the drop parameter.
Use both top_n and categories inputs together.
Are there other parameter combos which could cause problems?

I noticed some of the old tests set the top_n parameter outside of the constructor, i.e. encoder.parameters['top_n'] = 5. Could you please update those to use the constructor?

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

evalml/tests/component_tests/test_one_hot_encoder.py

dsherry · 2020-06-10T14:54:02Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

+            drop (string): Method ("first" or "if_binary") to use to drop one category per feature. Can also be
+                a list specifying which method to use for each feature. Defaults to None.
+            handle_unknown (string): Option to "ignore" or "error" for unknown categories for a feature encountered
+                during `fit` or `transform`. If `top_n` is used to limit the number of categories, this must be


If top_n is used to limit the number of categories, this must be "ignore".

Why?

My original answer was that using top_n would remove certain categories from the data before giving it to the scikit encoder to fit, meaning any categories removed by top_n in fit would throw an error in transform, as that's what scikit's documentation seems to imply. However, that doesn't make any sense, since the categories are not removed from the data before fitting.

So, some more experimenting has shown that limiting the categories so that not all of them end up being encoded (either through top_n or sklearn's own categories argument) will throw an error in fit when handle_unknown='error'. This is not in the documentation or any resource I can find online, so it's a bit baffling. I'm probably going to update this documentation to match my findings.

Just talked to @eccabay . Decision was to document that sklearn doesn't do what they say.

Just realized, something we could do down the road to support this is to drop the categories which aren't allowed before fitting the sklearn one-hot encoder. Documenting here for future reference.

dsherry

Looks great! Left some suggestions but looks ready to 🚢 👏

dsherry · 2020-06-16T14:09:45Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

+            handle_missing (string): Options for how to handle missing (NaN) values encountered during
+                `fit` or `transform`. If this is set to "as_category" and NaN values are within the `n` most frequent,
+                "nan" values will be encoded as their own column. If this is set to "error", any missing
+                values encountered will raise an error. Defaults to "error".


dsherry · 2020-06-16T14:10:16Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

+        if handle_missing not in missing_input_options:
+            raise ValueError("Invalid input {} for handle_missing".format(handle_missing))
+        if top_n is not None and categories is not None:
+            raise ValueError("Cannot use categories and top_n arguments simultaneously")


Love it, this is very clear

dsherry · 2020-06-16T14:12:16Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

+        elif self.parameters['handle_missing'] == "error" and X.isnull().any().any():
+            raise ValueError("Input contains NaN")
+
+        if len(cols_to_encode) == 0:


Do we have unit test coverage of this case?

Yep, test_all_numerical_dtype starting at line 284

dsherry · 2020-06-16T14:12:50Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

+            categories = 'auto'
+
+        elif self.parameters['categories'] is not None:
+            categories = self.parameters['categories']


Could things break if this comes back as an empty list?

I'm assuming not, since we start with an empty list in the else block below, but just checking

This is a good catch, scikit-learn throws an error in that case. I'll add our own catch to provide more useful feedback.

dsherry · 2020-06-16T14:14:00Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

-        try:
-            col_values = self.col_unique_values
-        except AttributeError:
+        if self._encoder is None:


👍 good call!

I'm working on a plan to standardize this, using decorators and a metaclass, but in the meantime its good we're adding this logic to more components.

dsherry · 2020-06-16T14:17:17Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

+
+        if self.parameters['handle_missing'] == "as_category":
+            X[cat_cols] = X[cat_cols].replace(np.nan, "nan")
+        elif self.parameters['handle_missing'] == "error" and X.isnull().any().any():


I suggest you change this to an if instead of elif. Using elif is not incorrect, but using if whenever possible helps keep the code simple, in my opinion. And since the logic in these two blocks isn't mutually exclusive (one is a dead end with the exception), that's doable in this case.

dsherry · 2020-06-16T15:40:38Z

evalml/tests/component_tests/test_one_hot_encoder.py

-    encoder = OneHotEncoder()
+    error_msg = "Invalid input {} for handle_missing".format("peanut butter")
+    with pytest.raises(ValueError, match=error_msg):
+        encoder = OneHotEncoder(handle_missing="peanut butter")


It could be helpful to move this and similar checks into a separate test test_one_hot_encoder_invalid_inputs. That way, the other tests can just check the positive cases, i.e. intended usages.

dsherry · 2020-06-16T15:42:39Z

evalml/tests/component_tests/test_one_hot_encoder.py

+        expected_col_names.add("col_1_" + val)
+    for val in X["col_2"]:
+        expected_col_names.add("col_2_" + val)
+    col_names = set(X_t.columns)


Nit-pick: helpful to first declare expected values, then have all the functional testing together in one block, i.e.

expected_col_names = ... ... encoder = OneHotEncoder(top_n=None, handle_unknown="error", random_state=2) encoder.fit(X) X_t = encoder.transform(X) col_names = set(X_t.columns) assert (X_t.shape == (11, 20)) assert (col_names == expected_col_names)

eccabay added 10 commits May 29, 2020 14:48

Preliminary attempt to add sk's onehot

5e2d533

Integrated top_n into caategories (some test failures)

a407776

SKlearn integrated fully except for missing and unknown value handling

6ebdebf

add support for top_n being None

db8b9f0

Fake-out NaN values

ec1d143

Fake NaN values only for categorical columns

c4d692e

Add increased testing

8659ef0

Merge branch 'master' into 776_sklearn_onehot

331be4e

Update changelog

104dddc

Fix lint and unit test issues

1f4059b

eccabay marked this pull request as ready for review June 4, 2020 13:32

auto-assign bot assigned eccabay Jun 4, 2020

eccabay requested review from angela97lin and dsherry June 4, 2020 13:32

dsherry reviewed Jun 4, 2020

View reviewed changes

evalml/pipelines/components/transformers/encoders/onehot_encoder.py Outdated Show resolved Hide resolved

dsherry reviewed Jun 4, 2020

View reviewed changes

evalml/pipelines/components/transformers/encoders/onehot_encoder.py Outdated Show resolved Hide resolved

angela97lin reviewed Jun 4, 2020

View reviewed changes

evalml/pipelines/components/transformers/encoders/onehot_encoder.py Show resolved Hide resolved

eccabay and others added 6 commits June 4, 2020 14:54

Address PR comments

24c75b7

Add missing support for categories argument (oops)

6511d28

Fix codecov by adding a test

be1942b

Merge branch 'master' into 776_sklearn_onehot

d09c3a2

fix master merge error

ab89776

Rework encoder to remove self.col_unique_vals (redundant w/categories)

14dc0fd

eccabay requested a review from dsherry June 9, 2020 12:11

dsherry suggested changes Jun 10, 2020

View reviewed changes

eccabay and others added 4 commits June 11, 2020 10:53

Address PR comments

6505d72

Add missing tests

64cc83f

Merge branch 'master' into 776_sklearn_onehot

a7f5088

Remove comments and fix ValueError matching

3813863

eccabay added 2 commits June 12, 2020 11:00

Merge master into branch

ed36917

Fix test failing caused by merge

bbd8357

eccabay requested a review from dsherry June 12, 2020 15:58

Merge branch 'master' into 776_sklearn_onehot

490a04f

dsherry approved these changes Jun 16, 2020

View reviewed changes

eccabay and others added 2 commits June 16, 2020 14:12

Final PR fixes

f6e6d67

Merge branch 'master' into 776_sklearn_onehot

b7c3e56

eccabay merged commit 4167aaa into master Jun 16, 2020

angela97lin mentioned this pull request Jun 30, 2020

Release v0.11.0 #901

Merged

dsherry deleted the 776_sklearn_onehot branch October 29, 2020 23:48

This was referenced Mar 17, 2021

One Hot Encoder: Drop one redundant feature by default for features with two categories #1993

Closed

One Hot Encoder: Drop one redundant feature by default for features with two categories #1936

Closed

Use our own implementation for one-hot encoding #1995

Open

Integrate Sklearn OneHotEncoder #830

Integrate Sklearn OneHotEncoder #830

Uh oh!

Conversation

eccabay commented Jun 3, 2020

Uh oh!

codecov bot commented Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsherry left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsherry left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Jun 3, 2020 •

edited

Loading