Skip to content

Integrate Sklearn OneHotEncoder #830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jun 16, 2020
Merged

Integrate Sklearn OneHotEncoder #830

merged 25 commits into from
Jun 16, 2020

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Jun 3, 2020

Closes #776.

I made several choices in how to handle unknown and missing values in the dataset. Both now have options ignore or error:

  • scikit-learn's handle_unknown is incompatible with our implementation of top_n, since the categories not in the top n are removed before the data is given to the sklearn implementation. Therefore, I added the option to add top_n=None if handle_unknown=error is used.
  • there is no way for the scikit-learn OneHot encoder to handle missing values in its current stable release. Therefore, if missing values need to be exposed to OneHot and handle_missing='ignore', the code replaces any np.nan values with a string "nan" to reflect the previous implementation's output.

Any feedback on these choices or how they were implemented and tested for would be greatly appreciated.

@codecov
Copy link

codecov bot commented Jun 3, 2020

Codecov Report

Merging #830 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #830   +/-   ##
=======================================
  Coverage   99.69%   99.69%           
=======================================
  Files         195      195           
  Lines        7846     7930   +84     
=======================================
+ Hits         7822     7906   +84     
  Misses         24       24           
Impacted Files Coverage Δ
...ification_pipeline_tests/test_en_classification.py 100.00% <ø> (ø)
...ification_pipeline_tests/test_et_classification.py 100.00% <ø> (ø)
...ts/regression_pipeline_tests/test_en_regression.py 100.00% <ø> (ø)
...ts/regression_pipeline_tests/test_et_regression.py 100.00% <ø> (ø)
...egression_pipeline_tests/test_linear_regression.py 100.00% <ø> (ø)
...ts/regression_pipeline_tests/test_rf_regression.py 100.00% <ø> (ø)
...gression_pipeline_tests/test_xgboost_regression.py 100.00% <ø> (ø)
...lines/classification/logistic_regression_binary.py 100.00% <100.00%> (ø)
...l/pipelines/classification/random_forest_binary.py 100.00% <100.00%> (ø)
evalml/pipelines/classification/xgboost_binary.py 100.00% <100.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6a9815...b7c3e56. Read the comment docs.

@eccabay eccabay marked this pull request as ready for review June 4, 2020 13:32
@eccabay eccabay requested review from angela97lin and dsherry June 4, 2020 13:32
@eccabay eccabay requested a review from dsherry June 9, 2020 12:11
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a bunch of comments on the docs, the impl and the tests.

Great job overall, particularly on the docs which read clearly! This is a solid first draft. I bet the next revision will be ready to merge.

Missing test coverage:

  • Init test: check that parameters dict is as expected after init.
  • Coverage for the drop parameter.
  • Use both top_n and categories inputs together.
  • Are there other parameter combos which could cause problems?

I noticed some of the old tests set the top_n parameter outside of the constructor, i.e. encoder.parameters['top_n'] = 5. Could you please update those to use the constructor?

drop (string): Method ("first" or "if_binary") to use to drop one category per feature. Can also be
a list specifying which method to use for each feature. Defaults to None.
handle_unknown (string): Option to "ignore" or "error" for unknown categories for a feature encountered
during `fit` or `transform`. If `top_n` is used to limit the number of categories, this must be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If top_n is used to limit the number of categories, this must be "ignore".

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original answer was that using top_n would remove certain categories from the data before giving it to the scikit encoder to fit, meaning any categories removed by top_n in fit would throw an error in transform, as that's what scikit's documentation seems to imply. However, that doesn't make any sense, since the categories are not removed from the data before fitting.

So, some more experimenting has shown that limiting the categories so that not all of them end up being encoded (either through top_n or sklearn's own categories argument) will throw an error in fit when handle_unknown='error'. This is not in the documentation or any resource I can find online, so it's a bit baffling. I'm probably going to update this documentation to match my findings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just talked to @eccabay . Decision was to document that sklearn doesn't do what they say.

Just realized, something we could do down the road to support this is to drop the categories which aren't allowed before fitting the sklearn one-hot encoder. Documenting here for future reference.

@eccabay eccabay requested a review from dsherry June 12, 2020 15:58
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Left some suggestions but looks ready to 🚢 👏

handle_missing (string): Options for how to handle missing (NaN) values encountered during
`fit` or `transform`. If this is set to "as_category" and NaN values are within the `n` most frequent,
"nan" values will be encoded as their own column. If this is set to "error", any missing
values encountered will raise an error. Defaults to "error".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

if handle_missing not in missing_input_options:
raise ValueError("Invalid input {} for handle_missing".format(handle_missing))
if top_n is not None and categories is not None:
raise ValueError("Cannot use categories and top_n arguments simultaneously")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it, this is very clear

elif self.parameters['handle_missing'] == "error" and X.isnull().any().any():
raise ValueError("Input contains NaN")

if len(cols_to_encode) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have unit test coverage of this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, test_all_numerical_dtype starting at line 284

categories = 'auto'

elif self.parameters['categories'] is not None:
categories = self.parameters['categories']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could things break if this comes back as an empty list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming not, since we start with an empty list in the else block below, but just checking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch, scikit-learn throws an error in that case. I'll add our own catch to provide more useful feedback.

try:
col_values = self.col_unique_values
except AttributeError:
if self._encoder is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 good call!

I'm working on a plan to standardize this, using decorators and a metaclass, but in the meantime its good we're adding this logic to more components.


if self.parameters['handle_missing'] == "as_category":
X[cat_cols] = X[cat_cols].replace(np.nan, "nan")
elif self.parameters['handle_missing'] == "error" and X.isnull().any().any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you change this to an if instead of elif. Using elif is not incorrect, but using if whenever possible helps keep the code simple, in my opinion. And since the logic in these two blocks isn't mutually exclusive (one is a dead end with the exception), that's doable in this case.

encoder = OneHotEncoder()
error_msg = "Invalid input {} for handle_missing".format("peanut butter")
with pytest.raises(ValueError, match=error_msg):
encoder = OneHotEncoder(handle_missing="peanut butter")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be helpful to move this and similar checks into a separate test test_one_hot_encoder_invalid_inputs. That way, the other tests can just check the positive cases, i.e. intended usages.

expected_col_names.add("col_1_" + val)
for val in X["col_2"]:
expected_col_names.add("col_2_" + val)
col_names = set(X_t.columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit-pick: helpful to first declare expected values, then have all the functional testing together in one block, i.e.

expected_col_names = ...
...
encoder = OneHotEncoder(top_n=None, handle_unknown="error", random_state=2)
encoder.fit(X)
X_t = encoder.transform(X)
col_names = set(X_t.columns)
assert (X_t.shape == (11, 20))
assert (col_names == expected_col_names)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update our one-hot encoder to use sklearn's implementation
3 participants