-
Notifications
You must be signed in to change notification settings - Fork 91
Integrate Sklearn OneHotEncoder #830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #830 +/- ##
=======================================
Coverage 99.69% 99.69%
=======================================
Files 195 195
Lines 7846 7930 +84
=======================================
+ Hits 7822 7906 +84
Misses 24 24
Continue to review full report at Codecov.
|
evalml/pipelines/components/transformers/encoders/onehot_encoder.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/encoders/onehot_encoder.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a bunch of comments on the docs, the impl and the tests.
Great job overall, particularly on the docs which read clearly! This is a solid first draft. I bet the next revision will be ready to merge.
Missing test coverage:
- Init test: check that
parameters
dict is as expected after init. - Coverage for the
drop
parameter. - Use both
top_n
andcategories
inputs together. - Are there other parameter combos which could cause problems?
I noticed some of the old tests set the top_n
parameter outside of the constructor, i.e. encoder.parameters['top_n'] = 5
. Could you please update those to use the constructor?
evalml/pipelines/components/transformers/encoders/onehot_encoder.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/encoders/onehot_encoder.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/encoders/onehot_encoder.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/encoders/onehot_encoder.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/encoders/onehot_encoder.py
Outdated
Show resolved
Hide resolved
drop (string): Method ("first" or "if_binary") to use to drop one category per feature. Can also be | ||
a list specifying which method to use for each feature. Defaults to None. | ||
handle_unknown (string): Option to "ignore" or "error" for unknown categories for a feature encountered | ||
during `fit` or `transform`. If `top_n` is used to limit the number of categories, this must be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
top_n
is used to limit the number of categories, this must be "ignore".
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original answer was that using top_n
would remove certain categories from the data before giving it to the scikit encoder to fit, meaning any categories removed by top_n
in fit would throw an error in transform, as that's what scikit's documentation seems to imply. However, that doesn't make any sense, since the categories are not removed from the data before fitting.
So, some more experimenting has shown that limiting the categories so that not all of them end up being encoded (either through top_n
or sklearn's own categories
argument) will throw an error in fit
when handle_unknown='error'
. This is not in the documentation or any resource I can find online, so it's a bit baffling. I'm probably going to update this documentation to match my findings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just talked to @eccabay . Decision was to document that sklearn doesn't do what they say.
Just realized, something we could do down the road to support this is to drop the categories which aren't allowed before fitting the sklearn one-hot encoder. Documenting here for future reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Left some suggestions but looks ready to 🚢 👏
handle_missing (string): Options for how to handle missing (NaN) values encountered during | ||
`fit` or `transform`. If this is set to "as_category" and NaN values are within the `n` most frequent, | ||
"nan" values will be encoded as their own column. If this is set to "error", any missing | ||
values encountered will raise an error. Defaults to "error". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
if handle_missing not in missing_input_options: | ||
raise ValueError("Invalid input {} for handle_missing".format(handle_missing)) | ||
if top_n is not None and categories is not None: | ||
raise ValueError("Cannot use categories and top_n arguments simultaneously") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it, this is very clear
elif self.parameters['handle_missing'] == "error" and X.isnull().any().any(): | ||
raise ValueError("Input contains NaN") | ||
|
||
if len(cols_to_encode) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have unit test coverage of this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, test_all_numerical_dtype
starting at line 284
categories = 'auto' | ||
|
||
elif self.parameters['categories'] is not None: | ||
categories = self.parameters['categories'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could things break if this comes back as an empty list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming not, since we start with an empty list in the else
block below, but just checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good catch, scikit-learn throws an error in that case. I'll add our own catch to provide more useful feedback.
try: | ||
col_values = self.col_unique_values | ||
except AttributeError: | ||
if self._encoder is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 good call!
I'm working on a plan to standardize this, using decorators and a metaclass, but in the meantime its good we're adding this logic to more components.
|
||
if self.parameters['handle_missing'] == "as_category": | ||
X[cat_cols] = X[cat_cols].replace(np.nan, "nan") | ||
elif self.parameters['handle_missing'] == "error" and X.isnull().any().any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest you change this to an if
instead of elif
. Using elif
is not incorrect, but using if
whenever possible helps keep the code simple, in my opinion. And since the logic in these two blocks isn't mutually exclusive (one is a dead end with the exception), that's doable in this case.
encoder = OneHotEncoder() | ||
error_msg = "Invalid input {} for handle_missing".format("peanut butter") | ||
with pytest.raises(ValueError, match=error_msg): | ||
encoder = OneHotEncoder(handle_missing="peanut butter") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be helpful to move this and similar checks into a separate test test_one_hot_encoder_invalid_inputs
. That way, the other tests can just check the positive cases, i.e. intended usages.
expected_col_names.add("col_1_" + val) | ||
for val in X["col_2"]: | ||
expected_col_names.add("col_2_" + val) | ||
col_names = set(X_t.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit-pick: helpful to first declare expected values, then have all the functional testing together in one block, i.e.
expected_col_names = ...
...
encoder = OneHotEncoder(top_n=None, handle_unknown="error", random_state=2)
encoder.fit(X)
X_t = encoder.transform(X)
col_names = set(X_t.columns)
assert (X_t.shape == (11, 20))
assert (col_names == expected_col_names)
Closes #776.
I made several choices in how to handle unknown and missing values in the dataset. Both now have options
ignore
orerror
:handle_unknown
is incompatible with our implementation oftop_n
, since the categories not in the top n are removed before the data is given to the sklearn implementation. Therefore, I added the option to addtop_n=None
ifhandle_unknown=error
is used.handle_missing='ignore'
, the code replaces anynp.nan
values with a string"nan"
to reflect the previous implementation's output.Any feedback on these choices or how they were implemented and tested for would be greatly appreciated.