Skip to content

Integrate Sklearn OneHotEncoder #830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jun 16, 2020
Merged
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
5e2d533
Preliminary attempt to add sk's onehot
eccabay May 29, 2020
a407776
Integrated top_n into caategories (some test failures)
eccabay Jun 1, 2020
6ebdebf
SKlearn integrated fully except for missing and unknown value handling
eccabay Jun 2, 2020
db8b9f0
add support for top_n being None
eccabay Jun 2, 2020
ec1d143
Fake-out NaN values
eccabay Jun 2, 2020
c4d692e
Fake NaN values only for categorical columns
eccabay Jun 2, 2020
8659ef0
Add increased testing
eccabay Jun 3, 2020
331be4e
Merge branch 'master' into 776_sklearn_onehot
eccabay Jun 3, 2020
104dddc
Update changelog
eccabay Jun 3, 2020
1f4059b
Fix lint and unit test issues
eccabay Jun 3, 2020
24c75b7
Address PR comments
eccabay Jun 4, 2020
6511d28
Add missing support for categories argument (oops)
eccabay Jun 4, 2020
be1942b
Fix codecov by adding a test
eccabay Jun 4, 2020
d09c3a2
Merge branch 'master' into 776_sklearn_onehot
eccabay Jun 4, 2020
ab89776
fix master merge error
eccabay Jun 4, 2020
14dc0fd
Rework encoder to remove self.col_unique_vals (redundant w/categories)
eccabay Jun 5, 2020
6505d72
Address PR comments
eccabay Jun 11, 2020
64cc83f
Add missing tests
eccabay Jun 11, 2020
a7f5088
Merge branch 'master' into 776_sklearn_onehot
eccabay Jun 11, 2020
3813863
Remove comments and fix ValueError matching
eccabay Jun 11, 2020
ed36917
Merge master into branch
eccabay Jun 12, 2020
bbd8357
Fix test failing caused by merge
eccabay Jun 12, 2020
490a04f
Merge branch 'master' into 776_sklearn_onehot
eccabay Jun 15, 2020
f6e6d67
Final PR fixes
eccabay Jun 16, 2020
b7c3e56
Merge branch 'master' into 776_sklearn_onehot
eccabay Jun 16, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Rework encoder to remove self.col_unique_vals (redundant w/categories)
  • Loading branch information
eccabay committed Jun 5, 2020
commit 14dc0fda79da5cda2b2d1cdcbfa54c601be3e5df
Original file line number Diff line number Diff line change
Expand Up @@ -70,17 +70,16 @@ def fit(self, X, y=None):
X = pd.DataFrame(X)
X_t = X
cols_to_encode = self._get_cat_cols(X_t)
self.col_unique_values = {}

# If there are no categorical columns, nothing needs to happen
if len(cols_to_encode) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have unit test coverage of this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, test_all_numerical_dtype starting at line 284

categories = 'auto'

# Use the categories parameter
if isinstance(self.parameters['categories'], list):
elif isinstance(self.parameters['categories'], list):
if top_n is not None:
raise ValueError("Cannot use categories and top_n arguments simultaneously")
categories = self.parameters['categories']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could things break if this comes back as an empty list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming not, since we start with an empty list in the else block below, but just checking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch, scikit-learn throws an error in that case. I'll add our own catch to provide more useful feedback.

for col in X_t[cols_to_encode]:
value_counts = X_t[col].value_counts(dropna=False).to_frame()
unique_values = value_counts.index.tolist()
self.col_unique_values[col] = np.sort(unique_values)

# Use the top_n parameter
else:
Expand All @@ -98,12 +97,8 @@ def fit(self, X, y=None):
value_counts = value_counts.sort_values([col], ascending=False, kind='mergesort')
unique_values = value_counts.head(top_n).index.tolist()
unique_values = np.sort(unique_values)
self.col_unique_values[col] = unique_values
categories.append(unique_values)

if len(cols_to_encode) == 0:
categories = 'auto'

# Create an encoder to pass off the rest of the computation to
encoder = SKOneHotEncoder(categories=categories,
drop=self.drop,
Expand All @@ -120,9 +115,7 @@ def transform(self, X, y=None):
Returns:
Transformed dataframe, where each categorical feature has been encoded into numerical columns using one-hot encoding.
"""
try:
col_values = self.col_unique_values
except AttributeError:
if self.encoder is None:
raise RuntimeError("You must fit one hot encoder before calling transform!")
if not isinstance(X, pd.DataFrame):
X = pd.DataFrame(X)
Expand All @@ -134,11 +127,11 @@ def transform(self, X, y=None):
X_t = pd.DataFrame()
# Add the non-categorical columns, untouched
for col in X.columns:
if col not in col_values:
if col not in cat_cols:
X_t = pd.concat([X_t, X[col]], axis=1)

# Call sklearn's transform on the categorical columns
if len(col_values) != 0:
if len(cat_cols) != 0:
X_cat = pd.DataFrame(self.encoder.transform(X[cat_cols]).toarray())
X_cat.columns = self.encoder.get_feature_names(input_features=cat_cols)
X_t = pd.concat([X_t.reindex(X_cat.index), X_cat], axis=1)
Expand Down