Added pandas support #889

franchuterivera · 2020-07-04T13:24:26Z

Sklearn check-array method is used to give numerical dataframe support already.

Nevertheless, when a categorical value is given, we want to automatically encode it.

Also, we want to better guide the user when a dataframe is having a problematic type.

Added test to make sure dataframes can be handled.

franchuterivera · 2020-07-04T13:27:32Z

Hi Matthias, I do want to ask you something about the encoding of a dataframe, when a categorical value is given.
By default it is setup to be:
feature_encoder_type: str = 'OrdinalEncoder',
target_encoder_type: str = 'LabelEncoder',

Please let me know if this should be changed. I found an issue with the LabelEncoder and column transformer, so I had to move to ordinal encoder (something similar to https://stackoverflow.com/questions/46162855/fit-transform-takes-2-positional-arguments-but-3-were-given-with-labelbinarize).

mfeurer

Great job, that'll improve a lot of things in Auto-sklearn! I left a lot of small notes throughout the PR with a few requests for additional tests and clarifications.

autosklearn/data/validation.py

test/test_data/test_validation.py

franchuterivera · 2020-07-08T17:42:02Z

I am wondering if BaseAutoML is still needed?

Really, we can inherit from automl directly, on each estimator. Please let me know your thoughts!

mfeurer · 2020-07-10T11:34:17Z

It appears that there's an issue with the unit test for pandas: https://travis-ci.org/github/automl/auto-sklearn/jobs/706628514 Could you please check and re-request a review?

franchuterivera · 2020-07-10T12:23:04Z

With the requested dataset (2) the following behavior is observed:

No problems when doing: python -m pytest test -v -k test_classification_pandas_support
Random stuck when doing python -m pytest test -v, with the last message in the log file:
[DEBUG] [2020-07-10 13:47:51,619:AutoML(1):3d08f2ce6767ae31511106cfc2ae94da] /home/chico/work/auto- sklearn_fork_pandas/venv/lib/python3.7/site-packages/sklearn/feature_selection/_univariate_selection.py:115: RuntimeWarning:invalid value encountered in true_divide

For the purpose of the test, I believe using Australian, that has following categories, should be sufficient.

X.dtypes
A1 category
A2 float64
A3 float64
A4 category
A5 category
A6 category
A7 float64
A8 category
A9 category
A10 float64
A11 category
A12 category
A13 float64
A14 float64
dtype: object

I will add to my list of things understanding the random crash when using dataset02.

mfeurer

This looks good, I just have a few more minor remarks.

autosklearn/automl.py

autosklearn/data/validation.py

test/test_automl/test_estimators.py

test/test_data/test_validation.py

mfeurer · 2020-07-14T16:16:56Z

I am wondering if BaseAutoML is still needed?

I currently don't see a reason to keep it.

codecov-commenter · 2020-07-15T17:50:30Z

Codecov Report

Merging #889 into development will increase coverage by 0.08%.
The diff coverage is 90.41%.

@@               Coverage Diff               @@
##           development     #889      +/-   ##
===============================================
+ Coverage        85.14%   85.22%   +0.08%     
===============================================
  Files              129      130       +1     
  Lines             9531     9603      +72     
===============================================
+ Hits              8115     8184      +69     
- Misses            1416     1419       +3

Impacted Files	Coverage Δ
autosklearn/data/validation.py	`89.43% <89.43%> (ø)`
autosklearn/automl.py	`81.35% <95.65%> (-1.21%)`	⬇️
autosklearn/estimators.py	`91.17% <100.00%> (+0.58%)`	⬆️
autosklearn/smbo.py	`73.81% <0.00%> (+1.09%)`	⬆️
...omponents/data_preprocessing/data_preprocessing.py	`91.01% <0.00%> (+2.24%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d80ac0...8b50a28. Read the comment docs.

franchuterivera · 2020-07-16T22:20:15Z

Added dataset 2 after removing svm as a candidate for an estimator.
Consistently, the code runs into using:
'classifier:libsvm_svc:C': 1007.8868860667042
'classifier:libsvm_svc:gamma': 0.0009693320195457126
'classifier:libsvm_svc:kernel': 'poly'
'classifier:libsvm_svc:max_iter': -1
'classifier:libsvm_svc:shrinking': 'True'
'classifier:libsvm_svc:tol': 0.00048384544670559135

During a re-fit of this model, sklearn is stuck in fit. For this particular dataset, this estimator class should not be picked. We can circumvent this by changing the number of iterations, but this might be dataset specific.

test/test_data/test_validation.py

autosklearn/data/validation.py

autosklearn/automl.py

mfeurer · 2020-07-22T16:02:23Z

I just played with this PR and found the following issue with scikit-learn and pandas: the OrdinalEncoder cannot yet handle pandas categorical columns with NaNs and cannot handle new categories: scikit-learn/scikit-learn#17123 and scikit-learn/scikit-learn#15796

Therefore, I'm afraid we need a few more tests to deal with NaN values:

numpy, categorical column without NaN
numpy, categorical column with NaN
numpy, numerical column without NaN
numpy, numerical column with NaN
pandas, categorical column without NaN
pandas, categorical column with NaN -> this will fail. If we encode it as a unit test we'll realize immediately when scikit-learn supports this
pandas, numerical column without NaN
pandas, numerical column with NaN

I triggered this behavior by loading dataset 2 with X, y = sklearn.datasets.fetch_openml(data_id=2, return_X_y=True, as_frame=True)

and also new tests for what's happening when new categories arrive at test time. I triggered the 2nd error by loading dataset 3 with X, y = sklearn.datasets.fetch_openml(data_id=3, return_X_y=True, as_frame=True) inside example_holdout.py.

This is unfortunately way more complicated than anticipated.

autosklearn/data/validation.py

test/test_data/test_validation.py

* Added pandas support * Cleanup and incorporate comments feedback * Remove old comments * Fix fit ensemble and add more checks * Move to diff dataset * Remove print message * Incorporate feedback comments * Add dataset 2 to pandas check * Improved messaging and more testing * More test for NaN * NaN columns not changed * Update validation.py * Update validation.py * Update test_validation.py * Update automl.py

mfeurer reviewed Jul 7, 2020

View reviewed changes

franchuterivera force-pushed the pandas_support branch from a261edd to 73322f2 Compare July 8, 2020 17:43

franchuterivera requested a review from mfeurer July 10, 2020 10:26

mfeurer requested changes Jul 14, 2020

View reviewed changes

franchuterivera requested a review from mfeurer July 17, 2020 11:42

mfeurer reviewed Jul 20, 2020

View reviewed changes

test/test_data/test_validation.py Show resolved Hide resolved

autosklearn/data/validation.py Outdated Show resolved Hide resolved

autosklearn/automl.py Show resolved Hide resolved

franchuterivera added 10 commits July 26, 2020 01:19

Added pandas support

c8d74a8

Cleanup and incorporate comments feedback

e5117a7

Remove old comments

f4db4dc

Fix fit ensemble and add more checks

d61d986

Move to diff dataset

d4c386e

Remove print message

e301299

Incorporate feedback comments

dce7d38

Add dataset 2 to pandas check

a9b8647

Improved messaging and more testing

e259833

More test for NaN

1afe35f

franchuterivera force-pushed the pandas_support branch from 519bddc to 1afe35f Compare July 25, 2020 23:20

franchuterivera requested a review from mfeurer July 26, 2020 13:08

mfeurer requested changes Jul 27, 2020

View reviewed changes

franchuterivera and others added 3 commits July 27, 2020 23:23

NaN columns not changed

5de723d

Update validation.py

065c779

Update validation.py

186838f

mfeurer and others added 2 commits July 28, 2020 11:25

Update test_validation.py

2e08970

Merge branch 'development' into pandas_support

75984b7

mfeurer approved these changes Jul 28, 2020

View reviewed changes

Update automl.py

8b50a28

mfeurer merged commit f6b358c into automl:development Jul 28, 2020

This was referenced Jul 28, 2020

AutoSklearnRegressor ignores feat_type #722

Closed

Feature request: handle pandas #157

Closed

rabsr mentioned this pull request Nov 2, 2020

Execution fails when data contains missing values #990

Closed

Added pandas support #889

Added pandas support #889

Uh oh!

Conversation

franchuterivera commented Jul 4, 2020

Uh oh!

franchuterivera commented Jul 4, 2020

Uh oh!

mfeurer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

franchuterivera commented Jul 8, 2020

Uh oh!

mfeurer commented Jul 10, 2020

Uh oh!

franchuterivera commented Jul 10, 2020

Uh oh!

mfeurer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mfeurer commented Jul 14, 2020

Uh oh!

codecov-commenter commented Jul 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

franchuterivera commented Jul 16, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mfeurer commented Jul 22, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jul 15, 2020 •

edited

Loading