Skip to content

Conversation

franchuterivera
Copy link
Contributor

Sklearn check-array method is used to give numerical dataframe support already.

Nevertheless, when a categorical value is given, we want to automatically encode it.

Also, we want to better guide the user when a dataframe is having a problematic type.

Added test to make sure dataframes can be handled.

@franchuterivera
Copy link
Contributor Author

Hi Matthias, I do want to ask you something about the encoding of a dataframe, when a categorical value is given.
By default it is setup to be:
feature_encoder_type: str = 'OrdinalEncoder',
target_encoder_type: str = 'LabelEncoder',

Please let me know if this should be changed. I found an issue with the LabelEncoder and column transformer, so I had to move to ordinal encoder (something similar to https://stackoverflow.com/questions/46162855/fit-transform-takes-2-positional-arguments-but-3-were-given-with-labelbinarize).

Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, that'll improve a lot of things in Auto-sklearn! I left a lot of small notes throughout the PR with a few requests for additional tests and clarifications.

@franchuterivera
Copy link
Contributor Author

I am wondering if BaseAutoML is still needed?

Really, we can inherit from automl directly, on each estimator. Please let me know your thoughts!

@mfeurer
Copy link
Contributor

mfeurer commented Jul 10, 2020

It appears that there's an issue with the unit test for pandas: https://travis-ci.org/github/automl/auto-sklearn/jobs/706628514 Could you please check and re-request a review?

@franchuterivera
Copy link
Contributor Author

With the requested dataset (2) the following behavior is observed:

  • No problems when doing: python -m pytest test -v -k test_classification_pandas_support
  • Random stuck when doing python -m pytest test -v, with the last message in the log file:
    [DEBUG] [2020-07-10 13:47:51,619:AutoML(1):3d08f2ce6767ae31511106cfc2ae94da] /home/chico/work/auto-  sklearn_fork_pandas/venv/lib/python3.7/site-packages/sklearn/feature_selection/_univariate_selection.py:115: RuntimeWarning:invalid value encountered in true_divide

For the purpose of the test, I believe using Australian, that has following categories, should be sufficient.

X.dtypes
A1 category
A2 float64
A3 float64
A4 category
A5 category
A6 category
A7 float64
A8 category
A9 category
A10 float64
A11 category
A12 category
A13 float64
A14 float64
dtype: object

I will add to my list of things understanding the random crash when using dataset02.

Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, I just have a few more minor remarks.

@mfeurer
Copy link
Contributor

mfeurer commented Jul 14, 2020

I am wondering if BaseAutoML is still needed?

I currently don't see a reason to keep it.

@codecov-commenter
Copy link

codecov-commenter commented Jul 15, 2020

Codecov Report

Merging #889 into development will increase coverage by 0.08%.
The diff coverage is 90.41%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #889      +/-   ##
===============================================
+ Coverage        85.14%   85.22%   +0.08%     
===============================================
  Files              129      130       +1     
  Lines             9531     9603      +72     
===============================================
+ Hits              8115     8184      +69     
- Misses            1416     1419       +3     
Impacted Files Coverage Δ
autosklearn/data/validation.py 89.43% <89.43%> (ø)
autosklearn/automl.py 81.35% <95.65%> (-1.21%) ⬇️
autosklearn/estimators.py 91.17% <100.00%> (+0.58%) ⬆️
autosklearn/smbo.py 73.81% <0.00%> (+1.09%) ⬆️
...omponents/data_preprocessing/data_preprocessing.py 91.01% <0.00%> (+2.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d80ac0...8b50a28. Read the comment docs.

@franchuterivera
Copy link
Contributor Author

Added dataset 2 after removing svm as a candidate for an estimator.
Consistently, the code runs into using:
'classifier:libsvm_svc:C': 1007.8868860667042
'classifier:libsvm_svc:gamma': 0.0009693320195457126
'classifier:libsvm_svc:kernel': 'poly'
'classifier:libsvm_svc:max_iter': -1
'classifier:libsvm_svc:shrinking': 'True'
'classifier:libsvm_svc:tol': 0.00048384544670559135

During a re-fit of this model, sklearn is stuck in fit. For this particular dataset, this estimator class should not be picked. We can circumvent this by changing the number of iterations, but this might be dataset specific.

@franchuterivera franchuterivera requested a review from mfeurer July 17, 2020 11:42
@mfeurer
Copy link
Contributor

mfeurer commented Jul 22, 2020

I just played with this PR and found the following issue with scikit-learn and pandas: the OrdinalEncoder cannot yet handle pandas categorical columns with NaNs and cannot handle new categories: scikit-learn/scikit-learn#17123 and scikit-learn/scikit-learn#15796

Therefore, I'm afraid we need a few more tests to deal with NaN values:

  • numpy, categorical column without NaN
  • numpy, categorical column with NaN
  • numpy, numerical column without NaN
  • numpy, numerical column with NaN
  • pandas, categorical column without NaN
  • pandas, categorical column with NaN -> this will fail. If we encode it as a unit test we'll realize immediately when scikit-learn supports this
  • pandas, numerical column without NaN
  • pandas, numerical column with NaN

I triggered this behavior by loading dataset 2 with X, y = sklearn.datasets.fetch_openml(data_id=2, return_X_y=True, as_frame=True)

and also new tests for what's happening when new categories arrive at test time. I triggered the 2nd error by loading dataset 3 with X, y = sklearn.datasets.fetch_openml(data_id=3, return_X_y=True, as_frame=True) inside example_holdout.py.

This is unfortunately way more complicated than anticipated.

@mfeurer mfeurer merged commit f6b358c into automl:development Jul 28, 2020
franchuterivera added a commit to franchuterivera/auto-sklearn that referenced this pull request Aug 21, 2020
* Added pandas support

* Cleanup and incorporate comments feedback

* Remove old comments

* Fix fit ensemble and add more checks

* Move to diff dataset

* Remove print message

* Incorporate feedback comments

* Add dataset 2 to pandas check

* Improved messaging and more testing

* More test for NaN

* NaN columns not changed

* Update validation.py

* Update validation.py

* Update test_validation.py

* Update automl.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants