-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Added pandas support #889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added pandas support #889
Conversation
Hi Matthias, I do want to ask you something about the encoding of a dataframe, when a categorical value is given. Please let me know if this should be changed. I found an issue with the LabelEncoder and column transformer, so I had to move to ordinal encoder (something similar to https://stackoverflow.com/questions/46162855/fit-transform-takes-2-positional-arguments-but-3-were-given-with-labelbinarize). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job, that'll improve a lot of things in Auto-sklearn! I left a lot of small notes throughout the PR with a few requests for additional tests and clarifications.
I am wondering if BaseAutoML is still needed? Really, we can inherit from automl directly, on each estimator. Please let me know your thoughts! |
a261edd
to
73322f2
Compare
It appears that there's an issue with the unit test for pandas: https://travis-ci.org/github/automl/auto-sklearn/jobs/706628514 Could you please check and re-request a review? |
With the requested dataset (2) the following behavior is observed:
For the purpose of the test, I believe using Australian, that has following categories, should be sufficient.
I will add to my list of things understanding the random crash when using dataset02. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, I just have a few more minor remarks.
I currently don't see a reason to keep it. |
Codecov Report
@@ Coverage Diff @@
## development #889 +/- ##
===============================================
+ Coverage 85.14% 85.22% +0.08%
===============================================
Files 129 130 +1
Lines 9531 9603 +72
===============================================
+ Hits 8115 8184 +69
- Misses 1416 1419 +3
Continue to review full report at Codecov.
|
Added dataset 2 after removing svm as a candidate for an estimator. During a re-fit of this model, sklearn is stuck in fit. For this particular dataset, this estimator class should not be picked. We can circumvent this by changing the number of iterations, but this might be dataset specific. |
I just played with this PR and found the following issue with scikit-learn and pandas: the OrdinalEncoder cannot yet handle pandas categorical columns with NaNs and cannot handle new categories: scikit-learn/scikit-learn#17123 and scikit-learn/scikit-learn#15796 Therefore, I'm afraid we need a few more tests to deal with NaN values:
I triggered this behavior by loading dataset 2 with and also new tests for what's happening when new categories arrive at test time. I triggered the 2nd error by loading dataset 3 with This is unfortunately way more complicated than anticipated. |
519bddc
to
1afe35f
Compare
* Added pandas support * Cleanup and incorporate comments feedback * Remove old comments * Fix fit ensemble and add more checks * Move to diff dataset * Remove print message * Incorporate feedback comments * Add dataset 2 to pandas check * Improved messaging and more testing * More test for NaN * NaN columns not changed * Update validation.py * Update validation.py * Update test_validation.py * Update automl.py
Sklearn check-array method is used to give numerical dataframe support already.
Nevertheless, when a categorical value is given, we want to automatically encode it.
Also, we want to better guide the user when a dataframe is having a problematic type.
Added test to make sure dataframes can be handled.