Skip to content

Classification stratificationColumn not supported for boolean column #1204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
CESARDELATORRE opened this issue Oct 9, 2018 · 3 comments
Closed

Comments

@CESARDELATORRE
Copy link
Contributor

For not balanced datasets, with stratified splitting, the data is divided in such a way that a percentage of each target column value is put in both training and test dataset.

However, the following line of code throws an error if the column 'Label' is Boolean, which is very common for binary classification.

(trainData, testData) = classification.TrainTestSplit(data, testFraction: 0.2, stratificationColumn: "Label");

It would work if the Label column would be float or other types.

I might be missing something, but why is Boolean not supported for the stratificationColumn?
Can we support it since it can be a common scenario for binary classifications?

@Zruty0
Copy link
Contributor

Zruty0 commented Oct 10, 2018

Stratification column is not what you think. Actually, if you read the documentation, it states that

If two examples share the same value of the stratificationColumn (if provided), they are guaranteed to appear in the same subset (train or test). Use this to make sure there is no label leakage from train to the test set

As you can see, it is ALWAYS a bad idea to use Label as a stratification column, and ALWAYS a bad idea to use a Boolean column as a stratification column.

That said, currently we only support float, key and string values for stratification (and I'm not sure about string). We should expand the coverage of Hash estimator (see #1031)

@CESARDELATORRE
Copy link
Contributor Author

CESARDELATORRE commented Mar 3, 2019

It would be good to be able specify the "stratification column" as a parameter in the TrainTestSplit().
The 'samplingKeyColumn' parameter is almost the opposite concept.

@codemzs
Copy link
Member

codemzs commented Jun 30, 2019

I agree with Pete. Closing this.

@codemzs codemzs closed this as completed Jun 30, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants