-
Notifications
You must be signed in to change notification settings - Fork 1.9k
LightGBM trainer filters out rows with NaN features #4681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @rauhs , good suggestion, thanks! |
Hi @rauhs , it is true that LightGBM filters NaN values by default. We have |
Update: When the requested change from
3 binary classification LightGbm tests fail: LightGBMClassificationTest, GossLightGBMTest, DartLightGBMTest. They use breast-cancer.txt as test and train file, as shown here: machinelearning/test/Microsoft.ML.Predictor.Tests/TestPredictors.cs Lines 476 to 508 in bc7d86e
The exact failures are due to calculated results not quite matching the baseline results. For example:
Again, these failing tests pass when |
Update: I investigated the IDataView data that is being passed to LightGbm right before its training. This particular IDataView contains 3 colums: Attributes "Attr", Label, and Features. The Label and Features columns are correctly mapped, but the Attributes column is not. It seems like this Attributes column is taking the 7th column in the This might indicate a bug that exists within |
https://github.com/dotnet/machinelearning/blob/master/test/data/breast-cancer.txt#L58
https://github.com/dotnet/machinelearning/blob/master/test/data/breast-cancer-withheader.txt#L1 Note: The first line in that file "// Number of Instances: 699 (as of 15 July 1992)" is a lie. There are only 682 instances. Not sure if this is all on purpose or some accident. I don't see any comments. |
Hi @rauhs, thank you, I also saw the "?" missing values in |
I'd expect Column 6 of So all cases of Also, the next one should not change tldr: This test output should change. Update |
LightGBM (binary trainer) will filter out NaN (feature)-values even though we have the option of
HandleMissingValue
which allows LightGBM to properly deal with missing values:https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html
machinelearning/src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs
Line 178 in 7a4372e
I think this:
machinelearning/src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs
Line 439 in 7a4372e
should also specify the flag "AllFeatures" so they're allowed through.
The text was updated successfully, but these errors were encountered: