Skip to content

LightGBM trainer filters out rows with NaN features #4681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rauhs opened this issue Jan 21, 2020 · 7 comments · Fixed by #4695
Closed

LightGBM trainer filters out rows with NaN features #4681

rauhs opened this issue Jan 21, 2020 · 7 comments · Fixed by #4695
Assignees
Labels
P1 Priority of the issue for triage purpose: Needs to be fixed soon.

Comments

@rauhs
Copy link
Contributor

rauhs commented Jan 21, 2020

LightGBM (binary trainer) will filter out NaN (feature)-values even though we have the option of HandleMissingValue which allows LightGBM to properly deal with missing values:

https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html

[TlcModule.SweepableDiscreteParam("UseMissing", new object[] { true, false })]

I think this:

var loadFlags = CursOpt.AllLabels | CursOpt.Features;

should also specify the flag "AllFeatures" so they're allowed through.

@yaeldekel yaeldekel added the P1 Priority of the issue for triage purpose: Needs to be fixed soon. label Jan 21, 2020
@yaeldekel
Copy link

Hi @rauhs , good suggestion, thanks!

@mstfbl
Copy link
Contributor

mstfbl commented Jan 23, 2020

Hi @rauhs , it is true that LightGBM filters NaN values by default. We have HandleMissingValue so that one can disable this by setting HandleMissingValue=false.
In addition, there is also the zero_as_missing flag in LightGBM, which allows the missing value to be representated by the numerical value 0. I have made a PR to address your suggestion and to add support for the zero_as_missing flag in LightGBM.

@mstfbl
Copy link
Contributor

mstfbl commented Feb 3, 2020

Update: When the requested change from Features to AllFeatures is made here: ...

var loadFlags = CursOpt.AllLabels | CursOpt.Features;
,

3 binary classification LightGbm tests fail: LightGBMClassificationTest, GossLightGBMTest, DartLightGBMTest. They use breast-cancer.txt as test and train file, as shown here:

public void LightGBMClassificationTest()
{
var learners = new[] { TestLearners.LightGBMClassifier };
var binaryClassificationDatasets = new List<TestDataset> { TestDatasets.breastCancerPipe };
foreach (var learner in learners)
{
foreach (TestDataset dataset in binaryClassificationDatasets)
Run_TrainTest(learner, dataset);
}
Done();
}
[LightGBMFact]
[TestCategory("Binary"), TestCategory("LightGBM")]
public void GossLightGBMTest()
{
var binaryPredictors = new[] { TestLearners.LightGBMGoss };
var binaryClassificationDatasets = new List<TestDataset> { TestDatasets.breastCancerPipe };
RunAllTests(binaryPredictors, binaryClassificationDatasets, extraTag: "goss");
Done();
}
[LightGBMFact]
[TestCategory("Binary")]
[TestCategory("LightGBM")]
public void DartLightGBMTest()
{
var binaryPredictors = new[] { TestLearners.LightGBMDart };
var binaryClassificationDatasets = new List<TestDataset> { TestDatasets.breastCancerPipe };
RunAllTests(binaryPredictors, binaryClassificationDatasets, extraTag: "dart");
Done();
}

The exact failures are due to calculated results not quite matching the baseline results. For example:

Message:
System.AggregateException : One or more errors occurred. (Values to compare are 240 and 238
AllowedVariance: 1E-07
delta: 2
delta2: 2
) (Output and baseline mismatch at line 11, expected ' positive || 240 | 1 | 0.9959' but got ' positive || 238 | 3 | 0.9876' : 'LightGBMBinary\LightGBM-TrainTest-breast-cancer-out.txt')

Again, these failing tests pass when AllFeatures is replaced by Features. @rauhs , would you happen to have any idea why this is happening?

@mstfbl
Copy link
Contributor

mstfbl commented Feb 4, 2020

Update: I investigated the IDataView data that is being passed to LightGbm right before its training. This particular IDataView contains 3 colums: Attributes "Attr", Label, and Features.

p1

The Label and Features columns are correctly mapped, but the Attributes column is not. It seems like this Attributes column is taking the 7th column in the breast-cancer.txt file.

p2

This might indicate a bug that exists within RoleMappedData and/or TextLoader.

@rauhs
Copy link
Contributor Author

rauhs commented Feb 5, 2020

  1. There are a bunch of "?" in that column in the original data set:

https://github.com/dotnet/machinelearning/blob/master/test/data/breast-cancer.txt#L58

  1. They're also completely filtered out in:

https://github.com/dotnet/machinelearning/blob/master/test/data/breast-cancer-withheader.txt#L1

Note: The first line in that file "// Number of Instances: 699 (as of 15 July 1992)" is a lie. There are only 682 instances.

Not sure if this is all on purpose or some accident. I don't see any comments.

@mstfbl
Copy link
Contributor

mstfbl commented Feb 5, 2020

Hi @rauhs, thank you, I also saw the "?" missing values in breast-cancer.txt. It looks like these missing values are not being imputed with the CursOpt.AllFeatures change. The TestDatasets.breastCancerPipe dataset has missing values, whereas TestDatasets.breastCancerPipeMissing does not have missing values. Thus if the failing tests use TestDatasets.breastCancerPipeMissing, these builds will pass with these changes.

@justinormont
Copy link
Contributor

@mstfbl commented 2020-02-05T17:30:49Z
Hi @rauhs, thank you, I also saw the "?" missing values in breast-cancer.txt. It looks like these missing values are not being imputed with the CursOpt.AllFeatures change. The TestDatasets.breastCancerPipe dataset has missing values, whereas TestDatasets.breastCancerPipeMissing does not have missing values. Thus if the failing tests use TestDatasets.breastCancerPipeMissing, these builds will pass with these changes.

I'd expect breastCancerPipe to change, and breastCancerPipeMissing to not change due to adding the AllFeatures flag.

Column 6 of breast-cancer.txt has the only NA values, which is read into both Mixed and More columns, which the NAHandle does run on for breastCancerPipeMissing.

So all cases of NA values are filled for breastCancerPipeMissing before LightGBM would see it (so no changes should be seen). And breastCancerPipe lets thru the NA values (so a baseline change should be seen).

Also, the next one should not change breastCancerPipeMissingFilter, as is drops the rows with with missing (NA) values.

tldr: This test output should change. Update breastCancerPipe baseline for LightGBM.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P1 Priority of the issue for triage purpose: Needs to be fixed soon.
Projects
None yet
4 participants