Skip to content

Updated handling of missing values with LightGBM, and added ability to use (0) as missing value #4695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Feb 10, 2020

Conversation

mstfbl
Copy link
Contributor

@mstfbl mstfbl commented Jan 23, 2020

Fixes #4681 , and also adds the ability to use the numerical value (0) as missing value with LightGBM. In addition, I've also updated the baseline results for LightGbm as it is normal for them to change with CursOpt.AllFeatures, and all added a unit test checking that a previously trained LightGBM model with the CursOpt.Features flag produces the baselines it did before.

@mstfbl mstfbl requested a review from a team as a code owner January 23, 2020 12:34
@mstfbl mstfbl requested a review from a team as a code owner January 24, 2020 12:52
@mstfbl mstfbl changed the title Update LightGbmTrainerBase.cs Updated handling of missing values with LightGBM, and added ability to use (0) as missing value Jan 28, 2020
@mstfbl
Copy link
Contributor Author

mstfbl commented Jan 28, 2020

The current CI tests are failing due to the current generated netcoreapp core_manifest.json not exactly matching with the generated baseline core_manifest.json. The difference here is caused by the newly added UseZeroAsMissingValue variable. I am currently investigating how this baseline is generated, so that I can update the generated file to match the correct core_manifest.json.

Output matches baseline: '..\Common\EntryPoints\core_ep-list.tsv'
*** Failure: Output and baseline mismatch at line 11652, expected ' "Name": "MinimumExampleCountPerGroup",' but got ' "Name": "UseZeroAsMissingValue",' : '..\Common\EntryPoints\core_manifest.json'
Test EntryPointCatalog: completed normally: failed

@justinormont
Copy link
Contributor

@mstfbl: We may want to document how to update the core_manifest.json and core_ep-list.tsv in the developer-guide.md.

I think the method is to unskip the RegenerateEntryPointCatalog test, run it, and push changes to the two files.

[Fact(Skip = "Execute this test if you want to regenerate the core_manifest and core_ep_list files")]
public void RegenerateEntryPointCatalog()
{
var (epListContents, jObj) = BuildManifests();
var buildPrefix = GetBuildPrefix();
var epListFile = buildPrefix + "_ep-list.tsv";
var entryPointsSubDir = Path.Combine("..", "Common", "EntryPoints");
var catalog = Env.ComponentCatalog;
var epListPath = GetBaselinePath(entryPointsSubDir, epListFile);
DeleteOutputPath(epListPath);
File.WriteAllLines(epListPath, epListContents);
var manifestFile = buildPrefix + "_manifest.json";
var manifestPath = GetBaselinePath(entryPointsSubDir, manifestFile);
DeleteOutputPath(manifestPath);
using (var file = File.OpenWrite(manifestPath))
using (var writer = new StreamWriter(file))
using (var jw = new JsonTextWriter(writer))
{
jw.Formatting = Formatting.Indented;
jObj.WriteTo(jw);
}
}

@yaeldekel
Copy link

Another option is to run EntryPointCatalog locally, and then copy the files created in the bin directory to test\BaselineOutput\Common\EntryPoints.


In reply to: 579335552 [](ancestors = 579335552)

Copy link
Contributor

@justinormont justinormont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@ganik ganik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls try to run old LightGBM model (trained without your changes) with your new code. Baseline numbers should be the same.

Copy link
Member

@ganik ganik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@mstfbl mstfbl merged commit 9d2b198 into dotnet:master Feb 10, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LightGBM trainer filters out rows with NaN features
5 participants