Skip to content

Does GetFeatureWeights support categorical splits? #3766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rauhs opened this issue May 22, 2019 · 6 comments
Open

Does GetFeatureWeights support categorical splits? #3766

rauhs opened this issue May 22, 2019 · 6 comments
Labels
bug Something isn't working lightgbm Bugs related lightgbm P1 Priority of the issue for triage purpose: Needs to be fixed soon.

Comments

@rauhs
Copy link
Contributor

rauhs commented May 22, 2019

Version: 1.0

I'm training my multi class LightGBM with mostly categorical features (which works great). But when I want to get the GetFeatureWeights from the binary predictors I get huge values for feature of index -1. Though, just inspecting the models in the debugger the trees almost exclusively use categorical features for the splits. It seems that the GainMap doesn't actually consider any categorical splits and just assigns all those gains to the index -1 which makes the feature weights vector completely useless in my case.

Is this something that will be supported? Or am I wrong here?

@wschin
Copy link
Member

wschin commented May 22, 2019

@codemzs, do you have any idea as you were working on categorical split?

@ganik ganik added P2 Priority of the issue for triage purpose: Needs to be fixed at some point. question Further information is requested labels May 23, 2019
@antoniovs1029 antoniovs1029 added bug Something isn't working P1 Priority of the issue for triage purpose: Needs to be fixed soon. and removed P2 Priority of the issue for triage purpose: Needs to be fixed at some point. question Further information is requested labels Apr 13, 2020
@antoniovs1029
Copy link
Member

antoniovs1029 commented Apr 13, 2020

By working on issue #3272 (and PR #5018) it looks to me that that issue is similar in nature to the issue in here.

The issue in here is obtaining the Gains from the following code, which is obtaining the feature's indices from SplitFeatures[node].

for (int n = 0; n < numNonLeaves; ++n)
result[SplitFeatures[n]] += _splitGain[n];

As explained here the SplitFeatures[] array has "-1" for categorical splits, and it shouldn't be used for such splits (CategoricalSplitFeatures[][] should be used instead). So to solve issue this issue here we should also add code to support categorical features when calculating the GainMap. Problem is that I don't know what's the "mathematically correct" way to do it. I will ask around to see if I can get the correct way to do it, and open a PR with that code as well.

@antoniovs1029
Copy link
Member

Hi, @rauhs . So it seems the solution to this should be very straightforward. Do you happen to have any repro to test this with the solution I'm working on, and see if the results are what you would expect? thanks!

@rauhs
Copy link
Contributor Author

rauhs commented Apr 15, 2020

Thanks for working on this. I don't have a repo right now as I'm on vacation. If absolutely necessary I can provide some next week

@antoniovs1029
Copy link
Member

Thanks for answering, @rauhs . It's not completely necessary, as creating a model that uses categorical splits is easy. Still, if possible, I would really like to have a repro from your side, since it always help to get to know how users are using ML.NET 😄 So I would still like to wait to whenever you can provide a repro to test the solution. Thanks!

@antoniovs1029 antoniovs1029 self-assigned this Apr 15, 2020
@rauhs
Copy link
Contributor Author

rauhs commented Apr 20, 2020

With this code I get an indexoutofbound exception:

I can't reproduce the "-1" right now with using the synthetic data.

   public class GenericBinaryInstance
    {
      public string A { get; set; }
      public float Num { get; set; }
      public bool Label { get; set; }
    }


    public static string[] FeatureVector(int card, int total)
    {
      var inner  = Enumerable.Range(1, card).Select(x => x.ToString());
      var repeatCount = (int)Math.Ceiling((double)total / card) + 1;
      return Enumerable.Repeat(inner, repeatCount).SelectMany(x => x).ToArray();
    }

    public static void ReproduceLightGbmGetFeatureWeights()
    {
      var numInstances = 10_000;
      var axs = FeatureVector(4, numInstances);
      var labels = FeatureVector(2, numInstances);
      var rnd = new Random(1);
      var data = Enumerable.Range(1, numInstances).Select(x => new GenericBinaryInstance { A = axs[x], Num = (float)rnd.NextDouble(), Label = labels[x] == "1" });
      var ctx = new MLContext(1);
      var options = new LightGbmBinaryTrainer.Options
      {
        UseCategoricalSplit = true,
        MinimumExampleCountPerLeaf = 1,
        MinimumExampleCountPerGroup = 1,
      };
      options.Booster = new GradientBooster.Options();

      var pipe = ctx.Transforms.Conversion.MapValueToKey("A")
        .Append(ctx.Transforms.Conversion.MapKeyToVector("A"))
        .Append(ctx.Transforms.Concatenate("Features", "Num", "A"));
      var dataView = ctx.Data.LoadFromEnumerable(data);
      var trainer = ctx.BinaryClassification.Trainers.LightGbm(options);
      var encoder = pipe.Fit(dataView);
      var trainEncoded = encoder.Transform(dataView);
      var model = trainer.Fit(trainEncoded);

      var weightsBuf = new VBuffer<float>();
      model.Model.SubModel.GetFeatureWeights(ref weightsBuf);
      var weights= weightsBuf.GetValues().ToArray();
    }

@harishsk harishsk added the lightgbm Bugs related lightgbm label Apr 29, 2020
@antoniovs1029 antoniovs1029 removed their assignment Jul 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lightgbm Bugs related lightgbm P1 Priority of the issue for triage purpose: Needs to be fixed soon.
Projects
None yet
Development

No branches or pull requests

5 participants