-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Does GetFeatureWeights support categorical splits? #3766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@codemzs, do you have any idea as you were working on categorical split? |
By working on issue #3272 (and PR #5018) it looks to me that that issue is similar in nature to the issue in here. The issue in here is obtaining the Gains from the following code, which is obtaining the feature's indices from SplitFeatures[node]. machinelearning/src/Microsoft.ML.FastTree/TreeEnsemble/InternalRegressionTree.cs Lines 1334 to 1335 in 8660ecc
As explained here the |
Hi, @rauhs . So it seems the solution to this should be very straightforward. Do you happen to have any repro to test this with the solution I'm working on, and see if the results are what you would expect? thanks! |
Thanks for working on this. I don't have a repo right now as I'm on vacation. If absolutely necessary I can provide some next week |
Thanks for answering, @rauhs . It's not completely necessary, as creating a model that uses categorical splits is easy. Still, if possible, I would really like to have a repro from your side, since it always help to get to know how users are using ML.NET 😄 So I would still like to wait to whenever you can provide a repro to test the solution. Thanks! |
With this code I get an indexoutofbound exception: I can't reproduce the "-1" right now with using the synthetic data. public class GenericBinaryInstance
{
public string A { get; set; }
public float Num { get; set; }
public bool Label { get; set; }
}
public static string[] FeatureVector(int card, int total)
{
var inner = Enumerable.Range(1, card).Select(x => x.ToString());
var repeatCount = (int)Math.Ceiling((double)total / card) + 1;
return Enumerable.Repeat(inner, repeatCount).SelectMany(x => x).ToArray();
}
public static void ReproduceLightGbmGetFeatureWeights()
{
var numInstances = 10_000;
var axs = FeatureVector(4, numInstances);
var labels = FeatureVector(2, numInstances);
var rnd = new Random(1);
var data = Enumerable.Range(1, numInstances).Select(x => new GenericBinaryInstance { A = axs[x], Num = (float)rnd.NextDouble(), Label = labels[x] == "1" });
var ctx = new MLContext(1);
var options = new LightGbmBinaryTrainer.Options
{
UseCategoricalSplit = true,
MinimumExampleCountPerLeaf = 1,
MinimumExampleCountPerGroup = 1,
};
options.Booster = new GradientBooster.Options();
var pipe = ctx.Transforms.Conversion.MapValueToKey("A")
.Append(ctx.Transforms.Conversion.MapKeyToVector("A"))
.Append(ctx.Transforms.Concatenate("Features", "Num", "A"));
var dataView = ctx.Data.LoadFromEnumerable(data);
var trainer = ctx.BinaryClassification.Trainers.LightGbm(options);
var encoder = pipe.Fit(dataView);
var trainEncoded = encoder.Transform(dataView);
var model = trainer.Fit(trainEncoded);
var weightsBuf = new VBuffer<float>();
model.Model.SubModel.GetFeatureWeights(ref weightsBuf);
var weights= weightsBuf.GetValues().ToArray();
}
|
Version: 1.0
I'm training my multi class LightGBM with mostly categorical features (which works great). But when I want to get the
GetFeatureWeights
from the binary predictors I get huge values for feature of index-1
. Though, just inspecting the models in the debugger the trees almost exclusively use categorical features for the splits. It seems that theGainMap
doesn't actually consider any categorical splits and just assigns all those gains to the index-1
which makes the feature weights vector completely useless in my case.Is this something that will be supported? Or am I wrong here?
The text was updated successfully, but these errors were encountered: