Skip to content

Support for Categorical features in CalculateFeatureContribution of LightGBM #5018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 21, 2020

Conversation

antoniovs1029
Copy link
Member

Fixes #3272

As explained here, CalculateFeatureContribution would throw an exception when used on LightGBM models that had UseCategoricalSplit enabled, because there was no support to calculate feature contribution for categorical features. Here I add that support, and one test to replicate the original scenario were an exception was thrown.

@antoniovs1029 antoniovs1029 requested a review from a team as a code owner April 12, 2020 08:29
Comment on lines 1525 to 1543
foreach (var index in CategoricalSplitFeatures[node])
{
float fv = GetFeatureValue(src.GetItemOrDefault(index), node);
if (fv > 0.0f)
{
newNode = GtChild[node];
otherWay = LteChild[node];
break;
}
}

// What if we went the other way?
var ghostLeaf = GetLeafFrom(in src, otherWay);
var ghostOutput = GetOutput(ghostLeaf);

// If the ghost got a smaller output, the contribution of the categorical features is positive, so
// the contribution is true minus ghost.
foreach(var ifeat in CategoricalSplitFeatures[node])
contributions.AddFeature(ifeat, (float)(trueOutput - ghostOutput));
Copy link
Member Author

@antoniovs1029 antoniovs1029 Apr 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-wise, I think this is the correct way to find which features are involved in the categorical split of this node (as it is done in a similar way here) And I tried to make this analogous to how feature contribution is already being calculated for non-categorical features (here).

But I don't know if this is the "mathematically correct" way of calculating feature contributions for categorical features. I can think about a couple of alternatives to this, but I wouldn't know which one to choose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've updated this. I am still not sure the updated version is correct. But I think it's the closest to how feature contribution is calculated for numerical feature splits, and considering that for other other cases (FastTree, Gam, etc...) categorical features are treated the same as any other feature (ignoring the fact they're categorical) when calculating feature contribution.

@rauhs
Copy link
Contributor

rauhs commented Apr 12, 2020

Does this also affect #3766 ?

@antoniovs1029
Copy link
Member Author

antoniovs1029 commented Apr 13, 2020

Hi, @rauhs. This PR here won't affect issue #3766. Nonetheless, having a quick look at that issue, it seems it is similar in nature to issue #3272 which is fixed in this PR.

So fixing this issue here won't fix that other issue. I've left some comments on that issue explaining what needs to be done to solve it.

Thanks for pointing to that issue!


In reply to: 612639332 [](ancestors = 612639332)

@@ -57,7 +57,7 @@ public abstract class RegressionTreeBase
/// (2) the categorical features indexed by <see cref="GetCategoricalCategoricalSplitFeatureRangeAt(int)"/>'s
/// returned value with nodeIndex=i is NOT a sub-set of <see cref="GetCategoricalSplitFeaturesAt(int)"/> with
/// nodeIndex=i.
/// Note that the case (1) happens only when <see cref="CategoricalSplitFlags"/>[i] is true and otherwise (2)
/// Note that the case (1) happens only when <see cref="CategoricalSplitFlags"/>[i] is false and otherwise (2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me this doc was wrong, as it's inconsistent with what these other lines below say:

/// <summary>
/// <see cref="NumericalSplitFeatureIndexes"/>[i] is the feature index used the splitting function of the
/// i-th node. This value is valid only if <see cref="CategoricalSplitFlags"/>[i] is false.
/// </summary>
public IReadOnlyList<int> NumericalSplitFeatureIndexes => _numericalSplitFeatureIndexes;

/// <summary>
/// Return categorical thresholds used at node indexed by nodeIndex. If the considered input feature does NOT
/// matche any of values returned by <see cref="GetCategoricalSplitFeaturesAt(int)"/>, we call it a
/// less-than-threshold event and therefore <see cref="LeftChild"/>[nodeIndex] is the child node that input
/// should go next. The returned value is valid only if <see cref="CategoricalSplitFlags"/>[nodeIndex] is true.
/// </summary>
public IReadOnlyList<int> GetCategoricalSplitFeaturesAt(int nodeIndex)

@antoniovs1029
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@antoniovs1029 antoniovs1029 merged commit 37edde9 into dotnet:master Apr 21, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exception on using LightGBM trainer with FeatureContributionCalculation and OneHotEncoding
3 participants