-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Tree-based featurization #3812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tree-based featurization #3812
Conversation
@@ -437,5 +436,45 @@ public static class TreeExtensions | |||
var env = CatalogUtils.GetEnvironment(catalog); | |||
return new FastForestBinaryTrainer(env, options); | |||
} | |||
|
|||
public static PretrainedTreeFeaturizationEstimator PretrainTreeEnsembleFeaturizing(this TransformsCatalog catalog, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XML Doc on all public classes and APIs. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. I am working on them. #Resolved
return new PretrainedTreeFeaturizationEstimator(env, options); | ||
} | ||
|
||
public static FastForestRegressionFeaturizationEstimator FastForestRegressionFeaturizing(this TransformsCatalog catalog, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This naming pattern reads a little funny. How about turning it into FeaturizeXXX
? Like we have with FeaturizeText
. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do FeaturizeBy...
. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds better (to me at least). #Resolved
/// "Leaves" + <see cref="OutputColumnsSuffix"/>, and "Paths" + <see cref="OutputColumnsSuffix"/>. If <see cref="OutputColumnsSuffix"/> | ||
/// is <see langword="null"/>, the output names would be "Trees", "Leaves", and "Paths". | ||
/// </summary> | ||
public string OutputColumnsSuffix; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We went away from magic strings in the TextTransform. Previously with tokens=+
, we produced a new column named {OutputColName}_TokenizedText
.
For the estimators API we have users directly enter the column name for the tokens. We may want to do the same for the Trees/Leaves/Paths of the TreeFeat.
Perhaps:
OutputColumnTreeName
, OutputColumnLeavesName
, OutputColumnPathsName
.
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And don't add that column if they empty or equal to null.
That way you can actually configure which parts of tree structure do you want. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As further background on the PR I was referencing...
Conversation about TextTransform:
via @rogancarr in #2957 PR
When using
OutputTokens=true
,FeaturizeText
creates a new column called${OutputColumnName}_TransformedText
. This isn't really well documented anywhere, and it's odd behavior. I suggest that we make the tokenized text column name explicit in the API.My suggestion would be the following:
- Change
OutputTokens = [bool]
toOutputTokensColumn = [string]
, and astring.NullOrWhitespace(OutputTokensColumn)
signifies that this column will not be created. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we can do optional output columns and custom output column names. Please see tests for examples (or wait for formal API samples). #Resolved
/// <param name="catalog">The context <see cref="TransformsCatalog"/> to create <see cref="FastTreeTweedieFeaturizationEstimator"/>.</param> | ||
/// <param name="options">The options to configure <see cref="FastTreeTweedieFeaturizationEstimator"/>. See <see cref="FastTreeTweedieFeaturizationEstimator.Options"/> and | ||
/// <see cref="TreeEnsembleFeaturizationEstimatorBase.CommonOptions"/> for available settings.</param> | ||
public static FastTreeTweedieFeaturizationEstimator FeaturizeByFastTreeTweedie(this TransformsCatalog catalog, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to note in its name that FastTreeTweedie is regression: (the naming of the others list their task types)
public static FastTreeTweedieFeaturizationEstimator FeaturizeByFastTreeTweedie(this TransformsCatalog catalog, | |
public static FastTreeTweedieRegressionFeaturizationEstimator FeaturizeByFastTreeTweedieRegression(this TransformsCatalog catalog, | |
``` #ByDesign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes if the model name doesn't tell the task. Given that Tweedie
somehow implies a regression case, we don't have Regression
appended to any of public Tweedie
modules. This pattern can be seen in FastTreeTweedieTrainer
and FastTreeTweedieModelParameters
. #Resolved
TrainerOptions = trainerOptions | ||
}; | ||
|
||
var pipeline = ML.Transforms.FeaturizeByFastTreeBinary(options). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating (8) seperate ML.Transforms.FeaturizeBy{ FastTreeBinary, FastForestRegression, FastTreeRegression, FastTreeTweedie, ... }
featurizers under ML.Transforms.*
seems a bit large. Seems to clutter up the namespace.
In the future, this list should grow as I think we should have LightGBM variants too (and CatBoost if we take it in). The number of independent featurizers will be:
{ FastTree, FastTreeTweedie, FastForest, LightGBM, CatBoost } x { BinaryClassification, Multiclass, Regression, Ranking }`. (with some combinations missing)
Would it be more clean to have one ML.Transforms.TreeFeaturizer()
, and put the specific instance type as a parameter? Would it be doable to have one return type? #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two reasons that I don't like a single TreeFeaturizer
.
- It goes toward an opposite direction of the C# API's design. Most of them are strongly typed to the underlying data structures. You can see we have a lot of
FastTree...
andLightGbm...
, which is intended. - It may requires user to specify
TreeFeaturizer<TTrainer>(options)
and the user manually needs to make sure the type ofoptions
isTTrainer.Options
, which is easy to make mistakes. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, the current pattern will likely have less mistakes by the user.
For the plethora of FastTree...
and LightGBM...
, those are namespaced under the task mlContext.Regression.Trainers.FastTree()
, hence the duplication isn't visible.
TrainerOptions = trainerOptions | ||
}; | ||
|
||
var pipeline = ML.Transforms.FeaturizeByFastForestRegression(options). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test of ML.Transforms.FeaturizeByFastForestRegression()
using FastForest's ShuffleLabels
option? I believe this is the only way to use TreeFeat w/ multi-class classification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it in another PR. This PR has been too large.. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the UnitTests should be in the main PR. Specifically, I think we should ensure the multi-class case can work. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding.
#Resolved
test/Microsoft.ML.Tests/TrainerEstimators/TreeEnsembleFeaturizerTest.cs
Outdated
Show resolved
Hide resolved
test/Microsoft.ML.Tests/TrainerEstimators/TreeEnsembleFeaturizerTest.cs
Outdated
Show resolved
Hide resolved
…erTest.cs Co-Authored-By: Justin Ormont <[email protected]>
} | ||
|
||
[Fact] | ||
public void TreeEnsembleFeaturizingPipelineMulticlass() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daholste and I were able to map the Key
to a float
using a CustomMapping()
.
I'd recommend the multiclass unit test be:
public void TreeEnsembleFeaturizingPipelineMulticlass() | |
[Fact] | |
public void TreeEnsembleFeaturizingPipelineMulticlass() | |
{ | |
int dataPointCount = 1000; | |
var data = SamplesUtils.DatasetUtils.GenerateRandomMulticlassClassificationExamples(dataPointCount).ToList(); | |
var dataView = ML.Data.LoadFromEnumerable(data); | |
dataView = ML.Data.Cache(dataView); | |
var trainerOptions = new FastForestRegressionTrainer.Options | |
{ | |
NumberOfThreads = 1, | |
NumberOfTrees = 10, | |
NumberOfLeaves = 4, | |
MinimumExampleCountPerLeaf = 10, | |
FeatureColumnName = "Features", | |
LabelColumnName = "FloatLabel", | |
ShuffleLabels = true | |
}; | |
var options = new FastForestRegressionFeaturizationEstimator.Options() | |
{ | |
InputColumnName = "Features", | |
TreesColumnName = "Trees", | |
LeavesColumnName = "Leaves", | |
PathsColumnName = "Paths", | |
TrainerOptions = trainerOptions | |
}; | |
Action<RowWithKey, RowWithFloat> actionConvertKeyToFloat = (RowWithKey rowWithKey, RowWithFloat rowWithFloat) => | |
{ | |
rowWithFloat.FloatLabel = rowWithKey.KeyLabel == 0 ? float.NaN : rowWithKey.KeyLabel - 1; | |
}; | |
var split = ML.Data.TrainTestSplit(dataView, 0.5); | |
var trainData = split.TrainSet; | |
var testData = split.TestSet; | |
var pipeline = ML.Transforms.Conversion.MapValueToKey("KeyLabel", "Label") | |
.Append(ML.Transforms.CustomMapping(actionConvertKeyToFloat, "KeyLabel")) | |
.Append(ML.Transforms.FeaturizeByFastForestRegression(options)) | |
.Append(ML.Transforms.Concatenate("CombinedFeatures", "Trees", "Leaves", "Paths")) | |
.Append(ML.MulticlassClassification.Trainers.SdcaMaximumEntropy("KeyLabel", "CombinedFeatures")); | |
var model = pipeline.Fit(trainData); | |
var prediction = model.Transform(testData); | |
var metrics = ML.MulticlassClassification.Evaluate(prediction, labelColumnName: "KeyLabel"); | |
Assert.True(metrics.MacroAccuracy > 0.6); | |
Assert.True(metrics.MicroAccuracy > 0.6); | |
} | |
class RowWithKey | |
{ | |
[KeyType()] | |
public uint KeyLabel { get; set; } | |
} | |
class RowWithFloat | |
{ | |
public float FloatLabel { get; set; } | |
} |
Specifically, this is using a CustomMapping()
to convert the Key
to a float
for use in the TreeFeat's FastForest regression. The current method in the unit test requires a user to know/list all of the values in their Label (and their type). The CustomMapping()
style is easier for a user to replicate for their dataset.
We also added a TrainTestSplit()
so we're not testing on the training set, and we removed the original features from the Concatenate()
to ensure the TreeFeat's output features are useful.
public TreeEnsembleModelParameters ModelParameters; | ||
}; | ||
|
||
private TreeEnsembleModelParameters _modelParameters; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TreeEnsembleModelParameters [](start = 16, length = 27)
Should this be readonly or something similar to make sure it is not altered?
Since you have already built all this infrastructure why are we not providing the featurizers for |
// The 0-1 encoding of leaves the input feature vector falls into. | ||
public float[] Leaves { get; set; } | ||
// The 0-1 encoding of paths the input feature vector reaches the leaves. | ||
public float[] Paths { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (same in other places if you are doing a revision of the PR):
// The 0-1 encoding of paths the input feature vector follows to reach the leaves.
/// and the i-th vector element is the prediction value predicted by the i-th tree. | ||
/// If <see cref="TreesColumnName"/> is <see langword="null"/>, this output column may not be generated. | ||
/// </summary> | ||
public string TreesColumnName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TreesColumnName [](start = 26, length = 15)
Suggested renaming:
TreesColumnName -> TreeOutputsColumnName
I think it would be easier to understand, but this is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Implement transformer * Initial draft of porting tree-based featurization * Internalize something * Add Tweedie and Ranking cases * Some small docs * Customize output column names * Fix save and load * Optional output columns * Fix a test and add some XML docs * Add samples * Add a sample * API docs * Fix one line * Add MC test * Extend a test further * Address some comments * Address some comments * Address comments * Comment * Add cache points * Update test/Microsoft.ML.Tests/TrainerEstimators/TreeEnsembleFeaturizerTest.cs Co-Authored-By: Justin Ormont <[email protected]> * Address comment * Add Justin's test * Reduce sample size * Update sample output
Fix #2482. Generating features using tree structure has been a popular technique in data mining. This PR exposes this internal-only feature to the public.
Since I don't have enough time to handle multiple different assignments at the same time, please don't put nit comments and create new issues instead. Thanks a lot.