Skip to content

Tree-based featurization #3812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jun 26, 2019
Merged

Tree-based featurization #3812

merged 27 commits into from
Jun 26, 2019

Conversation

wschin
Copy link
Member

@wschin wschin commented Jun 3, 2019

Fix #2482. Generating features using tree structure has been a popular technique in data mining. This PR exposes this internal-only feature to the public.

Since I don't have enough time to handle multiple different assignments at the same time, please don't put nit comments and create new issues instead. Thanks a lot.

@wschin wschin requested review from justinormont and codemzs June 3, 2019 22:13
@wschin wschin self-assigned this Jun 3, 2019
@@ -437,5 +436,45 @@ public static class TreeExtensions
var env = CatalogUtils.GetEnvironment(catalog);
return new FastForestBinaryTrainer(env, options);
}

public static PretrainedTreeFeaturizationEstimator PretrainTreeEnsembleFeaturizing(this TransformsCatalog catalog,
Copy link
Member

@eerhardt eerhardt Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XML Doc on all public classes and APIs. #Resolved

Copy link
Member Author

@wschin wschin Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I am working on them. #Resolved

return new PretrainedTreeFeaturizationEstimator(env, options);
}

public static FastForestRegressionFeaturizationEstimator FastForestRegressionFeaturizing(this TransformsCatalog catalog,
Copy link
Member

@eerhardt eerhardt Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming pattern reads a little funny. How about turning it into FeaturizeXXX? Like we have with FeaturizeText. #Resolved

Copy link
Member Author

@wschin wschin Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do FeaturizeBy.... #Resolved

Copy link
Member

@eerhardt eerhardt Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds better (to me at least). #Resolved

/// "Leaves" + <see cref="OutputColumnsSuffix"/>, and "Paths" + <see cref="OutputColumnsSuffix"/>. If <see cref="OutputColumnsSuffix"/>
/// is <see langword="null"/>, the output names would be "Trees", "Leaves", and "Paths".
/// </summary>
public string OutputColumnsSuffix;
Copy link
Contributor

@justinormont justinormont Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We went away from magic strings in the TextTransform. Previously with tokens=+, we produced a new column named {OutputColName}_TokenizedText.

For the estimators API we have users directly enter the column name for the tokens. We may want to do the same for the Trees/Leaves/Paths of the TreeFeat.

Perhaps:
OutputColumnTreeName, OutputColumnLeavesName, OutputColumnPathsName.

#Resolved

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jun 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And don't add that column if they empty or equal to null.
That way you can actually configure which parts of tree structure do you want. #Resolved

Copy link
Contributor

@justinormont justinormont Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As further background on the PR I was referencing...

Conversation about TextTransform:
via @rogancarr in #2957 PR

When using OutputTokens=true, FeaturizeText creates a new column called ${OutputColumnName}_TransformedText. This isn't really well documented anywhere, and it's odd behavior. I suggest that we make the tokenized text column name explicit in the API.

My suggestion would be the following:

  • Change OutputTokens = [bool] to OutputTokensColumn = [string], and a string.NullOrWhitespace(OutputTokensColumn) signifies that this column will not be created. #Resolved

Copy link
Member Author

@wschin wschin Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we can do optional output columns and custom output column names. Please see tests for examples (or wait for formal API samples). #Resolved

/// <param name="catalog">The context <see cref="TransformsCatalog"/> to create <see cref="FastTreeTweedieFeaturizationEstimator"/>.</param>
/// <param name="options">The options to configure <see cref="FastTreeTweedieFeaturizationEstimator"/>. See <see cref="FastTreeTweedieFeaturizationEstimator.Options"/> and
/// <see cref="TreeEnsembleFeaturizationEstimatorBase.CommonOptions"/> for available settings.</param>
public static FastTreeTweedieFeaturizationEstimator FeaturizeByFastTreeTweedie(this TransformsCatalog catalog,
Copy link
Contributor

@justinormont justinormont Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to note in its name that FastTreeTweedie is regression: (the naming of the others list their task types)

Suggested change
public static FastTreeTweedieFeaturizationEstimator FeaturizeByFastTreeTweedie(this TransformsCatalog catalog,
public static FastTreeTweedieRegressionFeaturizationEstimator FeaturizeByFastTreeTweedieRegression(this TransformsCatalog catalog,
``` #ByDesign

Copy link
Member Author

@wschin wschin Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if the model name doesn't tell the task. Given that Tweedie somehow implies a regression case, we don't have Regression appended to any of public Tweedie modules. This pattern can be seen in FastTreeTweedieTrainer and FastTreeTweedieModelParameters. #Resolved

TrainerOptions = trainerOptions
};

var pipeline = ML.Transforms.FeaturizeByFastTreeBinary(options).
Copy link
Contributor

@justinormont justinormont Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating (8) seperate ML.Transforms.FeaturizeBy{ FastTreeBinary, FastForestRegression, FastTreeRegression, FastTreeTweedie, ... } featurizers under ML.Transforms.* seems a bit large. Seems to clutter up the namespace.

In the future, this list should grow as I think we should have LightGBM variants too (and CatBoost if we take it in). The number of independent featurizers will be:
{ FastTree, FastTreeTweedie, FastForest, LightGBM, CatBoost } x { BinaryClassification, Multiclass, Regression, Ranking }`. (with some combinations missing)

Would it be more clean to have one ML.Transforms.TreeFeaturizer(), and put the specific instance type as a parameter? Would it be doable to have one return type? #ByDesign

Copy link
Member Author

@wschin wschin Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two reasons that I don't like a single TreeFeaturizer.

  1. It goes toward an opposite direction of the C# API's design. Most of them are strongly typed to the underlying data structures. You can see we have a lot of FastTree... and LightGbm..., which is intended.
  2. It may requires user to specify TreeFeaturizer<TTrainer>(options) and the user manually needs to make sure the type of options is TTrainer.Options, which is easy to make mistakes. #Resolved

Copy link
Contributor

@justinormont justinormont Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the current pattern will likely have less mistakes by the user.

For the plethora of FastTree... and LightGBM..., those are namespaced under the task mlContext.Regression.Trainers.FastTree(), hence the duplication isn't visible.

TrainerOptions = trainerOptions
};

var pipeline = ML.Transforms.FeaturizeByFastForestRegression(options).
Copy link
Contributor

@justinormont justinormont Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test of ML.Transforms.FeaturizeByFastForestRegression() using FastForest's ShuffleLabels option? I believe this is the only way to use TreeFeat w/ multi-class classification.

Copy link
Member Author

@wschin wschin Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it in another PR. This PR has been too large.. #Resolved

Copy link
Contributor

@justinormont justinormont Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the UnitTests should be in the main PR. Specifically, I think we should ensure the multi-class case can work. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding.
#Resolved

@wschin wschin changed the title [WIP] Tree-based featurization Tree-based featurization Jun 6, 2019
@wschin wschin requested review from Ivanidzo4ka, justinormont and eerhardt and removed request for Ivanidzo4ka and justinormont June 6, 2019 17:58
@wschin wschin requested a review from eerhardt June 6, 2019 22:10
}

[Fact]
public void TreeEnsembleFeaturizingPipelineMulticlass()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daholste and I were able to map the Key to a float using a CustomMapping().

I'd recommend the multiclass unit test be:

Suggested change
public void TreeEnsembleFeaturizingPipelineMulticlass()
[Fact]
public void TreeEnsembleFeaturizingPipelineMulticlass()
{
int dataPointCount = 1000;
var data = SamplesUtils.DatasetUtils.GenerateRandomMulticlassClassificationExamples(dataPointCount).ToList();
var dataView = ML.Data.LoadFromEnumerable(data);
dataView = ML.Data.Cache(dataView);
var trainerOptions = new FastForestRegressionTrainer.Options
{
NumberOfThreads = 1,
NumberOfTrees = 10,
NumberOfLeaves = 4,
MinimumExampleCountPerLeaf = 10,
FeatureColumnName = "Features",
LabelColumnName = "FloatLabel",
ShuffleLabels = true
};
var options = new FastForestRegressionFeaturizationEstimator.Options()
{
InputColumnName = "Features",
TreesColumnName = "Trees",
LeavesColumnName = "Leaves",
PathsColumnName = "Paths",
TrainerOptions = trainerOptions
};
Action<RowWithKey, RowWithFloat> actionConvertKeyToFloat = (RowWithKey rowWithKey, RowWithFloat rowWithFloat) =>
{
rowWithFloat.FloatLabel = rowWithKey.KeyLabel == 0 ? float.NaN : rowWithKey.KeyLabel - 1;
};
var split = ML.Data.TrainTestSplit(dataView, 0.5);
var trainData = split.TrainSet;
var testData = split.TestSet;
var pipeline = ML.Transforms.Conversion.MapValueToKey("KeyLabel", "Label")
.Append(ML.Transforms.CustomMapping(actionConvertKeyToFloat, "KeyLabel"))
.Append(ML.Transforms.FeaturizeByFastForestRegression(options))
.Append(ML.Transforms.Concatenate("CombinedFeatures", "Trees", "Leaves", "Paths"))
.Append(ML.MulticlassClassification.Trainers.SdcaMaximumEntropy("KeyLabel", "CombinedFeatures"));
var model = pipeline.Fit(trainData);
var prediction = model.Transform(testData);
var metrics = ML.MulticlassClassification.Evaluate(prediction, labelColumnName: "KeyLabel");
Assert.True(metrics.MacroAccuracy > 0.6);
Assert.True(metrics.MicroAccuracy > 0.6);
}
class RowWithKey
{
[KeyType()]
public uint KeyLabel { get; set; }
}
class RowWithFloat
{
public float FloatLabel { get; set; }
}

Specifically, this is using a CustomMapping() to convert the Key to a float for use in the TreeFeat's FastForest regression. The current method in the unit test requires a user to know/list all of the values in their Label (and their type). The CustomMapping() style is easier for a user to replicate for their dataset.

We also added a TrainTestSplit() so we're not testing on the training set, and we removed the original features from the Concatenate() to ensure the TreeFeat's output features are useful.

@codemzs codemzs requested review from artidoro and ganik June 17, 2019 17:11
public TreeEnsembleModelParameters ModelParameters;
};

private TreeEnsembleModelParameters _modelParameters;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TreeEnsembleModelParameters [](start = 16, length = 27)

Should this be readonly or something similar to make sure it is not altered?

@artidoro
Copy link
Contributor

artidoro commented Jun 20, 2019

Since you have already built all this infrastructure why are we not providing the featurizers for LightGbm trainers? I think they still use the same base class for the tree ensemble.
I guess this could come in another PR.

// The 0-1 encoding of leaves the input feature vector falls into.
public float[] Leaves { get; set; }
// The 0-1 encoding of paths the input feature vector reaches the leaves.
public float[] Paths { get; set; }
Copy link
Contributor

@artidoro artidoro Jun 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (same in other places if you are doing a revision of the PR):
// The 0-1 encoding of paths the input feature vector follows to reach the leaves.

/// and the i-th vector element is the prediction value predicted by the i-th tree.
/// If <see cref="TreesColumnName"/> is <see langword="null"/>, this output column may not be generated.
/// </summary>
public string TreesColumnName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TreesColumnName [](start = 26, length = 15)

Suggested renaming:

TreesColumnName -> TreeOutputsColumnName

I think it would be easier to understand, but this is not necessary.

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Member

@abgoswam abgoswam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@wschin wschin merged commit 9d29111 into dotnet:master Jun 26, 2019
@wschin wschin deleted the tree-feat branch June 26, 2019 23:15
Dmitry-A pushed a commit to Dmitry-A/machinelearning that referenced this pull request Jul 24, 2019
* Implement transformer

* Initial draft of porting tree-based featurization

* Internalize something

* Add Tweedie and Ranking cases

* Some small docs

* Customize output column names

* Fix save and load

* Optional output columns

* Fix a test and add some XML docs

* Add samples

* Add a sample

* API docs

* Fix one line

* Add MC test

* Extend a test further

* Address some comments

* Address some comments

* Address comments

* Comment

* Add cache points

* Update test/Microsoft.ML.Tests/TrainerEstimators/TreeEnsembleFeaturizerTest.cs

Co-Authored-By: Justin Ormont <[email protected]>

* Address comment

* Add Justin's test

* Reduce sample size

* Update sample output
@ghost ghost locked as resolved and limited conversation to collaborators Mar 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TreeEnsembleFeaturizer is not a Transformer yet
6 participants