Tree-based featurization #3812

wschin · 2019-06-03T22:13:25Z

Fix #2482. Generating features using tree structure has been a popular technique in data mining. This PR exposes this internal-only feature to the public.

Since I don't have enough time to handle multiple different assignments at the same time, please don't put nit comments and create new issues instead. Thanks a lot.

eerhardt · 2019-06-03T23:20:02Z

src/Microsoft.ML.FastTree/TreeTrainersCatalog.cs

@@ -437,5 +436,45 @@ public static class TreeExtensions
            var env = CatalogUtils.GetEnvironment(catalog);
            return new FastForestBinaryTrainer(env, options);
        }
+
+        public static PretrainedTreeFeaturizationEstimator PretrainTreeEnsembleFeaturizing(this TransformsCatalog catalog,


XML Doc on all public classes and APIs. #Resolved

No problem. I am working on them. #Resolved

eerhardt · 2019-06-03T23:21:10Z

src/Microsoft.ML.FastTree/TreeTrainersCatalog.cs

+            return new PretrainedTreeFeaturizationEstimator(env, options);
+        }
+
+        public static FastForestRegressionFeaturizationEstimator FastForestRegressionFeaturizing(this TransformsCatalog catalog,


This naming pattern reads a little funny. How about turning it into FeaturizeXXX? Like we have with FeaturizeText. #Resolved

I can do FeaturizeBy.... #Resolved

That sounds better (to me at least). #Resolved

justinormont · 2019-06-03T23:45:31Z

src/Microsoft.ML.FastTree/TreeEnsembleFeaturizationEstimator.cs

+            /// "Leaves" + <see cref="OutputColumnsSuffix"/>, and "Paths" + <see cref="OutputColumnsSuffix"/>. If <see cref="OutputColumnsSuffix"/>
+            /// is <see langword="null"/>, the output names would be "Trees", "Leaves", and "Paths".
+            /// </summary>
+            public string OutputColumnsSuffix;


We went away from magic strings in the TextTransform. Previously with tokens=+, we produced a new column named {OutputColName}_TokenizedText.

For the estimators API we have users directly enter the column name for the tokens. We may want to do the same for the Trees/Leaves/Paths of the TreeFeat.

Perhaps:
OutputColumnTreeName, OutputColumnLeavesName, OutputColumnPathsName.

#Resolved

And don't add that column if they empty or equal to null.
That way you can actually configure which parts of tree structure do you want. #Resolved

As further background on the PR I was referencing...

Conversation about TextTransform:
via @rogancarr in #2957 PR

When using OutputTokens=true, FeaturizeText creates a new column called ${OutputColumnName}_TransformedText. This isn't really well documented anywhere, and it's odd behavior. I suggest that we make the tokenized text column name explicit in the API.

My suggestion would be the following:

Change OutputTokens = [bool] to OutputTokensColumn = [string], and a string.NullOrWhitespace(OutputTokensColumn) signifies that this column will not be created. #Resolved

Now we can do optional output columns and custom output column names. Please see tests for examples (or wait for formal API samples). #Resolved

justinormont · 2019-06-05T19:45:37Z

src/Microsoft.ML.FastTree/TreeTrainersCatalog.cs

+        /// <param name="catalog">The context <see cref="TransformsCatalog"/> to create <see cref="FastTreeTweedieFeaturizationEstimator"/>.</param>
+        /// <param name="options">The options to configure <see cref="FastTreeTweedieFeaturizationEstimator"/>. See <see cref="FastTreeTweedieFeaturizationEstimator.Options"/> and
+        /// <see cref="TreeEnsembleFeaturizationEstimatorBase.CommonOptions"/> for available settings.</param>
+        public static FastTreeTweedieFeaturizationEstimator FeaturizeByFastTreeTweedie(this TransformsCatalog catalog,


May want to note in its name that FastTreeTweedie is regression: (the naming of the others list their task types)

Suggested change

public static FastTreeTweedieFeaturizationEstimator FeaturizeByFastTreeTweedie(this TransformsCatalog catalog,

public static FastTreeTweedieRegressionFeaturizationEstimator FeaturizeByFastTreeTweedieRegression(this TransformsCatalog catalog,

``` #ByDesign

Yes if the model name doesn't tell the task. Given that Tweedie somehow implies a regression case, we don't have Regression appended to any of public Tweedie modules. This pattern can be seen in FastTreeTweedieTrainer and FastTreeTweedieModelParameters. #Resolved

justinormont · 2019-06-05T19:48:21Z

test/Microsoft.ML.Tests/TrainerEstimators/TreeEnsembleFeaturizerTest.cs

+                TrainerOptions = trainerOptions
+            };
+
+            var pipeline = ML.Transforms.FeaturizeByFastTreeBinary(options).


Creating (8) seperate ML.Transforms.FeaturizeBy{ FastTreeBinary, FastForestRegression, FastTreeRegression, FastTreeTweedie, ... } featurizers under ML.Transforms.* seems a bit large. Seems to clutter up the namespace.

In the future, this list should grow as I think we should have LightGBM variants too (and CatBoost if we take it in). The number of independent featurizers will be:
{ FastTree, FastTreeTweedie, FastForest, LightGBM, CatBoost } x { BinaryClassification, Multiclass, Regression, Ranking }`. (with some combinations missing)

Would it be more clean to have one ML.Transforms.TreeFeaturizer(), and put the specific instance type as a parameter? Would it be doable to have one return type? #ByDesign

There are two reasons that I don't like a single TreeFeaturizer.

It goes toward an opposite direction of the C# API's design. Most of them are strongly typed to the underlying data structures. You can see we have a lot of FastTree... and LightGbm..., which is intended.

It may requires user to specify TreeFeaturizer<TTrainer>(options) and the user manually needs to make sure the type of options is TTrainer.Options, which is easy to make mistakes. #Resolved

You're right, the current pattern will likely have less mistakes by the user.

For the plethora of FastTree... and LightGBM..., those are namespaced under the task mlContext.Regression.Trainers.FastTree(), hence the duplication isn't visible.

justinormont · 2019-06-05T19:55:33Z

test/Microsoft.ML.Tests/TrainerEstimators/TreeEnsembleFeaturizerTest.cs

+                TrainerOptions = trainerOptions
+            };
+
+            var pipeline = ML.Transforms.FeaturizeByFastForestRegression(options).


Can you add a test of ML.Transforms.FeaturizeByFastForestRegression() using FastForest's ShuffleLabels option? I believe this is the only way to use TreeFeat w/ multi-class classification.

Let's make it in another PR. This PR has been too large.. #Resolved

Seems the UnitTests should be in the main PR. Specifically, I think we should ensure the multi-class case can work. #Resolved

Thanks for adding.
#Resolved

src/Microsoft.ML.FastTree/TreeEnsembleFeaturizer.cs

test/Microsoft.ML.Tests/TrainerEstimators/TreeEnsembleFeaturizerTest.cs

…erTest.cs Co-Authored-By: Justin Ormont <[email protected]>

…ee-feat

justinormont · 2019-06-11T01:02:53Z

test/Microsoft.ML.Tests/TrainerEstimators/TreeEnsembleFeaturizerTest.cs

+        }
+
+        [Fact]
+        public void TreeEnsembleFeaturizingPipelineMulticlass()


@daholste and I were able to map the Key to a float using a CustomMapping().

I'd recommend the multiclass unit test be:

Suggested change

public void TreeEnsembleFeaturizingPipelineMulticlass()

[Fact]

public void TreeEnsembleFeaturizingPipelineMulticlass()

{

int dataPointCount = 1000;

var data = SamplesUtils.DatasetUtils.GenerateRandomMulticlassClassificationExamples(dataPointCount).ToList();

var dataView = ML.Data.LoadFromEnumerable(data);

dataView = ML.Data.Cache(dataView);

var trainerOptions = new FastForestRegressionTrainer.Options

{

NumberOfThreads = 1,

NumberOfTrees = 10,

NumberOfLeaves = 4,

MinimumExampleCountPerLeaf = 10,

FeatureColumnName = "Features",

LabelColumnName = "FloatLabel",

ShuffleLabels = true

};

var options = new FastForestRegressionFeaturizationEstimator.Options()

{

InputColumnName = "Features",

TreesColumnName = "Trees",

LeavesColumnName = "Leaves",

PathsColumnName = "Paths",

TrainerOptions = trainerOptions

};

Action<RowWithKey, RowWithFloat> actionConvertKeyToFloat = (RowWithKey rowWithKey, RowWithFloat rowWithFloat) =>

{

rowWithFloat.FloatLabel = rowWithKey.KeyLabel == 0 ? float.NaN : rowWithKey.KeyLabel - 1;

};

var split = ML.Data.TrainTestSplit(dataView, 0.5);

var trainData = split.TrainSet;

var testData = split.TestSet;

var pipeline = ML.Transforms.Conversion.MapValueToKey("KeyLabel", "Label")

.Append(ML.Transforms.CustomMapping(actionConvertKeyToFloat, "KeyLabel"))

.Append(ML.Transforms.FeaturizeByFastForestRegression(options))

.Append(ML.Transforms.Concatenate("CombinedFeatures", "Trees", "Leaves", "Paths"))

.Append(ML.MulticlassClassification.Trainers.SdcaMaximumEntropy("KeyLabel", "CombinedFeatures"));

var model = pipeline.Fit(trainData);

var prediction = model.Transform(testData);

var metrics = ML.MulticlassClassification.Evaluate(prediction, labelColumnName: "KeyLabel");

Assert.True(metrics.MacroAccuracy > 0.6);

Assert.True(metrics.MicroAccuracy > 0.6);

}

class RowWithKey

{

[KeyType()]

public uint KeyLabel { get; set; }

}

class RowWithFloat

{

public float FloatLabel { get; set; }

}

Specifically, this is using a CustomMapping() to convert the Key to a float for use in the TreeFeat's FastForest regression. The current method in the unit test requires a user to know/list all of the values in their Label (and their type). The CustomMapping() style is easier for a user to replicate for their dataset.

We also added a TrainTestSplit() so we're not testing on the training set, and we removed the original features from the Concatenate() to ensure the TreeFeat's output features are useful.

…ee-feat

artidoro · 2019-06-20T23:14:19Z

src/Microsoft.ML.FastTree/TreeEnsembleFeaturizationEstimator.cs

+            public TreeEnsembleModelParameters ModelParameters;
+        };
+
+        private TreeEnsembleModelParameters _modelParameters;


TreeEnsembleModelParameters [](start = 16, length = 27)

Should this be readonly or something similar to make sure it is not altered?

artidoro · 2019-06-20T23:31:58Z

Since you have already built all this infrastructure why are we not providing the featurizers for LightGbm trainers? I think they still use the same base class for the tree ensemble.
I guess this could come in another PR.

artidoro · 2019-06-20T23:35:27Z

....ML.Samples/Dynamic/Transforms/TreeFeaturization/FastForestBinaryFeaturizationWithOptions.cs

+            // The 0-1 encoding of leaves the input feature vector falls into.
+            public float[] Leaves { get; set; }
+            // The 0-1 encoding of paths the input feature vector reaches the leaves.
+            public float[] Paths { get; set; }


nit (same in other places if you are doing a revision of the PR):
// The 0-1 encoding of paths the input feature vector follows to reach the leaves.

artidoro · 2019-06-20T23:38:41Z

src/Microsoft.ML.FastTree/TreeEnsembleFeaturizationEstimator.cs

+            /// and the i-th vector element is the prediction value predicted by the i-th tree.
+            /// If <see cref="TreesColumnName"/> is <see langword="null"/>, this output column may not be generated.
+            /// </summary>
+            public string TreesColumnName;


TreesColumnName [](start = 26, length = 15)

Suggested renaming:

TreesColumnName -> TreeOutputsColumnName

I think it would be easier to understand, but this is not necessary.

artidoro

abgoswam

* Implement transformer * Initial draft of porting tree-based featurization * Internalize something * Add Tweedie and Ranking cases * Some small docs * Customize output column names * Fix save and load * Optional output columns * Fix a test and add some XML docs * Add samples * Add a sample * API docs * Fix one line * Add MC test * Extend a test further * Address some comments * Address some comments * Address comments * Comment * Add cache points * Update test/Microsoft.ML.Tests/TrainerEstimators/TreeEnsembleFeaturizerTest.cs Co-Authored-By: Justin Ormont <[email protected]> * Address comment * Add Justin's test * Reduce sample size * Update sample output

wschin added 2 commits May 31, 2019 16:55

Implement transformer

17248d3

Initial draft of porting tree-based featurization

a2f1d6c

wschin requested review from justinormont and codemzs June 3, 2019 22:13

wschin self-assigned this Jun 3, 2019

wschin added 2 commits June 3, 2019 15:20

Internalize something

33d0ee0

Add Tweedie and Ranking cases

9658991

eerhardt reviewed Jun 3, 2019

View reviewed changes

justinormont reviewed Jun 3, 2019

View reviewed changes

wschin added 5 commits June 3, 2019 17:55

Some small docs

f529f1d

Customize output column names

9c4d801

Fix save and load

e7b84dd

Optional output columns

5d8215a

Fix a test and add some XML docs

ce4378f

justinormont reviewed Jun 5, 2019

View reviewed changes

wschin added 3 commits June 5, 2019 17:53

Add samples

618179d

Add a sample

49fe1d7

API docs

2197391

wschin changed the title ~~[WIP] Tree-based featurization~~ Tree-based featurization Jun 6, 2019

wschin requested review from Ivanidzo4ka, justinormont and eerhardt and removed request for Ivanidzo4ka and justinormont June 6, 2019 17:58

wschin added 2 commits June 6, 2019 10:59

Fix one line

b00be93

Add MC test

bbeb17f

Address comments

dbd5dac

wschin requested review from Ivanidzo4ka, justinormont and eerhardt June 6, 2019 21:25

eerhardt reviewed Jun 6, 2019

View reviewed changes

src/Microsoft.ML.FastTree/TreeEnsembleFeaturizer.cs Show resolved Hide resolved

Comment

1f261c5

wschin requested a review from eerhardt June 6, 2019 22:10

Add cache points

241b3ad

justinormont reviewed Jun 7, 2019

View reviewed changes

test/Microsoft.ML.Tests/TrainerEstimators/TreeEnsembleFeaturizerTest.cs Outdated Show resolved Hide resolved

justinormont reviewed Jun 7, 2019

View reviewed changes

test/Microsoft.ML.Tests/TrainerEstimators/TreeEnsembleFeaturizerTest.cs Outdated Show resolved Hide resolved

wschin and others added 3 commits June 7, 2019 08:15

Update test/Microsoft.ML.Tests/TrainerEstimators/TreeEnsembleFeaturiz…

6850b8e

…erTest.cs Co-Authored-By: Justin Ormont <[email protected]>

Address comment

d337aa5

Merge branch 'tree-feat' of github.com:wschin/machinelearning into tr…

4ea7bf6

…ee-feat

justinormont reviewed Jun 11, 2019

View reviewed changes

wschin added 4 commits June 11, 2019 08:40

Add Justin's test

7b2d654

Merge branch 'tree-feat' of github.com:wschin/machinelearning into tr…

b8a3ba8

…ee-feat

Reduce sample size

d1d6813

Update sample output

cc2d531

codemzs requested review from artidoro and ganik June 17, 2019 17:11

artidoro reviewed Jun 20, 2019

View reviewed changes

artidoro approved these changes Jun 20, 2019

View reviewed changes

abgoswam approved these changes Jun 26, 2019

View reviewed changes

wschin merged commit 9d29111 into dotnet:master Jun 26, 2019

wschin deleted the tree-feat branch June 26, 2019 23:15

ghost locked as resolved and limited conversation to collaborators Mar 21, 2022

	public static FastTreeTweedieFeaturizationEstimator FeaturizeByFastTreeTweedie(this TransformsCatalog catalog,
	public static FastTreeTweedieRegressionFeaturizationEstimator FeaturizeByFastTreeTweedieRegression(this TransformsCatalog catalog,
	``` #ByDesign

-        public void TreeEnsembleFeaturizingPipelineMulticlass()
+        [Fact]
+        public void TreeEnsembleFeaturizingPipelineMulticlass()
+        {
+            int dataPointCount = 1000;
+            var data = SamplesUtils.DatasetUtils.GenerateRandomMulticlassClassificationExamples(dataPointCount).ToList();
+            var dataView = ML.Data.LoadFromEnumerable(data);
+            dataView = ML.Data.Cache(dataView);
+            var trainerOptions = new FastForestRegressionTrainer.Options
+            {
+                NumberOfThreads = 1,
+                NumberOfTrees = 10,
+                NumberOfLeaves = 4,
+                MinimumExampleCountPerLeaf = 10,
+                FeatureColumnName = "Features",
+                LabelColumnName = "FloatLabel",
+                ShuffleLabels = true
+            };
+            var options = new FastForestRegressionFeaturizationEstimator.Options()
+            {
+                InputColumnName = "Features",
+                TreesColumnName = "Trees",
+                LeavesColumnName = "Leaves",
+                PathsColumnName = "Paths",
+                TrainerOptions = trainerOptions
+            };
+            Action<RowWithKey, RowWithFloat> actionConvertKeyToFloat = (RowWithKey rowWithKey, RowWithFloat rowWithFloat) =>
+            {
+                rowWithFloat.FloatLabel = rowWithKey.KeyLabel == 0 ? float.NaN : rowWithKey.KeyLabel - 1;
+            };
+            var split = ML.Data.TrainTestSplit(dataView, 0.5);
+            var trainData = split.TrainSet;
+            var testData = split.TestSet;
+            var pipeline = ML.Transforms.Conversion.MapValueToKey("KeyLabel", "Label")
+                .Append(ML.Transforms.CustomMapping(actionConvertKeyToFloat, "KeyLabel"))
+                .Append(ML.Transforms.FeaturizeByFastForestRegression(options))
+                .Append(ML.Transforms.Concatenate("CombinedFeatures", "Trees", "Leaves", "Paths"))
+                .Append(ML.MulticlassClassification.Trainers.SdcaMaximumEntropy("KeyLabel", "CombinedFeatures"));
+            var model = pipeline.Fit(trainData);
+            var prediction = model.Transform(testData);
+            var metrics = ML.MulticlassClassification.Evaluate(prediction, labelColumnName: "KeyLabel");
+            Assert.True(metrics.MacroAccuracy > 0.6);
+            Assert.True(metrics.MicroAccuracy > 0.6);
+       }
+        class RowWithKey
+        {
+            [KeyType()]
+            public uint KeyLabel { get; set; }
+        }
+        class RowWithFloat
+        {
+            public float FloatLabel { get; set; }
+        }

Tree-based featurization #3812

Tree-based featurization #3812

Uh oh!

Conversation

wschin commented Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eerhardt Jun 3, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eerhardt Jun 3, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eerhardt Jun 3, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jun 3, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka Jun 4, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jun 5, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Jun 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jun 5, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Jun 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jun 5, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Jun 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jun 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jun 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jun 6, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jun 20, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wschin commented Jun 3, 2019 •

edited

Loading

eerhardt Jun 3, 2019 •

edited by wschin

Loading

wschin Jun 3, 2019 •

edited

Loading

eerhardt Jun 3, 2019 •

edited by wschin

Loading

wschin Jun 3, 2019 •

edited

Loading

eerhardt Jun 3, 2019 •

edited by wschin

Loading

justinormont Jun 3, 2019 •

edited by wschin

Loading

Ivanidzo4ka Jun 4, 2019 •

edited by wschin

Loading

justinormont Jun 5, 2019 •

edited by wschin

Loading

wschin Jun 5, 2019 •

edited

Loading

justinormont Jun 5, 2019 •

edited by wschin

Loading

wschin Jun 5, 2019 •

edited

Loading

justinormont Jun 5, 2019 •

edited by wschin

Loading

wschin Jun 5, 2019 •

edited

Loading

justinormont Jun 5, 2019 •

edited

Loading

justinormont Jun 5, 2019 •

edited

Loading

wschin Jun 6, 2019 •

edited

Loading

justinormont Jun 6, 2019 •

edited by wschin

Loading

artidoro commented Jun 20, 2019 •

edited

Loading

artidoro Jun 20, 2019 •

edited

Loading