Changed default NGram length from 1 to 2. #5248

michaelgsharp · 2020-06-18T16:45:07Z

This is part of the work for #4749, other PR's will follow to split the work up. When the default value for NGrams was changed from 1 to 2, we discovered that memory was exploding for FastTree training and was causing test failures in some x86 tests. This PR changes the the default value for NGramLength from 1 to 2 and also changes FastTree so it handles sparse data better.

The main portion of the sparse data change is changing from an array to a dictionary so that memory is only allocated when its needed instead of all up front. When running the previously failing test, it now passes with less memory usage and is actually faster due to less GC running.

~~The slowdown to dense data appears to be very small. Running the benchmark for ranking before the change gives this result:~~

Method	Mean	Error	StdDev	Extra Metric
Test_Ranking_MSLRWeb10K_RawNumericFeatures_FastTreeRanking	1.263 s	0.0517 s	0.0595 s	-

~~After the change gives this result:~~

Method	Mean	Error	StdDev	Extra Metric
Test_Ranking_MSLRWeb10K_RawNumericFeatures_FastTreeRanking	1.381 s	0.0397 s	0.0457 s	-

~~This is about a 10% slowdown. In return, the memory used is lower. In one example NGram test with NGramLength = 2 the memory before the change was 3.4 GB. After this change was 400 MB.~~

Edit, leaving in original text for context. After doing more testing and benchmarking, it was discovered the test case was wrong and the FastTree implementation was fine. This pr now just fixes the test case while updating the default NGramLength to 2.

…nt with sparse data then before, slightly less performant if its dense data

justinormont · 2020-06-18T17:59:13Z

src/Microsoft.ML.FastTree/FastTree.cs

@@ -1012,7 +1012,7 @@ private protected bool UsingMaxLabel
        }

        private FeatureFlockBase CreateOneHotFlock(IChannel ch,
-                List<int> features, int[] binnedValues, int[] lastOn, ValuesList[] instanceList,
+                List<int> features, int[] binnedValues, int[] lastOn, Dictionary<int, ValuesList> instanceList,


Nice fix! Being one of the primary trainers in ML․NET, I'd recommend testing the speed & memory on a variety of datasets.

I see that you're reporting speed/memory on the FastTree ranking Test (not TrainTest); Test_Ranking_MSLRWeb10K_RawNumericFeatures_FastTreeRanking benchmark. For MAML, Test only runs the prediction step.

Running the training (TrainTest) TrainTest_Ranking_MSLRWeb10K_RawNumericFeatures_FastTreeRanking might be more telling.

The MSLR dataset is purely numeric, no text columns in it; so shouldn't be affected by the ngram length change.

Adding a FastTree text benchmark

You could add FastTree to the text benchmark in (Benchmarks/Text/MultiClassClassification.cs).

Code for adding a FastTree benchmark on WikiDetox would be::

[Benchmark] public void CV_Multiclass_WikiDetox_BigramsAndTrichar_OVAFastTree() { string cmd = @"CV k=5 data=" + _dataPathWiki + " loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+}" + " xf=Convert{col=logged_in type=R4}" + " xf=CategoricalTransform{col=ns}" + " xf=TextTransform{col=FeaturesText:comment wordExtractor=NGramExtractorTransform{ngram=2}}" + " xf=Concat{col=Features:FeaturesText,logged_in,ns}" + " tr=OVA{p=FastTree}"; var environment = EnvironmentFactory.CreateClassificationEnvironment<TextLoader, OneHotEncodingTransformer, FastTreeTrainer, LinearBinaryModelParameters>(); cmd.ExecuteMamlCommand(environment); }

Heuristic

After testing various datasets, you may find the dictionary is beneficial for small (or large) sparse datasets and if we find that, we could use a heuristic of useDictionary = sparseness > 0.95 && rowsTimesSlots < 1E6;

Do you have any example datasets I can use? Or point me to where I can find some? I am not sure where I would find datasets for something like this.

I was talking with Harish this morning about the sparseness and choosing between list/dictionary. We are going to discuss it again, but I coudln't find a way to know ahead of time whether the data would be sparse or dense without the user letting us know.

The test FastForestBinaryClassificationTestSummary uses a small dataset. It has about 1300 unique words and total word count is about 9000. The current fast tree trainer multiplies those 2 values together (which is about 11.5 million) and allocates an array of that size right off of the bat which is where the huge memory usage comes from. It ends up using a very small fraction of that amount during training, but I coudln't figure out how to tell how many it was actually using ahead of time.

The benchmark I was suggesting to be added, uses WikiDetox which is a 70MB text dataset.

See my recommended code for CV_Multiclass_WikiDetox_BigramsAndTrichar_OVAFastTree, above.

You should be able to just drop that code into (Benchmarks/Text/MultiClassClassification.cs).

The test FastForestBinaryClassificationTestSummary uses a small dataset. It has about 1300 unique words and total word count is about 9000.

Tests are on micro-datasets to test if something has changed. The benchmarks are meant to be real-world datasets, like MSLR and WikiDetox.

Would it be possible to measure the sparsity at runtime? Either before you use the array/dictionary on a subsample, or after N% of the data has passed.

Ok, after all the investigation, testing, and syncing you and I did, I was only ever able to reproduce this issue when the feature column has been one-hot-encoded. Per our offline discussion, this should never happen (and when it does, usually an error is thrown for trying to make an array that is too large), so the test case appears to be wrong, and my changes are not necessary. Even when the input data was sparse this situation did not repro.

Due to this, I am going to revert my changes to FastTree and just fix the test instead.

codecov · 2020-06-23T20:22:48Z

Codecov Report

Merging #5248 into master will decrease coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5248      +/-   ##
==========================================
- Coverage   73.58%   73.54%   -0.04%     
==========================================
  Files        1013     1014       +1     
  Lines      188446   188682     +236     
  Branches    20289    20330      +41     
==========================================
+ Hits       138659   138768     +109     
- Misses      44307    44412     +105     
- Partials     5480     5502      +22

Flag	Coverage Δ
#Debug	`73.54% <100.00%> (-0.04%)`	⬇️
#production	`69.36% <100.00%> (-0.04%)`	⬇️
#test	`87.42% <100.00%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
...soft.ML.Transforms/Text/WrappedTextTransformers.cs	`99.10% <100.00%> (ø)`
...ML.Tests/TrainerEstimators/CalibratorEstimators.cs	`95.08% <100.00%> (ø)`
...Microsoft.ML.Tests/TrainerEstimators/LbfgsTests.cs	`97.45% <100.00%> (ø)`
...ft.ML.Tests/TrainerEstimators/TrainerEstimators.cs	`93.61% <100.00%> (+0.06%)`	⬆️
...osoft.ML.Tests/Transformers/TextFeaturizerTests.cs	`99.64% <100.00%> (ø)`
src/Microsoft.ML.FastTree/RegressionTree.cs	`75.51% <0.00%> (-8.17%)`	⬇️
src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs	`78.92% <0.00%> (-6.07%)`	⬇️
src/Microsoft.ML.FastTree/Training/StepSearch.cs	`57.42% <0.00%> (-4.96%)`	⬇️
...rosoft.ML.AutoML/ColumnInference/TextFileSample.cs	`59.60% <0.00%> (-2.65%)`	⬇️
src/Microsoft.ML.TimeSeries/RootCauseAnalyzer.cs	`54.20% <0.00%> (-2.25%)`	⬇️
... and 12 more

harishsk · 2020-06-23T22:46:40Z

test/Microsoft.ML.Tests/TrainerEstimators/TreeEstimators.cs

@@ -977,7 +977,7 @@ public void FastTreeBinaryClassificationTestSummary()
        [Fact]
        public void FastForestBinaryClassificationTestSummary()
        {
-            var (pipeline, dataView) = GetOneHotBinaryClassificationPipeline();
+            var (pipeline, dataView) = GetBinaryClassificationPipeline();
            var estimator = pipeline.Append(ML.BinaryClassification.Trainers.FastForest(
                new FastForestBinaryTrainer.Options { NumberOfTrees = 2, NumberOfThreads = 1, NumberOfLeaves = 4, CategoricalSplit = true }));



Should this be changed too?

harishsk

antoniovs1029 · 2020-06-25T01:07:51Z

Hi, @michaelgsharp . So I believe that at the end you didn't modify the inner workings of FastTree or how it deals with sparse data, but instead on this PR you're updating the NgramLength default to 2, and updating the tests accordingly, right?

If so, can you please update the PR name, and the description of the PR?. Thanks!

make the FastTree work better with sparse data. Is much more performa…

6b4eb5a

…nt with sparse data then before, slightly less performant if its dense data

michaelgsharp requested review from harishsk and a team June 18, 2020 16:45

michaelgsharp self-assigned this Jun 18, 2020

justinormont reviewed Jun 18, 2020

View reviewed changes

reverted FastTree changes, fixed incorrect test

dab78a6

harishsk reviewed Jun 23, 2020

View reviewed changes

Added catagorical column to wiki detox data

ca9c041

harishsk approved these changes Jun 24, 2020

View reviewed changes

michaelgsharp changed the title ~~Made FastTree work better with sparse data.~~ Changed default NGram length from 1 to 2. Jun 26, 2020

michaelgsharp merged commit 091bddf into dotnet:master Jun 26, 2020

michaelgsharp deleted the fast-tree-memory-fix branch June 26, 2020 16:30

This was referenced Jul 10, 2020

Changed default NGrams for FeaturizerText from 1 to 2 #5243

Closed

[Meta Issue] Changing defaults #4749

Closed

ghost locked as resolved and limited conversation to collaborators Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed default NGram length from 1 to 2. #5248

Changed default NGram length from 1 to 2. #5248

michaelgsharp commented Jun 18, 2020 •

edited

Loading

justinormont Jun 18, 2020

justinormont Jun 18, 2020 •

edited

Loading

michaelgsharp Jun 18, 2020

michaelgsharp Jun 18, 2020

justinormont Jun 18, 2020

michaelgsharp Jun 23, 2020

codecov bot commented Jun 23, 2020 •

edited

Loading

harishsk Jun 23, 2020

harishsk left a comment

antoniovs1029 commented Jun 25, 2020

Changed default NGram length from 1 to 2. #5248

Changed default NGram length from 1 to 2. #5248

Conversation

michaelgsharp commented Jun 18, 2020 • edited Loading

justinormont Jun 18, 2020

Choose a reason for hiding this comment

justinormont Jun 18, 2020 • edited Loading

Choose a reason for hiding this comment

Adding a FastTree text benchmark

Heuristic

michaelgsharp Jun 18, 2020

Choose a reason for hiding this comment

michaelgsharp Jun 18, 2020

Choose a reason for hiding this comment

justinormont Jun 18, 2020

Choose a reason for hiding this comment

michaelgsharp Jun 23, 2020

Choose a reason for hiding this comment

codecov bot commented Jun 23, 2020 • edited Loading

Codecov Report

harishsk Jun 23, 2020

Choose a reason for hiding this comment

harishsk left a comment

Choose a reason for hiding this comment

antoniovs1029 commented Jun 25, 2020

michaelgsharp commented Jun 18, 2020 •

edited

Loading

justinormont Jun 18, 2020 •

edited

Loading

codecov bot commented Jun 23, 2020 •

edited

Loading