Skip to content

Changed default NGram length from 1 to 2. #5248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 26, 2020

Conversation

michaelgsharp
Copy link
Member

@michaelgsharp michaelgsharp commented Jun 18, 2020

This is part of the work for #4749, other PR's will follow to split the work up. When the default value for NGrams was changed from 1 to 2, we discovered that memory was exploding for FastTree training and was causing test failures in some x86 tests. This PR changes the the default value for NGramLength from 1 to 2 and also changes FastTree so it handles sparse data better.

The main portion of the sparse data change is changing from an array to a dictionary so that memory is only allocated when its needed instead of all up front. When running the previously failing test, it now passes with less memory usage and is actually faster due to less GC running.

The slowdown to dense data appears to be very small. Running the benchmark for ranking before the change gives this result:

Method Mean Error StdDev Extra Metric
Test_Ranking_MSLRWeb10K_RawNumericFeatures_FastTreeRanking 1.263 s 0.0517 s 0.0595 s -

After the change gives this result:

Method Mean Error StdDev Extra Metric
Test_Ranking_MSLRWeb10K_RawNumericFeatures_FastTreeRanking 1.381 s 0.0397 s 0.0457 s -

This is about a 10% slowdown. In return, the memory used is lower. In one example NGram test with NGramLength = 2 the memory before the change was 3.4 GB. After this change was 400 MB.

Edit, leaving in original text for context. After doing more testing and benchmarking, it was discovered the test case was wrong and the FastTree implementation was fine. This pr now just fixes the test case while updating the default NGramLength to 2.

…nt with sparse data then before, slightly less performant if its dense data
@michaelgsharp michaelgsharp requested review from harishsk and a team June 18, 2020 16:45
@michaelgsharp michaelgsharp self-assigned this Jun 18, 2020
@@ -1012,7 +1012,7 @@ private protected bool UsingMaxLabel
}

private FeatureFlockBase CreateOneHotFlock(IChannel ch,
List<int> features, int[] binnedValues, int[] lastOn, ValuesList[] instanceList,
List<int> features, int[] binnedValues, int[] lastOn, Dictionary<int, ValuesList> instanceList,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix! Being one of the primary trainers in ML․NET, I'd recommend testing the speed & memory on a variety of datasets.

Copy link
Contributor

@justinormont justinormont Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you're reporting speed/memory on the FastTree ranking Test (not TrainTest); Test_Ranking_MSLRWeb10K_RawNumericFeatures_FastTreeRanking benchmark. For MAML, Test only runs the prediction step.

Running the training (TrainTest) TrainTest_Ranking_MSLRWeb10K_RawNumericFeatures_FastTreeRanking might be more telling.

The MSLR dataset is purely numeric, no text columns in it; so shouldn't be affected by the ngram length change.

Adding a FastTree text benchmark

You could add FastTree to the text benchmark in (Benchmarks/Text/MultiClassClassification.cs).

Code for adding a FastTree benchmark on WikiDetox would be::

        [Benchmark]
        public void CV_Multiclass_WikiDetox_BigramsAndTrichar_OVAFastTree()
        {
            string cmd = @"CV k=5 data=" + _dataPathWiki +
                        " loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+}" +
                        " xf=Convert{col=logged_in type=R4}" +
                        " xf=CategoricalTransform{col=ns}" +
                        " xf=TextTransform{col=FeaturesText:comment wordExtractor=NGramExtractorTransform{ngram=2}}" +
                        " xf=Concat{col=Features:FeaturesText,logged_in,ns}" +
                        " tr=OVA{p=FastTree}";

            var environment = EnvironmentFactory.CreateClassificationEnvironment<TextLoader, OneHotEncodingTransformer, FastTreeTrainer, LinearBinaryModelParameters>();
            cmd.ExecuteMamlCommand(environment);
        }

Heuristic

After testing various datasets, you may find the dictionary is beneficial for small (or large) sparse datasets and if we find that, we could use a heuristic of useDictionary = sparseness > 0.95 && rowsTimesSlots < 1E6;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any example datasets I can use? Or point me to where I can find some? I am not sure where I would find datasets for something like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking with Harish this morning about the sparseness and choosing between list/dictionary. We are going to discuss it again, but I coudln't find a way to know ahead of time whether the data would be sparse or dense without the user letting us know.

The test FastForestBinaryClassificationTestSummary uses a small dataset. It has about 1300 unique words and total word count is about 9000. The current fast tree trainer multiplies those 2 values together (which is about 11.5 million) and allocates an array of that size right off of the bat which is where the huge memory usage comes from. It ends up using a very small fraction of that amount during training, but I coudln't figure out how to tell how many it was actually using ahead of time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark I was suggesting to be added, uses WikiDetox which is a 70MB text dataset.

See my recommended code for CV_Multiclass_WikiDetox_BigramsAndTrichar_OVAFastTree, above.

You should be able to just drop that code into (Benchmarks/Text/MultiClassClassification.cs).

The test FastForestBinaryClassificationTestSummary uses a small dataset. It has about 1300 unique words and total word count is about 9000.

Tests are on micro-datasets to test if something has changed. The benchmarks are meant to be real-world datasets, like MSLR and WikiDetox.

Would it be possible to measure the sparsity at runtime? Either before you use the array/dictionary on a subsample, or after N% of the data has passed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after all the investigation, testing, and syncing you and I did, I was only ever able to reproduce this issue when the feature column has been one-hot-encoded. Per our offline discussion, this should never happen (and when it does, usually an error is thrown for trying to make an array that is too large), so the test case appears to be wrong, and my changes are not necessary. Even when the input data was sparse this situation did not repro.

Due to this, I am going to revert my changes to FastTree and just fix the test instead.

@codecov
Copy link

codecov bot commented Jun 23, 2020

Codecov Report

Merging #5248 into master will decrease coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5248      +/-   ##
==========================================
- Coverage   73.58%   73.54%   -0.04%     
==========================================
  Files        1013     1014       +1     
  Lines      188446   188682     +236     
  Branches    20289    20330      +41     
==========================================
+ Hits       138659   138768     +109     
- Misses      44307    44412     +105     
- Partials     5480     5502      +22     
Flag Coverage Δ
#Debug 73.54% <100.00%> (-0.04%) ⬇️
#production 69.36% <100.00%> (-0.04%) ⬇️
#test 87.42% <100.00%> (-0.01%) ⬇️
Impacted Files Coverage Δ
...soft.ML.Transforms/Text/WrappedTextTransformers.cs 99.10% <100.00%> (ø)
...ML.Tests/TrainerEstimators/CalibratorEstimators.cs 95.08% <100.00%> (ø)
...Microsoft.ML.Tests/TrainerEstimators/LbfgsTests.cs 97.45% <100.00%> (ø)
...ft.ML.Tests/TrainerEstimators/TrainerEstimators.cs 93.61% <100.00%> (+0.06%) ⬆️
...osoft.ML.Tests/Transformers/TextFeaturizerTests.cs 99.64% <100.00%> (ø)
src/Microsoft.ML.FastTree/RegressionTree.cs 75.51% <0.00%> (-8.17%) ⬇️
src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs 78.92% <0.00%> (-6.07%) ⬇️
src/Microsoft.ML.FastTree/Training/StepSearch.cs 57.42% <0.00%> (-4.96%) ⬇️
...rosoft.ML.AutoML/ColumnInference/TextFileSample.cs 59.60% <0.00%> (-2.65%) ⬇️
src/Microsoft.ML.TimeSeries/RootCauseAnalyzer.cs 54.20% <0.00%> (-2.25%) ⬇️
... and 12 more

@@ -977,7 +977,7 @@ public void FastTreeBinaryClassificationTestSummary()
[Fact]
public void FastForestBinaryClassificationTestSummary()
{
var (pipeline, dataView) = GetOneHotBinaryClassificationPipeline();
var (pipeline, dataView) = GetBinaryClassificationPipeline();
var estimator = pipeline.Append(ML.BinaryClassification.Trainers.FastForest(
new FastForestBinaryTrainer.Options { NumberOfTrees = 2, NumberOfThreads = 1, NumberOfLeaves = 4, CategoricalSplit = true }));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be changed too?

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@antoniovs1029
Copy link
Member

Hi, @michaelgsharp . So I believe that at the end you didn't modify the inner workings of FastTree or how it deals with sparse data, but instead on this PR you're updating the NgramLength default to 2, and updating the tests accordingly, right?

If so, can you please update the PR name, and the description of the PR?. Thanks!

@michaelgsharp michaelgsharp changed the title Made FastTree work better with sparse data. Changed default NGram length from 1 to 2. Jun 26, 2020
@michaelgsharp michaelgsharp merged commit 091bddf into dotnet:master Jun 26, 2020
@michaelgsharp michaelgsharp deleted the fast-tree-memory-fix branch June 26, 2020 16:30
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants