-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Changed default NGram length from 1 to 2. #5248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed default NGram length from 1 to 2. #5248
Conversation
…nt with sparse data then before, slightly less performant if its dense data
@@ -1012,7 +1012,7 @@ private protected bool UsingMaxLabel | |||
} | |||
|
|||
private FeatureFlockBase CreateOneHotFlock(IChannel ch, | |||
List<int> features, int[] binnedValues, int[] lastOn, ValuesList[] instanceList, | |||
List<int> features, int[] binnedValues, int[] lastOn, Dictionary<int, ValuesList> instanceList, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fix! Being one of the primary trainers in ML․NET, I'd recommend testing the speed & memory on a variety of datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you're reporting speed/memory on the FastTree ranking Test
(not TrainTest
); Test_Ranking_MSLRWeb10K_RawNumericFeatures_FastTreeRanking benchmark. For MAML, Test
only runs the prediction step.
Running the training (TrainTest
) TrainTest_Ranking_MSLRWeb10K_RawNumericFeatures_FastTreeRanking might be more telling.
The MSLR dataset is purely numeric, no text columns in it; so shouldn't be affected by the ngram length change.
Adding a FastTree text benchmark
You could add FastTree to the text benchmark in (Benchmarks/Text/MultiClassClassification.cs).
Code for adding a FastTree benchmark on WikiDetox would be::
[Benchmark]
public void CV_Multiclass_WikiDetox_BigramsAndTrichar_OVAFastTree()
{
string cmd = @"CV k=5 data=" + _dataPathWiki +
" loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+}" +
" xf=Convert{col=logged_in type=R4}" +
" xf=CategoricalTransform{col=ns}" +
" xf=TextTransform{col=FeaturesText:comment wordExtractor=NGramExtractorTransform{ngram=2}}" +
" xf=Concat{col=Features:FeaturesText,logged_in,ns}" +
" tr=OVA{p=FastTree}";
var environment = EnvironmentFactory.CreateClassificationEnvironment<TextLoader, OneHotEncodingTransformer, FastTreeTrainer, LinearBinaryModelParameters>();
cmd.ExecuteMamlCommand(environment);
}
Heuristic
After testing various datasets, you may find the dictionary is beneficial for small (or large) sparse datasets and if we find that, we could use a heuristic of useDictionary = sparseness > 0.95 && rowsTimesSlots < 1E6;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any example datasets I can use? Or point me to where I can find some? I am not sure where I would find datasets for something like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was talking with Harish this morning about the sparseness and choosing between list/dictionary. We are going to discuss it again, but I coudln't find a way to know ahead of time whether the data would be sparse or dense without the user letting us know.
The test FastForestBinaryClassificationTestSummary
uses a small dataset. It has about 1300 unique words and total word count is about 9000. The current fast tree trainer multiplies those 2 values together (which is about 11.5 million) and allocates an array of that size right off of the bat which is where the huge memory usage comes from. It ends up using a very small fraction of that amount during training, but I coudln't figure out how to tell how many it was actually using ahead of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark I was suggesting to be added, uses WikiDetox which is a 70MB text dataset.
See my recommended code for CV_Multiclass_WikiDetox_BigramsAndTrichar_OVAFastTree
, above.
You should be able to just drop that code into (Benchmarks/Text/MultiClassClassification.cs).
The test FastForestBinaryClassificationTestSummary uses a small dataset. It has about 1300 unique words and total word count is about 9000.
Tests are on micro-datasets to test if something has changed. The benchmarks are meant to be real-world datasets, like MSLR and WikiDetox.
Would it be possible to measure the sparsity at runtime? Either before you use the array/dictionary on a subsample, or after N% of the data has passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, after all the investigation, testing, and syncing you and I did, I was only ever able to reproduce this issue when the feature column has been one-hot-encoded. Per our offline discussion, this should never happen (and when it does, usually an error is thrown for trying to make an array that is too large), so the test case appears to be wrong, and my changes are not necessary. Even when the input data was sparse this situation did not repro.
Due to this, I am going to revert my changes to FastTree and just fix the test instead.
Codecov Report
@@ Coverage Diff @@
## master #5248 +/- ##
==========================================
- Coverage 73.58% 73.54% -0.04%
==========================================
Files 1013 1014 +1
Lines 188446 188682 +236
Branches 20289 20330 +41
==========================================
+ Hits 138659 138768 +109
- Misses 44307 44412 +105
- Partials 5480 5502 +22
|
@@ -977,7 +977,7 @@ public void FastTreeBinaryClassificationTestSummary() | |||
[Fact] | |||
public void FastForestBinaryClassificationTestSummary() | |||
{ | |||
var (pipeline, dataView) = GetOneHotBinaryClassificationPipeline(); | |||
var (pipeline, dataView) = GetBinaryClassificationPipeline(); | |||
var estimator = pipeline.Append(ML.BinaryClassification.Trainers.FastForest( | |||
new FastForestBinaryTrainer.Options { NumberOfTrees = 2, NumberOfThreads = 1, NumberOfLeaves = 4, CategoricalSplit = true })); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be changed too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @michaelgsharp . So I believe that at the end you didn't modify the inner workings of FastTree or how it deals with sparse data, but instead on this PR you're updating the NgramLength default to 2, and updating the tests accordingly, right? If so, can you please update the PR name, and the description of the PR?. Thanks! |
This is part of the work for #4749, other PR's will follow to split the work up. When the default value for NGrams was changed from 1 to 2, we discovered that memory was exploding for FastTree training and was causing test failures in some x86 tests. This PR changes the the default value for NGramLength from 1 to 2 and also changes FastTree so it handles sparse data better.The main portion of the sparse data change is changing from an array to a dictionary so that memory is only allocated when its needed instead of all up front. When running the previously failing test, it now passes with less memory usage and is actually faster due to less GC running.The slowdown to dense data appears to be very small. Running the benchmark for ranking before the change gives this result:After the change gives this result:This is about a 10% slowdown. In return, the memory used is lower. In one example NGram test with NGramLength = 2 the memory before the change was 3.4 GB. After this change was 400 MB.Edit, leaving in original text for context. After doing more testing and benchmarking, it was discovered the test case was wrong and the FastTree implementation was fine. This pr now just fixes the test case while updating the default NGramLength to 2.