Skip to content

Add in ability to have pre-defined weights for ngrams. #6458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 23, 2022

Conversation

michaelgsharp
Copy link
Member

Add in ability to have pre-defined weights for ngrams.

@michaelgsharp michaelgsharp requested a review from a team November 17, 2022 20:49
@michaelgsharp michaelgsharp self-assigned this Nov 17, 2022
@codecov
Copy link

codecov bot commented Nov 17, 2022

Codecov Report

Merging #6458 (61044f5) into main (17c061a) will decrease coverage by 0.03%.
The diff coverage is 95.21%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6458      +/-   ##
==========================================
- Coverage   68.55%   68.51%   -0.04%     
==========================================
  Files        1170     1171       +1     
  Lines      246977   247159     +182     
  Branches    25792    25798       +6     
==========================================
+ Hits       169304   169337      +33     
- Misses      70950    71079     +129     
- Partials     6723     6743      +20     
Flag Coverage Δ
Debug 68.51% <95.21%> (-0.04%) ⬇️
production 62.89% <92.98%> (-0.06%) ⬇️
test 88.97% <98.64%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...c/Microsoft.ML.Transforms/Text/WordBagTransform.cs 82.50% <92.15%> (+3.90%) ⬆️
...t.ML.Tests/Transformers/WordBagTransformerTests.cs 98.64% <98.64%> (ø)
src/Microsoft.ML.Transforms/Text/TextCatalog.cs 68.62% <100.00%> (+1.28%) ⬆️
...soft.ML.Transforms/Text/WrappedTextTransformers.cs 99.15% <100.00%> (+0.04%) ⬆️
...osoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs 83.92% <0.00%> (-7.16%) ⬇️
src/Microsoft.ML.FastTree/Training/StepSearch.cs 57.42% <0.00%> (-4.96%) ⬇️
src/Microsoft.ML.Data/Training/TrainerUtils.cs 66.26% <0.00%> (-3.82%) ⬇️
...crosoft.ML.StandardTrainers/Standard/SdcaBinary.cs 85.40% <0.00%> (-3.42%) ⬇️
src/Microsoft.ML.Sweeper/AsyncSweeper.cs 71.42% <0.00%> (-1.37%) ⬇️
...oft.ML.StandardTrainers/Standard/SdcaMulticlass.cs 91.46% <0.00%> (-1.03%) ⬇️
... and 3 more

/// <param name="termSeparator">Separator used to separate terms/frequency pairs.</param>
/// <param name="freqSeparator">Separator used to separate terms from their frequency.</param>
/// This estimator operates over vector of text.</param>
public static WordBagEstimator ProduceWordBagsPreDefinedWeight(this TransformsCatalog.TextTransforms catalog,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericstj @luisquintanilla @tannergooding thoughts on this api name? The normal one is just called ProduceWordBags, but when naming this one that it can potentially cause ambiguity issues due to the default parameters not needing to be specified.

I had 2 options, either name it something different, or not have default parameters and make the users specify the term separator and frequency separator manually each time. I went with the first approach, but want to hear your opinions on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarekgh your thoughts as well if you have time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts:

In main scenarios is it expected to be called once or twice in the code? If yes, then I prefer using ProduceWordBags without any optional parameter. Will be just overload. If the answer is no, then using a different name would be better, I guess. We may try to suggest a better name than ProduceWordBagsPreDefinedWeight. May be ``ProduceWordBagsEstimator` is simpler?

Copy link
Member

@bartonjs bartonjs Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current name, I think a "With" should be in there (ProduceWordBagsWithPreDefinedWeight). Though since the caller doesn't get to specify it "PreDefined" seems more like "Default".

I assume the real reason a caller wants this method is to specify the term and/or freq separators. Can this not just be added as a true extra optional parameter overload using the established patterns?

+       [EditorBrowsable(EditorBrowsableState.Never)]
        public static WordBagEstimator ProduceWordBags(this TransformsCatalog.TextTransforms catalog,
            string outputColumnName,
            string inputColumnName = null,
-           int ngramLength = NgramExtractingEstimator.Defaults.NgramLength,
+           int ngramLength,
-           int skipLength = NgramExtractingEstimator.Defaults.SkipLength,
+           int skipLength,
-           bool useAllLengths = NgramExtractingEstimator.Defaults.UseAllLengths,
+           bool useAllLengths,
-           int maximumNgramsCount = NgramExtractingEstimator.Defaults.MaximumNgramsCount,
+           int maximumNgramsCount,
-           NgramExtractingEstimator.WeightingCriteria weighting = NgramExtractingEstimator.WeightingCriteria.Tf)
+           NgramExtractingEstimator.WeightingCriteria weighting)
            => new WordBagEstimator(Contracts.CheckRef(catalog, nameof(catalog)).GetEnvironment(),
                outputColumnName, inputColumnName, ngramLength, skipLength, useAllLengths, maximumNgramsCount, weighting);

+       public static WordBagEstimator ProduceWordBags(this TransformsCatalog.TextTransforms catalog,
+           string outputColumnName,
+           string inputColumnName = null,
+           int ngramLength = NgramExtractingEstimator.Defaults.NgramLength,
+           int skipLength = NgramExtractingEstimator.Defaults.SkipLength,
+           bool useAllLengths = NgramExtractingEstimator.Defaults.UseAllLengths,
+           int maximumNgramsCount = NgramExtractingEstimator.Defaults.MaximumNgramsCount,
+           NgramExtractingEstimator.WeightingCriteria weighting = NgramExtractingEstimator.WeightingCriteria.Tf,
+           char termSeparator = ';',
+           char freqSeparator = ':')
+           => new WordBagEstimator(parameters go here);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which overload will be chosen if the caller is passing only the first parameter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which overload will be chosen if the caller is passing only the first parameter?

If the compiled in the past, the old one. If they compile after, the new one. (The old method, now marked as [EB(Never)] is only called by a) legacy callers or b) someone who happened to specify all of those parameters and none of the new ones... the compiler will still (and only) call it if the callsite has an exact match to the signature.

Since both methods are creating a WordBagEstimator I'm assuming that "specifying the weighting" and "specifying the freq/term-separator" aren't actually conflicting options. If they are conflicting options, then some sort of differentiated name is warranted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would changing the existing method though by adding new default parameters count as a breaking change @ericstj?

Adding the parameters to the existing method is a breaking change (a really, really bad one). Adding it the way I suggested is the way to make it look like all you did was add new default parameters.

From Framework Design Guidelines, 3rd edition, sec 5.1 (General Member Design Guidelines):

DO move all default parameters to the new, longer overload when adding optional parameters to an existing method.

[prose that says to do what I said to do above]

and down in Appendix D (breaking changes) (emphasis mine):

D.11.2 Adding or Removing a Method Parameter

Methods in the .NET CLR are identified by their signature, which is composed
of the name, return type, and ordered list of parameters. Removing
a parameter, adding a required parameter, or adding an optional parameter
all count as changing the signature and, therefore, are logically the
same as deleting the original method. See section D.9.3 for information on
the runtime impact of removing a member.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartonjs the separators and the weight aren't conflicting options. The default should be just to use the frequency of the words, but there isn't a problem if they are both specified together (though I would bet that most people don't want to specify both together and probably would give them unexpected, though not incorrect, behavior if they do that).

I think making the name the same with a new overload where the user has to specify the separators is probably the best based on this convo. We don't want the separators specified unless the user really needs them, since thats the flag that tells word bag that it needs to handle the input data differently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartonjs I have made the changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That new catalog overload looks a lot nicer to me 😄

@michaelgsharp michaelgsharp merged commit bb563da into dotnet:main Nov 23, 2022
@michaelgsharp michaelgsharp deleted the ngram branch November 23, 2022 06:11
michaelgsharp added a commit to michaelgsharp/machinelearning that referenced this pull request Nov 23, 2022
Merging on red since it was approved but approver doesn't have write access.
@ghost ghost locked as resolved and limited conversation to collaborators Dec 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants