-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add in ability to have pre-defined weights for ngrams. #6458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #6458 +/- ##
==========================================
- Coverage 68.55% 68.51% -0.04%
==========================================
Files 1170 1171 +1
Lines 246977 247159 +182
Branches 25792 25798 +6
==========================================
+ Hits 169304 169337 +33
- Misses 70950 71079 +129
- Partials 6723 6743 +20
Flags with carried forward coverage won't be shown. Click here to find out more.
|
/// <param name="termSeparator">Separator used to separate terms/frequency pairs.</param> | ||
/// <param name="freqSeparator">Separator used to separate terms from their frequency.</param> | ||
/// This estimator operates over vector of text.</param> | ||
public static WordBagEstimator ProduceWordBagsPreDefinedWeight(this TransformsCatalog.TextTransforms catalog, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericstj @luisquintanilla @tannergooding thoughts on this api name? The normal one is just called ProduceWordBags
, but when naming this one that it can potentially cause ambiguity issues due to the default parameters not needing to be specified.
I had 2 options, either name it something different, or not have default parameters and make the users specify the term separator and frequency separator manually each time. I went with the first approach, but want to hear your opinions on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tarekgh your thoughts as well if you have time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts:
In main scenarios is it expected to be called once or twice in the code? If yes, then I prefer using ProduceWordBags
without any optional parameter. Will be just overload. If the answer is no, then using a different name would be better, I guess. We may try to suggest a better name than ProduceWordBagsPreDefinedWeight
. May be ``ProduceWordBagsEstimator` is simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current name, I think a "With" should be in there (ProduceWordBagsWithPreDefinedWeight
). Though since the caller doesn't get to specify it "PreDefined" seems more like "Default".
I assume the real reason a caller wants this method is to specify the term and/or freq separators. Can this not just be added as a true extra optional parameter overload using the established patterns?
+ [EditorBrowsable(EditorBrowsableState.Never)]
public static WordBagEstimator ProduceWordBags(this TransformsCatalog.TextTransforms catalog,
string outputColumnName,
string inputColumnName = null,
- int ngramLength = NgramExtractingEstimator.Defaults.NgramLength,
+ int ngramLength,
- int skipLength = NgramExtractingEstimator.Defaults.SkipLength,
+ int skipLength,
- bool useAllLengths = NgramExtractingEstimator.Defaults.UseAllLengths,
+ bool useAllLengths,
- int maximumNgramsCount = NgramExtractingEstimator.Defaults.MaximumNgramsCount,
+ int maximumNgramsCount,
- NgramExtractingEstimator.WeightingCriteria weighting = NgramExtractingEstimator.WeightingCriteria.Tf)
+ NgramExtractingEstimator.WeightingCriteria weighting)
=> new WordBagEstimator(Contracts.CheckRef(catalog, nameof(catalog)).GetEnvironment(),
outputColumnName, inputColumnName, ngramLength, skipLength, useAllLengths, maximumNgramsCount, weighting);
+ public static WordBagEstimator ProduceWordBags(this TransformsCatalog.TextTransforms catalog,
+ string outputColumnName,
+ string inputColumnName = null,
+ int ngramLength = NgramExtractingEstimator.Defaults.NgramLength,
+ int skipLength = NgramExtractingEstimator.Defaults.SkipLength,
+ bool useAllLengths = NgramExtractingEstimator.Defaults.UseAllLengths,
+ int maximumNgramsCount = NgramExtractingEstimator.Defaults.MaximumNgramsCount,
+ NgramExtractingEstimator.WeightingCriteria weighting = NgramExtractingEstimator.WeightingCriteria.Tf,
+ char termSeparator = ';',
+ char freqSeparator = ':')
+ => new WordBagEstimator(parameters go here);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which overload will be chosen if the caller is passing only the first parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which overload will be chosen if the caller is passing only the first parameter?
If the compiled in the past, the old one. If they compile after, the new one. (The old method, now marked as [EB(Never)] is only called by a) legacy callers or b) someone who happened to specify all of those parameters and none of the new ones... the compiler will still (and only) call it if the callsite has an exact match to the signature.
Since both methods are creating a WordBagEstimator
I'm assuming that "specifying the weighting" and "specifying the freq/term-separator" aren't actually conflicting options. If they are conflicting options, then some sort of differentiated name is warranted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would changing the existing method though by adding new default parameters count as a breaking change @ericstj?
Adding the parameters to the existing method is a breaking change (a really, really bad one). Adding it the way I suggested is the way to make it look like all you did was add new default parameters.
From Framework Design Guidelines, 3rd edition, sec 5.1 (General Member Design Guidelines):
DO move all default parameters to the new, longer overload when adding optional parameters to an existing method.
[prose that says to do what I said to do above]
and down in Appendix D (breaking changes) (emphasis mine):
D.11.2 Adding or Removing a Method Parameter
Methods in the .NET CLR are identified by their signature, which is composed
of the name, return type, and ordered list of parameters. Removing
a parameter, adding a required parameter, or adding an optional parameter
all count as changing the signature and, therefore, are logically the
same as deleting the original method. See section D.9.3 for information on
the runtime impact of removing a member.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bartonjs the separators and the weight aren't conflicting options. The default should be just to use the frequency of the words, but there isn't a problem if they are both specified together (though I would bet that most people don't want to specify both together and probably would give them unexpected, though not incorrect, behavior if they do that).
I think making the name the same with a new overload where the user has to specify the separators is probably the best based on this convo. We don't want the separators specified unless the user really needs them, since thats the flag that tells word bag that it needs to handle the input data differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bartonjs I have made the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That new catalog overload looks a lot nicer to me 😄
Merging on red since it was approved but approver doesn't have write access.
Add in ability to have pre-defined weights for ngrams.