-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Rename CV and TrainTest "stratification" parameter #2537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename CV and TrainTest "stratification" parameter #2537
Conversation
Something has gone weird with my master branch, and I'm seeing every merge from upstream as a commit. I'll look into this, but let's work with it for now ;) |
Codecov Report
@@ Coverage Diff @@
## master #2537 +/- ##
==========================================
+ Coverage 71.26% 71.44% +0.17%
==========================================
Files 797 801 +4
Lines 141292 141855 +563
Branches 16118 16141 +23
==========================================
+ Hits 100692 101346 +654
+ Misses 36138 36041 -97
- Partials 4462 4468 +6
|
/// If the <paramref name="stratificationColumn"/> is not provided, the random numbers generated to create it, will use this seed as value. | ||
/// And if it is not provided, the default value will be used.</param> | ||
public TrainTestData TrainTestSplit(IDataView data, double testFraction = 0.1, string stratificationColumn = null, uint? seed = null) | ||
/// <param name="groupPreservationColumn">Name of a column to use as an ID for grouping rows. If two examples share the same value of the <paramref name="groupPreservationColumn"/>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you left that description from previous iteration with idColumn
.
Do you want to rephrase it, since ID
looks weird. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also ID is already a thing on IDs, naming something different that would be super confusing.
public abstract ValueGetter<DataViewRowId> GetIdGetter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I @rogancarr , thanks for looking at this.
Group column is already an established name used to describe actual things that are intended to form a group. This is more general than this. So I do not like the name "group preservation." It's far too close. "Hey I specified this group column, how come my ranking metrics are all funky?" "Ah, you did not specify the group column, but the group preservation column!" Looks silly.
I'd almost prefer a random name drawn out of a dictionary to this. That would just look weird, but this will just be consistently confusing. I'll post suggestions in issue.
You commited directly into Once you're ready to restore your master branch, the easiest way to do this would be to |
Yes. I fixed that a couple days ago, but this PR still has the remnants :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rogancarr , this makes me a bit more comfortable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR changes the
CrossValidation
andTrainTest
parameterStratificationColumn
to beGroupPreservationColumn
and updates the docstrings to give a clearer explanation.Fixes #2536
See conversation in #2536
Related to #1204, but that issue might be asking for further coverage.