-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Combined methods related to splitting data into one single method. Also fixed related issues. #5227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5227 +/- ##
==========================================
+ Coverage 73.49% 73.59% +0.10%
==========================================
Files 1014 1014
Lines 188680 188963 +283
Branches 20330 20326 -4
==========================================
+ Hits 138677 139074 +397
+ Misses 44493 44396 -97
+ Partials 5510 5493 -17
|
new NormalizingEstimator.MinMaxColumnOptions(stratCol, stratificationColumn, ensureZeroUntouched: true)) | ||
.Fit(data).Transform(data); | ||
{ | ||
data = new ColumnCopyingEstimator(host,(stratCol,stratificationColumn)).Fit(data).Transform(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ColumnCopyingEstimator [](start = 35, length = 22)
Should this be added to EnsureGroupPreservationColumn
as well?
And more generally, would it make sense to unify the two methods? There is also a method called GetSplitColumn
in line 285 in CrossValidationCommand.cs that does more or less the same thing. I think the same utility method should be called for maml, entry points and the C# API. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I took a quick look into the EnsureGroupPreservationColumn
callers, and it doesn't seem that they drop the created column, so this wouldn't be an issue there. I'll look into it more later to confirm this.
About merging these methods, I think it would be a good idea to maintain consistency across the codebase, although it wouldn't be necessary to fix the issue opened by @ganik to unblock NimbusML update. I'll look into this idea.
In reply to: 438552643 [](ancestors = 438552643)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed that when EnsureGroupPreservationColumn
was called, the samplingKeyColumn
was never dropped, so the issue didn't apply on that case.
Also, I've merged the methods you've mentioned, so now only CreateGroupPreservationColumn
exists
In reply to: 438969558 [](ancestors = 438969558,438552643)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I've renamed CreateGroupPreservationColumn
to CreateSplitColumn
and made similar renames elsewhere to make the code easier to read. Now "splitColumn" is the name of the temporary column we create for splitting, regardless if it's based-off a "samplingKeyColumn" (ML.NET) or a "stratificationColumn" (NimbusML, Maml, legacy TLC naming conventions...)
Also now I do drop the new "splitColumn" when it's created while splitting through ML.NET's API, because not dropping it was causing other issues... see comment:
#5227 (comment)
Added 2 tests to cheked this is dropped, and updated my PR description to explain all the issues it's now fixing.
In reply to: 441992076 [](ancestors = 441992076,438969558,438552643)
Can you add a unit test that has a similar case to the NimbusML test that exposed this bug? #Resolved |
…xNormalizer for this assertion
var split = mlContext.Data.TrainTestSplit(input, 0.5, nameof(Input.TextStrat)); | ||
var ids = split.TestSet.GetColumn<int>(split.TestSet.Schema[nameof(Input.Id)]); | ||
Assert.Contains(1, ids); | ||
Assert.Contains(5, ids); | ||
split = mlContext.Data.TrainTestSplit(input, 0.5, nameof(Input.FloatStrat)); | ||
ids = split.TrainSet.GetColumn<int>(split.TrainSet.Schema[nameof(Input.Id)]); | ||
ids = split.TestSet.GetColumn<int>(split.TestSet.Schema[nameof(Input.Id)]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestSet [](start = 24, length = 7)
why? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the change to ensureZeroUntouched = false
In reply to: 443437139 [](ancestors = 443437139)
** Fix issue with AutoML ** Enforce having the same schema before and after splitting, avoiding future issues * Added tests for these * Enforce that samplingKeyColumnName shouldn't be null by the time we split stuff, since anyway it will throw in the RangeFilter if its null. Also change its name to tempSamplingKeyColumnName since it's going to be dropped.
public void AutoFitWithPresplittedData() | ||
{ | ||
// Models created in AutoML should work over the same data, | ||
// no matter how that data is splitted before passing it to the experiment execution | ||
// or to the model for prediction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test abstracts the issue in #5256 (which is also the issue in #4048 and dotnet/machinelearning-modelbuilder#694 )
This test fails with a
System.ArgumentOutOfRangeException : Could not find input column 'SamplingKeyColumn'
without the changes on this PR.
The particular change that fixed this issue is dropping the (temporary version of) SamplingKeyColumn
(now called SplitColumn
), on DataOperationsCatalog.CrossValidationSplit
and DataOperationsCatalog.TrainTestSplit
in ML.NET (no change was needed in AutoML, but I added this AutoML test since this issue was found repeatedly by AutoML users).
…r renames for consistency. Internally we'll call the new column SplitColumn, regardless if it's based off a "samplingKeyColumn" or a "stratificationColumn"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve on behalf of AutoML test change
{ | ||
env.CheckValue(splitColumn, nameof(splitColumn)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitColumn [](start = 27, length = 11)
Should we check/assert that this column exists in the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it because if the column doesn't exist (or if splitColumn was null) it is going to throw an exception anyway inside the RangeFilter. Since it's necessary that the column exists for the RangeFilter to work, I thought it was sensible to check for this here and don't let splitColumn
to be null.
In reply to: 447414598 [](ancestors = 447414598)
Min = (double)fold / numberOfFolds, | ||
Max = (double)(fold + 1) / numberOfFolds, | ||
Complement = false, | ||
IncludeMin = true, | ||
IncludeMax = true | ||
}, data); | ||
|
||
yield return new TrainTestData(trainFilter, testFilter); | ||
var trainDV = ColumnSelectingTransformer.CreateDrop(env, trainFilter, splitColumn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateDrop [](start = 57, length = 10)
Is this a new behavior? Or is it needed now because of the other changes?
If it is a new behavior, why is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping the SplitColumn is something new to fix the situation I mentioned here:
#5227 (comment)
Not dropping it was causing an issue with AutoML. And also it didn't make sense that we didn't drop it, because if we didn't do it then the schema of a DataView was changed by splitting it, which I think should be considered unexpected behavior.
In reply to: 447415004 [](ancestors = 447415004)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR combines
CreateStratificationColumn
andEnsureGroupPreservationColumn
, and some code onGetSplitColumn
, into a new method calledCreateSplitColumn
. This method receives asamplingKeyColumnName
parameter and always creates a new column namedSplitColumn
based on thesamplingKeyColumn
... the new column is elsewhere in the code used for splitting datasets for Train-Test splits or Cross-Validation Splits, and it's always safe to drop the new column as well. Note: thesamplingKeyColumn
thatCreateSplitColumn
receives as parameter, or the column it creates, are calledsamplingKeyColumn
,stratificationColumn
,groupPreservationColumn
, orsplitColumn
(or maybe evenrowGroupColumn
?) elsewhere in the code or docs, depending on where on the code you're / which API you're calling (ML.NET's TrainTestSplit, CrossValidationSplit, Evaluate or CrossValidate... AutoML, MAML, or NimbusML/Entrypoints). The inconsistency of the naming for this column is a long-standing problem, but since they're names used both in public APIs and for historical reasons, I guess there's not much we can do now (see #2536, and related issues and PR's).This PR also solves multiple issues related with the "samplingKeyColumn" when splitting data:
Issue: When
samplingKeyColumn
is KeyType, dropping the original samplingKeyColumn when calling from entrypoints, makes it impossible to reuse that columnSolution: Use a copy column transformer to copy the original column, so that it's safe to drop the new column elsewhere without loosing the original column
Fixes #5221 : Issue with NimbusML test
Issue : a "samplingKeyColumn" is created, but not dropped, when users split data with ML.NET's API before training a model with AutoML, causing some problems when reusing the AutoML model
Solution: Always drop the new column when calling from ML.NET's APIs. Note: to distinguish between the "samplingKeyColumn" provided by the user, and the one we internally create, the new column is now called "SplitColumn".
Fixes #5256
Fixes #4048
Fixes dotnet/machinelearning-modelbuilder#694
Issue: If the "samplingKeyColumn" is float/double type, and includes negative values, or is normalized and outside the (0,1) range, then it might not work as expected.
Solution: Use
ensureZeroUntouched = false
in the normalizing estimator, and always normalize if it's float/doubleFixes #5227 (comment)