-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Ignore hidden columns in AutoML schema checks of validation data #4490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore hidden columns in AutoML schema checks of validation data #4490
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@@ -183,14 +183,19 @@ private static void ValidateValidationData(IDataView trainData, IDataView valida | |||
|
|||
const string schemaMismatchError = "Training data and validation data schemas do not match."; | |||
|
|||
if (trainData.Schema.Count != validationData.Schema.Count) | |||
if (trainData.Schema.Count(c => !c.IsHidden) != validationData.Schema.Count(c => !c.IsHidden)) | |||
{ | |||
throw new ArgumentException($"{schemaMismatchError} Train data has '{trainData.Schema.Count}' columns," + | |||
$"and validation data has '{validationData.Schema.Count}' columns.", nameof(validationData)); | |||
} | |||
|
|||
foreach (var trainCol in trainData.Schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use a comment...
foreach (var trainCol in trainData.Schema) | |
// Also indirectly checks for new columns in the validation datasets as we above enforce the column counts are equal | |
foreach (var trainCol in trainData.Schema) |
I was otherwise going to suggest we check the reverse direction. A comment helps future readers to not have to think through it.
Codecov Report
@@ Coverage Diff @@
## master #4490 +/- ##
=========================================
Coverage ? 74.85%
=========================================
Files ? 908
Lines ? 159880
Branches ? 17215
=========================================
Hits ? 119678
Misses ? 35384
Partials ? 4818
|
Closes #4491
When the AutoML API consumes data, it validates schema consistency between the train and validation data.
There are two bugs in this logic:
The API asserts that the count of columns in the train and validation data must be equal. This throws an exception if the two data views have the same number of active columns but a different number of hidden columns. This PR updates to assert that the # of active (not hidden) columns in the train and validation data are equal.
If either the train or validation data has a hidden column with a type that differs from an active column of the same name, an exception is thrown. This PR restricts type consistency checks to active columns only.