-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Allow TextLoader to load empty float/double fields as NaN instead of 0 #5198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/// <summary> | ||
/// If true, empty float fields will be loaded as NaN. If false, they'll be loaded as 0. Default is false. | ||
/// </summary> | ||
[Argument(ArgumentType.AtMostOnce, HelpText = "If true, empty float fields will be loaded as NaN. If false, they'll be loaded as 0. Default is false.", ShortName = "imputefloat")] | ||
public bool ImputeEmptyFloats = Defaults.ImputeEmptyFloats; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy with the name I chose for this option, and for text I wrote as its doc. I think the name "field" can mean many things in the C# world (but inside the code of the TextLoader a "field" is the value found on a given column on a given row). Also, this option is supposed to be related to both Single and Doubles, so "float" is misleading. And "empty" would actually mean empty fields or fields with only whitespace. So I'm not sure how to name this and what to say on the public docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure 'Impute' fits the terminology here. A few options to consider: ParseEmptyFloatsAsNan, ParseEmptyRealsAsNan, EmptyFieldsToNan, EmptyValuesToNan, etc.
In reply to: 434309670 [](ancestors = 434309670)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the word "impute" as that was the original name @justinormont suggested here #4132 (comment)
I wouldn't know if that's a common term in which our users could understand the feature by reading it.
As for the names you provided, perhaps EmptyRealsAsNan would fit well, as it avoids the word Field, and "Reals" could imply Floats or Doubles.
I will leave unaddressed right now, and see if we get any suggestion on this. It's only a variable name change, and I'll be able to make the change before merging the PR.
In reply to: 434886717 [](ancestors = 434886717,434309670)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to go for MissingRealsAsNan , since it's not only the Empty fields that get mapped as NaNs, but also the ones that only have whitespace, and now also the ones were there are missing columns in a given row.
In reply to: 435163479 [](ancestors = 435163479,434886717,434309670)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case you should fix the documentation above to read "If true, empty or missing float and double fields will be loaded as NaN"
In reply to: 437077933 [](ancestors = 437077933,435163479,434886717,434309670)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the documentation reads:
/// <summary>
/// If true, missing real fields (i.e. double or single fields) will be loaded as NaN.
/// If false, they'll be loaded as 0. Default is false.
/// A field is considered "missing" if it's empty, if it only has whitespace, or if there are missing columns
/// at the end of a given row.
/// </summary>
In reply to: 437170658 [](ancestors = 437170658,437077933,435163479,434886717,434309670)
// The next two rows map the missing columns as 0, as it was decided not to impute with NaN | ||
// in this case | ||
6,"this has no date, num3 or num4 (the separator corresponding to them is also missing)",6.6,66.66,1/1/0001,0,0 | ||
7,"this has no num4 (the separator corresponding to it is missing)",7.7,77.77,7/7/2007,777.777,0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(appended to example to allow for threaded discussion, this is how it will actually load missing columns at the end of the row)
The feature I implement on this PR means that "1,,dog" will be loaded as "1,NaN,dog" if the second column is a float/double. And to load "1,dog," as "1,dog,NaN" if the third column is a float/double. This was the behavior requested on #4132
What to do with incomplete rows was undefined in the request. I.e. what to do if there are supposed to be 6 columns but on one row there's only 2? (I.e. 1,dog
, without the last separator)
In such a case, I took the decision to simply load the missing columns with default values (i.e. 0 for Float and Double) even when using the new feature of this PR.
The reason behind this is that the implementation required to load those missing columns as NaNs only for float/doubles would be somewhat hacky (and I didn't figure out how to make it work for vector types which included fields on missing columns on a given row). Besides, since it might be a fringe case. it's perhaps better not to support it.
Any thoughts on this, @justinormont @harishsk ? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the current behavior of the Textloader when it encounters incomplete rows? Does it thrown an error or does it load default values for missing columns?
If it is the latter, then I think we should fix it to set those columns to NaN because that is the spirit of this PR.
In reply to: 434313813 [](ancestors = 434313813)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of incomplete rows it returns default values, and doesn't throw, both with or without the new ImputeEmptyFloats option.
I'll update this PR to have it return NaNs, but I won't close this thread because I'd still want to leave this discussion on. Once I update this PR with that fix, we can decide if this case is worth it the changes that would need to be introduced to cover it.
In reply to: 434896173 [](ancestors = 434896173,434313813)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix this case I'll also need to do some checks like typeof(TValue) == typeof(float)
and casts like (TValue)(obj) Single.NaN)
similar to the ones described here:
#5198 (comment)
The same concerns I've mentioned there apply to this, although in here I think I can change the code to do the cast only when there are missing float/doubles columns for a given row and the ImputeEmptyFloats option is enabled.
In reply to: 435167058 [](ancestors = 435167058,434896173,434313813)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turned out fixing this case was much simpler than what I had originally expected, and no need to the (TValue)(obj)Single.NaN
cast I thought about doing. So I've updated mi changes to fix this case 😄
In reply to: 435219809 [](ancestors = 435219809,435167058,434896173,434313813)
Codecov Report
@@ Coverage Diff @@
## master #5198 +/- ##
==========================================
+ Coverage 73.31% 73.33% +0.02%
==========================================
Files 1007 1007
Lines 187969 188123 +154
Branches 20241 20245 +4
==========================================
+ Hits 137812 137967 +155
Misses 44630 44630
+ Partials 5527 5526 -1
|
While you are here, is it possible to combine these two Parse functions with a generic implementation that covers float and double? The one obstacle I can see is the Single.NaN and Double.NaN. Can that be factored in by extending the Result enum to have a Result.NaN or some other similar technique? #Resolved Refers to: src/Microsoft.ML.Core/Utilities/DoubleParser.cs:109 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False) |
So I found that this worked, and I guess I can use this to refactor as per your suggestion: static private void ReturnNaN<TValue>(out TValue val)
{
if(typeof(TValue) == typeof(Single))
{
Console.WriteLine("this was float!");
// val = (TValue)Convert.ChangeType(Single.NaN, typeof(TValue)); // this also works
val = (TValue)(object)Single.NaN; // this also works
return;
}
else if(typeof(TValue) == typeof(Double))
{
Console.WriteLine("this was double!");
// val = (TValue)Convert.ChangeType(Double.NaN, typeof(TValue)); // this also works
val = (TValue)(object)Double.NaN; // this also works
return;
}
throw new Exception();
} Still, I'm not sure if it's ok to do the above... In general it's commented that checking But on the other hand, in our code base there are already multiple generic methods where we check for machinelearning/src/Microsoft.ML.Data/Data/BufferBuilder.cs Lines 92 to 99 in c0eeea9
machinelearning/src/Microsoft.ML.TensorFlow/TensorflowTransform.cs Lines 798 to 807 in c0eeea9
Although I notice that in those 2 examples, we're actually returning a reference type, whereas in the refactor needed to return NaN it would be returning a value type... so I'm not sure if there's another workaround, although in the stack overflow link I posted above (and other places I checked) it does look that the (object) cast would be necessary even when returning Float/Double value type. Also, regarding performance, I don't know if there's any cost in casting from value to object to value... but I guess it would negligible as it would only occur when returning NaN? In reply to: 638777783 [](ancestors = 638777783,638494349) Refers to: src/Microsoft.ML.Core/Utilities/DoubleParser.cs:109 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False) |
Codecov Report
@@ Coverage Diff @@
## master #5198 +/- ##
==========================================
+ Coverage 73.31% 73.55% +0.23%
==========================================
Files 1007 1016 +9
Lines 187969 190040 +2071
Branches 20241 20438 +197
==========================================
+ Hits 137812 139777 +1965
- Misses 44630 44693 +63
- Partials 5527 5570 +43
|
Never mind. This seems a more invasive change. And even if we were to do this, this is a big enough change that it shouldn't be part of this PR. In reply to: 638818549 [](ancestors = 638818549,638777783,638494349) Refers to: src/Microsoft.ML.Core/Utilities/DoubleParser.cs:109 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False) |
Yeah. You're right. If I don't check for version then the CheckDecode in here is pretty much being ignored. Will fix this. EDIT: DONE. In reply to: 640419819 [](ancestors = 640419819) Refers to: src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs:1419 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False) |
And I'll add a MAML test to test that this is being saved and loaded correctly (I believe saving and loading a textloader can only be tested through MAML). In reply to: 640708134 [](ancestors = 640708134,640419819) Refers to: src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs:1419 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes #4132 by implementing a new option on TextLoader that will allow it to impute missing float/doubles as NaN instead of the default
0
.Builds on #5154