Allow TextLoader to load empty float/double fields as NaN instead of 0 #5198

antoniovs1029 · 2020-06-03T05:02:07Z

Fixes #4132 by implementing a new option on TextLoader that will allow it to impute missing float/doubles as NaN instead of the default 0.
Builds on #5154

…Parser. Made changes to make them work.

This reverts commit c8cace2.

This reverts commit 684d359.

antoniovs1029 · 2020-06-03T05:05:04Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

+            /// <summary>
+            /// If true, empty float fields will be loaded as NaN. If false, they'll be loaded as 0. Default is false.
+            /// </summary>
+            [Argument(ArgumentType.AtMostOnce, HelpText = "If true, empty float fields will be loaded as NaN. If false, they'll be loaded as 0. Default is false.", ShortName = "imputefloat")]
+            public bool ImputeEmptyFloats = Defaults.ImputeEmptyFloats;


I'm not happy with the name I chose for this option, and for text I wrote as its doc. I think the name "field" can mean many things in the C# world (but inside the code of the TextLoader a "field" is the value found on a given column on a given row). Also, this option is supposed to be related to both Single and Doubles, so "float" is misleading. And "empty" would actually mean empty fields or fields with only whitespace. So I'm not sure how to name this and what to say on the public docs.

I am not sure 'Impute' fits the terminology here. A few options to consider: ParseEmptyFloatsAsNan, ParseEmptyRealsAsNan, EmptyFieldsToNan, EmptyValuesToNan, etc.

In reply to: 434309670 [](ancestors = 434309670)

I used the word "impute" as that was the original name @justinormont suggested here #4132 (comment)
I wouldn't know if that's a common term in which our users could understand the feature by reading it.

As for the names you provided, perhaps EmptyRealsAsNan would fit well, as it avoids the word Field, and "Reals" could imply Floats or Doubles.

I will leave unaddressed right now, and see if we get any suggestion on this. It's only a variable name change, and I'll be able to make the change before merging the PR.

In reply to: 434886717 [](ancestors = 434886717,434309670)

I've decided to go for MissingRealsAsNan , since it's not only the Empty fields that get mapped as NaNs, but also the ones that only have whitespace, and now also the ones were there are missing columns in a given row.

In reply to: 435163479 [](ancestors = 435163479,434886717,434309670)

In that case you should fix the documentation above to read "If true, empty or missing float and double fields will be loaded as NaN"

In reply to: 437077933 [](ancestors = 437077933,435163479,434886717,434309670)

Yeah, the documentation reads:

/// <summary> /// If true, missing real fields (i.e. double or single fields) will be loaded as NaN. /// If false, they'll be loaded as 0. Default is false. /// A field is considered "missing" if it's empty, if it only has whitespace, or if there are missing columns /// at the end of a given row. /// </summary>

In reply to: 437170658 [](ancestors = 437170658,437077933,435163479,434886717,434309670)

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

antoniovs1029 · 2020-06-03T05:20:41Z

test/BaselineOutput/Common/TextLoader/missing_fields-with-impute.csv

+// The next two rows map the missing columns as 0, as it was decided not to impute with NaN
+// in this case
+6,"this has no date, num3 or num4 (the separator corresponding to them is also missing)",6.6,66.66,1/1/0001,0,0
+7,"this has no num4 (the separator corresponding to it is missing)",7.7,77.77,7/7/2007,777.777,0


(appended to example to allow for threaded discussion, this is how it will actually load missing columns at the end of the row)

The feature I implement on this PR means that "1,,dog" will be loaded as "1,NaN,dog" if the second column is a float/double. And to load "1,dog," as "1,dog,NaN" if the third column is a float/double. This was the behavior requested on #4132

What to do with incomplete rows was undefined in the request. I.e. what to do if there are supposed to be 6 columns but on one row there's only 2? (I.e. 1,dog, without the last separator)

In such a case, I took the decision to simply load the missing columns with default values (i.e. 0 for Float and Double) even when using the new feature of this PR.

The reason behind this is that the implementation required to load those missing columns as NaNs only for float/doubles would be somewhat hacky (and I didn't figure out how to make it work for vector types which included fields on missing columns on a given row). Besides, since it might be a fringe case. it's perhaps better not to support it.

Any thoughts on this, @justinormont @harishsk ? #Resolved

What is the current behavior of the Textloader when it encounters incomplete rows? Does it thrown an error or does it load default values for missing columns?
If it is the latter, then I think we should fix it to set those columns to NaN because that is the spirit of this PR.

In reply to: 434313813 [](ancestors = 434313813)

In the case of incomplete rows it returns default values, and doesn't throw, both with or without the new ImputeEmptyFloats option.

I'll update this PR to have it return NaNs, but I won't close this thread because I'd still want to leave this discussion on. Once I update this PR with that fix, we can decide if this case is worth it the changes that would need to be introduced to cover it.

In reply to: 434896173 [](ancestors = 434896173,434313813)

To fix this case I'll also need to do some checks like typeof(TValue) == typeof(float) and casts like (TValue)(obj) Single.NaN) similar to the ones described here:
#5198 (comment)

The same concerns I've mentioned there apply to this, although in here I think I can change the code to do the cast only when there are missing float/doubles columns for a given row and the ImputeEmptyFloats option is enabled.

In reply to: 435167058 [](ancestors = 435167058,434896173,434313813)

It turned out fixing this case was much simpler than what I had originally expected, and no need to the (TValue)(obj)Single.NaN cast I thought about doing. So I've updated mi changes to fix this case 😄

In reply to: 435219809 [](ancestors = 435219809,435167058,434896173,434313813)

This reverts commit 65fcb59.

…s if it runs after the new test for the Impute option.

This reverts commit df32ba1.

test/Microsoft.ML.Tests/TextLoaderTests.cs

codecov · 2020-06-03T06:46:34Z

Codecov Report

Merging #5198 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5198      +/-   ##
==========================================
+ Coverage   73.31%   73.33%   +0.02%     
==========================================
  Files        1007     1007              
  Lines      187969   188123     +154     
  Branches    20241    20245       +4     
==========================================
+ Hits       137812   137967     +155     
  Misses      44630    44630              
+ Partials     5527     5526       -1

Flag	Coverage Δ
#Debug	`73.33% <100.00%> (+0.02%)`	⬆️
#production	`69.07% <100.00%> (+<0.01%)`	⬆️
#test	`87.49% <100.00%> (+0.04%)`	⬆️

Impacted Files	Coverage Δ
src/Microsoft.ML.Data/Data/Conversion.cs	`79.06% <ø> (+0.37%)`	⬆️
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs	`82.13% <100.00%> (+0.06%)`	⬆️
...soft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs	`87.15% <100.00%> (+0.13%)`	⬆️
test/Microsoft.ML.Tests/TextLoaderTests.cs	`97.02% <100.00%> (+0.45%)`	⬆️
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs	`79.48% <0.00%> (-20.52%)`	⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs	`89.45% <0.00%> (+0.15%)`	⬆️
....ML.AutoML/PipelineSuggesters/PipelineSuggester.cs	`83.19% <0.00%> (+3.36%)`	⬆️

harishsk · 2020-06-03T22:28:56Z

While you are here, is it possible to combine these two Parse functions with a generic implementation that covers float and double? The one obstacle I can see is the Single.NaN and Double.NaN. Can that be factored in by extending the Result enum to have a Result.NaN or some other similar technique? #Resolved

Refers to: src/Microsoft.ML.Core/Utilities/DoubleParser.cs:109 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

antoniovs1029 · 2020-06-04T10:58:50Z

I'm not sure it's doable, but I'll explore the possibilities.

In reply to: 638494349 [](ancestors = 638494349)

Refers to: src/Microsoft.ML.Core/Utilities/DoubleParser.cs:109 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

antoniovs1029 · 2020-06-04T12:32:52Z

So I found that this worked, and I guess I can use this to refactor as per your suggestion:

    static private void ReturnNaN<TValue>(out TValue val)
    {
        if(typeof(TValue) == typeof(Single))
        {
            Console.WriteLine("this was float!");
            // val = (TValue)Convert.ChangeType(Single.NaN, typeof(TValue)); // this also works
            val = (TValue)(object)Single.NaN; // this also works
            return;
        }
        else if(typeof(TValue) == typeof(Double))
        {
            Console.WriteLine("this was double!");
            // val = (TValue)Convert.ChangeType(Double.NaN, typeof(TValue)); // this also works
            val = (TValue)(object)Double.NaN; // this also works
            return;
        }

        throw new Exception();
    }

Still, I'm not sure if it's ok to do the above... In general it's commented that checking typeof(TValue) == typeof(othertype) and doing the `(TValue)(object) is considered bad design:
https://stackoverflow.com/questions/15958830/c-sharp-generics-cast-generic-type-to-value-type

But on the other hand, in our code base there are already multiple generic methods where we check for typeof(TValue) == typeof(othertype), and I found that we actually also use something in similar to the return (TValue)(object) trick in places like:

machinelearning/src/Microsoft.ML.Data/Data/BufferBuilder.cs

Lines 92 to 99 in c0eeea9

    
           public static BufferBuilder<T> CreateDefault() 
        
           { 
        
               if (typeof(T) == typeof(ReadOnlyMemory<char>)) 
        
                   return (BufferBuilder<T>)(object)new BufferBuilder<ReadOnlyMemory<char>>(TextCombiner.Instance); 
        
               if (typeof(T) == typeof(float)) 
        
                   return (BufferBuilder<T>)(object)new BufferBuilder<float>(FloatAdder.Instance); 
        
               throw Contracts.Except($"Unrecognized type '{typeof(T)}' for default {nameof(BufferBuilder<T>)}"); 
        
           }

machinelearning/src/Microsoft.ML.TensorFlow/TensorflowTransform.cs

Lines 798 to 807 in c0eeea9

    
           private Tensor CastDataAndReturnAsTensor(T[] data) 
        
           { 
        
               if (typeof(T) == typeof(sbyte)) 
        
                   return new Tensor((sbyte[])(object)data, _dims, TF_DataType.TF_INT8); 
        
               else if (typeof(T) == typeof(long)) 
        
                   return new Tensor((long[])(object)data, _dims, TF_DataType.TF_INT64); 
        
               else if (typeof(T) == typeof(Int32)) 
        
                   return new Tensor((Int32[])(object)data, _dims, TF_DataType.TF_INT32); 
        
               else if (typeof(T) == typeof(Int16)) 
        
                   return new Tensor((Int16[])(object)data, _dims, TF_DataType.TF_INT16);

Although I notice that in those 2 examples, we're actually returning a reference type, whereas in the refactor needed to return NaN it would be returning a value type... so I'm not sure if there's another workaround, although in the stack overflow link I posted above (and other places I checked) it does look that the (object) cast would be necessary even when returning Float/Double value type. Also, regarding performance, I don't know if there's any cost in casting from value to object to value... but I guess it would negligible as it would only occur when returning NaN?

In reply to: 638777783 [](ancestors = 638777783,638494349)

Refers to: src/Microsoft.ML.Core/Utilities/DoubleParser.cs:109 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

codecov · 2020-06-04T13:26:39Z

Codecov Report

Merging #5198 into master will increase coverage by 0.23%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5198      +/-   ##
==========================================
+ Coverage   73.31%   73.55%   +0.23%     
==========================================
  Files        1007     1016       +9     
  Lines      187969   190040    +2071     
  Branches    20241    20438     +197     
==========================================
+ Hits       137812   139777    +1965     
- Misses      44630    44693      +63     
- Partials     5527     5570      +43

Flag	Coverage Δ
#Debug	`73.55% <100.00%> (+0.23%)`	⬆️
#production	`69.35% <100.00%> (+0.28%)`	⬆️
#test	`87.52% <100.00%> (+0.07%)`	⬆️

Impacted Files	Coverage Δ
src/Microsoft.ML.Data/Data/Conversion.cs	`79.06% <ø> (+0.37%)`	⬆️
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs	`82.41% <100.00%> (+0.34%)`	⬆️
...soft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs	`87.39% <100.00%> (+0.37%)`	⬆️
test/Microsoft.ML.TestFramework/TestCommandBase.cs	`44.96% <100.00%> (+0.39%)`	⬆️
test/Microsoft.ML.Tests/TextLoaderTests.cs	`97.02% <100.00%> (+0.45%)`	⬆️
test/Microsoft.ML.Tests/OnnxConversionTest.cs	`96.81% <0.00%> (-2.16%)`	⬇️
...c/Microsoft.ML.SamplesUtils/SamplesDatasetUtils.cs	`40.00% <0.00%> (-0.68%)`	⬇️
src/Microsoft.ML.Data/Model/Onnx/OnnxContext.cs	`100.00% <0.00%> (ø)`
...Microsoft.ML.OnnxConverter/OnnxExportExtensions.cs	`100.00% <0.00%> (ø)`
.../Microsoft.ML.Transforms/Dracula/DictCountTable.cs	`100.00% <0.00%> (ø)`
... and 52 more

harishsk · 2020-06-04T17:40:01Z

Never mind. This seems a more invasive change. And even if we were to do this, this is a big enough change that it shouldn't be part of this PR.

In reply to: 638818549 [](ancestors = 638818549,638777783,638494349)

Refers to: src/Microsoft.ML.Core/Utilities/DoubleParser.cs:109 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

yaeldekel · 2020-06-08T07:27:23Z

        host.CheckDecode((_flags & ~OptionFlags.All) == 0);

Do we need to have a different check here based on the version of the model? #Resolved

Refers to: src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs:1419 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

antoniovs1029 · 2020-06-08T15:39:34Z

        host.CheckDecode((_flags & ~OptionFlags.All) == 0);

Yeah. You're right. If I don't check for version then the CheckDecode in here is pretty much being ignored. Will fix this. EDIT: DONE.

In reply to: 640419819 [](ancestors = 640419819)

Refers to: src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs:1419 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

antoniovs1029 · 2020-06-08T16:12:31Z

        host.CheckDecode((_flags & ~OptionFlags.All) == 0);

And I'll add a MAML test to test that this is being saved and loaded correctly (I believe saving and loading a textloader can only be tested through MAML).

In reply to: 640708134 [](ancestors = 640708134,640419819)

Refers to: src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs:1419 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

antoniovs1029 · 2020-06-09T00:51:20Z

        host.CheckDecode((_flags & ~OptionFlags.All) == 0);

Done.

In reply to: 640726768 [](ancestors = 640726768,640708134,640419819)

Refers to: src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs:1419 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

harishsk

antoniovs1029 added 12 commits May 28, 2020 00:00

Updated MAML HelpText

a5ac022

Added ImputeEmptyFloats option in TextLoader and EmptyAsNaN in Double…

e7e7da7

…Parser. Made changes to make them work.

Small changes related to _keepEmpty

a7a454f

Minor change

7d10358

Updated docs

4dd081c

Added Test

b952e71

Typo in test

c0cdbec

Merge remote-tracking branch 'upstream/master' into ftNaN

66c6c24

Revert "new code coverage (dotnet#5169)"

684d359

This reverts commit c8cace2.

Updated manifest

b1cc176

Revert "Revert "new code coverage (dotnet#5169)""

65fcb59

This reverts commit 684d359.

Minor fix in docs

a68da25

antoniovs1029 requested a review from a team as a code owner June 3, 2020 05:02

antoniovs1029 requested review from harishsk, justinormont and mstfbl June 3, 2020 05:02

antoniovs1029 commented Jun 3, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs Show resolved Hide resolved

antoniovs1029 commented Jun 3, 2020

View reviewed changes

antoniovs1029 added 3 commits June 2, 2020 22:34

Revert "Revert "Revert "new code coverage (dotnet#5169)"""

df32ba1

This reverts commit 65fcb59.

Removed checking static instance in another test, because now it fail…

f6912b1

…s if it runs after the new test for the Impute option.

Revert "Revert "Revert "Revert "new code coverage (dotnet#5169)""""

3dd66ff

This reverts commit df32ba1.

antoniovs1029 commented Jun 3, 2020

View reviewed changes

test/Microsoft.ML.Tests/TextLoaderTests.cs Show resolved Hide resolved

yaeldekel reviewed Jun 8, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs Outdated Show resolved Hide resolved

Bumped version to include new flag

280a35e

wangyems reviewed Jun 8, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs Outdated Show resolved Hide resolved

antoniovs1029 added 5 commits June 8, 2020 16:23

Load missing columns also as NaNs

8e05dff

Change name to MissingRealsAsNaNs

1cc7eff

Added MAML Test

b074412

Updated test

19241ef

Updated manifest and fixed typo

00a972b

Removed unused variable

a417228

harishsk approved these changes Jun 9, 2020

View reviewed changes

antoniovs1029 merged commit c664748 into dotnet:master Jun 9, 2020

ghost locked as resolved and limited conversation to collaborators Mar 18, 2022

Allow TextLoader to load empty float/double fields as NaN instead of 0 #5198

Allow TextLoader to load empty float/double fields as NaN instead of 0 #5198

Uh oh!

Conversation

antoniovs1029 commented Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoniovs1029 Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk Jun 3, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jun 4, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jun 9, 2020

Choose a reason for hiding this comment

Uh oh!

harishsk Jun 9, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jun 9, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antoniovs1029 Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk Jun 3, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jun 4, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jun 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Jun 3, 2020

Codecov Report

Uh oh!

harishsk commented Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoniovs1029 commented Jun 4, 2020

Uh oh!

antoniovs1029 commented Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

harishsk commented Jun 4, 2020

Uh oh!

yaeldekel commented Jun 8, 2020 • edited by antoniovs1029 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

antoniovs1029 commented Jun 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoniovs1029 commented Jun 8, 2020

Uh oh!

Uh oh!

antoniovs1029 commented Jun 9, 2020

Uh oh!

harishsk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antoniovs1029 commented Jun 3, 2020 •

edited

Loading

antoniovs1029 Jun 3, 2020 •

edited

Loading

antoniovs1029 Jun 3, 2020 •

edited

Loading

antoniovs1029 Jun 4, 2020 •

edited

Loading

antoniovs1029 Jun 9, 2020 •

edited

Loading

harishsk commented Jun 3, 2020 •

edited

Loading

antoniovs1029 commented Jun 4, 2020 •

edited

Loading

codecov bot commented Jun 4, 2020 •

edited

Loading

yaeldekel commented Jun 8, 2020 •

edited by antoniovs1029

Loading

antoniovs1029 commented Jun 8, 2020 •

edited

Loading