Added decimal marker option in TextLoader #5145

mstfbl · 2020-05-20T02:52:27Z

This PR adds the decimal marker option in TextLoader, so that cultures where a comma is the decimal marker (as in 3,5 = 3.5 * 10^1) can use their appropriate datasets. This also updates verWrittenCur as it is now writing decimalMarker during serialization as well. In addition, this PR also adds in a unit test to check whether or not a dataset with ',' as its decimal marker is read in and processed correctly.

Fix #4910

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

…to TextLoaderCursor and TextLoaderParser

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs

test/data/iris_decimal_marker_as_comma.txt

…oader's constructor

test/Microsoft.ML.Tests/TextLoaderTests.cs

codecov · 2020-05-21T02:10:44Z

Codecov Report

Merging #5145 into master will increase coverage by 0.01%.
The diff coverage is 92.35%.

@@            Coverage Diff             @@
##           master    #5145      +/-   ##
==========================================
+ Coverage   75.76%   75.78%   +0.01%     
==========================================
  Files         993      993              
  Lines      180746   180915     +169     
  Branches    19463    19474      +11     
==========================================
+ Hits       136944   137108     +164     
- Misses      38516    38518       +2     
- Partials     5286     5289       +3

Flag	Coverage Δ
#Debug	`75.78% <92.35%> (+0.01%)`	⬆️
#production	`71.71% <61.53%> (+<0.01%)`	⬆️
#test	`88.87% <94.90%> (+0.02%)`	⬆️

Impacted Files	Coverage Δ
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs	`82.23% <61.53%> (-0.36%)`	⬇️
test/Microsoft.ML.Tests/TextLoaderTests.cs	`95.60% <94.90%> (-0.15%)`	⬇️
src/Microsoft.ML.Featurizers/CategoricalImputer.cs	`74.73% <0.00%> (-0.27%)`	⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs	`89.58% <0.00%> (+0.16%)`	⬆️
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs	`100.00% <0.00%> (+20.51%)`	⬆️
#Resolved

src/Microsoft.ML.Core/Utilities/DoubleParser.cs

test/Microsoft.ML.Tests/TextLoaderTests.cs

test/data/iris_decimal_marker_as_comma.txt

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

mstfbl · 2020-05-21T18:00:37Z

This PR also opens up a possible case where datasets having different decimal separators being read in at the same time can result in an issue. For example, when on 2 separate threads we're loading two datasets where one has '.' as the decimal separator and the other has ',' as a decimal separator, as the static decimal marker can only have one value, it won't be accurately able to read one of the datasets. This case is very unlikely to happen, where we attempt to load in datasets with different decimal markers at once. After this PR is merged, I can make a new issue on this topic.

Edit: I have added a note in DoubleParser.cs to denote this case. #Resolved

…eParser

…tionTextLoader

mstfbl · 2020-05-22T01:44:19Z

test/Microsoft.ML.Tests/TextLoaderTests.cs

+        }
+
+        [Fact]
+        public void TestCommaAsDecimalMarkerDouble()


This test is very similar to TestCommaAsDecimalMarkerFloat() above. This is because it is necessary to test decimal markers with floats and doubles. I was wishing to use InlineData to input a bool for using doubles and floats, however the specification of float/double is necessary to instantiate the VBuffer and ValueGetters, and there is no better way to specify varying types that I know.

With floats:

machinelearning/test/Microsoft.ML.Tests/TextLoaderTests.cs

Lines 1022 to 1023 in 6837f64

VBuffer<Single> featuresPeriod = default;

ValueGetter<VBuffer<Single>> featuresDelegatePeriod = cursorPeriod.GetGetter<VBuffer<Single>>(columnsPeriod[1]);

With doubles:

machinelearning/test/Microsoft.ML.Tests/TextLoaderTests.cs

Lines 1087 to 1088 in 6837f64

VBuffer<Double> featuresCsv = default;

ValueGetter<VBuffer<Double>> featuresDelegatePeriod = cursorCsv.GetGetter<VBuffer<Double>>(columnsCsv[1]);

If anyone has a better way of prevention code repetition, I'm all eyes and ears! #Resolved

As discussed offline, I agree with your approach of having 2 separate tests for each case, for the reasons you've mentioned above. #Resolved

You can do this a bit better by refactoring the core test as TestCommaAsDecimalMarker<T>() and then calling TestCommaAsDecimalMarker<float> from within TestCommaAsDecimalMarkerFloat(). You will still have two tests, but the code will be easier to follow that this is the same test on two different types.

In reply to: 429001199 [](ancestors = 429001199)

@harishsk now that in TestCommaAsDecimalMarkerFloat() I am testing both the .csv (with comma as both separator and decimal marker) and the .txt versions of the same dataset by using InlineDataAttribute, I think both tests would become harder to follow with the addition of a generic type parameter and calling TestCommaAsDecimalMarker<float> from within TestCommaAsDecimalMarkerFloat() #Resolved

Please try it out. This is the pattern used in a lot of onnx conversion tests and it has improved readability there. #Resolved

@harishsk thank you for your suggestion. I have now compressed the two tests, where I have a theory TestCommaAsDecimalMarker(bool useCsvVersion) test that calls a TestCommaAsDecimalMarkerHelper<T>(bool useCsvVersion) helper test. If only I knew a way to pass in the floating point value types float and double as inputs to TestCommaAsDecimalMarkerHelper<T>(bool useCsvVersion), I would have done the helper test this way to only have 1 unit test:

[Theory] [InlineData(true, typeof(float))] [InlineData(true, typeof(double))] [InlineData(false, typeof(float))] [InlineData(false, typeof(double))] public void TestCommaAsDecimalMarkerHelper(bool useCsvVersion, GENERIC_TYPE featureType) { ... // Who knew compressing tests can be so beautiful 😄 } #Resolved

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

test/Microsoft.ML.Tests/TextLoaderTests.cs

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

test/Microsoft.ML.Tests/TextLoaderTests.cs

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

antoniovs1029 · 2020-05-22T04:10:07Z

LGTM. There's only one minor nit comment I left.

Also, there's the potential issue you've mentioned here. As discussed offline with @harishsk , we all agree that the scenario where it could be a problem (i.e. having different textloaders with different Decimal Marker options reading, at the same time, different datasets) would be fringe enough, to consider it an edge case that shouldn't stop us from adding this new feature you've worked in. Still, I will soon try to fix issue #4132 and as part of my work there I will try to explore other ways to better connect the options in textloader to the DoubleParser, without running with the potential issue I've mentioned.

Aside from that, your tests do show that the feature works for the general cases where it would be used. So thanks for the thorough testing, @mstfbl 😄 #Resolved

harishsk

mstfbl · 2020-05-22T20:05:09Z

/azp run

azure-pipelines · 2020-05-22T20:05:22Z

Azure Pipelines successfully started running 2 pipeline(s).

mstfbl added 3 commits May 19, 2020 19:49

Added decimal marker option in TextLoader

d375d90

Added decimalChar to more TextLoader constructors

544dab6

Removed decimalMarker from TextLoader constructors due to API breaking

a8a9b54

mstfbl commented May 20, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs Outdated Show resolved Hide resolved

antoniovs1029 reviewed May 20, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs Outdated Show resolved Hide resolved

antoniovs1029 reviewed May 20, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs Outdated Show resolved Hide resolved

antoniovs1029 reviewed May 20, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs Outdated Show resolved Hide resolved

antoniovs1029 mentioned this pull request May 20, 2020

support sweeping multiline option in AutoML #5148

Merged

Added unit test for ',' as a decimal marker, and added decimalMarker …

7658a70

…to TextLoaderCursor and TextLoaderParser

antoniovs1029 reviewed May 20, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs Outdated Show resolved Hide resolved

justinormont reviewed May 20, 2020

View reviewed changes

test/data/iris_decimal_marker_as_comma.txt Outdated Show resolved Hide resolved

mstfbl added 2 commits May 20, 2020 17:33

Added DecimalMarker in DoubleParser

ece5518

Added decimal marker check and removed decimalMarker from CreateTextL…

a663f21

…oader's constructor

mstfbl commented May 21, 2020

View reviewed changes

test/Microsoft.ML.Tests/TextLoaderTests.cs Show resolved Hide resolved

mstfbl commented May 21, 2020

View reviewed changes

src/Microsoft.ML.Core/Utilities/DoubleParser.cs Show resolved Hide resolved

mstfbl marked this pull request as ready for review May 21, 2020 03:49

mstfbl requested a review from a team as a code owner May 21, 2020 03:49

harishsk reviewed May 21, 2020

View reviewed changes

src/Microsoft.ML.Core/Utilities/DoubleParser.cs Show resolved Hide resolved

harishsk reviewed May 21, 2020

View reviewed changes

test/Microsoft.ML.Tests/TextLoaderTests.cs Show resolved Hide resolved

antoniovs1029 reviewed May 21, 2020

View reviewed changes

test/Microsoft.ML.Tests/TextLoaderTests.cs Outdated Show resolved Hide resolved

antoniovs1029 reviewed May 21, 2020

View reviewed changes

test/data/iris_decimal_marker_as_comma.txt Outdated Show resolved Hide resolved

antoniovs1029 reviewed May 21, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs Outdated Show resolved Hide resolved

jaredpar mentioned this pull request May 21, 2020

OSX machines are de-provisioned during CI / PR runs leading to failures dotnet/runtime#34472

Closed

mstfbl marked this pull request as draft May 22, 2020 01:29

mstfbl added 2 commits May 21, 2020 18:30

Added TextLoader decimalMarker unit tests, and refined logic in Doubl…

141fa7b

…eParser

Merge remote-tracking branch 'upstream/master' into addCommaDecimalOp…

6837f64

…tionTextLoader

mstfbl commented May 22, 2020

View reviewed changes