Skip to content

Added decimal marker option in TextLoader #5145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 22, 2020

Conversation

mstfbl
Copy link
Contributor

@mstfbl mstfbl commented May 20, 2020

This PR adds the decimal marker option in TextLoader, so that cultures where a comma is the decimal marker (as in 3,5 = 3.5 * 10^1) can use their appropriate datasets. This also updates verWrittenCur as it is now writing decimalMarker during serialization as well. In addition, this PR also adds in a unit test to check whether or not a dataset with ',' as its decimal marker is read in and processed correctly.

Fix #4910

@codecov
Copy link

codecov bot commented May 21, 2020

Codecov Report

Merging #5145 into master will increase coverage by 0.01%.
The diff coverage is 92.35%.

@@            Coverage Diff             @@
##           master    #5145      +/-   ##
==========================================
+ Coverage   75.76%   75.78%   +0.01%     
==========================================
  Files         993      993              
  Lines      180746   180915     +169     
  Branches    19463    19474      +11     
==========================================
+ Hits       136944   137108     +164     
- Misses      38516    38518       +2     
- Partials     5286     5289       +3     
Flag Coverage Δ
#Debug 75.78% <92.35%> (+0.01%) ⬆️
#production 71.71% <61.53%> (+<0.01%) ⬆️
#test 88.87% <94.90%> (+0.02%) ⬆️
Impacted Files Coverage Δ
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs 82.23% <61.53%> (-0.36%) ⬇️
test/Microsoft.ML.Tests/TextLoaderTests.cs 95.60% <94.90%> (-0.15%) ⬇️
src/Microsoft.ML.Featurizers/CategoricalImputer.cs 74.73% <0.00%> (-0.27%) ⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 89.58% <0.00%> (+0.16%) ⬆️
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs 100.00% <0.00%> (+20.51%) ⬆️
#Resolved

@mstfbl mstfbl marked this pull request as ready for review May 21, 2020 03:49
@mstfbl mstfbl requested a review from a team as a code owner May 21, 2020 03:49
@mstfbl
Copy link
Contributor Author

mstfbl commented May 21, 2020

This PR also opens up a possible case where datasets having different decimal separators being read in at the same time can result in an issue. For example, when on 2 separate threads we're loading two datasets where one has '.' as the decimal separator and the other has ',' as a decimal separator, as the static decimal marker can only have one value, it won't be accurately able to read one of the datasets. This case is very unlikely to happen, where we attempt to load in datasets with different decimal markers at once. After this PR is merged, I can make a new issue on this topic.

Edit: I have added a note in DoubleParser.cs to denote this case. #Resolved

@mstfbl mstfbl marked this pull request as draft May 22, 2020 01:29
}

[Fact]
public void TestCommaAsDecimalMarkerDouble()
Copy link
Contributor Author

@mstfbl mstfbl May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is very similar to TestCommaAsDecimalMarkerFloat() above. This is because it is necessary to test decimal markers with floats and doubles. I was wishing to use InlineData to input a bool for using doubles and floats, however the specification of float/double is necessary to instantiate the VBuffer and ValueGetters, and there is no better way to specify varying types that I know.

With floats:

VBuffer<Single> featuresPeriod = default;
ValueGetter<VBuffer<Single>> featuresDelegatePeriod = cursorPeriod.GetGetter<VBuffer<Single>>(columnsPeriod[1]);

With doubles:
VBuffer<Double> featuresCsv = default;
ValueGetter<VBuffer<Double>> featuresDelegatePeriod = cursorCsv.GetGetter<VBuffer<Double>>(columnsCsv[1]);

If anyone has a better way of prevention code repetition, I'm all eyes and ears! #Resolved

Copy link
Member

@antoniovs1029 antoniovs1029 May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I agree with your approach of having 2 separate tests for each case, for the reasons you've mentioned above. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this a bit better by refactoring the core test as TestCommaAsDecimalMarker<T>() and then calling TestCommaAsDecimalMarker<float> from within TestCommaAsDecimalMarkerFloat(). You will still have two tests, but the code will be easier to follow that this is the same test on two different types.


In reply to: 429001199 [](ancestors = 429001199)

Copy link
Contributor Author

@mstfbl mstfbl May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harishsk now that in TestCommaAsDecimalMarkerFloat() I am testing both the .csv (with comma as both separator and decimal marker) and the .txt versions of the same dataset by using InlineDataAttribute, I think both tests would become harder to follow with the addition of a generic type parameter and calling TestCommaAsDecimalMarker<float> from within TestCommaAsDecimalMarkerFloat() #Resolved

Copy link
Contributor

@harishsk harishsk May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try it out. This is the pattern used in a lot of onnx conversion tests and it has improved readability there. #Resolved

Copy link
Contributor Author

@mstfbl mstfbl May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harishsk thank you for your suggestion. I have now compressed the two tests, where I have a theory TestCommaAsDecimalMarker(bool useCsvVersion) test that calls a TestCommaAsDecimalMarkerHelper<T>(bool useCsvVersion) helper test. If only I knew a way to pass in the floating point value types float and double as inputs to TestCommaAsDecimalMarkerHelper<T>(bool useCsvVersion), I would have done the helper test this way to only have 1 unit test:

    [Theory]
    [InlineData(true, typeof(float))]
    [InlineData(true, typeof(double))]
    [InlineData(false, typeof(float))]
    [InlineData(false, typeof(double))]
    public void TestCommaAsDecimalMarkerHelper(bool useCsvVersion, GENERIC_TYPE featureType)
    {
        ...
        // Who knew compressing tests can be so beautiful 😄
    }

 #Resolved

@antoniovs1029
Copy link
Member

antoniovs1029 commented May 22, 2020

LGTM. There's only one minor nit comment I left.

Also, there's the potential issue you've mentioned here. As discussed offline with @harishsk , we all agree that the scenario where it could be a problem (i.e. having different textloaders with different Decimal Marker options reading, at the same time, different datasets) would be fringe enough, to consider it an edge case that shouldn't stop us from adding this new feature you've worked in. Still, I will soon try to fix issue #4132 and as part of my work there I will try to explore other ways to better connect the options in textloader to the DoubleParser, without running with the potential issue I've mentioned.

Aside from that, your tests do show that the feature works for the general cases where it would be used. So thanks for the thorough testing, @mstfbl 😄 #Resolved

@mstfbl mstfbl marked this pull request as ready for review May 22, 2020 18:02
Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@mstfbl
Copy link
Contributor Author

mstfbl commented May 22, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@mstfbl mstfbl merged commit 27141e3 into dotnet:master May 22, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poor accuracy with non-US regional settings
4 participants