-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Added decimal marker option in TextLoader #5145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added decimal marker option in TextLoader #5145
Conversation
…to TextLoaderCursor and TextLoaderParser
Codecov Report
@@ Coverage Diff @@
## master #5145 +/- ##
==========================================
+ Coverage 75.76% 75.78% +0.01%
==========================================
Files 993 993
Lines 180746 180915 +169
Branches 19463 19474 +11
==========================================
+ Hits 136944 137108 +164
- Misses 38516 38518 +2
- Partials 5286 5289 +3
|
This PR also opens up a possible case where datasets having different decimal separators being read in at the same time can result in an issue. For example, when on 2 separate threads we're loading two datasets where one has Edit: I have added a note in |
} | ||
|
||
[Fact] | ||
public void TestCommaAsDecimalMarkerDouble() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is very similar to TestCommaAsDecimalMarkerFloat()
above. This is because it is necessary to test decimal markers with floats and doubles. I was wishing to use InlineData
to input a bool for using doubles and floats, however the specification of float/double is necessary to instantiate the VBuffer and ValueGetters, and there is no better way to specify varying types that I know.
With floats:
machinelearning/test/Microsoft.ML.Tests/TextLoaderTests.cs
Lines 1022 to 1023 in 6837f64
VBuffer<Single> featuresPeriod = default; | |
ValueGetter<VBuffer<Single>> featuresDelegatePeriod = cursorPeriod.GetGetter<VBuffer<Single>>(columnsPeriod[1]); |
With doubles:
machinelearning/test/Microsoft.ML.Tests/TextLoaderTests.cs
Lines 1087 to 1088 in 6837f64
VBuffer<Double> featuresCsv = default; | |
ValueGetter<VBuffer<Double>> featuresDelegatePeriod = cursorCsv.GetGetter<VBuffer<Double>>(columnsCsv[1]); |
If anyone has a better way of prevention code repetition, I'm all eyes and ears! #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I agree with your approach of having 2 separate tests for each case, for the reasons you've mentioned above. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this a bit better by refactoring the core test as TestCommaAsDecimalMarker<T>()
and then calling TestCommaAsDecimalMarker<float>
from within TestCommaAsDecimalMarkerFloat()
. You will still have two tests, but the code will be easier to follow that this is the same test on two different types.
In reply to: 429001199 [](ancestors = 429001199)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harishsk now that in TestCommaAsDecimalMarkerFloat()
I am testing both the .csv (with comma as both separator and decimal marker) and the .txt versions of the same dataset by using InlineDataAttribute
, I think both tests would become harder to follow with the addition of a generic type parameter and calling TestCommaAsDecimalMarker<float>
from within TestCommaAsDecimalMarkerFloat()
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try it out. This is the pattern used in a lot of onnx conversion tests and it has improved readability there. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harishsk thank you for your suggestion. I have now compressed the two tests, where I have a theory TestCommaAsDecimalMarker(bool useCsvVersion)
test that calls a TestCommaAsDecimalMarkerHelper<T>(bool useCsvVersion)
helper test. If only I knew a way to pass in the floating point value types float
and double
as inputs to TestCommaAsDecimalMarkerHelper<T>(bool useCsvVersion)
, I would have done the helper test this way to only have 1 unit test:
[Theory]
[InlineData(true, typeof(float))]
[InlineData(true, typeof(double))]
[InlineData(false, typeof(float))]
[InlineData(false, typeof(double))]
public void TestCommaAsDecimalMarkerHelper(bool useCsvVersion, GENERIC_TYPE featureType)
{
...
// Who knew compressing tests can be so beautiful 😄
}
#Resolved
LGTM. There's only one minor nit comment I left. Also, there's the potential issue you've mentioned here. As discussed offline with @harishsk , we all agree that the scenario where it could be a problem (i.e. having different textloaders with different Decimal Marker options reading, at the same time, different datasets) would be fringe enough, to consider it an edge case that shouldn't stop us from adding this new feature you've worked in. Still, I will soon try to fix issue #4132 and as part of my work there I will try to explore other ways to better connect the options in textloader to the DoubleParser, without running with the potential issue I've mentioned. Aside from that, your tests do show that the feature works for the general cases where it would be used. So thanks for the thorough testing, @mstfbl 😄 #Resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
This PR adds the decimal marker option in TextLoader, so that cultures where a comma is the decimal marker (as in 3,5 = 3.5 * 10^1) can use their appropriate datasets. This also updates
verWrittenCur
as it is now writingdecimalMarker
during serialization as well. In addition, this PR also adds in a unit test to check whether or not a dataset with','
as its decimal marker is read in and processed correctly.Fix #4910