Enable TextLoader to accept new lines in quoted fields #5125

antoniovs1029 · 2020-05-14T10:18:58Z

Fixes https://github.com/dotnet/machinelearning-automl/issues/193

Addresses these issues (see below):
#4460
dotnet/machinelearning-modelbuilder#702

Based on @LittleLittleCloud 's PR:
#4584

This PR enables TextLoader to load rows which have new line characters inside quoted fields. So, for example, if the input CSV file has:

id,description,animal
11,"this is a
quoted field with
newlines",cat

then all of that will actually be loaded into a single row. Previously, if the TextLoader was presented with such a case, it would typically throw exceptions because it wasn't able to parse it correctly.

I took @LittleLittleCloud 's PR which had most of the work done, and added some corner cases, public interfaces to this feature, and made and ran different tests, addressing the comments on that other PR.

Complying with RFC 4180

I was tasked to take care of this issue, while also making the TextLoader compliant with RFC 4180 (on CSV format). Turns out the TextLoader already accepted pretty much everything that the RFC states, except for accepting new lines inside quoted fields, which is pretty much the only thing fixed in this PR.

Notice that this means that TextLoader was already able to escape 2 double quotes ("") inside quoted fields; i.e. "this is ""an"" example" will correctly be loaded as this is "an" example.
When loading CSV files TextLoader also accepts and do other things that the RFC doesn't mention. For example, filtering lines that start with "//" as if they were comments, or accepting 1 double quote (") in the middle of a field when allowQuoting is false.
It seems to me the only thing that the RFC states and that ML.NET doesn't follow, is that the RFC says that spaces should also be considered part of the fields, and not be ignored... but in general ML.NET ignores spaces at the beginning and at the end of each field, and line.

Modelbuilder and AutoML issues

There are different issues in Modelbuilder and AutoML that were supposed to be caused by TextLoader and I was supposed to fix them on this PR. But after looking into them, it doesn't seem that they are caused by the TextLoader. I have opened this issue in ModelBuilder explaining this in more detail, but in general, for some reason, if the first line (after the header) of a CSV file contains a comma inside a quoted field, then ModelBuilder shows an error saying it can't be loaded; it was said that the problem was on TextLoader, but it isn't, particularly because TextLoader doesn't have any problem escaping commas inside quoted fields.

So, regarding issue dotnet/machinelearning-modelbuilder#702

ML.NET's TextLoader was able to load most of the datasets pointed there, despite the fact that ModelBuilder shows an error message when the user selects them on the GUI.
Only some of the jigsaw datasets (such as jigsaw-toxic-comment-train.csv) and the Airbnb datasets included new lines inside quoted fields so they couldn't be loaded by ML.NET. This is fixed on this PR.

Regarding issue #4460

@LittleLittleCloud mentioned that there were problems using InferColumns() with CSV files that contained new lines inside quotes, so this is fixed on this PR and I add a test reproducing that.
@LittleLittleCloud also provided a listings.csv dataset, which doesn't include new lines inside quotes, and so this PR doesn't really affect that. Regarding this dataset, he didn't mentioned problems loading it, but that there were some problems with how the InferColumns method worked with double quotes. It looks to me that that dataset is loaded correctly (i.e. scaping "" as expected), so I further discussions would be needed about this to see where's the real root cause, as it's likely not in TextLoader.

Testing this PR

To test this PR I:

Added 2 new tests
Ran locally all existing ML.NET tests, using ReadMultilines = true as default. And it all worked as expected. I did this both using 1 thread or multiple threads for the text loader, to check the effect of multithreading.
Loaded all the datasets provided by ModelBuilder team successfully in ML.NET.

…o is4460TextLoaderQuoting

…tiline doesn't close quoting

antoniovs1029 · 2020-05-14T10:25:58Z

Regarding @justinormont 's concern about this working with multithreading, there's nothing to worry about.

I did ran locally all the existing ML.NET tests using ReadMultilines = true, with both 1 or multiple threads, and everything worked as expected.

Reason for this is that TextLoader.Cursor.LineReader always works the same, regardless of the number of threads. It reads each line in the input files, and put them inside its BlockQueue<LineBatch> _queue; all cursors share the same LineReader object. Then the cursor (or cursors) will try to get lines to parse them by getting batches from this queue. So reading the lines in the file is always done sequentially in only one background thread, whereas getting the line batches from the queue is what can be done from multiple threads. #Resolved

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs

antoniovs1029 · 2020-05-14T10:30:48Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs

+                                break; // MYTODO: This could happen if we have an invalid open quote which never closes so we reach the end of the file without properly closing the field, should we throw instead in this case?
+
+                            if(line.Length != 0)
+                                sb.Append(" "); // MYTODO: should we use instead a "\n" in here to separate lines?


@justinormont recommended adding an actual new line here instead of a space. But I'm not sure if it would matter much. Also I wouldn't know when to append "\n" instead of "\r\n", or if it would make any difference. #Resolved

Yep. We'll want newlines instead of spaces.

The same as newlines are different to human readers, the ML can just as well make use of this information. We also have no control over how the user featurizes their dataset in the ML pipeline; the newlines could be important to their processing.

We also would want to maintain round trip read/write of the dataset. Which means we'd have to modify the TextSaver to support (or verify current support) for newlines. #Resolved

I've done this in a more recent iteration. To continue discussing about this, it'd be better to do it here:
#5125 (comment)

In reply to: 425757137 [](ancestors = 425757137)

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

antoniovs1029 · 2020-05-14T10:36:58Z

src/Microsoft.ML.AutoML/ColumnInference/ColumnInferenceApi.cs

+                TrimWhitespace = trimWhitespace,
+                ReadMultilines = true // MYTODO: is it ok to hardcode this? it is necessary for my test to pass


So I am not familiar at all with AutoML code, but in order to add a test reproducing the scenario in #4460 I had to add this in here for it to work. Problem is that these methods in AutoML and several more methods there would require to either hardcode this or add more parameters to several methods to have a readMultilines parameter.

This is a problem because ReadMultilines is set as false by default, and we would need to manually set it as true anywhere where we'd want AutoML to have this behavior. So I am not sure what would be the best way to handle this from AutoML side. #Resolved

I expect it's better to sweep over the ReadMultilines flag. The same as we do with the other flags for TextLoader (src).

Are there any datasets (or rows) which would be unreadable with ReadMultilines = true? #Resolved

As I mention here: #5125 (comment) , if the CSV file contains a quoted field which has a " at the beginning of the field, but it never closes it (which would technically be invalid format), then the file won't be read correctly and some problems may arise from it.

For example, our "wikipedia-detox-250-line-data.tsv" has this error in line 83. The sentiment text field there opens with " but it never closes that quotation, so in that dataset the Multilinereader doesn't load that file correctly.

In reply to: 425713757 [](ancestors = 425713757)

Regarding your suggestion of simply sweeping over the flags in TrySplitColumns, it doesn't look that would affect the InferColumns method I'm mentioning in this thread (it looks to me the InferColumns method isn't called when calling TrySplitColumns). As you can see in the ColumnInferenceApi, there are several methods which have multiple parameters related to TextLoader, and it seems I'd need to add a new parameter to all of them. But my question is that I wouldn't know if it's necessary to add it in all of them to make the readMultiline parameter accessible to ModelBuilder. Furthermore, it seems these methods are called by the AutoCatalog.cs which also exposes some public methods with parameters related to TextLoader. And it would again seem that I'd need to add a new readMultiline parameter there... but I guess won't be able to do that since that is a public API, and adding parameters is considered a breaking API change.

So, in general, I don't know which criteria to use in order to know where to add a readMultiline parameter in order to make it accessible for ModelBuilder. Is there any other place besides ColumnInferenceAPI and TrySplitColumns?

In reply to: 425713757 [](ancestors = 425713757)

As discussed in here: I won't change anything in AutoML.NET code, and will leave this up to ModelBuilder team.
#5125 (comment)

In reply to: 425726377 [](ancestors = 425726377,425713757)

antoniovs1029 · 2020-05-14T10:42:11Z

Currently the CI is failing because apparently I can't add a new parameter to the public methods that create TextLoaders, because it's considered a breaking API change. Anyway, I will fix this later, in the meantime, please feel free to review. Thanks! #Resolved

antoniovs1029 · 2020-05-14T11:02:24Z

test/data/multiline.csv

+// this file should be loaded with quoting enabled
+// and it should load without problems by the TextLoader
+id,description,animal


So here are several examples that the TextLoader should be able to load according to the RFC and to how it has worked in the past for ML.NET's purposes. Please, let me know if I should add more cases.

Also, from this thread, I think it's better not to support these:

Escaping \" inside quoted fields. There are 3 main reasons to not support this:

The RFC 4180 doesn't support it. Quotes should be escaped using ""

Excel doesn't really support it consistently. If I have a CSV with 19,"this has an \"scaped quote\" right" the first one will not be escaped but the second one does.

Among the datasets provided by ModelBuilder team some of them have the characters \" but they're not meant to escape the ". E.g., there are quoted fields that end with those characters "blabla\" and so the last " is supposed to be the closing of the field and the \ is supposed to be part of the field. So escaping this would cause errors for those datasets.

Accept True, "asdf,123 even when quoting is enabled There are 2 main reasons for not doing this:

The RFC doesn't accept this.

In ML.NET we filter out the spaces at the beginning of the field, so there's no difference between that and True,"asdf,123 which would be an error. So code-wise it would be weird to accept this particular case.

In general, I wouldn't accept these cases unless there were a lot of people asking for them. #Resolved

Would adding support for alternative escaping methods in the future be difficult? Slash escaping is common enough.

Pandas (generally used by scikit-learn) exposes an escapechar parameter to allow for this: https://github.com/pandas-dev/pandas/blob/7456fc52e1546d263b88ba5772c0324f91ff0804/doc/source/user_guide/io.rst#quoting-and-escape-characters

Databricks exposes an escape parameter for this: https://docs.databricks.com/data/data-sources/read-csv.html

Note that both of these recommend/default to slash escaping instead of double-double-quotes for escaping. The escaping applies to the quote character and the delimitation character. #Resolved

As discussed offline, I'll do this in an upcoming PR.

In reply to: 425735487 [](ancestors = 425735487)

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

harishsk · 2020-05-14T19:52:22Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderSaverCatalog.cs

@@ -41,6 +41,8 @@ public static class TextLoaderSaverCatalog
        /// A column may also have dense values followed by sparse values represented in this fashion. For example,
        /// a row containing "1 2 5 2:6 4:3" represents two dense columns with values 1 and 2, followed by 5 sparsely represented
        /// columns with values 0, 0, 6, 0, and 3. The indices of the sparse columns start from 0, even though 0 represents the third column.</param>
+        /// <param name="readMultilines">If true, it will accept new line characters inside quoted fields. Thus, enabling that a single field contain
+        /// multiple lines of text. If allowQuoting is false, this parameter is ignored.</param>


Are there scenarios where we would want this to be false?
If allowQuoting is true, don't we always want readMultiLines = true?

If that is the desired behavior then we don't need an API change. #Resolved

You're right in that we wouldn't need an API change in that case. I'm hesitant to always set readMultiLines as true if allowQuoting is enabled because of the reasons I've pointed out here:
#5125 (comment)

But yeah, in general it would be more convenient to simply use allowQuoting to decide if we want to read multilines, so that we can skip the part of having to create the readMultilines parameter.

So I'm not sure if we should do it or not. Any thoughts, @justinormont @harishsk ? #Resolved

How would we expect ModelBuilder to decide the values for allowQuoting and readMultilines? If it always sets them to true, then we are going to run into the bad case you mention in the discussion where the reader can potentially read the whole file. If they always set them to false, they won't be able to read a lot of files correctly. Do we expect them to try one different permutations till it succeeds?

In reply to: 425403877 [](ancestors = 425403877)

Yeah, it's certainly a dilemma, an I wouldn't know what the ModelBuilder team would prefer, so perhaps @LittleLittleCloud has something to say about this?

My recommendation is that ModelBuilder always uses allowQuoting and readMultilines. readMultilines should always work if the input file is correctly formatted, so in case an exception occurs in the case I pointed to, then it would be a problem with the dataset, and not with the tool. Perhaps adding an option for readMultiline to ModelBuilder's GUI could also work, but I don't know if it's possible to do it or convenient for the end user.

In reply to: 425464413 [](ancestors = 425464413,425403877)

@LittleLittleCloud @JakeRadMSFT, @justinormont We would greatly appreciate your input on this one.

In reply to: 425473921 [](ancestors = 425473921,425464413,425403877)

It's very quick to try and open the dataset w/ various parameters for the TextLoader. It tries to read the 1st 1000 rows of data; no model training/eval.

I believe we currently try up to 16 combination, adding this flag will move it to 32. The 16 is currently trivially fast, and should remain so.

One corner case to perf test is the run-away quoted option, where the close-quote is never provided. I expect this won't impact actual users as the run-away is stopped by the next double-quote, and the further lines will be misread causing that option set to not be successful (which is good).

We test the given TextLoader options set in TryParseFile(): (no training/eval)

machinelearning/src/Microsoft.ML.AutoML/ColumnInference/TextFileContents.cs

Lines 81 to 122 in e50c4d2

private static bool TryParseFile(MLContext context, TextLoader.Options options, IMultiStreamSource source,

out ColumnSplitResult result)

{

result = null;

// try to instantiate data view with swept arguments

try

{

var textLoader = context.Data.CreateTextLoader(options, source);

var idv = context.Data.TakeRows(textLoader.Load(source), 1000);

var columnCounts = new List<int>();

var column = idv.Schema["C"];

using (var cursor = idv.GetRowCursor(new[] { column }))

{

var getter = cursor.GetGetter<VBuffer<ReadOnlyMemory<char>>>(column);

VBuffer<ReadOnlyMemory<char>> line = default;

while (cursor.MoveNext())

{

getter(ref line);

columnCounts.Add(line.Length);

}

}

var mostCommon = columnCounts.GroupBy(x => x).OrderByDescending(x => x.Count()).First();

if (mostCommon.Count() < UniformColumnCountThreshold * columnCounts.Count)

{

return false;

}

// disallow single-column case

if (mostCommon.Key <= 1) { return false; }

result = new ColumnSplitResult(true, options.Separators.First(), options.AllowQuoting, options.AllowSparse, mostCommon.Key);

return true;

}

// fail gracefully if unable to instantiate data view with swept arguments

catch(Exception)

{

return false;

}

}

#Resolved

It's very quick to try and open the dataset w/ various parameters for the TextLoader. It tries to read the 1st 1000 rows of data; no model training/eval.

I believe we currently try up to 16 combination, adding this flag will move it to 32. The 16 is currently trivially fast, and should remain so.

One corner case to perf test is the run-away quoted option, where the close-quote is never provided. I expect this won't impact actual users the run-away is stopped by the next double-quote, and the further lines will be misread causing that option set to not be successful (which is good).

We test the given TextLoader options set in TryParseFile(): (no training/eval)

machinelearning/src/Microsoft.ML.AutoML/ColumnInference/TextFileContents.cs

Lines 81 to 122 in e50c4d2

private static bool TryParseFile(MLContext context, TextLoader.Options options, IMultiStreamSource source,

out ColumnSplitResult result)

{

result = null;

// try to instantiate data view with swept arguments

try

{

var textLoader = context.Data.CreateTextLoader(options, source);

var idv = context.Data.TakeRows(textLoader.Load(source), 1000);

var columnCounts = new List<int>();

var column = idv.Schema["C"];

using (var cursor = idv.GetRowCursor(new[] { column }))

{

var getter = cursor.GetGetter<VBuffer<ReadOnlyMemory<char>>>(column);

VBuffer<ReadOnlyMemory<char>> line = default;

while (cursor.MoveNext())

{

getter(ref line);

columnCounts.Add(line.Length);

}

}

var mostCommon = columnCounts.GroupBy(x => x).OrderByDescending(x => x.Count()).First();

if (mostCommon.Count() < UniformColumnCountThreshold * columnCounts.Count)

{

return false;

}

// disallow single-column case

if (mostCommon.Key <= 1) { return false; }

result = new ColumnSplitResult(true, options.Separators.First(), options.AllowQuoting, options.AllowSparse, mostCommon.Key);

return true;

}

// fail gracefully if unable to instantiate data view with swept arguments

catch(Exception)

{

return false;

}

}

I agree with Justin, just adding another sweep option should be enough to solve parsing error.

Besides, I would not make any changes to AutoML.Net ColumnInference API, to me readMultiline is a trivial and confusing detail that should be concealed from AutoML users
#Resolved

Ok, so after discussing it offline with @LittleLittleCloud , we decided that model builder team will take care of changing anything necessary in AutoML.NET to make this work for them , once this PR is merged. 😄

About the ColumnInference API, I thought it was important for you guys since the original issue of all of these ( #4460 ) explicitly talks about having problems with using ColumnInference() with new line characters inside quoted fields. But if it's not important, then only changing TryParseFile() might be enough and straightforward. In any case, I'll leave this discussion and fix up to the model builder team. Thanks. #Resolved

BTW, Just out of curiosity, does ModelBuilder uses the ColumnInference() method @LittleLittleCloud ? #Resolved

BTW, Just out of curiosity, does ModelBuilder uses the ColumnInference() method @LittleLittleCloud ?
Yes, we use it in AutoMLService
#Resolved

Co-authored-by: Justin Ormont <[email protected]>

… that checked ColumnInference

…s1029/machinelearning into is4460TextLoaderQuoting

…uoting

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

test/Microsoft.ML.TestFramework/TestCommandBase.cs

test/Microsoft.ML.Tests/TextLoaderTests.cs

justinormont · 2020-05-19T02:52:57Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs

+
+                        sb.Clear();
+                        sb.Append(line);
+                        while (numOfQuotes % 2 != 0)


How would you handle the case

Label,Size,Age true,5.4",5

I think we need to check that a quoted field begins with the quote character (vs only includes ones). If the field isn't quoted, it can include non-escaped quote chars. If it's quoted, all quotes must be escaped within the field. #Resolved

According to the RFC your row isn't valid, as " should always be either escaped, or be used to indicate a quoted field. So I assumed that datasets should comply with RFC when using ReadMultiline, to make the MultilineReader code less complicated (and faster).

Should we actually accept this even when ReadMultiline is true? #Resolved

Mmm ok, after thinking about this a bit, I will anyways have to change the logic in this method for the escapeChar feature, so I might just as well allow some subset of cases similar to the one you pointed out (i.e. using quotes inside an unquoted field, and not counting them as invalid quotes). I guess it's worth it to avoid users hitting the EOF exception and/or loading much of the text into memory which are things that would occur with invalid CSVs such as this one if I don't take special care of this kind of case.

In reply to: 427003836 [](ancestors = 427003836)

I've done this. Let us follow this thread here:
#5125 (comment)

In reply to: 427010189 [](ancestors = 427010189,427003836)

…ccepting invalid formats

antoniovs1029 · 2020-05-19T09:51:01Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs

+                        if(startsInsideQuoted || line[ichCur] == '"')
+                        {
+                            // Quoted Field Case
+
+                            if (!startsInsideQuoted)
+                                ichCur++;
+
+                            for (; ; ichCur++)
+                            {
+                                if (ichCur >= ichLim)
+                                    // We've reached the end of the line without finding the closing quote,
+                                    // so next line will start on this quoted field
+                                    return true;


I've rewritten the logic in Multilinereader, in order to be more flexible in what we accept (following this comment: #5125 (comment) )

In principle, it's the same as in previous iterations of the PR: simply read every character in the read line, and decide if we should read more lines as part of the current row. But now this code mimics the one found in TextLoader.Parser.HelperImpl.FetchNextField() to actually make a distinction between quoted and non-quoted fields.

So now the behavior has changed for the following scenarios, which are now accepted by Multilinereader without much problem (although the parser might throw or log an error when processing these). Notice that these cases are actually not accepted by the RFC and can be considered invalid formats, so we actually have liberty on how we handle this, and I think the way I've handled it will cause the least negative impact for the users:

We'll accept " inside unquoted fields, and treat those quotes as if they were any other character. This one used to cause the multilinereader to think it was opening a quoted field, and potentially read the rest of the file into this line. Now it knows it's not opening a quoted field.

Label,Size,Age true,5.4",5

We'll accept characters between a closing quote in a quoted field and the next separator. This case is actually always going to cause a QuotingError on the Parser when parsing the row which will make the parser to return an empty row instead (notice that quoting errors don't cause exceptions). Because of this, I set quotingError as true in here in the Multilinereader, which means that the rest of the row after the invalid closing quote, will be ignored:

Label,Size,Age true,"5."4,5 // this will be read as a single line, and then the parser will return it as empty false,"2.3",8// this will be read correctly

A positive side effect of point number 2, including the quotingError mechanism, is that now it's much harder for the multilinereader to read a complete file when there's a quoting error. Now if there's a quoted field without closing quote, it will read it until the next quote, and will consider the rest of that line and all the read lines as a single row (which will produce a QuotingError on the parser). And so the following lines will remain unaffected. Previous behavior for this particular case would actually have been to load the rest of the file as part of the first row:

// the first 3 rows will be read as single row, and later the parser will have a quoting error there // after that all rows will be read correctly. Label,Size,Age true,"5.4,5 true,2,4 false,"2.3",8 true,"3.5",67 ...

// Same behavior as above: Label,Size,Age true,"5.4,5 true,2,4 false,"2.3,8 true,"3.5",67 ...

harishsk

antoniovs1029 · 2020-05-19T21:33:20Z

For the record,

I ran all existing tests with readMultilines set as true, and everything worked as expected, and I didn't noticed a difference in running time. I also testes the final commit on this PR with some of ModelBuilder's datasets, and everything worked, and I also didn't noticed a difference in running time when using readMultilines set as true, or set as false.

I also ran the FeaturizeTextBench benchmark test (although I modified it to randomly decide if a field will be quoted or not. and never added new line characters inside the quoted fields), but I'm afraid I'm not sure how to interpret the results, which are as follows:

Without this PR:

|             Method |    Mean |   Error |  StdDev | Extra Metric |
|------------------- |--------:|--------:|--------:|-------------:|
| TrainFeaturizeText | 20.65 s | 23.31 s | 1.278 s |            - |

With this PR, with readMultilines = false

|             Method |    Mean |   Error |  StdDev | Extra Metric |
|------------------- |--------:|--------:|--------:|-------------:|
| TrainFeaturizeText | 18.79 s | 7.632 s | 0.418 s |            - |

With this PR, with readMultilines = true

|             Method |    Mean |   Error |  StdDev | Extra Metric |
|------------------- |--------:|--------:|--------:|-------------:|
| TrainFeaturizeText | 27.89 s | 13.39 s | 0.734 s |            - |

Where the columns are said to be:

// * Legends *
  Mean         : Arithmetic mean of all measurements
  Error        : Half of 99.9% confidence interval
  StdDev       : Standard deviation of all measurements
  Extra Metric : Value of the provided extra metric
  1 s          : 1 Second (1 sec)

As it can be noticed, mean time went 50% up when accepting readMultiline, but the Error doubled (which to be honest, I'm not sure what "error" actually is). Since while testing manually with ModelBuilder's dataset I certainly didn't noticed a change in training time, I think that in general doubling the time isn't expected. (I ran a full training pipeline on each of the datasets, and although I didn't use an stopwatch, I did took notice of the time reported by the debugger when hitting breakpoints, and it was simply the same time).

Also, strangely, times were better with this PR with readMultiline = false, than without it (which doesn't makes much sense since on this PR I did add extra checks which should have some little negative effect on performance)... and well, error in the case without this PR is very high... so I'm not entirely sure if this benchmark test is actually measuring accurately whatever needs to be measured.

LittleLittleCloud and others added 14 commits December 17, 2019 15:52

fix textloader bug on quotes

e03b0e4

use static method instead of extension method

3ddb3b0

better name

9b20ead

Merge branch 'master' into u/xiaoyun/fixTextLoaderBug

34d9efa

Merge remote-tracking branch 'xiaoyun/u/xiaoyun/fixTextLoaderBug' int…

ac6f971

…o is4460TextLoaderQuoting

Moved multiline reader to TextLoaderCursor, and added return when mul…

8304d62

…tiline doesn't close quoting

Fix issue with empty unquoted new lines

a9e91e2

Make ReadMultiLine a little bit more efficient

9cfaee1

Add clarifying comment in FetchNextField

2827c61

Added test

a427662

Added test for column inference

1cafbf0

Make multilinereader a little bit more efficient

6116d97

Create read multilines parameter

df2ca25

Make tests run with readMultilines parameter

530f41e

antoniovs1029 requested review from a team as code owners May 14, 2020 10:18

antoniovs1029 requested review from justinormont, harishsk and LittleLittleCloud May 14, 2020 10:19

antoniovs1029 commented May 14, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs Outdated Show resolved Hide resolved

antoniovs1029 commented May 14, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs Outdated Show resolved Hide resolved

antoniovs1029 commented May 14, 2020

View reviewed changes

antoniovs1029 mentioned this pull request May 14, 2020

ModelBuilder fails to use a CSV dataset that contains commas inside a quoted field on the first line dotnet/machinelearning-modelbuilder#747

Closed

LittleLittleCloud reviewed May 14, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs Show resolved Hide resolved

harishsk reviewed May 14, 2020

View reviewed changes

antoniovs1029 mentioned this pull request May 14, 2020

[AutoML v0.16.0] InferColumn doesn't work on tricky csv file #4460

Closed

antoniovs1029 and others added 6 commits May 15, 2020 11:49

Update src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs

fb6ab28

Co-authored-by: Justin Ormont <[email protected]>

Update src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs

f592a7f

Co-authored-by: Justin Ormont <[email protected]>

Removed new readMultiline option from AutoML.NET and removed the test…

a13b803

… that checked ColumnInference

Merge branch 'is4460TextLoaderQuoting' of https://github.com/antoniov…

c2d2ac7

…s1029/machinelearning into is4460TextLoaderQuoting

Actually add new line characters to loaded strings

13033bf

Actually include new line characters in loaded strings

2983b06

LittleLittleCloud approved these changes May 18, 2020

View reviewed changes

antoniovs1029 added 6 commits May 18, 2020 13:01

Let the TextSaver also correctly save new lines inside quoted fields

9789479

Fixed bug when calling GetSomeLines

60c4169

Throw exception when reaching EOF without ending quote on a given field.

fa28ddd

Refactored readMultilines into OptionsFlags

56279b4

Added test to check new readMultiline option from MAML.

51d9390

Merge remote-tracking branch 'upstream/master' into is4460TextLoaderQ…

5cc7512

…uoting

antoniovs1029 commented May 19, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs Show resolved Hide resolved

Fixed mistake on an unrelated test

5be5f6f

antoniovs1029 commented May 19, 2020

View reviewed changes

test/Microsoft.ML.TestFramework/TestCommandBase.cs Outdated Show resolved Hide resolved

antoniovs1029 commented May 19, 2020

View reviewed changes

test/Microsoft.ML.Tests/TextLoaderTests.cs Show resolved Hide resolved

justinormont reviewed May 19, 2020

View reviewed changes

antoniovs1029 added 3 commits May 18, 2020 23:01

Added internal default to ReadMultilines

b4e3029

Do more checking in MultilineReader in order to be more flexible in a…

7e16fc7

…ccepting invalid formats

Updated tests

f0652a5

antoniovs1029 commented May 19, 2020

View reviewed changes

harishsk approved these changes May 19, 2020

View reviewed changes

antoniovs1029 merged commit 4dcc6a8 into dotnet:master May 19, 2020

antoniovs1029 mentioned this pull request May 19, 2020

Add line limit to readMultilines in TextLoader #5144

Open

This was referenced May 20, 2020

Added decimal marker option in TextLoader #5145

Merged

Add escapeChar support to TextLoader and added benchmark for TextLoader #5147

Merged

justinormont mentioned this pull request Jun 6, 2020

Error in ML.net training #4464

Closed

ghost locked as resolved and limited conversation to collaborators Mar 18, 2022

		TrimWhitespace = trimWhitespace,
		ReadMultilines = true // MYTODO: is it ok to hardcode this? it is necessary for my test to pass

	private static bool TryParseFile(MLContext context, TextLoader.Options options, IMultiStreamSource source,
	out ColumnSplitResult result)
	{
	result = null;
	// try to instantiate data view with swept arguments
	try
	{
	var textLoader = context.Data.CreateTextLoader(options, source);
	var idv = context.Data.TakeRows(textLoader.Load(source), 1000);
	var columnCounts = new List<int>();
	var column = idv.Schema["C"];

	using (var cursor = idv.GetRowCursor(new[] { column }))
	{
	var getter = cursor.GetGetter<VBuffer<ReadOnlyMemory<char>>>(column);

	VBuffer<ReadOnlyMemory<char>> line = default;
	while (cursor.MoveNext())
	{
	getter(ref line);
	columnCounts.Add(line.Length);
	}
	}

	var mostCommon = columnCounts.GroupBy(x => x).OrderByDescending(x => x.Count()).First();
	if (mostCommon.Count() < UniformColumnCountThreshold * columnCounts.Count)
	{
	return false;
	}

	// disallow single-column case
	if (mostCommon.Key <= 1) { return false; }

	result = new ColumnSplitResult(true, options.Separators.First(), options.AllowQuoting, options.AllowSparse, mostCommon.Key);
	return true;
	}
	// fail gracefully if unable to instantiate data view with swept arguments
	catch(Exception)
	{
	return false;
	}
	}

Enable TextLoader to accept new lines in quoted fields #5125

Enable TextLoader to accept new lines in quoted fields #5125

Uh oh!

Conversation

antoniovs1029 commented May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Complying with RFC 4180

Modelbuilder and AutoML issues

Testing this PR

Uh oh!

antoniovs1029 commented May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

antoniovs1029 May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont May 15, 2020 • edited by antoniovs1029 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 May 19, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antoniovs1029 May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont May 15, 2020 • edited by antoniovs1029 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 May 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 May 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 May 19, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 commented May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoniovs1029 May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont May 15, 2020 • edited by antoniovs1029 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 May 19, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

harishsk May 14, 2020 • edited by antoniovs1029 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk May 14, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 May 14, 2020

Choose a reason for hiding this comment

Uh oh!

harishsk May 15, 2020

Choose a reason for hiding this comment

Uh oh!

justinormont May 15, 2020 • edited by antoniovs1029 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

antoniovs1029 commented May 14, 2020 •

edited

Loading

antoniovs1029 commented May 14, 2020 •

edited

Loading

antoniovs1029 May 14, 2020 •

edited

Loading

justinormont May 15, 2020 •

edited by antoniovs1029

Loading

antoniovs1029 May 14, 2020 •

edited

Loading

justinormont May 15, 2020 •

edited by antoniovs1029

Loading

antoniovs1029 May 15, 2020 •

edited

Loading

antoniovs1029 May 15, 2020 •

edited

Loading

antoniovs1029 commented May 14, 2020 •

edited

Loading

antoniovs1029 May 14, 2020 •

edited

Loading

justinormont May 15, 2020 •

edited by antoniovs1029

Loading

harishsk May 14, 2020 •

edited by antoniovs1029

Loading

antoniovs1029 May 14, 2020 •

edited

Loading

justinormont May 15, 2020 •

edited by antoniovs1029

Loading

LittleLittleCloud May 15, 2020 •

edited by antoniovs1029

Loading

antoniovs1029 May 15, 2020 •

edited

Loading

antoniovs1029 May 15, 2020 •

edited

Loading

LittleLittleCloud May 18, 2020 •

edited by antoniovs1029

Loading

justinormont May 19, 2020 •

edited by antoniovs1029

Loading

antoniovs1029 May 19, 2020 •

edited

Loading

antoniovs1029 May 19, 2020 •

edited

Loading

antoniovs1029 May 19, 2020 •

edited

Loading

antoniovs1029 commented May 19, 2020 •

edited

Loading