-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Enable TextLoader to accept new lines in quoted fields #5125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable TextLoader to accept new lines in quoted fields #5125
Conversation
…o is4460TextLoaderQuoting
…tiline doesn't close quoting
Regarding @justinormont 's concern about this working with multithreading, there's nothing to worry about. I did ran locally all the existing ML.NET tests using ReadMultilines = true, with both 1 or multiple threads, and everything worked as expected. Reason for this is that |
break; // MYTODO: This could happen if we have an invalid open quote which never closes so we reach the end of the file without properly closing the field, should we throw instead in this case? | ||
|
||
if(line.Length != 0) | ||
sb.Append(" "); // MYTODO: should we use instead a "\n" in here to separate lines? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinormont recommended adding an actual new line here instead of a space. But I'm not sure if it would matter much. Also I wouldn't know when to append "\n" instead of "\r\n", or if it would make any difference. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. We'll want newlines instead of spaces.
The same as newlines are different to human readers, the ML can just as well make use of this information. We also have no control over how the user featurizes their dataset in the ML pipeline; the newlines could be important to their processing.
We also would want to maintain round trip read/write of the dataset. Which means we'd have to modify the TextSaver
to support (or verify current support) for newlines. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done this in a more recent iteration. To continue discussing about this, it'd be better to do it here:
#5125 (comment)
In reply to: 425757137 [](ancestors = 425757137)
TrimWhitespace = trimWhitespace, | ||
ReadMultilines = true // MYTODO: is it ok to hardcode this? it is necessary for my test to pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am not familiar at all with AutoML code, but in order to add a test reproducing the scenario in #4460 I had to add this in here for it to work. Problem is that these methods in AutoML and several more methods there would require to either hardcode this or add more parameters to several methods to have a readMultilines parameter.
This is a problem because ReadMultilines is set as false by default, and we would need to manually set it as true anywhere where we'd want AutoML to have this behavior. So I am not sure what would be the best way to handle this from AutoML side. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect it's better to sweep over the ReadMultilines
flag. The same as we do with the other flags for TextLoader
(src).
Are there any datasets (or rows) which would be unreadable with ReadMultilines = true
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mention here: #5125 (comment) , if the CSV file contains a quoted field which has a " at the beginning of the field, but it never closes it (which would technically be invalid format), then the file won't be read correctly and some problems may arise from it.
For example, our "wikipedia-detox-250-line-data.tsv" has this error in line 83. The sentiment text field there opens with " but it never closes that quotation, so in that dataset the Multilinereader doesn't load that file correctly.
In reply to: 425713757 [](ancestors = 425713757)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding your suggestion of simply sweeping over the flags in TrySplitColumns, it doesn't look that would affect the InferColumns method I'm mentioning in this thread (it looks to me the InferColumns method isn't called when calling TrySplitColumns). As you can see in the ColumnInferenceApi, there are several methods which have multiple parameters related to TextLoader, and it seems I'd need to add a new parameter to all of them. But my question is that I wouldn't know if it's necessary to add it in all of them to make the readMultiline parameter accessible to ModelBuilder. Furthermore, it seems these methods are called by the AutoCatalog.cs which also exposes some public methods with parameters related to TextLoader. And it would again seem that I'd need to add a new readMultiline parameter there... but I guess won't be able to do that since that is a public API, and adding parameters is considered a breaking API change.
So, in general, I don't know which criteria to use in order to know where to add a readMultiline parameter in order to make it accessible for ModelBuilder. Is there any other place besides ColumnInferenceAPI and TrySplitColumns?
In reply to: 425713757 [](ancestors = 425713757)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in here: I won't change anything in AutoML.NET code, and will leave this up to ModelBuilder team.
#5125 (comment)
In reply to: 425726377 [](ancestors = 425726377,425713757)
Currently the CI is failing because apparently I can't add a new parameter to the public methods that create TextLoaders, because it's considered a breaking API change. Anyway, I will fix this later, in the meantime, please feel free to review. Thanks! #Resolved |
test/data/multiline.csv
Outdated
// this file should be loaded with quoting enabled | ||
// and it should load without problems by the TextLoader | ||
id,description,animal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here are several examples that the TextLoader should be able to load according to the RFC and to how it has worked in the past for ML.NET's purposes. Please, let me know if I should add more cases.
Also, from this thread, I think it's better not to support these:
Escaping \"
inside quoted fields. There are 3 main reasons to not support this:
- The RFC 4180 doesn't support it. Quotes should be escaped using ""
- Excel doesn't really support it consistently. If I have a CSV with
19,"this has an \"scaped quote\" right"
the first one will not be escaped but the second one does. - Among the datasets provided by ModelBuilder team some of them have the characters
\"
but they're not meant to escape the"
. E.g., there are quoted fields that end with those characters"blabla\"
and so the last"
is supposed to be the closing of the field and the\
is supposed to be part of the field. So escaping this would cause errors for those datasets.
Accept True, "asdf,123
even when quoting is enabled There are 2 main reasons for not doing this:
- The RFC doesn't accept this.
- In ML.NET we filter out the spaces at the beginning of the field, so there's no difference between that and
True,"asdf,123
which would be an error. So code-wise it would be weird to accept this particular case.
In general, I wouldn't accept these cases unless there were a lot of people asking for them. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would adding support for alternative escaping methods in the future be difficult? Slash escaping is common enough.
Pandas (generally used by scikit-learn) exposes an escapechar
parameter to allow for this: https://github.com/pandas-dev/pandas/blob/7456fc52e1546d263b88ba5772c0324f91ff0804/doc/source/user_guide/io.rst#quoting-and-escape-characters
Databricks exposes an escape
parameter for this: https://docs.databricks.com/data/data-sources/read-csv.html
Note that both of these recommend/default to slash escaping instead of double-double-quotes for escaping. The escaping applies to the quote character and the delimitation character. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I'll do this in an upcoming PR.
In reply to: 425735487 [](ancestors = 425735487)
@@ -41,6 +41,8 @@ public static class TextLoaderSaverCatalog | |||
/// A column may also have dense values followed by sparse values represented in this fashion. For example, | |||
/// a row containing "1 2 5 2:6 4:3" represents two dense columns with values 1 and 2, followed by 5 sparsely represented | |||
/// columns with values 0, 0, 6, 0, and 3. The indices of the sparse columns start from 0, even though 0 represents the third column.</param> | |||
/// <param name="readMultilines">If true, it will accept new line characters inside quoted fields. Thus, enabling that a single field contain | |||
/// multiple lines of text. If allowQuoting is false, this parameter is ignored.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there scenarios where we would want this to be false?
If allowQuoting is true, don't we always want readMultiLines = true?
If that is the desired behavior then we don't need an API change. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right in that we wouldn't need an API change in that case. I'm hesitant to always set readMultiLines as true if allowQuoting is enabled because of the reasons I've pointed out here:
#5125 (comment)
But yeah, in general it would be more convenient to simply use allowQuoting to decide if we want to read multilines, so that we can skip the part of having to create the readMultilines parameter.
So I'm not sure if we should do it or not. Any thoughts, @justinormont @harishsk ? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we expect ModelBuilder to decide the values for allowQuoting and readMultilines? If it always sets them to true, then we are going to run into the bad case you mention in the discussion where the reader can potentially read the whole file. If they always set them to false, they won't be able to read a lot of files correctly. Do we expect them to try one different permutations till it succeeds?
In reply to: 425403877 [](ancestors = 425403877)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's certainly a dilemma, an I wouldn't know what the ModelBuilder team would prefer, so perhaps @LittleLittleCloud has something to say about this?
My recommendation is that ModelBuilder always uses allowQuoting and readMultilines. readMultilines should always work if the input file is correctly formatted, so in case an exception occurs in the case I pointed to, then it would be a problem with the dataset, and not with the tool. Perhaps adding an option for readMultiline to ModelBuilder's GUI could also work, but I don't know if it's possible to do it or convenient for the end user.
In reply to: 425464413 [](ancestors = 425464413,425403877)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LittleLittleCloud @JakeRadMSFT, @justinormont We would greatly appreciate your input on this one.
In reply to: 425473921 [](ancestors = 425473921,425464413,425403877)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very quick to try and open the dataset w/ various parameters for the TextLoader
. It tries to read the 1st 1000 rows of data; no model training/eval.
I believe we currently try up to 16 combination, adding this flag will move it to 32. The 16 is currently trivially fast, and should remain so.
One corner case to perf test is the run-away quoted option, where the close-quote is never provided. I expect this won't impact actual users as the run-away is stopped by the next double-quote, and the further lines will be misread causing that option set to not be successful (which is good).
We test the given TextLoader
options set in TryParseFile()
: (no training/eval)
machinelearning/src/Microsoft.ML.AutoML/ColumnInference/TextFileContents.cs
Lines 81 to 122 in e50c4d2
private static bool TryParseFile(MLContext context, TextLoader.Options options, IMultiStreamSource source, | |
out ColumnSplitResult result) | |
{ | |
result = null; | |
// try to instantiate data view with swept arguments | |
try | |
{ | |
var textLoader = context.Data.CreateTextLoader(options, source); | |
var idv = context.Data.TakeRows(textLoader.Load(source), 1000); | |
var columnCounts = new List<int>(); | |
var column = idv.Schema["C"]; | |
using (var cursor = idv.GetRowCursor(new[] { column })) | |
{ | |
var getter = cursor.GetGetter<VBuffer<ReadOnlyMemory<char>>>(column); | |
VBuffer<ReadOnlyMemory<char>> line = default; | |
while (cursor.MoveNext()) | |
{ | |
getter(ref line); | |
columnCounts.Add(line.Length); | |
} | |
} | |
var mostCommon = columnCounts.GroupBy(x => x).OrderByDescending(x => x.Count()).First(); | |
if (mostCommon.Count() < UniformColumnCountThreshold * columnCounts.Count) | |
{ | |
return false; | |
} | |
// disallow single-column case | |
if (mostCommon.Key <= 1) { return false; } | |
result = new ColumnSplitResult(true, options.Separators.First(), options.AllowQuoting, options.AllowSparse, mostCommon.Key); | |
return true; | |
} | |
// fail gracefully if unable to instantiate data view with swept arguments | |
catch(Exception) | |
{ | |
return false; | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very quick to try and open the dataset w/ various parameters for the
TextLoader
. It tries to read the 1st 1000 rows of data; no model training/eval.I believe we currently try up to 16 combination, adding this flag will move it to 32. The 16 is currently trivially fast, and should remain so.
One corner case to perf test is the run-away quoted option, where the close-quote is never provided. I expect this won't impact actual users the run-away is stopped by the next double-quote, and the further lines will be misread causing that option set to not be successful (which is good).
We test the given
TextLoader
options set inTryParseFile()
: (no training/eval)
machinelearning/src/Microsoft.ML.AutoML/ColumnInference/TextFileContents.cs
Lines 81 to 122 in e50c4d2
private static bool TryParseFile(MLContext context, TextLoader.Options options, IMultiStreamSource source, out ColumnSplitResult result) { result = null; // try to instantiate data view with swept arguments try { var textLoader = context.Data.CreateTextLoader(options, source); var idv = context.Data.TakeRows(textLoader.Load(source), 1000); var columnCounts = new List<int>(); var column = idv.Schema["C"]; using (var cursor = idv.GetRowCursor(new[] { column })) { var getter = cursor.GetGetter<VBuffer<ReadOnlyMemory<char>>>(column); VBuffer<ReadOnlyMemory<char>> line = default; while (cursor.MoveNext()) { getter(ref line); columnCounts.Add(line.Length); } } var mostCommon = columnCounts.GroupBy(x => x).OrderByDescending(x => x.Count()).First(); if (mostCommon.Count() < UniformColumnCountThreshold * columnCounts.Count) { return false; } // disallow single-column case if (mostCommon.Key <= 1) { return false; } result = new ColumnSplitResult(true, options.Separators.First(), options.AllowQuoting, options.AllowSparse, mostCommon.Key); return true; } // fail gracefully if unable to instantiate data view with swept arguments catch(Exception) { return false; } }
I agree with Justin, just adding another sweep option should be enough to solve parsing error.
Besides, I would not make any changes to AutoML.Net ColumnInference
API, to me readMultiline
is a trivial and confusing detail that should be concealed from AutoML users
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so after discussing it offline with @LittleLittleCloud , we decided that model builder team will take care of changing anything necessary in AutoML.NET to make this work for them , once this PR is merged. 😄
About the ColumnInference API, I thought it was important for you guys since the original issue of all of these ( #4460 ) explicitly talks about having problems with using ColumnInference() with new line characters inside quoted fields. But if it's not important, then only changing TryParseFile() might be enough and straightforward. In any case, I'll leave this discussion and fix up to the model builder team. Thanks. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, Just out of curiosity, does ModelBuilder uses the ColumnInference() method @LittleLittleCloud ? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, Just out of curiosity, does ModelBuilder uses the ColumnInference() method @LittleLittleCloud ?
Yes, we use it inAutoMLService
#Resolved
Co-authored-by: Justin Ormont <[email protected]>
Co-authored-by: Justin Ormont <[email protected]>
… that checked ColumnInference
…s1029/machinelearning into is4460TextLoaderQuoting
|
||
sb.Clear(); | ||
sb.Append(line); | ||
while (numOfQuotes % 2 != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you handle the case
Label,Size,Age
true,5.4",5
I think we need to check that a quoted field begins with the quote character (vs only includes ones). If the field isn't quoted, it can include non-escaped quote chars. If it's quoted, all quotes must be escaped within the field. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the RFC your row isn't valid, as " should always be either escaped, or be used to indicate a quoted field. So I assumed that datasets should comply with RFC when using ReadMultiline
, to make the MultilineReader code less complicated (and faster).
Should we actually accept this even when ReadMultiline is true? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm ok, after thinking about this a bit, I will anyways have to change the logic in this method for the escapeChar feature, so I might just as well allow some subset of cases similar to the one you pointed out (i.e. using quotes inside an unquoted field, and not counting them as invalid quotes). I guess it's worth it to avoid users hitting the EOF exception and/or loading much of the text into memory which are things that would occur with invalid CSVs such as this one if I don't take special care of this kind of case.
In reply to: 427003836 [](ancestors = 427003836)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done this. Let us follow this thread here:
#5125 (comment)
In reply to: 427010189 [](ancestors = 427010189,427003836)
if(startsInsideQuoted || line[ichCur] == '"') | ||
{ | ||
// Quoted Field Case | ||
|
||
if (!startsInsideQuoted) | ||
ichCur++; | ||
|
||
for (; ; ichCur++) | ||
{ | ||
if (ichCur >= ichLim) | ||
// We've reached the end of the line without finding the closing quote, | ||
// so next line will start on this quoted field | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rewritten the logic in Multilinereader, in order to be more flexible in what we accept (following this comment: #5125 (comment) )
In principle, it's the same as in previous iterations of the PR: simply read every character in the read line, and decide if we should read more lines as part of the current row. But now this code mimics the one found in TextLoader.Parser.HelperImpl.FetchNextField() to actually make a distinction between quoted and non-quoted fields.
So now the behavior has changed for the following scenarios, which are now accepted by Multilinereader without much problem (although the parser might throw or log an error when processing these). Notice that these cases are actually not accepted by the RFC and can be considered invalid formats, so we actually have liberty on how we handle this, and I think the way I've handled it will cause the least negative impact for the users:
- We'll accept " inside unquoted fields, and treat those quotes as if they were any other character. This one used to cause the multilinereader to think it was opening a quoted field, and potentially read the rest of the file into this line. Now it knows it's not opening a quoted field.
Label,Size,Age
true,5.4",5
- We'll accept characters between a closing quote in a quoted field and the next separator. This case is actually always going to cause a QuotingError on the Parser when parsing the row which will make the parser to return an empty row instead (notice that quoting errors don't cause exceptions). Because of this, I set quotingError as true in here in the Multilinereader, which means that the rest of the row after the invalid closing quote, will be ignored:
Label,Size,Age
true,"5."4,5 // this will be read as a single line, and then the parser will return it as empty
false,"2.3",8// this will be read correctly
- A positive side effect of point number 2, including the quotingError mechanism, is that now it's much harder for the multilinereader to read a complete file when there's a quoting error. Now if there's a quoted field without closing quote, it will read it until the next quote, and will consider the rest of that line and all the read lines as a single row (which will produce a QuotingError on the parser). And so the following lines will remain unaffected. Previous behavior for this particular case would actually have been to load the rest of the file as part of the first row:
// the first 3 rows will be read as single row, and later the parser will have a quoting error there
// after that all rows will be read correctly.
Label,Size,Age
true,"5.4,5
true,2,4
false,"2.3",8
true,"3.5",67
...
// Same behavior as above:
Label,Size,Age
true,"5.4,5
true,2,4
false,"2.3,8
true,"3.5",67
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I ran all existing tests with readMultilines set as true, and everything worked as expected, and I didn't noticed a difference in running time. I also testes the final commit on this PR with some of ModelBuilder's datasets, and everything worked, and I also didn't noticed a difference in running time when using readMultilines set as true, or set as false. I also ran the Without this PR:
With this PR, with readMultilines = false
With this PR, with readMultilines = true
Where the columns are said to be:
As it can be noticed, mean time went 50% up when accepting readMultiline, but the Error doubled (which to be honest, I'm not sure what "error" actually is). Since while testing manually with ModelBuilder's dataset I certainly didn't noticed a change in training time, I think that in general doubling the time isn't expected. (I ran a full training pipeline on each of the datasets, and although I didn't use an stopwatch, I did took notice of the time reported by the debugger when hitting breakpoints, and it was simply the same time). Also, strangely, times were better with this PR with readMultiline = false, than without it (which doesn't makes much sense since on this PR I did add extra checks which should have some little negative effect on performance)... and well, error in the case without this PR is very high... so I'm not entirely sure if this benchmark test is actually measuring accurately whatever needs to be measured. |
Fixes https://github.com/dotnet/machinelearning-automl/issues/193
Addresses these issues (see below):
#4460
dotnet/machinelearning-modelbuilder#702
Based on @LittleLittleCloud 's PR:
#4584
This PR enables TextLoader to load rows which have new line characters inside quoted fields. So, for example, if the input CSV file has:
then all of that will actually be loaded into a single row. Previously, if the TextLoader was presented with such a case, it would typically throw exceptions because it wasn't able to parse it correctly.
I took @LittleLittleCloud 's PR which had most of the work done, and added some corner cases, public interfaces to this feature, and made and ran different tests, addressing the comments on that other PR.
Complying with RFC 4180
I was tasked to take care of this issue, while also making the TextLoader compliant with RFC 4180 (on CSV format). Turns out the TextLoader already accepted pretty much everything that the RFC states, except for accepting new lines inside quoted fields, which is pretty much the only thing fixed in this PR.
"this is ""an"" example"
will correctly be loaded asthis is "an" example
.Modelbuilder and AutoML issues
There are different issues in Modelbuilder and AutoML that were supposed to be caused by TextLoader and I was supposed to fix them on this PR. But after looking into them, it doesn't seem that they are caused by the TextLoader. I have opened this issue in ModelBuilder explaining this in more detail, but in general, for some reason, if the first line (after the header) of a CSV file contains a comma inside a quoted field, then ModelBuilder shows an error saying it can't be loaded; it was said that the problem was on TextLoader, but it isn't, particularly because TextLoader doesn't have any problem escaping commas inside quoted fields.
So, regarding issue dotnet/machinelearning-modelbuilder#702
Regarding issue #4460
Testing this PR
To test this PR I: