Skip to content

Enable TextLoader to accept new lines in quoted fields #5125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
May 19, 2020

Conversation

antoniovs1029
Copy link
Member

@antoniovs1029 antoniovs1029 commented May 14, 2020

Fixes https://github.com/dotnet/machinelearning-automl/issues/193

Addresses these issues (see below):
#4460
dotnet/machinelearning-modelbuilder#702

Based on @LittleLittleCloud 's PR:
#4584

This PR enables TextLoader to load rows which have new line characters inside quoted fields. So, for example, if the input CSV file has:

id,description,animal
11,"this is a
quoted field with
newlines",cat

then all of that will actually be loaded into a single row. Previously, if the TextLoader was presented with such a case, it would typically throw exceptions because it wasn't able to parse it correctly.

I took @LittleLittleCloud 's PR which had most of the work done, and added some corner cases, public interfaces to this feature, and made and ran different tests, addressing the comments on that other PR.

Complying with RFC 4180

I was tasked to take care of this issue, while also making the TextLoader compliant with RFC 4180 (on CSV format). Turns out the TextLoader already accepted pretty much everything that the RFC states, except for accepting new lines inside quoted fields, which is pretty much the only thing fixed in this PR.

  1. Notice that this means that TextLoader was already able to escape 2 double quotes ("") inside quoted fields; i.e. "this is ""an"" example" will correctly be loaded as this is "an" example.
  2. When loading CSV files TextLoader also accepts and do other things that the RFC doesn't mention. For example, filtering lines that start with "//" as if they were comments, or accepting 1 double quote (") in the middle of a field when allowQuoting is false.
  3. It seems to me the only thing that the RFC states and that ML.NET doesn't follow, is that the RFC says that spaces should also be considered part of the fields, and not be ignored... but in general ML.NET ignores spaces at the beginning and at the end of each field, and line.

Modelbuilder and AutoML issues

There are different issues in Modelbuilder and AutoML that were supposed to be caused by TextLoader and I was supposed to fix them on this PR. But after looking into them, it doesn't seem that they are caused by the TextLoader. I have opened this issue in ModelBuilder explaining this in more detail, but in general, for some reason, if the first line (after the header) of a CSV file contains a comma inside a quoted field, then ModelBuilder shows an error saying it can't be loaded; it was said that the problem was on TextLoader, but it isn't, particularly because TextLoader doesn't have any problem escaping commas inside quoted fields.

So, regarding issue dotnet/machinelearning-modelbuilder#702

  1. ML.NET's TextLoader was able to load most of the datasets pointed there, despite the fact that ModelBuilder shows an error message when the user selects them on the GUI.
  2. Only some of the jigsaw datasets (such as jigsaw-toxic-comment-train.csv) and the Airbnb datasets included new lines inside quoted fields so they couldn't be loaded by ML.NET. This is fixed on this PR.

Regarding issue #4460

  1. @LittleLittleCloud mentioned that there were problems using InferColumns() with CSV files that contained new lines inside quotes, so this is fixed on this PR and I add a test reproducing that.
  2. @LittleLittleCloud also provided a listings.csv dataset, which doesn't include new lines inside quotes, and so this PR doesn't really affect that. Regarding this dataset, he didn't mentioned problems loading it, but that there were some problems with how the InferColumns method worked with double quotes. It looks to me that that dataset is loaded correctly (i.e. scaping "" as expected), so I further discussions would be needed about this to see where's the real root cause, as it's likely not in TextLoader.

Testing this PR

To test this PR I:

  1. Added 2 new tests
  2. Ran locally all existing ML.NET tests, using ReadMultilines = true as default. And it all worked as expected. I did this both using 1 thread or multiple threads for the text loader, to check the effect of multithreading.
  3. Loaded all the datasets provided by ModelBuilder team successfully in ML.NET.

@antoniovs1029
Copy link
Member Author

antoniovs1029 commented May 14, 2020

Regarding @justinormont 's concern about this working with multithreading, there's nothing to worry about.

I did ran locally all the existing ML.NET tests using ReadMultilines = true, with both 1 or multiple threads, and everything worked as expected.

Reason for this is that TextLoader.Cursor.LineReader always works the same, regardless of the number of threads. It reads each line in the input files, and put them inside its BlockQueue<LineBatch> _queue; all cursors share the same LineReader object. Then the cursor (or cursors) will try to get lines to parse them by getting batches from this queue. So reading the lines in the file is always done sequentially in only one background thread, whereas getting the line batches from the queue is what can be done from multiple threads. #Resolved

break; // MYTODO: This could happen if we have an invalid open quote which never closes so we reach the end of the file without properly closing the field, should we throw instead in this case?

if(line.Length != 0)
sb.Append(" "); // MYTODO: should we use instead a "\n" in here to separate lines?
Copy link
Member Author

@antoniovs1029 antoniovs1029 May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinormont recommended adding an actual new line here instead of a space. But I'm not sure if it would matter much. Also I wouldn't know when to append "\n" instead of "\r\n", or if it would make any difference. #Resolved

Copy link
Contributor

@justinormont justinormont May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. We'll want newlines instead of spaces.

The same as newlines are different to human readers, the ML can just as well make use of this information. We also have no control over how the user featurizes their dataset in the ML pipeline; the newlines could be important to their processing.

We also would want to maintain round trip read/write of the dataset. Which means we'd have to modify the TextSaver to support (or verify current support) for newlines. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done this in a more recent iteration. To continue discussing about this, it'd be better to do it here:
#5125 (comment)


In reply to: 425757137 [](ancestors = 425757137)

Comment on lines 59 to 60
TrimWhitespace = trimWhitespace,
ReadMultilines = true // MYTODO: is it ok to hardcode this? it is necessary for my test to pass
Copy link
Member Author

@antoniovs1029 antoniovs1029 May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am not familiar at all with AutoML code, but in order to add a test reproducing the scenario in #4460 I had to add this in here for it to work. Problem is that these methods in AutoML and several more methods there would require to either hardcode this or add more parameters to several methods to have a readMultilines parameter.

This is a problem because ReadMultilines is set as false by default, and we would need to manually set it as true anywhere where we'd want AutoML to have this behavior. So I am not sure what would be the best way to handle this from AutoML side. #Resolved

Copy link
Contributor

@justinormont justinormont May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect it's better to sweep over the ReadMultilines flag. The same as we do with the other flags for TextLoader (src).

Are there any datasets (or rows) which would be unreadable with ReadMultilines = true? #Resolved

Copy link
Member Author

@antoniovs1029 antoniovs1029 May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mention here: #5125 (comment) , if the CSV file contains a quoted field which has a " at the beginning of the field, but it never closes it (which would technically be invalid format), then the file won't be read correctly and some problems may arise from it.

For example, our "wikipedia-detox-250-line-data.tsv" has this error in line 83. The sentiment text field there opens with " but it never closes that quotation, so in that dataset the Multilinereader doesn't load that file correctly.


In reply to: 425713757 [](ancestors = 425713757)

Copy link
Member Author

@antoniovs1029 antoniovs1029 May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding your suggestion of simply sweeping over the flags in TrySplitColumns, it doesn't look that would affect the InferColumns method I'm mentioning in this thread (it looks to me the InferColumns method isn't called when calling TrySplitColumns). As you can see in the ColumnInferenceApi, there are several methods which have multiple parameters related to TextLoader, and it seems I'd need to add a new parameter to all of them. But my question is that I wouldn't know if it's necessary to add it in all of them to make the readMultiline parameter accessible to ModelBuilder. Furthermore, it seems these methods are called by the AutoCatalog.cs which also exposes some public methods with parameters related to TextLoader. And it would again seem that I'd need to add a new readMultiline parameter there... but I guess won't be able to do that since that is a public API, and adding parameters is considered a breaking API change.

So, in general, I don't know which criteria to use in order to know where to add a readMultiline parameter in order to make it accessible for ModelBuilder. Is there any other place besides ColumnInferenceAPI and TrySplitColumns?


In reply to: 425713757 [](ancestors = 425713757)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in here: I won't change anything in AutoML.NET code, and will leave this up to ModelBuilder team.
#5125 (comment)


In reply to: 425726377 [](ancestors = 425726377,425713757)

@antoniovs1029
Copy link
Member Author

antoniovs1029 commented May 14, 2020

Currently the CI is failing because apparently I can't add a new parameter to the public methods that create TextLoaders, because it's considered a breaking API change. Anyway, I will fix this later, in the meantime, please feel free to review. Thanks! #Resolved

Comment on lines 1 to 3
// this file should be loaded with quoting enabled
// and it should load without problems by the TextLoader
id,description,animal
Copy link
Member Author

@antoniovs1029 antoniovs1029 May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here are several examples that the TextLoader should be able to load according to the RFC and to how it has worked in the past for ML.NET's purposes. Please, let me know if I should add more cases.

Also, from this thread, I think it's better not to support these:

Escaping \" inside quoted fields. There are 3 main reasons to not support this:

  1. The RFC 4180 doesn't support it. Quotes should be escaped using ""
  2. Excel doesn't really support it consistently. If I have a CSV with 19,"this has an \"scaped quote\" right" the first one will not be escaped but the second one does.
  3. Among the datasets provided by ModelBuilder team some of them have the characters \" but they're not meant to escape the ". E.g., there are quoted fields that end with those characters "blabla\" and so the last " is supposed to be the closing of the field and the \ is supposed to be part of the field. So escaping this would cause errors for those datasets.

Accept True, "asdf,123 even when quoting is enabled There are 2 main reasons for not doing this:

  1. The RFC doesn't accept this.
  2. In ML.NET we filter out the spaces at the beginning of the field, so there's no difference between that and True,"asdf,123 which would be an error. So code-wise it would be weird to accept this particular case.

In general, I wouldn't accept these cases unless there were a lot of people asking for them. #Resolved

Copy link
Contributor

@justinormont justinormont May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would adding support for alternative escaping methods in the future be difficult? Slash escaping is common enough.

Pandas (generally used by scikit-learn) exposes an escapechar parameter to allow for this: https://github.com/pandas-dev/pandas/blob/7456fc52e1546d263b88ba5772c0324f91ff0804/doc/source/user_guide/io.rst#quoting-and-escape-characters

Databricks exposes an escape parameter for this: https://docs.databricks.com/data/data-sources/read-csv.html

Note that both of these recommend/default to slash escaping instead of double-double-quotes for escaping. The escaping applies to the quote character and the delimitation character. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I'll do this in an upcoming PR.


In reply to: 425735487 [](ancestors = 425735487)

@@ -41,6 +41,8 @@ public static class TextLoaderSaverCatalog
/// A column may also have dense values followed by sparse values represented in this fashion. For example,
/// a row containing "1 2 5 2:6 4:3" represents two dense columns with values 1 and 2, followed by 5 sparsely represented
/// columns with values 0, 0, 6, 0, and 3. The indices of the sparse columns start from 0, even though 0 represents the third column.</param>
/// <param name="readMultilines">If true, it will accept new line characters inside quoted fields. Thus, enabling that a single field contain
/// multiple lines of text. If allowQuoting is false, this parameter is ignored.</param>
Copy link
Contributor

@harishsk harishsk May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there scenarios where we would want this to be false?
If allowQuoting is true, don't we always want readMultiLines = true?

If that is the desired behavior then we don't need an API change. #Resolved

Copy link
Member Author

@antoniovs1029 antoniovs1029 May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right in that we wouldn't need an API change in that case. I'm hesitant to always set readMultiLines as true if allowQuoting is enabled because of the reasons I've pointed out here:
#5125 (comment)

But yeah, in general it would be more convenient to simply use allowQuoting to decide if we want to read multilines, so that we can skip the part of having to create the readMultilines parameter.

So I'm not sure if we should do it or not. Any thoughts, @justinormont @harishsk ? #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we expect ModelBuilder to decide the values for allowQuoting and readMultilines? If it always sets them to true, then we are going to run into the bad case you mention in the discussion where the reader can potentially read the whole file. If they always set them to false, they won't be able to read a lot of files correctly. Do we expect them to try one different permutations till it succeeds?


In reply to: 425403877 [](ancestors = 425403877)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's certainly a dilemma, an I wouldn't know what the ModelBuilder team would prefer, so perhaps @LittleLittleCloud has something to say about this?

My recommendation is that ModelBuilder always uses allowQuoting and readMultilines. readMultilines should always work if the input file is correctly formatted, so in case an exception occurs in the case I pointed to, then it would be a problem with the dataset, and not with the tool. Perhaps adding an option for readMultiline to ModelBuilder's GUI could also work, but I don't know if it's possible to do it or convenient for the end user.


In reply to: 425464413 [](ancestors = 425464413,425403877)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LittleLittleCloud @JakeRadMSFT, @justinormont We would greatly appreciate your input on this one.


In reply to: 425473921 [](ancestors = 425473921,425464413,425403877)

Copy link
Contributor

@justinormont justinormont May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very quick to try and open the dataset w/ various parameters for the TextLoader. It tries to read the 1st 1000 rows of data; no model training/eval.

I believe we currently try up to 16 combination, adding this flag will move it to 32. The 16 is currently trivially fast, and should remain so.

One corner case to perf test is the run-away quoted option, where the close-quote is never provided. I expect this won't impact actual users as the run-away is stopped by the next double-quote, and the further lines will be misread causing that option set to not be successful (which is good).

We test the given TextLoader options set in TryParseFile(): (no training/eval)

private static bool TryParseFile(MLContext context, TextLoader.Options options, IMultiStreamSource source,
out ColumnSplitResult result)
{
result = null;
// try to instantiate data view with swept arguments
try
{
var textLoader = context.Data.CreateTextLoader(options, source);
var idv = context.Data.TakeRows(textLoader.Load(source), 1000);
var columnCounts = new List<int>();
var column = idv.Schema["C"];
using (var cursor = idv.GetRowCursor(new[] { column }))
{
var getter = cursor.GetGetter<VBuffer<ReadOnlyMemory<char>>>(column);
VBuffer<ReadOnlyMemory<char>> line = default;
while (cursor.MoveNext())
{
getter(ref line);
columnCounts.Add(line.Length);
}
}
var mostCommon = columnCounts.GroupBy(x => x).OrderByDescending(x => x.Count()).First();
if (mostCommon.Count() < UniformColumnCountThreshold * columnCounts.Count)
{
return false;
}
// disallow single-column case
if (mostCommon.Key <= 1) { return false; }
result = new ColumnSplitResult(true, options.Separators.First(), options.AllowQuoting, options.AllowSparse, mostCommon.Key);
return true;
}
// fail gracefully if unable to instantiate data view with swept arguments
catch(Exception)
{
return false;
}
}
#Resolved

Copy link
Contributor

@LittleLittleCloud LittleLittleCloud May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very quick to try and open the dataset w/ various parameters for the TextLoader. It tries to read the 1st 1000 rows of data; no model training/eval.

I believe we currently try up to 16 combination, adding this flag will move it to 32. The 16 is currently trivially fast, and should remain so.

One corner case to perf test is the run-away quoted option, where the close-quote is never provided. I expect this won't impact actual users the run-away is stopped by the next double-quote, and the further lines will be misread causing that option set to not be successful (which is good).

We test the given TextLoader options set in TryParseFile(): (no training/eval)

private static bool TryParseFile(MLContext context, TextLoader.Options options, IMultiStreamSource source,
out ColumnSplitResult result)
{
result = null;
// try to instantiate data view with swept arguments
try
{
var textLoader = context.Data.CreateTextLoader(options, source);
var idv = context.Data.TakeRows(textLoader.Load(source), 1000);
var columnCounts = new List<int>();
var column = idv.Schema["C"];
using (var cursor = idv.GetRowCursor(new[] { column }))
{
var getter = cursor.GetGetter<VBuffer<ReadOnlyMemory<char>>>(column);
VBuffer<ReadOnlyMemory<char>> line = default;
while (cursor.MoveNext())
{
getter(ref line);
columnCounts.Add(line.Length);
}
}
var mostCommon = columnCounts.GroupBy(x => x).OrderByDescending(x => x.Count()).First();
if (mostCommon.Count() < UniformColumnCountThreshold * columnCounts.Count)
{
return false;
}
// disallow single-column case
if (mostCommon.Key <= 1) { return false; }
result = new ColumnSplitResult(true, options.Separators.First(), options.AllowQuoting, options.AllowSparse, mostCommon.Key);
return true;
}
// fail gracefully if unable to instantiate data view with swept arguments
catch(Exception)
{
return false;
}
}

I agree with Justin, just adding another sweep option should be enough to solve parsing error.

Besides, I would not make any changes to AutoML.Net ColumnInference API, to me readMultiline is a trivial and confusing detail that should be concealed from AutoML users
#Resolved

Copy link
Member Author

@antoniovs1029 antoniovs1029 May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so after discussing it offline with @LittleLittleCloud , we decided that model builder team will take care of changing anything necessary in AutoML.NET to make this work for them , once this PR is merged. 😄

About the ColumnInference API, I thought it was important for you guys since the original issue of all of these ( #4460 ) explicitly talks about having problems with using ColumnInference() with new line characters inside quoted fields. But if it's not important, then only changing TryParseFile() might be enough and straightforward. In any case, I'll leave this discussion and fix up to the model builder team. Thanks. #Resolved

Copy link
Member Author

@antoniovs1029 antoniovs1029 May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, Just out of curiosity, does ModelBuilder uses the ColumnInference() method @LittleLittleCloud ? #Resolved

Copy link
Contributor

@LittleLittleCloud LittleLittleCloud May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, Just out of curiosity, does ModelBuilder uses the ColumnInference() method @LittleLittleCloud ?
Yes, we use it in AutoMLService
#Resolved


sb.Clear();
sb.Append(line);
while (numOfQuotes % 2 != 0)
Copy link
Contributor

@justinormont justinormont May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you handle the case

Label,Size,Age
true,5.4",5

I think we need to check that a quoted field begins with the quote character (vs only includes ones). If the field isn't quoted, it can include non-escaped quote chars. If it's quoted, all quotes must be escaped within the field. #Resolved

Copy link
Member Author

@antoniovs1029 antoniovs1029 May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the RFC your row isn't valid, as " should always be either escaped, or be used to indicate a quoted field. So I assumed that datasets should comply with RFC when using ReadMultiline, to make the MultilineReader code less complicated (and faster).

Should we actually accept this even when ReadMultiline is true? #Resolved

Copy link
Member Author

@antoniovs1029 antoniovs1029 May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm ok, after thinking about this a bit, I will anyways have to change the logic in this method for the escapeChar feature, so I might just as well allow some subset of cases similar to the one you pointed out (i.e. using quotes inside an unquoted field, and not counting them as invalid quotes). I guess it's worth it to avoid users hitting the EOF exception and/or loading much of the text into memory which are things that would occur with invalid CSVs such as this one if I don't take special care of this kind of case.


In reply to: 427003836 [](ancestors = 427003836)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done this. Let us follow this thread here:
#5125 (comment)


In reply to: 427010189 [](ancestors = 427010189,427003836)

Comment on lines +572 to +584
if(startsInsideQuoted || line[ichCur] == '"')
{
// Quoted Field Case

if (!startsInsideQuoted)
ichCur++;

for (; ; ichCur++)
{
if (ichCur >= ichLim)
// We've reached the end of the line without finding the closing quote,
// so next line will start on this quoted field
return true;
Copy link
Member Author

@antoniovs1029 antoniovs1029 May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten the logic in Multilinereader, in order to be more flexible in what we accept (following this comment: #5125 (comment) )

In principle, it's the same as in previous iterations of the PR: simply read every character in the read line, and decide if we should read more lines as part of the current row. But now this code mimics the one found in TextLoader.Parser.HelperImpl.FetchNextField() to actually make a distinction between quoted and non-quoted fields.

So now the behavior has changed for the following scenarios, which are now accepted by Multilinereader without much problem (although the parser might throw or log an error when processing these). Notice that these cases are actually not accepted by the RFC and can be considered invalid formats, so we actually have liberty on how we handle this, and I think the way I've handled it will cause the least negative impact for the users:

  1. We'll accept " inside unquoted fields, and treat those quotes as if they were any other character. This one used to cause the multilinereader to think it was opening a quoted field, and potentially read the rest of the file into this line. Now it knows it's not opening a quoted field.
Label,Size,Age
true,5.4",5
  1. We'll accept characters between a closing quote in a quoted field and the next separator. This case is actually always going to cause a QuotingError on the Parser when parsing the row which will make the parser to return an empty row instead (notice that quoting errors don't cause exceptions). Because of this, I set quotingError as true in here in the Multilinereader, which means that the rest of the row after the invalid closing quote, will be ignored:
Label,Size,Age
true,"5."4,5 // this will be read as a single line, and then the parser will return it as empty
false,"2.3",8// this will be read correctly
  1. A positive side effect of point number 2, including the quotingError mechanism, is that now it's much harder for the multilinereader to read a complete file when there's a quoting error. Now if there's a quoted field without closing quote, it will read it until the next quote, and will consider the rest of that line and all the read lines as a single row (which will produce a QuotingError on the parser). And so the following lines will remain unaffected. Previous behavior for this particular case would actually have been to load the rest of the file as part of the first row:
// the first 3 rows will be read as single row, and later the parser will have a quoting error there
// after that all rows will be read correctly.
Label,Size,Age
true,"5.4,5 
true,2,4
false,"2.3",8
true,"3.5",67
...
// Same behavior as above:
Label,Size,Age
true,"5.4,5 
true,2,4
false,"2.3,8
true,"3.5",67
...

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@antoniovs1029
Copy link
Member Author

antoniovs1029 commented May 19, 2020

For the record,

I ran all existing tests with readMultilines set as true, and everything worked as expected, and I didn't noticed a difference in running time. I also testes the final commit on this PR with some of ModelBuilder's datasets, and everything worked, and I also didn't noticed a difference in running time when using readMultilines set as true, or set as false.

I also ran the FeaturizeTextBench benchmark test (although I modified it to randomly decide if a field will be quoted or not. and never added new line characters inside the quoted fields), but I'm afraid I'm not sure how to interpret the results, which are as follows:

Without this PR:

|             Method |    Mean |   Error |  StdDev | Extra Metric |
|------------------- |--------:|--------:|--------:|-------------:|
| TrainFeaturizeText | 20.65 s | 23.31 s | 1.278 s |            - |

With this PR, with readMultilines = false

|             Method |    Mean |   Error |  StdDev | Extra Metric |
|------------------- |--------:|--------:|--------:|-------------:|
| TrainFeaturizeText | 18.79 s | 7.632 s | 0.418 s |            - |

With this PR, with readMultilines = true

|             Method |    Mean |   Error |  StdDev | Extra Metric |
|------------------- |--------:|--------:|--------:|-------------:|
| TrainFeaturizeText | 27.89 s | 13.39 s | 0.734 s |            - |

Where the columns are said to be:

// * Legends *
  Mean         : Arithmetic mean of all measurements
  Error        : Half of 99.9% confidence interval
  StdDev       : Standard deviation of all measurements
  Extra Metric : Value of the provided extra metric
  1 s          : 1 Second (1 sec)

As it can be noticed, mean time went 50% up when accepting readMultiline, but the Error doubled (which to be honest, I'm not sure what "error" actually is). Since while testing manually with ModelBuilder's dataset I certainly didn't noticed a change in training time, I think that in general doubling the time isn't expected. (I ran a full training pipeline on each of the datasets, and although I didn't use an stopwatch, I did took notice of the time reported by the debugger when hitting breakpoints, and it was simply the same time).

Also, strangely, times were better with this PR with readMultiline = false, than without it (which doesn't makes much sense since on this PR I did add extra checks which should have some little negative effect on performance)... and well, error in the case without this PR is very high... so I'm not entirely sure if this benchmark test is actually measuring accurately whatever needs to be measured.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants