Description
(This issue tracks @justinormont 's suggestion here)
Recent PR #5125 added a readMultilines
option to TextLoader
to enable the posibility of including newlines inside quoted fields.
A problem with this is that if the input file isn't correctly formatted (i.e., if it has a quote that opens a quoted field, that is never closed) then the Multilinereader
will actually load every line until it finds another quote. Depending on the dataset (and on how many incorrectly formatted rows it has) it could actually load into memory the whole dataset (or as much as the StringBuilder
supports, which is typically 2^32 chars )
For example:
id,description,animal
0,"this quoted field isnt closed,cat
1,this field doesnt include quotes,dog
... // no quoted fields in here
2555,"it is until this quoted field that the multilinereader actually stops reading row 0",bird
2556,"this row will be read correctly",dog
@justinormont 's suggestion here: #5125 (comment)
is to add another option to the TextLoader
that let the user set the maximum length of a row, and if that threshold is passed, then simply ignore the line and continue reading the input file without loading everything into it.
I think that before introducing more options to the TextLoader
, it's better to see if users actually hit this problem when using readMultilines
.