Add line limit to readMultilines in TextLoader #5144
Labels
enhancement
New feature or request
loadsave
Bugs related loading and saving data or models
P2
Priority of the issue for triage purpose: Needs to be fixed at some point.
usability
Smoothing user interaction or experience
(This issue tracks @justinormont 's suggestion here)
Recent PR #5125 added a
readMultilines
option toTextLoader
to enable the posibility of including newlines inside quoted fields.A problem with this is that if the input file isn't correctly formatted (i.e., if it has a quote that opens a quoted field, that is never closed) then the
Multilinereader
will actually load every line until it finds another quote. Depending on the dataset (and on how many incorrectly formatted rows it has) it could actually load into memory the whole dataset (or as much as theStringBuilder
supports, which is typically 2^32 chars )For example:
@justinormont 's suggestion here: #5125 (comment)
is to add another option to the
TextLoader
that let the user set the maximum length of a row, and if that threshold is passed, then simply ignore the line and continue reading the input file without loading everything into it.I think that before introducing more options to the
TextLoader
, it's better to see if users actually hit this problem when usingreadMultilines
.The text was updated successfully, but these errors were encountered: