Skip to content

Add line limit to readMultilines in TextLoader #5144

Open
@antoniovs1029

Description

@antoniovs1029

(This issue tracks @justinormont 's suggestion here)

Recent PR #5125 added a readMultilines option to TextLoader to enable the posibility of including newlines inside quoted fields.

A problem with this is that if the input file isn't correctly formatted (i.e., if it has a quote that opens a quoted field, that is never closed) then the Multilinereader will actually load every line until it finds another quote. Depending on the dataset (and on how many incorrectly formatted rows it has) it could actually load into memory the whole dataset (or as much as the StringBuilder supports, which is typically 2^32 chars )

For example:

id,description,animal
0,"this quoted field isnt closed,cat
1,this field doesnt include quotes,dog
... // no quoted fields in here
2555,"it is until this quoted field that the multilinereader actually stops reading row 0",bird
2556,"this row will be read correctly",dog

@justinormont 's suggestion here: #5125 (comment)

is to add another option to the TextLoader that let the user set the maximum length of a row, and if that threshold is passed, then simply ignore the line and continue reading the input file without loading everything into it.

I think that before introducing more options to the TextLoader, it's better to see if users actually hit this problem when using readMultilines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Priority of the issue for triage purpose: Needs to be fixed at some point.enhancementNew feature or requestloadsaveBugs related loading and saving data or modelsusabilitySmoothing user interaction or experience

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions