Skip to content

Add line limit to readMultilines in TextLoader #5144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
antoniovs1029 opened this issue May 19, 2020 · 0 comments
Open

Add line limit to readMultilines in TextLoader #5144

antoniovs1029 opened this issue May 19, 2020 · 0 comments
Labels
enhancement New feature or request loadsave Bugs related loading and saving data or models P2 Priority of the issue for triage purpose: Needs to be fixed at some point. usability Smoothing user interaction or experience

Comments

@antoniovs1029
Copy link
Member

(This issue tracks @justinormont 's suggestion here)

Recent PR #5125 added a readMultilines option to TextLoader to enable the posibility of including newlines inside quoted fields.

A problem with this is that if the input file isn't correctly formatted (i.e., if it has a quote that opens a quoted field, that is never closed) then the Multilinereader will actually load every line until it finds another quote. Depending on the dataset (and on how many incorrectly formatted rows it has) it could actually load into memory the whole dataset (or as much as the StringBuilder supports, which is typically 2^32 chars )

For example:

id,description,animal
0,"this quoted field isnt closed,cat
1,this field doesnt include quotes,dog
... // no quoted fields in here
2555,"it is until this quoted field that the multilinereader actually stops reading row 0",bird
2556,"this row will be read correctly",dog

@justinormont 's suggestion here: #5125 (comment)

is to add another option to the TextLoader that let the user set the maximum length of a row, and if that threshold is passed, then simply ignore the line and continue reading the input file without loading everything into it.

I think that before introducing more options to the TextLoader, it's better to see if users actually hit this problem when using readMultilines.

@antoniovs1029 antoniovs1029 added enhancement New feature or request usability Smoothing user interaction or experience P2 Priority of the issue for triage purpose: Needs to be fixed at some point. loadsave Bugs related loading and saving data or models labels May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request loadsave Bugs related loading and saving data or models P2 Priority of the issue for triage purpose: Needs to be fixed at some point. usability Smoothing user interaction or experience
Projects
None yet
Development

No branches or pull requests

1 participant