-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix TextLoader bug when there's newline between quotes #4584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix TextLoader bug when there's newline between quotes #4584
Conversation
|
||
public static string ReadEntry(this TextReader sr) | ||
{ | ||
string strReturn = string.Empty; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string entry
// And get more lines until the number of quotes is even | ||
while (strReturn.GetNumberOf("\"").IsOdd()) | ||
{ | ||
string strNow = sr.ReadLine(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string line
} | ||
} | ||
|
||
public static int GetNumberOf(this string s, string strSearchString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just leave this as a static method instead of an extension method. It's easier to read that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree, especially with completion for unimported extension methods coming from Roslyn. Specialized extension methods for common types will be uncomfortably noisy during development.
|
||
public static int GetNumberOf(this string s, string strSearchString) | ||
{ | ||
return s.Length - s.Replace(strSearchString, string.Empty).Length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting approach :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there's a bug.... It should divided by the length of strSearchString
to get the correct result. also if I do that, it won't work on string.Empty (div by 0)... so hard to write the correct code!
entry += sr.ReadLine(); | ||
|
||
// And get more lines until the number of quotes is even | ||
while (GetNumberOf(entry, "\"") % 2 != 0 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to first check that that quoting is enabled, and need to check for escaped quotes. May need a flag/field to define the escaping type, defaulting to allow either none or both.
Some tests:
True,"asdf,123
is perfectly ok without quoting enabledTrue,"12\" in a foot",123
has an escaped quote w/ slash escapingTrue,"12"" in a foot",123
has an escaped quote w/ double-double quotes (Excel style)True, "asdf,123
is ok, if following Excel style as it ignores quoted fields with a space before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to not append to lines at the start of the file which begin with //
or #
:
if (entry[0] == '#' || entry[0] == '/' && entry[1] == '/') { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So try to repeat the logic here
read line
if 'line' starts with '#' or starts with '\' => return 'line' # comments
if quote is disabled: => use the existing logic in TextLoader
else
if ( 'line' has only odd# real quote ) => read next 'line' until sum of real quote is even and return 'line' # real quote means "
and it's not one of [ ""
, \"
"
]
else return 'line'
And BTW it seems the existing enable quote logic doesn't check if it's real quote or not. ( for example in lines: "blablabla""blablabla" will cause an error in FetchNextField
function, but in Excel it's legal) Do we have a plan for adding support for that?
|
||
public static int GetNumberOf(string s, string strSearchString) | ||
{ | ||
if(strSearchString.Length == 0 || s.Length == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(strSearchString.Length == 0 || s.Length == 0) | |
if (strSearchString.Length == 0 || s.Length == 0) |
{ | ||
return 0; | ||
} | ||
return (s.Length - s.Replace(strSearchString, string.Empty).Length) / strSearchString.Length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String.Replace()
is not efficient as it reallocates.
@@ -514,7 +514,7 @@ private void ThreadProc() | |||
if (_abort) | |||
return; | |||
|
|||
text = rdr.ReadLine(); | |||
text = rdr.ReadEntry(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should check if moving to multi-line reading will interfere with multi-threaded reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to check that, could you expand a bit?
@@ -487,7 +487,7 @@ private void ThreadProc() | |||
// REVIEW: Avoid allocating a string for every line. This would probably require | |||
// introducing a CharSpan type (similar to ReadOnlyMemory but based on char[] or StringBuilder) | |||
// and implementing all the necessary conversion functionality on it. See task 3871. | |||
text = rdr.ReadLine(); | |||
text = rdr.ReadEntry(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend a flag for multi-line support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can share the same flag with EnableQuote, If we allow quote, then multi-line support is necessary to make sure TextLoader function well
while (GetNumberOf(entry, "\"") % 2 != 0 ) | ||
{ | ||
string line = sr.ReadLine(); | ||
entry += line; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely want to begin a string builder instead of a +=
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want to push thru a newline into the string builder. ReadLine()
doesn't include the \n
or \r
or \r\n
[ref].
Otherwise:
a,b,"c my missing
space",e,f
Would be read as:
a,b,"c my missingspace",e,f
Instead of:
a,b,"c my missing\nspace",e,f
It likely is not important, but it may be better to match the \n
or \r
or \r\n
of the last line read. This would only be important if the user read the file w/ TextLoader
but fed the model with an IEnumerable.
A quick workaround for this Issue
#4460
perhaps also this one
#4555