-
Notifications
You must be signed in to change notification settings - Fork 1.9k
support sweeping multiline option in AutoML #5148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c073e6b
8f0fc1a
e96d716
8c17bbe
b2947f5
ff1c909
ecaaaf0
e7cb2b5
2c04d09
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -23,14 +23,16 @@ public class ColumnSplitResult | |||||
|
||||||
public bool AllowQuote { get; set; } | ||||||
public bool AllowSparse { get; set; } | ||||||
public bool ReadMultilines { get; set; } | ||||||
|
||||||
public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool allowSparse, int columnCount) | ||||||
public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool readMultilines, bool allowSparse, int columnCount) | ||||||
{ | ||||||
IsSuccess = isSuccess; | ||||||
Separator = separator; | ||||||
AllowQuote = allowQuote; | ||||||
AllowSparse = allowSparse; | ||||||
ColumnCount = columnCount; | ||||||
ReadMultilines = readMultilines; | ||||||
} | ||||||
} | ||||||
|
||||||
|
@@ -50,12 +52,14 @@ public static ColumnSplitResult TrySplitColumns(MLContext context, IMultiStreamS | |||||
{ | ||||||
var sparse = new[] { false, true }; | ||||||
var quote = new[] { true, false }; | ||||||
var tryMultiline = new[] { false, true }; | ||||||
var foundAny = false; | ||||||
var result = default(ColumnSplitResult); | ||||||
foreach (var perm in (from _allowSparse in sparse | ||||||
from _allowQuote in quote | ||||||
from _sep in separatorCandidates | ||||||
select new { _allowSparse, _allowQuote, _sep })) | ||||||
from _tryMultiline in tryMultiline | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. naming pattern would be:
Suggested change
|
||||||
select new { _allowSparse, _allowQuote, _sep, _tryMultiline })) | ||||||
{ | ||||||
var options = new TextLoader.Options | ||||||
{ | ||||||
|
@@ -66,7 +70,8 @@ from _sep in separatorCandidates | |||||
} }, | ||||||
Separators = new[] { perm._sep }, | ||||||
AllowQuoting = perm._allowQuote, | ||||||
AllowSparse = perm._allowSparse | ||||||
AllowSparse = perm._allowSparse, | ||||||
ReadMultilines = perm._tryMultiline, | ||||||
}; | ||||||
|
||||||
if (TryParseFile(context, options, source, out result)) | ||||||
|
@@ -75,7 +80,7 @@ from _sep in separatorCandidates | |||||
break; | ||||||
} | ||||||
} | ||||||
return foundAny ? result : new ColumnSplitResult(false, null, true, true, 0); | ||||||
return foundAny ? result : new ColumnSplitResult(false, null, true, true, true, 0); | ||||||
} | ||||||
|
||||||
private static bool TryParseFile(MLContext context, TextLoader.Options options, IMultiStreamSource source, | ||||||
|
@@ -111,7 +116,7 @@ private static bool TryParseFile(MLContext context, TextLoader.Options options, | |||||
// disallow single-column case | ||||||
if (mostCommon.Key <= 1) { return false; } | ||||||
|
||||||
result = new ColumnSplitResult(true, options.Separators.First(), options.AllowQuoting, options.AllowSparse, mostCommon.Key); | ||||||
result = new ColumnSplitResult(true, options.Separators.First(), options.AllowQuoting, options.ReadMultilines, options.AllowSparse, mostCommon.Key); | ||||||
return true; | ||||||
} | ||||||
// fail gracefully if unable to instantiate data view with swept arguments | ||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,36 @@ | ||||||
id,Column1,Column2,Column3 | ||||||
1,this is a description, 1,2 | ||||||
2,"this is a quote description",1,2 | ||||||
Comment on lines
+1
to
+3
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (pinned to unrelated code to allow threaded discussion) @harishsk and I were under the impression that in order for ModelBuilder to support the new readMultilines option, then some changes might be needed on the CodeGenerator, as now the code generated by ModelBuilder should also include Are these changes not necessary? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes and no... CodeGen project should only refer to ML.Net packages that are already published. That's why we want to pin CodeGen to a specific version of ML.Net to avoid any API changes breaking CodeGen. So from this aspect we need to upgrade CodeGen. Meanwhile we can't change CodeGen before the release. Because if we update CodeGen, we need to refer to a nightly build ML.Net for generating project, which is not available externally. My suggestion is to not update CodeGen this time, becausse it won't affect a lot. CodeGenerated project uses LoadFromTextFile API with user-defined ModelInput Class, which should work even if there's newline between quotes if I recall correctly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, then it's ok.
Using the generic LoadFromTextFile API you mention will still use the default value for readMultilines, which is set to false. It would be needed to actually set it explicitly to true if it was necessary. Notice that up until recently LoadFromTextFile only accepted some parameters on its signature, and I couldn't add the new readMultilines parameter there because that would be a breaking API change (which we are not allowed to do until ML.NET version 2). So instead, what I did was to add a new LoadFromTextFile(path, options) overload, where it accepts a TextLoader.Options object (see #5134). So if users wanted to use the generic LoadFromTextFile<> and also use the new options (such as readMultilines, or the ones being introduced in #5147 or #5145) then they will have to use the new public overload. In this sense, I think that CodeGenerator should also start using this other overload to have access to new options. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unless I'm missing part of it, if the CLI / Model Builder calls this new version of AutoML, you will have to update CodeGen. Otherwise the multi-line files will be read by the AutoML code, but the CodeGen'd code will not read the user's file.
All of When I import
What's stopping CodeGen from being updated before the release? Why would a change to CodeGen cause CLI/ModelBuilder to depend on a nightly build instead of the released version? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It's not CLI/ModelBuilder depend on nightly build, in short, it's the project generated from CLI/ModelBuilder needs to depend on a released build of ML.Net
I don't see why it's related, if you want to call CodeGen, you need to call CodeGen that pin to In fact, if we can make sure before each release, CodeGen will be the final package to update, and during two releases, we don't make any changes to CodeGen, then consistent release works. But if during two releases there's any breaking change in API, then we must and mustn't update CodeGenerator because of the reason I mention above, either change or not change CodeGen will make it in a break state. One solution to solve this dilemma is to pin CodeGen to a released version, but that will break consistently release and we need to publish twice for each release cycle. One for ML.Net packages and one for CodeGen. Another solution is to make sure ModelBuilder pins to a released version of AutoML/CodeGen, but that means between two releases we can't fix bugs resulted by CodeGen and that's again no guarantee. Meanwhile, we have the same dilemma for AutoML too, but unlike codegen, we can use internal build for AutoML, and most of it's bug is not breaking bug, so that's not big deal. A few more words, sorry to be verbose here. |
||||||
3,"this is a quote description with double quote("")",1,2 | ||||||
4,"this is a quote description with ""a pair of double quote""",1,2 | ||||||
5,"this is a quote description with new line | ||||||
quote",1,2 | ||||||
6,"this is a quote description with | ||||||
new line1 and | ||||||
new line2 and empty line | ||||||
|
||||||
and double quote""",1,2 | ||||||
7, this is a description with single quote("),1,2 | ||||||
// empty line between quotes | ||||||
8,"",1,2 | ||||||
// single quote between quotes | ||||||
9,"""",1,2 | ||||||
// simply newline between quotes | ||||||
10," | ||||||
|
||||||
|
||||||
|
||||||
",1,2 | ||||||
// simply signle quote and newline between quotes | ||||||
11," | ||||||
|
||||||
"""" | ||||||
|
||||||
"" | ||||||
|
||||||
"" | ||||||
|
||||||
",1,2 | ||||||
|
||||||
|
||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a variety of other places the multi-line needs to be added. Basically anywhere "sparse" is found within AutoML/CodeGen:
(pinned to line of code to allow for discussion) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ResolvedThere was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's still some missing, like: machinelearning/src/Microsoft.ML.AutoML/ColumnInference/ColumnInferenceApi.cs Lines 31 to 32 in c025270
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prefer not exposing readMultiline option in these APIs, Simply sweeping it in Infersplit should be enough |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following naming convention would be:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextLoader Option uses ReadMultilines (So is allowQuote and allowSparse), and we'd better follow their naming style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved