Skip to content

support sweeping multiline option in AutoML #5148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 21, 2020
3 changes: 3 additions & 0 deletions src/Microsoft.ML.AutoML/ColumnInference/ColumnInferenceApi.cs
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ public static ColumnInferenceResults InferColumns(MLContext context, string path
Separators = new[] { splitInference.Separator.Value },
AllowSparse = splitInference.AllowSparse,
AllowQuoting = splitInference.AllowQuote,
ReadMultilines = splitInference.ReadMultilines,
HasHeader = hasHeader,
TrimWhitespace = trimWhitespace
};
Expand Down Expand Up @@ -91,6 +92,7 @@ public static ColumnInferenceResults InferColumns(MLContext context, string path
AllowQuoting = splitInference.AllowQuote,
AllowSparse = splitInference.AllowSparse,
Separators = new char[] { splitInference.Separator.Value },
ReadMultilines = splitInference.ReadMultilines,
HasHeader = hasHeader,
TrimWhitespace = trimWhitespace
};
Expand Down Expand Up @@ -139,6 +141,7 @@ private static ColumnTypeInference.InferenceResult InferColumnTypes(MLContext co
Separator = splitInference.Separator.Value,
AllowSparse = splitInference.AllowSparse,
AllowQuote = splitInference.AllowQuote,
ReadMultilines = splitInference.ReadMultilines,
HasHeader = hasHeader,
LabelColumnIndex = labelColumnIndex,
Label = label
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ internal sealed class Arguments
public int MaxRowsToRead;
public uint? LabelColumnIndex;
public string Label;
public bool ReadMultilines;

public Arguments()
{
Expand Down Expand Up @@ -262,6 +263,7 @@ private static InferenceResult InferTextFileColumnTypesCore(MLContext context, I
Separators = new[] { args.Separator },
AllowSparse = args.AllowSparse,
AllowQuoting = args.AllowQuote,
ReadMultilines = args.ReadMultilines,
};
var textLoader = context.Data.CreateTextLoader(textLoaderOptions);
var idv = textLoader.Load(fileSource);
Expand Down
15 changes: 10 additions & 5 deletions src/Microsoft.ML.AutoML/ColumnInference/TextFileContents.cs
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,16 @@ public class ColumnSplitResult

public bool AllowQuote { get; set; }
public bool AllowSparse { get; set; }
public bool ReadMultilines { get; set; }

public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool allowSparse, int columnCount)
public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool readMultilines, bool allowSparse, int columnCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following naming convention would be:

Suggested change
public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool readMultilines, bool allowSparse, int columnCount)
public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool allowMultiline, bool allowSparse, int columnCount)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextLoader Option uses ReadMultilines (So is allowQuote and allowSparse), and we'd better follow their naming style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

{
IsSuccess = isSuccess;
Separator = separator;
AllowQuote = allowQuote;
AllowSparse = allowSparse;
ColumnCount = columnCount;
ReadMultilines = readMultilines;
}
}

Expand All @@ -50,12 +52,14 @@ public static ColumnSplitResult TrySplitColumns(MLContext context, IMultiStreamS
{
var sparse = new[] { false, true };
var quote = new[] { true, false };
var tryMultiline = new[] { false, true };
var foundAny = false;
var result = default(ColumnSplitResult);
foreach (var perm in (from _allowSparse in sparse
from _allowQuote in quote
from _sep in separatorCandidates
select new { _allowSparse, _allowQuote, _sep }))
from _tryMultiline in tryMultiline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming pattern would be:

Suggested change
from _tryMultiline in tryMultiline
from _allowMultiline in multiline

select new { _allowSparse, _allowQuote, _sep, _tryMultiline }))
{
var options = new TextLoader.Options
{
Expand All @@ -66,7 +70,8 @@ from _sep in separatorCandidates
} },
Separators = new[] { perm._sep },
AllowQuoting = perm._allowQuote,
AllowSparse = perm._allowSparse
AllowSparse = perm._allowSparse,
ReadMultilines = perm._tryMultiline,
};

if (TryParseFile(context, options, source, out result))
Expand All @@ -75,7 +80,7 @@ from _sep in separatorCandidates
break;
}
}
return foundAny ? result : new ColumnSplitResult(false, null, true, true, 0);
return foundAny ? result : new ColumnSplitResult(false, null, true, true, true, 0);
}

private static bool TryParseFile(MLContext context, TextLoader.Options options, IMultiStreamSource source,
Expand Down Expand Up @@ -111,7 +116,7 @@ private static bool TryParseFile(MLContext context, TextLoader.Options options,
// disallow single-column case
if (mostCommon.Key <= 1) { return false; }

result = new ColumnSplitResult(true, options.Separators.First(), options.AllowQuoting, options.AllowSparse, mostCommon.Key);
result = new ColumnSplitResult(true, options.Separators.First(), options.AllowQuoting, options.ReadMultilines, options.AllowSparse, mostCommon.Key);
return true;
}
// fail gracefully if unable to instantiate data view with swept arguments
Expand Down
37 changes: 36 additions & 1 deletion test/Microsoft.ML.AutoML.Tests/ColumnInferenceTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
using System.Collections.Generic;
using System.IO;
using System.Linq;
using FluentAssertions;
using Microsoft.ML.Data;
using Microsoft.ML.TestFramework;
using Xunit;
Expand Down Expand Up @@ -186,5 +187,39 @@ public void InferColumnsColumnInfoParam()
Assert.Equal(DefaultColumnNames.Features, result.ColumnInformation.NumericColumnNames.First());
Assert.Null(result.ColumnInformation.ExampleWeightColumnName);
}

[Fact]
public void TrySplitColumns_should_split_on_dataset_with_newline_between_double_quotes()
{
var context = new MLContext();
var dataset = Path.Combine("TestData", "DatasetWithNewlineBetweenQuotes.txt");
var sample = TextFileSample.CreateFromFullFile(dataset);
var result = TextFileContents.TrySplitColumns(context, sample, TextFileContents.DefaultSeparators);

result.ColumnCount.Should().Be(4);
result.Separator.Should().Be(',');
result.IsSuccess.Should().BeTrue();
}

[Fact]
public void InferColumnsFromMultilineInputFile()
{
// Check if we can infer the column information
// from and input file which has escaped newlines inside quotes
var dataPath = GetDataPath("multiline.csv");
MLContext mlContext = new MLContext();
var inputColumnInformation = new ColumnInformation();
inputColumnInformation.LabelColumnName = @"id";
var result = mlContext.Auto().InferColumns(dataPath, inputColumnInformation);

// File has 3 columns: "id", "description" and "animal"
Assert.NotNull(result.ColumnInformation.LabelColumnName);
Assert.Equal(1, result.ColumnInformation.TextColumnNames.Count);
Assert.Equal(1, result.ColumnInformation.CategoricalColumnNames.Count);

Assert.Equal("id", result.ColumnInformation.LabelColumnName);
Assert.Equal("description", result.ColumnInformation.TextColumnNames.First());
Assert.Equal("animal", result.ColumnInformation.CategoricalColumnNames.First());
}
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,14 @@
</ItemGroup>

<ItemGroup>
<PackageReference Include="FluentAssertions" Version="5.10.3" />
<PackageReference Include="SciSharp.TensorFlow.Redist" Version="$(TensorFlowVersion)" />
</ItemGroup>

<ItemGroup>
<None Update="TestData\DatasetWithNewlineBetweenQuotes.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Update="TestData\DatasetWithDefaultColumnNames.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
id,Column1,Column2,Column3
1,this is a description, 1,2
2,"this is a quote description",1,2
Comment on lines +1 to +3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(pinned to unrelated code to allow threaded discussion)

@harishsk and I were under the impression that in order for ModelBuilder to support the new readMultilines option, then some changes might be needed on the CodeGenerator, as now the code generated by ModelBuilder should also include readMultilines = true if it was found necessary to use that parameter.

Are these changes not necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no... CodeGen project should only refer to ML.Net packages that are already published. That's why we want to pin CodeGen to a specific version of ML.Net to avoid any API changes breaking CodeGen. So from this aspect we need to upgrade CodeGen. Meanwhile we can't change CodeGen before the release. Because if we update CodeGen, we need to refer to a nightly build ML.Net for generating project, which is not available externally.

My suggestion is to not update CodeGen this time, becausse it won't affect a lot. CodeGenerated project uses LoadFromTextFile API with user-defined ModelInput Class, which should work even if there's newline between quotes if I recall correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, then it's ok.

CodeGenerated project uses LoadFromTextFile API with user-defined ModelInput Class, which should work even if there's newline between quotes if I recall correctly.

Using the generic LoadFromTextFile API you mention will still use the default value for readMultilines, which is set to false. It would be needed to actually set it explicitly to true if it was necessary.

Notice that up until recently LoadFromTextFile only accepted some parameters on its signature, and I couldn't add the new readMultilines parameter there because that would be a breaking API change (which we are not allowed to do until ML.NET version 2). So instead, what I did was to add a new LoadFromTextFile(path, options) overload, where it accepts a TextLoader.Options object (see #5134).

So if users wanted to use the generic LoadFromTextFile<> and also use the new options (such as readMultilines, or the ones being introduced in #5147 or #5145) then they will have to use the new public overload. In this sense, I think that CodeGenerator should also start using this other overload to have access to new options.

Copy link
Contributor

@justinormont justinormont May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing part of it, if the CLI / Model Builder calls this new version of AutoML, you will have to update CodeGen.

Otherwise the multi-line files will be read by the AutoML code, but the CodeGen'd code will not read the user's file.
(update: I see @antoniovs1029 added a comment before me which also explains this)

CodeGen project should only refer to ML.Net packages that are already published.

All of Microsoft.ML.* should be self consistent with each other. No package pinning needed. Almost all changes to the AutoML code should have a mirroring change to the CodeGen code.

When I import [email protected] into my project, I'm expecting it to reference [email protected] & [email protected]. Lets say my project has [email protected] to do prediction, then I want to call CodeGen (yes, only an internal call currently). Then my project needs both [email protected] and [email protected], causing a version conflict. Consistent releases are better.

Meanwhile we can't change CodeGen before the release. Because if we update CodeGen, we need to refer to a nightly build ML.Net for generating project, which is not available externally.

What's stopping CodeGen from being updated before the release? Why would a change to CodeGen cause CLI/ModelBuilder to depend on a nightly build instead of the released version?

Copy link
Contributor Author

@LittleLittleCloud LittleLittleCloud May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's stopping CodeGen from being updated before the release? Why would a change to CodeGen cause CLI/ModelBuilder to depend on a nightly build instead of the released version?

It's not CLI/ModelBuilder depend on nightly build, in short, it's the project generated from CLI/ModelBuilder needs to depend on a released build of ML.Net

All of Microsoft.ML.* should be self consistent with each other. No package pinning needed. Almost all changes to the AutoML code should have a mirroring change to the CodeGen code.
Hardly agree, there's no guarantee that every change in AutoML/ML.Net will be released, while projects generated by CodeGen should refer to a released version of ML.Net. If we mirroring every change in AutoML/ML.Net to CodeGen, that will make CodeGen in an unusable state between each release because ML.Net/AutoML CodeGen referred is different from what it's generated project referred.

When I import [email protected] into my project, I'm expecting it to reference [email protected] & [email protected]. Lets say my project has [email protected] to do prediction, then I want to call CodeGen (yes, only an internal call currently). Then my project needs both [email protected] and [email protected], causing a version conflict. Consistent releases are better.

I don't see why it's related, if you want to call CodeGen, you need to call CodeGen that pin to [email protected] instead of [email protected]

In fact, if we can make sure before each release, CodeGen will be the final package to update, and during two releases, we don't make any changes to CodeGen, then consistent release works. But if during two releases there's any breaking change in API, then we must and mustn't update CodeGenerator because of the reason I mention above, either change or not change CodeGen will make it in a break state.

One solution to solve this dilemma is to pin CodeGen to a released version, but that will break consistently release and we need to publish twice for each release cycle. One for ML.Net packages and one for CodeGen.

Another solution is to make sure ModelBuilder pins to a released version of AutoML/CodeGen, but that means between two releases we can't fix bugs resulted by CodeGen and that's again no guarantee.

Meanwhile, we have the same dilemma for AutoML too, but unlike codegen, we can use internal build for AutoML, and most of it's bug is not breaking bug, so that's not big deal.

A few more words, sorry to be verbose here.
I don't see why CodeGen have to refer to AutoML/ML.Net, the only relationship between CodeGen and AutoML is Pipline class (maybe also TextLoaderOption class), which should be defined in CodeGen as a contract class. And on the contrary, AutoML should refer to CodeGen to tell CodeGen how to generate projects. By doing that we can make CodeGen independent of ML.Net and resolve the dilemma we have now.

3,"this is a quote description with double quote("")",1,2
4,"this is a quote description with ""a pair of double quote""",1,2
5,"this is a quote description with new line
quote",1,2
6,"this is a quote description with
new line1 and
new line2 and empty line

and double quote""",1,2
7, this is a description with single quote("),1,2
// empty line between quotes
8,"",1,2
// single quote between quotes
9,"""",1,2
// simply newline between quotes
10,"



",1,2
// simply signle quote and newline between quotes
11,"

""""

""

""

",1,2



Copy link
Contributor

@justinormont justinormont May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still some missing, like:

public static ColumnInferenceResults InferColumns(MLContext context, string path, string labelColumn,
char? separatorChar, bool? allowQuotedStrings, bool? supportSparse, bool trimWhitespace, bool groupColumns)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not exposing readMultiline option in these APIs, Simply sweeping it in Infersplit should be enough