Skip to content

Add support for stateful custom mappers, and for custom filters. #4676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 10, 2020

Conversation

yaeldekel
Copy link

Fixes #4675 .

@yaeldekel yaeldekel requested a review from a team as a code owner January 19, 2020 13:04
@codecov
Copy link

codecov bot commented Jan 19, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@f94f359). Click here to learn what that means.
The diff coverage is 82.41%.

@@            Coverage Diff            @@
##             master    #4676   +/-   ##
=========================================
  Coverage          ?   75.84%           
=========================================
  Files             ?      953           
  Lines             ?   172684           
  Branches          ?    18612           
=========================================
  Hits              ?   130976           
  Misses            ?    36523           
  Partials          ?     5185
Flag Coverage Δ
#Debug 75.84% <82.41%> (?)
#production 71.44% <78.24%> (?)
#test 90.63% <100%> (?)
Impacted Files Coverage Δ
src/Microsoft.ML.Transforms/LambdaTransform.cs 64.42% <0%> (ø)
...rc/Microsoft.ML.Transforms/CustomMappingFactory.cs 100% <100%> (ø)
...rc/Microsoft.ML.Transforms/CustomMappingCatalog.cs 100% <100%> (ø)
...rosoft.ML.Tests/Transformers/CustomMappingTests.cs 94.92% <100%> (ø)
....ML.Transforms/StatefulCustomMappingTransformer.cs 75.63% <75.63%> (ø)
src/Microsoft.ML.Transforms/CustomMappingFilter.cs 82.35% <82.35%> (ø)

var filteredData = ML.Data.StatefulCustomFilter<MyFilterInput, MyFilterState>(data,
(input, state) =>
{
if (state.Count++ % 2 == 0)
Copy link
Contributor

@justinormont justinormont Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this run on multiple threads? If so, I'm not sure how to get the row number deterministically. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stateful mapping transformer and filter don't use threads, exactly for this reason.


In reply to: 397438593 [](ancestors = 397438593)

/// <typeparam name="TState">The type that describes the state object the mapping uses.</typeparam>
/// <typeparam name="TDst">The type that describes what new columns are added by this transform.</typeparam>
public abstract class StatefulCustomMappingFactory<TSrc, TState, TDst> : ICustomMappingFactory
where TSrc : class, new()
Copy link
Contributor

@harishsk harishsk Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit: CustomMappingFactory uses <TSrc, TDst>. Would it be more consistent to order these as <TSrc, TDst, TState> or <TState, TSrc, TDst> ?
#Resolved


/// <summary>
/// Create a <see cref="StatefulCustomMappingEstimator{TSrc, TState, TDst}"/>, which applies a custom mapping of input columns to output columns,
/// while allowing a per-cursor state.
Copy link
Contributor

@harishsk harishsk Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we adding this more for TLC parity or is there a customer ask for this? (Not opposed to the former, just curious) #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for parity, I don't know about customer asks.


In reply to: 399043928 [](ancestors = 399043928)

/// <param name="inputSchemaDefinition">Additional parameters for schema mapping between <typeparamref name="TSrc"/> and input data. Useful
/// when dealing with annotations.</param>
public static IDataView CustomFilter<TSrc>(this DataOperationsCatalog catalog, IDataView input, Func<TSrc, bool> filterPredicate, SchemaDefinition inputSchemaDefinition = null)
where TSrc : class, new()
Copy link
Contributor

@harishsk harishsk Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need samples illustrating these? #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea.


In reply to: 399050773 [](ancestors = 399050773)

/// <param name="outputSchemaDefinition">Additional parameters for schema mapping between <typeparamref name="TDst"/> and output data.
/// Useful when dealing with annotations.</param>
public static StatefulCustomMappingEstimator<TSrc, TState, TDst> StatefulCustomMapping<TSrc, TState, TDst>(this TransformsCatalog catalog, Action<TSrc, TState, TDst> mapAction,
Action<TState> stateInitAction, string contractName, SchemaDefinition inputSchemaDefinition = null, SchemaDefinition outputSchemaDefinition = null)
Copy link
Contributor

@harishsk harishsk Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add Estimator characteristics in the docs (like the others) and add Exportable to Onnx: No?
(I am guessing this will never be exportable to onnx...is that correct?) #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it cannot be exportable to ONNX, since it is based on custom C# code specified by the user. The estimator documentation regarding this is in StatefulCustomMappingTransform.cs in line 332.


In reply to: 399067853 [](ancestors = 399067853)

Comment on lines 178 to 183
// We create a temporary environment to instantiate the custom transformer. This is to ensure that we don't need the same
// environment for saving and loading.
var tempoEnv = new MLContext();
var customEst = tempoEnv.Transforms.StatefulCustomMapping<MyStatefulInput, MyStatefulOutput, MyState>(MyStatefulLambda.MyStatefulAction, MyStatefulLambda.MyStateInit, nameof(MyStatefulLambda));

ML.ComponentCatalog.RegisterAssembly(typeof(MyStatefulLambda).Assembly);
Copy link
Member

@antoniovs1029 antoniovs1029 Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads up that registering the assembly won't be necessary if this PR gets merged:
#4989

Since it's modifying the LambdaTransform Save and Create methods, it will also be necessary to update StatefulCustomMappingTransformer in here to also save the assembly name. My understanding is that there shouldn't be any problem in that PR or in this PR to do this change. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up. I guess if PR 4989 gets merged before this one, the code of StatefulCustomMappingTransformer won't build, so that will remind me to update this :).


In reply to: 401497684 [](ancestors = 401497684)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 4989 is merged.


In reply to: 401643916 [](ancestors = 401643916,401497684)

Comment on lines 71 to 72
public static StatefulCustomMappingEstimator<TSrc, TDst, TState> StatefulCustomMapping<TSrc, TDst, TState>(this TransformsCatalog catalog, Action<TSrc, TDst, TState> mapAction,
Action<TState> stateInitAction, string contractName, SchemaDefinition inputSchemaDefinition = null, SchemaDefinition outputSchemaDefinition = null)
Copy link
Member

@antoniovs1029 antoniovs1029 Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess this is right, but there's something that concerns me about letting the user to change the input and output SchemaDefinition of the transformer.

Some time ago there was this issue (#3988) where the behavior of CustomMappingTransformer was different before saving the model to disk, and after loading it. My conclusion on the issue was that when saving the model, the information of the SchemaDefinition wasn't saved into disk, and so when loading back the model that information got lost.

The issue remains unfixed, but I think it was closed because it's not in our plans to fix it soon and there existed a workaround for that particular issue.

In fact, I am not 100% sure if my conclusion there was correct (so perhaps having input from @yaeldekel would help), but if I was right then I'd think it's better to not let the user manipulate the SchemaDefinitions of this transformer so that they don't encounter this issue when loading back their models.

On the other hand, I am also not completely sure in which case would the end user would necessarily require to modify the SchemaDefinitions. There's actually only 1 test (SchemaDefinitionForCustomMapping) and no samples where the SchemaDefinitions of a CustomMappingTransformer are modified, and I am still not clear about what use case would that test correspond to.

Anyway, it's just a minor concern I have. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. It seems to me that the fix for this would be to serialize both the input and the output SchemaDefinitions in LambdaTransform.SaveCustomTransformer, and then have the ICustomMappingFactory.CreateTransformer method take an optional input/output SchemaDefinition.

Since the issue has been closed (because of low priority I guess?), should I disallow these input parameters in this new class? We cannot remove them from the existing CustomMappingEstimator class since that would be a breaking change, but we can at least not introduce the new class with a bug (even if it's just an edge case usage). What do you think? I'm good either way.


In reply to: 401514473 [](ancestors = 401514473)

Copy link
Member

@antoniovs1029 antoniovs1029 Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think it's better not to expose in here the input and output SchemaDefinition to the user. And only expose them if in the future someone requests it or if we actually get to solve the problem of serializing/deserializing SchemaDefinitions.

As for this issue on CustomMappingTransformer, I understand that we can't remove the SchemaDefinition parameters from the API. So I will reopen that issue, and tag it as 'breaking-api' so that it gets reviewed in the future if ML.NET 2.0 ever gets planned. #Resolved

Comment on lines 136 to 156
public override DataViewRowCursor[] GetRowCursorSet(IEnumerable<DataViewSchema.Column> columnsNeeded, int n, Random rand = null)
{
return new[] { GetRowCursorCore(columnsNeeded, rand) };
}
Copy link
Member

@antoniovs1029 antoniovs1029 Apr 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have a question.

My impression, is that this code in here is like this so to not let this transform to use more than one thread, is that right?

Still, on the ShouldUseParallelCursors of this transformer it returns null if the predicate there only wants columns that were already on the input of this transformer.

So my question is, shouldn't this method here actually return multiple cursors if ShouldUseParallelCursors returns null??

To be honest, I don't know if what I am asking makes sense, but for instance there's the case of GenerateNumberTransform, where its GetRowCursorSet actually checks the ShouldUseParallelCursors to know if it returns 1 or more cursors.

if (n > 1 && ShouldUseParallelCursors(predicate) != false)
{
var inputs = Source.GetRowCursorSet(inputCols, n);
Host.AssertNonEmpty(inputs);
if (inputs.Length != 1)
{
var cursors = new DataViewRowCursor[inputs.Length];
for (int i = 0; i < inputs.Length; i++)
cursors[i] = new Cursor(Host, _bindings, inputs[i], active);
return cursors;
}
input = inputs[0];
}
else
input = Source.GetRowCursor(inputCols);
return new DataViewRowCursor[] { new Cursor(Host, _bindings, input, active) };
#Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. ShouldUseParallelCursors is used in the base class TransformBase, inside the IDataView.GetRowCursor method. This method gets the cursor recursively from the Source of each IDataTransform, and in order to know whether to do that using GetRowCursor or GetRowCursorSet it calls ShouldUseParallelCursors. If this transform (StatefulCustomMappingTransformer.RowToRowMapper) is somewhere in the middle of the pipeline, and none of the transform's added columns are active, then it can safely produce multiple cursors from its Source.


In reply to: 401983989 [](ancestors = 401983989)

Copy link
Member

@antoniovs1029 antoniovs1029 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it LGTM, I just left one question that might or might not require changing the code. And then there's my suggestion to remove the ScemaDefinition parameters from the public API.

Since #4989 is now merged, please update this PR before merging. 😄

/// Whether a call to <see cref="ITransformer.GetRowToRowMapper(DataViewSchema)"/> should succeed, on an
/// appropriate schema.
/// </summary>
bool ITransformer.IsRowToRowMapper => true;
Copy link
Member

@antoniovs1029 antoniovs1029 Apr 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one question: If this is RowToRowMapper, then, according to what we chatted the other day, this can actually be used with PredictionEngine, right? But my understanding was that PredictionEngine process 1 row at a time. So I wonder: what will happen to the "state" on this transform? Will it actually update the state accross rows or will it restart the state on every row with the init function? #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init function gets called once when the getter is created, and since it is created only once when PredictionEngine is created, then it will use the same state object, that changes from prediction to prediction.


In reply to: 401998841 [](ancestors = 401998841)

@harishsk
Copy link
Contributor

harishsk commented Apr 4, 2020

@yaeldekel I think this is almost ready for merging. Can you please address the pending comments?

@harishsk
Copy link
Contributor

harishsk commented Apr 6, 2020

@yaeldekel Thanks for the update. It appears only the minor addition of Onnx exportable is left before we can merge. Please update it and let me know.

@harishsk
Copy link
Contributor

harishsk commented Apr 6, 2020

I was looking in the wrong place. You have already added it. All the comments have been addressed. Please merge the PR if you think it is ready from your end as well. Thanks.

@yaeldekel yaeldekel merged commit 8fb8420 into dotnet:master Apr 10, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for stateful custom mappers, and for custom filters
5 participants