Add support for stateful custom mappers, and for custom filters. #4676

yaeldekel · 2020-01-19T13:04:36Z

Fixes #4675 .

codecov · 2020-01-19T14:47:02Z

Codecov Report

❗ No coverage uploaded for pull request base (master@f94f359). Click here to learn what that means.
The diff coverage is 82.41%.

@@            Coverage Diff            @@
##             master    #4676   +/-   ##
=========================================
  Coverage          ?   75.84%           
=========================================
  Files             ?      953           
  Lines             ?   172684           
  Branches          ?    18612           
=========================================
  Hits              ?   130976           
  Misses            ?    36523           
  Partials          ?     5185

Flag	Coverage Δ
#Debug	`75.84% <82.41%> (?)`
#production	`71.44% <78.24%> (?)`
#test	`90.63% <100%> (?)`

Impacted Files	Coverage Δ
src/Microsoft.ML.Transforms/LambdaTransform.cs	`64.42% <0%> (ø)`
...rc/Microsoft.ML.Transforms/CustomMappingFactory.cs	`100% <100%> (ø)`
...rc/Microsoft.ML.Transforms/CustomMappingCatalog.cs	`100% <100%> (ø)`
...rosoft.ML.Tests/Transformers/CustomMappingTests.cs	`94.92% <100%> (ø)`
....ML.Transforms/StatefulCustomMappingTransformer.cs	`75.63% <75.63%> (ø)`
src/Microsoft.ML.Transforms/CustomMappingFilter.cs	`82.35% <82.35%> (ø)`

justinormont · 2020-03-24T20:26:03Z

test/Microsoft.ML.Tests/Transformers/CustomMappingTests.cs

+            var filteredData = ML.Data.StatefulCustomFilter<MyFilterInput, MyFilterState>(data,
+                (input, state) =>
+                {
+                    if (state.Count++ % 2 == 0)


Will this run on multiple threads? If so, I'm not sure how to get the row number deterministically. #Resolved

The stateful mapping transformer and filter don't use threads, exactly for this reason.

In reply to: 397438593 [](ancestors = 397438593)

harishsk · 2020-03-27T05:30:32Z

src/Microsoft.ML.Transforms/CustomMappingFactory.cs

+    /// <typeparam name="TState">The type that describes the state object the mapping uses.</typeparam>
+    /// <typeparam name="TDst">The type that describes what new columns are added by this transform.</typeparam>
+    public abstract class StatefulCustomMappingFactory<TSrc, TState, TDst> : ICustomMappingFactory
+        where TSrc : class, new()


Tiny nit: CustomMappingFactory uses <TSrc, TDst>. Would it be more consistent to order these as <TSrc, TDst, TState> or <TState, TSrc, TDst> ?
#Resolved

harishsk · 2020-03-27T05:32:27Z

src/Microsoft.ML.Transforms/CustomMappingCatalog.cs

+
+        /// <summary>
+        /// Create a <see cref="StatefulCustomMappingEstimator{TSrc, TState, TDst}"/>, which applies a custom mapping of input columns to output columns,
+        /// while allowing a per-cursor state.


Are we adding this more for TLC parity or is there a customer ask for this? (Not opposed to the former, just curious) #Resolved

This is for parity, I don't know about customer asks.

In reply to: 399043928 [](ancestors = 399043928)

harishsk · 2020-03-27T06:00:19Z

src/Microsoft.ML.Transforms/CustomMappingCatalog.cs

+        /// <param name="inputSchemaDefinition">Additional parameters for schema mapping between <typeparamref name="TSrc"/> and input data. Useful
+        /// when dealing with annotations.</param>
+        public static IDataView CustomFilter<TSrc>(this DataOperationsCatalog catalog, IDataView input, Func<TSrc, bool> filterPredicate, SchemaDefinition inputSchemaDefinition = null)
+            where TSrc : class, new()


Do we need samples illustrating these? #Resolved

Yes, good idea.

In reply to: 399050773 [](ancestors = 399050773)

harishsk · 2020-03-27T06:59:11Z

src/Microsoft.ML.Transforms/CustomMappingCatalog.cs

+        /// <param name="outputSchemaDefinition">Additional parameters for schema mapping between <typeparamref name="TDst"/> and output data.
+        /// Useful when dealing with annotations.</param>
+        public static StatefulCustomMappingEstimator<TSrc, TState, TDst> StatefulCustomMapping<TSrc, TState, TDst>(this TransformsCatalog catalog, Action<TSrc, TState, TDst> mapAction,
+            Action<TState> stateInitAction, string contractName, SchemaDefinition inputSchemaDefinition = null, SchemaDefinition outputSchemaDefinition = null)


Can you please add Estimator characteristics in the docs (like the others) and add Exportable to Onnx: No?
(I am guessing this will never be exportable to onnx...is that correct?) #Resolved

Correct, it cannot be exportable to ONNX, since it is based on custom C# code specified by the user. The estimator documentation regarding this is in StatefulCustomMappingTransform.cs in line 332.

In reply to: 399067853 [](ancestors = 399067853)

antoniovs1029 · 2020-04-01T10:00:22Z

test/Microsoft.ML.Tests/Transformers/CustomMappingTests.cs

+            // We create a temporary environment to instantiate the custom transformer. This is to ensure that we don't need the same
+            // environment for saving and loading.
+            var tempoEnv = new MLContext();
+            var customEst = tempoEnv.Transforms.StatefulCustomMapping<MyStatefulInput, MyStatefulOutput, MyState>(MyStatefulLambda.MyStatefulAction, MyStatefulLambda.MyStateInit, nameof(MyStatefulLambda));
+
+            ML.ComponentCatalog.RegisterAssembly(typeof(MyStatefulLambda).Assembly);


Just a heads up that registering the assembly won't be necessary if this PR gets merged:
#4989

Since it's modifying the LambdaTransform Save and Create methods, it will also be necessary to update StatefulCustomMappingTransformer in here to also save the assembly name. My understanding is that there shouldn't be any problem in that PR or in this PR to do this change. #Resolved

Thanks for the heads up. I guess if PR 4989 gets merged before this one, the code of StatefulCustomMappingTransformer won't build, so that will remind me to update this :).

In reply to: 401497684 [](ancestors = 401497684)

PR 4989 is merged.

In reply to: 401643916 [](ancestors = 401643916,401497684)

antoniovs1029 · 2020-04-01T10:29:17Z

src/Microsoft.ML.Transforms/CustomMappingCatalog.cs

+        public static StatefulCustomMappingEstimator<TSrc, TDst, TState> StatefulCustomMapping<TSrc, TDst, TState>(this TransformsCatalog catalog, Action<TSrc, TDst, TState> mapAction,
+            Action<TState> stateInitAction, string contractName, SchemaDefinition inputSchemaDefinition = null, SchemaDefinition outputSchemaDefinition = null)


So I guess this is right, but there's something that concerns me about letting the user to change the input and output SchemaDefinition of the transformer.

Some time ago there was this issue (#3988) where the behavior of CustomMappingTransformer was different before saving the model to disk, and after loading it. My conclusion on the issue was that when saving the model, the information of the SchemaDefinition wasn't saved into disk, and so when loading back the model that information got lost.

The issue remains unfixed, but I think it was closed because it's not in our plans to fix it soon and there existed a workaround for that particular issue.

In fact, I am not 100% sure if my conclusion there was correct (so perhaps having input from @yaeldekel would help), but if I was right then I'd think it's better to not let the user manipulate the SchemaDefinitions of this transformer so that they don't encounter this issue when loading back their models.

On the other hand, I am also not completely sure in which case would the end user would necessarily require to modify the SchemaDefinitions. There's actually only 1 test (SchemaDefinitionForCustomMapping) and no samples where the SchemaDefinitions of a CustomMappingTransformer are modified, and I am still not clear about what use case would that test correspond to.

Anyway, it's just a minor concern I have. #Resolved

Interesting. It seems to me that the fix for this would be to serialize both the input and the output SchemaDefinitions in LambdaTransform.SaveCustomTransformer, and then have the ICustomMappingFactory.CreateTransformer method take an optional input/output SchemaDefinition.

Since the issue has been closed (because of low priority I guess?), should I disallow these input parameters in this new class? We cannot remove them from the existing CustomMappingEstimator class since that would be a breaking change, but we can at least not introduce the new class with a bug (even if it's just an edge case usage). What do you think? I'm good either way.

In reply to: 401514473 [](ancestors = 401514473)

I'd think it's better not to expose in here the input and output SchemaDefinition to the user. And only expose them if in the future someone requests it or if we actually get to solve the problem of serializing/deserializing SchemaDefinitions.

As for this issue on CustomMappingTransformer, I understand that we can't remove the SchemaDefinition parameters from the API. So I will reopen that issue, and tag it as 'breaking-api' so that it gets reviewed in the future if ML.NET 2.0 ever gets planned. #Resolved

antoniovs1029 · 2020-04-02T00:20:15Z

src/Microsoft.ML.Transforms/StatefulCustomMappingTransformer.cs

+            public override DataViewRowCursor[] GetRowCursorSet(IEnumerable<DataViewSchema.Column> columnsNeeded, int n, Random rand = null)
+            {
+                return new[] { GetRowCursorCore(columnsNeeded, rand) };
+            }


So I have a question.

My impression, is that this code in here is like this so to not let this transform to use more than one thread, is that right?

Still, on the ShouldUseParallelCursors of this transformer it returns null if the predicate there only wants columns that were already on the input of this transformer.

So my question is, shouldn't this method here actually return multiple cursors if ShouldUseParallelCursors returns null??

To be honest, I don't know if what I am asking makes sense, but for instance there's the case of GenerateNumberTransform, where its GetRowCursorSet actually checks the ShouldUseParallelCursors to know if it returns 1 or more cursors.

machinelearning/src/Microsoft.ML.Data/Transforms/GenerateNumberTransform.cs

Lines 361 to 378 in 4b3cc7e

if (n > 1 && ShouldUseParallelCursors(predicate) != false)

{

var inputs = Source.GetRowCursorSet(inputCols, n);

Host.AssertNonEmpty(inputs);

if (inputs.Length != 1)

{

var cursors = new DataViewRowCursor[inputs.Length];

for (int i = 0; i < inputs.Length; i++)

cursors[i] = new Cursor(Host, _bindings, inputs[i], active);

return cursors;

}

input = inputs[0];

}

else

input = Source.GetRowCursor(inputCols);

return new DataViewRowCursor[] { new Cursor(Host, _bindings, input, active) };

#Resolved

Yes, you're right. ShouldUseParallelCursors is used in the base class TransformBase, inside the IDataView.GetRowCursor method. This method gets the cursor recursively from the Source of each IDataTransform, and in order to know whether to do that using GetRowCursor or GetRowCursorSet it calls ShouldUseParallelCursors. If this transform (StatefulCustomMappingTransformer.RowToRowMapper) is somewhere in the middle of the pipeline, and none of the transform's added columns are active, then it can safely produce multiple cursors from its Source.

In reply to: 401983989 [](ancestors = 401983989)

antoniovs1029

In general it LGTM, I just left one question that might or might not require changing the code. And then there's my suggestion to remove the ScemaDefinition parameters from the public API.

Since #4989 is now merged, please update this PR before merging. 😄

antoniovs1029 · 2020-04-02T01:14:56Z

src/Microsoft.ML.Transforms/StatefulCustomMappingTransformer.cs

+        /// Whether a call to <see cref="ITransformer.GetRowToRowMapper(DataViewSchema)"/> should succeed, on an
+        /// appropriate schema.
+        /// </summary>
+        bool ITransformer.IsRowToRowMapper => true;


I have one question: If this is RowToRowMapper, then, according to what we chatted the other day, this can actually be used with PredictionEngine, right? But my understanding was that PredictionEngine process 1 row at a time. So I wonder: what will happen to the "state" on this transform? Will it actually update the state accross rows or will it restart the state on every row with the init function? #Resolved

The init function gets called once when the getter is created, and since it is created only once when PredictionEngine is created, then it will use the same state object, that changes from prediction to prediction.

In reply to: 401998841 [](ancestors = 401998841)

harishsk · 2020-04-04T00:43:58Z

@yaeldekel I think this is almost ready for merging. Can you please address the pending comments?

Adding the Exportable to onnx entry for the estimator
Pull, rebase and test now that Added the assembly name of the custom transform to the model file #4989 is merged
Disallow SchemaDefinition parameters in StatefulCustomMapping #Resolved

harishsk · 2020-04-06T22:34:09Z

@yaeldekel Thanks for the update. It appears only the minor addition of Onnx exportable is left before we can merge. Please update it and let me know.

harishsk · 2020-04-06T22:53:16Z

I was looking in the wrong place. You have already added it. All the comments have been addressed. Please merge the PR if you think it is ready from your end as well. Thanks.

yaeldekel requested a review from a team as a code owner January 19, 2020 13:04

yaeldekel force-pushed the custom branch from b123941 to 9bb3178 Compare January 29, 2020 07:32

yaeldekel force-pushed the custom branch from 9bb3178 to ba40b0a Compare February 28, 2020 23:29

justinormont reviewed Mar 24, 2020

View reviewed changes

harishsk reviewed Mar 27, 2020

View reviewed changes

antoniovs1029 reviewed Apr 1, 2020

View reviewed changes

antoniovs1029 mentioned this pull request Apr 1, 2020

Troubles with CustomMappingEstimator after save/load because InputSchemaDefinition is not being saved. #3988

Open

antoniovs1029 reviewed Apr 2, 2020

View reviewed changes

antoniovs1029 approved these changes Apr 2, 2020

View reviewed changes

antoniovs1029 reviewed Apr 2, 2020

View reviewed changes

yaeldMS added 5 commits April 5, 2020 10:01

Stateful custom mapping transformer

78661c6

Custom filters

ef3b075

Add the filter file, and add documentation

d91be8a

Address code review comments.

76f7f3f

Code review comments.

7ddd3fd

yaeldekel force-pushed the custom branch from 1c7b8f1 to 7ddd3fd Compare April 5, 2020 08:52

yaeldekel merged commit 8fb8420 into dotnet:master Apr 10, 2020

ghost locked as resolved and limited conversation to collaborators Mar 19, 2022

		public static StatefulCustomMappingEstimator<TSrc, TDst, TState> StatefulCustomMapping<TSrc, TDst, TState>(this TransformsCatalog catalog, Action<TSrc, TDst, TState> mapAction,
		Action<TState> stateInitAction, string contractName, SchemaDefinition inputSchemaDefinition = null, SchemaDefinition outputSchemaDefinition = null)

	if (n > 1 && ShouldUseParallelCursors(predicate) != false)
	{
	var inputs = Source.GetRowCursorSet(inputCols, n);
	Host.AssertNonEmpty(inputs);

	if (inputs.Length != 1)
	{
	var cursors = new DataViewRowCursor[inputs.Length];
	for (int i = 0; i < inputs.Length; i++)
	cursors[i] = new Cursor(Host, _bindings, inputs[i], active);
	return cursors;
	}
	input = inputs[0];
	}
	else
	input = Source.GetRowCursor(inputCols);

	return new DataViewRowCursor[] { new Cursor(Host, _bindings, input, active) };

Add support for stateful custom mappers, and for custom filters. #4676

Add support for stateful custom mappers, and for custom filters. #4676

Uh oh!

Conversation

yaeldekel commented Jan 19, 2020

Uh oh!

codecov bot commented Jan 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

justinormont Mar 24, 2020 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk Mar 27, 2020 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk Mar 27, 2020 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk Mar 27, 2020 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk Mar 27, 2020 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Apr 1, 2020 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Apr 1, 2020 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Apr 1, 2020 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Apr 2, 2020 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Apr 2, 2020 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk commented Apr 4, 2020 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harishsk commented Apr 6, 2020

Uh oh!

harishsk commented Apr 6, 2020

Uh oh!

Uh oh!

codecov bot commented Jan 19, 2020 •

edited

Loading

justinormont Mar 24, 2020 •

edited by yaeldekel

Loading

harishsk Mar 27, 2020 •

edited by yaeldekel

Loading

harishsk Mar 27, 2020 •

edited by yaeldekel

Loading

harishsk Mar 27, 2020 •

edited by yaeldekel

Loading

harishsk Mar 27, 2020 •

edited by yaeldekel

Loading

antoniovs1029 Apr 1, 2020 •

edited by yaeldekel

Loading

antoniovs1029 Apr 1, 2020 •

edited by yaeldekel

Loading

antoniovs1029 Apr 1, 2020 •

edited by yaeldekel

Loading

antoniovs1029 Apr 2, 2020 •

edited by yaeldekel

Loading

antoniovs1029 left a comment •

edited

Loading

antoniovs1029 Apr 2, 2020 •

edited by yaeldekel

Loading

harishsk commented Apr 4, 2020 •

edited by yaeldekel

Loading