Skip to content

CreatePredictionEngine overrides inputSchemaDefinition by calling into DataViewConstructionUtils.GetSchemaDefinition #3234

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
codemzs opened this issue Apr 8, 2019 · 3 comments
Assignees
Labels
bug Something isn't working loadsave Bugs related loading and saving data or models P1 Priority of the issue for triage purpose: Needs to be fixed soon.

Comments

@codemzs
Copy link
Member

codemzs commented Apr 8, 2019

It seems CreatePredictionEngine overrides inputSchemaDefinition by calling into DataViewConstructionUtils.GetSchemaDefinition. This is problematic if some of the features of input schema is determined at runtime, example, length of feature vector.

            ITransformer trainedModel;
            using (var stream = new FileStream(modelFileFullPath, FileMode.Open, FileAccess.Read, FileShare.Read))
            {
               var trainedModel = mlContext.Model.Load(stream, out var inputSchema);
               var = mlContext.Model.CreatePredictionEngine<TInput, TOutput>(trainedModel, inputSchemaDefinition: inputSchema); 

//Throws an exception features<vector<single>> does not match features<vector<single, 432>> because CreatePredictionEngine returns transformer.CreatePredictionEngine<TSrc, TDst>(_env, false, DataViewConstructionUtils.GetSchemaDefinition<TSrc>(_env, inputSchema)); and `DataViewConstructionUtils.GetSchemaDefinition` overrides inputSchema.

            }

//Below workaround that overrides the override!

using (var stream = new FileStream(modelFileFullPath", FileMode.Open, FileAccess.Read, FileShare.Read))
            {

                var trainedModel = mlContext.Model.Load(stream, out var inputSchema);
                var outputSchemaDefinition = SchemaDefinition.Create(typeof(TInput));
                outputSchemaDefinition["Features"].ColumnType = new VectorDataViewType(NumberDataViewType.Single, Convert.ToInt32(432));
                var p = mlContext.Model.CreatePredictionEngine<TInput, TOutput>(trainedModel, inputSchemaDefinition: outputSchemaDefinition);
            }

CC: @TomFinley

@codemzs codemzs self-assigned this Apr 8, 2019
@codemzs codemzs added the bug Something isn't working label Apr 8, 2019
@TomFinley
Copy link
Contributor

TomFinley commented Apr 8, 2019

Thanks for writing this up @codemzs, but I still have a few questions till I understand the issue fully, specifically, I do not yet understand your first example, which I quote here:

var trainedModel = mlContext.Model.Load(stream, out var inputSchema);
var = mlContext.Model.CreatePredictionEngine<TInput, TOutput>(trainedModel, inputSchemaDefinition: inputSchema); 

The first line appears to be a usage of this method:

public ITransformer Load(Stream stream, out DataViewSchema inputSchema)

The second line appears to be a usage of this method:

public PredictionEngine<TSrc, TDst> CreatePredictionEngine<TSrc, TDst>(ITransformer transformer,
bool ignoreMissingColumns = true, SchemaDefinition inputSchemaDefinition = null, SchemaDefinition outputSchemaDefinition = null)

I can tell it is that specifically, rather than the overload a few lines below, due to the explicit usage of the parameter inputSchemaDefinition, which only exists on this parameter. Yet, inputSchema. But that doesn't make sense, since these are two completely different types, and there is no implicit conversion from DataViewSchema (which is what inputSchema is), and the schema definition objects.

My guess is that you meant to use the one with inputSchema. If I'm right, could you edit the issue, to correct it?

I'm going to move forward under the assumption that my guess is correct, since I'm about 90% sure about that. Anyway, if I suppose you really meant the second overload, I see this method.

internal static SchemaDefinition GetSchemaDefinition<TRow>(IHostEnvironment env, DataViewSchema schema)

I'm also guessing that your desire is to make the input type defined from the input. If you want to have the types of the input schema definition taken from the input schema that seems like a positive change to me. There is an additional subtlety you'll need to take care of, and that is inconsistencies w.r.t. the schema types in the input schema, as well as checking for consistency. For example, imagine the situation where we have columns A, B, C, D, E, each having type in the schema of vector of float of length 10.

Then imagine we have this input type:

public sealed class Instance
{
    public float[] A;

    [VectorType]
    public float[] B;

    [VectorType(5)]
    public float[] C;

    [VectorType(10)]
    public float[] D;

    public double[] E;
}

I would expect if you are inferring types from the input schema as well, it should succeed in remapping column A, either fail on columns B and C with a suitably descriptive method or prefer the attributes in the column, of course do nothing on column D, and fail (at some point) with an exception in case E.

You'll want to do something similar with keys, though I'd mark that as less urgent.

There is another additional problem that I notice just by reading that that method, and that is that in the loop, hidden columns are not skipped, and they kind of have to be... otherwise annotation and type information will be layered on top of each other.

Hopefully the above discussion clear of what a desirable design for this would be, let me (or whoever works on it) know if you have questions, and thanks for bringing this up.

@yaeldekel
Copy link

@TomFinley's assumption is correct, the problem is with the overload of CreatePredictionEngine that takes a DataViewSchema, and the bug is in DataViewConstructionUtils.GetSchemaDefinition.

To reproduce, change the test LoadModelWithOptionalColumnTransform to use the CreatePredictionEngine that takes a DataViewSchema.

@yaeldekel yaeldekel added the P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away. label Jan 9, 2020
@harishsk harishsk added P1 Priority of the issue for triage purpose: Needs to be fixed soon. and removed P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away. labels Mar 26, 2020
@harishsk harishsk added the loadsave Bugs related loading and saving data or models label Apr 29, 2020
@frank-dong-ms-zz frank-dong-ms-zz self-assigned this Jun 10, 2020
@frank-dong-ms-zz
Copy link
Contributor

frank-dong-ms-zz commented Jun 10, 2020

I can reproduce this issue on latest code base with below extra error:

System.InvalidOperationException : Type should contain a member named Label

after ignore column "Label" on purpose, get below exception:

Incompatible features column type: 'Vector<Single>' vs 'Vector<Single, 38>'

so 2 issues here, one is the schema override issue, the other is we are loading extra column "Label" from model.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working loadsave Bugs related loading and saving data or models P1 Priority of the issue for triage purpose: Needs to be fixed soon.
Projects
None yet
Development

No branches or pull requests

5 participants