-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Adding KeyToValueTransformers before finishing exporting to ONNX #4841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
var xf = new KeyToValueMappingTransformer(Host, "PredictedLabel").Transform(scorePipe); | ||
var output = new CommonOutputs.TransformOutput { Model = new TransformModelImpl(Host, xf, scorePipe), OutputData = xf }; | ||
end = output.OutputData; | ||
transforms.AddLast(end as ITransformCanSaveOnnx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, ignore everything in this PR except for this lines where I add the KeyToValueMapping at the end of the transforms list.
Here I had hardcoded the name of the PredictedLabel column, but I guess that simply retrieving the KeyDataViewType columns of the scorePipe dataview could be possible, so to then add KeyToValueMappingTransformers for each KeyDataViewType column we might desire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, reading it again now. I like this idea because I think this code path runs only in the Nimbus scenario. (We should verify that, is is the case). If that is true, then this may unblock us for now. But I would like to be able to solve this for the non-Nimbus scenario as well. If this works for now, we can create a bug for the non-Nimbus scenario and fix it later.
In reply to: 379220940 [](ancestors = 379220940)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my understanding is that the Run() method where I added my changes only gets executed when running the following entrypoint:
machinelearning/src/Microsoft.ML.OnnxConverter/SaveOnnxCommand.cs
Lines 324 to 329 in 0b4b15b
[TlcModule.EntryPoint(Name = "Models.OnnxConverter", Desc = "Converts the model to ONNX format.", UserName = "ONNX Converter.")] | |
public static Output Apply(IHostEnvironment env, Arguments input) | |
{ | |
new SaveOnnxCommand(env, input).Run(); | |
return new Output(); | |
} |
So any tool that has access to that entrypoint will be able to run that. I am unsure if NimbusML is the only one who has it (I wonder if MAML, and AutoML/ModelBuilder have access to it). If other tools have access to this, then perhaps adding an overload to Run() with a flag such as "addMappingAtTheEnd = false" might be a good idea.
On the other hand I am not sure how to confirm that this Run() method gets only called through that entrypoint. I might want to ask you about this offline about how to do it.
In reply to: 379256936 [](ancestors = 379256936,379220940)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command is runnable from MAML.
I added a comment in the NimbusML issue, I am not sure that we should be adding transforms to the ONNX export, that do not exist in the model. If the goal is consistency between the results when scoring in NimbusML or in ONNX, to me it makes more sense to change NimbusML to make the KeyToValue
transform a part of the model (otherwise, you will get consistency between NimbusML and ONNX, but not with ML.NET...).
In reply to: 379566192 [](ancestors = 379566192,379256936,379220940)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try to add the KeyToValue transform to the pipeline from NimbusML. The problem is that the way NimbusML creates pipelines and run predictions in them, doesn't allow me to add a transform after the predictor. I tried a couple of approaches to make it work, but I couldn't manage to do it only in NimusML side. My conclusion of those attempts was that to add the KeyToValue transform from NimbusML side would also require to change the way that ML.NET executes graphs created in NimbusML.
I believe that @ganik agreed that trying to change it from NimbusML (with the changes that would be required in ML.NET) would require major changes in the codebase, and so it was better to go with the approach I am using here in this PR as a temporary solution.
If you have something in mind to change it in NimbusML side without requiring major changes, please let me know. Thanks!
In reply to: 381234684 [](ancestors = 381234684,379566192,379256936,379220940)
|
||
if(rawPred.PredictionKind == PredictionKind.BinaryClassification || rawPred.PredictionKind == PredictionKind.MulticlassClassification) | ||
{ | ||
if(scorePipe.Schema.GetColumnOrNull("PredictedLabel")?.Type is KeyDataViewType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is KeyDataViewType [](start = 84, length = 18)
This still doesn't mean that the KeyToValueMappingTransformer
will work - there can be key type columns that don't have key value annotations. I think there is a utility method called something like HasKeyNames
that you can use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HasKeyValues
method requires me to specify the keyValueItemType
. Does anyone know if there's a way to avoid having to specify it, since in here I don't care about the keyValueItemType, as long as it has key value annotations.? @ganik @harishsk (if I leave it null, it defaults to be TextDataViewType)
public static bool HasKeyValues(this DataViewSchema.Column column, PrimitiveDataViewType keyValueItemType = null) | |
{ | |
// False if type is neither a key type, or a vector of key types. | |
if (!(column.Type.GetItemType() is KeyDataViewType keyType)) | |
return false; | |
if (keyValueItemType == null) | |
keyValueItemType = TextDataViewType.Instance; | |
var metaColumn = column.Annotations.Schema.GetColumnOrNull(AnnotationUtils.Kinds.KeyValues); | |
return | |
metaColumn != null | |
&& metaColumn.Value.Type is VectorDataViewType vectorType | |
&& keyType.Count == (ulong)vectorType.Size | |
&& keyValueItemType.Equals(vectorType.ItemType); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about you do something like:
ReadOnlyMemory tmp = default;
if (input.Data.Schema.TryGetAnnotation(TextDataViewType.Instance, AnnotationUtils.Kinds.ScoreValueKind, i, ref tmp) && ReadOnlyMemoryUtils.EqualsStr(AnnotationUtils.Const.ScoreValueKind.PredictedLabel, tmp))
{
In reply to: 382339571 [](ancestors = 382339571)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More general advice: search for "ScoreValueKind.PredictedLabel" and see how in other places we check for such column
In reply to: 382745605 [](ancestors = 382745605,382339571)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can look at the implementation of HasKeyValues and filter out the checks that are relying on keyValueItemType . e.g.
if (!(column.Type.GetItemType() is KeyDataViewType keyType))
return false;
var metaColumn = column.Annotations.Schema.GetColumnOrNull(AnnotationUtils.Kinds.KeyValues);
return
metaColumn != null
&& metaColumn.Value.Type is VectorDataViewType vectorType
&& keyType.Count == (ulong)vectorType.Size)
In reply to: 382339571 [](ancestors = 382339571)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to go with @harishsk suggestion. Please, @yaeldekel let me know if you know about another way of doing this.
In reply to: 382804341 [](ancestors = 382804341,382339571)
I am in the process of adding SlotNames annotations to the exported onnx model. If that works, we should try to use the same technique to add KeyValues annotations also. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending the discussion on key type columns having key values.
Plus one comment on using the default column name.
…eyValues annotations and added explanatory comments
I am running with some problems when using this fix with BinaryClassification models coming from NimbusML. For some reason the exported ONNX model can not be loaded back, as it throws an exception when attempting to do that. It seems that, for some reason, exporting KeyToValueTransformers to ONNX, after a binary classification pipeline, is the problem. The error message looks as follows:
Notice that I get a similar error message when exporting to ONNX an ML.NET OVA model (without NimbusML) that is followed by a KeyToValueTransformer. Wonder if there's a connection with the fact that OVA uses Binary classifiers. On the other hand, I haven't been able to reproduce the above error message using only ML.NET (without NimbusML), because apparently there's no way of training Binary classification trainers with a Label column that is of KeyDataViewType (exceptions are thrown if I attempt to do this)... it seems they only work with Boolean Label columns. I don't know why is the reason for this, since NimbusML actually converts non-Boolean labels to KeyDataViewType behind the scenes, and the BinaryClassificationTrainer works with them as expected. Apparently different code paths are followed on each case that stop ML.NET from being capable of doing the same as NimbusML. I wonder if this difference in design has anything to do with the fact that the resulting ONNX model is, for some reason, not well-formed causing the exception above. |
This problem, along with problems with OVA models using this PR, has been fixed by #4878 😃 |
|
||
passThroughColumnNames.IntersectWith(inputKeyDataViewTypeColumnsNames); // Only count those columns that were in the input of the pipeline | ||
|
||
foreach (var name in passThroughColumnNames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
passThroughColumnNames [](start = 33, length = 22)
Here is another edge case: If the input has a categorical column, say A
, and the pipeline has a CopyTransform
that copies A
into a column named B
.
The way it is written now, we will not apply key-to-value on B
. Should we?
On one hand, we should, because A
and B
should be copies of each other, so it does not make sense if the ONNX model has different types and values for them... On the other hand, I don't see an easy rule that would help decide when we do need to apply key-to-value to B
(when it is the output of Copy), versus when we don't (for example, when it is the output of ToKey).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar issue with Concat: say the input has two categorical columns A
and B
, and the pipeline has a concat transform that concatenates them to produce a column C
...
In reply to: 383580868 [](ancestors = 383580868)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, that is a bit strange side-effect. Perhaps in next iterations we can get more details on how AutoML prefers this. Maybe we wont need to add KeyToValue for passthrough columns at all and let them surface as keys. For now we can document this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When NimbusML deals with category Pandas columns, it automatically convert them into
KeyDataViewType
columns when sending the inputIDataView
to ML.NET. On the way back, it convert those columns back to category columns. This is done without usingKeyToValueTransformer
orValueToKeyTransformer
, so those transformers aren't actually part of the pipeline of the model that's used by NimbusML. Another behavior that NimbusML has, is that for some classifiers it actually adds aKeyToValueTransformer
in the pipeline, but it doesn't add a ValueToKeyTransformer at the end, because it relies on its mechanism of converting the outputKeyDataViewType
columns into the original type of columns.In general these behaviors aren't a problem, but it causes troubles when exporting models to ONNX. Since the models created by nimbusml don't have the appropriate
ValueToKeyTransformers
inside them to map values back from keys, then the generated ONNX models won't have the appropriateLabelEncoders
to do that mapping.In this PR I add a
KeyToValueTransformers
at the end of thetransforms
list that is used bySaveOnnxCommand
, so that the resulting ONNX model can map values back from the keys, even when the NimbusML pipeline doesn't include the transformers to map them back. This is done for two main cases that were identified to affect NimbusML:PredictedLabel
KeyDataViewType
column.KeyDataViewTypes
after NimbusML sent them to ML.NET. In this scenarios, those input columns are expected to be in the output and remained untouched so that NimbusML can automatically map them back. But when using ONNX models generated by NimbusML without the changes in this PR, then the output of this scenario for ONNX are columns that are no longerKeyDataViewTypes
, and that contain only keys. So for this scenario it is also necessary to add the correspondingKeyToValueTransformers
. That way the user will get back the values that they're expecting, although this conversion wouldn't be done by NimbusML itself, but by the added transformers. The columns that were categorical in NimbusML, and then KeyDataViewTypes, and that remained untouched by the pipeline, are called "pass through" columns inside the changes of this PR, and in other related discussions.This is thought to be a temporary solution, to unblock Azure AutoML to work with NimbusML. Ideally these transformers would be added in NimbusML pipeline, but currently it's not possible to add transforms after predictors in NimbusML, and doing these changes would also need to change how NimbusML converts between IDataView and Pandas DataFrames. So for the time being the solution in here was found to be the best.
Fixes microsoft/NimbusML#431 and microsoft/NimbusML#426