You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What did you do?
I was testing the inference speed on some TensorFlow custom object detection models.
The tests are done on a I7 with GPU and Cuda (all correctly installed and functioning).
What happened?
The inference done on a TensorFlow saved_model by Microsoft.ML.TensorFlow.TensorFlowTransformer was about 5 / 6 time slower than the inference done on the same onnx converted model (using Microsoft.ML.Transforms.Onnx.OnnxTransformer for the inference).
Of course, I'm speaking not about the first inference that is notoriously slower.
Same result if I do the inference of these models with a Python: 5 / 6 time faster than Microsoft.ML.TensorFlow.TensorFlowTransformer.
Python inference: ~65ms
The results can vary by some ms, depending from the load of the cpu, gpu and input image, but the ratio is always quite the same.
What did you expect?
I expected that the inference time was comparable for the 3 inference systems. But...
Also because, python apart, I use the same pipeline for the saved_model and the onnx model (I attached at the bottom of this issue the estimators pipe).
So, being sure that my system was correctly configured, I investigated the problem inside the ML.NET source code and I found the catch.
In my system, that is an I7, the Microsoft.ML.Data.DataViewUtils.TryCreateConsolidatingCursor creates up to 7 threads (processors count - 1) and the MakeGetter of the TensorFlowTransformer.Mapper is called multiple time by these threads.
Microsoft.ML.TensorFlow.dll!Microsoft.ML.Transforms.TensorFlowTransformer.Mapper.MakeGetter(Microsoft.ML.DataViewRow input, int iinfo, System.Func<int, bool> activeOutput, out System.Action disposer) Riga 657 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowTransformerBase.MapperBase.CreateGetters(Microsoft.ML.DataViewRow input, System.Func<int, bool> activeOutput, out System.Action disposer) Riga 92 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.Cursor.Cursor(Microsoft.ML.Runtime.IChannelProvider provider, Microsoft.ML.DataViewRowCursor input, Microsoft.ML.Data.RowToRowMapperTransform parent, bool[] active) Riga 372 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 213 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.DataViewUtils.TryCreateConsolidatingCursor(out Microsoft.ML.DataViewRowCursor curs, Microsoft.ML.IDataView view, System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, Microsoft.ML.Runtime.IHost host, System.Random rand) Riga 127 C#
It's not an error in principle, for sure. It's the multithreading infrastructure, indeed. The real problem is that it does the same inference with the same input data and thus the same output data every time the MakeGetter is called by a cursor on the same data row!
I checked this passing just only one line of data to the transformer and it really happens: 7 cursors having the same row position (0) asking the same inference with the same image .
I can see there is already a cache logic to avoid this but...
But the outputCache instance is local in the MakeGetter method. Thus it's created on each cursor. It leads to a useless run of the inference for each MakeGetter. In my system with 8 core and 7 parallel generated data cursors the inference is run 7 times on the same data. Now it's clear why the total inference time is about 6 time slower than with onnx or python 😉 .
My estimators pipeline (just for knowledge)
/// <summary>
/// Return the three estimators pipes of the model
/// </summary>
/// <returns>The pipes</returns>
public sealed override ModelPipes GetPipes()
{
// Check is it's a Yolo model
var isYolo = Config.ModelType.ToLower().Contains("yolo");
// Sizes to use for the input tensor in the case that they are not specified in the model definition
var resize = new Size(
Config.ImageSize.Width > 0 ? Config.ImageSize.Width : _owner.DefaultImageResize.Width > 0 ? _owner.DefaultImageResize.Width : 640,
Config.ImageSize.Height > 0 ? Config.ImageSize.Height : _owner.DefaultImageResize.Height > 0 ? _owner.DefaultImageResize.Height : 640);
var shapes = new Queue<int>(!isYolo ? new[] { resize.Width, resize.Height } : new[] { resize.Height, resize.Width });
// Create the output pipeline
var outputEstimators = new EstimatorList();
var dropColumns = new HashSet<string>();
// Custom mapping of the Yolo models' output
if (isYolo) {
// Transform the Yolov5 output in a standard way
outputEstimators.Add(Context.Transforms.ScoreYolov5());
// Remove unused Yolo's output columns from the output data
(from c in Config.Outputs select c.ColumnName).ToList().ForEach(c => dropColumns.Add(c));
}
// Columns to transform in the case that the name of the tensor doesn't match to the ML.NET column name
var columnNameTransform = (
from g in new[] { Config.Inputs, Config.Outputs }
from c in g
where c.Name != c.ColumnName
select c).ToArray();
// Rename columns
foreach (var c in columnNameTransform)
outputEstimators.Add(Context.Transforms.CopyColumns(inputColumnName: c.ColumnName, outputColumnName: c.Name));
// List of columns to drop
columnNameTransform.ToList().ForEach(c => dropColumns.Add(c.ColumnName));
dropColumns.Add("Image");
dropColumns.Add("ResizedImage");
Config.Inputs.ToList().ForEach(c => dropColumns.Add(c.ColumnName));
// Drop the columns
if (dropColumns.Count > 0)
outputEstimators.Add(Context.Transforms.DropColumns(dropColumns.ToArray()));
// Return the three estimators pipes
return _pipes ??= new()
{
// Data input pipe
Input =
Context.Transforms.LoadImages(
inputColumnName: "ImagePath",
outputColumnName: "Image",
imageFolder: "")
.Append(Context.Transforms.ResizeImages(
inputColumnName: "Image",
outputColumnName: "ResizedImage",
imageWidth: resize.Width,
imageHeight: resize.Height,
resizing: ImageResizingEstimator.ResizingKind.Fill))
.Append(Context.Transforms.ExtractPixels(
inputColumnName: "ResizedImage",
outputColumnName: Config.Inputs[0].ColumnName,
scaleImage: !isYolo ? 1f : 1f / 255f,
interleavePixelColors: !isYolo,
outputAsFloatArray: Config.Inputs[0].DataType == typeof(float))),
// Inference pipe
Trainer =
Config.Format switch
{
// Onnx
ODModelConfig.ModelFormat.Onnx =>
Context.Transforms.ApplyOnnxModel(
inputColumnNames: new[] { Config.Inputs[0].ColumnName },
outputColumnNames: (from c in Config.Outputs select c.ColumnName).ToArray(),
modelFile: Config.ModelFilePath,
shapeDictionary: new Dictionary<string, int[]>()
{
{
Config.Inputs[0].ColumnName,
Config.Inputs[0].Dim.Select(d => d > 0 ? d : shapes.Dequeue()).ToArray()
}
}),
// TensorFlow saved_model or frozen graph
var tf when tf == ODModelConfig.ModelFormat.TF2SavedModel || tf == ODModelConfig.ModelFormat.TFFrozenGraph =>
Context.Model.LoadTensorFlowModel(Config.ModelFilePath).ScoreTensorFlowModel(
inputColumnNames: new[] { Config.Inputs[0].ColumnName },
outputColumnNames: (from c in Config.Outputs select c.ColumnName).ToArray()),
_ => throw new FormatException("Unknown model format")
},
// Output pipe
Output = outputEstimators.GetPipe()
};
}
My proposal
Do just a little change, without revolutioning all, just transforming the outputCache to a field of the class TensorFlowTransformer.Mapper and pass it to the UpdateCacheIfNeeded method (doing similarly as in the OnnxTransformer where the cache instance is created only one time in the CreateGetters method
).
The current logic will continue to function and to use the already inferenced cached data while computing the same data line. The logic will continue to catch the data line changed condition (by Position property in the cache) and run a new inference as it's needed.
See my pull request#5848 for this change.
Now the inferences with both onnx and saved_model are comparable in my release.
The text was updated successfully, but these errors were encountered:
* Speed up of the inference of saved_model(s).
Signed-off-by: darth-vader-lg <[email protected]>
* Fixed TensorFlowTransform fitting problem.
- Fixed the exception while fitting data with more than one input tensor. Followed the OnnxTransformer schema for the data view getters creation.
Signed-off-by: darth-vader-lg <[email protected]>
* Dispose of the cached tensors in the TensorFlowTransformer.
- The cached tensors are disposed at the end of inference operations.
Signed-off-by: darth-vader-lg <[email protected]>
System information
Runtime environment:
Issue
I was testing the inference speed on some TensorFlow custom object detection models.
The tests are done on a I7 with GPU and Cuda (all correctly installed and functioning).
The inference done on a TensorFlow saved_model by Microsoft.ML.TensorFlow.TensorFlowTransformer was about 5 / 6 time slower than the inference done on the same onnx converted model (using Microsoft.ML.Transforms.Onnx.OnnxTransformer for the inference).
Of course, I'm speaking not about the first inference that is notoriously slower.
Same result if I do the inference of these models with a Python: 5 / 6 time faster than Microsoft.ML.TensorFlow.TensorFlowTransformer.
The results can vary by some ms, depending from the load of the cpu, gpu and input image, but the ratio is always quite the same.
I expected that the inference time was comparable for the 3 inference systems. But...
Also because, python apart, I use the same pipeline for the saved_model and the onnx model (I attached at the bottom of this issue the estimators pipe).
So, being sure that my system was correctly configured, I investigated the problem inside the ML.NET source code and I found the catch.
Source code / logs
Please see here
machinelearning/src/Microsoft.ML.TensorFlow/TensorflowTransform.cs
Lines 653 to 665 in ff01708
In my system, that is an I7, the Microsoft.ML.Data.DataViewUtils.TryCreateConsolidatingCursor creates up to 7 threads (processors count - 1) and the MakeGetter of the TensorFlowTransformer.Mapper is called multiple time by these threads.
It's not an error in principle, for sure. It's the multithreading infrastructure, indeed. The real problem is that it does the same inference with the same input data and thus the same output data every time the MakeGetter is called by a cursor on the same data row!
I checked this passing just only one line of data to the transformer and it really happens: 7 cursors having the same row position (0) asking the same inference with the same image .
I can see there is already a cache logic to avoid this but...
machinelearning/src/Microsoft.ML.TensorFlow/TensorflowTransform.cs
Line 658 in ff01708
machinelearning/src/Microsoft.ML.TensorFlow/TensorflowTransform.cs
Line 664 in ff01708
machinelearning/src/Microsoft.ML.TensorFlow/TensorflowTransform.cs
Lines 716 to 743 in ff01708
But the outputCache instance is local in the MakeGetter method. Thus it's created on each cursor. It leads to a useless run of the inference for each MakeGetter. In my system with 8 core and 7 parallel generated data cursors the inference is run 7 times on the same data. Now it's clear why the total inference time is about 6 time slower than with onnx or python 😉 .
My estimators pipeline (just for knowledge)
My proposal
Do just a little change, without revolutioning all, just transforming the outputCache to a field of the class TensorFlowTransformer.Mapper and pass it to the UpdateCacheIfNeeded method (doing similarly as in the OnnxTransformer where the cache instance is created only one time in the CreateGetters method
machinelearning/src/Microsoft.ML.OnnxTransformer/OnnxTransform.cs
Lines 512 to 531 in ff01708
The current logic will continue to function and to use the already inferenced cached data while computing the same data line. The logic will continue to catch the data line changed condition (by Position property in the cache) and run a new inference as it's needed.
See my pull request #5848 for this change.
Now the inferences with both onnx and saved_model are comparable in my release.
The text was updated successfully, but these errors were encountered: