Skip to content

Speed-up TensorFlow models inference #5847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
darth-vader-lg opened this issue Jun 17, 2021 · 0 comments · Fixed by #5848
Closed

Speed-up TensorFlow models inference #5847

darth-vader-lg opened this issue Jun 17, 2021 · 0 comments · Fixed by #5848

Comments

@darth-vader-lg
Copy link
Contributor

darth-vader-lg commented Jun 17, 2021

System information

  • OS version/distro:

OS Name: Windows
OS Version: 10.0.19043
OS Platform: Windows

  • .NET Version (eg., dotnet --info):

Version: 5.0.204
Commit: 84d1fe1bb7

Runtime environment:

OS Name: Windows
OS Version: 10.0.19043
OS Platform: Windows
RID: win10-x64
Base Path: C:\Program Files\dotnet\sdk\5.0.204\

Host (useful for support):
Version: 5.0.7
Commit: 556582d964

.NET SDKs installed:
2.1.202 [C:\Program Files\dotnet\sdk]
3.1.301 [C:\Program Files\dotnet\sdk]
3.1.410 [C:\Program Files\dotnet\sdk]
5.0.104 [C:\Program Files\dotnet\sdk]
5.0.202 [C:\Program Files\dotnet\sdk]
5.0.203 [C:\Program Files\dotnet\sdk]
5.0.204 [C:\Program Files\dotnet\sdk]

.NET runtimes installed:
Microsoft.AspNetCore.All 2.1.19 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.1.27 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.1.28 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.19 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.1.27 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.1.28 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.5 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.14 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.15 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.16 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 5.0.4 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 5.0.5 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 5.0.6 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 5.0.7 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.0.9 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.19 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.27 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.28 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.5 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.14 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.15 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.16 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 5.0.4 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 5.0.5 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 5.0.6 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 5.0.7 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.WindowsDesktop.App 3.1.5 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 3.1.14 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 3.1.15 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 3.1.16 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 5.0.4 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 5.0.5 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 5.0.6 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 5.0.7 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

Issue

  • What did you do?
    I was testing the inference speed on some TensorFlow custom object detection models.
    The tests are done on a I7 with GPU and Cuda (all correctly installed and functioning).
  • What happened?
    The inference done on a TensorFlow saved_model by Microsoft.ML.TensorFlow.TensorFlowTransformer was about 5 / 6 time slower than the inference done on the same onnx converted model (using Microsoft.ML.Transforms.Onnx.OnnxTransformer for the inference).
    Of course, I'm speaking not about the first inference that is notoriously slower.
    Same result if I do the inference of these models with a Python: 5 / 6 time faster than Microsoft.ML.TensorFlow.TensorFlowTransformer.
  • Microsoft.ML.TensorFlow.TensorFlowTransformer inference time: ~380ms
  • Microsoft.ML.Transforms.Onnx.OnnxTransformer: ~80ms
  • Python inference: ~65ms
    The results can vary by some ms, depending from the load of the cpu, gpu and input image, but the ratio is always quite the same.
  • What did you expect?
    I expected that the inference time was comparable for the 3 inference systems. But...
    Also because, python apart, I use the same pipeline for the saved_model and the onnx model (I attached at the bottom of this issue the estimators pipe).
    So, being sure that my system was correctly configured, I investigated the problem inside the ML.NET source code and I found the catch.

Source code / logs

Please see here

protected override Delegate MakeGetter(DataViewRow input, int iinfo, Func<int, bool> activeOutput, out Action disposer)
{
disposer = null;
Host.AssertValue(input);
var outputCache = new OutputCache();
var activeOutputColNames = _parent.Outputs.Where((x, i) => activeOutput(i)).ToArray();
var type = Tf2MlNetType(_parent.TFOutputTypes[iinfo]).RawType;
Host.Assert(type == _parent.OutputTypes[iinfo].GetItemType().RawType);
var srcTensorGetters = GetTensorValueGetters(input, _inputColIndices, _isInputVector, _parent.TFInputTypes, _fullySpecifiedShapes);
return Utils.MarshalInvoke(MakeGetter<int>, type, input, iinfo, srcTensorGetters, activeOutputColNames, outputCache);
}

In my system, that is an I7, the Microsoft.ML.Data.DataViewUtils.TryCreateConsolidatingCursor creates up to 7 threads (processors count - 1) and the MakeGetter of the TensorFlowTransformer.Mapper is called multiple time by these threads.

Microsoft.ML.TensorFlow.dll!Microsoft.ML.Transforms.TensorFlowTransformer.Mapper.MakeGetter(Microsoft.ML.DataViewRow input, int iinfo, System.Func<int, bool> activeOutput, out System.Action disposer) Riga 657 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowTransformerBase.MapperBase.CreateGetters(Microsoft.ML.DataViewRow input, System.Func<int, bool> activeOutput, out System.Action disposer) Riga 92 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.Cursor.Cursor(Microsoft.ML.Runtime.IChannelProvider provider, Microsoft.ML.DataViewRowCursor input, Microsoft.ML.Data.RowToRowMapperTransform parent, bool[] active) Riga 372 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 213 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.DataViewUtils.TryCreateConsolidatingCursor(out Microsoft.ML.DataViewRowCursor curs, Microsoft.ML.IDataView view, System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, Microsoft.ML.Runtime.IHost host, System.Random rand) Riga 127 C#

It's not an error in principle, for sure. It's the multithreading infrastructure, indeed. The real problem is that it does the same inference with the same input data and thus the same output data every time the MakeGetter is called by a cursor on the same data row!
I checked this passing just only one line of data to the transformer and it really happens: 7 cursors having the same row position (0) asking the same inference with the same image .
I can see there is already a cache logic to avoid this but...

var outputCache = new OutputCache();
,
return Utils.MarshalInvoke(MakeGetter<int>, type, input, iinfo, srcTensorGetters, activeOutputColNames, outputCache);
and
private void UpdateCacheIfNeeded(long position, ITensorValueGetter[] srcTensorGetters, string[] activeOutputColNames, OutputCache outputCache)
{
if (outputCache.Position != position)
{
if (_parent.Graph.graph_key != tf.get_default_graph().graph_key)
_parent.Session.graph.as_default();
Runner runner = new Runner(_parent.Session, _parent.Inputs.ToArray(), _parent.Outputs.ToArray());
// Feed inputs to the graph.
for (int i = 0; i < _parent.Inputs.Length; i++)
runner.AddInput(srcTensorGetters[i].GetTensor(), i);
// Execute the graph.
var tensors = runner.Run();
runner.Dispose();
Contracts.Assert(tensors.Length > 0);
for (int j = 0; j < activeOutputColNames.Length; j++)
{
if (outputCache.Outputs.TryGetValue(activeOutputColNames[j], out Tensor outTensor))
outTensor.Dispose();
outputCache.Outputs[activeOutputColNames[j]] = tensors[j];
}
outputCache.Position = position;
}
}

But the outputCache instance is local in the MakeGetter method. Thus it's created on each cursor. It leads to a useless run of the inference for each MakeGetter. In my system with 8 core and 7 parallel generated data cursors the inference is run 7 times on the same data. Now it's clear why the total inference time is about 6 time slower than with onnx or python 😉 .

My estimators pipeline (just for knowledge)

/// <summary>
/// Return the three estimators pipes of the model
/// </summary>
/// <returns>The pipes</returns>
public sealed override ModelPipes GetPipes()
{
   // Check is it's a Yolo model
   var isYolo = Config.ModelType.ToLower().Contains("yolo");
   // Sizes to use for the input tensor in the case that they are not specified in the model definition
   var resize = new Size(
      Config.ImageSize.Width > 0 ? Config.ImageSize.Width : _owner.DefaultImageResize.Width > 0 ? _owner.DefaultImageResize.Width : 640,
      Config.ImageSize.Height > 0 ? Config.ImageSize.Height : _owner.DefaultImageResize.Height > 0 ? _owner.DefaultImageResize.Height : 640);
   var shapes = new Queue<int>(!isYolo ? new[] { resize.Width, resize.Height } : new[] { resize.Height, resize.Width });
   // Create the output pipeline
   var outputEstimators = new EstimatorList();
   var dropColumns = new HashSet<string>();
   // Custom mapping of the Yolo models' output
   if (isYolo) {
      // Transform the Yolov5 output in a standard way
      outputEstimators.Add(Context.Transforms.ScoreYolov5());
      // Remove unused Yolo's output columns from the output data
      (from c in Config.Outputs select c.ColumnName).ToList().ForEach(c => dropColumns.Add(c));
   }
   // Columns to transform in the case that the name of the tensor doesn't match to the ML.NET column name
   var columnNameTransform = (
      from g in new[] { Config.Inputs, Config.Outputs }
      from c in g
      where c.Name != c.ColumnName
      select c).ToArray();
   // Rename columns
   foreach (var c in columnNameTransform)
      outputEstimators.Add(Context.Transforms.CopyColumns(inputColumnName: c.ColumnName, outputColumnName: c.Name));
   // List of columns to drop
   columnNameTransform.ToList().ForEach(c => dropColumns.Add(c.ColumnName));
   dropColumns.Add("Image");
   dropColumns.Add("ResizedImage");
   Config.Inputs.ToList().ForEach(c => dropColumns.Add(c.ColumnName));
   // Drop the columns
   if (dropColumns.Count > 0)
      outputEstimators.Add(Context.Transforms.DropColumns(dropColumns.ToArray()));
   // Return the three estimators pipes
   return _pipes ??= new()
   {
      // Data input pipe
      Input =
         Context.Transforms.LoadImages(
            inputColumnName: "ImagePath",
            outputColumnName: "Image",
            imageFolder: "")
         .Append(Context.Transforms.ResizeImages(
            inputColumnName: "Image",
            outputColumnName: "ResizedImage",
            imageWidth: resize.Width,
            imageHeight: resize.Height,
            resizing: ImageResizingEstimator.ResizingKind.Fill))
         .Append(Context.Transforms.ExtractPixels(
            inputColumnName: "ResizedImage",
            outputColumnName: Config.Inputs[0].ColumnName,
            scaleImage: !isYolo ? 1f : 1f / 255f,
            interleavePixelColors: !isYolo,
            outputAsFloatArray: Config.Inputs[0].DataType == typeof(float))),
      // Inference pipe
      Trainer =
         Config.Format switch
         {
            // Onnx
            ODModelConfig.ModelFormat.Onnx =>
               Context.Transforms.ApplyOnnxModel(
                  inputColumnNames: new[] { Config.Inputs[0].ColumnName },
                  outputColumnNames: (from c in Config.Outputs select c.ColumnName).ToArray(),
                  modelFile: Config.ModelFilePath,
                  shapeDictionary: new Dictionary<string, int[]>()
                  {
                     {
                        Config.Inputs[0].ColumnName,
                        Config.Inputs[0].Dim.Select(d => d > 0 ? d : shapes.Dequeue()).ToArray()
                     }
                  }),
            // TensorFlow saved_model or frozen graph
            var tf when tf == ODModelConfig.ModelFormat.TF2SavedModel || tf == ODModelConfig.ModelFormat.TFFrozenGraph =>
               Context.Model.LoadTensorFlowModel(Config.ModelFilePath).ScoreTensorFlowModel(
                  inputColumnNames: new[] { Config.Inputs[0].ColumnName },
                  outputColumnNames: (from c in Config.Outputs select c.ColumnName).ToArray()),
            _ => throw new FormatException("Unknown model format")
         },
      // Output pipe
      Output = outputEstimators.GetPipe()
   };
}

My proposal

Do just a little change, without revolutioning all, just transforming the outputCache to a field of the class TensorFlowTransformer.Mapper and pass it to the UpdateCacheIfNeeded method (doing similarly as in the OnnxTransformer where the cache instance is created only one time in the CreateGetters method

public override Delegate[] CreateGetters(DataViewRow input, Func<int, bool> activeOutput, out Action disposer)
{
Contracts.Assert(input.Schema == InputSchema);
OnnxRuntimeOutputCacher outputCacher = new OnnxRuntimeOutputCacher();
int n = OutputColumns.Value.Length;
var result = new Delegate[n];
for (int i = 0; i < n; i++)
{
if (!activeOutput(i))
continue;
result[i] = CreateGetter(input, i, activeOutput, outputCacher);
}
disposer = () =>
{
outputCacher.Dispose();
};
return result;
}
).
The current logic will continue to function and to use the already inferenced cached data while computing the same data line. The logic will continue to catch the data line changed condition (by Position property in the cache) and run a new inference as it's needed.
See my pull request #5848 for this change.
Now the inferences with both onnx and saved_model are comparable in my release.

michaelgsharp pushed a commit that referenced this issue Jun 23, 2021
* Speed up of the inference of saved_model(s).

Signed-off-by: darth-vader-lg <[email protected]>

* Fixed TensorFlowTransform fitting problem.
- Fixed the exception while fitting data with more than one input tensor. Followed the OnnxTransformer schema for the data view getters creation.

Signed-off-by: darth-vader-lg <[email protected]>

* Dispose of the cached tensors in the TensorFlowTransformer.
- The cached tensors are disposed at the end of inference operations.

Signed-off-by: darth-vader-lg <[email protected]>
darth-vader-lg added a commit to darth-vader-lg/ML-NET that referenced this issue Jun 26, 2021
* remotes/official/main:
  Update lgbm to v2.3.1 (dotnet#5851)
  Speed-up bitmap operations on images. Fixes dotnet#5856 (dotnet#5857)
  Onnx recursion limit (dotnet#5840)
  Speed up the inference of the saved_model(s). Fixes dotnet#5847 (dotnet#5848)

Signed-off-by: darth-vader-lg <[email protected]>
@ghost ghost locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant