Speed-up TensorFlow models inference #5847

darth-vader-lg · 2021-06-17T23:44:50Z

System information

OS version/distro:

OS Name: Windows
OS Version: 10.0.19043
OS Platform: Windows

.NET Version (eg., dotnet --info):

Version: 5.0.204
Commit: 84d1fe1bb7

Runtime environment:

OS Name: Windows
OS Version: 10.0.19043
OS Platform: Windows
RID: win10-x64
Base Path: C:\Program Files\dotnet\sdk\5.0.204\

Host (useful for support):
Version: 5.0.7
Commit: 556582d964

.NET SDKs installed:
2.1.202 [C:\Program Files\dotnet\sdk]
3.1.301 [C:\Program Files\dotnet\sdk]
3.1.410 [C:\Program Files\dotnet\sdk]
5.0.104 [C:\Program Files\dotnet\sdk]
5.0.202 [C:\Program Files\dotnet\sdk]
5.0.203 [C:\Program Files\dotnet\sdk]
5.0.204 [C:\Program Files\dotnet\sdk]

.NET runtimes installed:
Microsoft.AspNetCore.All 2.1.19 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.1.27 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.1.28 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.19 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.1.27 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.1.28 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.5 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.14 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.15 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.16 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 5.0.4 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 5.0.5 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 5.0.6 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 5.0.7 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.0.9 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.19 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.27 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.28 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.5 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.14 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.15 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.16 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 5.0.4 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 5.0.5 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 5.0.6 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 5.0.7 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.WindowsDesktop.App 3.1.5 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 3.1.14 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 3.1.15 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 3.1.16 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 5.0.4 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 5.0.5 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 5.0.6 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 5.0.7 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

Issue

What did you do?
I was testing the inference speed on some TensorFlow custom object detection models.
The tests are done on a I7 with GPU and Cuda (all correctly installed and functioning).
What happened?
The inference done on a TensorFlow saved_model by Microsoft.ML.TensorFlow.TensorFlowTransformer was about 5 / 6 time slower than the inference done on the same onnx converted model (using Microsoft.ML.Transforms.Onnx.OnnxTransformer for the inference).
Of course, I'm speaking not about the first inference that is notoriously slower.
Same result if I do the inference of these models with a Python: 5 / 6 time faster than Microsoft.ML.TensorFlow.TensorFlowTransformer.
Microsoft.ML.TensorFlow.TensorFlowTransformer inference time: ~380ms
Microsoft.ML.Transforms.Onnx.OnnxTransformer: ~80ms
Python inference: ~65ms
The results can vary by some ms, depending from the load of the cpu, gpu and input image, but the ratio is always quite the same.
What did you expect?
I expected that the inference time was comparable for the 3 inference systems. But...
Also because, python apart, I use the same pipeline for the saved_model and the onnx model (I attached at the bottom of this issue the estimators pipe).
So, being sure that my system was correctly configured, I investigated the problem inside the ML.NET source code and I found the catch.

Source code / logs

Please see here

machinelearning/src/Microsoft.ML.TensorFlow/TensorflowTransform.cs

Lines 653 to 665 in ff01708

    
           protected override Delegate MakeGetter(DataViewRow input, int iinfo, Func<int, bool> activeOutput, out Action disposer) 
        
           { 
        
               disposer = null; 
        
               Host.AssertValue(input); 
        
               var outputCache = new OutputCache(); 
        
               var activeOutputColNames = _parent.Outputs.Where((x, i) => activeOutput(i)).ToArray(); 
        
               var type = Tf2MlNetType(_parent.TFOutputTypes[iinfo]).RawType; 
        
               Host.Assert(type == _parent.OutputTypes[iinfo].GetItemType().RawType); 
        
               var srcTensorGetters = GetTensorValueGetters(input, _inputColIndices, _isInputVector, _parent.TFInputTypes, _fullySpecifiedShapes); 
        
               return Utils.MarshalInvoke(MakeGetter<int>, type, input, iinfo, srcTensorGetters, activeOutputColNames, outputCache); 
        
           }

In my system, that is an I7, the Microsoft.ML.Data.DataViewUtils.TryCreateConsolidatingCursor creates up to 7 threads (processors count - 1) and the MakeGetter of the TensorFlowTransformer.Mapper is called multiple time by these threads.

Microsoft.ML.TensorFlow.dll!Microsoft.ML.Transforms.TensorFlowTransformer.Mapper.MakeGetter(Microsoft.ML.DataViewRow input, int iinfo, System.Func<int, bool> activeOutput, out System.Action disposer) Riga 657 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowTransformerBase.MapperBase.CreateGetters(Microsoft.ML.DataViewRow input, System.Func<int, bool> activeOutput, out System.Action disposer) Riga 92 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.Cursor.Cursor(Microsoft.ML.Runtime.IChannelProvider provider, Microsoft.ML.DataViewRowCursor input, Microsoft.ML.Data.RowToRowMapperTransform parent, bool[] active) Riga 372 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 213 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.RowToRowMapperTransform.GetRowCursorSet(System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, int n, System.Random rand) Riga 204 C#
Microsoft.ML.Data.dll!Microsoft.ML.Data.DataViewUtils.TryCreateConsolidatingCursor(out Microsoft.ML.DataViewRowCursor curs, Microsoft.ML.IDataView view, System.Collections.Generic.IEnumerable<Microsoft.ML.DataViewSchema.Column> columnsNeeded, Microsoft.ML.Runtime.IHost host, System.Random rand) Riga 127 C#

It's not an error in principle, for sure. It's the multithreading infrastructure, indeed. The real problem is that it does the same inference with the same input data and thus the same output data every time the MakeGetter is called by a cursor on the same data row!
I checked this passing just only one line of data to the transformer and it really happens: 7 cursors having the same row position (0) asking the same inference with the same image .
I can see there is already a cache logic to avoid this but...

machinelearning/src/Microsoft.ML.TensorFlow/TensorflowTransform.cs

Line 658 in ff01708

var outputCache = new OutputCache();

,

machinelearning/src/Microsoft.ML.TensorFlow/TensorflowTransform.cs

Line 664 in ff01708

    
           return Utils.MarshalInvoke(MakeGetter<int>, type, input, iinfo, srcTensorGetters, activeOutputColNames, outputCache);

and

machinelearning/src/Microsoft.ML.TensorFlow/TensorflowTransform.cs

Lines 716 to 743 in ff01708

    
           private void UpdateCacheIfNeeded(long position, ITensorValueGetter[] srcTensorGetters, string[] activeOutputColNames, OutputCache outputCache) 
        
           { 
        
               if (outputCache.Position != position) 
        
               { 
        
                   if (_parent.Graph.graph_key != tf.get_default_graph().graph_key) 
        
                       _parent.Session.graph.as_default(); 
        
                   Runner runner = new Runner(_parent.Session, _parent.Inputs.ToArray(), _parent.Outputs.ToArray()); 
        
                   // Feed inputs to the graph. 
        
                   for (int i = 0; i < _parent.Inputs.Length; i++) 
        
                       runner.AddInput(srcTensorGetters[i].GetTensor(), i); 
        
                   // Execute the graph. 
        
                   var tensors = runner.Run(); 
        
                   runner.Dispose(); 
        
                   Contracts.Assert(tensors.Length > 0); 
        
                   for (int j = 0; j < activeOutputColNames.Length; j++) 
        
                   { 
        
                       if (outputCache.Outputs.TryGetValue(activeOutputColNames[j], out Tensor outTensor)) 
        
                           outTensor.Dispose(); 
        
                       outputCache.Outputs[activeOutputColNames[j]] = tensors[j]; 
        
                   } 
        
                   outputCache.Position = position; 
        
               } 
        
           }

But the outputCache instance is local in the MakeGetter method. Thus it's created on each cursor. It leads to a useless run of the inference for each MakeGetter. In my system with 8 core and 7 parallel generated data cursors the inference is run 7 times on the same data. Now it's clear why the total inference time is about 6 time slower than with onnx or python 😉 .

My estimators pipeline (just for knowledge)

/// <summary>
/// Return the three estimators pipes of the model
/// </summary>
/// <returns>The pipes</returns>
public sealed override ModelPipes GetPipes()
{
   // Check is it's a Yolo model
   var isYolo = Config.ModelType.ToLower().Contains("yolo");
   // Sizes to use for the input tensor in the case that they are not specified in the model definition
   var resize = new Size(
      Config.ImageSize.Width > 0 ? Config.ImageSize.Width : _owner.DefaultImageResize.Width > 0 ? _owner.DefaultImageResize.Width : 640,
      Config.ImageSize.Height > 0 ? Config.ImageSize.Height : _owner.DefaultImageResize.Height > 0 ? _owner.DefaultImageResize.Height : 640);
   var shapes = new Queue<int>(!isYolo ? new[] { resize.Width, resize.Height } : new[] { resize.Height, resize.Width });
   // Create the output pipeline
   var outputEstimators = new EstimatorList();
   var dropColumns = new HashSet<string>();
   // Custom mapping of the Yolo models' output
   if (isYolo) {
      // Transform the Yolov5 output in a standard way
      outputEstimators.Add(Context.Transforms.ScoreYolov5());
      // Remove unused Yolo's output columns from the output data
      (from c in Config.Outputs select c.ColumnName).ToList().ForEach(c => dropColumns.Add(c));
   }
   // Columns to transform in the case that the name of the tensor doesn't match to the ML.NET column name
   var columnNameTransform = (
      from g in new[] { Config.Inputs, Config.Outputs }
      from c in g
      where c.Name != c.ColumnName
      select c).ToArray();
   // Rename columns
   foreach (var c in columnNameTransform)
      outputEstimators.Add(Context.Transforms.CopyColumns(inputColumnName: c.ColumnName, outputColumnName: c.Name));
   // List of columns to drop
   columnNameTransform.ToList().ForEach(c => dropColumns.Add(c.ColumnName));
   dropColumns.Add("Image");
   dropColumns.Add("ResizedImage");
   Config.Inputs.ToList().ForEach(c => dropColumns.Add(c.ColumnName));
   // Drop the columns
   if (dropColumns.Count > 0)
      outputEstimators.Add(Context.Transforms.DropColumns(dropColumns.ToArray()));
   // Return the three estimators pipes
   return _pipes ??= new()
   {
      // Data input pipe
      Input =
         Context.Transforms.LoadImages(
            inputColumnName: "ImagePath",
            outputColumnName: "Image",
            imageFolder: "")
         .Append(Context.Transforms.ResizeImages(
            inputColumnName: "Image",
            outputColumnName: "ResizedImage",
            imageWidth: resize.Width,
            imageHeight: resize.Height,
            resizing: ImageResizingEstimator.ResizingKind.Fill))
         .Append(Context.Transforms.ExtractPixels(
            inputColumnName: "ResizedImage",
            outputColumnName: Config.Inputs[0].ColumnName,
            scaleImage: !isYolo ? 1f : 1f / 255f,
            interleavePixelColors: !isYolo,
            outputAsFloatArray: Config.Inputs[0].DataType == typeof(float))),
      // Inference pipe
      Trainer =
         Config.Format switch
         {
            // Onnx
            ODModelConfig.ModelFormat.Onnx =>
               Context.Transforms.ApplyOnnxModel(
                  inputColumnNames: new[] { Config.Inputs[0].ColumnName },
                  outputColumnNames: (from c in Config.Outputs select c.ColumnName).ToArray(),
                  modelFile: Config.ModelFilePath,
                  shapeDictionary: new Dictionary<string, int[]>()
                  {
                     {
                        Config.Inputs[0].ColumnName,
                        Config.Inputs[0].Dim.Select(d => d > 0 ? d : shapes.Dequeue()).ToArray()
                     }
                  }),
            // TensorFlow saved_model or frozen graph
            var tf when tf == ODModelConfig.ModelFormat.TF2SavedModel || tf == ODModelConfig.ModelFormat.TFFrozenGraph =>
               Context.Model.LoadTensorFlowModel(Config.ModelFilePath).ScoreTensorFlowModel(
                  inputColumnNames: new[] { Config.Inputs[0].ColumnName },
                  outputColumnNames: (from c in Config.Outputs select c.ColumnName).ToArray()),
            _ => throw new FormatException("Unknown model format")
         },
      // Output pipe
      Output = outputEstimators.GetPipe()
   };
}

My proposal

Do just a little change, without revolutioning all, just transforming the outputCache to a field of the class TensorFlowTransformer.Mapper and pass it to the UpdateCacheIfNeeded method (doing similarly as in the OnnxTransformer where the cache instance is created only one time in the CreateGetters method

machinelearning/src/Microsoft.ML.OnnxTransformer/OnnxTransform.cs

Lines 512 to 531 in ff01708

    
           public override Delegate[] CreateGetters(DataViewRow input, Func<int, bool> activeOutput, out Action disposer) 
        
           { 
        
               Contracts.Assert(input.Schema == InputSchema); 
        
               OnnxRuntimeOutputCacher outputCacher = new OnnxRuntimeOutputCacher(); 
        
               int n = OutputColumns.Value.Length; 
        
               var result = new Delegate[n]; 
        
               for (int i = 0; i < n; i++) 
        
               { 
        
                   if (!activeOutput(i)) 
        
                       continue; 
        
                   result[i] = CreateGetter(input, i, activeOutput, outputCacher); 
        
               } 
        
               disposer = () => 
        
               { 
        
                   outputCacher.Dispose(); 
        
               }; 
        
               return result; 
        
           }

).
The current logic will continue to function and to use the already inferenced cached data while computing the same data line. The logic will continue to catch the data line changed condition (by Position property in the cache) and run a new inference as it's needed.
See my pull request #5848 for this change.
Now the inferences with both onnx and saved_model are comparable in my release.

The text was updated successfully, but these errors were encountered:

* Speed up of the inference of saved_model(s). Signed-off-by: darth-vader-lg <[email protected]> * Fixed TensorFlowTransform fitting problem. - Fixed the exception while fitting data with more than one input tensor. Followed the OnnxTransformer schema for the data view getters creation. Signed-off-by: darth-vader-lg <[email protected]> * Dispose of the cached tensors in the TensorFlowTransformer. - The cached tensors are disposed at the end of inference operations. Signed-off-by: darth-vader-lg <[email protected]>

* remotes/official/main: Update lgbm to v2.3.1 (dotnet#5851) Speed-up bitmap operations on images. Fixes dotnet#5856 (dotnet#5857) Onnx recursion limit (dotnet#5840) Speed up the inference of the saved_model(s). Fixes dotnet#5847 (dotnet#5848) Signed-off-by: darth-vader-lg <[email protected]>

This was referenced Jun 17, 2021

Speed up the inference of the saved_model(s). Fixes #5847 #5848

Merged

Speed-up bitmap operations on images. Fixes #5856 #5857

Merged

michaelgsharp closed this as completed in #5848 Jun 23, 2021

ghost locked as resolved and limited conversation to collaborators Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed-up TensorFlow models inference #5847

Speed-up TensorFlow models inference #5847

darth-vader-lg commented Jun 17, 2021 •

edited

Loading

Speed-up TensorFlow models inference #5847

Speed-up TensorFlow models inference #5847

Comments

darth-vader-lg commented Jun 17, 2021 • edited Loading

System information

Issue

Source code / logs

My estimators pipeline (just for knowledge)

My proposal

darth-vader-lg commented Jun 17, 2021 •

edited

Loading