StopWordsRemovingEstimator export to Onnx #5279

Lynx1820 · 2020-07-02T18:57:32Z

Exporting StopWordsRemovingEstimator/CustomStopWordsRemovingEstimator to Onnx.
A test which currently fails is the edge case where all the words in the text get removed. While Ml.net produces an empty array, Onnx produces an array of length 1 with the empty string. Suggestions for how handle this case?

Note: The decision was address the issue mentioned above in another PR.

harishsk · 2020-07-02T19:49:40Z

src/Microsoft.ML.Transforms/Text/StopWordsRemovingTransformer.cs

+            {
+                var opType = "Squeeze";
+                var squeezeOutput = ctx.AddIntermediateVariable(null, "SqueezeOutput", true);
+                var node = ctx.CreateNode(opType, srcVariableName, squeezeOutput, ctx.GetNodeName(opType), "");


Is it possible to avoid skipping the shape and type addition? #Resolved

The reason I skipped the shape is because the shape of the tokenized word vector is not known prior to inference time. The number of words tokenized may be any number. #Resolved

Other text transformers that also infer shape:

https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Transforms/Text/WordTokenizing.cs

https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Transforms/Text/TokenizingByCharacters.cs #Resolved

harishsk · 2020-07-02T19:52:01Z

src/Microsoft.ML.Transforms/Text/StopWordsRemovingTransformer.cs

+                var opType = "Squeeze";
+                var squeezeOutput = ctx.AddIntermediateVariable(null, "SqueezeOutput", true);
+                var node = ctx.CreateNode(opType, srcVariableName, squeezeOutput, ctx.GetNodeName(opType), "");
+                node.AddAttribute("axes", new long[] { 0 });


Not in this PR, but in a different PR it maybe worth considering changing the default domain for CreateNode to be "ai.onnx" and not "ai.onnx.ml". The latter has very few ops and we mostly use operators from "ai.onnx" and it makes sense to retain that as the default.

src/Microsoft.ML.Transforms/Text/StopWordsRemovingTransformer.cs

harishsk · 2020-07-02T19:56:47Z

Can you check whether the number of elements returned is the same?

codecov · 2020-07-02T20:20:52Z

Codecov Report

Merging #5279 into master will increase coverage by 0.00%.
The diff coverage is 96.80%.

@@           Coverage Diff           @@
##           master    #5279   +/-   ##
=======================================
  Coverage   73.68%   73.68%           
=======================================
  Files        1022     1022           
  Lines      190366   190348   -18     
  Branches    20474    20472    -2     
=======================================
+ Hits       140265   140267    +2     
+ Misses      44568    44548   -20     
  Partials     5533     5533

Flag	Coverage Δ
#Debug	`73.68% <96.80%> (+<0.01%)`	⬆️
#production	`69.42% <94.73%> (+<0.01%)`	⬆️
#test	`87.65% <100.00%> (+0.03%)`	⬆️

Impacted Files	Coverage Δ
...ML.Transforms/Text/StopWordsRemovingTransformer.cs	`86.67% <94.73%> (+0.89%)`	⬆️
test/Microsoft.ML.Tests/OnnxConversionTest.cs	`96.66% <100.00%> (+0.08%)`	⬆️
...c/Microsoft.ML.SamplesUtils/SamplesDatasetUtils.cs	`40.00% <0.00%> (-3.46%)`	⬇️
src/Microsoft.ML.CodeGenerator/Utils.cs	`59.20% <0.00%> (-2.12%)`	⬇️
src/Microsoft.ML.Data/Training/TrainerUtils.cs	`65.86% <0.00%> (-1.01%)`	⬇️
...dardTrainers/Standard/Online/AveragedPerceptron.cs	`89.70% <0.00%> (-0.58%)`	⬇️
...rc/Microsoft.ML.LightGbm/LightGbmRankingTrainer.cs	`88.00% <0.00%> (-0.38%)`	⬇️
...c/Microsoft.ML.Data/DataLoadSave/EstimatorChain.cs	`89.65% <0.00%> (-0.35%)`	⬇️
src/Microsoft.ML.Data/Prediction/Calibrator.cs	`80.45% <0.00%> (-0.27%)`	⬇️
src/Microsoft.ML.Sweeper/AsyncSweeper.cs	`71.23% <0.00%> (-0.20%)`	⬇️
... and 29 more

Lynx1820 · 2020-07-02T21:54:03Z

Can you check whether the number of elements returned is the same?

I think this already done as part of the onnx testing framework:
https://github.com/dotnet/machinelearning/blob/master/test/Microsoft.ML.TestFramework/BaseTestBaseline.cs#L751

harishsk · 2020-07-02T22:19:57Z

Can you reshape the returned vector in SaveAsOnnx to be compatible with ML.NET?

In reply to: 653235748 [](ancestors = 653235748)

harishsk

Lynx1820 requested a review from a team as a code owner July 2, 2020 18:57

Lynx1820 requested review from wangyems, antoniovs1029 and harishsk July 2, 2020 19:08

harishsk reviewed Jul 2, 2020

View reviewed changes

src/Microsoft.ML.Transforms/Text/StopWordsRemovingTransformer.cs Show resolved Hide resolved

Lynx1820 requested a review from harishsk July 10, 2020 22:47

Lynx1820 added 3 commits July 10, 2020 16:12

StopWordsRemoving transformer export to onnx

98d9f5b

format changes

809d77f

adding types

f55873c

Lynx1820 force-pushed the onnx_stop_wrds branch from 86621de to f55873c Compare July 10, 2020 23:13

harishsk approved these changes Jul 10, 2020

View reviewed changes

Lynx1820 merged commit 7879849 into dotnet:master Jul 11, 2020

ghost locked as resolved and limited conversation to collaborators Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StopWordsRemovingEstimator export to Onnx #5279

StopWordsRemovingEstimator export to Onnx #5279

Lynx1820 commented Jul 2, 2020 •

edited

Loading

harishsk Jul 2, 2020 •

edited by Lynx1820

Loading

Lynx1820 Jul 2, 2020 •

edited

Loading

Lynx1820 Jul 2, 2020 •

edited

Loading

harishsk Jul 2, 2020

harishsk commented Jul 2, 2020

codecov bot commented Jul 2, 2020 •

edited

Loading

Lynx1820 commented Jul 2, 2020 •

edited

Loading

harishsk commented Jul 2, 2020

harishsk left a comment

StopWordsRemovingEstimator export to Onnx #5279

StopWordsRemovingEstimator export to Onnx #5279

Conversation

Lynx1820 commented Jul 2, 2020 • edited Loading

harishsk Jul 2, 2020 • edited by Lynx1820 Loading

Choose a reason for hiding this comment

Lynx1820 Jul 2, 2020 • edited Loading

Choose a reason for hiding this comment

Lynx1820 Jul 2, 2020 • edited Loading

Choose a reason for hiding this comment

harishsk Jul 2, 2020

Choose a reason for hiding this comment

harishsk commented Jul 2, 2020

codecov bot commented Jul 2, 2020 • edited Loading

Codecov Report

Lynx1820 commented Jul 2, 2020 • edited Loading

harishsk commented Jul 2, 2020

harishsk left a comment

Choose a reason for hiding this comment

Lynx1820 commented Jul 2, 2020 •

edited

Loading

harishsk Jul 2, 2020 •

edited by Lynx1820

Loading

Lynx1820 Jul 2, 2020 •

edited

Loading

Lynx1820 Jul 2, 2020 •

edited

Loading

codecov bot commented Jul 2, 2020 •

edited

Loading

Lynx1820 commented Jul 2, 2020 •

edited

Loading