Skip to content

StopWordsRemovingEstimator export to Onnx #5279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 11, 2020

Conversation

Lynx1820
Copy link
Contributor

@Lynx1820 Lynx1820 commented Jul 2, 2020

  • Exporting StopWordsRemovingEstimator/CustomStopWordsRemovingEstimator to Onnx.
  • A test which currently fails is the edge case where all the words in the text get removed. While Ml.net produces an empty array, Onnx produces an array of length 1 with the empty string. Suggestions for how handle this case?

Note: The decision was address the issue mentioned above in another PR.

@Lynx1820 Lynx1820 requested a review from a team as a code owner July 2, 2020 18:57
{
var opType = "Squeeze";
var squeezeOutput = ctx.AddIntermediateVariable(null, "SqueezeOutput", true);
var node = ctx.CreateNode(opType, srcVariableName, squeezeOutput, ctx.GetNodeName(opType), "");
Copy link
Contributor

@harishsk harishsk Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to avoid skipping the shape and type addition? #Resolved

Copy link
Contributor Author

@Lynx1820 Lynx1820 Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I skipped the shape is because the shape of the tokenized word vector is not known prior to inference time. The number of words tokenized may be any number. #Resolved

Copy link
Contributor Author

@Lynx1820 Lynx1820 Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var opType = "Squeeze";
var squeezeOutput = ctx.AddIntermediateVariable(null, "SqueezeOutput", true);
var node = ctx.CreateNode(opType, srcVariableName, squeezeOutput, ctx.GetNodeName(opType), "");
node.AddAttribute("axes", new long[] { 0 });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR, but in a different PR it maybe worth considering changing the default domain for CreateNode to be "ai.onnx" and not "ai.onnx.ml". The latter has very few ops and we mostly use operators from "ai.onnx" and it makes sense to retain that as the default.

@harishsk
Copy link
Contributor

harishsk commented Jul 2, 2020

Can you check whether the number of elements returned is the same?

@codecov
Copy link

codecov bot commented Jul 2, 2020

Codecov Report

Merging #5279 into master will increase coverage by 0.00%.
The diff coverage is 96.80%.

@@           Coverage Diff           @@
##           master    #5279   +/-   ##
=======================================
  Coverage   73.68%   73.68%           
=======================================
  Files        1022     1022           
  Lines      190366   190348   -18     
  Branches    20474    20472    -2     
=======================================
+ Hits       140265   140267    +2     
+ Misses      44568    44548   -20     
  Partials     5533     5533           
Flag Coverage Δ
#Debug 73.68% <96.80%> (+<0.01%) ⬆️
#production 69.42% <94.73%> (+<0.01%) ⬆️
#test 87.65% <100.00%> (+0.03%) ⬆️
Impacted Files Coverage Δ
...ML.Transforms/Text/StopWordsRemovingTransformer.cs 86.67% <94.73%> (+0.89%) ⬆️
test/Microsoft.ML.Tests/OnnxConversionTest.cs 96.66% <100.00%> (+0.08%) ⬆️
...c/Microsoft.ML.SamplesUtils/SamplesDatasetUtils.cs 40.00% <0.00%> (-3.46%) ⬇️
src/Microsoft.ML.CodeGenerator/Utils.cs 59.20% <0.00%> (-2.12%) ⬇️
src/Microsoft.ML.Data/Training/TrainerUtils.cs 65.86% <0.00%> (-1.01%) ⬇️
...dardTrainers/Standard/Online/AveragedPerceptron.cs 89.70% <0.00%> (-0.58%) ⬇️
...rc/Microsoft.ML.LightGbm/LightGbmRankingTrainer.cs 88.00% <0.00%> (-0.38%) ⬇️
...c/Microsoft.ML.Data/DataLoadSave/EstimatorChain.cs 89.65% <0.00%> (-0.35%) ⬇️
src/Microsoft.ML.Data/Prediction/Calibrator.cs 80.45% <0.00%> (-0.27%) ⬇️
src/Microsoft.ML.Sweeper/AsyncSweeper.cs 71.23% <0.00%> (-0.20%) ⬇️
... and 29 more

@Lynx1820
Copy link
Contributor Author

Lynx1820 commented Jul 2, 2020

Can you check whether the number of elements returned is the same?

I think this already done as part of the onnx testing framework:
https://github.com/dotnet/machinelearning/blob/master/test/Microsoft.ML.TestFramework/BaseTestBaseline.cs#L751

@harishsk
Copy link
Contributor

harishsk commented Jul 2, 2020

Can you reshape the returned vector in SaveAsOnnx to be compatible with ML.NET?


In reply to: 653235748 [](ancestors = 653235748)

@Lynx1820 Lynx1820 requested a review from harishsk July 10, 2020 22:47
Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Lynx1820 Lynx1820 merged commit 7879849 into dotnet:master Jul 11, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants