Cancellation in Image Classification (fixes #4632) #4650

antoniovs1029 · 2020-01-13T23:09:54Z

Adds support for cancellation to the Image Classification trainer in a similar manner as done in #3062 (and other PRs) by adding cancellation checkpoints to the train method.

I've tested it by running the sample related to this trainer. Since the other PR's that included checkpoints for cancellation don't include unit tests, I also didn't include any in here.

Fixes #4632 .

antoniovs1029 · 2020-01-13T23:13:50Z

I don't know if I should also add a .CheckAlive() chekpoint inside the CacheFeaturizedImagesToDisk method of Image Classification Trainer, as that method can take a couple of minutes, but once the method is over, the trainer will anyway end up hitting the checkpoint I've already added in TrainAndEvaluateClassificationLayer.

Also, if anyone has other opinions as to where to put more checkpoints, please, let me know!

codemzs · 2020-01-14T05:06:42Z

src/Microsoft.ML.Vision/ImageClassificationTrainer.cs

@@ -992,6 +995,7 @@ public Tensor ProcessImage(in VBuffer<byte> imageBuffer)

                for (int epoch = 0; epoch < epochs; epoch += 1)
                {
+                    Host.CheckAlive();


Host.CheckAlive(); [](start = 20, length = 18)

I would just put the check in this loop and in the CreateFeaturizedCacheFile. Please also report numbers in perf differences before and after. Please remove CheckAlive from everywhere else as its not very significant and only pollutes the code. You also need to call TryCleanupTemporaryWorkspace for a graceful termination. #Closed

I have added a new method "CheckAlive" to the ImageClassification trainer, with a try...catch to call TryCleanupTemporaryWorkspace when it's needed.

Also changed the places where I added the checkpoints.

I will see how to get the perf difference now. #Closed

So I ran the ImageClassificationBench.TrainResnetV250 benchmark, with and without the changes of this PR, and they both behaved in pretty much the same way.

Without the changes this was the summary output of the benchmark:

Method | Mean | Error | StdDev | Extra Metric | ---------------- |--------:|--------:|---------:|-------------:| TrainResnetV250 | 41.55 s | 5.580 s | 0.3058 s | - |

And with the changes, the summary was:

Method | Mean | Error | StdDev | Extra Metric | ---------------- |--------:|--------:|---------:|-------------:| TrainResnetV250 | 40.10 s | 2.723 s | 0.1493 s | - |

So on average the version with the changes was reported to ran faster.

In any case, the CheckAlive() method is simply doing if-statements evaluations, so I don't think it can introduce meaningful performance difference (given that image classification training is a task expected to take a considerable amount of time anyway). #Closed

So, as suggested online by @codemzs I have reran the benchmarks, but using the CIFAR-10 dataset.

Without the changes introduced in the PR the summary is as follows:

Method | Mean | Error | StdDev | Extra Metric | ---------------- |--------:|--------:|---------:|-------------:| TrainResnetV250 | 79.29 m | 4.850 m | 0.2658 m | - |

With the changes:

Method | Mean | Error | StdDev | Extra Metric | ---------------- |--------:|--------:|--------:|-------------:| TrainResnetV250 | 78.82 m | 21.71 m | 1.190 m | - |

So, again, my understanding is that there's some variability in the time it takes to train this model (and that's why the benchmark with the changes ran a little bit faster), and the introduction of the CheckAlive() method doesn't really have an impact on the performance of this. #Closed

codemzs · 2020-01-14T05:07:46Z

CacheFeaturizedImagesToDisk can take significant time, we must add there.

In reply to: 573918512 [](ancestors = 573918512)

codemzs

antoniovs1029 added 7 commits January 10, 2020 17:22

Modified Program.cs

fc139f7

Added other sample for Image Classification

7147772

Added sample to test cancelling method

3a7a0bb

Added CheckAlive() Checkpoints

9a5afbd

Restore Samples' Program.cs back to original and delete test sample

052a6e4

Samples back to original state

8785b8f

Reset to original state of samples

f6042ed

antoniovs1029 requested a review from a team as a code owner January 13, 2020 23:09

antoniovs1029 requested a review from codemzs January 13, 2020 23:15

codemzs reviewed Jan 14, 2020

View reviewed changes

antoniovs1029 added 4 commits January 14, 2020 11:42

Removed checkpoints from unnecessary places

eef112c

Adding CheckAlive method with exception handling

54d0309

Added checkpoints with new CheckAlive method

53b737b

Removed unused exception variable "e"

681c13c

codemzs approved these changes Jan 17, 2020

View reviewed changes

antoniovs1029 merged commit 6210c38 into dotnet:master Jan 17, 2020

ghost locked as resolved and limited conversation to collaborators Mar 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cancellation in Image Classification (fixes #4632) #4650

Cancellation in Image Classification (fixes #4632) #4650

antoniovs1029 commented Jan 13, 2020 •

edited

Loading

antoniovs1029 commented Jan 13, 2020 •

edited

Loading

codemzs Jan 14, 2020 •

edited

Loading

antoniovs1029 Jan 14, 2020 •

edited by codemzs

Loading

antoniovs1029 Jan 14, 2020 •

edited by codemzs

Loading

antoniovs1029 Jan 17, 2020 •

edited by codemzs

Loading

codemzs commented Jan 14, 2020

codemzs left a comment

Cancellation in Image Classification (fixes #4632) #4650

Cancellation in Image Classification (fixes #4632) #4650

Conversation

antoniovs1029 commented Jan 13, 2020 • edited Loading

antoniovs1029 commented Jan 13, 2020 • edited Loading

codemzs Jan 14, 2020 • edited Loading

Choose a reason for hiding this comment

antoniovs1029 Jan 14, 2020 • edited by codemzs Loading

Choose a reason for hiding this comment

antoniovs1029 Jan 14, 2020 • edited by codemzs Loading

Choose a reason for hiding this comment

antoniovs1029 Jan 17, 2020 • edited by codemzs Loading

Choose a reason for hiding this comment

codemzs commented Jan 14, 2020

codemzs left a comment

Choose a reason for hiding this comment

antoniovs1029 commented Jan 13, 2020 •

edited

Loading

antoniovs1029 commented Jan 13, 2020 •

edited

Loading

codemzs Jan 14, 2020 •

edited

Loading

antoniovs1029 Jan 14, 2020 •

edited by codemzs

Loading

antoniovs1029 Jan 14, 2020 •

edited by codemzs

Loading

antoniovs1029 Jan 17, 2020 •

edited by codemzs

Loading