-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix tensorflow test hanging issue #4997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix tensorflow test hanging issue #4997
Conversation
string fullImagesetFolderPath = Path.Combine( | ||
imagesDownloadFolderPath, finalImagesFolderName); | ||
// set timeout to 3 minutes, download sometimes will stuck so set smaller timeout to retry download | ||
string timoOutOldValue = Environment.GetEnvironmentVariable(ResourceManagerUtils.TimeoutEnvVariable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want do this only for these tests and not for all tests? We can choose a smaller timeout and higher retry count for all downloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this setting (timeout) is hardcoded in our source code as 10 minutes but not a option that user can set at https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Vision/ImageClassificationTrainer.cs#L1321
this setting can be override by set the env variable I'm doing this as if everything is fine 3 minutes is enough if something go wrong then we hope this to fail fast as we are retrying 5 times.
we are not seeing other tests have similar issue so I only do this in affected tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are already specifying a timeout in code, why not fix it there directly in one place instead of many places here? Would changing the timeout to 3 minutes in ImageClassificationTrainer have the same effect?
In reply to: 403157318 [](ancestors = 403157318)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our CI environment 3 minutes is enough but if I don't want to change the 10 minutes to 3 minutes as customer's network maybe not that good then 3 minutes maybe not enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we start running tests in parallel and different tests are using different timeout values?
In reply to: 403230917 [](ancestors = 403230917)
imagesDownloadFolderPath, finalImagesFolderName); | ||
// set timeout to 3 minutes, download sometimes will stuck so set smaller timeout to retry download | ||
string timoOutOldValue = Environment.GetEnvironmentVariable(ResourceManagerUtils.TimeoutEnvVariable); | ||
Environment.SetEnvironmentVariable(ResourceManagerUtils.TimeoutEnvVariable, (3 * 60 * 1000).ToString()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will setting the environment variables here and resetting it at the end of the test have any impact?
It appears that you have moved the download code to the constructor. But the timeout is not being set there. So by the time the code gets here wouldn't the download already have completed with a different timeout value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are downloading different things.
In constructor we are downloading some image set and all tests use same images so no need to download duplicate images and I move this to constructor. this is refine the code performance and remove unnecessary downloads.
this env variable is to set timeout for download some metadata for test and metadata download is causing the hanging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It this environment variable is affecting the download of metadata by ImageClassificationTrainer, then can we fix this directly there in the timeout that ImageClassificationTrainer uses?
In reply to: 403159407 [](ancestors = 403159407)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as previous, in source code I want to make sure timeout is enough which is 10 minutes as before (this metadata file is nearly 100MB). But in our test code as in CI 3 minutes is enough per our experience so I want to make this timeout as 3 minutes to fail fast if we met download issue.
{ | ||
// ignore any Exception and retrying download | ||
ch.Warning($"{i+1} - th try: Dowload {fileName} from {url} fail with exception {ex.Message}"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have lots of trouble with failed downloads. Can we make the exceptions more visible?
That is, we should log all the failures as you have done. But if in the end the download still fails after the retries, how about throwing an exception that download failed even after retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we indeed throw exception if retry still fails at line: https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.TensorFlow/TensorflowUtils.cs#L190
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this download code is used for things other than Tensorflow as well...right? I meant should we throw here? If not make we have to make sure we are throwing whereever this is called from to make sure that download failures are propagated as errors.
In reply to: 403162052 [](ancestors = 403162052)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can see one other code is also using this. We already have this "downloadResult" returned if download failed after retry, and ResourceDownloadResults will treat returned "downloadResult" as error message and take care the DownloadResults at caller method.
see https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Transforms/Text/WordEmbeddingsExtractor.cs#L631-L638 for the other caller of this ensure resource.
|
||
var task = s.ReadAsync(buffer, 0, blockSize); | ||
task.Wait(ct); | ||
int count = task.Result; | ||
// REVIEW: use a progress channel instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you need to check the result after the first call, you can change the loop to a do{} loop instead of a while() loop. If you still want to have a while loop, you can change the loop to a while(true) loop and move this inside the while loop. In either case you won't need to have these three lines repeated twice.
int count; | ||
|
||
var task = s.ReadAsync(buffer, 0, blockSize); | ||
task.Wait(ct); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadAsync allows you to pass a CancellationToken directly. You can choose to use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed 2 issue of tensorflow tests: