fix hang with multiple envs #2600

chriselion · 2019-09-20T17:58:29Z

Fixes a hang/deadlock when closing the environment. This was pretty easy to get to happen with 3 or 4 environments before.

From adding debug statements, it was clear that the worker process reached the end but join() form the main processes never finished, e.g.

DEBUG:mlagents.envs:SubprocessEnvManager closing.
DEBUG:mlagents.envs:UnityEnvWorker 0 joining process.
DEBUG:mlagents.envs:Worker 0 closing.
DEBUG:mlagents.envs:Worker 0 done.

I believe the cause of the hang was:

Env manager would retrieve steps from the step queue
Another subprocess worker would enqueue to the step queue
Trainer would reach the max steps so the environment manager would start to close
UnityEnvWorker.close() would hang on self.process.join()

In this case, the worker from step 2 can't exit until its item is removed from the queue. It turns out this is intended behavior. From https://docs.python.org/3/library/multiprocessing.html#programming-guidelines:

Bear in mind that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe. (The child process can call the Queue.cancel_join_thread method of the queue to avoid this behaviour.)

This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. Otherwise you cannot be sure that processes which have put items on the queue will terminate. Remember also that non-daemonic processes will be joined automatically.

This situation is impossible to hit with 1 env and rare with 2, but pretty likely with 3 and increases from there.

The fix (as above) is to call Queue.cancel_join_thread() from the child process. I tested this with up to 10 envs locally without a deadlock.

Another option (which I initially tried) was to "flush" the step queue by calling step_queue.get_nowait() until the queue was empty. However, I think this is still vulnerable to the same race condition, i.e. subprocess enqueues after the queue is flushed (especially if the environment is very slow).

Any ideas how we can test this in practice?

xiaomaogy · 2019-09-20T21:03:12Z

We can test this using the training automation, I actually saw this happening since last week when we started to run daily job on parallel envs. But I wasn't sure if this is specific to Linux or not.

ervteng

Code looks great - only remaining thing might be to test it on Cloud Build. But it's definitely better than it was before 👍

fix hang with multiple envs

03d6192

chriselion requested review from harperj, ervteng and xiaomaogy September 20, 2019 19:55

ervteng approved these changes Sep 20, 2019

View reviewed changes

xiaomaogy approved these changes Sep 20, 2019

View reviewed changes

chriselion merged commit 97b9951 into develop Sep 20, 2019

chriselion deleted the develop-debug-multienv-hang branch September 24, 2019 22:46

chriselion mentioned this pull request Sep 30, 2019

Behaviour when multi-environment process dies #2641

Closed

github-actions bot locked as resolved and limited conversation to collaborators May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix hang with multiple envs #2600

fix hang with multiple envs #2600

Uh oh!

chriselion commented Sep 20, 2019 •

edited

Loading

Uh oh!

xiaomaogy commented Sep 20, 2019

Uh oh!

ervteng left a comment

Uh oh!

Uh oh!

fix hang with multiple envs #2600

fix hang with multiple envs #2600

Uh oh!

Conversation

chriselion commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaomaogy commented Sep 20, 2019

Uh oh!

ervteng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chriselion commented Sep 20, 2019 •

edited

Loading