Skip to content

fix hang with multiple envs #2600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 20, 2019
Merged

Conversation

chriselion
Copy link
Contributor

@chriselion chriselion commented Sep 20, 2019

Fixes a hang/deadlock when closing the environment. This was pretty easy to get to happen with 3 or 4 environments before.

From adding debug statements, it was clear that the worker process reached the end but join() form the main processes never finished, e.g.

DEBUG:mlagents.envs:SubprocessEnvManager closing.
DEBUG:mlagents.envs:UnityEnvWorker 0 joining process.
DEBUG:mlagents.envs:Worker 0 closing.
DEBUG:mlagents.envs:Worker 0 done.

I believe the cause of the hang was:

  1. Env manager would retrieve steps from the step queue
  2. Another subprocess worker would enqueue to the step queue
  3. Trainer would reach the max steps so the environment manager would start to close
  4. UnityEnvWorker.close() would hang on self.process.join()

In this case, the worker from step 2 can't exit until its item is removed from the queue. It turns out this is intended behavior. From https://docs.python.org/3/library/multiprocessing.html#programming-guidelines:

Bear in mind that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe. (The child process can call the Queue.cancel_join_thread method of the queue to avoid this behaviour.)

This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. Otherwise you cannot be sure that processes which have put items on the queue will terminate. Remember also that non-daemonic processes will be joined automatically.

This situation is impossible to hit with 1 env and rare with 2, but pretty likely with 3 and increases from there.

The fix (as above) is to call Queue.cancel_join_thread() from the child process. I tested this with up to 10 envs locally without a deadlock.

Another option (which I initially tried) was to "flush" the step queue by calling step_queue.get_nowait() until the queue was empty. However, I think this is still vulnerable to the same race condition, i.e. subprocess enqueues after the queue is flushed (especially if the environment is very slow).

Any ideas how we can test this in practice?

@xiaomaogy
Copy link
Contributor

We can test this using the training automation, I actually saw this happening since last week when we started to run daily job on parallel envs. But I wasn't sure if this is specific to Linux or not.

Copy link
Contributor

@ervteng ervteng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks great - only remaining thing might be to test it on Cloud Build. But it's definitely better than it was before 👍

@chriselion chriselion merged commit 97b9951 into develop Sep 20, 2019
@chriselion chriselion deleted the develop-debug-multienv-hang branch September 24, 2019 22:46
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants