fix hang with multiple envs #2600
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes a hang/deadlock when closing the environment. This was pretty easy to get to happen with 3 or 4 environments before.
From adding debug statements, it was clear that the worker process reached the end but join() form the main processes never finished, e.g.
I believe the cause of the hang was:
self.process.join()
In this case, the worker from step 2 can't exit until its item is removed from the queue. It turns out this is intended behavior. From https://docs.python.org/3/library/multiprocessing.html#programming-guidelines:
This situation is impossible to hit with 1 env and rare with 2, but pretty likely with 3 and increases from there.
The fix (as above) is to call
Queue.cancel_join_thread()
from the child process. I tested this with up to 10 envs locally without a deadlock.Another option (which I initially tried) was to "flush" the step queue by calling
step_queue.get_nowait()
until the queue was empty. However, I think this is still vulnerable to the same race condition, i.e. subprocess enqueues after the queue is flushed (especially if the environment is very slow).Any ideas how we can test this in practice?