fix: add callback to map before sending request to fix race condition #3146
+22
−10
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Last week, I thought I found the smoking gun in our flaky integration tests where holding these locks could have led to potential deadlock:
Yet even after those PRs went in, we continued to see flakinees in our integration tests! Though with the additional logging added as part of debugging those tests, I now saw things like:
That is, a response was sent for a request, but no callback was in place to handle the response!
This time, I think I may have found the underlying issue (though the fixes for holding locks for too long may have also been part of it), which is I found cases where we were sending the request:
codex/codex-rs/core/src/codex.rs
Line 597 in 234c0a0
before inserting the
Sender
into thepending_approvals
map (which has to wait on acquiring a mutex):codex/codex-rs/core/src/codex.rs
Lines 598 to 601 in 234c0a0
so it is possible the request could go out and the client could respond before
pending_approvals
was updated!Note this was happening in both
request_command_approval()
andrequest_patch_approval()
, which maps to the sorts of errors we have been seeing when these integration tests have been flaking on us.While here, I am also adding some extra logging that prints if inserting into
pending_approvals
overwrites an entry as opposed to purely inserting one. Today, a conversation can have only one pending request at a time, but as we are planning to support parallel tool calls, this invariant may not continue to hold, in which case we need to revisit this abstraction.