Skip to content

Conversation

bolinfest
Copy link
Collaborator

@bolinfest bolinfest commented Sep 4, 2025

Last week, I thought I found the smoking gun in our flaky integration tests where holding these locks could have led to potential deadlock:

Yet even after those PRs went in, we continued to see flakinees in our integration tests! Though with the additional logging added as part of debugging those tests, I now saw things like:

read message from stdout: Notification(JSONRPCNotification { jsonrpc: "2.0", method: "codex/event/exec_approval_request", params: Some(Object {"id": String("0"), "msg": Object {"type": String("exec_approval_request"), "call_id": String("call1"), "command": Array [String("python3"), String("-c"), String("print(42)")], "cwd": String("/tmp/.tmpFj2zwi/workdir")}, "conversationId": String("c67b32c5-9475-41bf-8680-f4b4834ebcc6")}) })
notification: Notification(JSONRPCNotification { jsonrpc: "2.0", method: "codex/event/exec_approval_request", params: Some(Object {"id": String("0"), "msg": Object {"type": String("exec_approval_request"), "call_id": String("call1"), "command": Array [String("python3"), String("-c"), String("print(42)")], "cwd": String("/tmp/.tmpFj2zwi/workdir")}, "conversationId": String("c67b32c5-9475-41bf-8680-f4b4834ebcc6")}) })
read message from stdout: Request(JSONRPCRequest { id: Integer(0), jsonrpc: "2.0", method: "execCommandApproval", params: Some(Object {"conversation_id": String("c67b32c5-9475-41bf-8680-f4b4834ebcc6"), "call_id": String("call1"), "command": Array [String("python3"), String("-c"), String("print(42)")], "cwd": String("/tmp/.tmpFj2zwi/workdir")}) })
writing message to stdin: Response(JSONRPCResponse { id: Integer(0), jsonrpc: "2.0", result: Object {"decision": String("approved")} })
in read_stream_until_notification_message(codex/event/task_complete)
[mcp stderr] 2025-09-04T00:00:59.738585Z  INFO codex_mcp_server::message_processor: <- response: JSONRPCResponse { id: Integer(0), jsonrpc: "2.0", result: Object {"decision": String("approved")} }
[mcp stderr] 2025-09-04T00:00:59.738740Z DEBUG codex_core::codex: Submission sub=Submission { id: "1", op: ExecApproval { id: "0", decision: Approved } }
[mcp stderr] 2025-09-04T00:00:59.738832Z  WARN codex_core::codex: No pending approval found for sub_id: 0

That is, a response was sent for a request, but no callback was in place to handle the response!

This time, I think I may have found the underlying issue (though the fixes for holding locks for too long may have also been part of it), which is I found cases where we were sending the request:

let _ = self.tx_event.send(event).await;

before inserting the Sender into the pending_approvals map (which has to wait on acquiring a mutex):

{
let mut state = self.state.lock_unchecked();
state.pending_approvals.insert(sub_id, tx_approve);
}

so it is possible the request could go out and the client could respond before pending_approvals was updated!

Note this was happening in both request_command_approval() and request_patch_approval(), which maps to the sorts of errors we have been seeing when these integration tests have been flaking on us.

While here, I am also adding some extra logging that prints if inserting into pending_approvals overwrites an entry as opposed to purely inserting one. Today, a conversation can have only one pending request at a time, but as we are planning to support parallel tool calls, this invariant may not continue to hold, in which case we need to revisit this abstraction.

@bolinfest bolinfest merged commit bd4fa85 into main Sep 4, 2025
36 checks passed
@bolinfest bolinfest deleted the pr3146 branch September 4, 2025 14:38
@github-actions github-actions bot locked and limited conversation to collaborators Sep 4, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants