-
Notifications
You must be signed in to change notification settings - Fork 71
fix: turn atlas-connect-cluster async #343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 16145285047Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 16123662219Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the atlas-connect-cluster
tool into an asynchronous background process and updates integration tests to poll until the cluster connection is established.
- Tests now loop with retries to wait for the async connection to succeed.
ConnectClusterTool
has been split into query, prepare, and background connect phases, returning immediately with an “Attempting…” message.- Added new log IDs for connect attempts and successes in
logger.ts
.
Comments suppressed due to low confidence (1)
tests/integration/tools/atlas/clusters.test.ts:195
- Add an assertion after the loop (or inside the success branch) to fail the test if the cluster never reports "Cluster is already connected.", otherwise the test may silently pass without verifying a successful connection.
for (let i = 0; i < 600; i++) {
await this.session.serviceProvider.runCommand("admin", { | ||
ping: 1, | ||
}); | ||
return "connected"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap the runCommand('admin', { ping: 1 })
call in a try/catch so transient ping errors don’t bubble up and trigger a full reconnection flow prematurely.
await this.session.serviceProvider.runCommand("admin", { | |
ping: 1, | |
}); | |
return "connected"; | |
try { | |
await this.session.serviceProvider.runCommand("admin", { | |
ping: 1, | |
}); | |
return "connected"; | |
} catch (error) { | |
logger.warn(LogId.ConnectionPingError, `Ping command failed: ${error.message}`); | |
return "connecting"; | |
} |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need, I'm bubbling up the error
for (let i = 0; i < 600; i++) { | ||
// try for 5 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract the retry count (600) and delay (500ms) into named constants to improve readability and ease future adjustments.
for (let i = 0; i < 600; i++) { | |
// try for 5 minutes | |
for (let i = 0; i < RETRY_COUNT; i++) { | |
// try for RETRY_COUNT attempts |
Copilot uses AI. Check for mistakes.
src/logger.ts
Outdated
@@ -17,6 +17,8 @@ export const LogId = { | |||
atlasDeleteDatabaseUserFailure: mongoLogId(1_001_002), | |||
atlasConnectFailure: mongoLogId(1_001_003), | |||
atlasInspectFailure: mongoLogId(1_001_004), | |||
atlasConnectAttempt: mongoLogId(1_001_005), | |||
atlasConnectSuccessed: mongoLogId(1_001_006), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log ID key atlasConnectSuccessed
is misspelled; consider renaming it to atlasConnectSucceeded
or atlasConnectSuccess
for clarity.
atlasConnectSuccessed: mongoLogId(1_001_006), | |
atlasConnectSucceeded: mongoLogId(1_001_006), |
Copilot uses AI. Check for mistakes.
const connectionString = cn.toString(); | ||
|
||
return connectionString; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const connectionString = cn.toString(); | |
return connectionString; | |
return cn.toString(); |
groupId: this.session.connectedAtlasCluster?.projectId || "", | ||
username: this.session.connectedAtlasCluster?.username || "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If those are not set, does it make sense to make that call at all?
); | ||
} | ||
|
||
protected async execute({ projectId, clusterName }: ToolArgs<typeof this.argsShape>): Promise<CallToolResult> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure about the DX changes here - any reason not to do the polling internally and only return a result once we're connected? Assuming the common use case for connecting to a cluster is to do something with it (e.g. insert/query documents) - the user will be waiting anyway, so might as well just notify them when we're done. Note that once we add support for streamable http, we'd be able to use the progress notifications mechanism to notify the user that something is going on, so they won't be just starting at a spinner for 10-15 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeouts, client-side controls them no extended timeouts per tool, even our CI is intermittent, notice I added a very generous timeout (5 minutes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is coming from #321
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poking around I've found that typescripts default timeout for tool calling is 1 minute (https://github.com/modelcontextprotocol/typescript-sdk/blob/main/src/shared/protocol.ts#L53C45-L53C50)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... looking at #321, it doesn't appear like the issue is a timeout, but rather that the original 10-ish second window we gave the db user to be connected isn't enough (assuming that the auth error is due to the user still not being created after 20 retries with 500 ms delay each).
How would you feel if we went a middle ground where we polled internally for e.g. 45 seconds and if we still didn't have a valid user, we would return something along the lines of "provisioning the user and creating to the cluster is taking a bit longer, so check back in 15-20 seconds"? That way for the majority of users, we'd still connect within the first tool call, while still providing a graceful fallback for the few where it takes a lot longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I think that is acceptable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll limit it to 30 secs given we have two extra API calls that can also eat some time
@nirinchev this is ready for another look |
fixes #321
Proposed changes
turn
atlas-connect-cluster
tool into an async tool. Unfortunately, rely on atlas user creation api to connect, sometimes the propagation time of the user from control plane to the data plane can take time, there is no state to query other than check if user has access to DB.Checklist