Skip to content

fix: turn atlas-connect-cluster async #343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

fix: turn atlas-connect-cluster async #343

wants to merge 13 commits into from

Conversation

fmenezes
Copy link
Collaborator

@fmenezes fmenezes commented Jul 7, 2025

fixes #321

Proposed changes

turn atlas-connect-cluster tool into an async tool. Unfortunately, rely on atlas user creation api to connect, sometimes the propagation time of the user from control plane to the data plane can take time, there is no state to query other than check if user has access to DB.

Checklist

@coveralls
Copy link
Collaborator

coveralls commented Jul 7, 2025

Pull Request Test Coverage Report for Build 16145285047

Details

  • 22 of 48 (45.83%) changed or added relevant lines in 2 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-1.9%) to 73.688%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/session.ts 3 5 60.0%
src/tools/atlas/metadata/connectCluster.ts 19 43 44.19%
Files with Coverage Reduction New Missed Lines %
src/tools/atlas/metadata/connectCluster.ts 1 44.34%
Totals Coverage Status
Change from base Build 16135914646: -1.9%
Covered Lines: 853
Relevant Lines: 1065

💛 - Coveralls

@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 16123662219

Details

  • 0 of 34 (0.0%) changed or added relevant lines in 1 file are covered.
  • 129 unchanged lines in 15 files lost coverage.
  • Overall coverage decreased (-13.3%) to 60.879%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/tools/atlas/metadata/connectCluster.ts 0 34 0.0%
Files with Coverage Reduction New Missed Lines %
src/common/atlas/generatePassword.ts 3 25.0%
src/tools/atlas/create/createFreeCluster.ts 3 57.14%
src/tools/atlas/read/inspectCluster.ts 3 23.53%
src/tools/atlas/read/listAlerts.ts 3 20.0%
src/tools/atlas/read/listOrgs.ts 3 72.73%
src/tools/atlas/read/inspectAccessList.ts 4 33.33%
src/tools/atlas/create/createDBUser.ts 6 25.0%
src/tools/atlas/create/createProject.ts 8 46.67%
src/common/atlas/cluster.ts 9 0.0%
src/tools/atlas/create/createAccessList.ts 10 16.13%
Totals Coverage Status
Change from base Build 16077040616: -13.3%
Covered Lines: 704
Relevant Lines: 1046

💛 - Coveralls

@fmenezes fmenezes marked this pull request as ready for review July 8, 2025 06:54
@Copilot Copilot AI review requested due to automatic review settings July 8, 2025 06:54
@fmenezes fmenezes requested a review from a team as a code owner July 8, 2025 06:54
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the atlas-connect-cluster tool into an asynchronous background process and updates integration tests to poll until the cluster connection is established.

  • Tests now loop with retries to wait for the async connection to succeed.
  • ConnectClusterTool has been split into query, prepare, and background connect phases, returning immediately with an “Attempting…” message.
  • Added new log IDs for connect attempts and successes in logger.ts.
Comments suppressed due to low confidence (1)

tests/integration/tools/atlas/clusters.test.ts:195

  • Add an assertion after the loop (or inside the success branch) to fail the test if the cluster never reports "Cluster is already connected.", otherwise the test may silently pass without verifying a successful connection.
                for (let i = 0; i < 600; i++) {

Comment on lines +43 to +46
await this.session.serviceProvider.runCommand("admin", {
ping: 1,
});
return "connected";
Copy link
Preview

Copilot AI Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap the runCommand('admin', { ping: 1 }) call in a try/catch so transient ping errors don’t bubble up and trigger a full reconnection flow prematurely.

Suggested change
await this.session.serviceProvider.runCommand("admin", {
ping: 1,
});
return "connected";
try {
await this.session.serviceProvider.runCommand("admin", {
ping: 1,
});
return "connected";
} catch (error) {
logger.warn(LogId.ConnectionPingError, `Ping command failed: ${error.message}`);
return "connecting";
}

Copilot uses AI. Check for mistakes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need, I'm bubbling up the error

Comment on lines 122 to 123
for (let i = 0; i < 600; i++) {
// try for 5 minutes
Copy link
Preview

Copilot AI Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract the retry count (600) and delay (500ms) into named constants to improve readability and ease future adjustments.

Suggested change
for (let i = 0; i < 600; i++) {
// try for 5 minutes
for (let i = 0; i < RETRY_COUNT; i++) {
// try for RETRY_COUNT attempts

Copilot uses AI. Check for mistakes.

src/logger.ts Outdated
@@ -17,6 +17,8 @@ export const LogId = {
atlasDeleteDatabaseUserFailure: mongoLogId(1_001_002),
atlasConnectFailure: mongoLogId(1_001_003),
atlasInspectFailure: mongoLogId(1_001_004),
atlasConnectAttempt: mongoLogId(1_001_005),
atlasConnectSuccessed: mongoLogId(1_001_006),
Copy link
Preview

Copilot AI Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log ID key atlasConnectSuccessed is misspelled; consider renaming it to atlasConnectSucceeded or atlasConnectSuccess for clarity.

Suggested change
atlasConnectSuccessed: mongoLogId(1_001_006),
atlasConnectSucceeded: mongoLogId(1_001_006),

Copilot uses AI. Check for mistakes.

Comment on lines 108 to 110
const connectionString = cn.toString();

return connectionString;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const connectionString = cn.toString();
return connectionString;
return cn.toString();

Comment on lines 148 to 149
groupId: this.session.connectedAtlasCluster?.projectId || "",
username: this.session.connectedAtlasCluster?.username || "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those are not set, does it make sense to make that call at all?

);
}

protected async execute({ projectId, clusterName }: ToolArgs<typeof this.argsShape>): Promise<CallToolResult> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about the DX changes here - any reason not to do the polling internally and only return a result once we're connected? Assuming the common use case for connecting to a cluster is to do something with it (e.g. insert/query documents) - the user will be waiting anyway, so might as well just notify them when we're done. Note that once we add support for streamable http, we'd be able to use the progress notifications mechanism to notify the user that something is going on, so they won't be just starting at a spinner for 10-15 seconds.

Copy link
Collaborator Author

@fmenezes fmenezes Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeouts, client-side controls them no extended timeouts per tool, even our CI is intermittent, notice I added a very generous timeout (5 minutes)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is coming from #321

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poking around I've found that typescripts default timeout for tool calling is 1 minute (https://github.com/modelcontextprotocol/typescript-sdk/blob/main/src/shared/protocol.ts#L53C45-L53C50)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... looking at #321, it doesn't appear like the issue is a timeout, but rather that the original 10-ish second window we gave the db user to be connected isn't enough (assuming that the auth error is due to the user still not being created after 20 retries with 500 ms delay each).

How would you feel if we went a middle ground where we polled internally for e.g. 45 seconds and if we still didn't have a valid user, we would return something along the lines of "provisioning the user and creating to the cluster is taking a bit longer, so check back in 15-20 seconds"? That way for the majority of users, we'd still connect within the first tool call, while still providing a graceful fallback for the few where it takes a lot longer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think that is acceptable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll limit it to 30 secs given we have two extra API calls that can also eat some time

@fmenezes fmenezes requested a review from nirinchev July 8, 2025 13:44
@fmenezes
Copy link
Collaborator Author

fmenezes commented Jul 8, 2025

@nirinchev this is ready for another look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

atlas-connect-cluster command randomly fails to connect
3 participants