Skip to content

Improve reporting of unexpected network disconnects #125290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
DaveCTurner opened this issue Mar 20, 2025 · 2 comments · May be fixed by #127736
Open

Improve reporting of unexpected network disconnects #125290

DaveCTurner opened this issue Mar 20, 2025 · 2 comments · May be fixed by #127736
Labels
:Distributed Coordination/Network Http and internode communication implementations >enhancement Team:Distributed Coordination Meta label for Distributed Coordination team

Comments

@DaveCTurner
Copy link
Contributor

Today if a node-to-node connection drops we log this message:

logger.info(
"""
transport connection to [{}] closed by remote; \
if unexpected, see [{}] for troubleshooting guidance""",
node.descriptionWithoutAttributes(),
ReferenceDocs.NETWORK_DISCONNECT_TROUBLESHOOTING
);

The "if unexpected" bit is tricksy, it's actually pretty hard to tell from the logs whether a disconnect was expected (e.g. the node shut down) or not (e.g. network disruption). Yet we should be able to work out ourselves whether a disconnect was unexpected, and log a message that unambiguously indicates that we saw an unexpected disconnect.

In particular, if the org.elasticsearch.cluster.NodeConnectionsService finds it is disconnected from a peer and then successfully reconnects to that same peer again (its DiscoveryNode#ephemeralId did not change) then that's definitely not due to the node shutting down. We should be emitting a WARN log in this case. Moreover, it'd be incredibly useful to capture the exception (if any) that TcpTransport reported as causing the disconnect so we can repeat it in such a log message.

@DaveCTurner DaveCTurner added :Distributed Coordination/Network Http and internode communication implementations >enhancement labels Mar 20, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Mar 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@DaveCTurner
Copy link
Contributor Author

Moreover, it'd be incredibly useful to capture the exception (if any) that TcpTransport reported as causing the disconnect so we can repeat it in such a log message.

This is harder than the other bit. I'd suggest resolving this in a separate PR from the other bits.

@schase-es schase-es linked a pull request May 6, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations >enhancement Team:Distributed Coordination Meta label for Distributed Coordination team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants