Skip to content

Improve reporting of unexpected network disconnects #125290

Closed
@DaveCTurner

Description

@DaveCTurner

Today if a node-to-node connection drops we log this message:

logger.info(
"""
transport connection to [{}] closed by remote; \
if unexpected, see [{}] for troubleshooting guidance""",
node.descriptionWithoutAttributes(),
ReferenceDocs.NETWORK_DISCONNECT_TROUBLESHOOTING
);

The "if unexpected" bit is tricksy, it's actually pretty hard to tell from the logs whether a disconnect was expected (e.g. the node shut down) or not (e.g. network disruption). Yet we should be able to work out ourselves whether a disconnect was unexpected, and log a message that unambiguously indicates that we saw an unexpected disconnect.

In particular, if the org.elasticsearch.cluster.NodeConnectionsService finds it is disconnected from a peer and then successfully reconnects to that same peer again (its DiscoveryNode#ephemeralId did not change) then that's definitely not due to the node shutting down. We should be emitting a WARN log in this case. Moreover, it'd be incredibly useful to capture the exception (if any) that TcpTransport reported as causing the disconnect so we can repeat it in such a log message.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions