Improve reporting of unexpected network disconnects #125290
Labels
:Distributed Coordination/Network
Http and internode communication implementations
>enhancement
Team:Distributed Coordination
Meta label for Distributed Coordination team
Today if a node-to-node connection drops we log this message:
elasticsearch/server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java
Lines 242 to 248 in a59c182
The "if unexpected" bit is tricksy, it's actually pretty hard to tell from the logs whether a disconnect was expected (e.g. the node shut down) or not (e.g. network disruption). Yet we should be able to work out ourselves whether a disconnect was unexpected, and log a message that unambiguously indicates that we saw an unexpected disconnect.
In particular, if the
org.elasticsearch.cluster.NodeConnectionsService
finds it is disconnected from a peer and then successfully reconnects to that same peer again (itsDiscoveryNode#ephemeralId
did not change) then that's definitely not due to the node shutting down. We should be emitting aWARN
log in this case. Moreover, it'd be incredibly useful to capture the exception (if any) thatTcpTransport
reported as causing the disconnect so we can repeat it in such a log message.The text was updated successfully, but these errors were encountered: