15290: Log Unexpected Disconnects #127736

schase-es · 2025-05-06T00:16:24Z

This change makes network errors visible at higher levels (TcpChannel, Transport.Connection), by passing exceptions that cause channels to close onto the close listener's onFailure method.

This pull request has two commits:

the first enables exception visibility in the close listener onFailure() method, and fixes areas of the code/tests that anticipate the onFailure to never run. TcpChannels now have an onException() method for reporting an exception; this report has been added to several places where exceptions are received (the inbound handler, initialization). Some test code (FakeTcpChannel) was updated to match the real one, and update tested assumptions. Some production code asserted that the onFailure method should never fire; these have been changed to ignore it, or log the exception.
the second notes disconnect and reconnect events inside the ClusterConnectionManager, and logs situations where a reconnect has the same ephemeralId as a recent disconnect. This implies that the process did not roll, but the network cut out. Network exceptions are logged, and the connection history is GCed every hour.

Still to do:

write a test
do some refactoring to pass the threadpool into ClusterConnectionManager, so it has a time source.

Closes #125290

Previously, exceptions encountered on a netty channel were caught and logged at some level, but not passed to the TcpChannel or Transport.Connection close listeners. This limited observability. This change implements this exception reporting and passing, with TcpChannel.onException and NodeChannels.closeAndFail reporting exceptions and their close listeners receiving them. Some test infrastructure (FakeTcpChannel) and assertions in close listener onFailure methods have been updated.

ClusterConnectionManager now caches the previous ephemeralId (process-scope) of peer nodes when they disconnect. On reconnect, when a peer has the same ephemeralId as it did in the previous connection, this is logged to indicate a network failure. The ephemeralId table is garbage-collected every hour, with entries older than an hour removed.

DaveCTurner

Nice idea, I like it.

I suggest breaking out the change to the close-listener into a separate PR. It deserves its own tests etc and in case of a bug it'd help to be able to bisect between these two parts of this change.

DaveCTurner · 2025-05-06T07:06:58Z

modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4Transport.java

            assert ch instanceof Netty4NioSocketChannel;
            NetUtils.tryEnsureReasonableKeepAliveConfig(((Netty4NioSocketChannel) ch).javaChannel());
            setupPipeline(ch, false);
        }

        @Override
        public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
+            Netty4TcpChannel channel = ctx.channel().attr(CHANNEL_KEY).get();
+            if (cause instanceof Error) {


We should also use org.elasticsearch.ExceptionsHelper#maybeDieOnAnotherThread here - if an Error is thrown then we cannot continue and must not just quietly suppress it.

DaveCTurner · 2025-05-06T07:20:17Z

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

+    private void collectHistoryGarbage() {
+        final long now = System.currentTimeMillis();
+        final long hour = 60 * 60 * 1000;


I don't think this should be time-based. I'd rather we dropped the entry when the node is removed from the cluster membership (unexpected membership changes are already logged appropriately - see JoinReasonService). If the node remains in the cluster membership then we should report an unexpected reconnection no matter how long it's been disconnected.

(If we were to make this time-based then we should make the timeout configurable via a setting rather than hard-coded at 1 hour).

In principle we could also just make it size-based, expiring the oldest entries to limit the size of the map to no more than (say) double the size of the cluster. That's what JoinReasonService does, because there we cannot fall back on something stable like the cluster membership. But here we can, so I think we should use that precision.

schase-es added 2 commits May 5, 2025 16:41

schase-es requested a review from nicktindall May 6, 2025 00:16

schase-es self-assigned this May 6, 2025

schase-es added >non-issue :Distributed Coordination/Network Http and internode communication implementations labels May 6, 2025

elasticsearchmachine added the v9.1.0 label May 6, 2025

DaveCTurner reviewed May 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

15290: Log Unexpected Disconnects #127736

15290: Log Unexpected Disconnects #127736

schase-es commented May 6, 2025

DaveCTurner left a comment

DaveCTurner May 6, 2025

DaveCTurner May 6, 2025

15290: Log Unexpected Disconnects #127736

Are you sure you want to change the base?

15290: Log Unexpected Disconnects #127736

Conversation

schase-es commented May 6, 2025

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner May 6, 2025

Choose a reason for hiding this comment

DaveCTurner May 6, 2025

Choose a reason for hiding this comment