Skip to content

15290: Log Unexpected Disconnects #127736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

schase-es
Copy link

This change makes network errors visible at higher levels (TcpChannel, Transport.Connection), by passing exceptions that cause channels to close onto the close listener's onFailure method.

This pull request has two commits:

  • the first enables exception visibility in the close listener onFailure() method, and fixes areas of the code/tests that anticipate the onFailure to never run. TcpChannels now have an onException() method for reporting an exception; this report has been added to several places where exceptions are received (the inbound handler, initialization). Some test code (FakeTcpChannel) was updated to match the real one, and update tested assumptions. Some production code asserted that the onFailure method should never fire; these have been changed to ignore it, or log the exception.

  • the second notes disconnect and reconnect events inside the ClusterConnectionManager, and logs situations where a reconnect has the same ephemeralId as a recent disconnect. This implies that the process did not roll, but the network cut out. Network exceptions are logged, and the connection history is GCed every hour.

Still to do:

  • write a test
  • do some refactoring to pass the threadpool into ClusterConnectionManager, so it has a time source.

Closes #125290

schase-es added 2 commits May 5, 2025 16:41
Previously, exceptions encountered on a netty channel were caught and logged at
some level, but not passed to the TcpChannel or Transport.Connection close
listeners. This limited observability. This change implements this exception
reporting and passing, with TcpChannel.onException and NodeChannels.closeAndFail
reporting exceptions and their close listeners receiving them. Some test
infrastructure (FakeTcpChannel) and assertions in close listener onFailure
methods have been updated.
ClusterConnectionManager now caches the previous ephemeralId (process-scope) of
peer nodes when they disconnect. On reconnect, when a peer has the same
ephemeralId as it did in the previous connection, this is logged to indicate a
network failure. The ephemeralId table is garbage-collected every hour, with
entries older than an hour removed.
@schase-es schase-es requested a review from nicktindall May 6, 2025 00:16
@schase-es schase-es self-assigned this May 6, 2025
@schase-es schase-es added >non-issue :Distributed Coordination/Network Http and internode communication implementations labels May 6, 2025
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea, I like it.

I suggest breaking out the change to the close-listener into a separate PR. It deserves its own tests etc and in case of a bug it'd help to be able to bisect between these two parts of this change.

assert ch instanceof Netty4NioSocketChannel;
NetUtils.tryEnsureReasonableKeepAliveConfig(((Netty4NioSocketChannel) ch).javaChannel());
setupPipeline(ch, false);
}

@Override
public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
Netty4TcpChannel channel = ctx.channel().attr(CHANNEL_KEY).get();
if (cause instanceof Error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also use org.elasticsearch.ExceptionsHelper#maybeDieOnAnotherThread here - if an Error is thrown then we cannot continue and must not just quietly suppress it.

Comment on lines +351 to +353
private void collectHistoryGarbage() {
final long now = System.currentTimeMillis();
final long hour = 60 * 60 * 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be time-based. I'd rather we dropped the entry when the node is removed from the cluster membership (unexpected membership changes are already logged appropriately - see JoinReasonService). If the node remains in the cluster membership then we should report an unexpected reconnection no matter how long it's been disconnected.

(If we were to make this time-based then we should make the timeout configurable via a setting rather than hard-coded at 1 hour).

In principle we could also just make it size-based, expiring the oldest entries to limit the size of the map to no more than (say) double the size of the cluster. That's what JoinReasonService does, because there we cannot fall back on something stable like the cluster membership. But here we can, so I think we should use that precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations >non-issue v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve reporting of unexpected network disconnects
3 participants