-
Notifications
You must be signed in to change notification settings - Fork 25.2k
15290: Log Unexpected Disconnects #127736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Previously, exceptions encountered on a netty channel were caught and logged at some level, but not passed to the TcpChannel or Transport.Connection close listeners. This limited observability. This change implements this exception reporting and passing, with TcpChannel.onException and NodeChannels.closeAndFail reporting exceptions and their close listeners receiving them. Some test infrastructure (FakeTcpChannel) and assertions in close listener onFailure methods have been updated.
ClusterConnectionManager now caches the previous ephemeralId (process-scope) of peer nodes when they disconnect. On reconnect, when a peer has the same ephemeralId as it did in the previous connection, this is logged to indicate a network failure. The ephemeralId table is garbage-collected every hour, with entries older than an hour removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea, I like it.
I suggest breaking out the change to the close-listener into a separate PR. It deserves its own tests etc and in case of a bug it'd help to be able to bisect between these two parts of this change.
assert ch instanceof Netty4NioSocketChannel; | ||
NetUtils.tryEnsureReasonableKeepAliveConfig(((Netty4NioSocketChannel) ch).javaChannel()); | ||
setupPipeline(ch, false); | ||
} | ||
|
||
@Override | ||
public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception { | ||
Netty4TcpChannel channel = ctx.channel().attr(CHANNEL_KEY).get(); | ||
if (cause instanceof Error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also use org.elasticsearch.ExceptionsHelper#maybeDieOnAnotherThread
here - if an Error
is thrown then we cannot continue and must not just quietly suppress it.
private void collectHistoryGarbage() { | ||
final long now = System.currentTimeMillis(); | ||
final long hour = 60 * 60 * 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be time-based. I'd rather we dropped the entry when the node is removed from the cluster membership (unexpected membership changes are already logged appropriately - see JoinReasonService
). If the node remains in the cluster membership then we should report an unexpected reconnection no matter how long it's been disconnected.
(If we were to make this time-based then we should make the timeout configurable via a setting rather than hard-coded at 1 hour).
In principle we could also just make it size-based, expiring the oldest entries to limit the size of the map to no more than (say) double the size of the cluster. That's what JoinReasonService
does, because there we cannot fall back on something stable like the cluster membership. But here we can, so I think we should use that precision.
This change makes network errors visible at higher levels (TcpChannel, Transport.Connection), by passing exceptions that cause channels to close onto the close listener's onFailure method.
This pull request has two commits:
the first enables exception visibility in the close listener onFailure() method, and fixes areas of the code/tests that anticipate the onFailure to never run. TcpChannels now have an onException() method for reporting an exception; this report has been added to several places where exceptions are received (the inbound handler, initialization). Some test code (FakeTcpChannel) was updated to match the real one, and update tested assumptions. Some production code asserted that the onFailure method should never fire; these have been changed to ignore it, or log the exception.
the second notes disconnect and reconnect events inside the ClusterConnectionManager, and logs situations where a reconnect has the same ephemeralId as a recent disconnect. This implies that the process did not roll, but the network cut out. Network exceptions are logged, and the connection history is GCed every hour.
Still to do:
Closes #125290