Skip to content

Watch hangs forever #2432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tokaplan opened this issue May 13, 2025 · 10 comments
Open

Watch hangs forever #2432

tokaplan opened this issue May 13, 2025 · 10 comments

Comments

@tokaplan
Copy link

tokaplan commented May 13, 2025

Can anyone confirm what the new behavior of the watch() is in 1.2.0 given this fix: #2367? I'm on 1.0.0 and I'm seeing my watches randomly hang forever, ignoring timeoutInSeconds and everything else. I've seen multiple pods do that at the same time, assuming it's triggered by networking issues. What is the new behavior in that scenario in 1.2.0? Is there now a guarantee that the watch will fail within 30 seconds of such connection breakage? Can I trust it to work now? I almost shipped this to all my customers today, noticed it last moment by pure chance, it happens rarely.
@cjihrig @rossanthony @brendandburns

@brendandburns
Copy link
Contributor

brendandburns commented May 13, 2025

It should break the connection based on #2367 but you should definitely test and validate. Watch behavior on long-hanging connections is very tricky to simulate/test well in general.

Please let us know if you find any issues.

@rossanthony
Copy link
Contributor

@tokaplan this sounds a bit different to the behavior we were seeing with 1.0.0, what we observed was the informer callback would keep on disconnecting an reconnecting. I think it was timing out after the fetch default timeout of 30 secs. This might not sound like a big deal, but at the time we had it running in various apps, across a few of our non-prod clusters, during a period when we were doing some heavy load testing, therefore we had things scaled out with lots of pods.

One of our systems engineers happened to notice in the EKS control pane dashboard, under a section called "Top Talkers", that the number of watch requests was off the charts. We're talking millions per day - notice the counts in the righthand column...

Image

Since 1.2.0 was cut, I've had it running in an ephemeral environment (scaled out to just 3 pods) and no issues to report based on what I'm seeing thus far. Based on some debug logging I added to our implementation of an informer that watches for changes to the k8s secrets, I can see each pod is making a new watch call once every 30 mins. Which aligns with the change in #2367, specifically the socket keep alive setting here: https://github.com/kubernetes-client/javascript/pull/2367/files#diff-8af1000c89b03ced37f439c61c5696c45e1e83a70cc07182feef6595123f0badR60

@rossanthony
Copy link
Contributor

rossanthony commented May 14, 2025

Actually the timeout setting added in #2367 was 30000ms (30 secs), with keepAlive probe set with an initial delay also of 30 secs. So I'm not sure what is now causing the watch to reconnect every 30 mins. But anyway it's much better, however the sockets could potentially be tweaked to keep them open longer. Although maybe it's the max idle timeout on the kube API side that is set to 30 mins.

@brendandburns
Copy link
Contributor

fwiw, often it's some sort of timeout on the long-running TCP connection, or a NAT/Load balancer somewhere in the middle recycling and terminating the TCP connection. long-running connections and networking are really straight-forward in theory and ridiculously convoluted and bespoke in practice.

@tokaplan
Copy link
Author

tokaplan commented May 14, 2025

@rossanthony my watches were correctly disconnecting at timeout (if no timeout is specified it fairly randomly disconnects around 5 minutes, which I assume is the default set on Kube API side), but the problem was that very rarely the call would just hang forever - not receiving any callbacks from the API and not returning. That was completely breaking us of course. I'm just trying to ensure this can't happen anymore, and unless someone can tell me that - I'm either moving away from watches or introducing a watchdog that keeps track of time since last call and crashes the process to have k8s recreate the pod if it detects the zombie state. I don't see any other way, I can't have this zombie state happen randomly.

@rossanthony
Copy link
Contributor

rossanthony commented May 14, 2025

@tokaplan interesting, I've not observed this behavior of it hanging like that. Does it eventually time out the connection and log an error? Curious how I would tell if we have the same issue. The particular implementation I've been testing in our non-prod env with 1.2.0 is not listening to events very often, because it's a mechanism for caching auth0 tokens in etcd. They get refreshed once per hour, then all the pods running in that particular namespace listen to update events and set the refreshed token in their memory cache. I can try running a sustained load test on it tomorrow and see if I can reproduce what you're describing.

@tokaplan
Copy link
Author

tokaplan commented May 14, 2025

@rossanthony It never times out, it hangs forever. The way I noticed is because we have a watchdog running in a background loop that logs seconds elapsed since our most recent watch has started. It just started climbing.
I've seen two pods go into that state at exactly the same time, so there had to be some common reason.

@tokaplan
Copy link
Author

Here's what I'm seeing (1.0.0):

  • with 300 pods running (each constantly watching with a 240 second timeout on the watch: one watch ends or fails, the next one immediately begins) - I saw 5 pods (out of 300) hang forever within a span of 48 hours.
  • if I change the watch timeout from 240 seconds to 5 seconds - I don't see any pods hang, no repro within 48 hours.

@rossanthony
Copy link
Contributor

This could be why I've not run into this infinite hanging issue, if it's only happening to ~5 pods out of 300 in a 48hr window. I've been testing it with 3 pods, we don't really have the capacity in our cluster or the need to scale up that much, even in prod we're running ~10 pods in each region.

Have you tried with 1.2.0?

@tokaplan
Copy link
Author

tokaplan commented May 16, 2025

@rossanthony not yet, I need to wait for days to establish base lines. I will soon and report back. Keep in mind though that I did see both pods get into the bad state at the same time - and we only had 2 pods there. So looks like under the right conditions repro rate could be much higher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants