-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Ocasional NullPointerException in io.grpc.internal.RetriableStream.drain(RetriableStream.java:279) when using hedgingPolicy #9185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interesting. I think this is a regression from cbda32a . It seems |
This is a race and seems could only be avoided by avoiding hedging. Retries would still be safe. |
I see. Thanks @ejona86 for triaging this so quickly. Hedging is working really well for us to reduce tail latency. I'm hoping we can continue using it and help resolve the race condition. |
Actually... what was the impact to the RPC in this case? It looks like this might just be logged but the RPC could still complete successfully. Which means things may mostly work if you just ignore the log. There is a risk of a memory leak developing, but since it didn't end up starting the stream, I think gRPC wouldn't leak any memory when using the Netty transport and a connection is available. If there is no connection available, it might leak. |
This was in the context of a workaround for the problem. |
From what I can tell, all requests are succeeding. This is just logged on the side. That said, I did see 1 single instance of this message:
|
Since writing a message has to be done after start(), it feel like that buffer leak should be unrelated. But that stack trace is in the same hedging draining code, which does make it suspiciously related. |
Yeah! I do see several TODOs in the class ("TODO(b/145386688)" - d3c77f2 ) that may be related to this race condition 🤷? Example: grpc-java/core/src/main/java/io/grpc/internal/RetriableStream.java Lines 451 to 453 in 5bb721e
grpc-java/core/src/main/java/io/grpc/internal/RetriableStream.java Lines 168 to 169 in 5bb721e
grpc-java/core/src/main/java/io/grpc/internal/RetriableStream.java Lines 178 to 179 in 5bb721e
grpc-java/core/src/main/java/io/grpc/internal/RetriableStream.java Lines 412 to 413 in 5bb721e
|
I don't think those guarded by things are that important. All those locks are the same instance, just aliases. So it is more just that the code is written in a way that isn't able to be verified by tooling. |
The problem was one hedge was committed before another had drained start(). This was not testable because HedgingRunnable checks whether scheduledHedgingRef is cancelled, which is racy, but there's no way to deterministically trigger either race. The same problem couldn't be triggered with retries because only one attempt will be draining at a time. Retries with cancellation also couldn't trigger it, for the surprising reason that the noop stream used in cancel() wasn't considered drained. This commit marks the noop stream as drained with cancel(), which allows memory to be garbage collected sooner and exposes the race for tests. That then showed the stream as hanging, because inFlightSubStreams wasn't being decremented. Fixes grpc#9185
I believe this issue became more serious in 1.53. There was a change in 1.53 that fixed some lifecycle issues, but when this is triggered I think it would cause the RPC to hang and never receive its final status. So if you are seeing this issue, you may need to avoid 1.53 and 1.54 until a patch is out. |
The problem was one hedge was committed before another had drained start(). This was not testable because HedgingRunnable checks whether scheduledHedgingRef is cancelled, which is racy, but there's no way to deterministically trigger either race. The same problem couldn't be triggered with retries because only one attempt will be draining at a time. Retries with cancellation also couldn't trigger it, for the surprising reason that the noop stream used in cancel() wasn't considered drained. This commit marks the noop stream as drained with cancel(), which allows memory to be garbage collected sooner and exposes the race for tests. That then showed the stream as hanging, because inFlightSubStreams wasn't being decremented. Fixes #9185
The problem was one hedge was committed before another had drained start(). This was not testable because HedgingRunnable checks whether scheduledHedgingRef is cancelled, which is racy, but there's no way to deterministically trigger either race. The same problem couldn't be triggered with retries because only one attempt will be draining at a time. Retries with cancellation also couldn't trigger it, for the surprising reason that the noop stream used in cancel() wasn't considered drained. This commit marks the noop stream as drained with cancel(), which allows memory to be garbage collected sooner and exposes the race for tests. That then showed the stream as hanging, because inFlightSubStreams wasn't being decremented. Fixes grpc#9185
The problem was one hedge was committed before another had drained start(). This was not testable because HedgingRunnable checks whether scheduledHedgingRef is cancelled, which is racy, but there's no way to deterministically trigger either race. The same problem couldn't be triggered with retries because only one attempt will be draining at a time. Retries with cancellation also couldn't trigger it, for the surprising reason that the noop stream used in cancel() wasn't considered drained. This commit marks the noop stream as drained with cancel(), which allows memory to be garbage collected sooner and exposes the race for tests. That then showed the stream as hanging, because inFlightSubStreams wasn't being decremented. Fixes grpc#9185
The problem was one hedge was committed before another had drained start(). This was not testable because HedgingRunnable checks whether scheduledHedgingRef is cancelled, which is racy, but there's no way to deterministically trigger either race. The same problem couldn't be triggered with retries because only one attempt will be draining at a time. Retries with cancellation also couldn't trigger it, for the surprising reason that the noop stream used in cancel() wasn't considered drained. This commit marks the noop stream as drained with cancel(), which allows memory to be garbage collected sooner and exposes the race for tests. That then showed the stream as hanging, because inFlightSubStreams wasn't being decremented. Fixes #9185
The problem was one hedge was committed before another had drained start(). This was not testable because HedgingRunnable checks whether scheduledHedgingRef is cancelled, which is racy, but there's no way to deterministically trigger either race. The same problem couldn't be triggered with retries because only one attempt will be draining at a time. Retries with cancellation also couldn't trigger it, for the surprising reason that the noop stream used in cancel() wasn't considered drained. This commit marks the noop stream as drained with cancel(), which allows memory to be garbage collected sooner and exposes the race for tests. That then showed the stream as hanging, because inFlightSubStreams wasn't being decremented. Fixes #9185
What version of gRPC-Java are you using?
Version 1.42.2
What is your environment?
Linux
What did you expect to see?
I'm using the hedging retry policy via configuration and ocasionally see a NullPointerException pop up.
Here's a snippet of the Kotlin code that configures the hedging policy with an 85ms hedging delay:
What did you see instead?
Steps to reproduce the bug
Enable hedging policy above
The text was updated successfully, but these errors were encountered: