-
Notifications
You must be signed in to change notification settings - Fork 131
Spanner: TransactionContext hangs thread indefinitely #799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@msk-midas I transferred the issue to the Spanner-specific repo as we no longer use the other repo. @olavloite if you could please take a look at this issue, that would be great. FYI @thiagotnunes |
@msk-midas Thank you so much for the very detailed report. Your pseudo example indicates that transaction.readRow(tableName, Key.of(getHashForId(id), getTableId(), id, sid), Collections.singleton("json")); is the statement that gets stuck. I have a couple of additional questions regarding this:
(Note: My questions above should not in any way be interpreted as indication that any of the above is not supported, I'm just trying to figure out a way to reproduce it) |
I see that it's closed and you may well have found the solution! Just to answer your questions: |
@msk-midas Thanks for your response, that is actually very interesting information. I would have expected that the statement that gets stuck would not be the first statement in the transaction. I will do some additional investigations based on that. |
…stuck If the first query or read operation of a read/write transaction would return UNAVAILABLE for the first element of the result stream, the transaction could get stuck. This was caused by the internal retry mechanism that would wait for the initial attempt to return a transaction, which was never returned as the UNAVAILABLE exception was internally handled by the result stream iterator. Fixes #799
…stuck (#807) If the first query or read operation of a read/write transaction would return UNAVAILABLE for the first element of the result stream, the transaction could get stuck. This was caused by the internal retry mechanism that would wait for the initial attempt to return a transaction, which was never returned as the UNAVAILABLE exception was internally handled by the result stream iterator. Fixes #799
I updated to 3.3.2 about 20 hours ago, and the error has not recurred. (due to its random nature it's not guaranteed fixed, but previously I'd encountered it at least a couple of times in the same timespan) I believe the UNAVAILABLE fix has solved our problems. |
@msk-midas Thank you for the update. It is very much appreciated. |
If the intial read from a streaming operation fails when also trying to implicitly start a transaction, ensure the library properly recovers and completes the operation (using an explicit `BeginTransaction` call). This was definitely a gap in our test coverage; @thiagotnunes brought a Java customer issue (googleapis/java-spanner#799) to my attention, so I wanted to ensure C++ users were not susceptible.
…sts (#5718) If the initial read from a streaming operation fails when also trying to implicitly start a transaction, ensure the library properly recovers and completes the operation. Test both permanent and transient failures. They behave slightly differently, a permanent failure will cause an explicit `BeginTransaction`, whereas a transient one causes the RPC to be retried with the `begin` selector still set. This was definitely a gap in our test coverage; @thiagotnunes brought a Java customer issue (googleapis/java-spanner#799) to my attention, so I wanted to ensure C++ users were not susceptible.
…stuck (#807) If the first query or read operation of a read/write transaction would return UNAVAILABLE for the first element of the result stream, the transaction could get stuck. This was caused by the internal retry mechanism that would wait for the initial attempt to return a transaction, which was never returned as the UNAVAILABLE exception was internally handled by the result stream iterator. Fixes #799
…stuck (googleapis#807) If the first query or read operation of a read/write transaction would return UNAVAILABLE for the first element of the result stream, the transaction could get stuck. This was caused by the internal retry mechanism that would wait for the initial attempt to return a transaction, which was never returned as the UNAVAILABLE exception was internally handled by the result stream iterator. Fixes googleapis#799
…to get stuck (#856) * fix: UNAVAILABLE error on first query could cause transaction to get stuck (#807) If the first query or read operation of a read/write transaction would return UNAVAILABLE for the first element of the result stream, the transaction could get stuck. This was caused by the internal retry mechanism that would wait for the initial attempt to return a transaction, which was never returned as the UNAVAILABLE exception was internally handled by the result stream iterator. Fixes #799 * chore: re-formats source files To fix lint errors * fix: removes unrelated changes Co-authored-by: Knut Olav Løite <[email protected]>
…stuck (#807) If the first query or read operation of a read/write transaction would return UNAVAILABLE for the first element of the result stream, the transaction could get stuck. This was caused by the internal retry mechanism that would wait for the initial attempt to return a transaction, which was never returned as the UNAVAILABLE exception was internally handled by the result stream iterator. Fixes #799
🤖 I have created a release *beep* *boop* --- ### [2.6.4](googleapis/java-spanner-jdbc@v2.6.3...v2.6.4) (2022-04-21) ### Dependencies * update dependency com.google.cloud:google-cloud-shared-dependencies to v2.10.0 ([googleapis#798](googleapis/java-spanner-jdbc#798)) ([a77024c](googleapis/java-spanner-jdbc@a77024c)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
This has been communicated with google cloud support, case 26471342, and they have requested I open an issue here
Environment details
Steps to reproduce
We have a loop running in one of our coorporate test VMs that updates a row for a "last checkin made by machine" every ten seconds. Let it run and randomly encounter non-recoverable freezes. (the longest one we let run was over the holidays, 20 days frozen thread). Reproduced on two separate machines multiple times, average executions before encountering freeze is around 3000 (sometimes 800, sometimes 6000). On one third machine we have not encountered it at all (14000 successful attempts and counting). The data written is minimal, essentially just name and timestamp. Normal communication time is a question of milliseconds, so ten seconds is ample time between calls.
Code example
rough pseudo (this is part of a very large framework, but this is the item it happens on - regular scheduled executor service at 10 second ticks to run, fixed rate not fixed delay)
Thread dump
Any additional information below
We have repeatedly reproduced this over the regular public APIs, but we have the impression that it happens more often (for reproduction) if you add
The text was updated successfully, but these errors were encountered: