Spanner: TransactionContext hangs thread indefinitely #799

msk-midas · 2021-01-14T10:39:29Z

This has been communicated with google cloud support, case 26471342, and they have requested I open an issue here

Environment details

API: Spanner
OS type and version: Debian 8
Java version: 1.8.0_162
google-cloud-java version(s): 3.3.0 (also reproduced on 3.0.5)

Steps to reproduce

We have a loop running in one of our coorporate test VMs that updates a row for a "last checkin made by machine" every ten seconds. Let it run and randomly encounter non-recoverable freezes. (the longest one we let run was over the holidays, 20 days frozen thread). Reproduced on two separate machines multiple times, average executions before encountering freeze is around 3000 (sometimes 800, sometimes 6000). On one third machine we have not encountered it at all (14000 successful attempts and counting). The data written is minimal, essentially just name and timestamp. Normal communication time is a question of milliseconds, so ten seconds is ample time between calls.

Code example

rough pseudo (this is part of a very large framework, but this is the item it happens on - regular scheduled executor service at 10 second ticks to run, fixed rate not fixed delay)

TransactionRunner transactionRunner = databaseClient.readWriteTransaction();
transactionRunner.run(transaction -> {
                transaction.readRow(tableName, Key.of(getHashForId(id), getTableId(), id, sid), Collections.singleton("json"));
                }

Thread dump

"com.king.spanner.Availability_monitor-1" googleapis/google-cloud-java#91 prio=5 os_prio=0 tid=0x00007f3a00d42800 nid=0x57d4 waiting on condition [0x00007f39984a1000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000f272eff0> (a com.google.api.core.AbstractApiFuture$InternalSettableFuture)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:537)
	at com.google.api.core.AbstractApiFuture.get(AbstractApiFuture.java:56)
	at com.google.cloud.spanner.TransactionRunnerImpl$TransactionContextImpl.getTransactionSelector(TransactionRunnerImpl.java:437)
	at com.google.cloud.spanner.AbstractReadContext$2.startStream(AbstractReadContext.java:739)
	at com.google.cloud.spanner.AbstractResultSet$ResumableStreamIterator.computeNext(AbstractResultSet.java:1020)
	at com.google.cloud.spanner.AbstractResultSet$ResumableStreamIterator.computeNext(AbstractResultSet.java:898)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
	at com.google.cloud.spanner.AbstractResultSet$GrpcValueIterator.ensureReady(AbstractResultSet.java:260)
	at com.google.cloud.spanner.AbstractResultSet$GrpcValueIterator.getMetadata(AbstractResultSet.java:236)
	at com.google.cloud.spanner.AbstractResultSet$GrpcResultSet.next(AbstractResultSet.java:118)
	at com.google.cloud.spanner.AbstractReadContext.consumeSingleRow(AbstractReadContext.java:755)
	at com.google.cloud.spanner.AbstractReadContext.readRow(AbstractReadContext.java:457)

Any additional information below

We have repeatedly reproduced this over the regular public APIs, but we have the impression that it happens more often (for reproduction) if you add

Builder builder = SpannerOptions.newBuilder();
builder.setHost("https://restricted.googleapis.com");
builder.setInterceptorProvider(() -> Collections.singletonList(new ClientInterceptor() {
                public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(MethodDescriptor<ReqT, RespT> method, CallOptions callOptions, Channel next) {
                    return next.newCall(method, callOptions.withAuthority("spanner.googleapis.com"));
                }
            }));

The text was updated successfully, but these errors were encountered:

skuruppu · 2021-01-14T10:42:48Z

@msk-midas I transferred the issue to the Spanner-specific repo as we no longer use the other repo.

@olavloite if you could please take a look at this issue, that would be great.

FYI @thiagotnunes

olavloite · 2021-01-14T18:10:15Z

@msk-midas Thank you so much for the very detailed report.

Your pseudo example indicates that

transaction.readRow(tableName, Key.of(getHashForId(id), getTableId(), id, sid), Collections.singleton("json"));

is the statement that gets stuck. I have a couple of additional questions regarding this:

Is this the first statement in the transaction? Or is there some other statement in the transaction that precedes it?
If the given example is not the first statement in the transaction, what type of statement is the first statement (read / query / update)?
Is there any try-catch block inside the transaction around the first statement of the transaction? Or will any error that is returned by that statement propagate?
Do you apply any kind of parallelism inside the transaction? E.g. do you execute two queries in parallel, or do you pass any query results to a separate worker thread to let that thread process the results, or anything similar?

(Note: My questions above should not in any way be interpreted as indication that any of the above is not supported, I'm just trying to figure out a way to reproduce it)

msk-midas · 2021-01-15T09:43:35Z

I see that it's closed and you may well have found the solution! Just to answer your questions:
Short answer: "actual transaction interactions are as simple as the pseudo"
Longer: After transaction runner is called and we get a transaction callback, the first interaction with said transaction is the read where it hangs in the pseudo example. There is nothing else going on with the transaction at that point and no other thread picking it up on our end. There's no catch block inside of the transaction execution. I believe this is the only spanner interaction we run at all on these machines at the moment, meaning that it's only a single thread doing a single call every ten seconds, on the entire machine, so there should only be a single transactioncontext alive at a time, at all.

olavloite · 2021-01-15T09:52:50Z

@msk-midas Thanks for your response, that is actually very interesting information. I would have expected that the statement that gets stuck would not be the first statement in the transaction. I will do some additional investigations based on that.

…stuck If the first query or read operation of a read/write transaction would return UNAVAILABLE for the first element of the result stream, the transaction could get stuck. This was caused by the internal retry mechanism that would wait for the initial attempt to return a transaction, which was never returned as the UNAVAILABLE exception was internally handled by the result stream iterator. Fixes #799

…stuck (#807) If the first query or read operation of a read/write transaction would return UNAVAILABLE for the first element of the result stream, the transaction could get stuck. This was caused by the internal retry mechanism that would wait for the initial attempt to return a transaction, which was never returned as the UNAVAILABLE exception was internally handled by the result stream iterator. Fixes #799

msk-midas · 2021-01-19T11:39:21Z

I updated to 3.3.2 about 20 hours ago, and the error has not recurred. (due to its random nature it's not guaranteed fixed, but previously I'd encountered it at least a couple of times in the same timespan) I believe the UNAVAILABLE fix has solved our problems.

olavloite · 2021-01-19T11:46:27Z

@msk-midas Thank you for the update. It is very much appreciated.

@thiagotnunes

If the intial read from a streaming operation fails when also trying to implicitly start a transaction, ensure the library properly recovers and completes the operation (using an explicit `BeginTransaction` call). This was definitely a gap in our test coverage; @thiagotnunes brought a Java customer issue (googleapis/java-spanner#799) to my attention, so I wanted to ensure C++ users were not susceptible.

@thiagotnunes

…sts (#5718) If the initial read from a streaming operation fails when also trying to implicitly start a transaction, ensure the library properly recovers and completes the operation. Test both permanent and transient failures. They behave slightly differently, a permanent failure will cause an explicit `BeginTransaction`, whereas a transient one causes the RPC to be retried with the `begin` selector still set. This was definitely a gap in our test coverage; @thiagotnunes brought a Java customer issue (googleapis/java-spanner#799) to my attention, so I wanted to ensure C++ users were not susceptible.

…stuck (#807) If the first query or read operation of a read/write transaction would return UNAVAILABLE for the first element of the result stream, the transaction could get stuck. This was caused by the internal retry mechanism that would wait for the initial attempt to return a transaction, which was never returned as the UNAVAILABLE exception was internally handled by the result stream iterator. Fixes #799

…stuck (googleapis#807) If the first query or read operation of a read/write transaction would return UNAVAILABLE for the first element of the result stream, the transaction could get stuck. This was caused by the internal retry mechanism that would wait for the initial attempt to return a transaction, which was never returned as the UNAVAILABLE exception was internally handled by the result stream iterator. Fixes googleapis#799

…to get stuck (#856) * fix: UNAVAILABLE error on first query could cause transaction to get stuck (#807) If the first query or read operation of a read/write transaction would return UNAVAILABLE for the first element of the result stream, the transaction could get stuck. This was caused by the internal retry mechanism that would wait for the initial attempt to return a transaction, which was never returned as the UNAVAILABLE exception was internally handled by the result stream iterator. Fixes #799 * chore: re-formats source files To fix lint errors * fix: removes unrelated changes Co-authored-by: Knut Olav Løite <[email protected]>

…stuck (#807) If the first query or read operation of a read/write transaction would return UNAVAILABLE for the first element of the result stream, the transaction could get stuck. This was caused by the internal retry mechanism that would wait for the initial attempt to return a transaction, which was never returned as the UNAVAILABLE exception was internally handled by the result stream iterator. Fixes #799

🤖 I have created a release *beep* *boop* --- ### [2.6.4](googleapis/java-spanner-jdbc@v2.6.3...v2.6.4) (2022-04-21) ### Dependencies * update dependency com.google.cloud:google-cloud-shared-dependencies to v2.10.0 ([googleapis#798](googleapis/java-spanner-jdbc#798)) ([a77024c](googleapis/java-spanner-jdbc@a77024c)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

skuruppu transferred this issue from googleapis/google-cloud-java Jan 14, 2021

product-auto-label bot added the api: spanner Issues related to the googleapis/java-spanner API. label Jan 14, 2021

skuruppu added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Jan 14, 2021

skuruppu assigned olavloite Jan 14, 2021

olavloite mentioned this issue Jan 14, 2021

fix: safeguard against statement errors when requesting a transaction #800

Merged

thiagotnunes closed this as completed in #800 Jan 14, 2021

olavloite reopened this Jan 15, 2021

olavloite mentioned this issue Jan 16, 2021

fix: UNAVAILABLE error on first query could cause transaction to get stuck #807

Merged

thiagotnunes closed this as completed in #807 Jan 17, 2021

mr-salty mentioned this issue Jan 20, 2021

test(spanner): add an additional transient failure test googleapis/google-cloud-cpp#5718

Merged

thiagotnunes mentioned this issue Feb 8, 2021

fix(3.1.x): UNAVAILABLE error on first query could cause transaction to get stuck #856

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spanner: TransactionContext hangs thread indefinitely #799

Spanner: TransactionContext hangs thread indefinitely #799

msk-midas commented Jan 14, 2021

skuruppu commented Jan 14, 2021

olavloite commented Jan 14, 2021

msk-midas commented Jan 15, 2021

olavloite commented Jan 15, 2021

msk-midas commented Jan 19, 2021

olavloite commented Jan 19, 2021

Spanner: TransactionContext hangs thread indefinitely #799

Spanner: TransactionContext hangs thread indefinitely #799

Comments

msk-midas commented Jan 14, 2021

Environment details

Steps to reproduce

Code example

Thread dump

Any additional information below

skuruppu commented Jan 14, 2021

olavloite commented Jan 14, 2021

msk-midas commented Jan 15, 2021

olavloite commented Jan 15, 2021

msk-midas commented Jan 19, 2021

olavloite commented Jan 19, 2021