-
Notifications
You must be signed in to change notification settings - Fork 3.9k
concurrency: mark some unreplicated locks as ineligible for export #145532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
concurrency: mark some unreplicated locks as ineligible for export #145532
Conversation
@miraradeva @arulajmani I think we are leaning away from this towards a more fundamental fix. But, I figured I'd put this up in a slightly cleaner form just so we could consider it since, I think something like this might still be required along some of the implementation paths we've discussed. |
2ecf9dd
to
272c1fb
Compare
272c1fb
to
86b31cd
Compare
86b31cd
to
4d3b62f
Compare
Fixed up the race condition in the test here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall. But I wonder if it's possible for the marking-as-ineligible mechanism to race with the lock-export mechanism. E.g. QueryIntent
finds a missing intent and is just about to mark the lock in the lock table, while a lease transfer kicks off a lock export. If it's not possible, we should mention it somewhere.
Reviewed 2 of 2 files at r1, 8 of 8 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)
-- commits
line 42 at r3:
I assume you'll squash this commit into one of the previous ones?
pkg/kv/kvserver/batcheval/cmd_query_intent.go
line 162 at r3 (raw file):
res := result.Result{} if !reply.FoundIntent && args.ErrorIfMissing { l := roachpb.MakeLockAcquisition(args.Txn, args.Key, lock.Replicated, args.Strength, args.IgnoredSeqNums)
It's a bit unintuitive why we're returning a replicated lock acquisition here, given that we'll use this response to mark unreplicated locks in the lock table. Wouldn't it make more sense for this acquisition to be lock.Intent
?
pkg/kv/kvclient/kvcoord/dist_sender_server_test.go
line 4791 at r2 (raw file):
return err case 2: // My our second retry, txrecovery should have correctly aborted this
nit: "On our second retry ..."
pkg/kv/kvserver/concurrency/concurrency_control.go
line 706 at r2 (raw file):
// MarkIneligibleForExport marks any locks held by this transaction on the // same key at greater or equal strength as ineligible for exported from the
nit: "ineligible for export"
pkg/kv/kvclient/kvcoord/dist_sender_server_test.go
line 4707 at r1 (raw file):
getReq, ok := fArgs.Req.Requests[0].GetInner().(*kvpb.GetRequest) // Only fail replication on the first retry. if ok && getReq.KeyLockingDurability == lock.Replicated && retryNum.Load() == 1 {
Is there a reason we need to keep track of these retries separately? Wouldn't the txn epoch tell us that?
pkg/kv/kvclient/kvcoord/dist_sender_server_test.go
line 4733 at r1 (raw file):
tc.TransferRangeLeaseOrFatal(t, desc, tc.Target(idx)) } // Make a db with transaction heartbeating disabled. This ensures that we
Is it important that heartbeating is disabled only for this db? If it was ok to disable it in general, we could just do this, right?
kvcoord.TxnHeartbeatInterval.Override(ctx, &s.ClusterSettings().SV, -1)
pkg/kv/kvclient/kvcoord/dist_sender_server_test.go
line 4786 at r1 (raw file):
err = txn.Commit(ctx) require.Error(t, err) //require.ErrorContains(t, err, "IntentMissingError")
Seems like we should be able to uncomment this?
Good question. What I think makes this safe is that both a lease transfer and subsume acquire non-MVCC write latches over the entire range's span, so when they run they really are running alone. I wonder if we could even assert that this is the case somewhere? |
Yeah, I see. And |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @miraradeva)
Previously, miraradeva (Mira Radeva) wrote…
I assume you'll squash this commit into one of the previous ones?
Yeah, I'll squash everything into a single commit after review.
pkg/kv/kvserver/batcheval/cmd_query_intent.go
line 162 at r3 (raw file):
Previously, miraradeva (Mira Radeva) wrote…
It's a bit unintuitive why we're returning a replicated lock acquisition here, given that we'll use this response to mark unreplicated locks in the lock table. Wouldn't it make more sense for this acquisition to be
lock.Intent
?
This represents the missing lock which only could have been replicated. The missing lock might not be an intent, it could be a shared or exclusive lock as well.
pkg/kv/kvclient/kvcoord/dist_sender_server_test.go
line 4707 at r1 (raw file):
Previously, miraradeva (Mira Radeva) wrote…
Is there a reason we need to keep track of these retries separately? Wouldn't the txn epoch tell us that?
Good call. Done. I think this was probably originally done because when the txn gets correctly aborted then on retry you are actually on a new txn, but this easily detected via the changed txnID.
pkg/kv/kvclient/kvcoord/dist_sender_server_test.go
line 4733 at r1 (raw file):
Previously, miraradeva (Mira Radeva) wrote…
Is it important that heartbeating is disabled only for this db? If it was ok to disable it in general, we could just do this, right?
kvcoord.TxnHeartbeatInterval.Override(ctx, &s.ClusterSettings().SV, -1)
I suppose without heartbeating gnerally, if this test ends up running for a long time somehow, some internal transactions could be aborted. Everything should handle that correctly (since it can presumably happen in production) but it could cause log noise that would complicated debugging.
pkg/kv/kvclient/kvcoord/dist_sender_server_test.go
line 4786 at r1 (raw file):
Previously, miraradeva (Mira Radeva) wrote…
Seems like we should be able to uncomment this?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @arulajmani and @stevendanna)
pkg/kv/kvserver/batcheval/cmd_query_intent.go
line 162 at r3 (raw file):
Previously, stevendanna (Steven Danna) wrote…
This represents the missing lock which only could have been replicated. The missing lock might not be an intent, it could be a shared or exclusive lock as well.
Got it.
pkg/kv/kvclient/kvcoord/dist_sender_server_test.go
line 4733 at r1 (raw file):
Previously, stevendanna (Steven Danna) wrote…
I suppose without heartbeating gnerally, if this test ends up running for a long time somehow, some internal transactions could be aborted. Everything should handle that correctly (since it can presumably happen in production) but it could cause log noise that would complicated debugging.
Makes sense.
ab9c485
to
6deea66
Compare
@arulajmani Not sure if you wanted a second look at this. But I'll probably merge this after a rebase tomorrow. If you do spot any follow-ups post merge, let me know. |
6deea66
to
5510302
Compare
Once a lock has been reported as missing via QueryIntent, it is important that it stays missing. As described in cockroachdb#145458, if a lock that has been previously reported as missing is somehow then found via a subsequent QueryIntent request, this can result in a transaction being erroneously committed by the transaction recovery process. When unreplicated lock reliability is enabled, Lease and Merge use ExportUnreplicatedLocks to move unreplicated locks from the lock table into durable storage. However, these locks may include unreplicated locks that "cover" a replicated lock that was previously reported as missing, exactly what we must avoid. Here, we solve that by extending QueryIntent's evaluation to mark any locks it reports as missing as ineligible for export. Fixes cockroachdb#145458 Release note: None Co-authored-by: Arul Ajmani <[email protected]>
Release note: None Epic: none
5510302
to
e390c70
Compare
bors r=miraradeva |
145532: concurrency: mark some unreplicated locks as ineligible for export r=miraradeva a=stevendanna Once a lock has been reported as missing via QueryIntent, it is important that it stays missing. As described in #145458, if a lock that has been previously reported as missing is somehow then found via a subsequent QueryIntent request, this can result in a transaction being erroneously committed by the transaction recovery process. When unreplicated lock reliability is enabled, Lease and Merge use ExportUnreplicatedLocks to move unreplicated locks from the lock table into durable storage. However, these locks may include unreplicated locks that "cover" a replicated lock that was previously reported as missing, exactly what we must avoid. Here, we solve that by extending QueryIntent's evaluation to mark any locks it reports as missing as ineligible for export. Fixes #145458 Release note: None Co-authored-by: Steven Danna <[email protected]>
Build failed: |
145532: concurrency: mark some unreplicated locks as ineligible for export r=miraradeva a=stevendanna Once a lock has been reported as missing via QueryIntent, it is important that it stays missing. As described in #145458, if a lock that has been previously reported as missing is somehow then found via a subsequent QueryIntent request, this can result in a transaction being erroneously committed by the transaction recovery process. When unreplicated lock reliability is enabled, Lease and Merge use ExportUnreplicatedLocks to move unreplicated locks from the lock table into durable storage. However, these locks may include unreplicated locks that "cover" a replicated lock that was previously reported as missing, exactly what we must avoid. Here, we solve that by extending QueryIntent's evaluation to mark any locks it reports as missing as ineligible for export. Fixes #145458 Release note: None Co-authored-by: Steven Danna <[email protected]>
Build failed: |
The universe doesn't want this fix. And in some sense, the universe is correct. Internal build system failure:
bors retry |
Once a lock has been reported as missing via QueryIntent, it is
important that it stays missing. As described in #145458, if a lock
that has been previously reported as missing is somehow then found via
a subsequent QueryIntent request, this can result in a transaction
being erroneously committed by the transaction recovery process.
When unreplicated lock reliability is enabled, Lease and Merge use
ExportUnreplicatedLocks to move unreplicated locks from the lock table
into durable storage. However, these locks may include unreplicated
locks that "cover" a replicated lock that was previously reported as
missing, exactly what we must avoid.
Here, we solve that by extending QueryIntent's evaluation to mark any
locks it reports as missing as ineligible for export.
Fixes #145458
Release note: None