Minimize the blast radius of lock-counter bugs #7

corhere · 2025-05-23T16:08:46Z

Recycling objects with a sync.Pool risks amplifying the impact of bugs. If the object is not reset to be functionally indistinguishable from a newly-constructed one before being returned to the pool, the unlucky recipient of the recycled object could misbehave through no fault of its own. And what's worse, such bugs would likely manifest as Heisenbugs as they would only happen when objects returned to the pool are being reused, i.e. a program with lots of concurrency and lots of lock churn.

Make the pool optimization functionally transparent so the impact of counter bugs does not extend beyond the scope it would have had without the pool optimization. Defensively reset the lock counters before returning them to the pool and after taking them from the pool. And add testing-only code which asserts that the invariants are upheld to ensure that the defensive coding is not papering over real bugs.

Copilot

Pull Request Overview

This PR adds defensive logic around recycling lock counter objects via sync.Pool to ensure their internal counters are always zeroed and to panic in test mode if invariants are violated.

Introduce rwPoolPut to validate and reset rwlockCtr counters before putting back into the pool.
Wrap pool-Get in both get and Lock methods with test-mode checks and explicit zeroing of counters.
Replace direct sync.Pool.Put calls in Unlock/RUnlock (and in MutexMap) with the new defensive helper.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
rwmutexmap.go	Added `rwPoolPut` wrapper, replaced direct `Put` calls, and added test-only counter invariant checks around pool `Get`/`Put`.
mutexmap.go	Added test-only panic checks and defensive `waiters` resets in `Lock`/`Unlock`, and replaced direct pool `Put` with defensive cleanup.

Comments suppressed due to low confidence (3)

rwmutexmap.go:28

There are no tests exercising the new rwPoolPut behavior (e.g., confirming counters are zeroed). Add unit tests to explicitly cover both the normal path and panic path in test mode.

func rwPoolPut(nameLock *rwlockCtr) {

rwmutexmap.go:34

The call to testing.Testing() will fail to compile because testing.Testing is not part of Go's testing API. Consider using a build tag or an injectable flag to enable test-only checks instead of importing testing in production code.

if testing.Testing() {

mutexmap.go:43

Similar to the above, testing.Testing() does not exist in the testing package. Remove this import and guard or replace it with a valid mechanism for test-only assertions.

if testing.Testing() {

rwmutexmap.go

Recycling objects with a sync.Pool risks amplifying the impact of bugs. If the object is not reset to be functionally indistinguishable from a newly-constructed one before being returned to the pool, the unlucky recipient of the recycled object could misbehave through no fault of its own. And what's worse, such bugs would likely manifest as Heisenbugs as they would only happen when objects returned to the pool are being reused, i.e. a program with lots of concurrency and lots of lock churn. Make the pool optimization functionally transparent so the impact of counter bugs does not extend beyond the scope it would have had without the pool optimization. Defensively reset the lock counters before returning them to the pool and after taking them from the pool. And add testing-only code which asserts that the invariants are upheld to ensure that the defensive coding is not papering over real bugs. Signed-off-by: Cory Snider <[email protected]>

cpuguy83 · 2025-05-23T17:12:45Z

istesting.go

+import "testing"
+
+func isTesting() bool {
+	return testing.Testing()


I'm not sure I like importing testing like this, though admittedly I'd not seen this function before.
Are we expecting users of this library to have the different behavior under test?

Perhaps some package-level bool that's set as part of tests in this repository (so both "testing" to be true, and the bool to be set?)

admittedly I'd not seen this function before.

It's new as of go1.21. The proposal was https://go.dev/issue/52600

Are we expecting users of this library to have the different behavior under test?

If this library is free of bugs there will be no observed difference in behaviour in testing vs. application code. But if this library does contain latent bugs, I figure this strikes the best compromise between stability in production and discoverability of bugs.

There are a few ways to go about defending against lock counter / concurrency bugs:

Defensively reset the structs unconditionally. Heisenbugs in production may never get caught, or could get misattributed to the application. This is closest to the v1 behaviour.

Unconditionally assert the invariants. Bugs will get noticed and correctly attributed to the library, but at the risk of programs panicking in production.

Defensively reset the structs in applications, assert the invariants in test binaries. Test coverage of the library expands to the test suites of our users, without causing any grief in production.

Assert the invariants in moby/locker CI; defensive resetting everywhere else, including importer test suites. Only marginally better than option 1 as bugs will only be caught if we get really lucky in a CI run of the moby/locker tests.

I would rather go with option 2 if we can't come to a consensus on 3. Better to fail fast and definitively than send application maintainers on a wild goose chase for "impossible" unbounded concurrency as a consequence of bugs in moby/locker.

Perhaps some package-level bool that's set as part of tests in this repository (so both "testing" to be true, and the bool to be set?)

There's no point to call testing.Testing() in _test.go files as execution of code in those files already implies that a go test program is being executed.

corhere requested review from tonistiigi, cpuguy83, thaJeztah, vvoland and Copilot May 23, 2025 16:08

Copilot AI reviewed May 23, 2025

View reviewed changes

rwmutexmap.go Outdated Show resolved Hide resolved

corhere force-pushed the pool-paranoia branch 2 times, most recently from faa2054 to b058b63 Compare May 23, 2025 16:27

corhere force-pushed the pool-paranoia branch from b058b63 to 86733db Compare May 23, 2025 16:31

austinvazquez approved these changes May 23, 2025

View reviewed changes

cpuguy83 reviewed May 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Minimize the blast radius of lock-counter bugs #7

Minimize the blast radius of lock-counter bugs #7

Uh oh!

corhere commented May 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

cpuguy83 May 23, 2025

Uh oh!

thaJeztah May 23, 2025

Uh oh!

corhere May 23, 2025

Uh oh!

Uh oh!

Minimize the blast radius of lock-counter bugs #7

Are you sure you want to change the base?

Minimize the blast radius of lock-counter bugs #7

Uh oh!

Conversation

corhere commented May 23, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

cpuguy83 May 23, 2025

Choose a reason for hiding this comment

Uh oh!

thaJeztah May 23, 2025

Choose a reason for hiding this comment

Uh oh!

corhere May 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!