Skip to content

Minimize the blast radius of lock-counter bugs #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

corhere
Copy link
Contributor

@corhere corhere commented May 23, 2025

Recycling objects with a sync.Pool risks amplifying the impact of bugs. If the object is not reset to be functionally indistinguishable from a newly-constructed one before being returned to the pool, the unlucky recipient of the recycled object could misbehave through no fault of its own. And what's worse, such bugs would likely manifest as Heisenbugs as they would only happen when objects returned to the pool are being reused, i.e. a program with lots of concurrency and lots of lock churn.

Make the pool optimization functionally transparent so the impact of counter bugs does not extend beyond the scope it would have had without the pool optimization. Defensively reset the lock counters before returning them to the pool and after taking them from the pool. And add testing-only code which asserts that the invariants are upheld to ensure that the defensive coding is not papering over real bugs.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds defensive logic around recycling lock counter objects via sync.Pool to ensure their internal counters are always zeroed and to panic in test mode if invariants are violated.

  • Introduce rwPoolPut to validate and reset rwlockCtr counters before putting back into the pool.
  • Wrap pool-Get in both get and Lock methods with test-mode checks and explicit zeroing of counters.
  • Replace direct sync.Pool.Put calls in Unlock/RUnlock (and in MutexMap) with the new defensive helper.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
rwmutexmap.go Added rwPoolPut wrapper, replaced direct Put calls, and added test-only counter invariant checks around pool Get/Put.
mutexmap.go Added test-only panic checks and defensive waiters resets in Lock/Unlock, and replaced direct pool Put with defensive cleanup.
Comments suppressed due to low confidence (3)

rwmutexmap.go:28

  • There are no tests exercising the new rwPoolPut behavior (e.g., confirming counters are zeroed). Add unit tests to explicitly cover both the normal path and panic path in test mode.
func rwPoolPut(nameLock *rwlockCtr) {

rwmutexmap.go:34

  • The call to testing.Testing() will fail to compile because testing.Testing is not part of Go's testing API. Consider using a build tag or an injectable flag to enable test-only checks instead of importing testing in production code.
if testing.Testing() {

mutexmap.go:43

  • Similar to the above, testing.Testing() does not exist in the testing package. Remove this import and guard or replace it with a valid mechanism for test-only assertions.
if testing.Testing() {

@corhere corhere force-pushed the pool-paranoia branch 2 times, most recently from faa2054 to b058b63 Compare May 23, 2025 16:27
Recycling objects with a sync.Pool risks amplifying the impact of bugs.
If the object is not reset to be functionally indistinguishable from a
newly-constructed one before being returned to the pool, the unlucky
recipient of the recycled object could misbehave through no fault of its
own. And what's worse, such bugs would likely manifest as Heisenbugs as
they would only happen when objects returned to the pool are being
reused, i.e. a program with lots of concurrency and lots of lock churn.

Make the pool optimization functionally transparent so the impact of
counter bugs does not extend beyond the scope it would have had without
the pool optimization. Defensively reset the lock counters before
returning them to the pool and after taking them from the pool. And add
testing-only code which asserts that the invariants are upheld to ensure
that the defensive coding is not papering over real bugs.

Signed-off-by: Cory Snider <[email protected]>
import "testing"

func isTesting() bool {
return testing.Testing()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like importing testing like this, though admittedly I'd not seen this function before.
Are we expecting users of this library to have the different behavior under test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps some package-level bool that's set as part of tests in this repository (so both "testing" to be true, and the bool to be set?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

admittedly I'd not seen this function before.

It's new as of go1.21. The proposal was https://go.dev/issue/52600

Are we expecting users of this library to have the different behavior under test?

If this library is free of bugs there will be no observed difference in behaviour in testing vs. application code. But if this library does contain latent bugs, I figure this strikes the best compromise between stability in production and discoverability of bugs.

There are a few ways to go about defending against lock counter / concurrency bugs:

  1. Defensively reset the structs unconditionally. Heisenbugs in production may never get caught, or could get misattributed to the application. This is closest to the v1 behaviour.
  2. Unconditionally assert the invariants. Bugs will get noticed and correctly attributed to the library, but at the risk of programs panicking in production.
  3. Defensively reset the structs in applications, assert the invariants in test binaries. Test coverage of the library expands to the test suites of our users, without causing any grief in production.
  4. Assert the invariants in moby/locker CI; defensive resetting everywhere else, including importer test suites. Only marginally better than option 1 as bugs will only be caught if we get really lucky in a CI run of the moby/locker tests.

I would rather go with option 2 if we can't come to a consensus on 3. Better to fail fast and definitively than send application maintainers on a wild goose chase for "impossible" unbounded concurrency as a consequence of bugs in moby/locker.

Perhaps some package-level bool that's set as part of tests in this repository (so both "testing" to be true, and the bool to be set?)

There's no point to call testing.Testing() in _test.go files as execution of code in those files already implies that a go test program is being executed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants