Skip to content

Conversation

daltairwalter
Copy link
Contributor

@daltairwalter daltairwalter commented Oct 14, 2025

Description

fix #2888

This fixes undefined behavior in Gil handling associated with storing a dangling pointer. Saving this pointer is not functionally needed, because if the TLS tstate lookup returns a nullptr in gil_scoped_acquire(), the next bit of code does tstate=PyGILState_GetThisThreadState(). In the cases where the tstate wasn't dangling the results will be the same with one extra TLS lookup (minor penalty for correctness), in the cases where the saved tstate was dangling the correct one will be loaded from PyGILState_GetThisThreadState().

Suggested changelog entry:

Fix undefined behavior that occurred when importing pybind11 modules from non-main threads created by C API modules or embedded python interpreters.

@rwgk
Copy link
Collaborator

rwgk commented Oct 15, 2025

@b-pass @oremanj This is essentially a one-line change, could you please look? — I looked myself, but I'm not sufficiently familiar with the thread handling mechanisms to be sure this is correct.

Where I'm getting lost:

  • Aren't the internals initialized only once per extension module (when the module is imported)?
  • It looks like we're already initializing the tstate key_ here:

if (!PYBIND11_TLS_KEY_CREATE(key_)) {

  • So what does the new tstate.set(nullptr); actually do? Replace the value set in L88 with nullptr?

@daltairwalter how did you convince yourself that this change actually helps? Asking the same question from a different angle: Do you have a unit test we could add here?

I also don't understand: What is the exact sequence of events that 1. assigns to tstate but 2. makes the value dangling?

@daltairwalter
Copy link
Contributor Author

Without my change, python is getting the current PyThreadState from whatever it currently is when a module is loaded. This thread state is not guaranteed to be valid in the future. If you want an exact sequence, it would be:

  1. non-pybind11 C API uses PyGILState_Ensure() to acquire GIL in a non-main thread
  2. pybind11 module is loaded in this non-main thread and does something to initialize tstate. This will save the PyThreadState created by PyGILState_Ensure() in step 1.
  3. non-pybind11 C API uses PyGILState_Release which deletes the PyThreadState that was created in step 1 leaving the saved state in step 2 dangling.

The simplest test case requires an embedded interpreter that is not started by pybind11. In the past, the pybind11 community has been very opposed to this being a supported use case, I am not sure where the community is on this now. Either way, this will also fail in a complex use case with a well written/valid non-pybind11 pyd and a pybind11 pyd being used together. There is a code sample that fails in the ticket #2888 - I updated the test case slightly for newer pybind11 at the end of the ticket, but it is essentially the same.

I am not sure if setting tstate to nullptr is actually meaningful or would otherwise be the default, the primary point of my change is to stop setting tstate = cur_tstate;

@rwgk
Copy link
Collaborator

rwgk commented Oct 15, 2025

Thanks for the explanation. That's making things a little clearer for me.

Did you consider building with PYBIND11_SIMPLE_GIL_MANAGEMENT? I believe that would side-step all issues around the internals tstate completely.

In the past, the pybind11 community has been very opposed to this being a supported use case, I am not sure where the community is on this now.

My vote: If it's wrong (like an actual dangling pointer), we should fix it.

But:

  • I only have a sketchy understanding of the thread mechanisms in general.
  • The fix should be well understood by the experts.

@oremanj
Copy link
Collaborator

oremanj commented Oct 15, 2025

Despite the small code footprint of the fix, this is extremely subtle and I will need more time to review it. There are some comments in subinterpreter.h suggesting that the logic there is relying on the internals to be created with the current tstate initialized.

@b-pass
Copy link
Contributor

b-pass commented Oct 15, 2025

Yes, this makes sense to me. I think this line was added accidentally when cleaning up the internals constructor, the code pre-cleanup used a local with the same name as the member.

This field is only used by gil.h.

@oremanj
Copy link
Collaborator

oremanj commented Oct 15, 2025

It's used in subinterpreter_scoped_activate also.

@b-pass
Copy link
Contributor

b-pass commented Oct 15, 2025

Ah I see, it's setting it for the benefit of gil.h. It properly unsets it when the raii object is destructed, so it's probably fine to leave those there. But they could also be removed and gil.h will fill populate itself as needed.

@oremanj
Copy link
Collaborator

oremanj commented Oct 15, 2025

OK, I stared at this a bunch and cross-referenced with CPython and I believe it's safe.

We could go further: On 3.12 and later, I believe we don't need internals.tstate at all, because on those interpreter versions, attaching a thread state causes it to become the PyGILState_GetThisThreadState() for the current OS thread -- even if it is later detached, as long as it still exists. internals.tstate is only read in gil_scoped_acquire, and if it's not set then we fall back to PyGILState_GetThisThreadState(). If we look at the places where internals.tstate is set:

  • Internals constructor -- this one is invalid, that's the subject of this PR
  • gil_scoped_acquire constructor/destructor for the benefit of nested gil_scoped_acquire and gil_scoped_release if a thread state was created -- this is unnecessary because the new thread state is attached so it will be returned by PyGILState_GetThisThreadState()
  • gil_scoped_release constructor/destructor in disassoc mode -- this is clearing/restoring the internals.tstate thread-local so that we don't cache a stale linkage between tstate and OS thread, which is unnecessary if we don't cache any linkage between tstate and OS thread
  • subinterpreter_scoped_activate -- the inner interpreter's tstate is attached during the lifetime of this object, so PyGILState_GetThisThreadState() should work; if there is a nested switch to yet a different interpreter, its tstate will be the attached one

TL;DR CPython is now smart enough to handle the caching that we once needed to do to work around the previous limitations of the GILState APIs.

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @b-pass and @oremanj for the feedback!


// The PyThreadState that is returned by PyThreadState_Get is not owned by us and may be
// destroyed by the owner at some point in the future. Saving this PyThreadState here can
// lead to dangling pointer undefined behavior in complex multi-threaded scenarios.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment only makes sense in the context of this PR, but will be very confusing in the future, because we're not saving "this PyThreadState" here anymore.

WDYT about:

  • Deleting this comment entirely.
  • Adding a comment above like this:
        tstate.set(nullptr); // See PR #5870

While at it, could you please also change this line in pybind11/subinterpreter.h?

    // save this in internals for scoped_gil calls (see also: PR #5870)
    detail::get_internals().tstate = tstate_;

(super-easy way to make this context more discoverable for future maintainers)

@rwgk
Copy link
Collaborator

rwgk commented Oct 16, 2025

Thanks @daltairwalter @b-pass @oremanj! Merging.

@rwgk rwgk merged commit 1594396 into pybind:master Oct 16, 2025
85 checks passed
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Oct 16, 2025
@rwgk
Copy link
Collaborator

rwgk commented Oct 16, 2025

@daltairwalter I just noticed: Could you please fill in the "Suggested changelog entry" in the PR description? (I just put back the template; usually we just have a sentence or two.)

@daltairwalter
Copy link
Contributor Author

Thank you all for making this process go so smoothly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs changelog Possibly needs a changelog entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] GIL hang in multi-threaded situation

4 participants