Skip to content

[BUG]: Fatal SIGSEGV in ddtrace profiler (Python 3.11, Gunicorn) – unsafe use of sys._current_frames() / _PyThread_CurrentFrames causes use-after-free #13567

@lattwood

Description

@lattwood

Tracer Version(s)

2.14.4, 2.21.8

Python Version(s)

Python 3.11.12

Pip Version(s)

pip 25.1.1

Bug Report

Hi!

We're seeing regular WORKER TIMEOUTs under gunicorn, and have tracked the cause to workers being killed by SIGSEGV originating from Datadog's ddtrace profiler. (See GDB backtrace and detailed analysis at the end.)


1 Summary

Our production Gunicorn workers (Python 3.11, Debian Bookworm base, ddtrace profiler enabled) routinely exit with WORKER TIMEOUT. Post-mortem analysis shows the workers are dying from a segmentation fault raised inside the Datadog in-process profiler, specifically in the Cython function that walks thread frame chains (pyframe_to_frames).
The crash is deterministic under load, reproducible on fresh containers, and is not an application error: the profiler’s sampling thread dereferences a stale PyFrameObject * obtained from sys._current_frames() / _PyThread_CurrentFrames, an API explicitly documented by CPython as unsafe for concurrent inspection.

2 Observed behaviour

  1. Gunicorn logs a WORKER TIMEOUT.
  2. systemd-coredump captures SIGSEGV (11) in the worker process.
  3. Every core shows the same stack: segfault originates in
    ddtrace/profiling/collector/_traceback.cpython-311-x86_64-linux-gnu.so : pyframe_to_frames.

3 Back-trace (abbreviated)

#0  __pthread_kill_implementation               ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal                     ./nptl/pthread_kill.c:78
#2  __GI_raise                                  ../sysdeps/posix/raise.c:26
#3  libdd_wrapper-glibc-x86_64.so               close_stderr_chainer()
#4  libdd_wrapper-glibc-x86_64.so               (signal handler wrapper)
#5  <signal handler>
#6  libpython3.11.so.1.0                        (generic getattr path)
#7  libpython3.11.so.1.0                        _PyObject_GenericGetAttrWithDict
#8  _traceback.cpython-311-x86_64-linux-gnu.so  __pyx_pw...pyframe_to_frames
#9  stack.cpython-311-x86_64-linux-gnu.so       StackCollector.collect
#10 libpython3.11.so.1.0                        PyObject_Vectorcall
#16 ddtrace/internal/_threads.so                PeriodicThread::_M_run
#18 start_thread                                ./nptl/pthread_create.c:442
#19 clone                                       ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

4 Signal info

(gdb) print $_siginfo
{si_signo = 11, si_errno = 0, si_code = -6, _sifields = {_pad = {921, 0 <repeats 27 times>}, _kill = {si_pid = 921, si_uid = 0}, _timer = {si_tid = 921, si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 921, si_uid = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 921, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x399, _addr_lsb = 0, _addr_bnd = {_lower = 0x0, _upper = 0x0}}, _sigpoll = {si_band = 921, si_fd = 0}, _sigsys = {_call_addr = 0x399, _syscall = 0, _arch = 0}}}

5 Registers and disassembly at crash site

rip = 0x7f8693026761  (__pyx_pw...pyframe_to_frames+1281)
rax = 0x0             (function pointer resolved to NULL)
rbp = 0x7f85ff4ae5c0  (supposed PyFrameObject *)
...
       mov    0x8(%rbp),%rax        ; frame->f_code
       mov    0x90(%rax),%rax       ; slot from code/type object
       call   *%rax                 ; ---> segfault (NULL / garbage)

6 Memory inspection

(gdb) x/gx $rbp
0x7f85ff4ae5c0: 0x0000000000000002       # suspiciously low refcount

(gdb) x/gx $rbp+0x8                      # supposed f_code
0x7f85ff4ae5c8: 0x00007f869405ad60

(gdb) x/32gx 0x7f869405ad60
0x7f869405ad60 <PyFrame_Type>: ...       # pointer lands on the *type* struct,

7 Profiler source responsible

The Cython collector calls CPython’s internal API every sampling tick:

cdef dict running_threads = <dict>_PyThread_CurrentFrames()
...
frames, nframes = _traceback.pyframe_to_frames(frame, max_nframes)

CPython implementation of that API:

PyObject *_PyThread_CurrentFrames(void) { ... }   // exposes raw frame ptrs

8 Why this must crash

  • sys._current_frames() / _PyThread_CurrentFrames returns raw PyFrameObject * for every thread.

  • CPython docs (quoted verbatim):

    "The frame returned for a non-deadlocked thread may bear no relationship to that thread’s current activity by the time calling code examines the frame."

  • Frames are reference-counted; another thread can return from a function and free its frame immediately after the mapping is created.

  • The profiler thread, still holding those pointers, dereferences them -> use-after-free.

  • This is unaffected by the GIL (a context switch may occur after the mapping is built but before each frame is examined).

9 Impact

Any high-concurrency workload with the profiler enabled will eventually dereference freed memory and bring down the entire worker process. In our ECS service this manifests as 502/504s, disrupted sessions, and cascading retries.

10 Expected behaviour

Profiling should never terminate the target process. If the design cannot be made safe, the risk must be explicitly documented and the feature disabled (or opt-in) for Python ≥3.11.

Error Logs

No response

Libraries in Use

No response

Operating System

python:3.11-slim-bookworm

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions