-
Notifications
You must be signed in to change notification settings - Fork 467
Description
Tracer Version(s)
2.14.4, 2.21.8
Python Version(s)
Python 3.11.12
Pip Version(s)
pip 25.1.1
Bug Report
Hi!
We're seeing regular WORKER TIMEOUTs under gunicorn, and have tracked the cause to workers being killed by SIGSEGV originating from Datadog's ddtrace profiler. (See GDB backtrace and detailed analysis at the end.)
1 Summary
Our production Gunicorn workers (Python 3.11, Debian Bookworm base, ddtrace profiler enabled) routinely exit with WORKER TIMEOUT. Post-mortem analysis shows the workers are dying from a segmentation fault raised inside the Datadog in-process profiler, specifically in the Cython function that walks thread frame chains (pyframe_to_frames).
The crash is deterministic under load, reproducible on fresh containers, and is not an application error: the profiler’s sampling thread dereferences a stale PyFrameObject * obtained from sys._current_frames() / _PyThread_CurrentFrames, an API explicitly documented by CPython as unsafe for concurrent inspection.
2 Observed behaviour
- Gunicorn logs a
WORKER TIMEOUT. - systemd-coredump captures
SIGSEGV (11)in the worker process. - Every core shows the same stack: segfault originates in
ddtrace/profiling/collector/_traceback.cpython-311-x86_64-linux-gnu.so : pyframe_to_frames.
3 Back-trace (abbreviated)
#0 __pthread_kill_implementation ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal ./nptl/pthread_kill.c:78
#2 __GI_raise ../sysdeps/posix/raise.c:26
#3 libdd_wrapper-glibc-x86_64.so close_stderr_chainer()
#4 libdd_wrapper-glibc-x86_64.so (signal handler wrapper)
#5 <signal handler>
#6 libpython3.11.so.1.0 (generic getattr path)
#7 libpython3.11.so.1.0 _PyObject_GenericGetAttrWithDict
#8 _traceback.cpython-311-x86_64-linux-gnu.so __pyx_pw...pyframe_to_frames
#9 stack.cpython-311-x86_64-linux-gnu.so StackCollector.collect
#10 libpython3.11.so.1.0 PyObject_Vectorcall
#16 ddtrace/internal/_threads.so PeriodicThread::_M_run
#18 start_thread ./nptl/pthread_create.c:442
#19 clone ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
4 Signal info
(gdb) print $_siginfo
{si_signo = 11, si_errno = 0, si_code = -6, _sifields = {_pad = {921, 0 <repeats 27 times>}, _kill = {si_pid = 921, si_uid = 0}, _timer = {si_tid = 921, si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 921, si_uid = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 921, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x399, _addr_lsb = 0, _addr_bnd = {_lower = 0x0, _upper = 0x0}}, _sigpoll = {si_band = 921, si_fd = 0}, _sigsys = {_call_addr = 0x399, _syscall = 0, _arch = 0}}}
5 Registers and disassembly at crash site
rip = 0x7f8693026761 (__pyx_pw...pyframe_to_frames+1281)
rax = 0x0 (function pointer resolved to NULL)
rbp = 0x7f85ff4ae5c0 (supposed PyFrameObject *)
...
mov 0x8(%rbp),%rax ; frame->f_code
mov 0x90(%rax),%rax ; slot from code/type object
call *%rax ; ---> segfault (NULL / garbage)
6 Memory inspection
(gdb) x/gx $rbp
0x7f85ff4ae5c0: 0x0000000000000002 # suspiciously low refcount
(gdb) x/gx $rbp+0x8 # supposed f_code
0x7f85ff4ae5c8: 0x00007f869405ad60
(gdb) x/32gx 0x7f869405ad60
0x7f869405ad60 <PyFrame_Type>: ... # pointer lands on the *type* struct,
7 Profiler source responsible
The Cython collector calls CPython’s internal API every sampling tick:
cdef dict running_threads = <dict>_PyThread_CurrentFrames()
...
frames, nframes = _traceback.pyframe_to_frames(frame, max_nframes)CPython implementation of that API:
PyObject *_PyThread_CurrentFrames(void) { ... } // exposes raw frame ptrs8 Why this must crash
-
sys._current_frames()/_PyThread_CurrentFramesreturns rawPyFrameObject *for every thread. -
CPython docs (quoted verbatim):
"The frame returned for a non-deadlocked thread may bear no relationship to that thread’s current activity by the time calling code examines the frame."
-
Frames are reference-counted; another thread can return from a function and free its frame immediately after the mapping is created.
-
The profiler thread, still holding those pointers, dereferences them -> use-after-free.
-
This is unaffected by the GIL (a context switch may occur after the mapping is built but before each frame is examined).
9 Impact
Any high-concurrency workload with the profiler enabled will eventually dereference freed memory and bring down the entire worker process. In our ECS service this manifests as 502/504s, disrupted sessions, and cascading retries.
10 Expected behaviour
Profiling should never terminate the target process. If the design cannot be made safe, the risk must be explicitly documented and the feature disabled (or opt-in) for Python ≥3.11.
Error Logs
No response
Libraries in Use
No response
Operating System
python:3.11-slim-bookworm