Skip to content

3.14t vs 3.13t cuts IOCP performance in half #134637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
SolsticeProjekt opened this issue May 24, 2025 · 5 comments · Fixed by #134702
Closed

3.14t vs 3.13t cuts IOCP performance in half #134637

SolsticeProjekt opened this issue May 24, 2025 · 5 comments · Fixed by #134702
Labels
3.13 bugs and security fixes 3.14 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows performance Performance or resource usage topic-ctypes topic-free-threading type-bug An unexpected behavior, bug, or error

Comments

@SolsticeProjekt
Copy link

SolsticeProjekt commented May 24, 2025

Bug report

Bug description:

This is about 3.14.0bl vs 3.13.1, free threaded in both cases.
Microsoft Windows [Version 10.0.19045.4529]

I run and maintain an IOCP server in python 3.13.1t.
There are no 3rd party libraries being used.

The Problem: Using the exact same code, running 3.14t vs 3.13t cuts the throughput in half.

I've made a badly written, working benchmark, extracted/simplified from my IOCP server.

Server:

from ctypes import windll,create_string_buffer,c_void_p,c_ulong,c_ulonglong,Structure,byref,cast,addressof,POINTER,c_char
from ctypes.wintypes import DWORD,HANDLE

kernel32 = windll.kernel32

CreateNamedPipeW = kernel32.CreateNamedPipeW
CreateIOCompletionPort = kernel32.CreateIoCompletionPort
ConnectNamedPipe = kernel32.ConnectNamedPipe
GetQueuedCompletionStatusEx = kernel32.GetQueuedCompletionStatusEx;
ReadFile = kernel32.ReadFile

GLE = kernel32.GetLastError

class OVERLAPPED(Structure):
	_fields_ = (("0", c_void_p),("1", c_void_p),("2", DWORD),("3", DWORD),("4", c_void_p),
				("5", c_void_p),("6",c_void_p),("7",c_void_p),("8",c_void_p))


Overlapped = (OVERLAPPED*10)()
__Overlapped = byref(Overlapped)

IOCP = CreateIOCompletionPort(HANDLE(-1),None,0,4)

flag1 = 1 | 1073741824; flag2 = 4 | 2 | 0 | 8
Pipe = CreateNamedPipeW("\\\\.\\pipe\\IOCPBenchMark",flag1,flag2,255,32,0,0, None)

if not CreateIOCompletionPort(Pipe,IOCP,1,0): print("ERROR!")

ReadBuffer = create_string_buffer(1024)
__ReadBuffer = byref(ReadBuffer)

OverlapEntries = create_string_buffer(32*128)
ove = byref(OverlapEntries); 

Completed = c_ulong(0)
__Completed = byref(Completed)

def __IOCPThread():
	while True:
		while not GetQueuedCompletionStatusEx(IOCP, ove, 255, __Completed, 0, False): continue
		ReadFile(Pipe, __ReadBuffer,32,None,__Overlapped)


from threading import Thread

Threads = []
for t in range(4): Threads.append(Thread(target=__IOCPThread))


success = ConnectNamedPipe(Pipe, __Overlapped)
if not success:
	if GLE() != 997:
		print("ERROR 2")


while not GetQueuedCompletionStatusEx(IOCP, ove, 255, __Completed, 1, False): continue
print("Connected.")

ReadFile(Pipe, __ReadBuffer,32,None,__Overlapped)
for t in Threads: t.start()


from time import sleep
while True:
	sleep(1)

Client:

from ctypes import windll,c_char_p,byref
from ctypes.wintypes import DWORD

from time import perf_counter as pfc

kernel32 = windll.kernel32

CreateFileW = kernel32.CreateFileW
WriteFile = kernel32.WriteFile
GLE = kernel32.GetLastError

written = DWORD()
__written = byref(written)


print(GLE())

GENERIC_WRITE = 1073741824
Pipe = kernel32.CreateFileW("\\\\.\\pipe\\IOCPBenchMark",GENERIC_WRITE,0,None,3,0,None)

if GLE() == 0: print("Connected.")

test = b"test"

t = pfc()+1
while True:

	for Count in range(1000000):

		if not WriteFile(Pipe, test, 4,__written, None):
			print("ERROR ",GLE())

		if not WriteFile(Pipe, test, 4,__written, None):
			print("ERROR ",GLE())

		if not WriteFile(Pipe, test, 4,__written, None):
			print("ERROR ",GLE())

		if not WriteFile(Pipe, test, 4,__written, None):
			print("ERROR ",GLE())

		if pfc() >= t:
			t = pfc()+1
			print(Count*4)
			break

The server uses 4 threads. if you don't see any output, try reducing the amount.
When I use 8 threads (on my 8 core machine), I don't get any output. d'uh.
No, SMT-Threads don't count for anything here.

Each script runs in its own cmd.exe window.
Please be aware that you'll have to kill the server-process manually.

I wanted to add a call to "taskkill/F /IM:pytho*",
but then realized I might cause someone big trouble with that.

>python3.13t server.py:
Client output:
205536
207128
206764
206504
204768

>python3.14t server.py:
Client output:
107468
105516
106032
107492
108472

Perplexity suggested I should post this here,
because this is a use-case you people might be interested in.

Thank you.

CPython versions tested on:

3.14

Operating systems tested on:

Windows

Linked PRs

@SolsticeProjekt SolsticeProjekt added the type-bug An unexpected behavior, bug, or error label May 24, 2025
@picnixz picnixz added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-free-threading 3.13 bugs and security fixes 3.14 bugs and security fixes OS-windows labels May 24, 2025
@ZeroIntensity
Copy link
Member

In 3.14, we made ctypes thread-safe, so this is probably the result of lock contention. I'm honestly surprised it's not crashing in 3.13. Would you mind benchmarking to find which lock is causing the problem?

cc @kumaraditya303

@kumaraditya303
Copy link
Contributor

I think it is because of using critical section around PyCFuncPtr_call because restype is mutable so it needs locking in PyCFuncPtr_call, I'll look into making it lock free.

@SolsticeProjekt
Copy link
Author

In 3.14, we made ctypes thread-safe, so this is probably the result of lock contention. I'm honestly surprised it's not crashing in 3.13. Would you mind benchmarking to find which lock is causing the problem?

cc @kumaraditya303

Are you saying that my IOCP server, using 3.13t, shouldn't actually be working? Because it does. Flawlessly, even under load heavy enough to throttle the rest of the system.

Is there anything I can do to help?

@ZeroIntensity
Copy link
Member

I think it is because of using critical section around PyCFuncPtr_call because restype is mutable so it needs locking in PyCFuncPtr_call, I'll look into making it lock free.

Oh, yeah, that sounds problematic. I'd be ok with removing that critical section entirely (or only holding it in places where we actually access per-object state), because it should up to the user to serialize their own C calls.

@kumaraditya303
Copy link
Contributor

I have a fix at #134702 which makes it lock-free in the general case and fixes the performance regression.

@kumaraditya303 kumaraditya303 linked a pull request May 26, 2025 that will close this issue
kumaraditya303 added a commit that referenced this issue May 26, 2025
…inter in `free threading`. (#134702)

Fix performance regression in calling `ctypes` function pointer in `free threading`.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 26, 2025
…ion pointer in `free threading`. (pythonGH-134702)

Fix performance regression in calling `ctypes` function pointer in `free threading`.
(cherry picked from commit 3c05251)

Co-authored-by: Kumar Aditya <[email protected]>
kumaraditya303 added a commit that referenced this issue May 26, 2025
…tion pointer in `free threading`. (GH-134702) (#134742)

gh-134637: Fix performance regression in calling `ctypes` function pointer in `free threading`. (GH-134702)

Fix performance regression in calling `ctypes` function pointer in `free threading`.
(cherry picked from commit 3c05251)

Co-authored-by: Kumar Aditya <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes 3.14 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows performance Performance or resource usage topic-ctypes topic-free-threading type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants