Skip to content

GH-135379: Top of stack caching for the JIT. #135465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

markshannon
Copy link
Member

@markshannon markshannon commented Jun 13, 2025

The stats need fixing and the generated tables could be more compact, but it works.

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool. I'll do a full review soon enough.

@markshannon
Copy link
Member Author

Performance is in the noise, but we would need a really big speed up of jitted code for it to be more than noise overall.

The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup.
I don't know why that would be as I think we are using stock LLVM for Mac, not the Apple compiler.

@Fidget-Spinner
Copy link
Member

The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup. I don't know why that would be as I think we are using stock LLVM for Mac, not the Apple compiler.

Nice. We use Apple's Compiler for the interpreter, though the JIT uses stock LLVm. Thomas previously showed that the version of the Apple compiler we use is subject to huge fluctuations in performance due to a PGO bug.

@markshannon markshannon marked this pull request as ready for review June 20, 2025 15:04
Comment on lines +1 to +6
Implement a limited form of register allocation know as "top of stack
caching" in the JIT. It works by keeping 0-3 of the top items in the stack
in registers. The code generator generates multiple versions of thos uops
that do not escape and are relatively small. During JIT compilation, the
copy that produces the least memory traffic is selected, spilling or
reloading values when needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Implement a limited form of register allocation know as "top of stack
caching" in the JIT. It works by keeping 0-3 of the top items in the stack
in registers. The code generator generates multiple versions of thos uops
that do not escape and are relatively small. During JIT compilation, the
copy that produces the least memory traffic is selected, spilling or
reloading values when needed.
Implement a limited form of register allocation known as "top of stack
caching" in the JIT. It works by keeping 0-3 of the top items in the stack
in registers. The code generator generates multiple versions of those uops
that do not escape and are relatively small. During JIT compilation, the
copy that produces the least memory traffic is selected, spilling or
reloading values when needed.

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to review the cases generator later.

@@ -0,0 +1,3 @@
Implement top-of-stack caching for the JIT (and tier 2 interpreter). Reduces
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a second news file?

Comment on lines +1008 to +1009
static int
get_exit_depth(_PyUOpInstruction *inst)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write a short snippet on what this does? It's rather confusing otherwise. IIUC, it finds what is the number of "used" registers on exit right?

Comment on lines +1028 to +1029
if (_PyUop_Caching[base_opcode].exit_depth_is_output) {
return input + _PyUop_Caching[base_opcode].delta;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

static int
stack_allocate(_PyUOpInstruction *buffer, int length)
{
for (int i = length-1; i >= 0; i--) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, this is due to the possibility of needing to insert a spill between every instruction right, so you need to reserve 2N number of instructions?

Comment on lines +1259 to +1262
if ideal_inputs > 3:
ideal_inputs = 3
if ideal_outputs > 3:
ideal_outputs = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the value 3 to a global magic number so that we can play around with increasing/decreasing register counts in the future?

Comment on lines +1268 to +1271
#if has_exit and ideal_inputs != ideal_outputs:
# n = min(ideal_inputs, ideal_outputs)
# yield n, n
# return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants