-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
GH-135379: Top of stack caching for the JIT. #135465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool. I'll do a full review soon enough.
78489ea
to
2850d72
Compare
Performance is in the noise, but we would need a really big speed up of jitted code for it to be more than noise overall. The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup. |
Nice. We use Apple's Compiler for the interpreter, though the JIT uses stock LLVm. Thomas previously showed that the version of the Apple compiler we use is subject to huge fluctuations in performance due to a PGO bug. |
Implement a limited form of register allocation know as "top of stack | ||
caching" in the JIT. It works by keeping 0-3 of the top items in the stack | ||
in registers. The code generator generates multiple versions of thos uops | ||
that do not escape and are relatively small. During JIT compilation, the | ||
copy that produces the least memory traffic is selected, spilling or | ||
reloading values when needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement a limited form of register allocation know as "top of stack | |
caching" in the JIT. It works by keeping 0-3 of the top items in the stack | |
in registers. The code generator generates multiple versions of thos uops | |
that do not escape and are relatively small. During JIT compilation, the | |
copy that produces the least memory traffic is selected, spilling or | |
reloading values when needed. | |
Implement a limited form of register allocation known as "top of stack | |
caching" in the JIT. It works by keeping 0-3 of the top items in the stack | |
in registers. The code generator generates multiple versions of those uops | |
that do not escape and are relatively small. During JIT compilation, the | |
copy that produces the least memory traffic is selected, spilling or | |
reloading values when needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to review the cases generator later.
@@ -0,0 +1,3 @@ | |||
Implement top-of-stack caching for the JIT (and tier 2 interpreter). Reduces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a second news file?
static int | ||
get_exit_depth(_PyUOpInstruction *inst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write a short snippet on what this does? It's rather confusing otherwise. IIUC, it finds what is the number of "used" registers on exit right?
if (_PyUop_Caching[base_opcode].exit_depth_is_output) { | ||
return input + _PyUop_Caching[base_opcode].delta; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
static int | ||
stack_allocate(_PyUOpInstruction *buffer, int length) | ||
{ | ||
for (int i = length-1; i >= 0; i--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding, this is due to the possibility of needing to insert a spill between every instruction right, so you need to reserve 2N number of instructions?
if ideal_inputs > 3: | ||
ideal_inputs = 3 | ||
if ideal_outputs > 3: | ||
ideal_outputs = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the value 3
to a global magic number so that we can play around with increasing/decreasing register counts in the future?
#if has_exit and ideal_inputs != ideal_outputs: | ||
# n = min(ideal_inputs, ideal_outputs) | ||
# yield n, n | ||
# return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this.
The stats need fixing and the generated tables could be more compact, but it works.