-
Notifications
You must be signed in to change notification settings - Fork 18k
cmd/compile: play better with perf #73753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@golang/compiler @golang/runtime |
Why arm64 works: it parses the function prolog (&epilog?) instructions and decides at what point a new entry in the frame pointer linked list has been set up. If it hasn't done that yet, it knows it is a leaf function and X30 has an address in the parent frame. At least, if I'm reading the code right. Gory details at https://github.com/libunwind/libunwind/blob/master/src/aarch64/Gstep.c |
Related Issues
Related Code Changes (Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
Ok, so arm64 actually has the opposite problem. The detector in libunwind for the "made a frame record" case doesn't trigger for the code our compiler generates. It is looking for
but we generate two separate store instructions to do this:
Since the detector never sees a frame record setup, it treats every function as a leaf. So when the frame record is actually set up, we get a duplicate frame during backtracing. Since everything is thought to be a leaf, after the sampled pc we use the contents of x30, and then walk the frame pointer list. But when the function has indeed set up its frame record, x30 can be junk. Typically it will either be the function's return address, in which case the parent gets reported twice, or a return address into the sampled function (from the last call that the sampled function made), in which case the sampled function will be reported twice. Atypically, x30 will be completely junk. Not sure what happens then. I was wondering why I was seeing duplicate entries in tracebacks using perf. Now I know. |
Change https://go.dev/cl/674615 mentions this issue: |
perf
is a sampling-based analysis tool on Linux. It's kind of a swiss-army knife tool, but the basic usage just samples PCs periodically and reports CPU usage by function.For this issue, I'm interested in how perf gets call stacks, which is the
-g
option toperf record
. Currently the default for perf is to do--call-graph=fp
, which means use frame pointers to unwind stacks.Example program:
Example usage:
Go's
pprof
seems to always get call stacks perfectly correct.perf
, on the other hand, has some issues. Becauseperf
uses frame pointers, it can sometimes get stack backtraces wrong. In particular, currently it has the following problems:Both of these problems relate to the fact that
perf
uses frame pointers to unwind the stack. Because the frame pointer has not been set up in both of the above situations, perf unwinds incorrectly. To get the parent frame, it doespc = *(fp+8); fp = *fp
. Whenfp
is from the parent frame, a pc from the parent frame itself is never found, after the current sample point the next pc is from the grandparent.It seems that this is not a problem on
arm64
. Not sure how exactly, but it does not suffer from this problem. TODO: how about other architectures? Is this related to link-register vs stack push of the return address?We have a hack to solve this problem (CL 7728) when the callee is
runtime.duffzero
orruntime.duffcopy
. The caller sets up a dummy frame pointer before calling either of those functions. Whenperf
samples inside those two functions, it correctly finds the parent frame. This hack was added because inperf
profiles we see a fair amount of these two functions, and it helps to see the immediate caller (these functions are called from lots of places, unlike a typical frameless leaf function). But for all the other cases in 1 and 2, we are out of luck.The
runtime.duffzero
/runtime.duffcopy
hack was also ported toarm64
, but probably that was not needed. It is also causing problems, see #73748. Probably we should remove it, although I don't yet understand howperf
solves this problem onarm64
.So, with all that said, how might we proceed here?
perf
is not important. Remove the hack above, and just live with the fact thatperf
backtraces might be missing the parent. Not the end of the world.perf
is really important. We should add frame pointer setup and teardown to frameless leaf functions.perf
to do stack walking without using frame pointers. Modernperf
has some other ways of finding stacks, including--call-graph=lbr
(last branch record) and--call-graph=dwarf
(using dwarf info in a.eh_frame
section).Only 4 would in principle handle the prolog/epilog problem. Just adding frame pointers everywhere would not.
As mentioned above, maybe this only matters for
amd64
?The text was updated successfully, but these errors were encountered: