-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8355003: Implement Ahead-of-Time Method Profiling #24210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back iveresov! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CDS changes looks good. Just a few nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is still a draft, here is a first sweep through it anyway :)
if (k != nullptr && k->class_loader_data() != nullptr && | ||
(!k->is_instance_klass() || InstanceKlass::cast(k)->is_loaded())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the conditions like these are essentially is_live
? It is confusing that some blocks test for klass->is_loader_alive
, and some don't. Maybe this suggests we really need a helper method if we cannot use is_live
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that conditions are slightly different. In one case we accept non-instance klasses. In the other we don't. @iwanowww, could you comment on this? I think you wrote this piece?
src/hotspot/share/cds/cdsConfig.cpp
Outdated
FLAG_SET_ERGO(ReplayTraining, false); | ||
FLAG_SET_ERGO(RecordTraining, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need also check when these flags set on command line but AOT caching is disable. You need to issue warning,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming files and classes to include AOT
word. But I will not insist.
@@ -546,6 +548,32 @@ bool CDSConfig::check_vm_args_consistency(bool patch_mod_javabase, bool mode_fla | |||
return true; | |||
} | |||
|
|||
void CDSConfig::setup_compiler_args() { | |||
// AOT profiles are supported only in the JEP 483 workflow. | |||
bool can_dump_profiles = AOTClassLinking && new_aot_flags_used(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why check new_aot_flags_used()
? As I understand AOTClassLinking
should be enough in mainline. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iklam, do I need it?
Some performance experiments:
I think they show there are good improvements in good cases, but also regressions. I think this all is caused by triggering compilations early. So we get a performance benefit of starting the compilation earlier in the good case. But conversely, we also get the performance regression -- sometimes massive in the over-trained runs, look at difference in "User" time -- because training data forces us to compile more than we need? Just doing the following seems to recuperate perf losses, this is
|
In this PR we're actually targeting warmup improvements not the startup improvements per se. So, I think we should craft a benchmark that shows how the performance changes over time rather than measuring the total. This should allow us to measure when does the app reach its peak performance. |
Right, but I think that distinction would be lost when users enable AOTCache and see startup regressions. Let me see if I understand this right. I think PR in current form only stacks well with AOT? That is, when training data loads, we force compilation, which would likely be handled by AOT-ed code load? For a free-standing PR, just loading the profiles without triggering the compilation seems like a good incremental win without any particular losses, apart from spending a few CPU cycles on installing the profiles. |
@veresov please take openjdk/leyden@293edad from the Leyden repo as well. |
Maybe we should still compile at start-up but somehow limit the amount of excessive compilation? E.g., only the methods that were compiled during first N ms of training run? |
We should discuss it, but JITing is free on many-core systems. May be we should have a heuristic based on the number of cores? Or a command line option -XX:AOTMode=warmup ? |
This is how I think about it. Profiling essentially answers two related, but not exactly the same questions: The mere existence of profile answers Q2. In pure JIT-profiled ("JITP") and JIT-compiled ("JITC") mode, through the way we collect the profile, the existence of profile also answers Q1 automatically. That is, if JITP says the profile is ready, it means that the method is hot right now, and we have a strong preference to throw it to JITC and compile it. This is a great JIT dynamism: we are able to compile things that matter now, leaving other, less important things for later. With AOT-profiled ("AOTP") mode, the answers to these two questions finally depart from each other. Now, AOTP can provide us with the ready profile for the still cold method. Does this answer Q1 that well? We surely lost the dynamism of "what is hot right now". As we can see in perf data above, in the case of over-trained runs and with only JITC present, AOTP gives us a wildly inefficient answer to Q1. JIT compilation is unfortunately not free even on multi-cores, since the resources we spend redundantly compiling are the resources we don't spend on workload itself; even if compilation runs on ostensibly "separate" core. See the diff between This situation is alleviated when we have AOT-compiled ("AOTC") mode, like in current Leyden. Then, the bad answer to Q1 does not make us to pay the price for extra JIT compilation. We would load some unnecessary AOT code in tens of microseconds per method, but it would not be as bad as spending hundreds of milliseconds JIT compiling. One can make an argument that the core problem here is over-training: if you don't want AOTP to trigger extra JITC, then don't overtrain? I used to think the same a year ago. Unfortunately, my tests show that in AOTP+AOTC(+JITC) modes it is useful to over-train: to collect better profiles, make sure all the long tails are compiled well, etc. Without AOTC, I think it is much more sensible not to allow AOTP to trigger JITC at all. I don't think heuristics for core counts, number of compilations triggered, etc help to solve the fundamental problem of triggering AOTP -> JITC. Without AOTP -> JITC trigger, persistent profiles help both startup and warmup: the basic invocation profiling answers Q1, AOTP assists the invocation/MDO profiling to answer Q2. Much less surprises in compiler dynamics, all of the fun not spending time collecting the profiles. I think that also aligns better with current JEP writeup. |
I mentioned this at Leyden meeting. Some methods in JavacBench are stuck at level=0 with |
@veresov by the looks of it, it should be possible for |
Since I see no changes to C1 or C2 here, I assume training run profile info comes purely from the interpreter? |
…eredCompilation
Thanks, I ported it here. |
@shipilev Now that we have all the patches that fix the perf problems in this PR, could you please rerun your experiments that turn off/on the compilation after clinit? I wonder if it's still beneficial to turn it off and what happens to the warmup curve...? |
There are CI hooks that collect some of the information, although it's mostly about snapshotting MDOs. |
Yes, planning to retest in the morning. Thanks! |
Quick one-shot experiment shows disabling eager compilation trigger is still beneficial in the early phases: I'll do fuller runs overnight / in the morning, but I think this still highlights the need for the feature flag that guards the compilation trigger. |
@shipilev keep in mind that there's still an elephant in the room when it comes to startup caused by contention on
W/o compilation replay (
|
And, somewhat surprisingly, remedies in leyden/premain (
What caught my eye is that C1 is much slower with lock free queue:
|
@veresov this pull request can not be integrated into git checkout pp
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
@veresov This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a |
Implement persistent profiles, see JDK-8325147
Progress
Integration blocker
Warnings
Issues
Backport <hash>
with the hash of the original commit. See Backports.Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24210/head:pull/24210
$ git checkout pull/24210
Update a local copy of the PR:
$ git checkout pull/24210
$ git pull https://git.openjdk.org/jdk.git pull/24210/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 24210
View PR using the GUI difftool:
$ git pr show -t 24210
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24210.diff