Skip to content

8355003: Implement Ahead-of-Time Method Profiling #24210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from

Conversation

veresov
Copy link
Contributor

@veresov veresov commented Mar 24, 2025

Implement persistent profiles, see JDK-8325147


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Change requires CSR request JDK-8356012 to be approved
  • Commit message must refer to an issue

Integration blocker

 ⚠️ Title mismatch between PR and JBS for issue JDK-8355003

Warnings

 ⚠️ Patch contains a binary file (src/java.desktop/share/classes/java/awt/doc-files/BorderLayout-1.gif)
 ⚠️ Patch contains a binary file (src/java.desktop/share/classes/java/awt/doc-files/FlowLayout-1.gif)
 ⚠️ Patch contains a binary file (src/java.desktop/share/classes/java/awt/doc-files/GridBagLayout-1.gif)
 ⚠️ Patch contains a binary file (src/java.desktop/share/classes/java/awt/doc-files/GridBagLayout-2.gif)
 ⚠️ Patch contains a binary file (src/java.desktop/share/classes/java/awt/doc-files/GridLayout-1.gif)
 ⚠️ Patch contains a binary file (src/java.desktop/share/classes/java/awt/doc-files/GridLayout-2.gif)
 ⚠️ Patch contains a binary file (test/hotspot/jtreg/runtime/ClassFile/JsrRewritingTestCase.jar)
 ⚠️ Patch contains a binary file (test/hotspot/jtreg/runtime/ClassFile/testcase.jar)

Issues

  • JDK-8355003: Implement JEP 515: Ahead-of-Time Method Profiling (Sub-task - P4) ⚠️ Title mismatch between PR and JBS. ⚠️ Issue is already resolved. Consider making this a "backport pull request" by setting the PR title to Backport <hash> with the hash of the original commit. See Backports.
  • JDK-8356012: Implement JEP 515: Ahead-of-Time Method Profiling (CSR)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24210/head:pull/24210
$ git checkout pull/24210

Update a local copy of the PR:
$ git checkout pull/24210
$ git pull https://git.openjdk.org/jdk.git pull/24210/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24210

View PR using the GUI difftool:
$ git pr show -t 24210

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24210.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 24, 2025

👋 Welcome back iveresov! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 24, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented Mar 24, 2025

@veresov The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Copy link
Member

@iklam iklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CDS changes looks good. Just a few nits.

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is still a draft, here is a first sweep through it anyway :)

Comment on lines +310 to +311
if (k != nullptr && k->class_loader_data() != nullptr &&
(!k->is_instance_klass() || InstanceKlass::cast(k)->is_loaded())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the conditions like these are essentially is_live? It is confusing that some blocks test for klass->is_loader_alive, and some don't. Maybe this suggests we really need a helper method if we cannot use is_live directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that conditions are slightly different. In one case we accept non-instance klasses. In the other we don't. @iwanowww, could you comment on this? I think you wrote this piece?

Comment on lines 572 to 573
FLAG_SET_ERGO(ReplayTraining, false);
FLAG_SET_ERGO(RecordTraining, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need also check when these flags set on command line but AOT caching is disable. You need to issue warning,

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming files and classes to include AOT word. But I will not insist.

@@ -546,6 +548,32 @@ bool CDSConfig::check_vm_args_consistency(bool patch_mod_javabase, bool mode_fla
return true;
}

void CDSConfig::setup_compiler_args() {
// AOT profiles are supported only in the JEP 483 workflow.
bool can_dump_profiles = AOTClassLinking && new_aot_flags_used();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why check new_aot_flags_used()? As I understand AOTClassLinking should be enough in mainline. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iklam, do I need it?

@shipilev
Copy link
Member

shipilev commented Apr 11, 2025

Some performance experiments:

$ build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC -cp JavacBenchApp.jar -XX:AOTCache=app.aot JavacBenchApp ...

=== Trained with 50 classes

# --- Running with 1 class
Baseline: 233.5 ms ±   3.1 ms  [User: 810.7 ms, System: 83.9 ms]
This PR:  204.3 ms ±   2.4 ms  [User: 990.4 ms, System: 101.9 ms]
No comp:  215.5 ms ±   1.7 ms  [User: 993.5 ms, System: 106.5 ms]

# --- Running with 50 classes
Baseline: 710.3 ms ±   6.4 ms  [User: 3736.8 ms, System: 190.7 ms]
This PR:  590.3 ms ±  14.8 ms  [User: 4019.2 ms, System: 221.4 ms]
No comp:  553.4 ms ±  12.3 ms  [User: 3706.8 ms, System: 227.9 ms]

# --- Running with 1000 classes
Baseline: 3.344 s ±  0.042 s   [User: 18.499 s, System: 0.441 s]
This PR:  3.155 s ±  0.038 s   [User:  9.915 s, System: 0.362 s]
No comp:  3.177 s ±  0.023 s    [User: 9.773 s, System: 0.377 s]

=== Trained with 1000 classes

# --- Running with 1 class
Baseline: 232.9 ms ±   2.4 ms  [User:  834.5 ms, System:  84.5 ms]
This PR:  314.2 ms ±   6.0 ms  [User: 1550.2 ms, System: 121.1 ms]
No comp:  217.1 ms ±   1.9 ms  [User: 1022.5 ms, System: 107.1 ms]

# --- Running with 50 classes
Baseline: 712.4 ms ±   5.5 ms  [User: 3721.8 ms, System: 194.1 ms]
This PR:  798.3 ms ±   8.9 ms  [User: 6042.8 ms, System: 236.9 ms]
No comp:  577.5 ms ±  10.3 ms  [User: 4192.0 ms, System: 249.5 ms]

# --- Running with 1000 classes
Baseline: 3.361 s ±  0.037 s   [User: 18.473 s, System: 0.444 s]
This PR:  2.719 s ±  0.031 s   [User: 17.716 s, System: 0.397 s]
No comp:  2.655 s ±  0.035 s   [User: 16.771 s, System: 0.401 s]

I think they show there are good improvements in good cases, but also regressions. I think this all is caused by triggering compilations early. So we get a performance benefit of starting the compilation earlier in the good case. But conversely, we also get the performance regression -- sometimes massive in the over-trained runs, look at difference in "User" time -- because training data forces us to compile more than we need?

Just doing the following seems to recuperate perf losses, this is No comp in the results above:

diff --git a/src/hotspot/share/compiler/compilationPolicy.cpp b/src/hotspot/share/compiler/compilationPolicy.cpp
index 6cb60b41cce..2daad56afc7 100644
--- a/src/hotspot/share/compiler/compilationPolicy.cpp
+++ b/src/hotspot/share/compiler/compilationPolicy.cpp
@@ -157,8 +157,8 @@ void CompilationPolicy::replay_training_at_init_impl(InstanceKlass* klass, TRAPS
         if (ctd->init_deps_left() == 0) {
           MethodTrainingData* mtd = ctd->method();
           if (mtd->has_holder()) {
-            const methodHandle mh(THREAD, const_cast<Method*>(mtd->holder()));
-            CompilationPolicy::maybe_compile_early(mh, THREAD);
+            // const methodHandle mh(THREAD, const_cast<Method*>(mtd->holder()));
+            // CompilationPolicy::maybe_compile_early(mh, THREAD);
           }
         }
       });

@veresov
Copy link
Contributor Author

veresov commented Apr 12, 2025

In this PR we're actually targeting warmup improvements not the startup improvements per se. So, I think we should craft a benchmark that shows how the performance changes over time rather than measuring the total. This should allow us to measure when does the app reach its peak performance.

@shipilev
Copy link
Member

In this PR we're actually targeting warmup improvements not the startup improvements per se.

Right, but I think that distinction would be lost when users enable AOTCache and see startup regressions.

Let me see if I understand this right. I think PR in current form only stacks well with AOT? That is, when training data loads, we force compilation, which would likely be handled by AOT-ed code load? For a free-standing PR, just loading the profiles without triggering the compilation seems like a good incremental win without any particular losses, apart from spending a few CPU cycles on installing the profiles.

@iklam
Copy link
Member

iklam commented Apr 14, 2025

@veresov please take openjdk/leyden@293edad from the Leyden repo as well.

@iklam
Copy link
Member

iklam commented Apr 14, 2025

In this PR we're actually targeting warmup improvements not the startup improvements per se.

Right, but I think that distinction would be lost when users enable AOTCache and see startup regressions.

Let me see if I understand this right. I think PR in current form only stacks well with AOT? That is, when training data loads, we force compilation, which would likely be handled by AOT-ed code load? For a free-standing PR, just loading the profiles without triggering the compilation seems like a good incremental win without any particular losses, apart from spending a few CPU cycles on installing the profiles.

Maybe we should still compile at start-up but somehow limit the amount of excessive compilation? E.g., only the methods that were compiled during first N ms of training run?

@veresov
Copy link
Contributor Author

veresov commented Apr 15, 2025

In this PR we're actually targeting warmup improvements not the startup improvements per se.

Right, but I think that distinction would be lost when users enable AOTCache and see startup regressions.

Let me see if I understand this right. I think PR in current form only stacks well with AOT? That is, when training data loads, we force compilation, which would likely be handled by AOT-ed code load? For a free-standing PR, just loading the profiles without triggering the compilation seems like a good incremental win without any particular losses, apart from spending a few CPU cycles on installing the profiles.

Maybe we should still compile at start-up but somehow limit the amount of excessive compilation? E.g., only the methods that were compiled during first N ms of training run?

We should discuss it, but JITing is free on many-core systems. May be we should have a heuristic based on the number of cores? Or a command line option -XX:AOTMode=warmup ?
Optimizing for startup and time-to-peak performance is a different thing. One could probably construct a benchmark that would win with interpreter vs with all of our stuff.

@shipilev
Copy link
Member

This is how I think about it.

Profiling essentially answers two related, but not exactly the same questions:
Q1. What and when to compile: which methods are actually hot in the nearest past
Q2. How to compile: what are branch frequencies, receiver types, etc

The mere existence of profile answers Q2.

In pure JIT-profiled ("JITP") and JIT-compiled ("JITC") mode, through the way we collect the profile, the existence of profile also answers Q1 automatically. That is, if JITP says the profile is ready, it means that the method is hot right now, and we have a strong preference to throw it to JITC and compile it. This is a great JIT dynamism: we are able to compile things that matter now, leaving other, less important things for later.

With AOT-profiled ("AOTP") mode, the answers to these two questions finally depart from each other. Now, AOTP can provide us with the ready profile for the still cold method. Does this answer Q1 that well? We surely lost the dynamism of "what is hot right now". As we can see in perf data above, in the case of over-trained runs and with only JITC present, AOTP gives us a wildly inefficient answer to Q1. JIT compilation is unfortunately not free even on multi-cores, since the resources we spend redundantly compiling are the resources we don't spend on workload itself; even if compilation runs on ostensibly "separate" core. See the diff between This PR and No comp, on a fairly large machine.

This situation is alleviated when we have AOT-compiled ("AOTC") mode, like in current Leyden. Then, the bad answer to Q1 does not make us to pay the price for extra JIT compilation. We would load some unnecessary AOT code in tens of microseconds per method, but it would not be as bad as spending hundreds of milliseconds JIT compiling.

One can make an argument that the core problem here is over-training: if you don't want AOTP to trigger extra JITC, then don't overtrain? I used to think the same a year ago. Unfortunately, my tests show that in AOTP+AOTC(+JITC) modes it is useful to over-train: to collect better profiles, make sure all the long tails are compiled well, etc.

Without AOTC, I think it is much more sensible not to allow AOTP to trigger JITC at all. I don't think heuristics for core counts, number of compilations triggered, etc help to solve the fundamental problem of triggering AOTP -> JITC. Without AOTP -> JITC trigger, persistent profiles help both startup and warmup: the basic invocation profiling answers Q1, AOTP assists the invocation/MDO profiling to answer Q2. Much less surprises in compiler dynamics, all of the fun not spending time collecting the profiles. I think that also aligns better with current JEP writeup.

@shipilev
Copy link
Member

I mentioned this at Leyden meeting. Some methods in JavacBench are stuck at level=0 with -XX:-TieredCompilation, and I have a report and fix here: JDK-8355296. That patch should be applicable to this PR as well. We can discuss how much of that makes sense in those PRs.

@shipilev
Copy link
Member

Some new graphs that captures warmup curve more exactly. (X axis = real time; Y axis = "iteration" time):
plot-t500

This shows persistent profiles are excellent with C2 only, and there are still incremental improvements in tiered and C1 only modes.

@dougxc
Copy link
Member

dougxc commented Apr 23, 2025

@veresov by the looks of it, it should be possible for jdk.vm.ci.hotspot.HotSpotProfilingInfo to also be augmented with the training data profile info. What's the plan for that?

@dougxc
Copy link
Member

dougxc commented Apr 23, 2025

Since I see no changes to C1 or C2 here, I assume training run profile info comes purely from the interpreter?

@veresov
Copy link
Contributor Author

veresov commented Apr 23, 2025

I mentioned this at Leyden meeting. Some methods in JavacBench are stuck at level=0 with -XX:-TieredCompilation, and I have a report and fix here: JDK-8355296. That patch should be applicable to this PR as well. We can discuss how much of that makes sense in those PRs.

Thanks, I ported it here.

@veresov
Copy link
Contributor Author

veresov commented Apr 23, 2025

@shipilev Now that we have all the patches that fix the perf problems in this PR, could you please rerun your experiments that turn off/on the compilation after clinit? I wonder if it's still beneficial to turn it off and what happens to the warmup curve...?

@veresov
Copy link
Contributor Author

veresov commented Apr 23, 2025

Since I see no changes to C1 or C2 here, I assume training run profile info comes purely from the interpreter?

There are CI hooks that collect some of the information, although it's mostly about snapshotting MDOs.

@shipilev
Copy link
Member

@shipilev Now that we have all the patches that fix the perf problems in this PR, could you please rerun your experiments that turn off/on the compilation after clinit? I wonder if it's still beneficial to turn it off and what happens to the warmup curve...?

Yes, planning to retest in the morning. Thanks!

@shipilev
Copy link
Member

shipilev commented Apr 23, 2025

@shipilev Now that we have all the patches that fix the perf problems in this PR, could you please rerun your experiments that turn off/on the compilation after clinit? I wonder if it's still beneficial to turn it off and what happens to the warmup curve...?

Quick one-shot experiment shows disabling eager compilation trigger is still beneficial in the early phases:
plot-t200

I'll do fuller runs overnight / in the morning, but I think this still highlights the need for the feature flag that guards the compilation trigger.

@iwanowww
Copy link
Contributor

@shipilev keep in mind that there's still an elephant in the room when it comes to startup caused by contention on MethodCompileQueue_lock. The problem becomes extremely acute the more compiler threads are present:

linux-x64 (CLX, 96 core), leyden/premain, JavaBench

VM flags: -XX:-LoadCachedCode -XX:+AOTReplayTraining -XX:-UseLockFreeCompileQueues -XX:+UseGlobalCompileQueueLock -XX:-UseDynamicNumberOfCompilerThreads -Xlog:init

-XX:CICompilerCount=2 (1+1)

Setup: 39.649 ms
Iteration      0: 694.126 ms /   744 ms
Iteration      1: 353.071 ms /  1098 ms
Iteration      2: 213.609 ms /  1313 ms
Iteration      3: 170.167 ms /  1483 ms
Iteration      4: 142.025 ms /  1626 ms
Iteration      5: 149.483 ms /  1775 ms
Iteration      6: 116.326 ms /  1892 ms
Iteration      7: 100.457 ms /  1993 ms
Iteration      8: 93.768 ms /  2087 ms
Iteration      9: 79.593 ms /  2167 ms

MethodCompileQueue_lock =  3126us (101155us) /      1024 events

vs

-XX:CICompilerCount=18 (6+12)

Setup: 26.341 ms
Iteration      0: 691.520 ms /   731 ms
Iteration      1: 92.919 ms /   824 ms
Iteration      2: 59.493 ms /   884 ms
Iteration      3: 47.337 ms /   931 ms
Iteration      4: 42.826 ms /   974 ms
Iteration      5: 54.629 ms /  1029 ms
Iteration      6: 36.257 ms /  1065 ms
Iteration      7: 38.009 ms /  1103 ms
Iteration      8: 35.152 ms /  1139 ms
Iteration      9: 32.922 ms /  1172 ms

MethodCompileQueue_lock = 10083us (460996us) /      1419 events

W/o compilation replay (-XX:-AOTReplayTraining) it's much more modest:

-XX:CICompilerCount=2 (1+1)

Setup: 37.928 ms
Iteration      0: 451.696 ms /   501 ms
Iteration      1: 203.250 ms /   705 ms
Iteration      2: 161.545 ms /   867 ms
Iteration      3: 138.514 ms /  1006 ms
Iteration      4: 118.381 ms /  1125 ms
Iteration      5: 121.460 ms /  1247 ms
Iteration      6: 87.302 ms /  1335 ms
Iteration      7: 75.985 ms /  1411 ms
Iteration      8: 72.188 ms /  1484 ms
Iteration      9: 72.722 ms /  1557 ms

MethodCompileQueue_lock =  9440us (16982us) /      5067 events


-XX:CICompilerCount=18 (6+12)

Setup: 30.336 ms
Iteration      0: 442.384 ms /   485 ms
Iteration      1: 172.386 ms /   658 ms
Iteration      2: 137.177 ms /   796 ms
Iteration      3: 81.599 ms /   878 ms
Iteration      4: 86.394 ms /   965 ms
Iteration      5: 72.938 ms /  1038 ms
Iteration      6: 71.333 ms /  1112 ms
Iteration      7: 56.150 ms /  1169 ms
Iteration      8: 52.959 ms /  1222 ms
Iteration      9: 48.328 ms /  1271 ms

MethodCompileQueue_lock = 59048us (42354us) /      3897 events

@iwanowww
Copy link
Contributor

And, somewhat surprisingly, remedies in leyden/premain (-XX:-UseGlobalCompileQueueLock -XX:+UseLockFreeCompileQueues) don't give those 500ms back yet:

-XX:-LoadCachedCode -XX:+AOTReplayTraining -XX:+UseLockFreeCompileQueues -XX:-UseGlobalCompileQueueLock -XX:-UseDynamicNumberOfCompilerThreads -Xlog:init

-XX:CICompilerCount=18 (6+12)

Setup: 22.865 ms
Iteration      0: 542.214 ms /   576 ms
Iteration      1: 246.702 ms /   823 ms
Iteration      2: 154.868 ms /   978 ms
Iteration      3: 117.547 ms /  1096 ms
Iteration      4: 90.245 ms /  1187 ms
Iteration      5: 88.173 ms /  1275 ms
Iteration      6: 57.447 ms /  1333 ms
Iteration      7: 48.204 ms /  1381 ms
Iteration      8: 45.394 ms /  1427 ms
Iteration      9: 39.042 ms /  1466 ms

MethodCompileQueueC1_lock =    36us (   45us) /         8 events
MethodCompileQueueC2_lock =  4437us ( 1569us) /       544 events

What caught my eye is that C1 is much slower with lock free queue:

-XX:-LoadCachedCode -XX:+AOTReplayTraining -XX:-UseDynamicNumberOfCompilerThreads -XX:CICompilerCount=18

-XX:-UseLockFreeCompileQueues -XX:+UseGlobalCompileQueueLock

[1.206s][info   ][init] Compilation statistics:   Total: 6121 methods; 0 bailouts, 0 invalidated, 864 non_entrant
[1.206s][info   ][init]     Tier1:   245 methods (in 0.082s)
[1.206s][info   ][init]     Tier2:  4846 methods (in 3.176s); osr:   30 methods (in 0.035s); not_entrant:  828 methods
[1.206s][info   ][init]     Tier3:   145 methods (in 0.050s); not_entrant:   20 methods
[1.206s][info   ][init]     Tier4:   824 methods (in 8.603s); osr:   31 methods (in 0.154s); not_entrant:   16 methods
[1.206s][info   ][init] 
[1.206s][info   ][init]   C1 compile queue (0 active / 6 total threads): 0 tasks
[1.206s][info   ][init]   C2 compile queue (12 active / 12 total threads): 446 tasks: T4: 446 tasks;

vs

-XX:+UseLockFreeCompileQueues -XX:-UseGlobalCompileQueueLock:

[1.553s][info   ][init] Compilation statistics:   Total: 4583 methods; 0 bailouts, 0 invalidated, 933 non_entrant
[1.553s][info   ][init]     Tier1:    65 methods (in 0.075s)
[1.553s][info   ][init]     Tier2:  3306 methods (in 4.725s); osr:    7 methods (in 0.014s); not_entrant:  879 methods
[1.553s][info   ][init]     Tier3:   229 methods (in 0.342s); not_entrant:   43 methods
[1.553s][info   ][init]     Tier4:   938 methods (in 11.718s); osr:   38 methods (in 0.199s); not_entrant:   11 methods
[1.553s][info   ][init] 
[1.554s][info   ][init]   C1 compile queue (2 active / 6 total threads): 1684 tasks: T1: 130 tasks; T2: 1506 tasks; T3: 48 tasks;
[1.554s][info   ][init]   C2 compile queue (5 active / 12 total threads): 0 tasks

@openjdk
Copy link

openjdk bot commented Apr 30, 2025

@veresov this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout pp
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added csr Pull request needs approved CSR before integration merge-conflict Pull request has merge conflict with target branch labels Apr 30, 2025
@openjdk openjdk bot removed the csr Pull request needs approved CSR before integration label Jun 3, 2025
@bridgekeeper
Copy link

bridgekeeper bot commented Jun 18, 2025

@veresov This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@veresov veresov closed this Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot [email protected] merge-conflict Pull request has merge conflict with target branch
Development

Successfully merging this pull request may close these issues.

6 participants