8350621: Code cache stops scheduling GC #26189

tschatzl · 2025-07-08T15:03:27Z

Hi all,

please review this change to avoid CodeCache triggered GCs temporarily being ignored.

In particular, G1 does not make sure when its collect() method is called during a concurrent cycle, that a Remark pause that does code unloading etc. actually occurs after that request. This makes it so that some internal flag is not reset appropriately (via some callback to the caller). After this event, there will never be another attempt by the compiler threads to issue a garbage collection. The compiler threads are stuck until the next code unloading (caused by e.g. a regular concurrent cycle being triggered).

The original PR also stated a similar with Parallel GC - however due to recent changes (https://bugs.openjdk.org/browse/JDK-8192647 and others) I do not think it is possible any more that the collect()call can silently be ignored. (I.e. gclocker could abort a full gc that is the only reason why code cache unloading and that callback will not occur as expected).

So for G1, the change makes the caller of collect() (i.e. the compiler thread) wait until that situation passes and retry. With that the compiler thread is sure that the callback the compiler threads are waiting for occurs. This does have the disadvantage that that compiler thread may not be available for compilation any more for a bit.

Testing: tier1-5, test case passing, failing before

Thanks,
Thomas

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8350621: Code cache stops scheduling GC (Bug - P2)(⚠️ The fixVersion in this issue is [25] but the fixVersion in .jcheck/conf is 26, a new backport will be created when this pr is integrated.)

Contributors

Thomas Schatzl <[email protected]>
Alexandre Jacob <[email protected]>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26189/head:pull/26189
$ git checkout pull/26189

Update a local copy of the PR:
$ git checkout pull/26189
$ git pull https://git.openjdk.org/jdk.git pull/26189/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26189

View PR using the GUI difftool:
$ git pr show -t 26189

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26189.diff

Using Webrev

Link to Webrev Comment

Hi all, please review this change to avoid CodeCache triggered GCs temporarily being ignored. In particular, G1 does not make sure when its `collect()` method is called during a concurrent cycle, that a `Remark` pause that does code unloading etc. actually occurs after that request. This makes it so that some internal flag is not reset appropriately, stuck until the next code unloading (caused by e.g. a regular concurrent cycle being triggered). Testing: tier1-5 Thanks, Thomas

bridgekeeper · 2025-07-08T15:03:52Z

👋 Welcome back tschatzl! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-07-08T15:04:28Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

tschatzl · 2025-07-08T15:04:44Z

/contributor add @tschatzl
/contributor add @ajacob
/label add hotspot-gc

openjdk · 2025-07-08T15:05:15Z

@tschatzl
Contributor Thomas Schatzl <[email protected]> successfully added.

openjdk · 2025-07-08T15:05:42Z

@tschatzl @ajacob was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

/contributor add @openjdk-bot
/contributor add duke
/contributor add J. Duke <[email protected]>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

openjdk · 2025-07-08T15:05:53Z

@tschatzl
The hotspot-gc label was successfully added.

tschatzl · 2025-07-08T15:10:21Z

/contributor add Alexandre Jacob [email protected]

openjdk · 2025-07-08T15:10:48Z

@tschatzl
Contributor Alexandre Jacob <[email protected]> successfully added.

mlbridge · 2025-07-08T15:13:58Z

Webrevs

01: Full - Incremental (36b023fb)
00: Full (348a7896)

TheRealMDoerr · 2025-07-09T15:33:44Z

Good finding! I think it looks good, but it should get reviewed by G1 experts.
The test doesn't always pass. I've seen this once:

[12.943s][info ][codecache] Triggering aggressive GC due to having only 49.991% free memory
[12.944s][trace][gc       ] C1 CompilerThread3: Try Collect Concurrently (CodeCache GC Aggressive): attempt 1
[12.984s][debug][gc       ] GC(22) Allocated 2 survivor 0 old percent total 6.67% (10%)
[12.984s][info ][gc       ] GC(22) Pause Young (Concurrent Start) (CodeCache GC Aggressive) 4M->2M(32M) 39.712ms
[12.984s][info ][gc       ] GC(23) Concurrent Mark Cycle
[12.984s][trace][gc       ] C1 CompilerThread3: Try Collect Concurrently (CodeCache GC Aggressive): wait
[13.040s][info ][codecache] Code cache critically low; use aggressive aging
[13.040s][info ][gc       ] GC(23) Pause Remark 3M->3M(32M) 6.383ms
[13.052s][info ][gc       ] GC(23) Pause Cleanup 3M->3M(32M) 0.078ms
[13.052s][info ][gc       ] GC(23) Concurrent Mark Cycle 67.851ms
[13.052s][trace][gc       ] C1 CompilerThread3: Try Collect Concurrently (CodeCache GC Aggressive): complete after wait
[13.056s][info ][codecache] Triggering aggressive GC due to having only 49.854% free memory
[13.056s][trace][gc       ] C1 CompilerThread2: Try Collect Concurrently (CodeCache GC Aggressive): attempt 1
[13.094s][debug][gc       ] GC(24) Allocated 2 survivor 1 old percent total 10.34% (10%)
[13.095s][info ][gc       ] GC(24) Pause Young (Concurrent Start) (CodeCache GC Aggressive) 3M->2M(32M) 38.594ms
[13.095s][trace][gc       ] C1 CompilerThread2: Try Collect Concurrently (CodeCache GC Aggressive): wait
[13.095s][info ][gc       ] GC(25) Concurrent Mark Cycle
[13.149s][info ][codecache] No code cache pressure; don't age code
[13.149s][info ][gc       ] GC(25) Pause Remark 3M->3M(32M) 11.108ms
[13.163s][info ][gc       ] GC(25) Pause Cleanup 3M->3M(32M) 0.072ms
[13.163s][info ][gc       ] GC(25) Concurrent Mark Cycle 68.023ms
[13.163s][trace][gc       ] C1 CompilerThread2: Try Collect Concurrently (CodeCache GC Aggressive): complete after wait
Compiled 180 classes
Compilation done, compiled 200 classes

STDERR:
java.lang.RuntimeException: Could not find a CodeCache GC Threshold GC after finishing the concurrent cycle: expected true, was false

Seems like there are several ways the VM can take.

tschatzl · 2025-07-09T15:49:09Z

Good finding! I think it looks good, but it should get reviewed by G1 experts. The test doesn't always pass. I've seen this once:

[12.943s][info ][codecache] Triggering aggressive GC due to having only 49.991% free memory
[12.944s][trace][gc       ] C1 CompilerThread3: Try Collect Concurrently (CodeCache GC Aggressive): attempt 1
[12.984s][debug][gc       ] GC(22) Allocated 2 survivor 0 old percent total 6.67% (10%)
[12.984s][info ][gc       ] GC(22) Pause Young (Concurrent Start) (CodeCache GC Aggressive) 4M->2M(32M) 39.712ms
[12.984s][info ][gc       ] GC(23) Concurrent Mark Cycle
[12.984s][trace][gc       ] C1 CompilerThread3: Try Collect Concurrently (CodeCache GC Aggressive): wait
[13.040s][info ][codecache] Code cache critically low; use aggressive aging
[13.040s][info ][gc       ] GC(23) Pause Remark 3M->3M(32M) 6.383ms
[13.052s][info ][gc       ] GC(23) Pause Cleanup 3M->3M(32M) 0.078ms
[13.052s][info ][gc       ] GC(23) Concurrent Mark Cycle 67.851ms
[13.052s][trace][gc       ] C1 CompilerThread3: Try Collect Concurrently (CodeCache GC Aggressive): complete after wait
[13.056s][info ][codecache] Triggering aggressive GC due to having only 49.854% free memory
[13.056s][trace][gc       ] C1 CompilerThread2: Try Collect Concurrently (CodeCache GC Aggressive): attempt 1
[13.094s][debug][gc       ] GC(24) Allocated 2 survivor 1 old percent total 10.34% (10%)
[13.095s][info ][gc       ] GC(24) Pause Young (Concurrent Start) (CodeCache GC Aggressive) 3M->2M(32M) 38.594ms
[13.095s][trace][gc       ] C1 CompilerThread2: Try Collect Concurrently (CodeCache GC Aggressive): wait
[13.095s][info ][gc       ] GC(25) Concurrent Mark Cycle
[13.149s][info ][codecache] No code cache pressure; don't age code
[13.149s][info ][gc       ] GC(25) Pause Remark 3M->3M(32M) 11.108ms
[13.163s][info ][gc       ] GC(25) Pause Cleanup 3M->3M(32M) 0.072ms
[13.163s][info ][gc       ] GC(25) Concurrent Mark Cycle 68.023ms
[13.163s][trace][gc       ] C1 CompilerThread2: Try Collect Concurrently (CodeCache GC Aggressive): complete after wait
Compiled 180 classes
Compilation done, compiled 200 classes

STDERR:
java.lang.RuntimeException: Could not find a CodeCache GC Threshold GC after finishing the concurrent cycle: expected true, was false

Seems like there are several ways the VM can take.

This is a test bug - for some reason the compiler triggered "Code Cache Aggressive" instead of "Code Cache Threshold" gcs. The former are a more urgent version of the latter, so for that test this is fine.

The test only checks for the latter.

src/hotspot/share/gc/g1/g1CollectedHeap.cpp

- restructure code in `try_collect_concurrently` - fix asserts - they only worked in the test because of whitebox being active - fix comments * threalmdoerr review: - fix test to be more lenient

kimbarrett · 2025-07-10T17:33:42Z

src/hotspot/share/gc/g1/g1CollectedHeap.cpp

-      // ensure that progress is made.
+    } else if (GCCause::is_codecache_requested_gc(cause) && op.marking_in_progress()) {
+        // For a CodeCache requested GC, before marking, progress is ensured as the
+        // following Remark pause unloads code (and signals the requestr such).


spelling: requestr

kimbarrett · 2025-07-10T17:36:23Z

src/hotspot/share/gc/g1/g1CollectedHeap.cpp

      //
      // Request is finished if any of
      // (1) the VMOp successfully performed a GC,
-      // (2) a concurrent cycle was already in progress,
+      // (2) a concurrent cycle was already in progress and we were
+      //     before the Remark pause for CodeCache requested GCs,


I think this comment change is no longer needed. Which is good, because I find it confusingly
ambiguous, but had trouble figuring out what it should say instead. I think the original is fine,
because we're in a non-codecache request context.

kimbarrett · 2025-07-10T20:53:17Z

src/hotspot/share/gc/g1/g1CollectedHeap.cpp

-    } else if (!GCCause::is_user_requested_gc(cause)) {
-      // For an "automatic" (not user-requested) collection, we just need to
-      // ensure that progress is made.
+    } else if (GCCause::is_codecache_requested_gc(cause) && op.marking_in_progress()) {


I wonder if this might be simpler if the codecache-request case were
completely handled by it's own clause. This would involve "duplicating" some
of the user-request case. But I think that could be done by chunking that into
a couple helper functions, bounded by the existing assertions in the
user-request case. So something like this (I don't love the names of the
helpers here):

if (is_codecache_requested_gc(cause)) { if (op.marking_in_progress()) { LOG... ; return true; } // is there some assert similar to the user-request assert that makes sense here? if (wait_for_completion(old_marking_started_before, old_marking_started_after, old_marking_completed_after)) { return true; } // is there some assert similar to the user-request assert that makes sense here? if (need_normal_retry(op)) { continue; } }

And similarly in the user-request case. And obviously with more comments :)

While looking at that, I noticed that the whitebox control stall in the
user-request case is doing exactly what the comment about it says not to do.
It is waiting for control to be released, while the comment says not to do
that. This of course means there could be multiple complete collections (and
not just STW full collections) before it can proceed. This seems to be in the
initial published webrev for JDK-8240239. The immediately preceding (draft,
non-public) webrev has that comment but different code, so it looks like the
code was changed but the comment wasn't updated. Bad Kim!

I will file a bug for that and assign it to myself. If you decide to adopt
something like the above suggested refactoring, just retain that code and
comment as-is. I'll see if I can recall what happened there, and update the
comment accordingly. Unfortunately, the (non-public) email discussion between
those versions doesn't make any mention of this change.

openjdk bot changed the title ~~8350621~~ 8350621: Code cache stops scheduling GC Jul 8, 2025

openjdk bot added the hotspot-gc [email protected] label Jul 8, 2025

* some whitespace fixes

348a789

openjdk bot added the rfr Pull request is ready for review label Jul 8, 2025

kimbarrett suggested changes Jul 9, 2025

View reviewed changes

src/hotspot/share/gc/g1/g1CollectedHeap.cpp Outdated Show resolved Hide resolved

src/hotspot/share/gc/g1/g1CollectedHeap.cpp Show resolved Hide resolved

* kbarret review:

36b023f

- restructure code in `try_collect_concurrently` - fix asserts - they only worked in the test because of whitebox being active - fix comments * threalmdoerr review: - fix test to be more lenient

kimbarrett suggested changes Jul 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

8350621: Code cache stops scheduling GC #26189

8350621: Code cache stops scheduling GC #26189

tschatzl commented Jul 8, 2025 •

edited by openjdk bot

Loading

Uh oh!

bridgekeeper bot commented Jul 8, 2025

Uh oh!

openjdk bot commented Jul 8, 2025

Uh oh!

tschatzl commented Jul 8, 2025

Uh oh!

openjdk bot commented Jul 8, 2025

Uh oh!

openjdk bot commented Jul 8, 2025

Uh oh!

openjdk bot commented Jul 8, 2025

Uh oh!

tschatzl commented Jul 8, 2025

Uh oh!

openjdk bot commented Jul 8, 2025

Uh oh!

mlbridge bot commented Jul 8, 2025 •

edited

Loading

Uh oh!

TheRealMDoerr commented Jul 9, 2025

Uh oh!

tschatzl commented Jul 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

kimbarrett Jul 10, 2025

Uh oh!

kimbarrett Jul 10, 2025

Uh oh!

kimbarrett Jul 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

8350621: Code cache stops scheduling GC #26189

Are you sure you want to change the base?

8350621: Code cache stops scheduling GC #26189

Conversation

tschatzl commented Jul 8, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Contributors

Reviewing

Uh oh!

bridgekeeper bot commented Jul 8, 2025

Uh oh!

openjdk bot commented Jul 8, 2025

Uh oh!

tschatzl commented Jul 8, 2025

Uh oh!

openjdk bot commented Jul 8, 2025

Uh oh!

openjdk bot commented Jul 8, 2025

Uh oh!

openjdk bot commented Jul 8, 2025

Uh oh!

tschatzl commented Jul 8, 2025

Uh oh!

openjdk bot commented Jul 8, 2025

Uh oh!

mlbridge bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

TheRealMDoerr commented Jul 9, 2025

Uh oh!

tschatzl commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kimbarrett Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

kimbarrett Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

kimbarrett Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tschatzl commented Jul 8, 2025 •

edited by openjdk bot

Loading

mlbridge bot commented Jul 8, 2025 •

edited

Loading

tschatzl commented Jul 9, 2025 •

edited

Loading

kimbarrett Jul 10, 2025 •

edited

Loading