Skip to content

8350621: Code cache stops scheduling GC #23656

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

ajacob
Copy link

@ajacob ajacob commented Feb 16, 2025

The purpose of this PR is to fix a bug where we can end up in a situation where the GC is not scheduled anymore by CodeCache.

This situation is possible because the _unloading_threshold_gc_requested flag is set to true when triggering the GC and we expect the GC to call CodeCache::on_gc_marking_cycle_finish which in turn will call CodeCache::update_cold_gc_count, which will reset the flag _unloading_threshold_gc_requested allowing further GC scheduling.

Unfortunately this can't work properly under certain circumstances.
For example, if using G1GC, calling G1CollectedHeap::collect does no give the guarantee that the GC will actually run as it can be already running (see here).

I have observed this behavior on JVM in version 21 that were migrated recently from java 17.
Those JVMs have some pressure on code cache and quite a large heap in comparison to allocation rate, which means that objects are mostly GC'd by young collections and full GC take a long time to happen.

I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well.

In order to reproduce this issue, I found a very simple and convenient way:

public class CodeCacheMain {
    public static void main(String[] args) throws InterruptedException {
        while (true) {
            Thread.sleep(100);
        }
    }
}

Run this simple app with the following JVM flags:

-Xlog:gc*=info,codecache=info -Xmx512m -XX:ReservedCodeCacheSize=2496k -XX:StartAggressiveSweepingAt=15
  • 512m for the heap just to clarify the intent that we don't want to be bothered by a full GC
  • low ReservedCodeCacheSize to put pressure on code cache quickly
  • StartAggressiveSweepingAt can be set to 20 or 15 for faster bug reproduction

Itself, the program will hardly get pressure on code cache, but the good news is that it is sufficient to attach a jconsole on it which will:

  • allows us to monitor code cache
  • indirectly generate activity on the code cache, just what we need to reproduce the bug

Some logs related to code cache will show up at some point with GC activity:

[648.733s][info][codecache      ] Triggering aggressive GC due to having only 14.970% free memory

And then it will stop and we'll end up with the following message:

[672.714s][info][codecache      ] Code cache is full - disabling compilation

Leaving the JVM in an unstable situation.

I considered a few different options before making this change:

  1. Always call Universe::heap()->collect(...) without making any check (the GC impl should handle the situation)
  2. Fix all GCs implementation to ensure _unloading_threshold_gc_requested gets back to false at some point (probably what is supposed to happen today)
  3. Change CollectedHeap::collect to return a bool instead of void to indicate if GC was run or scheduled

But I discarded them:

  1. Dumb option that I used to check that the bug would be corrected, but will probably put a bit of pressure on resources when allocation need to be performed at code cache level (as it will be called at each allocation attempt). In addition, the log indicating that we trigger GC is spammed, not easy to decide how to handle the log correctly.
  2. This option is possible and was my favorite up to some point. GC's implementation can have quite a lot of branches and it can be difficult to ensure we don't forget a case when to reset the flag. This could eventually be a solution to be explored in addition to the solution I propose in the PR. We could introduce a static method in CodeCache that would let a GC implementation to just reset the flag in a case the GC will not actually run for example (to be discussed)
  3. I explored this solution, but it adds quite a lot of changes and is risky in the long term (in my opinion). G1GC already has a G1CollectedHeap::try_collect method that returns a bool, but this bool is true even when the GC is not run.

As a result, I decided to simply add a way for CodeCache to recover from this situation. The idea is to let the GC code as-is but keep in memory the time of the last GC request and reset the flag to false if it was not reset in a certain amount of time (250ms in my PR). This should only be helpful in corner cases where the GC impl has not reset the flag by itself.

Among the advantages of this solution: it gives a security to recover from a situation that may be created by changes in GC implementation, because someone forgot to take care about code cache.

I took a lot of time investigating this issue and exploring solutions, and am willing to take any input on it as it is my first PR on the project.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8350621: Code cache stops scheduling GC (Bug - P2)(⚠️ The fixVersion in this issue is [25] but the fixVersion in .jcheck/conf is 26, a new backport will be created when this pr is integrated.)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23656/head:pull/23656
$ git checkout pull/23656

Update a local copy of the PR:
$ git checkout pull/23656
$ git pull https://git.openjdk.org/jdk.git pull/23656/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 23656

View PR using the GUI difftool:
$ git pr show -t 23656

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23656.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper bridgekeeper bot added the oca Needs verification of OCA signatory status label Feb 16, 2025
@bridgekeeper
Copy link

bridgekeeper bot commented Feb 16, 2025

Hi @ajacob, welcome to this OpenJDK project and thanks for contributing!

We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.

If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user ajacob" as summary for the issue.

If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.

@openjdk
Copy link

openjdk bot commented Feb 16, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented Feb 16, 2025

@ajacob The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@ajacob
Copy link
Author

ajacob commented Feb 16, 2025

Here is a log sample that shows how it behaves when the bug occurs. Logs starting with >>> are some logs I added when working on steps to reproduce.
The bug occurs at ~648.762s, G1GC has reset the flag to false but is still running, CodeCache has called Universe::heap()->collect(...), which was discarded because of the current GC routine.
Note that during this test -XX:StartAggressiveSweepingAt was set to 25 instead of 15 but I confirm I can reproduce with 15 as well (as explained in the description of the PR)

[648.733s][info][codecache      ] >>> should start GC (_unloading_threshold_gc_requested = 0)
[648.733s][info][codecache      ] Triggering aggressive GC due to having only 24.970% free memory
[648.733s][info][gc,start       ] GC(6210) Pause Young (CodeCache GC Aggressive)
[648.733s][info][gc,heap        ] GC(6210) PSYoungGen: 2851K(132096K)->224K(132096K) Eden: 2851K(131584K)->0K(131584K) From: 0K(512K)->224K(512K)
[648.733s][info][gc,heap        ] GC(6210) ParOldGen: 11691K(349696K)->11691K(349696K)
[648.733s][info][gc,metaspace   ] GC(6210) Metaspace: 7690K(8192K)->7690K(8192K) NonClass: 6904K(7168K)->6904K(7168K) Class: 786K(1024K)->786K(1024K)
[648.733s][info][gc             ] GC(6210) Pause Young (CodeCache GC Aggressive) 14M->11M(470M) 0.238ms
[648.733s][info][gc,cpu         ] GC(6210) User=0.00s Sys=0.00s Real=0.00s
[648.733s][info][gc,start       ] GC(6211) Pause Full (CodeCache GC Aggressive)
[648.733s][info][gc,phases,start] GC(6211) Marking Phase
[648.742s][info][gc,phases      ] GC(6211) Marking Phase 8.585ms
[648.742s][info][gc,phases,start] GC(6211) Summary Phase
[648.742s][info][gc,phases      ] GC(6211) Summary Phase 0.009ms
[648.742s][info][gc,phases,start] GC(6211) Adjust Roots
[648.742s][info][gc,phases      ] GC(6211) Adjust Roots 0.311ms
[648.742s][info][gc,phases,start] GC(6211) Compaction Phase
[648.747s][info][gc,phases      ] GC(6211) Compaction Phase 4.701ms
[648.747s][info][gc,phases,start] GC(6211) Post Compact
[648.747s][info][codecache      ] >>> CodeCache::update_cold_gc_count _unloading_threshold_gc_requested = false
[648.747s][info][codecache      ] Code cache critically low; use aggressive aging
[648.747s][info][gc,phases      ] GC(6211) Post Compact 0.106ms
[648.747s][info][gc,heap        ] GC(6211) PSYoungGen: 224K(132096K)->0K(132096K) Eden: 0K(131584K)->0K(131584K) From: 224K(512K)->0K(512K)
[648.747s][info][gc,heap        ] GC(6211) ParOldGen: 11691K(349696K)->11688K(349696K)
[648.747s][info][gc,metaspace   ] GC(6211) Metaspace: 7690K(8192K)->7690K(8192K) NonClass: 6904K(7168K)->6904K(7168K) Class: 786K(1024K)->786K(1024K)
[648.747s][info][gc             ] GC(6211) Pause Full (CodeCache GC Aggressive) 11M->11M(470M) 13.799ms
[648.747s][info][gc,cpu         ] GC(6211) User=0.11s Sys=0.00s Real=0.01s
[648.747s][info][codecache      ] >>> should start GC (_unloading_threshold_gc_requested = 0)
[648.747s][info][codecache      ] Triggering aggressive GC due to having only 24.865% free memory
[648.747s][info][gc,start       ] GC(6212) Pause Young (CodeCache GC Aggressive)
[648.748s][info][gc,heap        ] GC(6212) PSYoungGen: 2851K(132096K)->224K(132096K) Eden: 2851K(131584K)->0K(131584K) From: 0K(512K)->224K(512K)
[648.748s][info][gc,heap        ] GC(6212) ParOldGen: 11688K(349696K)->11688K(349696K)
[648.748s][info][gc,metaspace   ] GC(6212) Metaspace: 7690K(8192K)->7690K(8192K) NonClass: 6904K(7168K)->6904K(7168K) Class: 786K(1024K)->786K(1024K)
[648.748s][info][gc             ] GC(6212) Pause Young (CodeCache GC Aggressive) 14M->11M(470M) 0.257ms
[648.748s][info][gc,cpu         ] GC(6212) User=0.00s Sys=0.00s Real=0.00s
[648.748s][info][gc,start       ] GC(6213) Pause Full (CodeCache GC Aggressive)
[648.748s][info][gc,phases,start] GC(6213) Marking Phase
[648.756s][info][gc,phases      ] GC(6213) Marking Phase 8.512ms
[648.756s][info][gc,phases,start] GC(6213) Summary Phase
[648.756s][info][gc,phases      ] GC(6213) Summary Phase 0.007ms
[648.756s][info][gc,phases,start] GC(6213) Adjust Roots
[648.757s][info][gc,phases      ] GC(6213) Adjust Roots 0.331ms
[648.757s][info][gc,phases,start] GC(6213) Compaction Phase
[648.761s][info][gc,phases      ] GC(6213) Compaction Phase 4.734ms
[648.761s][info][gc,phases,start] GC(6213) Post Compact
[648.761s][info][codecache      ] >>> CodeCache::update_cold_gc_count _unloading_threshold_gc_requested = false
[648.761s][info][codecache      ] Code cache critically low; use aggressive aging
[648.761s][info][gc,phases      ] GC(6213) Post Compact 0.059ms
[648.761s][info][gc,heap        ] GC(6213) PSYoungGen: 224K(132096K)->0K(132096K) Eden: 0K(131584K)->0K(131584K) From: 224K(512K)->0K(512K)
[648.761s][info][gc,heap        ] GC(6213) ParOldGen: 11688K(349696K)->11689K(349696K)
[648.761s][info][gc,metaspace   ] GC(6213) Metaspace: 7690K(8192K)->7690K(8192K) NonClass: 6904K(7168K)->6904K(7168K) Class: 786K(1024K)->786K(1024K)
[648.761s][info][gc             ] GC(6213) Pause Full (CodeCache GC Aggressive) 11M->11M(470M) 13.725ms
[648.761s][info][gc,cpu         ] GC(6213) User=0.09s Sys=0.02s Real=0.01s
[648.762s][info][codecache      ] >>> should start GC (_unloading_threshold_gc_requested = 0)
[648.762s][info][codecache      ] Triggering aggressive GC due to having only 24.895% free memory
[648.762s][info][codecache      ] >>> should start GC (_unloading_threshold_gc_requested = 1)
[648.762s][info][gc,start       ] GC(6214) Pause Young (GCLocker Initiated GC)
[648.762s][info][gc,heap        ] GC(6214) PSYoungGen: 1973K(132096K)->224K(132096K) Eden: 1973K(131584K)->0K(131584K) From: 0K(512K)->224K(512K)
[648.762s][info][gc,heap        ] GC(6214) ParOldGen: 11689K(349696K)->11689K(349696K)
[648.762s][info][gc,metaspace   ] GC(6214) Metaspace: 7691K(8192K)->7691K(8192K) NonClass: 6905K(7168K)->6905K(7168K) Class: 786K(1024K)->786K(1024K)
[648.762s][info][gc             ] GC(6214) Pause Young (GCLocker Initiated GC) 13M->11M(470M) 0.310ms
[648.762s][info][gc,cpu         ] GC(6214) User=0.00s Sys=0.00s Real=0.00s
[648.762s][info][codecache      ] >>> should start GC (_unloading_threshold_gc_requested = 1)
** removed 278 occurrences of the same log **
[672.714s][info][codecache      ] >>> should start GC (_unloading_threshold_gc_requested = 1)
[672.714s][info][codecache      ] Code cache is full - disabling compilation
[672.714s][warning][codecache      ] CodeCache is full. Compiler has been disabled.
[672.714s][warning][codecache      ] Try increasing the code cache size using -XX:ReservedCodeCacheSize=
OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=
CodeCache: size=2496Kb used=2479Kb max_used=2479Kb free=16Kb
 bounds [0x00007923ed490000, 0x00007923ed700000, 0x00007923ed700000]
 total_blobs=1127 nmethods=640 adapters=399
 compilation: disabled (not enough contiguous free space left)
              stopped_count=1, restarted_count=0
 full_count=1

@ajacob
Copy link
Author

ajacob commented Feb 16, 2025

Both JVM were started for ~20 minutes

jconsole (reproducting the bug)

image

Started to misbehave at ~315.181s

jconsole (with the fix from the PR)

image

[13.078s][debug][codecache   ] Previous GC request has not been reset after 13.018797s, force auto-reset
[412.985s][debug][codecache   ] Previous GC request has not been reset after 23.985252s, force auto-reset
[464.974s][debug][codecache   ] Previous GC request has not been reset after 7.970082s, force auto-reset
[524.953s][debug][codecache   ] Previous GC request has not been reset after 3.937477s, force auto-reset

@openjdk
Copy link

openjdk bot commented Feb 17, 2025

@ajacob Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.

@ajacob ajacob marked this pull request as draft February 17, 2025 08:50
@ajacob
Copy link
Author

ajacob commented Feb 17, 2025

Converted to draft: I would like to change it to ensure we log before calling Universe::heap()->collect(...) (same way as before)

@ajacob ajacob marked this pull request as ready for review February 17, 2025 20:41
@ajacob ajacob changed the title XXXXXXX: Fix code cache GC XXXXXXX: Fix CodeCache becoming unable to run GC Feb 17, 2025
@ajacob
Copy link
Author

ajacob commented Feb 18, 2025

Performing more tests on this (different configuration, different GC, ...), I noticed that I had a race condition when multiple threads enter the try_to_gc method I introduced.

The race condition impact was :

  • an unwanted auto-reset of the flag
  • an invalid "duration since last GC request" log
  • an unneeded GC request

Possible in the following conditions:

  • thread1: reads _unloading_gc_requested_time (with elapsed_since_last_gc_request > 250ms)
  • thread2: has _unloading_gc_requested == false → it requests GC
  • thread1: has _unloading_gc_requested == true → it resets _unloading_gc_requested + log + request GC 💥

In order to avoid that I propose to simply ensure we don't have multiple threads performing the checks in gc_on_allocation

@ajacob ajacob changed the title XXXXXXX: Fix CodeCache becoming unable to run GC JDK-8350621: Fix CodeCache becoming unable to run GC Feb 25, 2025
@ajacob ajacob changed the title JDK-8350621: Fix CodeCache becoming unable to run GC 8350621: Code cache stops scheduling GC Feb 25, 2025
@dean-long
Copy link
Member

/label hotspot-gc

@openjdk
Copy link

openjdk bot commented Feb 27, 2025

@dean-long
The hotspot-gc label was successfully added.

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 27, 2025

@ajacob This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Apr 24, 2025

@ajacob This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Apr 24, 2025
@ajacob
Copy link
Author

ajacob commented Apr 28, 2025

/open

@openjdk openjdk bot reopened this Apr 28, 2025
@openjdk
Copy link

openjdk bot commented Apr 28, 2025

@ajacob This pull request is now open

@bridgekeeper bridgekeeper bot removed the oca Needs verification of OCA signatory status label Apr 28, 2025
@ajacob
Copy link
Author

ajacob commented Apr 28, 2025

/signed

@bridgekeeper
Copy link

bridgekeeper bot commented Apr 28, 2025

You are already a known contributor!

@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 28, 2025
@mlbridge
Copy link

mlbridge bot commented Apr 28, 2025

Webrevs

@albertnetymk
Copy link
Member

I have a question regarding the existing code/logic.

    // In case the GC is concurrent, we make sure only one thread requests the GC.
    if (Atomic::cmpxchg(&_unloading_threshold_gc_requested, false, true) == false) {
      log_info(codecache)("Triggering aggressive GC due to having only %.3f%% free memory", free_ratio * 100.0);
      Universe::heap()->collect(GCCause::_codecache_GC_aggressive);
    }

Why making sure only one thread calls collect(...)? I believe this API can be invoked concurrently.

Would removing _unloading_threshold_gc_requested resolve this problem?

I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well.

For ParallelGC, ParallelScavengeHeap::collect contains the following to ensure System.gc gccause and similar ones guarantee a full-gc.

    if (!GCCause::is_explicit_full_gc(cause)) {
      return;
    }

However, the current logic that a young-gc can cancel a full-gc (_codecache_GC_aggressive in this case) also seems surprising.

@bridgekeeper
Copy link

bridgekeeper bot commented May 30, 2025

@ajacob This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@tschatzl
Copy link
Contributor

tschatzl commented Jun 24, 2025

Why making sure only one thread calls collect(...)? I believe this API can be invoked concurrently.

I have a question regarding the existing code/logic.

    // In case the GC is concurrent, we make sure only one thread requests the GC.
    if (Atomic::cmpxchg(&_unloading_threshold_gc_requested, false, true) == false) {
      log_info(codecache)("Triggering aggressive GC due to having only %.3f%% free memory", free_ratio * 100.0);
      Universe::heap()->collect(GCCause::_codecache_GC_aggressive);
    }

Why making sure only one thread calls collect(...)? I believe this API can be invoked concurrently.

Would removing _unloading_threshold_gc_requested resolve this problem?

It does, at the cost of many log messages:

[0.047s][info][gc          ] GC(0) Pause Young (Concurrent Start) (CodeCache GC Threshold) 2M->1M(512M) 4.087ms
[0.047s][info][gc,cpu      ] GC(0) User=0.01s Sys=0.00s Real=0.00s
[0.047s][info][gc          ] GC(1) Concurrent Mark Cycle
[0.047s][info][gc,marking  ] GC(1) Concurrent Scan Root Regions
[0.048s][info][codecache   ] Triggering threshold (7.654%) GC due to allocating 48.973% since last unloading (0.000% used -> 48.973% used)
[0.048s][info][gc,marking  ] GC(1) Concurrent Scan Root Regions 0.147ms
[0.048s][info][gc,marking  ] GC(1) Concurrent Mark
[0.048s][info][gc,marking  ] GC(1) Concurrent Mark From Roots
[0.048s][info][codecache   ] Triggering threshold (7.646%) GC due to allocating 49.028% since last unloading (0.000% used -> 49.028% used)
[0.048s][info][codecache   ] Triggering threshold (7.646%) GC due to allocating 49.028% since last unloading (0.000% used -> 49.028% used)
[0.048s][info][codecache   ] Triggering threshold (7.633%) GC due to allocating 49.114% since last unloading (0.000% used -> 49.114% used)
[0.049s][info][gc,task     ] GC(1) Using 6 workers of 6 for marking
[0.049s][info][codecache   ] Triggering threshold (7.625%) GC due to allocating 49.169% since last unloading (0.000% used -> 49.169% used)
[0.049s][info][codecache   ] Triggering threshold (7.616%) GC due to allocating 49.224% since last unloading (0.000% used -> 49.224% used)

[...repeated 15 times...]

[0.063s][info][codecache   ] Triggering threshold (7.527%) GC due to allocating 49.820% since last unloading (0.000% used -> 49.820% used)
[0.065s][info][codecache   ] Triggering threshold (7.519%) GC due to allocating 49.875% since last unloading (0.000% used -> 49.875% used)
[0.067s][info][codecache   ] Triggering threshold (7.511%) GC due to allocating 49.930% since last unloading (0.000% used -> 49.930% used)
[0.068s][info][gc,marking  ] GC(1) Concurrent Mark From Roots 20.256ms
[0.068s][info][gc,marking  ] GC(1) Concurrent Preclean
[0.068s][info][gc,marking  ] GC(1) Concurrent Preclean 0.016ms
[0.068s][info][gc,start    ] GC(1) Pause Remark

As you can see this is very annoying, particularly if the marking takes seconds all the while compiling is in progress.

I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well.

For ParallelGC, ParallelScavengeHeap::collect contains the following to ensure System.gc gccause and similar ones guarantee a full-gc.

    if (!GCCause::is_explicit_full_gc(cause)) {
      return;
    }

However, the current logic that a young-gc can cancel a full-gc (_codecache_GC_aggressive in this case) also seems surprising.

That's a different issue.

@tschatzl
Copy link
Contributor

tschatzl commented Jul 4, 2025

However, the current logic that a young-gc can cancel a full-gc (_codecache_GC_aggressive in this case) also seems surprising.

That's a different issue.

Actually most likely this is the issue for Parallel GC; that code is present only in older JDK versions before 25 (however other reasons like the GCLocker may also prevent these GCs), i.e. there should be no such issue in JDK 25 for Parallel GC.

The situation for Parallel GC is different for earlier versions, i.e. for backporting: it would require the changes for JDK-8192647 and at least one other fix. There needs to be a cost/benefit analysis these are rather intrusive changes.

@ajacob:

I considered a few different options before making this change:

  1. Always call Universe::heap()->collect(...) without making any check (the GC impl should handle the situation)
  2. Fix all GCs implementation to ensure _unloading_threshold_gc_requested gets back to false at some point (probably what is supposed to happen today)
  3. Change CollectedHeap::collect to return a bool instead of void to indicate if GC was run or scheduled

I had a spin at the (imo correct) fix for 2 - fix G1 collect() logic.

Here's a diff: https://github.com/openjdk/jdk/compare/master...tschatzl:jdk:submit/8350621-code-cache-mgmt-hang?expand=1

What do you think?

Thanks,
Thomas

@ajacob
Copy link
Author

ajacob commented Jul 7, 2025

Hello, I'm sorry I didn't get back to you sooner on this PR.

Indeed I considered the first option (do not try to prevent calls to Universe::heap()->collect(...)) but wanted to have something more elaborated instead.

@tschatzl I like your proposal of fixing the GC implementation directly, as mentioned in my PR description it was my favorite option but because I found that this bug existed for at least Parallel GC and G1 I wanted to have something in CodeCache directly to ensure we never have an issue related to GC implementation.
I had a look at your commit and feel like it is the good direction for G1. Thank you for having a look at it

@tschatzl
Copy link
Contributor

tschatzl commented Jul 8, 2025

Hello, I'm sorry I didn't get back to you sooner on this PR.

No worries, it should be rather me to not get to this earlier....

Indeed I considered the first option (do not try to prevent calls to Universe::heap()->collect(...)) but wanted to have something more elaborated instead.

@tschatzl I like your proposal of fixing the GC implementation directly, as mentioned in my PR description it was my favorite option but because I found that this bug existed for at least Parallel GC and G1 I wanted to have something in CodeCache directly to ensure we never have an issue related to GC implementation. I had a look at your commit and feel like it is the good direction for G1. Thank you for having a look at it

First, I assume you verified my change ;)

How do we proceed from here? Do you want to reuse this PR or should we (I, you?) open a new one for the new suggestion?

What do you prefer? I am fine with either option.

Thanks,
Thomas

@tschatzl
Copy link
Contributor

tschatzl commented Jul 8, 2025

We discussed this question internally bit, and the consensus has been to use a new PR to avoid confusion due to two different approaches being discussed in the same thread. Would you mind closing this one out and I'll create a new PR?

Thanks,
Thomas

@ajacob
Copy link
Author

ajacob commented Jul 8, 2025

Sure I can close this PR, this makes things simpler for everybody I guess!

@ajacob ajacob closed this Jul 8, 2025
@ajacob ajacob deleted the fix-code-cache-gc branch July 8, 2025 19:21
@tschatzl
Copy link
Contributor

tschatzl commented Jul 9, 2025

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants