-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8350621: Code cache stops scheduling GC #23656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @ajacob, welcome to this OpenJDK project and thanks for contributing! We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user ajacob" as summary for the issue. If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing |
❗ This change is not yet ready to be integrated. |
Here is a log sample that shows how it behaves when the bug occurs. Logs starting with
|
Both JVM were started for ~20 minutes jconsole (reproducting the bug)Started to misbehave at ~315.181s jconsole (with the fix from the PR)
|
eb9e415
to
9817ad3
Compare
@ajacob Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information. |
Converted to draft: I would like to change it to ensure we log before calling |
Performing more tests on this (different configuration, different GC, ...), I noticed that I had a race condition when multiple threads enter the The race condition impact was :
Possible in the following conditions:
In order to avoid that I propose to simply ensure we don't have multiple threads performing the checks in |
/label hotspot-gc |
@dean-long |
@ajacob This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@ajacob This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
/open |
@ajacob This pull request is now open |
/signed |
You are already a known contributor! |
I have a question regarding the existing code/logic.
Why making sure only one thread calls Would removing
For ParallelGC,
However, the current logic that a young-gc can cancel a full-gc ( |
@ajacob This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a |
It does, at the cost of many log messages:
As you can see this is very annoying, particularly if the marking takes seconds all the while compiling is in progress.
That's a different issue. |
Actually most likely this is the issue for Parallel GC; that code is present only in older JDK versions before 25 (however other reasons like the The situation for Parallel GC is different for earlier versions, i.e. for backporting: it would require the changes for JDK-8192647 and at least one other fix. There needs to be a cost/benefit analysis these are rather intrusive changes.
I had a spin at the (imo correct) fix for 2 - fix G1 Here's a diff: https://github.com/openjdk/jdk/compare/master...tschatzl:jdk:submit/8350621-code-cache-mgmt-hang?expand=1 What do you think? Thanks, |
Hello, I'm sorry I didn't get back to you sooner on this PR. Indeed I considered the first option (do not try to prevent calls to @tschatzl I like your proposal of fixing the GC implementation directly, as mentioned in my PR description it was my favorite option but because I found that this bug existed for at least Parallel GC and G1 I wanted to have something in CodeCache directly to ensure we never have an issue related to GC implementation. |
No worries, it should be rather me to not get to this earlier....
First, I assume you verified my change ;) How do we proceed from here? Do you want to reuse this PR or should we (I, you?) open a new one for the new suggestion? What do you prefer? I am fine with either option. Thanks, |
We discussed this question internally bit, and the consensus has been to use a new PR to avoid confusion due to two different approaches being discussed in the same thread. Would you mind closing this one out and I'll create a new PR? Thanks, |
Sure I can close this PR, this makes things simpler for everybody I guess! |
Thank you. |
The purpose of this PR is to fix a bug where we can end up in a situation where the GC is not scheduled anymore by
CodeCache
.This situation is possible because the
_unloading_threshold_gc_requested
flag is set totrue
when triggering the GC and we expect the GC to callCodeCache::on_gc_marking_cycle_finish
which in turn will callCodeCache::update_cold_gc_count
, which will reset the flag_unloading_threshold_gc_requested
allowing further GC scheduling.Unfortunately this can't work properly under certain circumstances.
For example, if using G1GC, calling
G1CollectedHeap::collect
does no give the guarantee that the GC will actually run as it can be already running (see here).I have observed this behavior on JVM in version 21 that were migrated recently from java 17.
Those JVMs have some pressure on code cache and quite a large heap in comparison to allocation rate, which means that objects are mostly GC'd by young collections and full GC take a long time to happen.
I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well.
In order to reproduce this issue, I found a very simple and convenient way:
Run this simple app with the following JVM flags:
ReservedCodeCacheSize
to put pressure on code cache quicklyStartAggressiveSweepingAt
can be set to 20 or 15 for faster bug reproductionItself, the program will hardly get pressure on code cache, but the good news is that it is sufficient to attach a jconsole on it which will:
Some logs related to code cache will show up at some point with GC activity:
And then it will stop and we'll end up with the following message:
Leaving the JVM in an unstable situation.
I considered a few different options before making this change:
Universe::heap()->collect(...)
without making any check (the GC impl should handle the situation)_unloading_threshold_gc_requested
gets back tofalse
at some point (probably what is supposed to happen today)CollectedHeap::collect
to return abool
instead ofvoid
to indicate if GC was run or scheduledBut I discarded them:
CodeCache
that would let a GC implementation to just reset the flag in a case the GC will not actually run for example (to be discussed)bool
, but this bool istrue
even when the GC is not run.As a result, I decided to simply add a way for
CodeCache
to recover from this situation. The idea is to let the GC code as-is but keep in memory the time of the last GC request and reset the flag tofalse
if it was not reset in a certain amount of time (250ms in my PR). This should only be helpful in corner cases where the GC impl has not reset the flag by itself.Among the advantages of this solution: it gives a security to recover from a situation that may be created by changes in GC implementation, because someone forgot to take care about code cache.
I took a lot of time investigating this issue and exploring solutions, and am willing to take any input on it as it is my first PR on the project.
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23656/head:pull/23656
$ git checkout pull/23656
Update a local copy of the PR:
$ git checkout pull/23656
$ git pull https://git.openjdk.org/jdk.git pull/23656/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 23656
View PR using the GUI difftool:
$ git pr show -t 23656
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23656.diff
Using Webrev
Link to Webrev Comment