Skip to content

org.elasticsearch.xpack.ilm.PolicyStepsRegistry Uses Not-Threadsafe Maps Concurently #78440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
original-brownbear opened this issue Sep 29, 2021 · 3 comments
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team

Comments

@original-brownbear
Copy link
Member

original-brownbear commented Sep 29, 2021

Straight forward bug:

These 3 maps:

    private final SortedMap<String, LifecyclePolicyMetadata> lifecyclePolicyMap;
    // keeps track of what the first step in a policy is, the key is policy name
    private final Map<String, Step> firstStepMap;
    // keeps track of a mapping from policy/step-name to respective Step, the key is policy name
    private final Map<String, Map<Step.StepKey, Step>> stepMap;

are all plain hash or tree maps and get updated on every cluster state update on the cluster state applier thread. but are accessed from at least the CS threads and the ILM scheduler threads concurrently.

We need to do something to synchronize them correctly or else at least the period execution will break in subtle ways here and there.

@original-brownbear original-brownbear added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Sep 29, 2021
@danhermann danhermann added :Data Management/ILM+SLM Index and Snapshot lifecycle management and removed :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Sep 29, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Sep 29, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@nielsbauman
Copy link
Contributor

I just learned this issue exists. We already encountered a bug related to this, which was fixed in #126840. I'm not sure how many real bugs can still be caused by this, because ILM is quite fault tolarent nowadays - meaning we may fail once because of a mismatch in the maps, but a following run will pass when the maps are aligned again. But I agree that doing some kind of synchronization and/or atomic updates to the maps here is valuable to rule out any issues.

@original-brownbear
Copy link
Member Author

++ yes please :) it's not like you're just gonna run into an outdated view of these maps here iether, NPE and friends might also show up at some rate. Generally, for this kind of thing the copy-on-write pattern tends to be a a good fit but I don't remember all the details here 4 years later :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

5 participants