Skip to content

Maintenance mode stacking support #3044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

LZD-PratyushBhatt
Copy link

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

(#200 - Link your issue number here: You can write "Fixes #XXX". Please use the proper keyword so that the issue gets closed automatically. See https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue
Any of the following keywords can be used: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved)
This PR closes #3041 , adds support for Maintenance mode Stacking.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

(Write a concise description including what, why, how)
The current implementation of maintenance mode for clusters supports only a single reason at a time, tracked using the simpleFields.REASON key. This restricts the functionality to a single actor and reason, which limits flexibility and coordination.

This proposal introduces a new design that allows multiple actors to independently place a cluster into maintenance mode for different reasons. We will extend the maintenance mode design to support multiple actors, each capable of independently adding or removing their own maintenance reason. The cluster will remain in maintenance mode as long as at least one active reason exists. Each reason will be associated with metadata such as the actor, reason, and timestamp. For backwards compatibility, the existing simpleFields.REASON will be retained and updated to reflect the most recent active reason. If a reason is removed, it will be replaced with the next most recent one. While legacy clients that remove the entire znode cannot be completely prevented, we will handle such cases gracefully and recommend migrating to an updated API that enables proper multi-actor maintenance handling.

Tests

  • The following tests are written for this issue:
    testAutomationMaintenanceMode, testRemoveMaintenanceReasonNoDuplicates, testLegacyClientCompatibility, testMaintenanceHistoryAfterOperationFlag, testMultiActorMaintenanceModeExitSequence, testMultiActorMaintenanceModeReconciliation, testMultiActorMaintenanceModeOldClientExit, testMultiActorMaintenanceModeOldClientOverride, testMultiActorMaintenanceModeInvalidExit

(List the names of added unit/integration tests)

  • The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@LZD-PratyushBhatt
Copy link
Author

Hi @junkaixue , @GrantPSpencer , @zpinto , @xyuanlu
Can you please review this PR? Its for supporting MM stacking.

@LZD-PratyushBhatt LZD-PratyushBhatt force-pushed the maintenance_mode_stacking branch from 9cb7f35 to b6a081b Compare May 19, 2025 07:47
return false;
}

return signal.hasMaintenanceReasons();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be dangerous. What if the new version read an old ZNode.

@junkaixue
Copy link
Contributor

Highly recommend the new contributor starting with stablizing the tests instead of touch the very core part. It is very very dangerous. There was one line log change can blast the entire server before.

If you still believe your change is solid, we can help review. At the same time, please never lower the bar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Maintenance mode stacking support
3 participants