-
Notifications
You must be signed in to change notification settings - Fork 457
[release-4.20] OCPNODE-3722: Enforce OCP 4.20 and earlier cluster to have AutoSizingReserved disabled by default #5387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-4.20
Are you sure you want to change the base?
Conversation
|
Skipping CI for Draft Pull Request. |
1229ca6 to
69555b0
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ngopalak-redhat The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
2507d91 to
e8e4d53
Compare
|
/test all |
e8e4d53 to
c9f9e79
Compare
|
@ngopalak-redhat: This pull request references OCPNODE-3718 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.20.z" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test unit |
|
@ngopalak-redhat: This pull request references OCPNODE-3718 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.20.z" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ngopalak-redhat: This pull request references OCPNODE-3722 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ngopalak-redhat: This pull request references OCPNODE-3722 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test bootstrap-unit |
a958166 to
b12addd
Compare
|
/test all |
|
@ngopalak-redhat: This pull request references OCPNODE-3722 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@haircommander @sairameshv Can you please review? |
|
|
||
| // configureAutoSizingMachineConfig creates an auto-sizing MachineConfig for a given pool if it doesn't exist | ||
| func (ctrl *Controller) configureAutoSizingMachineConfig(pool *mcfgv1.MachineConfigPool) error { | ||
| autoSizingKey := fmt.Sprintf("01-%s-auto-sizing-disabled", pool.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may want the prio to be higher. maybe 50?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also: can we make a function of this? it's used twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion about increasing the priority to 50! I'm curious to understand the reasoning behind this - could you help me understand why priority 50 would be more appropriate than the current "01-" prefix? I want to make sure I'm following the right convention for this use case. Is there a specific ordering requirement or interaction with other MachineConfigs that I should be aware of?
Also I've refactored the repeated code by extracting the Ignition config creation logic into a helper function createAutoSizingIgnConfig().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run Levels 0 and 1 are usually reserved.
We also need to test the bootstrap phase - installing 4.20 with a kubelet config at install - then upgrading to 4.21.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rphillips . I can choose the middle ground as "50-". I tried searching for some docs that gives recommendations on these lines, but couldn't find any.
For testing the bootstrap phase, I have included the test results in the 4.21 PR: #5390
| } | ||
|
|
||
| // newAutoSizingMachineConfig creates an empty auto-sizing MachineConfig for a given pool | ||
| func newAutoSizingMachineConfig(pool *mcfgv1.MachineConfigPool) (*mcfgv1.MachineConfig, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing it: I'm not seeing where we check if the user already set auto node sizing and skip making this MC if they did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added comprehensive E2E tests in openshift/origin#30467 (Will submit for review)
Manual Testing on 4.20 Cluster:
- Verified MachineConfigs were created:
$ oc get mc -o jsonpath='{.items[?(@.metadata.annotations.openshift-patch-reference == "added auto-sizing MachineConfig to disable node
sizing")].metadata.name}'
01-master-auto-sizing-disabled 01-worker-auto-sizing-disabled
- Verified default state (auto-sizing disabled):
$ oc debug node/ip-10-0-45-86.ec2.internal
sh-5.1# chroot /host
sh-5.1# cat /etc/node-sizing-enabled.env
NODE_SIZING_ENABLED=false
SYSTEM_RESERVED_MEMORY=1Gi
SYSTEM_RESERVED_CPU=500m
SYSTEM_RESERVED_ES=1Gi
- Applied KubeletConfig to enable auto-sizing:
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: 80-worker-auto
spec:
autoSizingReserved: true
machineConfigPoolSelector:
matchLabels:
pools.operator.machineconfiguration.openshift.io/worker: ""
- Verified user configuration overrides the default (after node restart):
$ oc debug node/ip-10-0-45-86.ec2.internal
sh-5.1# chroot /host
sh-5.1# cat /etc/node-sizing-enabled.env
NODE_SIZING_ENABLED=true
SYSTEM_RESERVED_MEMORY=1Gi
SYSTEM_RESERVED_CPU=500m
SYSTEM_RESERVED_ES=1Gi
This confirms that:
- The default MachineConfig correctly disables auto-sizing for 4.20 clusters
- User-defined KubeletConfigs successfully override the default setting
- The precedence order works as expected
|
@ngopalak-redhat: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
sairameshv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on the pattern that you followed to add a default behavior via the creation of a MachineConfig object but I would strongly recommend to move this functionality to the KubeletConfigController instead of the KubeletNodeConfigController.
Although KubeletNodeConfigController deals with a few of the upstream kubelet config updates, it is not an ideal place to deal with the functionality related to the downstream KubeletConig API i.e. the AutoSizingReserved field
|
|
||
| // createAutoSizingIgnConfig creates the Ignition config for auto-sizing disabled | ||
| func createAutoSizingIgnConfig() ([]byte, error) { | ||
| autoSizingFile := ctrlcommon.NewIgnFileBytes("/etc/node-sizing-enabled.env", []byte(DefaultAutoSizingEnvContent)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you define the file "/etc/node-sizing-enabled.env" as a constant?
| ctrl.nodeConfigQueue.AddAfter(key, 1*time.Minute) | ||
| } | ||
|
|
||
| // createAutoSizingIgnConfig creates the Ignition config for auto-sizing disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // createAutoSizingIgnConfig creates the Ignition config for auto-sizing disabled | |
| // createAutoSizingIgnConfig creates the Ignition config that would be written to a file containing the environment variables to enable/disable the `AutoSizingReserved` field of the KubeletConfig |
|
|
||
| // newAutoSizingMachineConfig creates an empty auto-sizing MachineConfig for a given pool | ||
| func newAutoSizingMachineConfig(pool *mcfgv1.MachineConfigPool) (*mcfgv1.MachineConfig, error) { | ||
| autoSizingDisabledName := fmt.Sprintf("01-%s-auto-sizing-disabled", pool.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Just to improve the code readability
| autoSizingDisabledName := fmt.Sprintf("01-%s-auto-sizing-disabled", pool.Name) | |
| autoSizingDisabledMCName := fmt.Sprintf("01-%s-auto-sizing-disabled", pool.Name) |
| return rawAutoSizingIgn, nil | ||
| } | ||
|
|
||
| // newAutoSizingMachineConfig creates an empty auto-sizing MachineConfig for a given pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't create an empty MachineConfig, does it?
| // newAutoSizingMachineConfig creates an empty auto-sizing MachineConfig for a given pool | ||
| func newAutoSizingMachineConfig(pool *mcfgv1.MachineConfigPool) (*mcfgv1.MachineConfig, error) { | ||
| autoSizingDisabledName := fmt.Sprintf("01-%s-auto-sizing-disabled", pool.Name) | ||
| ignConfig := ctrlcommon.NewIgnConfig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not required as we any way create an ignitionconfig using createAutoSizingIgnConfig() and populate the machineconfig with it.
Ref: https://github.com/openshift/machine-config-operator/pull/5387/files#diff-f37515e1ebe0812972d628b0f81dd29fc25fd8289fbef268b1391a8a3273e0beR89
| } | ||
|
|
||
| // configureAutoSizingMachineConfig creates an auto-sizing MachineConfig for a given pool if it doesn't exist | ||
| func (ctrl *Controller) configureAutoSizingMachineConfig(pool *mcfgv1.MachineConfigPool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't configuring any machineconfig here. AFAIU, this function helps in actual creation of the MC object. And newAutoSizingMachineConfig is a constructor to create a machine config object.
Could you update the method names/comments accordingly?
|
|
||
| // configureAutoSizingMachineConfig creates an auto-sizing MachineConfig for a given pool if it doesn't exist | ||
| func (ctrl *Controller) configureAutoSizingMachineConfig(pool *mcfgv1.MachineConfigPool) error { | ||
| autoSizingKey := fmt.Sprintf("01-%s-auto-sizing-disabled", pool.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to define the name/prefix of the MC object as a constant
| for _, pool := range mcpPools { | ||
| // First, create the auto-sizing MachineConfig for this pool if it doesn't exist | ||
| if err := ctrl.configureAutoSizingMachineConfig(pool); err != nil { | ||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this function syncNodeConfigHandler would always get hit as we have a nodes.config object in the cluster always
IMHO, the creation of this MC is not related to the nodes.config.openshift.io object in anyway and hence this is not the correct place to have this operation.
For ex, encountering an issue in creating an ignition file or a machineconfig should not interrupt the functionality of nodes.config api i.e. updating the workerLatencyProfiles or the minKubeletVersion etc.
The only place we should deal with is the kubeletConfigController. We can have an extra sync/work item which can be hit always without depending on the KubeletConfig object's presence.
Could you explore in that direction?
|
|
||
| // Create auto-sizing MachineConfigs for each pool | ||
| for _, pool := range mcpPools { | ||
| autoSizingMC, err := newAutoSizingMachineConfig(pool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the place per this
| require.Len(t, mcList.Items, 1) | ||
| require.NotEqual(t, nodeKeyCustom, mcList.Items[0].Name) | ||
| // Now expecting 2 MachineConfigs: | ||
| // 1. Auto-sizing MC for worker (01-worker-auto-sizing-disabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a nodes.config feature and we shouldn't be expecting an additional MC object here
Ref: https://github.com/openshift/machine-config-operator/pull/5387/files#r2517233985
Fixes: #OCPNODE-3722
- What I did
This patch introduces the
01-master-auto-sizing-disabledMachineConfig to OpenShift 4.20 clusters, setting theNODE_SIZING_ENABLEDflag to false by default on master and worker nodes.This change is required as we are making auto sizing enabled by default for cluster created using 4.21 and above.
Summary of changes
Enforce Default Autosizing: Ensures that clusters created in 4.20 will retain the pre-4.21 behavior of having auto node sizing disabled by default.
Upgrade Pre-requisite: This patch is a mandatory requirement for upgrading 4.20 clusters to 4.21. Changes to Cincinnati (OCPNODE-3722: Set minimum version of 4.20 required to upgrade to 4.21 cincinnati-graph-data#8277) will enforce that this patch must be present before the upgrade path to 4.21 is started.
User Override (Priority): The MachineConfig uses the prefix 01- to ensure it sets the initial default. If a user has already created a KubeletConfig to explicitly enable autoSizing (as per the KubeletConfig documentation), that explicit user configuration will take precedence (override this default) and will be retained when upgrading to 4.21.
Reference: This change addresses the shift in default behavior introduced in OpenShift 4.21, where NODE_SIZING_ENABLED is set to true for all new clusters: #5390
Additional Notes for Developers
The approach taken in this PR is patterned after the change implemented in #4715, which was used to modify the default container runtime.
Rejected Alternatives
We explored several alternative solutions, but they were not feasible:
In-Place Upgrade Handling: We found that direct handling during the 4.21 upgrade was unreliable. After multiple upgrade cycles, there was no consistent mechanism to identify clusters originally provisioned before 4.21.
Changing the Default File: Switching the default configuration file (e.g., away from /etc/node-sizing-enabled) was overly complex, requiring us to manually manage legacy configuration paths for existing clusters.
Installer-Created KubeletConfig: Since OpenShift clusters do not contain a default KubeletConfig resource, one option was to have the installer create it. This was rejected because Hypershift deployments may bypass the standard OCP installer.
Adding a Default KubeletConfig Resource: This approach was dismissed because OpenShift allows only a single KubeletConfig per cluster. Introducing a default resource risks a user's explicit KubeletConfig unintentionally overriding the system default, leading to confusion.
- How to verify it
Verified the patch on a 4.20 cluster: Created a cluster using ClusterBot, applied the patch via
oc adm upgrade, confirmed the new MachineConfig was created, and ensured auto node sizing was disabled.Direct Patch Verification: Created a cluster using ClusterBot with the patch applied and confirmed auto node sizing was disabled.
User Override Test: Created a KubeletConfig to explicitly enable auto sizing and verified that the setting was correctly enabled (overriding the default).
Upgrade Path Validation: Successfully upgraded the patched cluster to 4.21 (using the above referenced 4.21 PR changes). Confirmed that auto node sizing remained disabled for upgraded clusters that had not been explicitly configured otherwise.
- Description for the changelog
Introduces the auto sizing MachineConfig, ensuring the feature remains disabled by default during upgrade