Skip to content

Introducing Module-NMC controller #503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

yevgeny-shnaidman
Copy link
Contributor

This PR introduces a new controller, that is responsible for reconciliation of NMC based on the data in Module and Node The following are the main points:

  1. It watches for changes in Modules, Nodes (labels, becoming
    schedulable) and Jobs belonging owned by Modules (in order to
    reschedule once Build/Sign job are finished)
  2. Per reconciled Module, go over all the nodes in the cluster
    and reconcile their NMCs based on the Module/Node state

Changes in the PR:

  1. adding functions to filter package that are resposible for filtering
    the events (nodes. modules, jobs) and finding out which modules
    should be reconciled
  2. changing the NMC handler logic a little, adding verification for
    container image presence in case module should be enable.
  3. Module-NMC Reconciliation is being done both in case Kernel Mapping
    exists and in case it does not exists ( solves the garbage
    collection flow)

This PR does not include Module status updates based on NMC updates

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yevgeny-shnaidman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 3, 2023
@netlify
Copy link

netlify bot commented Aug 3, 2023

Deploy Preview for kubernetes-sigs-kmm ready!

Name Link
🔨 Latest commit 70b755e
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kmm/deploys/64d0eab8559ebb0008d58675
😎 Deploy Preview https://deploy-preview-503--kubernetes-sigs-kmm.netlify.app/
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 3, 2023
@yevgeny-shnaidman
Copy link
Contributor Author

/assign @ybettan

@yevgeny-shnaidman
Copy link
Contributor Author

/assign @qbarrand

mnc := controllers.NewModuleNMCReconciler(
client,
kernelAPI,
nmcHandler,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this one be renamed to nmcAPI to be consistent with the rest of the API passed to it (or to the Module reconciler)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a controller, we are not creating/returning an Interface ( API), but the struct itself, as the rest of the reconcilers/controllers, that's why there is no API in the name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nmcHandler is a public interface. Why don't we call it nmcAPI instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the NMC-Handler API into the controller, since it implements controller logic

"sigs.k8s.io/controller-runtime/pkg/log"
)

//+kubebuilder:rbac:groups="core",resources=nodes,verbs=get;watch;patch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to patch nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need, removing

)

//+kubebuilder:rbac:groups="core",resources=nodes,verbs=get;watch;patch
//+kubebuilder:rbac:groups=kmm.sigs.x-k8s.io,resources=nodemodulesconfigs,verbs=get;list;watch;update;patch;create
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between update and patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that depends on the action that you execute on the object. I prefer to mention both of them, so that we can execute them in the code without a need to update the rules

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the problem with updating the rules? A new version of the code will come with a new bundle anyway. I would restrict the verbs here to the bare minimum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the "update"

// get the reconciled module
logger := log.FromContext(ctx)

logger.Info("Starting Module-NMS reconcilation", "module name and namespace", req.NamespacedName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who says that the reconciled object is a Module? Can't it be a Node? a Job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because this is how this reconciler is defined. It is defined For Module, and for every "watch" it finds the appropriate Module to reconcile

logger.Info("Module deleted")
return ctrl.Result{}, nil
}
return ctrl.Result{}, fmt.Errorf("failed to get the requested %s KMMO CR: %w", req.NamespacedName, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ctrl.Result{}, fmt.Errorf("failed to get the requested %s KMMO CR: %w", req.NamespacedName, err)
return ctrl.Result{}, fmt.Errorf("failed to get the requested %s KMMO CR: %v", req.NamespacedName, err)

Are we using the wrapped error anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. We do use wrapped errors , when needed. For example in the KernelMapper, to get the appropriate MLD, if exists

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, "do we use the wrapped error in this case". If you fixed it I assume we were not using it :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, in this case it was a mistake

errs := make([]error, 0, len(nodes))
for _, node := range nodes {
kernelVersion := strings.TrimSuffix(node.Status.NodeInfo.KernelVersion, "+")
mld, err := mnr.kernelAPI.GetModuleLoaderDataForKernel(mod, kernelVersion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to do that for every node? Can we extract it outside of the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the kernel version of the nodes may differ

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't see node being passed to this function or the implementation depending on the nodes in any way. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the node is used to determine the kernelVersion , and to determine kernel mapping (using kernel version). Then both kernel mapping(mld) and node are passed to the function that determines whether the module needs to run on the node or not

}

err = mnr.nmcHandler.HandleModuleNMC(ctx, mod, mld, node)
errs = append(errs, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't you appending nil errors here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but errors.Join knows how to handle them and ignore

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it be clearer to only append real errors instead of delegating it to Join?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why to add another "if", if there is already code in the Join that takes care of it? This is the purpose of Join function: decide of there are errors in the array of errors , and create the appropriate error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why to add another "if", if there is already code in the Join that takes care of it?

Because it is not trivial for someone who didn't read the Join doc to understand those are ignored. Everywhere else in the code we are checking if an error is != nil before doing something with it except here.

If you think it is better this way, let's keep it, I just though it is a bit confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am keeping it: if we have a Golang library function in the code, i would expect that developer to read about it and understand

Comment on lines +97 to +110
getRequestedModule(ctx context.Context, namespacedName types.NamespacedName) (*kmmv1beta1.Module, error)
getNodesList(ctx context.Context) ([]v1.Node, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering why getRequestedModule isn't a method in the module package and why we don't have a node package that will hold getNodeList as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in Slack, i am leaving it here and we can decide on it and the rest of functions in packages in v2.1

}
}

func (mnrh *moduleNMCReconcilerHelper) getRequestedModule(ctx context.Context, namespacedName types.NamespacedName) (*kmmv1beta1.Module, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this function a dup of getRequestedModule in the module_reconciler.go file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in Slack, i am leaving it here and we can decide on it and the rest of functions in packages in v2.1

return &mod, nil
}

func (mnrh *moduleNMCReconcilerHelper) getNodesList(ctx context.Context) ([]v1.Node, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this function exist (almost) in module_reconciler.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in Slack, i am leaving it here and we can decide on it and the rest of functions in packages in v2.1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we need this function at all? In tests, we replace one EXPECT() with another - not sure what is the benefit. I guess that remark is also cvalid for getRequestedModule above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to set such function in a helper interface, makes it easier to write the unit-tests for the main reconciliation loop, and to check all the corner case in a dedicated unit-test for that specific function.

@yevgeny-shnaidman yevgeny-shnaidman force-pushed the yevgeny/module-nmc-controller branch from d65402f to b1c21c6 Compare August 6, 2023 12:33
)

//+kubebuilder:rbac:groups="core",resources=nodes,verbs=get;watch;patch
//+kubebuilder:rbac:groups=kmm.sigs.x-k8s.io,resources=nodemodulesconfigs,verbs=get;list;watch;update;patch;create
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the problem with updating the rules? A new version of the code will come with a new bundle anyway. I would restrict the verbs here to the bare minimum.

return &mod, nil
}

func (mnrh *moduleNMCReconcilerHelper) getNodesList(ctx context.Context) ([]v1.Node, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we need this function at all? In tests, we replace one EXPECT() with another - not sure what is the benefit. I guess that remark is also cvalid for getRequestedModule above.

@yevgeny-shnaidman yevgeny-shnaidman force-pushed the yevgeny/module-nmc-controller branch from b1c21c6 to 2b37582 Compare August 7, 2023 09:43
@qbarrand
Copy link
Contributor

qbarrand commented Aug 7, 2023

/test pull-kernel-module-management-unit-tests

@yevgeny-shnaidman
Copy link
Contributor Author

/test pull-kernel-module-management-unit-tests

@yevgeny-shnaidman yevgeny-shnaidman force-pushed the yevgeny/module-nmc-controller branch 2 times, most recently from c6e518a to 09f9ade Compare August 7, 2023 11:37
Comment on lines 232 to 238
Watches(
&batchv1.Job{},
handler.EnqueueRequestForOwner(mgr.GetScheme(), mgr.GetRESTMapper(), &kmmv1beta1.Module{}, handler.OnlyControllerOwner()),
builder.WithPredicates(
filter.ModuleNMCReconcileJobPredicate(),
),
).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Owns instead?

Suggested change
Watches(
&batchv1.Job{},
handler.EnqueueRequestForOwner(mgr.GetScheme(), mgr.GetRESTMapper(), &kmmv1beta1.Module{}, handler.OnlyControllerOwner()),
builder.WithPredicates(
filter.ModuleNMCReconcileJobPredicate(),
),
).
Owns(
&batchv1.Job{},
builder.WithPredicates(
filter.ModuleNMCReconcileJobPredicate(),
),
).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, Owns defines it own filter and handler. Since i would like to keep the filter, i am still using Watches

@yevgeny-shnaidman yevgeny-shnaidman force-pushed the yevgeny/module-nmc-controller branch 3 times, most recently from ebc5e47 to fcb00c9 Compare August 7, 2023 12:40
@qbarrand
Copy link
Contributor

qbarrand commented Aug 7, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 7, 2023
This PR introduces a new controller, that is responsible
for reconciliation of NMC based on the data in Module and Node
The following are the main points:
1) It watches for changes in Modules, Nodes (labels, becoming
   schedulable) and Jobs belonging owned by Modules (in order to
   reschedule once Build/Sign job are finished)
2) Per reconciled Module, go over all the nodes in the cluster
   and reconcile their NMCs based on the Module/Node state

Changes in the PR:
1) adding functions to filter package that are resposible for filtering
    the events (nodes. modules, jobs) and finding out which modules
    should be reconciled
2)  changing the NMC handler logic a little, adding verification for
    container image presence in case module should be enable.
3)  Module-NMC Reconciliation is being done both in case Kernel Mapping
    exists and in case it does not exists ( solves the garbage
    collection flow)

This PR does not include Module status updates based on NMC updates
@yevgeny-shnaidman yevgeny-shnaidman force-pushed the yevgeny/module-nmc-controller branch from fcb00c9 to 70b755e Compare August 7, 2023 12:59
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 7, 2023
@qbarrand
Copy link
Contributor

qbarrand commented Aug 7, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 7, 2023
@k8s-ci-robot k8s-ci-robot merged commit 9bd2ba4 into kubernetes-sigs:main Aug 7, 2023
qbarrand added a commit to qbarrand/kernel-module-management that referenced this pull request Nov 8, 2023
Build using Go 1.18.

Upstream-Commit: 4c76f4d
qbarrand pushed a commit to qbarrand/kernel-module-management that referenced this pull request Nov 8, 2023
…igs#694)

This PR introduces a new controller, that is responsible
for reconciliation of NMC based on the data in Module and Node
The following are the main points:
1) It watches for changes in Modules, Nodes (labels, becoming
   schedulable) and Jobs belonging owned by Modules (in order to
   reschedule once Build/Sign job are finished)
2) Per reconciled Module, go over all the nodes in the cluster
   and reconcile their NMCs based on the Module/Node state

Changes in the PR:
1) adding functions to filter package that are resposible for filtering
    the events (nodes. modules, jobs) and finding out which modules
    should be reconciled
2)  changing the NMC handler logic a little, adding verification for
    container image presence in case module should be enable.
3)  Module-NMC Reconciliation is being done both in case Kernel Mapping
    exists and in case it does not exists ( solves the garbage
    collection flow)

This PR does not include Module status updates based on NMC updates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants