-
-
Notifications
You must be signed in to change notification settings - Fork 29
Proposal for karpenter intergation #439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Is this ready for review @carlory |
No. it needs to add more information. once it is ready, I will drop the WIP from PR's title and ping you again. |
cc @jwcesign would you like to take a look as well? Would like to have more inputs from the karpenter side. Thanks. |
affinity: | ||
nodeAffinity: | ||
requiredDuringSchedulingIgnoredDuringExecution: | ||
nodeSelectorTerms: | ||
- matchExpressions: | ||
- key: karpenter.k8s.aws/instance-gpu-name | ||
operator: In | ||
values: ["t4g"] | ||
- matchExpressions: | ||
- key: karpenter.k8s.aws/instance-gpu-name | ||
operator: In | ||
values: ["t4"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the user stories, it mentions:
Instead of locking my manifests to a single GPU type, I want to express a preference-ordered list of compatible GPU types (e.g., prefer A100, fall back to A10 or L4).
How can we control the preference order of multiple GPU types? We cannot control the order within nodeSelectorTerms, both t4g and t4 nodes are eligible to be selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the differences with karpenter's nodeSelectorTerms, if we have links that would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a link to explain it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this work for llmaz's resource fungibility? Please refer to the Karpenter Scheduling for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation.
I will take a look this weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @carlory This is exciting to me!
affinity: | ||
nodeAffinity: | ||
requiredDuringSchedulingIgnoredDuringExecution: | ||
nodeSelectorTerms: | ||
- matchExpressions: | ||
- key: karpenter.k8s.aws/instance-gpu-name | ||
operator: In | ||
values: ["t4g"] | ||
- matchExpressions: | ||
- key: karpenter.k8s.aws/instance-gpu-name | ||
operator: In | ||
values: ["t4"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the differences with karpenter's nodeSelectorTerms, if we have links that would be great.
What other approaches did you consider, and why did you rule them out? These do | ||
not need to be as detailed as the proposal, but should include enough | ||
information to express the idea and why it was not acceptable. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also provide a comparison about multiple nodepools implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add later
Signed-off-by: carlory <[email protected]>
FYI: InftyAI/karpenter#2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have a workflow to explain how karpenter + llmaz works together.
What this PR does / why we need it
Support scaling with Spot instances for cost saving with Karpenter
Which issue(s) this PR fixes
xref #106
Special notes for your reviewer
Does this PR introduce a user-facing change?