-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Feature: Add support for GPU with KVM hosts #11143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enables GPU support for KVM hosts by updating both backend utilities and the compute offering UI to attach and configure GPU cards and vGPU profiles.
- Removed a stray comment in the script utility header.
- Updated
AddComputeOffering.vue
to let users select GPU cards, vGPU profiles, GPU count, and display options.
Reviewed Changes
Copilot reviewed 153 out of 153 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
utils/src/main/java/com/cloud/utils/script/Script.java | Removed an extraneous comment line at the top of the file. |
ui/src/views/offering/AddComputeOffering.vue | Renamed form fields for GPU card and profile selection, added count/display controls and data-fetch methods. |
Comments suppressed due to low confidence (1)
ui/src/views/offering/AddComputeOffering.vue:262
- The form field name 'vgpuprofile' may conflict with the API parameter 'vgpuprofileid'. Consider renaming it to 'vgpuprofileid' to maintain consistency and avoid confusion when mapping form values to request parameters.
<a-form-item name="vgpuprofile" ref="vgpuprofile" :label="$t('label.vgpu.profile')" v-if="!isSystem && form.gpucardid && vgpuProfiles.length > 0">
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #11143 +/- ##
============================================
+ Coverage 16.57% 16.66% +0.08%
- Complexity 13988 14123 +135
============================================
Files 5745 5782 +37
Lines 510847 514381 +3534
Branches 62140 62572 +432
============================================
+ Hits 84696 85702 +1006
- Misses 416677 419123 +2446
- Partials 9474 9556 +82
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Great initiative @vishesh92; do you have any spec or documentation about it? |
I am still working on it. |
@blueorangutan package |
Just to clarify, you have the spec/documentation and are working on the PR or you still do not have it and will create it? |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14034 |
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
I am working on the docs PR. I have added the spec here: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Support+for+GPU+with+KVM+hosts |
@blueorangutan package |
@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14066 |
@blueorangutan test matrix |
@vishesh92 a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly reviewed before submission, but still some remarks.
@@ -40,19 +40,19 @@ public class VGPUTypesVO implements InternalIdentity { | |||
private String vgpuType; | |||
|
|||
@Column(name="video_ram") | |||
private long videoRam; | |||
private Long videoRam; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why should these no longer be basic types?
public interface HostGpuGroupsDao extends GenericDao<HostGpuGroupsVO, Long> { | ||
|
||
/** | ||
* Find host device by hostId and groupName | ||
* @param hostId the host | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadoc is ignoring this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
none of the changes in this file are of consequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
none of the changes in this file are of consequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
none of the changes in this file are of consequence.
GetGPUStatsAnswer gpuStatsAnswer = (GetGPUStatsAnswer) answer; | ||
HashMap<String, HashMap<String, VgpuTypesInfo>> groupDetails; | ||
gpuService.addGpuDevicesToHost(host, gpuStatsAnswer.getGpuDevices()); | ||
if (CollectionUtils.isNotEmpty(gpuStatsAnswer.getGpuDevices())) { | ||
groupDetails = gpuService.getGpuGroupDetailsFromGpuDevicesOnHost(host); | ||
} else { | ||
groupDetails = gpuStatsAnswer.getGroupDetails(); | ||
} | ||
|
||
return groupDetails; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return getGroupDetails()
?
if (newGpu - currentGpu > 0) { | ||
incrementResourceCountWithTag(accountId, ResourceType.gpu, tag, newGpu - currentGpu); | ||
} else if (newGpu - currentGpu < 0) { | ||
decrementResourceCountWithTag(accountId, ResourceType.gpu, tag, currentGpu - newGpu); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjustResourceCount()
?
ServiceOffering serviceOffering = vmProfile.getServiceOffering(); | ||
if (serviceOffering.getVgpuProfileId() != null) { | ||
VgpuProfileVO vgpuProfile = vgpuProfileDao.findById(serviceOffering.getVgpuProfileId()); | ||
if (vgpuProfile == null || "passthrough".equals(vgpuProfile.getName())) { | ||
throw new InvalidParameterValueException("Unsupported operation, VM uses host passthrough, cannot migrate"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idemDito(..)
Long currentGpu = currentServiceOffering.getGpuCount() != null ? Long.valueOf(currentServiceOffering.getGpuCount()) : 0L; | ||
Long newGpu = svcOffering.getGpuCount() != null ? Long.valueOf(svcOffering.getGpuCount()) : 0L; | ||
if (newGpu > currentGpu) { | ||
_resourceLimitMgr.checkVmGpuResourceLimit(owner, vmInstance.isDisplay(), svcOffering, template, newGpu - currentGpu); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
method
Long currentGpu = currentServiceOffering.getGpuCount() != null ? Long.valueOf(currentServiceOffering.getGpuCount()) : 0L; | ||
Long newGpu = svcOffering.getGpuCount() != null ? Long.valueOf(svcOffering.getGpuCount()) : 0L; | ||
if (newGpu > currentGpu) { | ||
_resourceLimitMgr.incrementVmGpuResourceCount(owner.getAccountId(), vmInstance.isDisplay(), svcOffering, template, newGpu - currentGpu); | ||
} else if (newGpu > 0 && currentGpu > newGpu){ | ||
_resourceLimitMgr.decrementVmGpuResourceCount(owner.getAccountId(), vmInstance.isDisplay(), svcOffering, template, currentGpu - newGpu); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
method
Due to lack of DC-grade GPU hardware we wouldn't be able to fully test this. However, we've tested this against consumer-grade RTX card for basic full GPU-passthrough use-case. I propose we review and assess based on regression testing and ship as technical preview in 4.21 (if it makes it). And, hopefully between now and ACS 4.22 - this could find more interest and testing by the wider community and users who have access to datacenter-grade GPU cards. |
Description
This PR allows attaching of GPU devices via PCI, mdev or VF to an Instance for KVM.
CWiki Design doc: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Support+for+GPU+with+KVM+hosts
Doc PR: apache/cloudstack-documentation#526
Generated summary
This pull request introduces several changes across multiple files, focusing on enhancing GPU-related functionality, adding new properties for VM hooks, and updating resource management capabilities. The most significant updates include the addition of GPU properties and event types, the introduction of new VM shell script properties, and modifications to resource limits and types to support GPU devices.
GPU-related enhancements:
api/src/main/java/com/cloud/agent/api/VgpuTypesInfo.java
: Added new fields such asdeviceType
,busAddress
,vendorId
, andvmName
to support detailed GPU device information. Also included getter and setter methods for these fields and updated constructors to accommodate the new properties. [1] [2] [3]api/src/main/java/com/cloud/agent/api/to/GPUDeviceTO.java
: Introduced new fields likegpuCount
andgpuDevices
to manage GPU device details and added corresponding getter/setter methods. Updated constructors to handle the new fields. [1] [2] [3]api/src/main/java/com/cloud/event/EventTypes.java
: Added new GPU-related event types (EVENT_GPU_CARD_CREATE
,EVENT_VGPU_PROFILE_CREATE
, etc.) and mapped them to corresponding entities such asGpuCard
andVgpuProfile
. [1] [2]VM hook properties:
agent/src/main/java/com/cloud/agent/properties/AgentProperties.java
: Added new shell script properties (AGENT_HOOKS_LIBVIRT_VM_XML_TRANSFORMER_SHELL_SCRIPT
,AGENT_HOOKS_LIBVIRT_VM_ON_START_SHELL_SCRIPT
, etc.) for VM lifecycle hooks, enabling execution of shell scripts for VM state changes. [1] [2] [3]Resource management updates:
api/src/main/java/com/cloud/capacity/Capacity.java
: Updated GPU capacity type ID from19
to11
.api/src/main/java/com/cloud/configuration/Resource.java
: Added a new resource type for GPUs (gpu
).api/src/main/java/com/cloud/user/ResourceLimitService.java
: Introduced new configuration keys for GPU limits at the account, domain, and project levels (DefaultMaxAccountGpus
,DefaultMaxDomainGpus
, etc.). Added methods to check, increment, and decrement GPU resource limits. [1] [2]Miscellaneous updates:
.github/workflows/ci.yml
: Added a new smoke test for deploying VMs with vGPU enabled (smoke/test_deploy_vgpu_enabled_vm
).api/src/main/java/org/apache/cloudstack/api/ApiConstants.java
: Added constants for GPU-related attributes such asBUS_ADDRESS
andDEVICE_NAME
. [1] [2]Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Screenshots (if appropriate):
How Has This Been Tested?
This was tested locally on my laptop with passthrough of a consumer graphics card. Due to unavailability of actual hardware, I wasn't able to test with vGPU profiles or mdev.
How did you try to break this feature and the system with this change?