Feature: Add support for GPU with KVM hosts #11143

vishesh92 · 2025-07-04T09:11:40Z

Description

This PR allows attaching of GPU devices via PCI, mdev or VF to an Instance for KVM.
CWiki Design doc: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Support+for+GPU+with+KVM+hosts
Doc PR: apache/cloudstack-documentation#526

Generated summary

This pull request introduces several changes across multiple files, focusing on enhancing GPU-related functionality, adding new properties for VM hooks, and updating resource management capabilities. The most significant updates include the addition of GPU properties and event types, the introduction of new VM shell script properties, and modifications to resource limits and types to support GPU devices.

GPU-related enhancements:

api/src/main/java/com/cloud/agent/api/VgpuTypesInfo.java: Added new fields such as deviceType, busAddress, vendorId, and vmName to support detailed GPU device information. Also included getter and setter methods for these fields and updated constructors to accommodate the new properties. [1] [2] [3]
api/src/main/java/com/cloud/agent/api/to/GPUDeviceTO.java: Introduced new fields like gpuCount and gpuDevices to manage GPU device details and added corresponding getter/setter methods. Updated constructors to handle the new fields. [1] [2] [3]
api/src/main/java/com/cloud/event/EventTypes.java: Added new GPU-related event types (EVENT_GPU_CARD_CREATE, EVENT_VGPU_PROFILE_CREATE, etc.) and mapped them to corresponding entities such as GpuCard and VgpuProfile. [1] [2]

VM hook properties:

agent/src/main/java/com/cloud/agent/properties/AgentProperties.java: Added new shell script properties (AGENT_HOOKS_LIBVIRT_VM_XML_TRANSFORMER_SHELL_SCRIPT, AGENT_HOOKS_LIBVIRT_VM_ON_START_SHELL_SCRIPT, etc.) for VM lifecycle hooks, enabling execution of shell scripts for VM state changes. [1] [2] [3]

Resource management updates:

api/src/main/java/com/cloud/capacity/Capacity.java: Updated GPU capacity type ID from 19 to 11.
api/src/main/java/com/cloud/configuration/Resource.java: Added a new resource type for GPUs (gpu).
api/src/main/java/com/cloud/user/ResourceLimitService.java: Introduced new configuration keys for GPU limits at the account, domain, and project levels (DefaultMaxAccountGpus, DefaultMaxDomainGpus, etc.). Added methods to check, increment, and decrement GPU resource limits. [1] [2]

Miscellaneous updates:

.github/workflows/ci.yml: Added a new smoke test for deploying VMs with vGPU enabled (smoke/test_deploy_vgpu_enabled_vm).
api/src/main/java/org/apache/cloudstack/api/ApiConstants.java: Added constants for GPU-related attributes such as BUS_ADDRESS and DEVICE_NAME. [1] [2]

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
build/CI
test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Screenshots (if appropriate):

How Has This Been Tested?

This was tested locally on my laptop with passthrough of a consumer graphics card. Due to unavailability of actual hardware, I wasn't able to test with vGPU profiles or mdev.

How did you try to break this feature and the system with this change?

Copilot

Pull Request Overview

This PR enables GPU support for KVM hosts by updating both backend utilities and the compute offering UI to attach and configure GPU cards and vGPU profiles.

Removed a stray comment in the script utility header.
Updated AddComputeOffering.vue to let users select GPU cards, vGPU profiles, GPU count, and display options.

Reviewed Changes

Copilot reviewed 153 out of 153 changed files in this pull request and generated 1 comment.

File	Description
utils/src/main/java/com/cloud/utils/script/Script.java	Removed an extraneous comment line at the top of the file.
ui/src/views/offering/AddComputeOffering.vue	Renamed form fields for GPU card and profile selection, added count/display controls and data-fetch methods.

Comments suppressed due to low confidence (1)

ui/src/views/offering/AddComputeOffering.vue:262

The form field name 'vgpuprofile' may conflict with the API parameter 'vgpuprofileid'. Consider renaming it to 'vgpuprofileid' to maintain consistency and avoid confusion when mapping form values to request parameters.

        <a-form-item name="vgpuprofile" ref="vgpuprofile" :label="$t('label.vgpu.profile')" v-if="!isSystem && form.gpucardid && vgpuProfiles.length > 0">

ui/src/views/offering/AddComputeOffering.vue

codecov · 2025-07-04T09:24:08Z

Codecov Report

Attention: Patch coverage is 32.51613% with 2092 lines in your changes missing coverage. Please review.

Project coverage is 16.66%. Comparing base (8e4fe1c) to head (f20d940).

Files with missing lines	Patch %	Lines
...java/org/apache/cloudstack/gpu/GpuServiceImpl.java	78.13%	109 Missing and 67 partials ⚠️
.../main/java/com/cloud/gpu/dao/GpuDeviceDaoImpl.java	0.00%	147 Missing ⚠️
.../com/cloud/agent/manager/MockAgentManagerImpl.java	0.00%	134 Missing ⚠️
...main/java/com/cloud/simulator/MockGpuDeviceVO.java	0.00%	89 Missing ⚠️
.../cloud/resourcelimit/ResourceLimitManagerImpl.java	3.84%	72 Missing and 3 partials ⚠️
...rc/main/java/com/cloud/gpu/dao/GpuCardDaoImpl.java	0.00%	69 Missing ⚠️
...che/cloudstack/api/response/GpuDeviceResponse.java	31.31%	68 Missing ⚠️
...chema/src/main/java/com/cloud/gpu/GpuDeviceVO.java	31.52%	63 Missing ⚠️
...c/main/java/com/cloud/agent/api/VgpuTypesInfo.java	47.32%	59 Missing ⚠️
...n/java/com/cloud/resource/ResourceManagerImpl.java	0.00%	58 Missing ⚠️
... and 70 more

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #11143      +/-   ##
============================================
+ Coverage     16.57%   16.66%   +0.08%     
- Complexity    13988    14123     +135     
============================================
  Files          5745     5782      +37     
  Lines        510847   514381    +3534     
  Branches      62140    62572     +432     
============================================
+ Hits          84696    85702    +1006     
- Misses       416677   419123    +2446     
- Partials       9474     9556      +82

Flag	Coverage Δ
uitests	`3.85% <ø> (-0.06%)`	⬇️
unittests	`17.57% <32.51%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

GutoVeronezi · 2025-07-04T11:48:13Z

Great initiative @vishesh92; do you have any spec or documentation about it?

vishesh92 · 2025-07-04T12:02:58Z

Great initiative @vishesh92; do you have any spec or documentation about it?

I am still working on it.

vishesh92 · 2025-07-04T12:21:00Z

@blueorangutan package

GutoVeronezi · 2025-07-04T12:40:56Z

Great initiative @vishesh92; do you have any spec or documentation about it?

I am still working on it.

Just to clarify, you have the spec/documentation and are working on the PR or you still do not have it and will create it?

blueorangutan · 2025-07-04T15:00:53Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14034

github-actions · 2025-07-04T22:21:52Z

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

vishesh92 · 2025-07-07T07:27:59Z

Great initiative @vishesh92; do you have any spec or documentation about it?

I am still working on it.

Just to clarify, you have the spec/documentation and are working on the PR or you still do not have it and will create it?

I am working on the docs PR. I have added the spec here: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Support+for+GPU+with+KVM+hosts

vishesh92 · 2025-07-07T10:25:34Z

@blueorangutan package

blueorangutan · 2025-07-07T10:26:03Z

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2025-07-07T12:17:10Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14066

vishesh92 · 2025-07-07T12:26:12Z

@blueorangutan test matrix

blueorangutan · 2025-07-07T12:28:04Z

@vishesh92 a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests

DaanHoogland

mostly reviewed before submission, but still some remarks.

DaanHoogland · 2025-07-07T12:23:44Z

engine/schema/src/main/java/com/cloud/gpu/VGPUTypesVO.java

@@ -40,19 +40,19 @@ public class VGPUTypesVO implements InternalIdentity {
    private String vgpuType;

    @Column(name="video_ram")
-    private long videoRam;
+    private Long videoRam;


why should these no longer be basic types?

DaanHoogland · 2025-07-07T12:25:31Z

engine/schema/src/main/java/com/cloud/gpu/dao/HostGpuGroupsDao.java

 public interface HostGpuGroupsDao extends GenericDao<HostGpuGroupsVO, Long> {

    /**
     * Find host device by hostId and groupName
-     * @param hostId the host
+     *


javadoc is ignoring this

DaanHoogland · 2025-07-07T12:26:09Z

engine/schema/src/main/java/com/cloud/gpu/dao/HostGpuGroupsDao.java

none of the changes in this file are of consequence.

DaanHoogland · 2025-07-07T12:26:53Z

engine/schema/src/main/java/com/cloud/gpu/dao/HostGpuGroupsDaoImpl.java

none of the changes in this file are of consequence.

DaanHoogland · 2025-07-07T12:27:21Z

engine/schema/src/main/java/com/cloud/gpu/dao/VGPUTypesDao.java

none of the changes in this file are of consequence.

DaanHoogland · 2025-07-07T12:58:30Z

server/src/main/java/com/cloud/resource/ResourceManagerImpl.java

+                GetGPUStatsAnswer gpuStatsAnswer = (GetGPUStatsAnswer) answer;
+                HashMap<String, HashMap<String, VgpuTypesInfo>> groupDetails;
+                gpuService.addGpuDevicesToHost(host, gpuStatsAnswer.getGpuDevices());
+                if (CollectionUtils.isNotEmpty(gpuStatsAnswer.getGpuDevices())) {
+                    groupDetails = gpuService.getGpuGroupDetailsFromGpuDevicesOnHost(host);
+                } else {
+                    groupDetails = gpuStatsAnswer.getGroupDetails();
+                }
+
+                return groupDetails;


return getGroupDetails()?

DaanHoogland · 2025-07-07T12:59:58Z

server/src/main/java/com/cloud/resourcelimit/ResourceLimitManagerImpl.java

+                if (newGpu - currentGpu > 0) {
+                    incrementResourceCountWithTag(accountId, ResourceType.gpu, tag, newGpu - currentGpu);
+                } else if (newGpu - currentGpu < 0) {
+                    decrementResourceCountWithTag(accountId, ResourceType.gpu, tag, currentGpu - newGpu);
+                }


adjustResourceCount()?

DaanHoogland · 2025-07-07T13:00:58Z

server/src/main/java/com/cloud/server/ManagementServerImpl.java

+        ServiceOffering serviceOffering = vmProfile.getServiceOffering();
+        if (serviceOffering.getVgpuProfileId() != null) {
+            VgpuProfileVO vgpuProfile = vgpuProfileDao.findById(serviceOffering.getVgpuProfileId());
+            if (vgpuProfile == null || "passthrough".equals(vgpuProfile.getName())) {
+                throw new InvalidParameterValueException("Unsupported operation, VM uses host passthrough, cannot migrate");
+            }
+        }


idemDito(..)

DaanHoogland · 2025-07-07T13:03:01Z

server/src/main/java/com/cloud/vm/UserVmManagerImpl.java

+            Long currentGpu = currentServiceOffering.getGpuCount() != null ? Long.valueOf(currentServiceOffering.getGpuCount()) : 0L;
+            Long newGpu = svcOffering.getGpuCount() != null ? Long.valueOf(svcOffering.getGpuCount()) : 0L;
+            if (newGpu > currentGpu) {
+                _resourceLimitMgr.checkVmGpuResourceLimit(owner, vmInstance.isDisplay(), svcOffering, template, newGpu - currentGpu);
+            }


DaanHoogland · 2025-07-07T13:03:13Z

server/src/main/java/com/cloud/vm/UserVmManagerImpl.java

+        Long currentGpu = currentServiceOffering.getGpuCount() != null ? Long.valueOf(currentServiceOffering.getGpuCount()) : 0L;
+        Long newGpu = svcOffering.getGpuCount() != null ? Long.valueOf(svcOffering.getGpuCount()) : 0L;
+        if (newGpu > currentGpu) {
+            _resourceLimitMgr.incrementVmGpuResourceCount(owner.getAccountId(), vmInstance.isDisplay(), svcOffering, template, newGpu - currentGpu);
+        } else if (newGpu > 0 && currentGpu > newGpu){
+            _resourceLimitMgr.decrementVmGpuResourceCount(owner.getAccountId(), vmInstance.isDisplay(), svcOffering, template, currentGpu - newGpu);
+        }


rohityadavcloud · 2025-07-07T15:26:57Z

Due to lack of DC-grade GPU hardware we wouldn't be able to fully test this. However, we've tested this against consumer-grade RTX card for basic full GPU-passthrough use-case. I propose we review and assess based on regression testing and ship as technical preview in 4.21 (if it makes it). And, hopefully between now and ACS 4.22 - this could find more interest and testing by the wider community and users who have access to datacenter-grade GPU cards.

vishesh92 added 2 commits June 24, 2025 11:54

GPU support for KVM

0a1f69a

Fix capacity & UI enhancements

58cb165

boring-cyborg bot added component:agent component:api component:integration labels Jul 4, 2025

vishesh92 changed the title ~~Integrate gpu~~ Feature: Add GPU support for KVM Jul 4, 2025

vishesh92 changed the title ~~Feature: Add GPU support for KVM~~ Feature: Add support for GPU with KVM hosts Jul 4, 2025

vishesh92 added type:new-feature type:experimental-feature labels Jul 4, 2025

vishesh92 requested a review from Copilot July 4, 2025 09:12

This comment was marked as outdated.

Sign in to view

Allow managing GPU inventory and mapping a gpu device to gpu offering

59651bd

vishesh92 force-pushed the integrate-gpu branch from 9e3b7a4 to ab17e0b Compare July 4, 2025 09:15

vishesh92 requested a review from Copilot July 4, 2025 09:16

Copilot AI reviewed Jul 4, 2025

View reviewed changes

ui/src/views/offering/AddComputeOffering.vue Outdated Show resolved Hide resolved

vishesh92 force-pushed the integrate-gpu branch from ab17e0b to 330289f Compare July 4, 2025 09:27

apache deleted a comment from blueorangutan Jul 4, 2025

sureshanaparti added this to the 4.21.0 milestone Jul 4, 2025

sureshanaparti added this to Apache CloudStack 4.21.0 Jul 4, 2025

sureshanaparti moved this to In Progress in Apache CloudStack 4.21.0 Jul 4, 2025

apache deleted a comment from blueorangutan Jul 4, 2025

vishesh92 force-pushed the integrate-gpu branch from 330289f to f6945ef Compare July 4, 2025 12:10

Setup limits for GPU

9bc8518

vishesh92 force-pushed the integrate-gpu branch from f6945ef to 9bc8518 Compare July 4, 2025 12:19

apache deleted a comment from blueorangutan Jul 4, 2025

github-actions bot added the status:has-conflicts label Jul 4, 2025

vishesh92 added 2 commits July 7, 2025 10:54

minor ui fixups

bccd469

Update GPU devices on adding a new host

9fa9af7

rohityadavcloud requested review from shwstppr, harikrishna-patnala, nvazquez, sureshanaparti and weizhouapache July 7, 2025 06:27

vishesh92 mentioned this pull request Jul 7, 2025

Update docs for GPU support with KVM apache/cloudstack-documentation#526

Open

Merge branch 'main' into integrate-gpu

8083a48

github-actions bot removed the status:has-conflicts label Jul 7, 2025

apache deleted a comment from blueorangutan Jul 7, 2025

vishesh92 force-pushed the integrate-gpu branch from 4794b51 to 3d5e5f0 Compare July 7, 2025 10:15

fixup

f20d940

vishesh92 force-pushed the integrate-gpu branch from 3d5e5f0 to f20d940 Compare July 7, 2025 10:25

DaanHoogland reviewed Jul 7, 2025

View reviewed changes

rohityadavcloud assigned kiranchavala, vladimirpetrov and borisstoyanov Jul 7, 2025

Feature: Add support for GPU with KVM hosts #11143

Are you sure you want to change the base?

Feature: Add support for GPU with KVM hosts #11143

Uh oh!

Conversation

vishesh92 commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

GPU-related enhancements:

VM hook properties:

Resource management updates:

Miscellaneous updates:

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

codecov bot commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

GutoVeronezi commented Jul 4, 2025

Uh oh!

vishesh92 commented Jul 4, 2025

Uh oh!

vishesh92 commented Jul 4, 2025

Uh oh!

GutoVeronezi commented Jul 4, 2025

Uh oh!

blueorangutan commented Jul 4, 2025

Uh oh!

github-actions bot commented Jul 4, 2025

Uh oh!

vishesh92 commented Jul 7, 2025

Uh oh!

vishesh92 commented Jul 7, 2025

Uh oh!

blueorangutan commented Jul 7, 2025

Uh oh!

blueorangutan commented Jul 7, 2025

Uh oh!

vishesh92 commented Jul 7, 2025

Uh oh!

blueorangutan commented Jul 7, 2025

Uh oh!

DaanHoogland left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohityadavcloud commented Jul 7, 2025

Uh oh!

vishesh92 commented Jul 4, 2025 •

edited

Loading

codecov bot commented Jul 4, 2025 •

edited

Loading