Skip to content

Feature: Add support for GPU with KVM hosts #11143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

vishesh92
Copy link
Member

@vishesh92 vishesh92 commented Jul 4, 2025

Description

This PR allows attaching of GPU devices via PCI, mdev or VF to an Instance for KVM.
CWiki Design doc: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Support+for+GPU+with+KVM+hosts
Doc PR: apache/cloudstack-documentation#526

Generated summary

This pull request introduces several changes across multiple files, focusing on enhancing GPU-related functionality, adding new properties for VM hooks, and updating resource management capabilities. The most significant updates include the addition of GPU properties and event types, the introduction of new VM shell script properties, and modifications to resource limits and types to support GPU devices.

GPU-related enhancements:

VM hook properties:

Resource management updates:

Miscellaneous updates:

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Screenshots (if appropriate):

How Has This Been Tested?

This was tested locally on my laptop with passthrough of a consumer graphics card. Due to unavailability of actual hardware, I wasn't able to test with vGPU profiles or mdev.

How did you try to break this feature and the system with this change?

@vishesh92 vishesh92 changed the title Integrate gpu Feature: Add GPU support for KVM Jul 4, 2025
@vishesh92 vishesh92 changed the title Feature: Add GPU support for KVM Feature: Add support for GPU with KVM hosts Jul 4, 2025
@vishesh92 vishesh92 requested a review from Copilot July 4, 2025 09:12
Copilot

This comment was marked as outdated.

@vishesh92 vishesh92 requested a review from Copilot July 4, 2025 09:16
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables GPU support for KVM hosts by updating both backend utilities and the compute offering UI to attach and configure GPU cards and vGPU profiles.

  • Removed a stray comment in the script utility header.
  • Updated AddComputeOffering.vue to let users select GPU cards, vGPU profiles, GPU count, and display options.

Reviewed Changes

Copilot reviewed 153 out of 153 changed files in this pull request and generated 1 comment.

File Description
utils/src/main/java/com/cloud/utils/script/Script.java Removed an extraneous comment line at the top of the file.
ui/src/views/offering/AddComputeOffering.vue Renamed form fields for GPU card and profile selection, added count/display controls and data-fetch methods.
Comments suppressed due to low confidence (1)

ui/src/views/offering/AddComputeOffering.vue:262

  • The form field name 'vgpuprofile' may conflict with the API parameter 'vgpuprofileid'. Consider renaming it to 'vgpuprofileid' to maintain consistency and avoid confusion when mapping form values to request parameters.
        <a-form-item name="vgpuprofile" ref="vgpuprofile" :label="$t('label.vgpu.profile')" v-if="!isSystem && form.gpucardid && vgpuProfiles.length > 0">

Copy link

codecov bot commented Jul 4, 2025

Codecov Report

Attention: Patch coverage is 32.51613% with 2092 lines in your changes missing coverage. Please review.

Project coverage is 16.66%. Comparing base (8e4fe1c) to head (f20d940).

Files with missing lines Patch % Lines
...java/org/apache/cloudstack/gpu/GpuServiceImpl.java 78.13% 109 Missing and 67 partials ⚠️
.../main/java/com/cloud/gpu/dao/GpuDeviceDaoImpl.java 0.00% 147 Missing ⚠️
.../com/cloud/agent/manager/MockAgentManagerImpl.java 0.00% 134 Missing ⚠️
...main/java/com/cloud/simulator/MockGpuDeviceVO.java 0.00% 89 Missing ⚠️
.../cloud/resourcelimit/ResourceLimitManagerImpl.java 3.84% 72 Missing and 3 partials ⚠️
...rc/main/java/com/cloud/gpu/dao/GpuCardDaoImpl.java 0.00% 69 Missing ⚠️
...che/cloudstack/api/response/GpuDeviceResponse.java 31.31% 68 Missing ⚠️
...chema/src/main/java/com/cloud/gpu/GpuDeviceVO.java 31.52% 63 Missing ⚠️
...c/main/java/com/cloud/agent/api/VgpuTypesInfo.java 47.32% 59 Missing ⚠️
...n/java/com/cloud/resource/ResourceManagerImpl.java 0.00% 58 Missing ⚠️
... and 70 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #11143      +/-   ##
============================================
+ Coverage     16.57%   16.66%   +0.08%     
- Complexity    13988    14123     +135     
============================================
  Files          5745     5782      +37     
  Lines        510847   514381    +3534     
  Branches      62140    62572     +432     
============================================
+ Hits          84696    85702    +1006     
- Misses       416677   419123    +2446     
- Partials       9474     9556      +82     
Flag Coverage Δ
uitests 3.85% <ø> (-0.06%) ⬇️
unittests 17.57% <32.51%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@apache apache deleted a comment from blueorangutan Jul 4, 2025
@sureshanaparti sureshanaparti added this to the 4.21.0 milestone Jul 4, 2025
@sureshanaparti sureshanaparti moved this to In Progress in Apache CloudStack 4.21.0 Jul 4, 2025
@apache apache deleted a comment from blueorangutan Jul 4, 2025
@apache apache deleted a comment from blueorangutan Jul 4, 2025
@GutoVeronezi
Copy link
Contributor

Great initiative @vishesh92; do you have any spec or documentation about it?

@vishesh92
Copy link
Member Author

Great initiative @vishesh92; do you have any spec or documentation about it?

I am still working on it.

@vishesh92
Copy link
Member Author

@blueorangutan package

@apache apache deleted a comment from blueorangutan Jul 4, 2025
@GutoVeronezi
Copy link
Contributor

Great initiative @vishesh92; do you have any spec or documentation about it?

I am still working on it.

Just to clarify, you have the spec/documentation and are working on the PR or you still do not have it and will create it?

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14034

Copy link

github-actions bot commented Jul 4, 2025

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@vishesh92
Copy link
Member Author

Great initiative @vishesh92; do you have any spec or documentation about it?

I am still working on it.

Just to clarify, you have the spec/documentation and are working on the PR or you still do not have it and will create it?

I am working on the docs PR. I have added the spec here: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Support+for+GPU+with+KVM+hosts

@apache apache deleted a comment from blueorangutan Jul 7, 2025
@apache apache deleted a comment from blueorangutan Jul 7, 2025
@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14066

@vishesh92
Copy link
Member Author

@blueorangutan test matrix

@blueorangutan
Copy link

@vishesh92 a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly reviewed before submission, but still some remarks.

@@ -40,19 +40,19 @@ public class VGPUTypesVO implements InternalIdentity {
private String vgpuType;

@Column(name="video_ram")
private long videoRam;
private Long videoRam;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should these no longer be basic types?

public interface HostGpuGroupsDao extends GenericDao<HostGpuGroupsVO, Long> {

/**
* Find host device by hostId and groupName
* @param hostId the host
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc is ignoring this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of the changes in this file are of consequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of the changes in this file are of consequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of the changes in this file are of consequence.

Comment on lines +4320 to +4329
GetGPUStatsAnswer gpuStatsAnswer = (GetGPUStatsAnswer) answer;
HashMap<String, HashMap<String, VgpuTypesInfo>> groupDetails;
gpuService.addGpuDevicesToHost(host, gpuStatsAnswer.getGpuDevices());
if (CollectionUtils.isNotEmpty(gpuStatsAnswer.getGpuDevices())) {
groupDetails = gpuService.getGpuGroupDetailsFromGpuDevicesOnHost(host);
} else {
groupDetails = gpuStatsAnswer.getGroupDetails();
}

return groupDetails;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return getGroupDetails()?

Comment on lines +1911 to +1915
if (newGpu - currentGpu > 0) {
incrementResourceCountWithTag(accountId, ResourceType.gpu, tag, newGpu - currentGpu);
} else if (newGpu - currentGpu < 0) {
decrementResourceCountWithTag(accountId, ResourceType.gpu, tag, currentGpu - newGpu);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjustResourceCount()?

Comment on lines +1545 to +1551
ServiceOffering serviceOffering = vmProfile.getServiceOffering();
if (serviceOffering.getVgpuProfileId() != null) {
VgpuProfileVO vgpuProfile = vgpuProfileDao.findById(serviceOffering.getVgpuProfileId());
if (vgpuProfile == null || "passthrough".equals(vgpuProfile.getName())) {
throw new InvalidParameterValueException("Unsupported operation, VM uses host passthrough, cannot migrate");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idemDito(..)

Comment on lines +2791 to +2795
Long currentGpu = currentServiceOffering.getGpuCount() != null ? Long.valueOf(currentServiceOffering.getGpuCount()) : 0L;
Long newGpu = svcOffering.getGpuCount() != null ? Long.valueOf(svcOffering.getGpuCount()) : 0L;
if (newGpu > currentGpu) {
_resourceLimitMgr.checkVmGpuResourceLimit(owner, vmInstance.isDisplay(), svcOffering, template, newGpu - currentGpu);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method

Comment on lines +2811 to +2817
Long currentGpu = currentServiceOffering.getGpuCount() != null ? Long.valueOf(currentServiceOffering.getGpuCount()) : 0L;
Long newGpu = svcOffering.getGpuCount() != null ? Long.valueOf(svcOffering.getGpuCount()) : 0L;
if (newGpu > currentGpu) {
_resourceLimitMgr.incrementVmGpuResourceCount(owner.getAccountId(), vmInstance.isDisplay(), svcOffering, template, newGpu - currentGpu);
} else if (newGpu > 0 && currentGpu > newGpu){
_resourceLimitMgr.decrementVmGpuResourceCount(owner.getAccountId(), vmInstance.isDisplay(), svcOffering, template, currentGpu - newGpu);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method

@rohityadavcloud
Copy link
Member

Due to lack of DC-grade GPU hardware we wouldn't be able to fully test this. However, we've tested this against consumer-grade RTX card for basic full GPU-passthrough use-case. I propose we review and assess based on regression testing and ship as technical preview in 4.21 (if it makes it). And, hopefully between now and ACS 4.22 - this could find more interest and testing by the wider community and users who have access to datacenter-grade GPU cards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

9 participants