[ET-VK] Split up prepack command buffer #12442

SS-JIA · 2025-07-14T15:33:13Z

Stack from ghstack (oldest at bottom):

[ET-VK][qlinear] Faster weight only quantized linear gemv kernel #12444
[ET-VK][ez] Rename run_prepack() to prepack() and replace encode_prepack() + prepack() with just prepack() #12443
-> [ET-VK] Split up prepack command buffer #12442

Changes

Introduce run_prepack() API which combines the functionality of encode_prepack() and prepack(), but submits prepacking shaders incrementally rather than all at once.
Introduce graph config options to control command buffer submission behaviour during prepacking.

Note that the current default values for the prepack submission thresholds were determined through experimentation. I will leave determining optimal values for specific devices as a later exercise. The goal of this diff is simply to introduce this mechanism to fix the Llama model loading crash on Samsung S24 (described below).

Context

Currently, ET-VK will encode all prepacking shaders, and then perform prepacking by submitting only one command buffer.

However, this approach has some drawbacks:

CPU/GPU parallelism is decreased, since the command buffer is submitted only after all commands have been encoded.
There can be performance issues at the Vulkan API level when processing a single "large" command buffer.

By splitting up prepacking to occur over multiple command buffers, performance can be improved by avoiding both the aforementioned issues.

Llama 3.2 1B crash on Samsung S24

I have also noticed that running large models (i.e. Llama 3.2 1B) on the Samsung S24 with ET-VK, the device's display will crash (causing the screen to go black and become unresponsive), and sometimes the device will shut down entirely.

Fortunately, this change also fixes this behaviour, in addition to providing a significant performance boost to model load time for Llama models (from 9s to 3s).

Performance Impact

Improves model load time, especially on larger models.

Future Work

Deprecate the encode_prepack() + prepack() pattern in favor of the run_prepack() pattern

Differential Revision: D78275586

## Changes * Introduce `run_prepack()` API which combines the functionality of `encode_prepack()` and `prepack()`, but submits prepacking shaders incrementally rather than all at once. * Introduce graph config options to control command buffer submission behaviour during prepacking. Note that the current default values for the prepack submission thresholds were determined through experimentation. I will leave determining optimal values for specific devices as a later exercise. The goal of this diff is simply to introduce this mechanism to fix the Llama model loading crash on Samsung S24 (described below). ## Context Currently, ET-VK will encode all prepacking shaders, and then perform prepacking by submitting only one command buffer. However, this approach has some drawbacks: * CPU/GPU parallelism is decreased, since the command buffer is submitted only after all commands have been encoded. * There can be performance issues at the Vulkan API level when processing a single "large" command buffer. By splitting up prepacking to occur over multiple command buffers, performance can be improved by avoiding both the aforementioned issues. ## Llama 3.2 1B crash on Samsung S24 I have also noticed that running large models (i.e. Llama 3.2 1B) on the Samsung S24 with ET-VK, the device's display will crash (causing the screen to go black and become unresponsive), and sometimes the device will shut down entirely. Fortunately, this change also fixes this behaviour, in addition to providing a significant performance boost to model load time for Llama models (from 9s to 3s). ## Performance Impact * Improves model load time, especially on larger models. ## Future Work * Deprecate the `encode_prepack()` + `prepack()` pattern in favor of the `run_prepack()` pattern Differential Revision: [D78275586](https://our.internmc.facebook.com/intern/diff/D78275586/) [ghstack-poisoned]

pytorch-bot · 2025-07-14T15:33:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12442

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit e447656 with merge base dd4488d ():

NEW FAILURE - The following job has failed:

pull / test-llava-runner-linux / linux-job (gh)
RuntimeError: Command docker exec -t 12777b4d61e3c2fbbb3c1cdd8e5a772783a7d0e12e94f8bcf5fdb85eba05db6c /exec failed with exit code 139

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-07-14T15:33:31Z

This pull request was exported from Phabricator. Differential Revision: D78275586

github-actions · 2025-07-14T15:34:05Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

## Changes * Introduce `run_prepack()` API which combines the functionality of `encode_prepack()` and `prepack()`, but submits prepacking shaders incrementally rather than all at once. * Introduce graph config options to control command buffer submission behaviour during prepacking. Note that the current default values for the prepack submission thresholds were determined through experimentation. I will leave determining optimal values for specific devices as a later exercise. The goal of this diff is simply to introduce this mechanism to fix the Llama model loading crash on Samsung S24 (described below). ## Context Currently, ET-VK will encode all prepacking shaders, and then perform prepacking by submitting only one command buffer. However, this approach has some drawbacks: * CPU/GPU parallelism is decreased, since the command buffer is submitted only after all commands have been encoded. * There can be performance issues at the Vulkan API level when processing a single "large" command buffer. By splitting up prepacking to occur over multiple command buffers, performance can be improved by avoiding both the aforementioned issues. ## Llama 3.2 1B crash on Samsung S24 I have also noticed that running large models (i.e. Llama 3.2 1B) on the Samsung S24 with ET-VK, the device's display will crash (causing the screen to go black and become unresponsive), and sometimes the device will shut down entirely. Fortunately, this change also fixes this behaviour, in addition to providing a significant performance boost to model load time for Llama models (from 9s to 3s). ## Performance Impact * Improves model load time, especially on larger models. ## Future Work * Deprecate the `encode_prepack()` + `prepack()` pattern in favor of the `run_prepack()` pattern Differential Revision: [D78275586](https://our.internmc.facebook.com/intern/diff/D78275586/) [ghstack-poisoned]

facebook-github-bot · 2025-07-14T19:44:40Z

This pull request was exported from Phabricator. Differential Revision: D78275586

SS-JIA requested review from jackzhxng, larryliu0820, swolchok and mergennachin as code owners July 14, 2025 15:33

This was referenced Jul 14, 2025

[ET-VK] Fix caching mechanism to account for included files #12441

Merged

[ET-VK][ez] Rename run_prepack() to prepack() and replace encode_prepack() + prepack() with just prepack() #12443

Open

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 14, 2025

SS-JIA mentioned this pull request Jul 14, 2025

[ET-VK][qlinear] Faster weight only quantized linear gemv kernel #12444

Open

facebook-github-bot added the fb-exported label Jul 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK] Split up prepack command buffer #12442

[ET-VK] Split up prepack command buffer #12442

SS-JIA commented Jul 14, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 14, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Jul 14, 2025

Uh oh!

github-actions bot commented Jul 14, 2025

Uh oh!

facebook-github-bot commented Jul 14, 2025

Uh oh!

Uh oh!

[ET-VK] Split up prepack command buffer #12442

Are you sure you want to change the base?

[ET-VK] Split up prepack command buffer #12442

Conversation

SS-JIA commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Context

Llama 3.2 1B crash on Samsung S24

Performance Impact

Future Work

Uh oh!

pytorch-bot bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12442

❌ 1 New Failure

Uh oh!

facebook-github-bot commented Jul 14, 2025

Uh oh!

github-actions bot commented Jul 14, 2025

This PR needs a release notes: label

Uh oh!

facebook-github-bot commented Jul 14, 2025

Uh oh!

Uh oh!

SS-JIA commented Jul 14, 2025 •

edited

Loading

pytorch-bot bot commented Jul 14, 2025 •

edited

Loading

This PR needs a `release notes:` label