feat: perf opt part3 #42

chraac · 2025-05-08T16:16:06Z

Related to #34

Passed test-backend-ops on sd 8gen2

  5519/5519 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m

Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

Full log:
test-backend-ops_all.debug.hexagon.9bbcef491.log
test-backend-ops_all.debug.hexagon.9bbcef491_logcat.log

This reverts commit efa88deb0e8265614fd91db3c3dba777c00e858b.

This reverts commit bb419f89ca4599470d61d636fe6fa1e033d62748.

Copilot

Pull Request Overview

This PR improves performance and refactors several components related to Hexagon NPU operations and memory management. Key changes include:

API updates for device and tensor initialization (e.g. removal of unused parameters, renaming get_data() to get_data_buffer() and new buffer management functions)
Enhancements to the op implementation and thread pool synchronization across graph compute functions
Updated build system support for quantized tensor operations

Reviewed Changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
ggml/src/ggml-qnn/npu/host/host.cpp	Removed extra parameters from device initialization, updated logging
ggml/src/ggml-qnn/npu/host/graph.cpp	Added support for a new f16-to-f32 table and enhanced debug logging
ggml/src/ggml-qnn/npu/host/buffer.{hpp,cpp}	Added clear_tensors and backend_buffer_reset implementations
ggml/src/ggml-qnn/npu/device/{op_impl.cpp,graph.cpp,device.cpp}	Updated API design for compute functions, thread synchronization and f16 table usage
ggml/src/ggml-qnn/npu/CMakeLists.txt	Extended build configuration to support quantized tensors

Comments suppressed due to low confidence (4)

ggml/src/ggml-qnn/npu/device/op_impl.cpp:280

The new compute_funcs array indexed by tensor data type requires careful maintenance. Confirm that the mapping from npu_device_tensor_data_type to the corresponding compute function is exhaustive and correctly documented.

return kOpCapabilities[op].compute_funcs[type];

ggml/src/ggml-qnn/npu/device/graph.cpp:74

[nitpick] Synchronizing the thread pool after processing each tensor operation may introduce overhead. Consider evaluating whether batching synchronization calls could further improve performance if profiling indicates a bottleneck.

pool->sync_thread();

ggml/src/ggml-qnn/npu/device/device.cpp:222

Passing the f16_to_f32_table pointer into graph::compute is a notable API change. Please verify that all callers of compute are updated consistently to handle this new parameter.

if (!graph->compute(dev_ctx->thread_pool.get(), dev_ctx->f16_to_f32_table.get())) {

ggml/src/ggml-qnn/npu/device/tensor.hpp:82

Renaming from get_data() to get_data_buffer() clarifies usage, but ensure accompanying documentation and clients of the tensor API are updated to prevent confusion with flush() and invalidate().

uint8_t * get_data_buffer() const { return _data + _info.offset; }

ggml/src/ggml-qnn/npu/host/host.cpp

Copilot

Pull Request Overview

This PR adds performance optimizations and infrastructure for Hexagon NPU backend, including:

Introduces VTCM memory management, scoped performance timers, and f16-to-f32 lookup table.
Refactors tensor I/O (read/write buffers), graph execution to pass performance table, and buffer reset functionality.
Extends build scripts (CMake) to enable quantized tensor support and performance tracking flags.

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
host/host.cpp	Removed unused `params` from `init_device` call
host/graph.cpp	Skip `RESHAPE` ops and update debug log to include graph handle
host/buffer.hpp/.cpp	Added `clear_tensors()` and `backend_buffer_reset` interface
device/vtcm_mem.hpp	New VTCM allocation RAII class
device/util.hpp	Added performance timer class and tracking macros
device/thread_pool.hpp	Exposed `sync_thread` to wait on barrier
device/tensor.hpp	Split `get_data()` into `get_read_buffer()`/`get_write_buffer()`
device/quants.{hpp,cpp}	New quantization/dequantization utilities
device/op_*.{hpp,cpp}	Refactored op implementations for vectorized and quantized paths
device/graph.{hpp,cpp}	Pass F16-to-F32 table into graph compute and invalidate cache after ops
device/device.cpp	Initialize f16-to-f32 table in device context
npu/CMakeLists.txt	Added options for quantized tensors & performance tracking, consolidated target logic
qnn/CMakeLists.txt	Enabled copying of dynamic libs under new `BACKEND_RUNTIME_LIBS`

Comments suppressed due to low confidence (3)

ggml/src/ggml-qnn/npu/device/util.hpp:7

The scoped-timer code uses va_list and vsnprintf but we haven’t included <cstdarg> or <cstdio>. Add these headers to avoid compilation errors.

#include <cstring>

ggml/src/ggml-qnn/npu/device/quants.cpp:88

The inner loop increments dst++ and then does dst += 32, which may misplace the pointer on subsequent iterations. Consider using a separate index or computing offsets rather than mutating dst directly.

            for (int j = 0; j < QUANT_K_BLOCK_SIZE; j += 64) {

ggml/src/ggml-qnn/npu/CMakeLists.txt:29

[nitpick] The CMake logic for enabling quantized tensors and performance tracking is duplicated in host and DSP build sections. Consider extracting common definitions (GGML_HEXAGON_ENABLE_QUANTIZED_TENSORS, GGML_HEXAGON_ENABLE_PERFORMANCE_TRACKING) into a shared macro or include to reduce duplication.

function(add_device_target target_name DSP_ARCH IS_SIMULATOR BUILD_CPU_COUNT)

Copilot

Pull Request Overview

This PR adds performance optimizations for the Hexagon NPU backend, including FP16→FP32 quantization support, VTCM memory management, L2 prefetching, and scoped performance timers.

Introduce an FP16→FP32 lookup table in the device context and pass it through graph compute
Implement vtcm_mem for on‐chip memory allocation and integrate L2 prefetch into matrix‐mul and element‐wise kernels
Add npu_scoped_timer utilities and macros to measure and log execution cycles

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
ggml/src/ggml-qnn/npu/host/host.cpp	Update `init_device` call signature
ggml/src/ggml-qnn/npu/device/vtcm_mem.hpp	New `vtcm_mem` class for VTCM allocation
ggml/src/ggml-qnn/npu/device/quants.cpp	FP16→FP32 table init and dequantization routines
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp	Matrix‐multiply optimization with caching and prefetch
ggml/src/ggml-qnn/npu/CMakeLists.txt	New build flags for quantization, performance tracking, DSP targets

Comments suppressed due to low confidence (2)

ggml/src/ggml-qnn/npu/device/quants.cpp:132

[nitpick] This function and the associated dequantization routines introduce core behavior; consider adding unit tests to cover boundary cases and verify correctness.

bool init_f16_f32_table(float * table, size_t count) {

ggml/src/ggml-qnn/npu/host/host.cpp:60

The call site was updated to omit the params argument, but verify that the init_device overload without parameters matches the expected behavior or update callers and signature consistently.

if (!dev_obj->init_device()) {

ggml/src/ggml-qnn/npu/device/vtcm_mem.hpp

ggml/src/ggml-qnn/npu/device/op_impl.hpp

ggml/src/ggml-qnn/npu/CMakeLists.txt

ggml/src/ggml-qnn/npu/host/buffer.hpp

ggml/src/ggml-qnn/npu/host/graph.cpp

chraac added 30 commits April 27, 2025 17:48

add f16 support to etl wise op

f3aef56

wip

b05a731

Revert "wip"

85beaf1

This reverts commit efa88deb0e8265614fd91db3c3dba777c00e858b.

qf32 for mul

835cba3

wip

4a17986

Revert "wip"

d0c2cd0

This reverts commit bb419f89ca4599470d61d636fe6fa1e033d62748.

disable fp16 add/sub

e62de91

tempate trick

86ea79a

wip

9fe4f54

add f16 mulmat

7eccc54

add log

497fd85

fix view liked op

805d21e

add log

eb6abbe

fix f16 mulmat

6459cf7

add quant type

6e538a2

wip

f0d006d

add l2fetch

544084a

add vtcm_mem

e809454

wip

604d69a

fix fetch

cdb966e

use vtcm cache in mulmat

2d1bf66

revert vtcm cache

7b90ddd

cache plane

18a89f7

small opt for plane cache

97a6cc6

cache plane for some element wise op

e602c73

wip

07eb21b

enable fetch even on vtcm

aef40f9

wip

1caff04

copy sysMonApp

bdc4ec5

small opt

0e109ba

small refactoring to invalidate l2 cahce

9bbcef4

chraac requested a review from Copilot May 15, 2025 05:41

Copilot AI reviewed May 15, 2025

View reviewed changes

ggml/src/ggml-qnn/npu/host/host.cpp Show resolved Hide resolved

chraac added 9 commits May 15, 2025 15:40

wip

d2e833e

opt on float fetching

eeee96c

unroll for loop manually

95ecabc

reduce vtcm usage

a54330b

add perf tracking for npu

8a18192

print dimensions for profiler log

56644b0

wip

d5f026d

wip

2eba52e

wip

46bf491

chraac requested a review from Copilot May 15, 2025 16:58

Copilot AI reviewed May 15, 2025

View reviewed changes

add sub proc tracker

2b4d5ad

chraac requested a review from Copilot May 16, 2025 05:02

Copilot AI reviewed May 16, 2025

View reviewed changes

chraac added this to qnn backend May 16, 2025

chraac moved this to In progress in qnn backend May 16, 2025

chraac added 9 commits May 16, 2025 13:11

fix typo

185a095

print pcycles

703add6

wip

4e0b369

wip

89c6b0b

prefetch rows

33c8f6c

add l2fetch_row

d23bf4b

small tweak based on perf tracer

56913cf

opt l2 fetching

d1f78f8

wip

faadba0

chraac merged commit 295f7f5 into dev-refactoring May 16, 2025

github-project-automation bot moved this from In progress to Done in qnn backend May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: perf opt part3 #42

feat: perf opt part3 #42

Uh oh!

chraac commented May 8, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: perf opt part3 #42

feat: perf opt part3 #42

Uh oh!

Conversation

chraac commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chraac commented May 8, 2025 •

edited

Loading