Skip to content

feat: perf opt part3 #42

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 104 commits into from
May 16, 2025
Merged

feat: perf opt part3 #42

merged 104 commits into from
May 16, 2025

Conversation

chraac
Copy link
Owner

@chraac chraac commented May 8, 2025

Related to #34

Passed test-backend-ops on sd 8gen2

  5519/5519 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m

Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

Full log:
test-backend-ops_all.debug.hexagon.9bbcef491.log
test-backend-ops_all.debug.hexagon.9bbcef491_logcat.log

@chraac chraac requested a review from Copilot May 15, 2025 05:41
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves performance and refactors several components related to Hexagon NPU operations and memory management. Key changes include:

  • API updates for device and tensor initialization (e.g. removal of unused parameters, renaming get_data() to get_data_buffer() and new buffer management functions)
  • Enhancements to the op implementation and thread pool synchronization across graph compute functions
  • Updated build system support for quantized tensor operations

Reviewed Changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
ggml/src/ggml-qnn/npu/host/host.cpp Removed extra parameters from device initialization, updated logging
ggml/src/ggml-qnn/npu/host/graph.cpp Added support for a new f16-to-f32 table and enhanced debug logging
ggml/src/ggml-qnn/npu/host/buffer.{hpp,cpp} Added clear_tensors and backend_buffer_reset implementations
ggml/src/ggml-qnn/npu/device/{op_impl.cpp,graph.cpp,device.cpp} Updated API design for compute functions, thread synchronization and f16 table usage
ggml/src/ggml-qnn/npu/CMakeLists.txt Extended build configuration to support quantized tensors
Comments suppressed due to low confidence (4)

ggml/src/ggml-qnn/npu/device/op_impl.cpp:280

  • The new compute_funcs array indexed by tensor data type requires careful maintenance. Confirm that the mapping from npu_device_tensor_data_type to the corresponding compute function is exhaustive and correctly documented.
return kOpCapabilities[op].compute_funcs[type];

ggml/src/ggml-qnn/npu/device/graph.cpp:74

  • [nitpick] Synchronizing the thread pool after processing each tensor operation may introduce overhead. Consider evaluating whether batching synchronization calls could further improve performance if profiling indicates a bottleneck.
pool->sync_thread();

ggml/src/ggml-qnn/npu/device/device.cpp:222

  • Passing the f16_to_f32_table pointer into graph::compute is a notable API change. Please verify that all callers of compute are updated consistently to handle this new parameter.
if (!graph->compute(dev_ctx->thread_pool.get(), dev_ctx->f16_to_f32_table.get())) {

ggml/src/ggml-qnn/npu/device/tensor.hpp:82

  • Renaming from get_data() to get_data_buffer() clarifies usage, but ensure accompanying documentation and clients of the tensor API are updated to prevent confusion with flush() and invalidate().
uint8_t * get_data_buffer() const { return _data + _info.offset; }

@chraac chraac requested a review from Copilot May 15, 2025 16:58
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds performance optimizations and infrastructure for Hexagon NPU backend, including:

  • Introduces VTCM memory management, scoped performance timers, and f16-to-f32 lookup table.
  • Refactors tensor I/O (read/write buffers), graph execution to pass performance table, and buffer reset functionality.
  • Extends build scripts (CMake) to enable quantized tensor support and performance tracking flags.

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated no comments.

Show a summary per file
File Description
host/host.cpp Removed unused params from init_device call
host/graph.cpp Skip RESHAPE ops and update debug log to include graph handle
host/buffer.hpp/.cpp Added clear_tensors() and backend_buffer_reset interface
device/vtcm_mem.hpp New VTCM allocation RAII class
device/util.hpp Added performance timer class and tracking macros
device/thread_pool.hpp Exposed sync_thread to wait on barrier
device/tensor.hpp Split get_data() into get_read_buffer()/get_write_buffer()
device/quants.{hpp,cpp} New quantization/dequantization utilities
device/op_*.{hpp,cpp} Refactored op implementations for vectorized and quantized paths
device/graph.{hpp,cpp} Pass F16-to-F32 table into graph compute and invalidate cache after ops
device/device.cpp Initialize f16-to-f32 table in device context
npu/CMakeLists.txt Added options for quantized tensors & performance tracking, consolidated target logic
qnn/CMakeLists.txt Enabled copying of dynamic libs under new BACKEND_RUNTIME_LIBS
Comments suppressed due to low confidence (3)

ggml/src/ggml-qnn/npu/device/util.hpp:7

  • The scoped-timer code uses va_list and vsnprintf but we haven’t included <cstdarg> or <cstdio>. Add these headers to avoid compilation errors.
#include <cstring>

ggml/src/ggml-qnn/npu/device/quants.cpp:88

  • The inner loop increments dst++ and then does dst += 32, which may misplace the pointer on subsequent iterations. Consider using a separate index or computing offsets rather than mutating dst directly.
            for (int j = 0; j < QUANT_K_BLOCK_SIZE; j += 64) {

ggml/src/ggml-qnn/npu/CMakeLists.txt:29

  • [nitpick] The CMake logic for enabling quantized tensors and performance tracking is duplicated in host and DSP build sections. Consider extracting common definitions (GGML_HEXAGON_ENABLE_QUANTIZED_TENSORS, GGML_HEXAGON_ENABLE_PERFORMANCE_TRACKING) into a shared macro or include to reduce duplication.
function(add_device_target target_name DSP_ARCH IS_SIMULATOR BUILD_CPU_COUNT)

@chraac chraac requested a review from Copilot May 16, 2025 05:02
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds performance optimizations for the Hexagon NPU backend, including FP16→FP32 quantization support, VTCM memory management, L2 prefetching, and scoped performance timers.

  • Introduce an FP16→FP32 lookup table in the device context and pass it through graph compute
  • Implement vtcm_mem for on‐chip memory allocation and integrate L2 prefetch into matrix‐mul and element‐wise kernels
  • Add npu_scoped_timer utilities and macros to measure and log execution cycles

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
ggml/src/ggml-qnn/npu/host/host.cpp Update init_device call signature
ggml/src/ggml-qnn/npu/device/vtcm_mem.hpp New vtcm_mem class for VTCM allocation
ggml/src/ggml-qnn/npu/device/quants.cpp FP16→FP32 table init and dequantization routines
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp Matrix‐multiply optimization with caching and prefetch
ggml/src/ggml-qnn/npu/CMakeLists.txt New build flags for quantization, performance tracking, DSP targets
Comments suppressed due to low confidence (2)

ggml/src/ggml-qnn/npu/device/quants.cpp:132

  • [nitpick] This function and the associated dequantization routines introduce core behavior; consider adding unit tests to cover boundary cases and verify correctness.
bool init_f16_f32_table(float * table, size_t count) {

ggml/src/ggml-qnn/npu/host/host.cpp:60

  • The call site was updated to omit the params argument, but verify that the init_device overload without parameters matches the expected behavior or update callers and signature consistently.
if (!dev_obj->init_device()) {

@chraac chraac moved this to In progress in qnn backend May 16, 2025
@chraac chraac merged commit 295f7f5 into dev-refactoring May 16, 2025
@github-project-automation github-project-automation bot moved this from In progress to Done in qnn backend May 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

1 participant