-
Notifications
You must be signed in to change notification settings - Fork 4
feat: perf opt part3 #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reverts commit efa88deb0e8265614fd91db3c3dba777c00e858b.
This reverts commit bb419f89ca4599470d61d636fe6fa1e033d62748.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves performance and refactors several components related to Hexagon NPU operations and memory management. Key changes include:
- API updates for device and tensor initialization (e.g. removal of unused parameters, renaming get_data() to get_data_buffer() and new buffer management functions)
- Enhancements to the op implementation and thread pool synchronization across graph compute functions
- Updated build system support for quantized tensor operations
Reviewed Changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
ggml/src/ggml-qnn/npu/host/host.cpp | Removed extra parameters from device initialization, updated logging |
ggml/src/ggml-qnn/npu/host/graph.cpp | Added support for a new f16-to-f32 table and enhanced debug logging |
ggml/src/ggml-qnn/npu/host/buffer.{hpp,cpp} | Added clear_tensors and backend_buffer_reset implementations |
ggml/src/ggml-qnn/npu/device/{op_impl.cpp,graph.cpp,device.cpp} | Updated API design for compute functions, thread synchronization and f16 table usage |
ggml/src/ggml-qnn/npu/CMakeLists.txt | Extended build configuration to support quantized tensors |
Comments suppressed due to low confidence (4)
ggml/src/ggml-qnn/npu/device/op_impl.cpp:280
- The new compute_funcs array indexed by tensor data type requires careful maintenance. Confirm that the mapping from npu_device_tensor_data_type to the corresponding compute function is exhaustive and correctly documented.
return kOpCapabilities[op].compute_funcs[type];
ggml/src/ggml-qnn/npu/device/graph.cpp:74
- [nitpick] Synchronizing the thread pool after processing each tensor operation may introduce overhead. Consider evaluating whether batching synchronization calls could further improve performance if profiling indicates a bottleneck.
pool->sync_thread();
ggml/src/ggml-qnn/npu/device/device.cpp:222
- Passing the f16_to_f32_table pointer into graph::compute is a notable API change. Please verify that all callers of compute are updated consistently to handle this new parameter.
if (!graph->compute(dev_ctx->thread_pool.get(), dev_ctx->f16_to_f32_table.get())) {
ggml/src/ggml-qnn/npu/device/tensor.hpp:82
- Renaming from get_data() to get_data_buffer() clarifies usage, but ensure accompanying documentation and clients of the tensor API are updated to prevent confusion with flush() and invalidate().
uint8_t * get_data_buffer() const { return _data + _info.offset; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds performance optimizations and infrastructure for Hexagon NPU backend, including:
- Introduces VTCM memory management, scoped performance timers, and f16-to-f32 lookup table.
- Refactors tensor I/O (read/write buffers), graph execution to pass performance table, and buffer reset functionality.
- Extends build scripts (CMake) to enable quantized tensor support and performance tracking flags.
Reviewed Changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
host/host.cpp | Removed unused params from init_device call |
host/graph.cpp | Skip RESHAPE ops and update debug log to include graph handle |
host/buffer.hpp/.cpp | Added clear_tensors() and backend_buffer_reset interface |
device/vtcm_mem.hpp | New VTCM allocation RAII class |
device/util.hpp | Added performance timer class and tracking macros |
device/thread_pool.hpp | Exposed sync_thread to wait on barrier |
device/tensor.hpp | Split get_data() into get_read_buffer() /get_write_buffer() |
device/quants.{hpp,cpp} | New quantization/dequantization utilities |
device/op_*.{hpp,cpp} | Refactored op implementations for vectorized and quantized paths |
device/graph.{hpp,cpp} | Pass F16-to-F32 table into graph compute and invalidate cache after ops |
device/device.cpp | Initialize f16-to-f32 table in device context |
npu/CMakeLists.txt | Added options for quantized tensors & performance tracking, consolidated target logic |
qnn/CMakeLists.txt | Enabled copying of dynamic libs under new BACKEND_RUNTIME_LIBS |
Comments suppressed due to low confidence (3)
ggml/src/ggml-qnn/npu/device/util.hpp:7
- The scoped-timer code uses
va_list
andvsnprintf
but we haven’t included<cstdarg>
or<cstdio>
. Add these headers to avoid compilation errors.
#include <cstring>
ggml/src/ggml-qnn/npu/device/quants.cpp:88
- The inner loop increments
dst++
and then doesdst += 32
, which may misplace the pointer on subsequent iterations. Consider using a separate index or computing offsets rather than mutatingdst
directly.
for (int j = 0; j < QUANT_K_BLOCK_SIZE; j += 64) {
ggml/src/ggml-qnn/npu/CMakeLists.txt:29
- [nitpick] The CMake logic for enabling quantized tensors and performance tracking is duplicated in host and DSP build sections. Consider extracting common definitions (
GGML_HEXAGON_ENABLE_QUANTIZED_TENSORS
,GGML_HEXAGON_ENABLE_PERFORMANCE_TRACKING
) into a shared macro or include to reduce duplication.
function(add_device_target target_name DSP_ARCH IS_SIMULATOR BUILD_CPU_COUNT)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds performance optimizations for the Hexagon NPU backend, including FP16→FP32 quantization support, VTCM memory management, L2 prefetching, and scoped performance timers.
- Introduce an FP16→FP32 lookup table in the device context and pass it through graph compute
- Implement
vtcm_mem
for on‐chip memory allocation and integrate L2 prefetch into matrix‐mul and element‐wise kernels - Add
npu_scoped_timer
utilities and macros to measure and log execution cycles
Reviewed Changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
ggml/src/ggml-qnn/npu/host/host.cpp | Update init_device call signature |
ggml/src/ggml-qnn/npu/device/vtcm_mem.hpp | New vtcm_mem class for VTCM allocation |
ggml/src/ggml-qnn/npu/device/quants.cpp | FP16→FP32 table init and dequantization routines |
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp | Matrix‐multiply optimization with caching and prefetch |
ggml/src/ggml-qnn/npu/CMakeLists.txt | New build flags for quantization, performance tracking, DSP targets |
Comments suppressed due to low confidence (2)
ggml/src/ggml-qnn/npu/device/quants.cpp:132
- [nitpick] This function and the associated dequantization routines introduce core behavior; consider adding unit tests to cover boundary cases and verify correctness.
bool init_f16_f32_table(float * table, size_t count) {
ggml/src/ggml-qnn/npu/host/host.cpp:60
- The call site was updated to omit the
params
argument, but verify that theinit_device
overload without parameters matches the expected behavior or update callers and signature consistently.
if (!dev_obj->init_device()) {
Related to #34
Passed
test-backend-ops
on sd 8gen2Full log:
test-backend-ops_all.debug.hexagon.9bbcef491.log
test-backend-ops_all.debug.hexagon.9bbcef491_logcat.log