Skip to content

feat: perf opt part3 #42

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 104 commits into from
May 16, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
104 commits
Select commit Hold shift + click to select a range
f3aef56
add f16 support to etl wise op
chraac Apr 24, 2025
b05a731
wip
chraac Apr 24, 2025
85beaf1
Revert "wip"
chraac Apr 24, 2025
835cba3
qf32 for mul
chraac Apr 24, 2025
4a17986
wip
chraac Apr 25, 2025
d0c2cd0
Revert "wip"
chraac Apr 25, 2025
e62de91
disable fp16 add/sub
chraac Apr 25, 2025
86ea79a
tempate trick
chraac Apr 25, 2025
9fe4f54
wip
chraac Apr 25, 2025
7eccc54
add f16 mulmat
chraac Apr 25, 2025
497fd85
add log
chraac Apr 25, 2025
805d21e
fix view liked op
chraac Apr 25, 2025
eb6abbe
add log
chraac Apr 25, 2025
6459cf7
fix f16 mulmat
chraac Apr 25, 2025
6e538a2
add quant type
chraac Apr 25, 2025
f0d006d
wip
chraac Apr 26, 2025
544084a
add l2fetch
chraac Apr 26, 2025
e809454
add vtcm_mem
chraac Apr 26, 2025
604d69a
wip
chraac Apr 27, 2025
cdb966e
fix fetch
chraac Apr 27, 2025
2d1bf66
use vtcm cache in mulmat
chraac Apr 27, 2025
7b90ddd
revert vtcm cache
chraac Apr 27, 2025
18a89f7
cache plane
chraac Apr 27, 2025
97a6cc6
small opt for plane cache
chraac Apr 27, 2025
e602c73
cache plane for some element wise op
chraac Apr 27, 2025
07eb21b
wip
chraac Apr 28, 2025
aef40f9
enable fetch even on vtcm
chraac Apr 28, 2025
1caff04
wip
chraac Apr 29, 2025
bdc4ec5
copy sysMonApp
chraac Apr 29, 2025
0e109ba
small opt
chraac Apr 30, 2025
73f05de
init ltu
chraac Apr 30, 2025
ebc1531
add compute_params
chraac Apr 30, 2025
9a7eb17
add op common header
chraac May 1, 2025
428662a
move vtcm_mem allocation to compute_param
chraac May 1, 2025
deb62ea
fallback to memcache when vtcm allocate failed
chraac May 1, 2025
c2b304b
pre-calculate quantize type
chraac May 1, 2025
77573c2
wip
chraac May 1, 2025
3c2a9a6
try fix test failure
chraac May 2, 2025
b77cba7
try fix mulmat nan
chraac May 3, 2025
d2ec5af
fix inf in mulmat
chraac May 5, 2025
0498c12
remove debug logs
chraac May 5, 2025
a6c35cd
wip
chraac May 5, 2025
c12e516
small refactoring on the dequant row func
chraac May 6, 2025
bdf582e
fix typo
chraac May 6, 2025
9f8e83a
improve logging
chraac May 6, 2025
9d50510
Merge branch 'dev-refactoring' into dev-perf-opt-part3
chraac May 7, 2025
38ac1cc
Merge branch 'dev-refactoring' into dev-perf-opt-part3
chraac May 8, 2025
eaaabcb
add q4_0 and q8_0
chraac May 8, 2025
d66042d
Merge branch 'dev-refactoring' into dev-perf-opt-part3
chraac May 8, 2025
c156325
wip
chraac May 8, 2025
5d2efb7
wip
chraac May 8, 2025
8d70e5d
build hexagon libs in cmake
chraac May 8, 2025
e11d1d5
wip
chraac May 8, 2025
f360110
fix qnn only build flag
chraac May 8, 2025
744aed0
fix typo
chraac May 8, 2025
5e83078
fix todo
chraac May 8, 2025
9e8b095
wip
chraac May 9, 2025
e91dfe9
wip
chraac May 9, 2025
a19bec3
add to_float
chraac May 9, 2025
07bafe2
use to)float directly instead of ltu
chraac May 9, 2025
26e6e01
wip
chraac May 9, 2025
4b28f72
cache f16_to_f32 table into vtcm
chraac May 9, 2025
b3b1f92
print tensor dims at log
chraac May 9, 2025
65d0621
init device in supports_op_impl
chraac May 9, 2025
51dc50d
revert cache ltu
chraac May 9, 2025
018d531
wip
chraac May 9, 2025
c107105
wip
chraac May 9, 2025
11003ee
fix graph calc issues by validate cache manually after each op
chraac May 10, 2025
ca6bf62
add cache invalidate func
chraac May 10, 2025
4fa6445
enable cache fallback only in quantize tensors
chraac May 10, 2025
ad3e05b
add option to disable quantized tensors
chraac May 10, 2025
bfabb1a
propagate the asan flag to npu build
chraac May 11, 2025
8514312
fix asan option
chraac May 11, 2025
f3e32da
wip
chraac May 13, 2025
631351b
invalidate tensors after finished
chraac May 13, 2025
d81fe11
Merge branch 'dev-refactoring' into dev-perf-opt-part3
chraac May 13, 2025
8c4efe3
implement backend_buffer_reset
chraac May 13, 2025
261b12f
wip
chraac May 13, 2025
8a1ab68
wip
chraac May 13, 2025
6d526d0
refactoring plane cache mechanism
chraac May 14, 2025
820359f
wip
chraac May 14, 2025
8402cc5
split row elements across thread
chraac May 14, 2025
1713e2a
use table for f16 to f32 conversion
chraac May 14, 2025
7175394
sync after each op
chraac May 15, 2025
9bbcef4
small refactoring to invalidate l2 cahce
chraac May 15, 2025
d2e833e
wip
chraac May 15, 2025
eeee96c
opt on float fetching
chraac May 15, 2025
95ecabc
unroll for loop manually
chraac May 15, 2025
a54330b
reduce vtcm usage
chraac May 15, 2025
8a18192
add perf tracking for npu
chraac May 15, 2025
56644b0
print dimensions for profiler log
chraac May 15, 2025
d5f026d
wip
chraac May 15, 2025
2eba52e
wip
chraac May 15, 2025
46bf491
wip
chraac May 15, 2025
2b4d5ad
add sub proc tracker
chraac May 16, 2025
185a095
fix typo
chraac May 16, 2025
703add6
print pcycles
chraac May 16, 2025
4e0b369
wip
chraac May 16, 2025
89c6b0b
wip
chraac May 16, 2025
33c8f6c
prefetch rows
chraac May 16, 2025
d23bf4b
add l2fetch_row
chraac May 16, 2025
56913cf
small tweak based on perf tracer
chraac May 16, 2025
d1f78f8
opt l2 fetching
chraac May 16, 2025
faadba0
wip
chraac May 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
cache plane for some element wise op
  • Loading branch information
chraac committed Apr 28, 2025
commit e602c737cf96011a71ab5d2851eee1fe6d62d793
59 changes: 42 additions & 17 deletions ggml/src/ggml-qnn/npu/device/op_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

#include "op_mul_mat.hpp"
#include "util.hpp"
#include "vtcm_mem.hpp"

namespace {

Expand Down Expand Up @@ -140,30 +141,54 @@ template <auto _RowFunc> bool element_wise_op(hexagon::tensor * out, size_t tidx
return false;
}

const auto * src0_ptr = reinterpret_cast<const uint8_t *>(src0->get_data());
const auto * src1_ptr = reinterpret_cast<const uint8_t *>(src1->get_data());
auto * dst_ptr = reinterpret_cast<uint8_t *>(out->get_data());
auto total_rows = out->get_ne(3) * out->get_ne(2) * out->get_ne(1);
const auto rows_per_box = out->get_ne(2) * out->get_ne(1);
const auto start_end = hexagon::get_thread_work_slice(total_rows, tidx, tcnt);
const auto * src0_ptr = reinterpret_cast<const uint8_t *>(src0->get_data());
const auto * src1_ptr = reinterpret_cast<const uint8_t *>(src1->get_data());
auto * dst_ptr = reinterpret_cast<uint8_t *>(out->get_data());
auto total_rows = out->get_ne(3) * out->get_ne(2) * out->get_ne(1);
const auto rows_per_cube = out->get_ne(2) * out->get_ne(1);
const auto start_end = hexagon::get_thread_work_slice(total_rows, tidx, tcnt);

if (start_end.first >= start_end.second) {
return true;
}

std::unique_ptr<hexagon::vtcm_mem> src1_plane_cache;
uint8_t * src1_plane_cache_ptr = nullptr;
if (src0->get_ne(1) / src1->get_ne(1) > 1) {
src1_plane_cache = std::make_unique<hexagon::vtcm_mem>(src1->get_nb(1) * src1->get_ne(1), false);
src1_plane_cache_ptr = src1_plane_cache->get_mem();
DEVICE_LOG_DEBUG("element_wise_op vtcm_mem allocated");
}

for (int64_t ir = start_end.first; ir < start_end.second; ++ir) {
const auto i03 = ir / rows_per_box;
const auto i02 = ir / out->get_ne(1) - i03 * out->get_ne(2);
const auto i01 = ir % out->get_ne(1);
const auto i13 = i03 % src1->get_ne(3);
const auto i12 = i02 % src1->get_ne(2);
const auto i11 = i01 % src1->get_ne(1);
auto * src0_row = src0_ptr + i03 * src0->get_nb(3) + i02 * src0->get_nb(2) + i01 * src0->get_nb(1);
auto * src1_row = src1_ptr + i13 * src1->get_nb(3) + i12 * src1->get_nb(2) + i11 * src1->get_nb(1);
auto * dst_row = dst_ptr + i03 * out->get_nb(3) + i02 * out->get_nb(2) + i01 * out->get_nb(1);
const auto i03 = ir / rows_per_cube;
const auto i02 = ir / out->get_ne(1) - i03 * out->get_ne(2);
const auto i01 = ir % out->get_ne(1);
const auto i13 = i03 % src1->get_ne(3);
const auto i12 = i02 % src1->get_ne(2);
const auto i11 = i01 % src1->get_ne(1);

auto * src1_plane = src1_ptr + i13 * src1->get_nb(3) + i12 * src1->get_nb(2);
if (src1_plane_cache_ptr) {
if (i01 == 0 || ir == start_end.first) {
memcpy(src1_plane_cache_ptr, src1_plane, src1_plane_cache->get_size());
}

src1_plane = src1_plane_cache_ptr;
}

auto * src0_row = src0_ptr + i03 * src0->get_nb(3) + i02 * src0->get_nb(2) + i01 * src0->get_nb(1);
auto * src1_row = src1_plane + i11 * src1->get_nb(1);
auto * dst_row = dst_ptr + i03 * out->get_nb(3) + i02 * out->get_nb(2) + i01 * out->get_nb(1);
if (ir + 1 < start_end.second) {
int32_t l2fetch_vectors = Q6_R_min_RR(src0->get_nb(1) / kElementsPerVector, hexagon::kL2FetchAheadVectors);
// TODO: should we use small kL2FetchAheadVectors?
hexagon::l2fetch(src0_row + src0->get_nb(1), hexagon::kBytesPerVector, hexagon::kBytesPerVector,
l2fetch_vectors, 0);
hexagon::l2fetch(src1_row + src1->get_nb(1), hexagon::kBytesPerVector, hexagon::kBytesPerVector,
l2fetch_vectors, 0);
if (!src1_plane_cache_ptr) {
hexagon::l2fetch(src1_row + src1->get_nb(1), hexagon::kBytesPerVector, hexagon::kBytesPerVector,
l2fetch_vectors, 0);
}
}

_RowFunc(reinterpret_cast<const data_type *>(src0_row), reinterpret_cast<const data_type *>(src1_row),
Expand Down
5 changes: 3 additions & 2 deletions ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -188,8 +188,9 @@ void mul_mat_impl(hexagon::tensor * src0, hexagon::tensor * src1, hexagon::tenso

if (start_end_row.second - start_end_row.first > 1) {
// cache the src0 plane in VTCM
src0_plane_cache = std::make_unique<hexagon::vtcm_mem>(src0->get_nb(2), false);
src0_plane_cache = std::make_unique<hexagon::vtcm_mem>(src0->get_nb(1) * src0->get_ne(1), false);
src0_plane_cache_ptr = src0_plane_cache->get_mem();
DEVICE_LOG_DEBUG("mul_mat_impl vtcm_mem allocated");
}

for (int64_t ip = start_end_plane.first; ip < start_end_plane.second; ip++) {
Expand All @@ -200,7 +201,7 @@ void mul_mat_impl(hexagon::tensor * src0, hexagon::tensor * src1, hexagon::tenso
auto * dst_plane = dst_ptr + i3 * dst->get_nb(3) + i2 * dst->get_nb(2);

if (src0_plane_cache_ptr) {
memcpy(src0_plane_cache_ptr, src0_plane, src0->get_nb(1) * src0->get_ne(1));
memcpy(src0_plane_cache_ptr, src0_plane, src0_plane_cache->get_size());
src0_plane = src0_plane_cache_ptr;
}

Expand Down