Skip to content

Commit 180ab5f

Browse files
committed
follow project's style
1 parent b0c3013 commit 180ab5f

File tree

5 files changed

+165
-80
lines changed

5 files changed

+165
-80
lines changed

README-qnn.md

+130
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# llama.cpp for QNN
2+
3+
- [Background](#background)
4+
- [News](#news)
5+
- [OS](#os)
6+
- [Hardware](#hardware)
7+
- [Android](#android)
8+
- [Windows](#windows)
9+
- [Q&A](#qa)
10+
- [TODO](#todo)
11+
12+
## Background
13+
14+
Android maintained its position as the leading mobile operating system worldwide in the fourth quarter of 2023 with <b><a href="https://www.statista.com/statistics/272698/global-market-share-held-by-mobile-operating-systems-since-2009/">a market share of 70.1 percent </a></b> . Qualcomm is No.1 mobile SoC semiconductor company in our planet currently.
15+
16+
17+
**QNN**(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK is verified to work with the following versions of the ML frameworks:
18+
19+
<ul>
20+
<li>TensorFlow: tf-1.15.0, or tf-2.10.1 </li>
21+
<li>TFLite: tflite-2.3.0 </li>
22+
<li> PyTorch: torch-1.13.1</li>
23+
<li> ONNX: onnx-1.11.0 </li>
24+
</ul>
25+
26+
27+
The Qualcomm® AI Engine Direct architecture is designed to be modular and allows for clean separation in the software for different hardware cores/accelerators such as the CPU, GPU and DSP that are designated as backends. Learn more about Qualcomm® AI Engine Direct backends here.
28+
29+
![Screenshot from 2024-04-14 11-42-14](https://github.com/zhouwg/kantv/assets/6889919/5d8de93a-7b02-4d6b-8b7f-19d2f829dd4d)
30+
31+
The Qualcomm® AI Engine Direct backends for different hardware cores/accelerators are compiled into individual core-specific libraries that come packaged with the SDK.
32+
33+
34+
One of the key highlights of Qualcomm® AI Engine Direct is that it provides a unified API to delegate operations such as graph creation and execution across all hardware accelerator backends. This allows users to treat Qualcomm® AI Engine Direct as a hardware abstraction API and port applications easily to different cores.
35+
36+
37+
The Qualcomm® AI Engine Direct API is designed to support an efficient execution model with capabilities such as graph optimizations to be taken care of internally. At the same time however, it leaves out broader functionality such as model parsing and network partitioning to higher level frameworks.
38+
39+
Qualcomm® AI Engine Direct API and the associated software stack provides all the constructs required by an application to construct, optimize and execute network models on the desired hardware accelerator core. Key constructs are illustrated by the Qualcomm AI Engine Direct Components - High Level View diagram.
40+
41+
42+
![qnn-arch](https://github.com/zhouwg/kantv/assets/6889919/4f4881a6-9a91-4477-aeb2-193591375d75)
43+
44+
45+
46+
### Llama.cpp + QNN
47+
48+
The llama.cpp QNN backend is intented to support **Qualcomm mobile SoC** firstly.
49+
50+
51+
## News
52+
53+
- 2024.4.24
54+
- PR to ggml community
55+
- data path works fine as expected with whisper.cpp and llama.cpp using QNN backend and verified on both low-end and high-end Android phones based on Qualcomm mobile SoC
56+
- Support OPs
57+
- GGML_OP_ADD
58+
- GGML_OP_MUL
59+
- GGML_OP_MUL_MAT
60+
61+
- 2024.3.29
62+
- launch "PoC:add QNN backend for Qualcomm mobile SoC"
63+
64+
## OS
65+
66+
| OS | Status | Verified |
67+
|-------------------|---------|------------------------------------|
68+
| Android | Support | Android 10, Android 14 |
69+
| Windows over ARM | TBD | TBD |
70+
71+
72+
## Hardware
73+
74+
### Qualcomm mobile SoC based Android phone
75+
76+
**Verified devices**
77+
78+
| Qualcom mobile SoC | Status | Verified Vendor |
79+
|-----------------------------------------|---------|---------------------------------------|
80+
| Qualcomm SM8650-AB Snapdragon 8 Gen 3 | Support | Xiaomi 14 |
81+
| Qualcomm low-end mobile SoC Series | Support | Vivo |
82+
83+
### Qualcomm SoC based Windows
84+
85+
TBD
86+
87+
## Android
88+
89+
### I. Setup Environment
90+
91+
Any **mainstream** Android phone based on Qualcomm's mobile SoC should be supported by llama.cpp + QNN. Qualcomm SM8650-AB Snapdragon 8 Gen 3 based Android phone is preferred.
92+
93+
### II. Build llama.cpp + QNN backend
94+
95+
96+
Please refer to [project kantv](https://github.com/zhouwg/kantv) firstly.
97+
98+
99+
A small and standalone Android example(or re-use [the existing Android example in llama.cpp](https://github.com/ggerganov/llama.cpp/tree/master/examples/llama.android)) for purpose of facilitate community developers to participate in develop/verify QNN backend.
100+
101+
102+
### III. Run the inference on Qualcomm mobile SoC based Android phone
103+
104+
105+
![504893116](https://github.com/zhouwg/kantv/assets/6889919/51f0b277-eca4-4938-86f5-415dbf5897e7)
106+
107+
108+
## Windows
109+
110+
TBD
111+
112+
## Q&A
113+
114+
TBD
115+
116+
### **GitHub contribution**:
117+
Please add the **[ggml-qnn]** prefix/tag in issues/PRs titles to help the community check/address them without delay.
118+
119+
## TODO
120+
121+
- only support FP32 / FP16 and the input and output tensors must be of the <b>same data type</b>
122+
123+
- lack of [implementation of other GGML-OPs using QNN API](https://github.com/zhouwg/llama.cpp/blob/qualcomm_qnn_backend_for_ggml/ggml-qnn.cpp#L3452). this work is very similar to <a href="https://github.com/zhouwg/llama.cpp/blob/qualcomm_qnn_backend_for_ggml/ggml-qnn.cpp#L2983">GGML_OP_ADD / GGML_OP_MUL / GGML_OP_MULMAT</a> in ggml-qnn.cpp
124+
125+
- multithreading not working with QNN GPU&HTP (aka DSP) backend
126+
127+
128+
- QNN's RPC feature(which useful for QNN HTP(aka DSP) backend) not used
129+
130+
- multi QNN backend(CPU/GPU/DSP) simultaneously not support

ggml-qnn.cpp

+29-57
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,3 @@
1-
/*
2-
* MIT license
3-
* Copyright (C) 2024 GGML Authors
4-
* SPDX-License-Identifier: MIT
5-
*
6-
* this is implementation of ggml QNN(Qualcomm Neural Network, aka AI Engine Direct) backend
7-
*
8-
* status:
9-
*
10-
* 1. core implementation(data path works fine as expected with whisper.cpp using QNN CPU/GPU backend on Qualcomm's SoC based low-end phone
11-
*
12-
* 2. core implementation(data path works fine as expected with whisper.cpp using QNN HTP(aka DSP) backend on Qualcomm's soC based high-end phone
13-
*
14-
* 3. core implementation(data path works fine as expected with llama.cpp using QNN CPU/GPU/HTP(aka DSP) backend on Qualcomm's soC based high-end phone
15-
*
16-
* 4. GGML_OP_MUL_MAT & GGML_OP_MUL & GGML_OP_ADD using QNN API has been completed
17-
*
18-
* todo:
19-
*
20-
* 1. lack of implementation of other GGML-OPs using QNN API
21-
*
22-
* 2. only support FP32 / FP16 and the input and output tensors must be of the same data type
23-
*
24-
* 3. QNN's RPC feature(which useful for QNN HTP(aka DSP) backend) not used
25-
*
26-
* 4. multi QNN backend(CPU/GPU/DSP) simultaneously not support
27-
*
28-
* 5. multithreading not work with QNN GPU/HTP(aka DSP) backend
29-
*
30-
*/
311
#include <stdio.h>
322
#include <stdlib.h>
333
#include <stdint.h>
@@ -89,6 +59,19 @@
8959
class qnn_instance;
9060

9161
//TODO: should be removed because this is a workaround method during development stage
62+
//a minor modification is required during development stage for validate QNN backend on Android phone:
63+
//
64+
//modify from
65+
//
66+
//static void ggml_compute_forward(struct ggml_compute_params * params, struct ggml_tensor * tensor)
67+
//
68+
//to
69+
//
70+
//void ggml_compute_forward(struct ggml_compute_params * params, struct ggml_tensor * tensor)
71+
//
72+
//in source file ggml.c#L16156
73+
//
74+
//this workaround will not be needed when the final QNN backend is complete
9275
extern "C" void ggml_compute_forward(struct ggml_compute_params * params, struct ggml_tensor * tensor);
9376

9477
#if (defined __ANDROID__) || (defined ANDROID) //Qualcomm's QNN could running on Windows over ARM(aka WoA)
@@ -838,7 +821,7 @@ static inline void set_qnn_tensor_memhandle(Qnn_Tensor_t * tensor, Qnn_MemHandle
838821

839822

840823
static size_t memscpy(void * dst, size_t dstSize, const void * src, size_t copySize) {
841-
if (!dst || !src || !dstSize || !copySize)
824+
if (!dst || !src || !dstSize || !copySize)
842825
return 0;
843826

844827
size_t minSize = dstSize < copySize ? dstSize : copySize;
@@ -946,7 +929,7 @@ static int free_qnn_tensor(Qnn_Tensor_t & tensor) {
946929
QNN_LOG_INFO("it should not happen, pls check");
947930
} else {
948931
//TODO:why crash in here? why pointer changed with mul_mat?
949-
//memory leak after comment above line
932+
//memory leak after comment below line
950933
//free(QNN_TENSOR_GET_DIMENSIONS(tensor));
951934
}
952935

@@ -1043,7 +1026,7 @@ static Qnn_DataType_t qnn_datatype_from_ggml_datatype(enum ggml_type ggmltype) {
10431026
}
10441027

10451028

1046-
//TODO:
1029+
//TODO: only support GGML_OP_ADD/GGML_OP_MUL/GGML_OP_MUL_MAT
10471030
static const char * qnn_opname_from_ggmlop(enum ggml_op ggmlop) {
10481031
switch (ggmlop) {
10491032
case GGML_OP_ADD:
@@ -1204,16 +1187,10 @@ static buf_element_t * qnn_buf_buffer_get (qnn_buf_t * fifo) {
12041187
buf_element_t * buf = nullptr;
12051188

12061189
pthread_mutex_lock (&fifo->mutex);
1207-
#if 0
1208-
while (fifo->first == nullptr) {
1209-
pthread_cond_wait (&fifo->not_empty, &fifo->mutex);
1210-
}
1211-
#else
12121190
if (fifo->first == nullptr) {
12131191
pthread_mutex_unlock (&fifo->mutex);
12141192
return nullptr;
12151193
}
1216-
#endif
12171194

12181195
buf = fifo->first;
12191196

@@ -1449,9 +1426,9 @@ static void ggml_qnn_log_internal(ggml_log_level level, const char * file, const
14491426
int len = vsnprintf(s_ggml_qnn_log_internal_buf + len_prefix, GGML_QNN_LOGBUF_LEN - len_prefix, format, args);
14501427
if (len < (GGML_QNN_LOGBUF_LEN - len_prefix)) {
14511428
#if (defined __ANDROID__) || (defined ANDROID)
1452-
__android_log_print(level, "llamacpp", "%s", s_ggml_qnn_log_internal_buf);
1429+
__android_log_print(level, "ggml-qnn", "%s", s_ggml_qnn_log_internal_buf);
14531430
#else
1454-
printf("%s", buffer); //Qualcomm's QNN could running on Window over ARM
1431+
printf("%s", buffer); //Qualcomm's QNN could running on Windows over ARM(aka WoA)
14551432
#endif
14561433
}
14571434
va_end(args);
@@ -2095,11 +2072,11 @@ int qnn_instance::load_system() {
20952072

20962073
_system_lib_handle = dlopen(system_lib_path.c_str(), RTLD_NOW | RTLD_LOCAL);
20972074
if (nullptr == _system_lib_handle) {
2098-
QNN_LOG_WARN("can not pen QNN library %s, error: %s\n", system_lib_path.c_str(), dlerror());
2075+
QNN_LOG_WARN("can not open QNN library %s, error: %s\n", system_lib_path.c_str(), dlerror());
20992076
return 1;
21002077
}
21012078

2102-
auto *get_providers = reinterpret_cast<_pfn_QnnSystemInterface_getProviders *>(dlsym(
2079+
auto * get_providers = reinterpret_cast<_pfn_QnnSystemInterface_getProviders *>(dlsym(
21032080
_system_lib_handle, "QnnSystemInterface_getProviders"));
21042081
if (nullptr == get_providers) {
21052082
QNN_LOG_WARN("can not load QNN symbol QnnSystemInterface_getProviders: %s\n", dlerror());
@@ -2223,7 +2200,7 @@ static void ggml_qnn_logcallback(const char * fmt,
22232200
int len_content = 0;
22242201
memset(s_ggml_qnn_logbuf, 0, GGML_QNN_LOGBUF_LEN);
22252202
len_content = vsnprintf(reinterpret_cast<char *const>(s_ggml_qnn_logbuf), GGML_QNN_LOGBUF_LEN, fmt, argp);
2226-
//QNN_LOG_DEBUG("%8.1fms [%-7s] %s ", ms, levelStr, s_ggml_qnn_logbuf);
2203+
QNN_LOG_DEBUG("%8.1fms [%-7s] %s ", ms, levelStr, s_ggml_qnn_logbuf);
22272204
}
22282205
}
22292206

@@ -2303,15 +2280,6 @@ int qnn_instance::qnn_init(const QnnSaver_Config_t ** saver_config) {
23032280
QNN_LOG_INFO("create device successfully\n");
23042281
}
23052282

2306-
/*
2307-
std::vector<const QnnDevice_Config_t*> temp_device_config;
2308-
_qnn_interface.qnn_device_create(_qnn_log_handle, temp_device_config.empty() ? nullptr : temp_device_config.data(), &_qnn_device_handle);
2309-
if (nullptr == _qnn_device_handle) {
2310-
QNN_LOG_WARN("why failed to initialize qnn device\n");
2311-
//return 6;
2312-
}
2313-
*/
2314-
23152283
if (ggml_qnn_profile_level::profile_off != _profile_level) {
23162284
QNN_LOG_INFO("profiling turned on; level = %d", _profile_level);
23172285
if (ggml_qnn_profile_level::profile_basic == _profile_level) {
@@ -2377,7 +2345,7 @@ int qnn_instance::qnn_init(const QnnSaver_Config_t ** saver_config) {
23772345
}
23782346

23792347

2380-
//QNN SDK would/might/should release all allocated resource in SDK's internal
2348+
//QNN SDK would/might/should release all allocated internal QNN resource in SDK's internal
23812349
int qnn_instance::qnn_finalize() {
23822350
int ret_status = 0;
23832351
Qnn_ErrorHandle_t error = QNN_SUCCESS;
@@ -3592,7 +3560,6 @@ bool ggml_qnn_compute_forward(struct ggml_compute_params * params, struct ggml_t
35923560
}
35933561

35943562

3595-
//ok, real show time in Qualcomm's QNN internal
35963563
if (nullptr != func)
35973564
func(tensor->src[0], tensor->src[1], tensor);
35983565
if (nullptr != func_common)
@@ -3832,7 +3799,7 @@ static size_t ggml_backend_qnn_buffer_type_get_alignment(ggml_backend_buffer_typ
38323799
static size_t ggml_backend_qnn_buffer_type_get_max_size(ggml_backend_buffer_type_t buft) {
38333800
GGML_UNUSED(buft);
38343801

3835-
return (38 * 1024 * 1024);
3802+
return (96 * 1024 * 1024);
38363803
}
38373804

38383805

@@ -4429,6 +4396,7 @@ static int ggml_get_n_tasks(struct ggml_tensor * node, int n_threads, int n_cur_
44294396
}
44304397

44314398

4399+
#if 0 //replaced with ggml_status ggml_backend_qnn_graph_compute_multithread
44324400
static void * ggml_graph_compute_thread(void * data) {
44334401
struct ggml_compute_state * state = (struct ggml_compute_state *) data;
44344402

@@ -4563,6 +4531,7 @@ static void * ggml_graph_compute_thread(void * data) {
45634531

45644532
return 0;
45654533
}
4534+
#endif
45664535

45674536

45684537
static ggml_status ggml_backend_qnn_graph_compute_multithread(ggml_backend_t backend, ggml_cgraph * cgraph) {
@@ -4579,6 +4548,7 @@ static ggml_status ggml_backend_qnn_graph_compute_multithread(ggml_backend_t bac
45794548

45804549
if (plan.work_size > 0) {
45814550
//QNN_LOG_INFO("work size %d(%d MB)", plan.work_size, plan.work_size / (1 << 20));
4551+
//TODO:using memory pool to avoid dynamic memory allocation/free
45824552
plan.work_data = static_cast<uint8_t *>(malloc(plan.work_size));
45834553
if (plan.work_data == nullptr) {
45844554
QNN_LOG_ERROR("malloc failed");
@@ -4650,6 +4620,7 @@ static ggml_status ggml_backend_qnn_graph_compute_multithread(ggml_backend_t bac
46504620
}
46514621

46524622
if (plan.work_data != nullptr) {
4623+
//TODO:using memory pool to avoid dynamic memory allocation/free
46534624
free(plan.work_data);
46544625
}
46554626

@@ -4766,7 +4737,8 @@ ggml_backend_buffer_type_t ggml_backend_qnn_buffer_type(size_t device_index) {
47664737
/**
47674738
*
47684739
* @param device 0: QNN_CPU 1: QNN_GPU 2: QNN_HTP(aka DSP)
4769-
* @param qnn_lib_path qnn library path, such as "/data/data/com.ggml.llamacpp/" on Android which can got by JNI from Java layer
4740+
* @param qnn_lib_path this param is Andorid APP's data path, such as "/data/data/com.ggml.llamacpp/"
4741+
* which can be obtained through JNI from Java layer
47704742
* @return
47714743
*/
47724744
ggml_backend_t ggml_backend_qnn_init(size_t device, const char * qnn_lib_path) {

ggml-qnn.h

+4-11
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,3 @@
1-
/*
2-
* MIT license
3-
* Copyright (C) 2024 GGML Authors
4-
* SPDX-License-Identifier: MIT
5-
*
6-
* this is implementation of ggml QNN(Qualcomm Nerual Network, aka AI Engine Direct) backend
7-
*/
81
#pragma once
92

103
#include "ggml.h"
@@ -30,7 +23,8 @@ GGML_API int ggml_backend_qnn_reg_devices();
3023
/**
3124
*
3225
* @param device 0: QNN_CPU 1: QNN_GPU 2: QNN_HTP(aka DSP)
33-
* @param qnn_lib_path qnn library path, such as "/data/data/com.ggml.llamacpp/" on Android which can got by JNI from Java layer
26+
* @param qnn_lib_path qnn library path, such as "/data/data/com.ggml.llamacpp/"
27+
* which can be obtained through JNI from Java layer
3428
* @return
3529
*/
3630
GGML_API ggml_backend_t ggml_backend_qnn_init(size_t dev_num, const char * qnn_lib_path);
@@ -45,9 +39,8 @@ GGML_API void ggml_backend_qnn_get_device_description(int device, char
4539

4640
GGML_API ggml_backend_buffer_type_t ggml_backend_qnn_buffer_type(size_t dev_num);
4741

48-
49-
//temporary API, should be removed in the future
50-
GGML_API bool ggml_qnn_compute_forward(struct ggml_compute_params * params, struct ggml_tensor * tensor);
42+
// TODO: this is a temporary API, should be removed in the future
43+
GGML_API bool ggml_qnn_compute_forward(struct ggml_compute_params * params, struct ggml_tensor * tensor);
5144

5245

5346
#ifdef __cplusplus

ggml.c

+1-2
Original file line numberDiff line numberDiff line change
@@ -16153,8 +16153,7 @@ static void ggml_compute_forward_cross_entropy_loss_back(
1615316153

1615416154
/////////////////////////////////
1615516155

16156-
//workaround for Qualcomm QNN backend
16157-
void ggml_compute_forward(struct ggml_compute_params * params, struct ggml_tensor * tensor) {
16156+
static void ggml_compute_forward(struct ggml_compute_params * params, struct ggml_tensor * tensor) {
1615816157
GGML_ASSERT(params);
1615916158

1616016159
if (tensor->op == GGML_OP_NONE || ggml_is_empty(tensor)) {

0 commit comments

Comments
 (0)