You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Helo,
For text (which turns out to be of ten tokens) i get 10 vectors even though i have --pooling enabled. Am I missing something obvious?
Going nuts, pair this with json adressing in code and http/sql batching... :-) Help welcome, thx in advance.
./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 750 Ti, compute capability 5.0, VMM: yes
version: 5797 (de56944)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
curl -s -X POST http://localhost:8081/embedding
-H "Content-Type: application/json"
-d '{
"model": "Qwen3-Embedding-0.6B-Q8_0.gguf",
"input": "The quick brown fox jumps over the lazy dog."
}'
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Helo,
For text (which turns out to be of ten tokens) i get 10 vectors even though i have --pooling enabled. Am I missing something obvious?
Going nuts, pair this with json adressing in code and http/sql batching... :-) Help welcome, thx in advance.
./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 750 Ti, compute capability 5.0, VMM: yes
version: 5797 (de56944)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
curl -s -X POST http://localhost:8081/embedding
-H "Content-Type: application/json"
-d '{
"model": "Qwen3-Embedding-0.6B-Q8_0.gguf",
"input": "The quick brown fox jumps over the lazy dog."
}'
ls -l
212K Jul 5 03:51 q-test-embedding.txt
jq '.[].embedding | length' ~/tmp/q-test-embedding.txt
10
grep -o ',' q-test-embedding.txt | wc -l
10240
Server script
!/bin/bash
LLAMA_MODEL="Qwen3-Embedding-0.6B-Q8_0.gguf"
LLAMA_MODEL_PATH="/home/DATA/GGUF/embed"
LLAMA_OPTS="-c 1024 --temp 0.3 --top-k 40 --top-p 0.9 --n-predict 60 --no-warmup --port 8081 --embedding"
LLAMA_PERF_OPTS="-ngl 99 --mlock --pooling last"
llama-server ${LLAMA_PERF_OPTS} ${LLAMA_OPTS} -m ${LLAMA_MODEL_PATH}/${LLAMA_MODEL} ${@}
Beta Was this translation helpful? Give feedback.
All reactions