I am getting embeded tokens not embeded whole text with --pooling last/mean... #14541

brunette69-ruby · 2025-07-05T08:09:44Z

brunette69-ruby
Jul 5, 2025

Helo,
For text (which turns out to be of ten tokens) i get 10 vectors even though i have --pooling enabled. Am I missing something obvious?
Going nuts, pair this with json adressing in code and http/sql batching... :-) Help welcome, thx in advance.

./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 750 Ti, compute capability 5.0, VMM: yes
version: 5797 (de56944)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu

curl -s -X POST http://localhost:8081/embedding
-H "Content-Type: application/json"
-d '{
"model": "Qwen3-Embedding-0.6B-Q8_0.gguf",
"input": "The quick brown fox jumps over the lazy dog."
}'

ls -l
212K Jul 5 03:51 q-test-embedding.txt
jq '.[].embedding | length' ~/tmp/q-test-embedding.txt
10
grep -o ',' q-test-embedding.txt | wc -l
10240

Server script
!/bin/bash
LLAMA_MODEL="Qwen3-Embedding-0.6B-Q8_0.gguf"
LLAMA_MODEL_PATH="/home/DATA/GGUF/embed"
LLAMA_OPTS="-c 1024 --temp 0.3 --top-k 40 --top-p 0.9 --n-predict 60 --no-warmup --port 8081 --embedding"
LLAMA_PERF_OPTS="-ngl 99 --mlock --pooling last"

llama-server ${LLAMA_PERF_OPTS} ${LLAMA_OPTS} -m ${LLAMA_MODEL_PATH}/${LLAMA_MODEL} ${@}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I am getting embeded tokens not embeded whole text with --pooling last/mean... #14541

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

I am getting embeded tokens not embeded whole text with --pooling last/mean... #14541

Uh oh!

Uh oh!

brunette69-ruby Jul 5, 2025

Replies: 0 comments

brunette69-ruby
Jul 5, 2025