Skip to content

server: Describing pictures with multi models seems to crash the model #13480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
domasofan opened this issue May 12, 2025 · 5 comments · Fixed by #13478
Closed

server: Describing pictures with multi models seems to crash the model #13480

domasofan opened this issue May 12, 2025 · 5 comments · Fixed by #13478

Comments

@domasofan
Copy link

Hi all,

Tried to describe a picture with these two models in separate runs:
https://huggingface.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF
https://huggingface.co/bartowski/Qwen_Qwen2.5-VL-7B-Instruct-GGUF

The llama.cpp build used was b5351 CPU X64 on Win 11.
No errors where thrown.

Greetings,
Simon

@zandera007
Copy link

I have the same issue whit Rolmocr model that is a fine-tuned Qwen2.5 model. Only text prompt is fonctionnning but when I send an image with text there is a crash...

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
build: 5350 (c104023) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 4, n_threads_batch = 4, total_threads = 8
system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 7
main: loading model
srv load_model: loading model '/models/RolmOCR.Q8_0.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) - 11796 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3060) - 11796 MiB free
llama_model_loader: loaded meta data with 42 key-value pairs and 339 tensors from /models/RolmOCR.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2vl
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = RolmOCR
llama_model_loader: - kv 3: general.size_label str = 7.6B
llama_model_loader: - kv 4: general.license str = apache-2.0
llama_model_loader: - kv 5: general.base_model.count u32 = 1
llama_model_loader: - kv 6: general.base_model.0.name str = Qwen2.5 VL 7B Instruct
llama_model_loader: - kv 7: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 8: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-V...
llama_model_loader: - kv 9: general.dataset.count u32 = 1
llama_model_loader: - kv 10: general.dataset.0.name str = olmOCR Mix 0225
llama_model_loader: - kv 11: general.dataset.0.version str = 0225
llama_model_loader: - kv 12: general.dataset.0.organization str = Allenai
llama_model_loader: - kv 13: general.dataset.0.repo_url str = https://huggingface.co/allenai/olmOCR...
llama_model_loader: - kv 14: qwen2vl.block_count u32 = 28
llama_model_loader: - kv 15: qwen2vl.context_length u32 = 128000
llama_model_loader: - kv 16: qwen2vl.embedding_length u32 = 3584
llama_model_loader: - kv 17: qwen2vl.feed_forward_length u32 = 18944
llama_model_loader: - kv 18: qwen2vl.attention.head_count u32 = 28
llama_model_loader: - kv 19: qwen2vl.attention.head_count_kv u32 = 4
llama_model_loader: - kv 20: qwen2vl.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 21: qwen2vl.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 22: qwen2vl.rope.dimension_sections arr[i32,4] = [16, 24, 24, 0]
llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - kv 34: general.file_type u32 = 7
llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/R...
llama_model_loader: - kv 36: mradermacher.quantize_version str = 2
llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher
llama_model_loader: - kv 38: mradermacher.quantized_at str = 2025-04-12T22:50:21+02:00
llama_model_loader: - kv 39: mradermacher.quantized_on str = nico1
llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/reducto/RolmOCR
llama_model_loader: - kv 41: mradermacher.convert_type str = hf
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q8_0: 198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 7.54 GiB (8.50 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2vl
print_info: vocab_only = 0
print_info: n_ctx_train = 128000
print_info: n_embd = 3584
print_info: n_layer = 28
print_info: n_head = 28
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 7
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18944
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 8
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 128000
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 7B
print_info: model params = 7.62 B
print_info: general.name = RolmOCR
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CUDA0 model buffer size = 3542.78 MiB
load_tensors: CUDA1 model buffer size = 3622.66 MiB
load_tensors: CPU_Mapped model buffer size = 552.23 MiB
..............................................srv log_server_r: request: GET /health 127.0.0.1 503
........................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1, padding = 32
llama_kv_cache_unified: CUDA0 KV buffer size = 120.00 MiB
llama_kv_cache_unified: CUDA1 KV buffer size = 104.00 MiB
llama_kv_cache_unified: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: CUDA0 compute buffer size = 312.03 MiB
llama_context: CUDA1 compute buffer size = 364.04 MiB
llama_context: CUDA_Host compute buffer size = 39.05 MiB
llama_context: graph nodes = 1042
llama_context: graph splits = 3
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
clip_ctx: CLIP using CUDA0 backend
clip_model_loader: model name: Qwen2.5-VL-7B-Instruct
clip_model_loader: description: image encoder for Qwen2VL
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 520
clip_model_loader: n_kv: 21
load_hparams: projector: qwen2.5vl_merger
load_hparams: n_embd: 1280
load_hparams: n_head: 16
load_hparams: n_ff: 0
load_hparams: n_layer: 32
load_hparams: projection_dim: 3584
load_hparams: image_size: 3584
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: proj_scale_factor: 0
load_hparams: n_wa_pattern: 8
load_hparams: ffn_op: silu
load_hparams: model size: 1291.40 MiB
load_hparams: metadata size: 0.18 MiB
alloc_compute_meta: CUDA0 compute buffer size = 2.77 MiB
alloc_compute_meta: CPU compute buffer size = 0.16 MiB
srv load_model: loaded multimodal model, '/models/Qwen2.5-VL-7B-Instruct.mmproj-fp16.gguf'
srv load_model: ctx_shift is not supported by multimodal, it will be disabled
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 4096
main: model loaded
main: chat template, chat_template: {%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0]['role'] == 'system' %}
{{- messages[0]['content'] }}
{%- else %}
{{- 'You are a helpful assistant.' }}
{%- endif %}
{{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within XML tags:\n" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": , "arguments": }\n</tool_call><|im_end|>\n" }}
{%- else %}
{%- if messages[0]['role'] == 'system' %}
{{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
{%- else %}
{{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}
{%- endif %}

{%- endif %}
{%- for message in messages %}
{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- elif message.role == "assistant" %}
{{- '<|im_start|>' + message.role }}
{%- if message.content %}
{{- '\n' + message.content }}
{%- endif %}
{%- for tool_call in message.tool_calls %}
{%- if tool_call.function is defined %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '\n<tool_call>\n{"name": "' }}
{{- tool_call.name }}
{{- '", "arguments": ' }}
{{- tool_call.arguments | tojson }}
{{- '}\n</tool_call>' }}
{%- endfor %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "tool" %}
{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
{{- '<|im_start|>user' }}
{%- endif %}
{{- '\n<tool_response>\n' }}
{{- message.content }}
{{- '\n</tool_response>' }}
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}

{%- endif %}
, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET /health 127.0.0.1 200
srv log_server_r: request: GET /health 127.0.0.1 200
srv log_server_r: request: GET /health 127.0.0.1 200
srv log_server_r: request: GET /health 127.0.0.1 200
srv log_server_r: request: GET /health 127.0.0.1 200
srv log_server_r: request: GET /health 127.0.0.1 200
srv log_server_r: request: GET /health 127.0.0.1 200
srv log_server_r: request: GET /health 127.0.0.1 200
srv log_server_r: request: GET /v1/api/tags 172.23.0.1 404
srv log_server_r: request: GET /v1/api/version 172.23.0.1 404
srv log_server_r: request: GET /health 127.0.0.1 200
srv log_server_r: request: GET / 192.168.2.103 200
srv log_server_r: request: GET /v1/api/tags 172.23.0.1 404
srv log_server_r: request: GET /favicon.ico 192.168.2.103 404
srv log_server_r: request: GET /props 192.168.2.103 200
srv log_server_r: request: GET /health 127.0.0.1 200
srv log_server_r: request: GET /health 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 14
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 4, n_tokens = 4, progress = 0.285714
slot update_slots: id 0 | task 0 | kv cache rm [4, end)
srv process_chun: processing image...
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8641.89 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9061680896
/app/ggml/src/ggml-backend.cpp:1666: GGML_ASSERT((char *)addr + ggml_backend_buffer_get_alloc_size(buffer, tensor) <= (char *)ggml_backend_buffer_get_base(buffer) + ggml_backend_buffer_get_size(buffer)) failed
encoding image or slice...

@domasofan
Copy link
Author

Just tried at home with build 5359 and this model:
https://huggingface.co/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF
Seems to work again. will try tomorrow in the office where i have a different model.

@domasofan
Copy link
Author

Works as well here in the office.
Used build: b5361

@zandera007
Copy link

zandera007 commented May 13, 2025

Ok,
il will try this evening with the last build B5361 (i'm using docker).

I don't understand with they are so much memory allocated in the buffer when the model receive the image... I hope that the last build fix this issue.

@zandera007
Copy link

I confirm, it's working with the latest build.

@CISC CISC linked a pull request May 14, 2025 that will close this issue
@CISC CISC closed this as completed May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants