Eval bug: allocating 114296.55 MiB on device 0: cudaMalloc failed: out of memory #12586

shuyuan-wang · 2025-03-26T09:35:01Z

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
version: 4893 (f4c3dd5)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

8 x A100

Models

Qwen2-VL-7B-Instruct-Q4_K_M.gguf

Problem description & steps to reproduce

when I tried to run

CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-qwen2vl-cli -m weights/qwen2-vl/Qwen2-VL-7B-Instruct/Qwen2-VL-7B-Instruct-Q4_K_M.gguf --mmproj ./qwen2-vl-7b-instruct-vision.gguf -p "Describe this image." --image 1737108984550867_1737107783460076_30_FE84002453CE30630892A76B6348E418D3E8B187_55_F100_front_wide.jpg

It shows

llama_context: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_context:      CUDA0 compute buffer size =   744.25 MiB
llama_context:  CUDA_Host compute buffer size =    15.02 MiB
llama_context: graph nodes  = 986
llama_context: graph splits = 396 (with bs=512), 7 (with bs=1)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 114296.55 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 119848614912
/gemini/code/llama.cpp/ggml/src/ggml-backend.cpp:1662: GGML_ASSERT((char *)addr + ggml_backend_buffer_get_alloc_size(buffer, tensor) <= (char *)ggml_backend_buffer_get_base(buffer) + ggml_backend_buffer_get_size(buffer)) failed

The quantized decoder model is 5.3GB and the vision encoder I believe is 1.8 GB

First Bad Commit

No response

Relevant log output

clip_init: loaded meta data with 20 key-value pairs and 521 tensors from ./qwen2-vl-7b-instruct-vision.gguf
clip_init: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_init: - kv   0:                       general.architecture str              = clip
clip_init: - kv   1:                        general.description str              = image encoder for Qwen2VL
clip_init: - kv   2:                          general.file_type u32              = 0
clip_init: - kv   3:                      clip.has_text_encoder bool             = false
clip_init: - kv   4:                    clip.has_vision_encoder bool             = true
clip_init: - kv   5:                    clip.has_qwen2vl_merger bool             = true
clip_init: - kv   6:                        clip.projector_type str              = qwen2vl_merger
clip_init: - kv   7:                              clip.use_silu bool             = false
clip_init: - kv   8:                              clip.use_gelu bool             = false
clip_init: - kv   9:                     clip.vision.patch_size u32              = 14
clip_init: - kv  10:                     clip.vision.image_size u32              = 560
clip_init: - kv  11:               clip.vision.embedding_length u32              = 1280
clip_init: - kv  12:                 clip.vision.projection_dim u32              = 3584
clip_init: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_init: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_init: - kv  15:                    clip.vision.block_count u32              = 32
clip_init: - kv  16:            clip.vision.feed_forward_length u32              = 0
clip_init: - kv  17:                               general.name str              = Qwen2-VL-7B-Instruct
clip_init: - kv  18:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_init: - kv  19:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_init: - type  f32:  521 tensors
clip_ctx: CLIP using CUDA0 backend
clip_init: text_encoder:   0
clip_init: vision_encoder: 1
clip_init: llava_projector:  0
clip_init: minicpmv_projector:  0
clip_init: minicpmv_version:  2
clip_init: glm_projector:  0
clip_init: model size:     2577.82 MB
clip_init: metadata size:  0.18 MB
clip_init: params backend buffer size =  2577.82 MB (521 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.feature_layer not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_init:      CUDA0 compute buffer size =   198.93 MiB
clip_init:        CPU compute buffer size =     3.61 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
init:        CPU KV buffer size =   224.00 MiB
llama_context: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_context:      CUDA0 compute buffer size =   744.25 MiB
llama_context:  CUDA_Host compute buffer size =    15.02 MiB
llama_context: graph nodes  = 986
llama_context: graph splits = 396 (with bs=512), 7 (with bs=1)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 114296.55 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 119848614912
/gemini/code/llama.cpp/ggml/src/ggml-backend.cpp:1662: GGML_ASSERT((char *)addr + ggml_backend_buffer_get_alloc_size(buffer, tensor) <= (char *)ggml_backend_buffer_get_base(buffer) + ggml_backend_buffer_get_size(buffer)) failed

The text was updated successfully, but these errors were encountered:

github-actions · 2025-05-10T01:07:39Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

shuyuan-wang added the bug-unconfirmed label Mar 26, 2025

github-actions bot added the stale label Apr 26, 2025

github-actions bot closed this as completed May 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: allocating 114296.55 MiB on device 0: cudaMalloc failed: out of memory #12586

Eval bug: allocating 114296.55 MiB on device 0: cudaMalloc failed: out of memory #12586

shuyuan-wang commented Mar 26, 2025

github-actions bot commented May 10, 2025

Eval bug: allocating 114296.55 MiB on device 0: cudaMalloc failed: out of memory #12586

Eval bug: allocating 114296.55 MiB on device 0: cudaMalloc failed: out of memory #12586

Comments

shuyuan-wang commented Mar 26, 2025

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

github-actions bot commented May 10, 2025