Skip to content

Eval bug: allocating 114296.55 MiB on device 0: cudaMalloc failed: out of memory #12586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shuyuan-wang opened this issue Mar 26, 2025 · 1 comment

Comments

@shuyuan-wang
Copy link

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
version: 4893 (f4c3dd5)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

8 x A100

Models

Qwen2-VL-7B-Instruct-Q4_K_M.gguf

Problem description & steps to reproduce

when I tried to run

CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-qwen2vl-cli -m weights/qwen2-vl/Qwen2-VL-7B-Instruct/Qwen2-VL-7B-Instruct-Q4_K_M.gguf --mmproj ./qwen2-vl-7b-instruct-vision.gguf -p "Describe this image." --image 1737108984550867_1737107783460076_30_FE84002453CE30630892A76B6348E418D3E8B187_55_F100_front_wide.jpg

It shows

llama_context: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_context:      CUDA0 compute buffer size =   744.25 MiB
llama_context:  CUDA_Host compute buffer size =    15.02 MiB
llama_context: graph nodes  = 986
llama_context: graph splits = 396 (with bs=512), 7 (with bs=1)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 114296.55 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 119848614912
/gemini/code/llama.cpp/ggml/src/ggml-backend.cpp:1662: GGML_ASSERT((char *)addr + ggml_backend_buffer_get_alloc_size(buffer, tensor) <= (char *)ggml_backend_buffer_get_base(buffer) + ggml_backend_buffer_get_size(buffer)) failed

The quantized decoder model is 5.3GB and the vision encoder I believe is 1.8 GB

First Bad Commit

No response

Relevant log output

clip_init: loaded meta data with 20 key-value pairs and 521 tensors from ./qwen2-vl-7b-instruct-vision.gguf
clip_init: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_init: - kv   0:                       general.architecture str              = clip
clip_init: - kv   1:                        general.description str              = image encoder for Qwen2VL
clip_init: - kv   2:                          general.file_type u32              = 0
clip_init: - kv   3:                      clip.has_text_encoder bool             = false
clip_init: - kv   4:                    clip.has_vision_encoder bool             = true
clip_init: - kv   5:                    clip.has_qwen2vl_merger bool             = true
clip_init: - kv   6:                        clip.projector_type str              = qwen2vl_merger
clip_init: - kv   7:                              clip.use_silu bool             = false
clip_init: - kv   8:                              clip.use_gelu bool             = false
clip_init: - kv   9:                     clip.vision.patch_size u32              = 14
clip_init: - kv  10:                     clip.vision.image_size u32              = 560
clip_init: - kv  11:               clip.vision.embedding_length u32              = 1280
clip_init: - kv  12:                 clip.vision.projection_dim u32              = 3584
clip_init: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_init: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_init: - kv  15:                    clip.vision.block_count u32              = 32
clip_init: - kv  16:            clip.vision.feed_forward_length u32              = 0
clip_init: - kv  17:                               general.name str              = Qwen2-VL-7B-Instruct
clip_init: - kv  18:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_init: - kv  19:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_init: - type  f32:  521 tensors
clip_ctx: CLIP using CUDA0 backend
clip_init: text_encoder:   0
clip_init: vision_encoder: 1
clip_init: llava_projector:  0
clip_init: minicpmv_projector:  0
clip_init: minicpmv_version:  2
clip_init: glm_projector:  0
clip_init: model size:     2577.82 MB
clip_init: metadata size:  0.18 MB
clip_init: params backend buffer size =  2577.82 MB (521 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.feature_layer not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_init:      CUDA0 compute buffer size =   198.93 MiB
clip_init:        CPU compute buffer size =     3.61 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
init:        CPU KV buffer size =   224.00 MiB
llama_context: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_context:      CUDA0 compute buffer size =   744.25 MiB
llama_context:  CUDA_Host compute buffer size =    15.02 MiB
llama_context: graph nodes  = 986
llama_context: graph splits = 396 (with bs=512), 7 (with bs=1)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 114296.55 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 119848614912
/gemini/code/llama.cpp/ggml/src/ggml-backend.cpp:1662: GGML_ASSERT((char *)addr + ggml_backend_buffer_get_alloc_size(buffer, tensor) <= (char *)ggml_backend_buffer_get_base(buffer) + ggml_backend_buffer_get_size(buffer)) failed
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant