We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 8 CUDA devices: Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes Device 1: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes Device 2: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes Device 3: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes Device 4: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes Device 5: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes Device 6: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes Device 7: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes version: 4893 (f4c3dd5) built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu
Linux
CUDA
8 x A100
Qwen2-VL-7B-Instruct-Q4_K_M.gguf
when I tried to run
CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-qwen2vl-cli -m weights/qwen2-vl/Qwen2-VL-7B-Instruct/Qwen2-VL-7B-Instruct-Q4_K_M.gguf --mmproj ./qwen2-vl-7b-instruct-vision.gguf -p "Describe this image." --image 1737108984550867_1737107783460076_30_FE84002453CE30630892A76B6348E418D3E8B187_55_F100_front_wide.jpg
It shows
llama_context: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB llama_context: CUDA0 compute buffer size = 744.25 MiB llama_context: CUDA_Host compute buffer size = 15.02 MiB llama_context: graph nodes = 986 llama_context: graph splits = 396 (with bs=512), 7 (with bs=1) ggml_backend_cuda_buffer_type_alloc_buffer: allocating 114296.55 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 119848614912 /gemini/code/llama.cpp/ggml/src/ggml-backend.cpp:1662: GGML_ASSERT((char *)addr + ggml_backend_buffer_get_alloc_size(buffer, tensor) <= (char *)ggml_backend_buffer_get_base(buffer) + ggml_backend_buffer_get_size(buffer)) failed
The quantized decoder model is 5.3GB and the vision encoder I believe is 1.8 GB
No response
clip_init: loaded meta data with 20 key-value pairs and 521 tensors from ./qwen2-vl-7b-instruct-vision.gguf clip_init: Dumping metadata keys/values. Note: KV overrides do not apply in this output. clip_init: - kv 0: general.architecture str = clip clip_init: - kv 1: general.description str = image encoder for Qwen2VL clip_init: - kv 2: general.file_type u32 = 0 clip_init: - kv 3: clip.has_text_encoder bool = false clip_init: - kv 4: clip.has_vision_encoder bool = true clip_init: - kv 5: clip.has_qwen2vl_merger bool = true clip_init: - kv 6: clip.projector_type str = qwen2vl_merger clip_init: - kv 7: clip.use_silu bool = false clip_init: - kv 8: clip.use_gelu bool = false clip_init: - kv 9: clip.vision.patch_size u32 = 14 clip_init: - kv 10: clip.vision.image_size u32 = 560 clip_init: - kv 11: clip.vision.embedding_length u32 = 1280 clip_init: - kv 12: clip.vision.projection_dim u32 = 3584 clip_init: - kv 13: clip.vision.attention.head_count u32 = 16 clip_init: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000001 clip_init: - kv 15: clip.vision.block_count u32 = 32 clip_init: - kv 16: clip.vision.feed_forward_length u32 = 0 clip_init: - kv 17: general.name str = Qwen2-VL-7B-Instruct clip_init: - kv 18: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211] clip_init: - kv 19: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777] clip_init: - type f32: 521 tensors clip_ctx: CLIP using CUDA0 backend clip_init: text_encoder: 0 clip_init: vision_encoder: 1 clip_init: llava_projector: 0 clip_init: minicpmv_projector: 0 clip_init: minicpmv_version: 2 clip_init: glm_projector: 0 clip_init: model size: 2577.82 MB clip_init: metadata size: 0.18 MB clip_init: params backend buffer size = 2577.82 MB (521 tensors) key clip.vision.image_grid_pinpoints not found in file key clip.vision.feature_layer not found in file key clip.vision.mm_patch_merge_type not found in file key clip.vision.image_crop_resolution not found in file clip_init: CUDA0 compute buffer size = 198.93 MiB clip_init: CPU compute buffer size = 3.61 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.58 MiB init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1 init: CPU KV buffer size = 224.00 MiB llama_context: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB llama_context: CUDA0 compute buffer size = 744.25 MiB llama_context: CUDA_Host compute buffer size = 15.02 MiB llama_context: graph nodes = 986 llama_context: graph splits = 396 (with bs=512), 7 (with bs=1) ggml_backend_cuda_buffer_type_alloc_buffer: allocating 114296.55 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 119848614912 /gemini/code/llama.cpp/ggml/src/ggml-backend.cpp:1662: GGML_ASSERT((char *)addr + ggml_backend_buffer_get_alloc_size(buffer, tensor) <= (char *)ggml_backend_buffer_get_base(buffer) + ggml_backend_buffer_get_size(buffer)) failed
The text was updated successfully, but these errors were encountered:
This issue was closed because it has been inactive for 14 days since being marked as stale.
Sorry, something went wrong.
No branches or pull requests
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
version: 4893 (f4c3dd5)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
8 x A100
Models
Qwen2-VL-7B-Instruct-Q4_K_M.gguf
Problem description & steps to reproduce
when I tried to run
It shows
The quantized decoder model is 5.3GB and the vision encoder I believe is 1.8 GB
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: