clip : cap max image size 1024 for qwen vl model #13478

ngxson · 2025-05-12T10:40:52Z

An image of size 2592x1944 equivalents to over 20k tokens, which is an unreasonable amount of computation (equivalent to 41GB of VRAM)

And turns out, the problem is not isolated to llama.cpp, someone already discuss about this using HF implementation: https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct/discussions/10

The solution is to cap image size to maximum 1024x1024, which brings down the max memory usage to 1830M

We maybe able to use less memory with flash attn, but given recent discussion about why it's not enabled by default on text model, I'm not sure what's the risk of adding flash attn to vision support. WDYT @ggerganov ?

ggerganov

I think enabling FA in the vision encoder will reduce the memory usage significantly, which seems to be important for vision usage. The main problem is that currently the computation will fallback to the CPU when the tensor shapes are not padded correctly (32 for Metal, 256 for CUDA). There are 2 approaches that come to mind:

Change the backend implementations to auto-pad the input tensors when necessary
Change the clip implementation to do the padding (similar to how the KV cache is padded)

Probably the former is the better approach, but it will need work in all the backends that implement FA.

clip : cap max image size 1024 for qwen vl model

201cdaf

ngxson requested a review from ggerganov May 12, 2025 10:40

github-actions bot added the examples label May 12, 2025

ngxson linked an issue May 12, 2025 that may be closed by this pull request

Segfault when submitting image to ggml-org/Qwen2.5-VL-7B-Instruct-GGUF #13467

Closed

ngxson mentioned this pull request May 12, 2025

Eval bug: mtmd in server mode crashes on too big image #13414

Closed

ggerganov approved these changes May 12, 2025

View reviewed changes

ngxson merged commit de4c07f into ggml-org:master May 12, 2025
44 checks passed

CISC linked an issue May 14, 2025 that may be closed by this pull request

server: Describing pictures with multi models seems to crash the model #13480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clip : cap max image size 1024 for qwen vl model #13478

clip : cap max image size 1024 for qwen vl model #13478

ngxson commented May 12, 2025

ggerganov left a comment

clip : cap max image size 1024 for qwen vl model #13478

clip : cap max image size 1024 for qwen vl model #13478

Conversation

ngxson commented May 12, 2025

ggerganov left a comment

Choose a reason for hiding this comment