Skip to content

clip : cap max image size 1024 for qwen vl model #13478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 12, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented May 12, 2025

Fix #13445 #13467

An image of size 2592x1944 equivalents to over 20k tokens, which is an unreasonable amount of computation (equivalent to 41GB of VRAM)

And turns out, the problem is not isolated to llama.cpp, someone already discuss about this using HF implementation: https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct/discussions/10

The solution is to cap image size to maximum 1024x1024, which brings down the max memory usage to 1830M

We maybe able to use less memory with flash attn, but given recent discussion about why it's not enabled by default on text model, I'm not sure what's the risk of adding flash attn to vision support. WDYT @ggerganov ?

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think enabling FA in the vision encoder will reduce the memory usage significantly, which seems to be important for vision usage. The main problem is that currently the computation will fallback to the CPU when the tensor shapes are not padded correctly (32 for Metal, 256 for CUDA). There are 2 approaches that come to mind:

  • Change the backend implementations to auto-pad the input tensors when necessary
  • Change the clip implementation to do the padding (similar to how the KV cache is padded)

Probably the former is the better approach, but it will need work in all the backends that implement FA.

@ngxson ngxson merged commit de4c07f into ggml-org:master May 12, 2025
44 checks passed
@CISC CISC linked an issue May 14, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants