Research: How to integrate VITA 1.5 for multi-modal GGUF deployment? #13520

jordanqi · 2025-05-14T02:58:50Z

I'm trying to deploy a multi-modal model based on VITA-1.5, where:

The text backbone is the same as Qwen2.

The vision tower is InternViT-300M-448px from OpenGVLab.

Yesterday I noticed that convert_hf_to_gguf.py added a new class:

class InternVisionModel(VisionModel)

which is the same one used in vita's vision part
However:

There's no corresponding tensor name mapping in constants.py under MODEL_TENSORS.

There's no build function in llama_model.cpp (e.g., no build_internvit() ).

I’m not sure how to combine the vision and text parts into a single GGUF model so that llama.cpp can infer with both modalities.

My goal:
To deploy VITA-1.5 via llama.cpp and run image+text inference (similar to LLaVA / MobileVLM).

Questions:
What is the recommended way to combine Qwen2 text + InternViT vision into one GGUF model?

Will InternViTVisionModel support GGUF inference soon, or should I write the corresponding GGML graph manually?

No response

No response

No response

jordanqi added the research 🔬 label May 14, 2025

Provide feedback