We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I'm trying to deploy a multi-modal model based on VITA-1.5, where:
The text backbone is the same as Qwen2.
The vision tower is InternViT-300M-448px from OpenGVLab.
Yesterday I noticed that convert_hf_to_gguf.py added a new class:
class InternVisionModel(VisionModel)
which is the same one used in vita's vision part However:
There's no corresponding tensor name mapping in constants.py under MODEL_TENSORS.
There's no build function in llama_model.cpp (e.g., no build_internvit() ).
I’m not sure how to combine the vision and text parts into a single GGUF model so that llama.cpp can infer with both modalities.
My goal: To deploy VITA-1.5 via llama.cpp and run image+text inference (similar to LLaVA / MobileVLM).
Questions: What is the recommended way to combine Qwen2 text + InternViT vision into one GGUF model?
Will InternViTVisionModel support GGUF inference soon, or should I write the corresponding GGML graph manually?
No response
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Research Stage
Previous existing literature and research
I'm trying to deploy a multi-modal model based on VITA-1.5, where:
The text backbone is the same as Qwen2.
The vision tower is InternViT-300M-448px from OpenGVLab.
Yesterday I noticed that convert_hf_to_gguf.py added a new class:
class InternVisionModel(VisionModel)
which is the same one used in vita's vision part
However:
There's no corresponding tensor name mapping in constants.py under MODEL_TENSORS.
There's no build function in llama_model.cpp (e.g., no build_internvit() ).
I’m not sure how to combine the vision and text parts into a single GGUF model so that llama.cpp can infer with both modalities.
My goal:
To deploy VITA-1.5 via llama.cpp and run image+text inference (similar to LLaVA / MobileVLM).
Questions:
What is the recommended way to combine Qwen2 text + InternViT vision into one GGUF model?
Will InternViTVisionModel support GGUF inference soon, or should I write the corresponding GGML graph manually?
Hypothesis
No response
Implementation
No response
Analysis
No response
Relevant log output
The text was updated successfully, but these errors were encountered: