Deploying VITA-1.5 Multimodal Model with ExecuTorch #10757

jordanqi · 2025-05-07T21:23:28Z

🚀 The feature, motivation and pitch

I’m trying to deploy a VITA-1.5 multimodal model (supports audio, vision, and text) using ExecuTorch.

The tokenizer is in Hugging Face tokenizer.json format, and I’d like to ask:

Is there any suggested way to convert the model into .pte format for ExecuTorch?
Since this is a new architecture, is there any guidance or examples for adding custom models?
Can I still use the LlamaDemo Android app with this multimodal？

Alternatives

No response

Additional context

No response

RFC (Optional)

No response

cc @larryliu0820 @mergennachin @cccclai @helunwencser @jackzhxng

Jack-Khuu · 2025-05-09T21:09:57Z

cc: @larryliu0820 for MM
@kirklandsign for Android

Jack-Khuu · 2025-05-09T21:12:06Z

Hi @jordanqi, if you haven't joined our discord channel, we would love to have you on there :)

jordanqi · 2025-05-09T22:51:14Z

Hi @jordanqi, if you haven't joined our discord channel, we would love to have you on there :)

I haven't joined the discord, please send the channel link to me thanks

kirklandsign · 2025-05-09T23:48:48Z

https://discordapp.com/channels/1334270993966825602/1334274225586049044

kirklandsign · 2025-05-12T18:59:10Z

Is there any suggested way to convert the model into .pte format for ExecuTorch?
Since this is a new architecture, is there any guidance or examples for adding custom models?

Is this model from HF? @guangy10 maybe know

Can I still use the LlamaDemo Android app with this multimodal？

If it can run with desktop llama_runner out of box then it can run with LlamaDemo Android app, but not sure about image processing and prompt format part

kirklandsign · 2025-05-12T18:59:50Z

We support Hugging Face tokenizer.json format right now

larryliu0820 · 2025-05-12T19:00:19Z

Hi @jordanqi thanks for your interest. A lot of the features you are asking for are under development, so there might not be convenient ways to do them, but I believe with a bit of work you can make these working!

Is there any suggested way to convert the model into .pte format for ExecuTorch?

Yes I would suggest you to take a look at the Llava example for multimodal LLMs: https://github.com/pytorch/executorch/blob/main/examples/models/llava/export_llava.py

We are actively working on a more general API that works on models with the same architecture

Since this is a new architecture, is there any guidance or examples for adding custom models?

Again please refer to the Llava example. We are open to take a new model under example directory, as long as it works.

Can I still use the LlamaDemo Android app with this multimodal？

Most likely you can use the demo app for image prefill, but to support audio we need a lot of work. Also keep in mind that right now the model and the runner are coupled so it's really a case by case situation.

Jack-Khuu added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code labels May 9, 2025

github-project-automation bot added this to ExecuTorch Core May 9, 2025

github-project-automation bot moved this to To triage in ExecuTorch Core May 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying VITA-1.5 Multimodal Model with ExecuTorch #10757

Deploying VITA-1.5 Multimodal Model with ExecuTorch #10757

jordanqi commented May 7, 2025 •

edited by pytorch-bot bot

Loading

Jack-Khuu commented May 9, 2025

Jack-Khuu commented May 9, 2025

jordanqi commented May 9, 2025

kirklandsign commented May 9, 2025

kirklandsign commented May 12, 2025 •

edited

Loading

kirklandsign commented May 12, 2025

larryliu0820 commented May 12, 2025 •

edited

Loading

Deploying VITA-1.5 Multimodal Model with ExecuTorch #10757

Deploying VITA-1.5 Multimodal Model with ExecuTorch #10757

Comments

jordanqi commented May 7, 2025 • edited by pytorch-bot bot Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

RFC (Optional)

Jack-Khuu commented May 9, 2025

Jack-Khuu commented May 9, 2025

jordanqi commented May 9, 2025

kirklandsign commented May 9, 2025

kirklandsign commented May 12, 2025 • edited Loading

kirklandsign commented May 12, 2025

larryliu0820 commented May 12, 2025 • edited Loading

jordanqi commented May 7, 2025 •

edited by pytorch-bot bot

Loading

kirklandsign commented May 12, 2025 •

edited

Loading

larryliu0820 commented May 12, 2025 •

edited

Loading