Skip to content

mtmd : support InternVL 2.5 and 3 #13422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 10, 2025
Merged

mtmd : support InternVL 2.5 and 3 #13422

merged 9 commits into from
May 10, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented May 10, 2025

WIP

Tested with:

  • InternVL 3: 1B, 2B, 8B, 14B
  • InternVL 2.5: 1B, 4B (note: for certain sizes, conversion fails due to broken tokenizer)
# InternVL 2.5 and 3
(tool_name) -hf ggml-org/InternVL2_5-1B-GGUF
(tool_name) -hf ggml-org/InternVL2_5-2B-GGUF
(tool_name) -hf ggml-org/InternVL3-1B-Instruct-GGUF
(tool_name) -hf ggml-org/InternVL3-2B-Instruct-GGUF
(tool_name) -hf ggml-org/InternVL3-4B-Instruct-GGUF
(tool_name) -hf ggml-org/InternVL3-14B-Instruct-GGUF

Test result:

(NOTE: MobileVLM test is removed, see in comment)

OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K
OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M

@github-actions github-actions bot added examples python python script changes labels May 10, 2025
@ngxson ngxson changed the title mtmd : support InternVL 2 and 3 mtmd : support InternVL 2.5 and 3 May 10, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 10, 2025
Comment on lines +3081 to +3083
if (ctx->vision_model.mm_glm_tok_boi) {
n_patches += 2; // for BOI and EOI token embeddings
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This change breaks MobileVLM in the test. But after closer inspection, I think MobileVLM has broken chat template from the beginning, which make it "randomly" passes the test by coincidence.

I decided to remove the test for MobileVLM because it seems like no one use that model anymore, it's just not practically usable without a proper chat template.

@ngxson ngxson marked this pull request as ready for review May 10, 2025 13:27
@ngxson ngxson requested a review from ggerganov May 10, 2025 13:28
Comment on lines +3529 to +3536
// sanity check (only support batch size of 1 for now)
const int n_tokens_out = embeddings->ne[1];
const int expected_n_tokens_out = clip_n_output_tokens(ctx, imgs.entries[0].get());
if (n_tokens_out != expected_n_tokens_out) {
LOG_ERR("%s: expected %d tokens, got %d\n", __func__, expected_n_tokens_out, n_tokens_out);
GGML_ABORT("Invalid number of output tokens");
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sanity check should prevent problems similar to #13381


// projector (always using GELU activation)
{
cur = build_norm(cur, model.mm_0_w, model.mm_0_b, NORM_TYPE_NORMAL, 1e-5, -1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This norm epsilon is hardcoded to 1e-5 or is it a parameter from the model config?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code uses LayerNorm default value, which is 1e-5 on pytorch's docs.

I'm adding a comment now:

Suggested change
cur = build_norm(cur, model.mm_0_w, model.mm_0_b, NORM_TYPE_NORMAL, 1e-5, -1);
// projector LayerNorm uses pytorch's default eps = 1e-5
// ref: https://huggingface.co/OpenGVLab/InternVL3-8B-Instruct/blob/a34d3e4e129a5856abfd6aa6de79776484caa14e/modeling_internvl_chat.py#L79
cur = build_norm(cur, model.mm_0_w, model.mm_0_b, NORM_TYPE_NORMAL, 1e-5, -1);

@ngxson ngxson merged commit 053367d into ggml-org:master May 10, 2025
46 of 47 checks passed
@city96
Copy link
Contributor

city96 commented May 10, 2025

Are there plans to support the larger 38B and 78B models as well? According to the huggingface model card the larger models use InternViT-6B-448px-V2_5 instead of InternViT-300M-448px-V2_5

The actual keys for the vision model are slightly different, so the current conversion script fails with:

ValueError: Can not map tensor 'vision_tower.vision_model.encoder.layers.0.attn.k_norm.weight'

image

@city96
Copy link
Contributor

city96 commented May 10, 2025

Actually, I think I got it working, will post a PR in a bit.

@mingyi456
Copy link

Hi @ngxson, there is also a 9B version of InternVL 3, could you please test that as well? I find it interesting because the text model is based on InternLM 3 instead of Qwen 2.5.

@nicoboss
Copy link
Contributor

nicoboss commented May 13, 2025

Hi @ngxson, there is also a 9B version of InternVL 3, could you please test that as well? I find it interesting because the text model is based on InternLM 3 instead of Qwen 2.5.

I just tested all the mainline InternVL models. @mingyi456 All 9B version of InternVL 3 failed. First they did so due to missing preprocessor_config.json but you can just get the file from any other InternLM 3 model. However once you have done so it still fails due to IndexError: piece id is out of range due to a broken tokenizer. @ngxson mentioned in the initial post that some InternVL 2.5 based models have this issue but apparently this issue affects InternVL3-9B, InternVL3-9B-Instruct and InternVL3-9B-Pretrained as well.

Models with missing preprocessor_config.json (can be fixed by using preprocessor_config.json from simular model):

  • InternVL3-9B
  • InternVL3-9B-Instruct
  • InternVL3-9B-Pretrained
  • InternVL2_5-78B
  • InternVL2_5-78B-MPO
  • InternVL2-40B

Models with broken tokenizer (unfixable):

  • InternVL3-9B
  • InternVL3-9B-Instruct
  • InternVL3-9B-Pretrained
  • InternVL2_5-2B
  • InternVL2_5-8B
  • InternVL2_5-26B
  • InternVL2_5-2B-MPO
  • InternVL2_5-8B-MPO
  • InternVL2_5-26B-MPO
  • InternVL2-8B-MPO
  • InternVL2-2B
  • InternVL2-8B
  • InternVL2-26B
  • Mini-InternVL-Chat-2B-V1-5
  • InternVL-Chat-V1-5

Incompatible V1, V1.5 and V2 models:
Note: None of them is official supported by this PR but all not listed work or fail because of above issues

  • InternVL2-4B
  • InternVL2-40B
  • InternVL2-Llama3-76B
  • Mini-InternVL-Chat-4B-V1-5
  • InternVL-Chat-V1-1
  • InternVL-Chat-V1-2
  • InternVL-Chat-V1-2-Plus

All InternVL2_5 and InternLM 3 main series models not mentioned above worked without any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants