(research) experiment with phi-4-multimodal vision support #12274

ngxson · 2025-03-08T18:00:23Z

What is this?

This is an experiment that I made around llava and llama_adapter_lora in order to support the vision part of phi-4-multimodal.

Important

I have no intent to merge this PR, please do NOT share the gguf generated from this branch, this is a WIP and may be broken any time in the future.

UPDATE: It's now working! See this comment

My goals for creating this PR are:

Seeking for help from the community
Express my frustration when working with the python code written by microsoft. Their code is terrible to look at(I am serious, see explanation in the comment)
Prepare for llama : second attempt to refactor vision API #11292 , so I can simply copy the compute graph over the new llama-vision infrastructure once it's ready

Technical details

The vision part phi-4:

Is based on SigLIP-400M, which should be similar to clip (so we can re-use the same compute graph from llava)
Uses 2-matrix MLP projector, so exactly the same as llava
Before projection, they do a 2D avg pool to reduce the number of image tokens (In this case, decrease from 4096 tokens to 256 tokens) ==> This can be referred as image_token_compression_cls in the python code
Before decoding image tokens using the text model, vision-lora will be applied

So my plan is to create a new phi4mm-cli.cpp that will handle applying LoRA before decoding image tokens, then remove if after decoding is done.

Current problems (potentially you can help!)

UPDATE: it can already run without hd_transform_order, but would be nice to have that.

The main problem is that there is an extra step before projection, it's referred as hd_transform_order in the python code, but I still have no idea how it works after literary spending 30 minutes staring at the screen.

It could be linked to something called "a new dynamic multi-crop strategy" as described in their paper, but there is no good source of documentation.

How I can try this?

Important

This is an advanced guide, only try if you understand it.

Clone https://huggingface.co/microsoft/Phi-4-multimodal-instruct into your computer

Step 1: Get the text model

python ~/work/llama.cpp-ngxson/convert_hf_to_gguf.py --outfile text_model.gguf Phi-4-multimodal-instruct
# output file: text_model.gguf

Step 2: Get the vision encoder:

python ~/work/llama.cpp-ngxson/examples/llava/phi4mm_convert_encoder_to_gguf.py --outfile mmproj.gguf Phi-4-multimodal-instruct
# output file: mmproj.gguf

Step 3: Get the vision LoRA:

vi Phi-4-multimodal-instruct/vision-lora/adapter_config.json
# Change the line:
#   "base_model_name_or_path": "TBA",
# to:
#   "base_model_name_or_path": "microsoft/Phi-4-multimodal-instruct",

python convert_lora_to_gguf.py --outfile vision_lora.gguf Phi-4-multimodal-instruct/vision-lora
# output file: vision_lora.gguf

Step 4: Complie llama-phi4mm-cli

cmake -B build
cmake --build build -j --target llama-phi4mm-cli
# output: build/bin/llama-phi4mm-cli

Step 5: Run it

./build/bin/llama-phi4mm-cli -m text_model.gguf \
  --mmproj mmproj.gguf \
  --lora vision_lora.gguf \
  --image ./bliss.png

In my case, I use the Windows XP wallpaper as input:

Output:

I see a vast, rolling green hill under a clear blue sky dotted with fluffy white clouds. The grass appears lush and vibrant, suggesting a serene and picturesque landscape. The scene evokes a sense of tranquility and natural beauty.<|end|>

jeffzhou2000 · 2025-03-09T05:31:33Z

* Express my frustration when working with the [python code](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/modeling_phi4mm.py) written by microsoft. Their code is pretty much 💩

I strongly agree with your feeling(MS's VS2022 IDE and VS2022 toolchain is also a xxxx of xxxx, #12215) but I don't dare to use these words: I'll directly use these words to express my true feelings if I were you(also a true AI expert). thanks for your PR and post!

ngxson · 2025-03-09T08:25:21Z

I strongly agree with your feeling(MS's VS2022 IDE and VS2022 toolchain is also a xxxx of xxxx, #12215) but I don't dare to use these words: I'll directly use these words to express my true feelings if I were you(also a true AI expert). thanks for your PR and post!

Yes that's why I use emoji instead of words, it's a good way to express emotion without being accused of using offensive words.

Also, this is no more than Linus saying "f**k you Nvidia". Basically, nowadays, many companies rush to release a product without actually care about the community aspect, and sadly microsoft is slowly shifting to this direction.

And to add to the context of my frustration. This is not the first time. For phi-3, I already got a small frustration where they use 4 different chat templates to express literally the same thing, see phi-3 family. The sliding window config is also a mess, which make me wonder if phi-3 is actually made by many different teams instead of one single team.

And for phi-4, let me explain why the code is bad , look at this:

There is a variable named audio_projection_mode and they set it to "vision", WTF?

Then that's not all, see:

So self.audio_projection can be either a nn.Sequential or nn.ModuleDict?? WTF x2 man??

ngxson · 2025-03-09T10:49:00Z

Firstly, they don't take the last layer of the vision tower, they take n_layers - 2 instead. Normally this can be found in config.json, but they decide to bury this in the code instead:

Secondly, this piece of code is misleading. Since patch_feature is the output tensor (2D tensor), there is no more notion of "channel". Yet, they still write "NCHW" with "C" refer to "channel" if I understand correctly:

ngxson · 2025-03-09T11:02:24Z

💩 , now it works!

Input image:

I see a vast, rolling green hill under a clear blue sky dotted with fluffy white clouds. The grass appears lush and vibrant, suggesting a serene and picturesque landscape. The scene evokes a sense of tranquility and natural beauty.<|end|>

CISC · 2025-03-09T11:16:09Z

sadly microsoft is slowly shifting to this direction.

If you've ever worked with one of their APIs you'd know this has been modus operandi for quite a while, but hey, good luck, I see you've made good progress already! :)

ngxson · 2025-03-09T20:49:40Z

examples/llava/phi4mm_convert_encoder_to_gguf.py

+SIGLIP_MODEL = {
+    "model_id": "google/siglip-base-patch16-224",
+    "image_size": 448,
+    "patch_size": 14, # I had very had time finding this number
+    "do_normalize": True,
+    "do_rescale": True,
+    "do_resize": True,
+    "image_mean": [
+        0.5,
+        0.5,
+        0.5
+    ],
+    "image_processor_type": "SiglipImageProcessor",
+    "image_std": [
+        0.5,
+        0.5,
+        0.5
+    ],
+    "processor_class": "SiglipProcessor",
+    "resample": 3,
+    "rescale_factor": 0.00392156862745098,
+    "size": {
+        "height": 224,
+        "width": 224
+    }
+}
+N_LAYERS = 27
+FEATURE_LAYER = -2
+HEAD_COUNT = 16


Would be nice if this whole thing can be pulled from a JSON config instead of staying hard coded here, but the original model provides no such thing.

(research) experiment with phi-4-multimodal

9d97ad5

github-actions bot added examples python python script changes labels Mar 8, 2025

ngxson mentioned this pull request Mar 8, 2025

Feature Request: Support for Phi4MMForCausalLM Architecture #12117

Closed

4 tasks

correct some configs

565d9f3

it works!

605a10f

ngxson commented Mar 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(research) experiment with phi-4-multimodal vision support #12274

(research) experiment with phi-4-multimodal vision support #12274

ngxson commented Mar 8, 2025 •

edited

Loading

Uh oh!

jeffzhou2000 commented Mar 9, 2025 •

edited

Loading

Uh oh!

ngxson commented Mar 9, 2025 •

edited

Loading

Uh oh!

ngxson commented Mar 9, 2025 •

edited

Loading

Uh oh!

ngxson commented Mar 9, 2025

Uh oh!

CISC commented Mar 9, 2025

Uh oh!

ngxson Mar 9, 2025

Uh oh!

Uh oh!

(research) experiment with phi-4-multimodal vision support #12274

Are you sure you want to change the base?

(research) experiment with phi-4-multimodal vision support #12274

Conversation

ngxson commented Mar 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this?

Technical details

Current problems (potentially you can help!)

How I can try this?

Uh oh!

jeffzhou2000 commented Mar 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Mar 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Mar 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Mar 9, 2025

Uh oh!

CISC commented Mar 9, 2025

Uh oh!

ngxson Mar 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxson commented Mar 8, 2025 •

edited

Loading

jeffzhou2000 commented Mar 9, 2025 •

edited

Loading

ngxson commented Mar 9, 2025 •

edited

Loading

ngxson commented Mar 9, 2025 •

edited

Loading