Skip to content

(research) experiment with phi-4-multimodal vision support #12274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Mar 8, 2025

What is this?

This is an experiment that I made around llava and llama_adapter_lora in order to support the vision part of phi-4-multimodal.

Important

I have no intent to merge this PR, please do NOT share the gguf generated from this branch, this is a WIP and may be broken any time in the future.

UPDATE: It's now working! See this comment

My goals for creating this PR are:

  • Seeking for help from the community
  • Express my frustration when working with the python code written by microsoft. Their code is terrible to look at(I am serious, see explanation in the comment)
  • Prepare for llama : second attempt to refactor vision API #11292 , so I can simply copy the compute graph over the new llama-vision infrastructure once it's ready

Technical details

The vision part phi-4:

  • Is based on SigLIP-400M, which should be similar to clip (so we can re-use the same compute graph from llava)
  • Uses 2-matrix MLP projector, so exactly the same as llava
  • Before projection, they do a 2D avg pool to reduce the number of image tokens (In this case, decrease from 4096 tokens to 256 tokens) ==> This can be referred as image_token_compression_cls in the python code
  • Before decoding image tokens using the text model, vision-lora will be applied

So my plan is to create a new phi4mm-cli.cpp that will handle applying LoRA before decoding image tokens, then remove if after decoding is done.

Current problems (potentially you can help!)

UPDATE: it can already run without hd_transform_order, but would be nice to have that.

The main problem is that there is an extra step before projection, it's referred as hd_transform_order in the python code, but I still have no idea how it works after literary spending 30 minutes staring at the screen.

It could be linked to something called "a new dynamic multi-crop strategy" as described in their paper, but there is no good source of documentation.

How I can try this?

Important

This is an advanced guide, only try if you understand it.

Clone https://huggingface.co/microsoft/Phi-4-multimodal-instruct into your computer

Step 1: Get the text model

python ~/work/llama.cpp-ngxson/convert_hf_to_gguf.py --outfile text_model.gguf Phi-4-multimodal-instruct
# output file: text_model.gguf

Step 2: Get the vision encoder:

python ~/work/llama.cpp-ngxson/examples/llava/phi4mm_convert_encoder_to_gguf.py --outfile mmproj.gguf Phi-4-multimodal-instruct
# output file: mmproj.gguf

Step 3: Get the vision LoRA:

vi Phi-4-multimodal-instruct/vision-lora/adapter_config.json
# Change the line:
#   "base_model_name_or_path": "TBA",
# to:
#   "base_model_name_or_path": "microsoft/Phi-4-multimodal-instruct",

python convert_lora_to_gguf.py --outfile vision_lora.gguf Phi-4-multimodal-instruct/vision-lora
# output file: vision_lora.gguf

Step 4: Complie llama-phi4mm-cli

cmake -B build
cmake --build build -j --target llama-phi4mm-cli
# output: build/bin/llama-phi4mm-cli

Step 5: Run it

./build/bin/llama-phi4mm-cli -m text_model.gguf \
  --mmproj mmproj.gguf \
  --lora vision_lora.gguf \
  --image ./bliss.png

In my case, I use the Windows XP wallpaper as input:

image

Output:

I see a vast, rolling green hill under a clear blue sky dotted with fluffy white clouds. The grass appears lush and vibrant, suggesting a serene and picturesque landscape. The scene evokes a sense of tranquility and natural beauty.<|end|>

@zhouwg
Copy link
Contributor

zhouwg commented Mar 9, 2025

* Express my frustration when working with the [python code](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/modeling_phi4mm.py) written by microsoft. Their code is pretty much 💩

I strongly agree with your feeling(MS's VS2022 IDE and VS2022 toolchain is also a xxxx of xxxx, #12215) but I don't dare to use these words: I'll directly use these words to express my true feelings if I were you(also a true AI expert). thanks for your PR and post!

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 9, 2025

I strongly agree with your feeling(MS's VS2022 IDE and VS2022 toolchain is also a xxxx of xxxx, #12215) but I don't dare to use these words: I'll directly use these words to express my true feelings if I were you(also a true AI expert). thanks for your PR and post!

Yes that's why I use emoji instead of words, it's a good way to express emotion without being accused of using offensive words.

Also, this is no more than Linus saying "f**k you Nvidia". Basically, nowadays, many companies rush to release a product without actually care about the community aspect, and sadly microsoft is slowly shifting to this direction.

And to add to the context of my frustration. This is not the first time. For phi-3, I already got a small frustration where they use 4 different chat templates to express literally the same thing, see phi-3 family. The sliding window config is also a mess, which make me wonder if phi-3 is actually made by many different teams instead of one single team.

And for phi-4, let me explain why the code is bad , look at this:

image

There is a variable named audio_projection_mode and they set it to "vision", WTF?

Then that's not all, see:

image

So self.audio_projection can be either a nn.Sequential or nn.ModuleDict?? WTF x2 man??

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 9, 2025

Firstly, they don't take the last layer of the vision tower, they take n_layers - 2 instead. Normally this can be found in config.json, but they decide to bury this in the code instead:

image

Secondly, this piece of code is misleading. Since patch_feature is the output tensor (2D tensor), there is no more notion of "channel". Yet, they still write "NCHW" with "C" refer to "channel" if I understand correctly:

image

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 9, 2025

💩 , now it works!

Input image:

image

I see a vast, rolling green hill under a clear blue sky dotted with fluffy white clouds. The grass appears lush and vibrant, suggesting a serene and picturesque landscape. The scene evokes a sense of tranquility and natural beauty.<|end|>

@CISC
Copy link
Collaborator

CISC commented Mar 9, 2025

sadly microsoft is slowly shifting to this direction.

If you've ever worked with one of their APIs you'd know this has been modus operandi for quite a while, but hey, good luck, I see you've made good progress already! :)

Comment on lines +19 to +47
SIGLIP_MODEL = {
"model_id": "google/siglip-base-patch16-224",
"image_size": 448,
"patch_size": 14, # I had very had time finding this number
"do_normalize": True,
"do_rescale": True,
"do_resize": True,
"image_mean": [
0.5,
0.5,
0.5
],
"image_processor_type": "SiglipImageProcessor",
"image_std": [
0.5,
0.5,
0.5
],
"processor_class": "SiglipProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"height": 224,
"width": 224
}
}
N_LAYERS = 27
FEATURE_LAYER = -2
HEAD_COUNT = 16
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if this whole thing can be pulled from a JSON config instead of staying hard coded here, but the original model provides no such thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants