-
Notifications
You must be signed in to change notification settings - Fork 11.8k
(research) experiment with phi-4-multimodal vision support #12274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I strongly agree with your feeling(MS's VS2022 IDE and VS2022 toolchain is also a xxxx of xxxx, #12215) but I don't dare to use these words: I'll directly use these words to express my true feelings if I were you(also a true AI expert). thanks for your PR and post! |
Yes that's why I use emoji instead of words, it's a good way to express emotion without being accused of using offensive words. Also, this is no more than Linus saying "f**k you Nvidia". Basically, nowadays, many companies rush to release a product without actually care about the community aspect, and sadly microsoft is slowly shifting to this direction. And to add to the context of my frustration. This is not the first time. For phi-3, I already got a small frustration where they use 4 different chat templates to express literally the same thing, see phi-3 family. The sliding window config is also a mess, which make me wonder if phi-3 is actually made by many different teams instead of one single team. And for phi-4, let me explain why the code is bad , look at this: ![]() There is a variable named Then that's not all, see: ![]() So |
If you've ever worked with one of their APIs you'd know this has been modus operandi for quite a while, but hey, good luck, I see you've made good progress already! :) |
SIGLIP_MODEL = { | ||
"model_id": "google/siglip-base-patch16-224", | ||
"image_size": 448, | ||
"patch_size": 14, # I had very had time finding this number | ||
"do_normalize": True, | ||
"do_rescale": True, | ||
"do_resize": True, | ||
"image_mean": [ | ||
0.5, | ||
0.5, | ||
0.5 | ||
], | ||
"image_processor_type": "SiglipImageProcessor", | ||
"image_std": [ | ||
0.5, | ||
0.5, | ||
0.5 | ||
], | ||
"processor_class": "SiglipProcessor", | ||
"resample": 3, | ||
"rescale_factor": 0.00392156862745098, | ||
"size": { | ||
"height": 224, | ||
"width": 224 | ||
} | ||
} | ||
N_LAYERS = 27 | ||
FEATURE_LAYER = -2 | ||
HEAD_COUNT = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if this whole thing can be pulled from a JSON config instead of staying hard coded here, but the original model provides no such thing.
What is this?
This is an experiment that I made around
llava
andllama_adapter_lora
in order to support the vision part of phi-4-multimodal.Important
I have no intent to merge this PR, please do NOT share the gguf generated from this branch, this is a WIP and may be broken any time in the future.
UPDATE: It's now working! See this comment
My goals for creating this PR are:
Technical details
The vision part phi-4:
image_token_compression_cls
in the python codevision-lora
will be appliedSo my plan is to create a new
phi4mm-cli.cpp
that will handle applying LoRA before decoding image tokens, then remove if after decoding is done.Current problems (potentially you can help!)
UPDATE: it can already run without
hd_transform_order
, but would be nice to have that.The main problem is that there is an extra step before projection, it's referred as
hd_transform_order
in the python code, but I still have no idea how it works after literary spending 30 minutes staring at the screen.It could be linked to something called "a new dynamic multi-crop strategy" as described in their paper, but there is no good source of documentation.
How I can try this?
Important
This is an advanced guide, only try if you understand it.
Clone
https://huggingface.co/microsoft/Phi-4-multimodal-instruct
into your computerStep 1: Get the text model
Step 2: Get the vision encoder:
Step 3: Get the vision LoRA:
Step 4: Complie
llama-phi4mm-cli
cmake -B build cmake --build build -j --target llama-phi4mm-cli # output: build/bin/llama-phi4mm-cli
Step 5: Run it
In my case, I use the Windows XP wallpaper as input:
Output:
I see a vast, rolling green hill under a clear blue sky dotted with fluffy white clouds. The grass appears lush and vibrant, suggesting a serene and picturesque landscape. The scene evokes a sense of tranquility and natural beauty.<|end|>