-
Notifications
You must be signed in to change notification settings - Fork 12.3k
model : jina-embeddings-v3 support #13693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Apart from a few minor differences (unsure why) in the tokenizer test, the
vs
and
vs
Edit: Fixed in #13743 |
@ngxson @slaren When you have the time I would appreciate some feedback on how best to tackle the task LoRAs of this model. I think the best user-experience would probably be to keep them embedded and add extra metadata for their names so they can be easily chosen via an option. However that increases the scope of this PR quite a bit as new mechanisms would need to be added to load and apply the right LoRA tensors at runtime. This seems a little excessive for just one model, but maybe it can be useful for others as well, I don't know? The less intrusive route would be to extract each LoRA into their own separate GGUF (albeit a more complicated conversion process) and make the user responsible for applying the correct one (and using the correct prompt), but that's seems like a fairly bad UX. The PR as-is now works great and produces identical embeddings as the original using |
I was thinking about supporting built-in lora lately, as this is required for phi-4-multimodal. We can extend the current lora API to support this case, but eventually end-user need a way to select this (for example via llama-server). For multimodal, it can be done easily via libmtmd. Another approach could be to add an enum of pre-defined lora types, and user code can switch it at runtime. This is based on an earlier suggestion from @ggerganov about having multiple models in the same gguf. If I have time this weekend, I can push a draft PR on how this can be done. |
Why can't we use the existing LoRA mechanism that is supported by Btw, did you resolve the tokenization differences? |
No, it seems like a bug/difference in the UGM tokenizer... |
The lora api on server is quite low-level, also downstream apps will have to explicitly set the lora accordingly to use case, which may not be a good UX overall, especially when the lora provides commonly known tasks like embeddings or reranking
|
Another option could be to consider it as an extension to the embedding pooling selection |
Ok I understand - the adapters are embedded inside the GGUF file together with the model and we don't have a mechanism to load them. |
Even if the adapters were embedded, the user still has to use the correct prompt. So the UX seems to be the same regardless how the LoRAs are stored? |
The thinking was that the prompt could be prefixed depending on task selection (easily stored as metadata). |
@ggerganov I can confirm that it's an issue with the UGM tokenizer, the same thing happens with nomic-embed-text-v2-moe f.ex. |
I looked deeper into the jina model. It is a bit confused to me though:
If the use case of non LoRA is not practical, maybe it's more simple to just merge the LoRA into the weight |
No, there are 5 LoRAs, but they are all in the same (
Not sure, just for reference I guess? |
See here for how the correct task LoRA is loaded in |
Ok I see, haven't looked at the tensor shapes. So if I understand correctly, it seems like the first 2 tasks |
No, there are 5 adapters, the tensors are shaped like this: [tasks (5), rank (4), N] |
@ngxson Had time to look at built-in LoRAs? |
For reference jina-embeddings-v4 was just released, and this time the LoRAs are in a separate file with separate weights for each task. Edit: Oh, and it's multimodal (Qwen2.5 VL). |
Added LoRA embedding and (non-functional) loading, need help figuring out how to actually load and apply the tensors. @slaren @ggerganov |
@slaren Yes, I'm already separating them in 9a39ccb (and storing them as separately named tensors, inspired by jina-embeddings-v4), however storing each task as a separate GGUF is not a good solution IMO (hence I'm now embedding everything in the model) as I'd like to make it as simple as possible to activate a task (ie, set correct LoRA and prepend prompt prefix). |
It may be very hard to justify making changes to the interfaces just for this one model. |
TBF 2 models, but yeah. :) They are the best multi-lingual embedding models available though. |
I understand the desire to make it as easy to use as possible, but I really don't think that it is too much asking the users to:
|
I don't think you've ever met a user. :) Anyway, I suppose this is something that can be solved by 3rd parties, I'll see if I can at least figure out a simple conversion process. |
Ok, I think I found a nice compromise... |
Sorry I missed the message. Indeed, I had an idea but haven't had time to work on it: The lora handling code in llama.cpp currently validates if the lora.gguf has only lora A/B tensors, and throw an exception if the file contains non-A/B tensors. However, allowing A/B and original tensor to co-habit in the same gguf can be useful in the current case. We can modify And go a bit further, we can prefix the A/B tensor, for example: This ofc deserve a dedicated PR, but still I don't have any plan for it. Feel free to take over this task if you like. |
Thanks for getting back to me on this, for now I think the current implementation is a good compromise, let me know if you think otherwise!
Let's get back to this in the future, I had a look into |
Support for jina-embeddings-v3
Work checklist
Task selection (using/enabling the correct LoRA) and prompt prefix is left to the user, but is made simpler by providing task name and prompt prefix in LoRA GGUF metadata (available from
llama-server
in/lora-adapters
endpoint).Fixes #12327
Fixes #9585