model : jina-embeddings-v3 support #13693

CISC · 2025-05-21T18:37:57Z

Work checklist

Task selection (using/enabling the correct LoRA) and prompt prefix is left to the user, but is made simpler by providing task name and prompt prefix in LoRA GGUF metadata (available from llama-server in /lora-adapters endpoint).

Fixes #12327
Fixes #9585

CISC · 2025-05-22T10:28:21Z

Apart from a few minor differences (unsure why) in the tokenizer test, the tokenizer.json parsing seems to work perfectly:

  1135 '`', 164721 '``', 164721 '``', 164721 '``'

vs

164721 '``', 164721 '``', 164721 '``',   1135 '`'

and

   32 '?',  85908 '?????'

vs

85908 '?????',     32 '?'

Edit: Fixed in #13743

CISC · 2025-05-23T06:29:09Z

@ngxson @slaren When you have the time I would appreciate some feedback on how best to tackle the task LoRAs of this model.

I think the best user-experience would probably be to keep them embedded and add extra metadata for their names so they can be easily chosen via an option. However that increases the scope of this PR quite a bit as new mechanisms would need to be added to load and apply the right LoRA tensors at runtime. This seems a little excessive for just one model, but maybe it can be useful for others as well, I don't know?

The less intrusive route would be to extract each LoRA into their own separate GGUF (albeit a more complicated conversion process) and make the user responsible for applying the correct one (and using the correct prompt), but that's seems like a fairly bad UX.

The PR as-is now works great and produces identical embeddings as the original using transformers with no task specified, but the main selling-point of the model is using the tasks so I thinks it's important to do this right.

ngxson · 2025-05-23T06:40:40Z

I was thinking about supporting built-in lora lately, as this is required for phi-4-multimodal. We can extend the current lora API to support this case, but eventually end-user need a way to select this (for example via llama-server). For multimodal, it can be done easily via libmtmd.

Another approach could be to add an enum of pre-defined lora types, and user code can switch it at runtime. This is based on an earlier suggestion from @ggerganov about having multiple models in the same gguf.

If I have time this weekend, I can push a draft PR on how this can be done.

ggerganov · 2025-05-23T06:40:56Z

Why can't we use the existing LoRA mechanism that is supported by llama-server?

Btw, did you resolve the tokenization differences?

CISC · 2025-05-23T06:45:51Z

Btw, did you resolve the tokenization differences?

No, it seems like a bug/difference in the UGM tokenizer...

ngxson · 2025-05-23T06:46:08Z

The lora api on server is quite low-level, also downstream apps will have to explicitly set the lora accordingly to use case, which may not be a good UX overall, especially when the lora provides commonly known tasks like embeddings or reranking

~~For tokenization, I think built-in lora is not affected by this, as they use the same tokenizer as base model~~ Unrelated answer

ngxson · 2025-05-23T06:47:49Z

Another option could be to consider it as an extension to the embedding pooling selection

ggerganov · 2025-05-23T06:50:12Z

Why can't we use the existing LoRA mechanism that is supported by llama-server?

Ok I understand - the adapters are embedded inside the GGUF file together with the model and we don't have a mechanism to load them.

ggerganov · 2025-05-23T06:54:15Z

The less intrusive route would be to extract each LoRA into their own separate GGUF (albeit a more complicated conversion process) and make the user responsible for applying the correct one (and using the correct prompt), but that's seems like a fairly bad UX.

Even if the adapters were embedded, the user still has to use the correct prompt. So the UX seems to be the same regardless how the LoRAs are stored?

CISC · 2025-05-23T06:55:17Z

Even if the adapters were embedded, the user still has to use the correct prompt. So the UX seems to be the same regardless how the LoRAs are stored?

The thinking was that the prompt could be prefixed depending on task selection (easily stored as metadata).

CISC · 2025-05-23T07:41:38Z

Btw, did you resolve the tokenization differences?

No, it seems like a bug/difference in the UGM tokenizer...

@ggerganov I can confirm that it's an issue with the UGM tokenizer, the same thing happens with nomic-embed-text-v2-moe f.ex.

ngxson · 2025-05-23T10:11:15Z

I looked deeper into the jina model. It is a bit confused to me though:

There is only one single lora, not one lora per task as I initially thought
It's unclear: in which use case, we don't want to use LoRA?

If the use case of non LoRA is not practical, maybe it's more simple to just merge the LoRA into the weight

CISC · 2025-05-23T10:17:46Z

1. There is only one single lora, not one lora per task as I initially thought

No, there are 5 LoRAs, but they are all in the same (4D3D) tensor.

2. It's unclear: in which use case, we don't want to use LoRA?

Not sure, just for reference I guess?

CISC · 2025-05-23T10:21:09Z

See here for how the correct task LoRA is loaded in transformers:
https://huggingface.co/jinaai/jina-embeddings-v3/blob/main/custom_st.py#L130-L141

ngxson · 2025-05-23T10:26:53Z

Ok I see, haven't looked at the tensor shapes. So if I understand correctly, it seems like the first 2 tasks retrieval.query and retrieval.passage are merged into 1 adapter, and the 2 tasks are switched using prompt. That's why we have 5 tasks but only 4 adapters.

CISC · 2025-05-23T10:35:27Z

Ok I see, haven't looked at the tensor shapes. So if I understand correctly, it seems like the first 2 tasks retrieval.query and retrieval.passage are merged into 1 adapter, and the 2 tasks are switched using prompt. That's why we have 5 tasks but only 4 adapters.

No, there are 5 adapters, the tensors are shaped like this: [tasks (5), rank (4), N]

CISC · 2025-06-01T19:55:53Z

@ngxson Had time to look at built-in LoRAs?

CISC · 2025-06-28T09:51:32Z

For reference jina-embeddings-v4 was just released, and this time the LoRAs are in a separate file with separate weights for each task.

Edit: Oh, and it's multimodal (Qwen2.5 VL).

CISC · 2025-07-06T14:53:42Z

Added LoRA embedding and (non-functional) loading, need help figuring out how to actually load and apply the tensors. @slaren @ggerganov

slaren · 2025-07-06T15:07:57Z

Looking only at the shapes of the tensors, my guess is that this model has all the 5 loras integrated in a single tensor as an additional dimension:

It should be possible to write a script to separate each lora into a different tensor (removing the first dimension), and save them as different GGUF lora files that can be loaded with the standard --lora command line option. You should verify that this assumption is correct by checking the code, though.

CISC · 2025-07-06T15:14:18Z

@slaren Yes, I'm already separating them in 9a39ccb (and storing them as separately named tensors, inspired by jina-embeddings-v4), however storing each task as a separate GGUF is not a good solution IMO (hence I'm now embedding everything in the model) as I'd like to make it as simple as possible to activate a task (ie, set correct LoRA and prepend prompt prefix).

slaren · 2025-07-06T15:18:48Z

It may be very hard to justify making changes to the interfaces just for this one model.

CISC · 2025-07-06T15:20:14Z

It may be very hard to justify making changes to the interfaces just for this one model.

TBF 2 models, but yeah. :) They are the best multi-lingual embedding models available though.

slaren · 2025-07-06T15:26:29Z

I understand the desire to make it as easy to use as possible, but I really don't think that it is too much asking the users to:

Load the lora that they want to use with --lora
Prefix each message with the prompt format expected by the model

CISC · 2025-07-06T15:50:58Z

I understand the desire to make it as easy to use as possible, but I really don't think that it is too much asking the users to:
* Load the lora that they want to use with `--lora`

* Prefix each message with the prompt format expected by the model

I don't think you've ever met a user. :)

Anyway, I suppose this is something that can be solved by 3rd parties, I'll see if I can at least figure out a simple conversion process.

CISC · 2025-07-06T20:15:25Z

Ok, I think I found a nice compromise...

ngxson · 2025-07-06T20:27:41Z

@ngxson Had time to look at built-in LoRAs?

Sorry I missed the message. Indeed, I had an idea but haven't had time to work on it:

The lora handling code in llama.cpp currently validates if the lora.gguf has only lora A/B tensors, and throw an exception if the file contains non-A/B tensors.

However, allowing A/B and original tensor to co-habit in the same gguf can be useful in the current case. We can modify llama_adapter_lora_init_impl to handle such case, then call llama_adapter_lora_init_impl internally based on a given gguf metadata, for example.

And go a bit further, we can prefix the A/B tensor, for example: rerank.blk.0.attn_k.lora_A

This ofc deserve a dedicated PR, but still I don't have any plan for it. Feel free to take over this task if you like.

CISC · 2025-07-06T20:33:56Z

However, allowing A/B and original tensor to co-habit in the same gguf can be useful in the current case. We can modify llama_adapter_lora_init_impl to handle such case, then call llama_adapter_lora_init_impl internally based on a given gguf metadata, for example.

Thanks for getting back to me on this, for now I think the current implementation is a good compromise, let me know if you think otherwise!

And go a bit further, we can prefix the A/B tensor, for example: rerank.blk.0.attn_k.lora_A

This ofc deserve a dedicated PR, but still I don't have any plan for it. Feel free to take over this task if you like.

Let's get back to this in the future, I had a look into llama_adapter_lora_init_impl and it felt like it is in need of major refactoring, didn't feel up to it right now.

include/llama.h

CISC added 3 commits May 21, 2025 20:17

initial jina-embeddings-v3 support

6303ea2

initial jina-embeddings-v3 support

ba51f89

initial jina-embeddings-v3 support

4ac1380

github-actions bot added the python python script changes label May 21, 2025

fix vocab parsing with only tokenizer.json

1274c8c

set mask token lstrip attribute

65a37fa

additional unk_token_id fallback just in case [no ci]

f2d876a

CISC added 4 commits May 26, 2025 08:40

revert vocab_size() change [no ci]

b17e981

Merge branch 'master' into cisc/jina-embeddings-v3

4046701

merge tensor loading into general bert

f5d0305

rope

3862d95

Merge branch 'master' into cisc/jina-embeddings-v3

8ab8d36

add lora embedding and loading (non-functional)

9a39ccb

CISC added the help wanted Needs help from the community label Jul 6, 2025

export separate lora ggufs instead

966d0e0

CISC marked this pull request as ready for review July 6, 2025 20:29

CISC requested a review from ngxson as a code owner July 6, 2025 20:29

CISC requested a review from slaren July 6, 2025 20:29

github-actions bot added examples server labels Jul 6, 2025

slaren reviewed Jul 6, 2025

View reviewed changes

include/llama.h Outdated Show resolved Hide resolved

add adapter metadata api

bea6d06

CISC requested a review from slaren July 7, 2025 05:26

model : jina-embeddings-v3 support #13693

Are you sure you want to change the base?

model : jina-embeddings-v3 support #13693

Conversation

CISC commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Work checklist

Uh oh!

CISC commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented May 23, 2025

Uh oh!

ngxson commented May 23, 2025

Uh oh!

ggerganov commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented May 23, 2025

Uh oh!

ngxson commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 23, 2025

Uh oh!

ggerganov commented May 23, 2025

Uh oh!

ggerganov commented May 23, 2025

Uh oh!

CISC commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented May 23, 2025

Uh oh!

ngxson commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented May 23, 2025

Uh oh!

ngxson commented May 23, 2025

Uh oh!

CISC commented May 23, 2025

Uh oh!

CISC commented Jun 1, 2025

Uh oh!

CISC commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Jul 6, 2025

Uh oh!

slaren commented Jul 6, 2025

Uh oh!

CISC commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Jul 6, 2025

Uh oh!

CISC commented Jul 6, 2025

Uh oh!

slaren commented Jul 6, 2025

Uh oh!

CISC commented Jul 6, 2025

Uh oh!

CISC commented Jul 6, 2025

Uh oh!

ngxson commented Jul 6, 2025

Uh oh!

CISC commented Jul 6, 2025

Uh oh!

Uh oh!

Uh oh!

CISC commented May 21, 2025 •

edited

Loading

CISC commented May 22, 2025 •

edited

Loading

ggerganov commented May 23, 2025 •

edited

Loading

ngxson commented May 23, 2025 •

edited

Loading

CISC commented May 23, 2025 •

edited

Loading

ngxson commented May 23, 2025 •

edited

Loading

CISC commented May 23, 2025 •

edited

Loading

CISC commented Jun 28, 2025 •

edited

Loading

CISC commented Jul 6, 2025 •

edited

Loading