Youtu-Embedding:
Advancing Unified Text Representation with Collaborative-Distinct Learning

🔖 中文版 • 🤗 Model Download • 🚀 Quickly Start Inference • 🛠️ How to Train

🎯 Brief Introduction

Youtu-Embedding is an industry-leading, general-purpose text representation model developed by Tencent Youtu Lab. It demonstrates state-of-the-art performance across a wide range of natural language processing tasks, including Information Retrieval (IR), Semantic Textual Similarity (STS), Clustering, Reranking, and Classification.

The core advantages of Youtu-Embedding can be summarized as follows:

🏆 State-of-the-Art Performance: Achieved a top score of 77.58 on the authoritative Chinese text embedding benchmark CMTEB (as of Sep 2025), proving its powerful representation capabilities.
🧠 Sophisticated Three-Stage Training: We pioneered a "LLM-based Pre-training → Weakly-supervised Alignment → Collaborative-Discriminative Fine-tuning" pipeline, which systematically distills the broad knowledge of large language models into the specialized discriminative power required for embedding tasks.
⭐ Innovative Fine-tuning Framework: We designed a unique Collaborative-Discriminative Fine-tuning Framework that effectively resolves the "negative transfer" problem in multi-task learning through a unified data format, task-differentiated loss functions, and a dynamic single-task sampling mechanism. (This framework has been verified on a variety of basic encoders to ensure its versatility and effectiveness.)
🛠️ Meticulous Data Engineering: We combined high-quality, LLM-based data synthesis with efficient hard negative mining strategies to provide the most robust data foundation for model training.

We are open-sourcing the model weights, inference code, and the training framework. We hope this will help developers in the community create greater value.

🤗 Model Download

We have released our first model version on Hugging Face. It is a 2 billion (2B) parameter model designed for general-purpose semantic representation.

Model Name	Parameters	Dimensions	Sequence Length	Download
Youtu-Embedding-V1	2B	2048	8K	Model

🚀 Quickly Start Inference

You can generate embeddings in two ways: via our official API for ease of use or by running the model locally for full control.

Option 1: ☁️ Using the Official API

📦 Install the SDK

pip install --upgrade tencentcloud-sdk-python

API Guide: For details on authentication and endpoints, see the Tencent Cloud API Documentation.
SDK Reference: For more on the SDK, refer to the SDK Installation Guide.

⚙️ Usage

Please see the script in usage/tencent_cloud_api.py.

Option 2: 💻 Locally with Self-Hosted Inference

Running the model on your own machine gives you full control, making it perfect for offline use, customization, or when data privacy is a priority. Below is a practical, step-by-step guide using this repository’s prebuilt scripts.

1) Quick Start (Repo + Environment)

# Clone this repo
git clone https://github.com/TencentCloudADP/youtu-embedding.git
cd youtu-embedding

# Create and activate a virtual environment
python -m venv youtu-env
source youtu-env/bin/activate  # Windows: youtu-env\Scripts\activate

# Install dependencies
pip install -U pip
pip install "transformers==4.51.3" torch numpy scipy scikit-learn huggingface_hub

2) Get the Model Weights (choose one)

Option A: Download from Hugging Face to the local folder

huggingface-cli download tencent/Youtu-Embedding --local-dir ./youtu-model

Option B: Clone the model repo

git clone https://huggingface.co/tencent/Youtu-Embedding ./Youtu-Embedding

3) Run the Prebuilt Test Scripts (recommended)

Pick one that matches your environment. All scripts are included in this repo.

CUDA systems:

python test_transformers_online_cuda.py

macOS (Apple Silicon with MPS or CPU fallback):

python test_transformers_online_macos.py

Local-only (use locally downloaded model in ./Youtu-Embedding):

python test_transformers_local.py

These scripts will: load the model, encode a demo query and passages, then print similarity scores (sorted so the best match is obvious).

Sample Output

nv/bin/python /Users/pro/Desktop/youtu-embedding/test_transformers_local.py
Loading checkpoint shards: 100%|███████████████████████████████| 2/2 [00:00<00:00, 28.64it/s]
Model loaded: ./Youtu-Embedding
Device: mps

================================================================================
🔍 Query: What's the weather like?
================================================================================

🥇 BEST MATCH
   Score: 0.4465 | ⚡ Moderately Relevant
   Visual: [█████████████░░░░░░░░░░░░░░░░░] 44.7%
   Content: "The weather is lovely today."

🥈 2nd BEST
   Score: 0.3124 | ⚡ Moderately Relevant
   Visual: [█████████░░░░░░░░░░░░░░░░░░░░░] 31.2%
   Content: "It's so sunny outside!"

🥉 3rd BEST
   Score: 0.0688 | ❌ Not Relevant
   Visual: [██░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 6.9%
   Content: "Would you want to play a game?"

#4
   Score: 0.0304 | ❌ Not Relevant
   Visual: [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 3.0%
   Content: "He drove to the stadium."

================================================================================

Raw scores: [[0.4465198516845703, 0.31240472197532654, 0.03040437400341034, 0.06884326785802841]]

4) Using the Custom `LLMEmbeddingModel` Class

For a more specialized implementation or to see our direct wrapper, you can use the LLMEmbeddingModel class.

See the complete example script here: usage/infer_llm_embedding.py.

5) Using `sentence-transformers`

If you prefer sentence-transformers, you can load the same model by ID or from a local folder.

📦 Installation

pip install sentence-transformers==5.1.0

⚙️ Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("model_id")
queries = ["What's the weather like?"]
passages = [
    'The weather is lovely today.',
    "It's so sunny outside!",
    'He drove to the stadium.'
]
queries_embeddings = model.encode_query(queries)
passages_embeddings = model.encode_document(passages)

similarities = model.similarity(queries_embeddings, passages_embeddings)
print(similarities)

3. Using `LangChain` 🦜

Easily integrate the model into your LangChain applications, such as RAG pipelines.

📦 Installation

pip install langchain==0.3.27 langchain-community==0.3.29 langchain-huggingface==0.3.1 sentence-transformers==5.1.0 faiss-cpu==1.11.0

⚙️ Usage

See this example: usage/langchain_embedding.py

4. Using `LlamaIndex` 🦙

This is perfect for integrating the model into your LlamaIndex search and retrieval systems.

📦 Installation

pip install llama-index==0.14.2 llama-index-embeddings-huggingface==0.6.1 sentence-transformers==5.1.0 llama-index-vector-stores-faiss==0.5.1

⚙️ Usage

See this example: usage/llamaindex_embedding.py

💡 Fine-tuning Framework

We provide our novel Collaborative-Discriminative Fine-tuning Framework, designed to overcome the challenges of jointly optimizing different text embedding tasks. By systematically decoupling tasks, we introduce several key innovations to achieve highly efficient unified representation learning.

🌐 1. Unified & Extensible Data Format

Our unified data structure seamlessly handles heterogeneous data from IR, STS, classification, and reranking tasks, offering excellent extensibility for incorporating new tasks in the future.

🎯 2. Task-Differentiated Loss Functions

We moved beyond a "one-size-fits-all" loss function and designed specialized optimization objectives for different tasks.

For IR (Information Retrieval) tasks: We use a powerful InfoNCE contrastive loss that supports multiple positives, hard negatives, and in-batch cross-device negative sampling for superior discriminative ability.
For STS (Semantic Textual Similarity) tasks: We go beyond simple contrastive learning by adopting ranking-aware objectives (e.g., Pearson loss, KL divergence loss L_RankKL) to directly optimize for ranking consistency.

🔄 3. Dynamic Single-Task Sampling

To prevent gradient interference from mixed-task batches, we implemented a custom dynamic sampler. It ensures that within a single training iteration, all GPUs process non-overlapping shards of the same dataset, providing the model with a pure and stable gradient signal.

🛠️ How to Train

The code for our training framework is located in the training/ directory.

1. Installation

Clone the repository and install the required dependencies:

git clone https://github.com/TencentCloudADP/youtu-embedding.git
cd training/CoDiEmb
pip install -r requirements.txt

2. Training

cd scripts
bash train_youtuemb.sh

3. Evaluation

The code for reproducing the following results is available in evaluation/.

📊 CMTEB

Youtu-Embedding demonstrates superior performance across all seven task categories on the CMTEB benchmark and achieves the highest overall average score. We present the results of the latest version of the model as follows:

Model	Param.	Mean(Task)	Mean(Type)	Class.	Clust.	Pair Class.	Rerank.	Retr.	STS
bge-multilingual-gemma2	9B	67.64	68.52	75.31	59.30	79.30	68.28	73.73	55.19
ritrieve_zh_v1	326M	72.71	73.85	76.88	66.50	85.98	72.86	76.97	63.92
Qwen3-Embedding-4B	4B	72.27	73.51	75.46	77.89	83.34	66.05	77.03	61.26
Qwen3-Embedding-8B	8B	73.84	75.00	76.97	80.08	84.23	66.99	78.21	63.53
Conan-embedding-v2	1.4B	74.24	75.99	76.47	68.84	92.44	74.41	78.31	65.48
Seed1.6-embedding	-	75.63	76.68	77.98	73.11	88.71	71.65	79.69	68.94
QZhou-Embedding	7B	76.99	78.58	79.99	70.91	95.07	74.85	78.80	71.89
Youtu-Embedding-V1	2B	77.58	78.86	78.65	84.27	86.12	75.10	80.21	68.82

Note: Comparative scores are based from the MTEB leaderboard, recorded on September 28, 2025.

🤝 Contributing

We welcome contributions from the community! Here's how you can help:

💻 Code Contribution

🍴 Fork the project
🌿 Create a feature branch (git checkout -b feature/AmazingFeature)
💾 Commit your changes (git commit -m 'Add some AmazingFeature')
📤 Push to the branch (git push origin feature/AmazingFeature)
🔄 Create a Pull Request

🎉 Citation

If you find our work useful in your research, please consider citing our paper:

@misc{zhang2025codiemb,
  title={CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity},
  author={Zhang, Bowen and Song, Zixin and Chen, Chunquan and Zhang, Qian-Wen and Yin, Di and Sun, Xing},
  year={2025},
  eprint={2508.11442},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2508.11442},
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
docs		docs
evaluation		evaluation
training/CoDiEmb		training/CoDiEmb
usage		usage
.gitignore		.gitignore
LICENSE		LICENSE
README-CN.md		README-CN.md
README.md		README.md
test_transformers_local.py		test_transformers_local.py
test_transformers_online_cuda.py		test_transformers_online_cuda.py
test_transformers_online_macos.py		test_transformers_online_macos.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Youtu-Embedding:
Advancing Unified Text Representation with Collaborative-Distinct Learning

🎯 Brief Introduction

🤗 Model Download

🚀 Quickly Start Inference

Option 1: ☁️ Using the Official API

Option 2: 💻 Locally with Self-Hosted Inference

1) Quick Start (Repo + Environment)

2) Get the Model Weights (choose one)

3) Run the Prebuilt Test Scripts (recommended)

Sample Output

4) Using the Custom `LLMEmbeddingModel` Class

5) Using `sentence-transformers`

3. Using `LangChain` 🦜

4. Using `LlamaIndex` 🦙

💡 Fine-tuning Framework

🛠️ How to Train

1. Installation

2. Training

3. Evaluation

📊 CMTEB

🤝 Contributing

💻 Code Contribution

🎉 Citation

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

TencentCloudADP/youtu-embedding

Folders and files

Latest commit

History

Repository files navigation

Youtu-Embedding: Advancing Unified Text Representation with Collaborative-Distinct Learning

🎯 Brief Introduction

🤗 Model Download

🚀 Quickly Start Inference

Option 1: ☁️ Using the Official API

Option 2: 💻 Locally with Self-Hosted Inference

1) Quick Start (Repo + Environment)

2) Get the Model Weights (choose one)

3) Run the Prebuilt Test Scripts (recommended)

Sample Output

4) Using the Custom LLMEmbeddingModel Class

5) Using sentence-transformers

3. Using LangChain 🦜

4. Using LlamaIndex 🦙

💡 Fine-tuning Framework

🛠️ How to Train

1. Installation

2. Training

3. Evaluation

📊 CMTEB

🤝 Contributing

💻 Code Contribution

🎉 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Youtu-Embedding:
Advancing Unified Text Representation with Collaborative-Distinct Learning

4) Using the Custom `LLMEmbeddingModel` Class

5) Using `sentence-transformers`

3. Using `LangChain` 🦜

4. Using `LlamaIndex` 🦙

Packages