Skip to content

TencentCloudADP/youtu-embedding

Repository files navigation

Youtu Logo Youtu-Embedding:
Advancing Unified Text Representation with Collaborative-Distinct Learning

License: MIT Paper GitHub Huggingface WeChat Community Discord Community

🔖 中文版🤗 Model Download🚀 Quickly Start Inference🛠️ How to Train

🎯 Brief Introduction

Youtu-RAG Logo

Youtu-Embedding is an industry-leading, general-purpose text representation model developed by Tencent Youtu Lab. It demonstrates state-of-the-art performance across a wide range of natural language processing tasks, including Information Retrieval (IR), Semantic Textual Similarity (STS), Clustering, Reranking, and Classification.

The core advantages of Youtu-Embedding can be summarized as follows:

  • 🏆 State-of-the-Art Performance: Achieved a top score of 77.58 on the authoritative Chinese text embedding benchmark CMTEB (as of Sep 2025), proving its powerful representation capabilities.

  • 🧠 Sophisticated Three-Stage Training: We pioneered a "LLM-based Pre-training → Weakly-supervised Alignment → Collaborative-Discriminative Fine-tuning" pipeline, which systematically distills the broad knowledge of large language models into the specialized discriminative power required for embedding tasks.

  • ⭐ Innovative Fine-tuning Framework: We designed a unique Collaborative-Discriminative Fine-tuning Framework that effectively resolves the "negative transfer" problem in multi-task learning through a unified data format, task-differentiated loss functions, and a dynamic single-task sampling mechanism. (This framework has been verified on a variety of basic encoders to ensure its versatility and effectiveness.)

  • 🛠️ Meticulous Data Engineering: We combined high-quality, LLM-based data synthesis with efficient hard negative mining strategies to provide the most robust data foundation for model training.

We are open-sourcing the model weights, inference code, and the training framework. We hope this will help developers in the community create greater value.

🤗 Model Download

We have released our first model version on Hugging Face. It is a 2 billion (2B) parameter model designed for general-purpose semantic representation.

Model Name Parameters Dimensions Sequence Length Download
Youtu-Embedding-V1 2B 2048 8K Model

🚀 Quickly Start Inference

You can generate embeddings in two ways: via our official API for ease of use or by running the model locally for full control.

Option 1: ☁️ Using the Official API

📦 Install the SDK

pip install --upgrade tencentcloud-sdk-python

⚙️ Usage

Option 2: 💻 Locally with Self-Hosted Inference

Running the model on your own machine gives you full control, making it perfect for offline use, customization, or when data privacy is a priority. Below is a practical, step-by-step guide using this repository’s prebuilt scripts.

1) Quick Start (Repo + Environment)

# Clone this repo
git clone https://github.com/TencentCloudADP/youtu-embedding.git
cd youtu-embedding

# Create and activate a virtual environment
python -m venv youtu-env
source youtu-env/bin/activate  # Windows: youtu-env\Scripts\activate

# Install dependencies
pip install -U pip
pip install "transformers==4.51.3" torch numpy scipy scikit-learn huggingface_hub

2) Get the Model Weights (choose one)

  • Option A: Download from Hugging Face to the local folder
huggingface-cli download tencent/Youtu-Embedding --local-dir ./youtu-model
  • Option B: Clone the model repo
git clone https://huggingface.co/tencent/Youtu-Embedding ./Youtu-Embedding

3) Run the Prebuilt Test Scripts (recommended)

Pick one that matches your environment. All scripts are included in this repo.

  • CUDA systems:
python test_transformers_online_cuda.py
  • macOS (Apple Silicon with MPS or CPU fallback):
python test_transformers_online_macos.py
  • Local-only (use locally downloaded model in ./Youtu-Embedding):
python test_transformers_local.py

These scripts will: load the model, encode a demo query and passages, then print similarity scores (sorted so the best match is obvious).

Sample Output

nv/bin/python /Users/pro/Desktop/youtu-embedding/test_transformers_local.py
Loading checkpoint shards: 100%|███████████████████████████████| 2/2 [00:00<00:00, 28.64it/s]
Model loaded: ./Youtu-Embedding
Device: mps

================================================================================
🔍 Query: What's the weather like?
================================================================================

🥇 BEST MATCH
   Score: 0.4465 | ⚡ Moderately Relevant
   Visual: [█████████████░░░░░░░░░░░░░░░░░] 44.7%
   Content: "The weather is lovely today."

🥈 2nd BEST
   Score: 0.3124 | ⚡ Moderately Relevant
   Visual: [█████████░░░░░░░░░░░░░░░░░░░░░] 31.2%
   Content: "It's so sunny outside!"

🥉 3rd BEST
   Score: 0.0688 | ❌ Not Relevant
   Visual: [██░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 6.9%
   Content: "Would you want to play a game?"

#4
   Score: 0.0304 | ❌ Not Relevant
   Visual: [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 3.0%
   Content: "He drove to the stadium."

================================================================================

Raw scores: [[0.4465198516845703, 0.31240472197532654, 0.03040437400341034, 0.06884326785802841]]

4) Using the Custom LLMEmbeddingModel Class

For a more specialized implementation or to see our direct wrapper, you can use the LLMEmbeddingModel class.

5) Using sentence-transformers

If you prefer sentence-transformers, you can load the same model by ID or from a local folder.

📦 Installation

pip install sentence-transformers==5.1.0

⚙️ Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("model_id")
queries = ["What's the weather like?"]
passages = [
    'The weather is lovely today.',
    "It's so sunny outside!",
    'He drove to the stadium.'
]
queries_embeddings = model.encode_query(queries)
passages_embeddings = model.encode_document(passages)

similarities = model.similarity(queries_embeddings, passages_embeddings)
print(similarities)

3. Using LangChain 🦜

Easily integrate the model into your LangChain applications, such as RAG pipelines.

📦 Installation

pip install langchain==0.3.27 langchain-community==0.3.29 langchain-huggingface==0.3.1 sentence-transformers==5.1.0 faiss-cpu==1.11.0

⚙️ Usage

4. Using LlamaIndex 🦙

This is perfect for integrating the model into your LlamaIndex search and retrieval systems.

📦 Installation

pip install llama-index==0.14.2 llama-index-embeddings-huggingface==0.6.1 sentence-transformers==5.1.0 llama-index-vector-stores-faiss==0.5.1

⚙️ Usage

💡 Fine-tuning Framework

We provide our novel Collaborative-Discriminative Fine-tuning Framework, designed to overcome the challenges of jointly optimizing different text embedding tasks. By systematically decoupling tasks, we introduce several key innovations to achieve highly efficient unified representation learning.

🌐 1. Unified & Extensible Data Format

Our unified data structure seamlessly handles heterogeneous data from IR, STS, classification, and reranking tasks, offering excellent extensibility for incorporating new tasks in the future.

🎯 2. Task-Differentiated Loss Functions

We moved beyond a "one-size-fits-all" loss function and designed specialized optimization objectives for different tasks.

  • For IR (Information Retrieval) tasks: We use a powerful InfoNCE contrastive loss that supports multiple positives, hard negatives, and in-batch cross-device negative sampling for superior discriminative ability.

  • For STS (Semantic Textual Similarity) tasks: We go beyond simple contrastive learning by adopting ranking-aware objectives (e.g., Pearson loss, KL divergence loss L_RankKL) to directly optimize for ranking consistency.

🔄 3. Dynamic Single-Task Sampling

To prevent gradient interference from mixed-task batches, we implemented a custom dynamic sampler. It ensures that within a single training iteration, all GPUs process non-overlapping shards of the same dataset, providing the model with a pure and stable gradient signal.

🛠️ How to Train

The code for our training framework is located in the training/ directory.

1. Installation

Clone the repository and install the required dependencies:

git clone https://github.com/TencentCloudADP/youtu-embedding.git
cd training/CoDiEmb
pip install -r requirements.txt

2. Training

cd scripts
bash train_youtuemb.sh

3. Evaluation

The code for reproducing the following results is available in evaluation/.

📊 CMTEB

Youtu-Embedding demonstrates superior performance across all seven task categories on the CMTEB benchmark and achieves the highest overall average score. We present the results of the latest version of the model as follows:

Model Param. Mean(Task) Mean(Type) Class. Clust. Pair Class. Rerank. Retr. STS
bge-multilingual-gemma2 9B 67.64 68.52 75.31 59.30 79.30 68.28 73.73 55.19
ritrieve_zh_v1 326M 72.71 73.85 76.88 66.50 85.98 72.86 76.97 63.92
Qwen3-Embedding-4B 4B 72.27 73.51 75.46 77.89 83.34 66.05 77.03 61.26
Qwen3-Embedding-8B 8B 73.84 75.00 76.97 80.08 84.23 66.99 78.21 63.53
Conan-embedding-v2 1.4B 74.24 75.99 76.47 68.84 92.44 74.41 78.31 65.48
Seed1.6-embedding - 75.63 76.68 77.98 73.11 88.71 71.65 79.69 68.94
QZhou-Embedding 7B 76.99 78.58 79.99 70.91 95.07 74.85 78.80 71.89
Youtu-Embedding-V1 2B 77.58 78.86 78.65 84.27 86.12 75.10 80.21 68.82

Note: Comparative scores are based from the MTEB leaderboard, recorded on September 28, 2025.

🤝 Contributing

We welcome contributions from the community! Here's how you can help:

💻 Code Contribution

  1. 🍴 Fork the project
  2. 🌿 Create a feature branch (git checkout -b feature/AmazingFeature)
  3. 💾 Commit your changes (git commit -m 'Add some AmazingFeature')
  4. 📤 Push to the branch (git push origin feature/AmazingFeature)
  5. 🔄 Create a Pull Request

🎉 Citation

If you find our work useful in your research, please consider citing our paper:

@misc{zhang2025codiemb,
  title={CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity},
  author={Zhang, Bowen and Song, Zixin and Chen, Chunquan and Zhang, Qian-Wen and Yin, Di and Sun, Xing},
  year={2025},
  eprint={2508.11442},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2508.11442},
}

About

Youtu-Embedding is an industry-leading, general-purpose text representation model developed by Tencent Youtu Lab.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published