Skip to content

Commit d1125af

Browse files
committed
minor
1 parent 27458f5 commit d1125af

File tree

2 files changed

+29
-28
lines changed

2 files changed

+29
-28
lines changed

README.md

Lines changed: 27 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,58 +1,60 @@
1-
# LLM2CLIP: Powerful Language Model Unlocking Richer Visual Representations
1+
# LLM2CLIP: Unlocking Richer Visual Representations with a Powerful Language Model
22

3-
Welcome to the official repository for **LLM2CLIP**! This project leverages large language models (LLMs) as powerful textual teachers for CLIPs visual encoder, enabling more nuanced and comprehensive multimodal learning.
3+
Welcome to the official repository for **LLM2CLIP**! This project leverages large language models (LLMs) as powerful textual teachers for CLIP's visual encoder, enabling more nuanced and comprehensive multimodal learning.
44

55
[![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2411.04997) [![Project Homepage](https://img.shields.io/badge/Project-Homepage-blue)](https://aka.ms/llm2clip) [![HuggingFace Collection](https://img.shields.io/badge/HuggingFace-Collection-orange)](https://huggingface.co/collections/microsoft/llm2clip-672323a266173cfa40b32d4c)
6-
Paper: Accepted to NeurIPS 2024 Workshop SSL
6+
**Paper:** Preprinted and under-review now. Accepted to NeurIPS 2024 Workshop: Self-Supervised Learning - Theory and Practice
77

88
---
99
<img src="docs/static/images/radar_paper(4).png" style="max-width: 800px;">
1010

1111
## Challenges with Existing CLIP
1212

1313
Current versions of CLIP face several limitations:
14-
- The text encoder has a short context window of only 77 tokens, limiting its ability to understand lengthy inputs.
15-
- The text encoder is relatively weak, often criticized for its inability to comprehend complex text, functioning nearly as a bag-of-words model.
14+
15+
- **Limited Context Window**: The text encoder has a short context window of only 77 tokens, which restricts its understanding of lengthy inputs.
16+
- **Weak Text Comprehension**: The text encoder is relatively limited in its ability to comprehend complex text, often functioning as a bag-of-words model with limited depth.
1617

1718
## Why Integrate LLM with CLIP?
1819

19-
Providing unimaginable cross-language capabilities. Our LLM2CLIP fine-tuned on purely English corpus even outperforms Chinese CLIP.
20+
LLM2CLIP brings the unimaginable power of large language models to CLIP, even surpassing native language capabilities. Our LLM2CLIP, fine-tuned purely on an English corpus, outperforms standard Chinese CLIP models:
2021

21-
1. **Extended Input Window**: The LLM greatly expands CLIP's input window, allowing richer textual context.
22-
2. **Enhanced Understanding**: With LLM's help, CLIP can better comprehend dense and complex captions, improving text-image alignment.
23-
3. **Open-World Knowledge**: LLM supplements open-world knowledge, allowing CLIP to align multimodal features more globally, enhancing training efficiency.
22+
1. **Extended Input Window**: The LLM expands CLIP's input window, allowing richer textual context and better comprehension of long inputs.
23+
2. **Enhanced Understanding**: With LLM's help, CLIP gains a deeper understanding of dense, complex captions, leading to improved text-image alignment.
24+
3. **Open-World Knowledge**: The LLM provides open-world knowledge, enabling more globally informed multimodal feature alignment and boosting training efficiency.
2425

2526
## Key Challenges
2627

27-
LLMs have strong text encoding capabilities hidden within the model, but their output space is often not highly separable for contrastive learning.
28+
While LLMs have strong inherent text encoding capabilities, the output space is often not highly separable, which limits their effectiveness for contrastive learning.
29+
![coco_score.svg](docs%2Fstatic%2Fimages%2Fcoco_score.svg)
2830

2931
## Our Approach
3032

31-
We designed a Caption-to-Caption contrastive learning strategy, training the LLM to better differentiate between captions of the same or different images. This This caption-caption discrimination enhances the output space's separability enhances the output space's separability.
32-
The LLM gradients were frozen while efficiently training CLIP's visual encoder on limited data, resulting in substantial performance improvements.
33+
To overcome these challenges, we designed a **Caption-to-Caption Contrastive Learning** strategy. We trained the LLM to better differentiate between captions of the same or different images, enhancing the separability of the LLM's output space. During training, the LLM gradients were frozen while CLIP's visual encoder was fine-tuned on limited data, resulting in significant performance gains.
34+
35+
Through this strategy, we better utilized the LLM's power to comprehend and process **long and dense captions**, improving the overall representation capabilities.
3336

3437
## What Can You Achieve with LLM2CLIP?
3538

36-
1. **Enhanced CLIP Models**: Use our code to fine-tune pretrained CLIP models with representative dense captions or task-specific image-text datasets, making CLIP stronger for various tasks.
37-
2. **Out-of-the-Box Power**: Directly use our enhanced CLIP models, which have been made significantly more powerful with LLM guidance.
39+
1. **Enhanced CLIP Models**: Fine-tune pretrained CLIP models with dense captions or task-specific image-text datasets, making CLIP stronger for various use cases.
40+
2. **Out-of-the-Box Power**: Directly use our enhanced CLIP models, significantly upgraded with LLM guidance for superior performance in multimodal tasks.
3841

3942
---
4043

4144
## News 🚀🚀🚀
42-
- **[2024-11-06]** OpenAI's CLIP and EVA02's ViT base and large models are now available on [HuggingFace](https://huggingface.co/collections/microsoft/llm2clip-672323a266173cfa40b32d4c). More model versions and datasets will be added to HuggingFace shortly.
43-
- **[2024-11-01]** Our paper has been accepted to the NeurIPS 2024 SSL Workshop!
45+
- **[2024-11-06]** OpenAI's CLIP and EVA02's ViT base and large models are now available on [HuggingFace](https://huggingface.co/collections/microsoft/llm2clip-672323a266173cfa40b32d4c). More model versions and datasets coming soon!
46+
- **[2024-11-01]** Our paper was accepted to the NeurIPS 2024 SSL Workshop!
4447

4548
---
46-
![main.svg](docs%2Fstatic%2Fimages%2Fmain.svg)
47-
48-
## Model Zoo (Keep Updating)
49+
![main.svg](docs/static/images/main.svg)
4950

51+
## Model Zoo (Continuously Updated)
5052

5153
Stay tuned for updates on pretrained models and datasets, which will be made available in the [HuggingFace Model Zoo](https://huggingface.co/collections/microsoft/llm2clip-672323a266173cfa40b32d4c).
5254

5355
---
5456

55-
## 💻 How to Install
57+
## 💻 Installation Guide
5658

5759
1. **Create the environment**:
5860

@@ -63,17 +65,16 @@ Stay tuned for updates on pretrained models and datasets, which will be made ava
6365
```
6466
2. **Data Preparation**:
6567

66-
### Data Preparation (Coming Soon)
68+
*(Coming Soon)*
69+
70+
3. **🔥 Training**:
6771

68-
### 🔥 Training
69-
7072
```bash
7173
sh run.sh
7274
```
75+
## ❤️ Acknowledgements
7376

74-
## ❤️ Acknowledgement
75-
76-
Currently, our code is built on top of [eva-clip](https://github.com/baaivision/EVA/tree/master/EVA-CLIP) and .
77+
Our code is built on top of [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP). We would like to thank the EVA team for their foundational work.
7778

7879
## Citation
7980

@@ -89,4 +90,3 @@ If you use our work, please cite:
8990
primaryClass={cs.CV},
9091
url={https://arxiv.org/abs/2411.04997},
9192
}
92-
```

docs/index.html

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -283,7 +283,8 @@ <h2 class="title is-3"><span class="dvima">Why Integrate LLM with CLIP?</span></
283283

284284
<h2 class="title is-3"><span class="dvima">Our Approach</span></h2>
285285
<p style="font-size: 100%">
286-
We designed a Caption-to-Caption contrastive learning strategy, training the LLM to better differentiate between captions of the same or different images. This caption-caption discrimination enhances the output space's separability. The LLM gradients were frozen while efficiently training CLIP's visual encoder on limited data, resulting in substantial performance improvements.
286+
We designed a Caption-to-Caption contrastive learning strategy, training the LLM to better differentiate between captions of the same or different images. This caption-caption discrimination enhances LLM output space's separability. The LLM gradients were frozen while efficiently training CLIP's visual encoder on limited data, resulting in substantial performance improvements. Through this strategy, we better utilized the LLM's power to comprehend and process **long and dense captions**, improving the overall representation capabilities.
287+
287288
</p>
288289

289290
<h2 class="title is-3"><span class="dvima">What Can You Achieve with LLM2CLIP?</span></h2>

0 commit comments

Comments
 (0)