You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# LLM2CLIP: Powerful Language Model Unlocking Richer Visual Representations
1
+
# LLM2CLIP: Unlocking Richer Visual Representations with a Powerful Language Model
2
2
3
-
Welcome to the official repository for **LLM2CLIP**! This project leverages large language models (LLMs) as powerful textual teachers for CLIP’s visual encoder, enabling more nuanced and comprehensive multimodal learning.
3
+
Welcome to the official repository for **LLM2CLIP**! This project leverages large language models (LLMs) as powerful textual teachers for CLIP's visual encoder, enabling more nuanced and comprehensive multimodal learning.
Current versions of CLIP face several limitations:
14
-
- The text encoder has a short context window of only 77 tokens, limiting its ability to understand lengthy inputs.
15
-
- The text encoder is relatively weak, often criticized for its inability to comprehend complex text, functioning nearly as a bag-of-words model.
14
+
15
+
-**Limited Context Window**: The text encoder has a short context window of only 77 tokens, which restricts its understanding of lengthy inputs.
16
+
-**Weak Text Comprehension**: The text encoder is relatively limited in its ability to comprehend complex text, often functioning as a bag-of-words model with limited depth.
16
17
17
18
## Why Integrate LLM with CLIP?
18
19
19
-
Providing unimaginable cross-language capabilities. Our LLM2CLIP fine-tuned on purely English corpus even outperforms Chinese CLIP.
20
+
LLM2CLIP brings the unimaginable power of large language models to CLIP, even surpassing native language capabilities. Our LLM2CLIP, fine-tuned purely on an English corpus, outperforms standard Chinese CLIP models:
2.**Enhanced Understanding**: With LLM's help, CLIP can better comprehend dense and complex captions, improving text-image alignment.
23
-
3.**Open-World Knowledge**: LLM supplements open-world knowledge, allowing CLIP to align multimodal features more globally, enhancing training efficiency.
22
+
1.**Extended Input Window**: The LLM expands CLIP's input window, allowing richer textual context and better comprehension of long inputs.
23
+
2.**Enhanced Understanding**: With LLM's help, CLIP gains a deeper understanding of dense, complex captions, leading to improved text-image alignment.
24
+
3.**Open-World Knowledge**: The LLM provides open-world knowledge, enabling more globally informed multimodal feature alignment and boosting training efficiency.
24
25
25
26
## Key Challenges
26
27
27
-
LLMs have strong text encoding capabilities hidden within the model, but their output space is often not highly separable for contrastive learning.
28
+
While LLMs have strong inherent text encoding capabilities, the output space is often not highly separable, which limits their effectiveness for contrastive learning.
We designed a Caption-to-Caption contrastive learning strategy, training the LLM to better differentiate between captions of the same or different images. This This caption-caption discrimination enhances the output space's separability enhances the output space's separability.
32
-
The LLM gradients were frozen while efficiently training CLIP's visual encoder on limited data, resulting in substantial performance improvements.
33
+
To overcome these challenges, we designed a **Caption-to-Caption Contrastive Learning** strategy. We trained the LLM to better differentiate between captions of the same or different images, enhancing the separability of the LLM's output space. During training, the LLM gradients were frozen while CLIP's visual encoder was fine-tuned on limited data, resulting in significant performance gains.
34
+
35
+
Through this strategy, we better utilized the LLM's power to comprehend and process **long and dense captions**, improving the overall representation capabilities.
33
36
34
37
## What Can You Achieve with LLM2CLIP?
35
38
36
-
1.**Enhanced CLIP Models**: Use our code to fine-tune pretrained CLIP models with representative dense captions or task-specific image-text datasets, making CLIP stronger for various tasks.
37
-
2.**Out-of-the-Box Power**: Directly use our enhanced CLIP models, which have been made significantly more powerful with LLM guidance.
39
+
1.**Enhanced CLIP Models**: Fine-tune pretrained CLIP models with dense captions or task-specific image-text datasets, making CLIP stronger for various use cases.
40
+
2.**Out-of-the-Box Power**: Directly use our enhanced CLIP models, significantly upgraded with LLM guidance for superior performance in multimodal tasks.
38
41
39
42
---
40
43
41
44
## News 🚀🚀🚀
42
-
-**[2024-11-06]** OpenAI's CLIP and EVA02's ViT base and large models are now available on [HuggingFace](https://huggingface.co/collections/microsoft/llm2clip-672323a266173cfa40b32d4c). More model versions and datasets will be added to HuggingFace shortly.
43
-
-**[2024-11-01]** Our paper has been accepted to the NeurIPS 2024 SSL Workshop!
45
+
-**[2024-11-06]** OpenAI's CLIP and EVA02's ViT base and large models are now available on [HuggingFace](https://huggingface.co/collections/microsoft/llm2clip-672323a266173cfa40b32d4c). More model versions and datasets coming soon!
46
+
-**[2024-11-01]** Our paper was accepted to the NeurIPS 2024 SSL Workshop!
44
47
45
48
---
46
-

47
-
48
-
## Model Zoo (Keep Updating)
49
+

49
50
51
+
## Model Zoo (Continuously Updated)
50
52
51
53
Stay tuned for updates on pretrained models and datasets, which will be made available in the [HuggingFace Model Zoo](https://huggingface.co/collections/microsoft/llm2clip-672323a266173cfa40b32d4c).
52
54
53
55
---
54
56
55
-
## 💻 How to Install
57
+
## 💻 Installation Guide
56
58
57
59
1.**Create the environment**:
58
60
@@ -63,17 +65,16 @@ Stay tuned for updates on pretrained models and datasets, which will be made ava
63
65
```
64
66
2.**Data Preparation**:
65
67
66
-
### Data Preparation (Coming Soon)
68
+
*(Coming Soon)*
69
+
70
+
3.**🔥 Training**:
67
71
68
-
### 🔥 Training
69
-
70
72
```bash
71
73
sh run.sh
72
74
```
75
+
## ❤️ Acknowledgements
73
76
74
-
## ❤️ Acknowledgement
75
-
76
-
Currently, our code is built on top of [eva-clip](https://github.com/baaivision/EVA/tree/master/EVA-CLIP) and .
77
+
Our code is built on top of [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP). We would like to thank the EVA team for their foundational work.
77
78
78
79
## Citation
79
80
@@ -89,4 +90,3 @@ If you use our work, please cite:
We designed a Caption-to-Caption contrastive learning strategy, training the LLM to better differentiate between captions of the same or different images. This caption-caption discrimination enhances the output space's separability. The LLM gradients were frozen while efficiently training CLIP's visual encoder on limited data, resulting in substantial performance improvements.
286
+
We designed a Caption-to-Caption contrastive learning strategy, training the LLM to better differentiate between captions of the same or different images. This caption-caption discrimination enhances LLM output space's separability. The LLM gradients were frozen while efficiently training CLIP's visual encoder on limited data, resulting in substantial performance improvements. Through this strategy, we better utilized the LLM's power to comprehend and process **long and dense captions**, improving the overall representation capabilities.
287
+
287
288
</p>
288
289
289
290
<h2class="title is-3"><spanclass="dvima">What Can You Achieve with LLM2CLIP?</span></h2>
0 commit comments