You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Use SimCSE with Huggingface](#use-our-models-out-of-the-box)
26
+
-[Train SimCSE](#train-simcse)
27
+
-[Requirements](#requirements)
28
+
-[Evaluation](#evaluation)
29
+
-[Training](#training)
26
30
-[Bugs or Questions?](#Bugs-or-questions)
27
31
-[Citation](#citation)
28
32
-[SimCSE Elsewhere](#simcse-elsewhere)
@@ -33,22 +37,71 @@ We propose a simple contrastive learning framework that works with both unlabele
33
37
34
38

35
39
36
-
## Use our models out of the box
37
-
Our pre-trained models are now publicly available with [HuggingFace's Transformers](https://github.com/huggingface/transformers). Models and their performance are presented as follows:
40
+
## Getting Started
41
+
42
+
We provide an easy-to-use sentence embedding tool based on our SimCSE model. To use the tool, first install the `simcse` package from pypi
43
+
```bash
44
+
pip install simcse
45
+
```
46
+
47
+
Or directly install it from our code
48
+
```bash
49
+
python setup.py install
50
+
```
51
+
52
+
Note that if you want to enable GPU encoding, you should install the correct version of PyTorch that supports CUDA. See [PyTorch official website](https://pytorch.org) for instructions.
53
+
54
+
After installing the package, you can load our model by just two lines of code
55
+
```python
56
+
from simcse import SimCSE
57
+
model = SimCSE("princeton-nlp/sup-simcse-bert-base-uncased")
58
+
```
59
+
See [model list](#model-list) for a full list of available models.
60
+
61
+
Then you can use our model for **encoding sentences into embeddings**
62
+
```python
63
+
embeddings = model.encode("A woman is reading.")
64
+
```
65
+
66
+
**Compute the cosine similarities** between two groups of sentences
67
+
```python
68
+
sentences_a = ['A woman is reading.', 'A man is playing a guitar.']
69
+
sentences_b = ['He plays guitar.', 'A woman is making a photo.']
Or build index for a group of sentences and **search** among them
74
+
```python
75
+
sentences = ['A woman is reading.', 'A man is playing a guitar.']
76
+
model.build_index(sentences)
77
+
results = model.search("He plays guitar.")
78
+
```
79
+
80
+
We also support [faiss](https://github.com/facebookresearch/faiss), an efficient similarity search library. Just install the package following [instructions](https://github.com/facebookresearch/faiss/blob/master/INSTALL.md) here and `simcse` will automatically use `faiss` for efficient search.
81
+
82
+
**WARNING**: We have found that `faiss` did not well support Nvidia AMPERE GPUs (3090 and A100). In that case, you should change to other GPUs or install the CPU version of `faiss` package.
83
+
84
+
We also provide an easy-to-build [demo website](./demo) to show how SimCSE can be used in sentence retrieval.
85
+
86
+
## Model List
87
+
88
+
Our released models are listed as following. You can import these models by using the `simcse` package or using [HuggingFace's Transformers](https://github.com/huggingface/transformers).
**Naming rules**: `unsup` and `sup` represent "unsupervised" (trained on Wikipedia corpus) and "supervised" (trained on NLI datasets) respectively.
50
101
51
-
You can easily import our model in an out-of-the-box way with HuggingFace's API:
102
+
## Use SimCSE with Huggingface
103
+
104
+
Besides using our provided sentence embedding tool, you can also easily import our models with HuggingFace's `transformers`:
52
105
```python
53
106
import torch
54
107
from scipy.spatial.distance import cosine
@@ -81,9 +134,11 @@ print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[
81
134
82
135
If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the above table and use `model = AutoModel.from_pretrained({PATH TO THE DOWNLOAD MODEL})`.
83
136
84
-
If you only want to use our models in an out-of-the-box way, just installing the latest version of `torch`, `transformers` and `scipy` is enough. If you want to use our training or evaluation code, see the requirement section below.
137
+
## Train SimCSE
85
138
86
-
## Requirements
139
+
In the following section, we describe how to train a SimCSE model by using our code.
140
+
141
+
### Requirements
87
142
88
143
First, install PyTorch by following the instructions from [the official website](https://pytorch.org). To faithfully reproduce our results, please use the correct `1.7.1` version corresponding to your platforms/CUDA versions. PyTorch version higher than `1.7.1` should also work. For example, if you use Linux and **CUDA11** ([how to check CUDA version](https://varhowto.com/check-cuda-version/)), install PyTorch by the following command,
89
144
@@ -104,7 +159,7 @@ Then run the following script to install the remaining dependencies,
104
159
pip install -r requirements.txt
105
160
```
106
161
107
-
## Evaluation
162
+
###Evaluation
108
163
Our evaluation code for sentence embeddings is based on a modified version of [SentEval](https://github.com/facebookresearch/SentEval). It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. For STS tasks, our evaluation takes the "all" setting, and report Spearman's correlation. See [our paper](https://arxiv.org/pdf/2104.08821.pdf) (Appendix B) for evaluation details.
109
164
110
165
Before evaluation, please download the evaluation datasets by running
@@ -151,13 +206,13 @@ Arguments for the evaluation script are as follows,
151
206
*`na`: Manually set tasks by `--tasks`.
152
207
*`--tasks`: Specify which dataset(s) to evaluate on. Will be overridden if `--task_set` is not `na`. See the code for a full list of tasks.
153
208
154
-
## Training
209
+
###Training
155
210
156
-
### Data
211
+
####Data
157
212
158
213
For unsupervised SimCSE, we sample 1 million sentences from English Wikipedia; for supervised SimCSE, we use the SNLI and MNLI datasets. You can run `data/download_wiki.sh` and `data/download_nli.sh` to download the two datasets.
159
214
160
-
### Training scripts
215
+
####Training scripts
161
216
162
217
We provide example training scripts for both unsupervised and supervised SimCSE. In `run_unsup_example.sh`, we provide a single-GPU (or CPU) example for the unsupervised version, and in `run_sup_example.sh` we give a **multiple-GPU** example for the supervised version. Both scripts call `train.py` for training. We explain the arguments in following:
163
218
*`--train_file`: Training file path. We support "txt" files (one line for one sentence) and "csv" files (2-column: pair data with no hard negative; 3-column: pair data with one corresponding hard negative instance). You can use our provided Wikipedia or NLI data, or you can use your own data with the same format.
@@ -173,10 +228,12 @@ All the other arguments are standard Huggingface's `transformers` training argum
173
228
174
229
**REPRODUCTION**: For results in the paper, we use Nvidia 3090 GPUs with CUDA 11. Using different types of devices or different versions of CUDA/other softwares may lead to slightly different performance.
175
230
176
-
### Convert models
231
+
####Convert models
177
232
178
233
**IMPORTANT**: Our saved checkpoints are slightly different from Huggingface's pre-trained checkpoints. Run `python simcse_to_huggingface.py --path {PATH_TO_CHECKPOINT_FOLDER}` to convert it. After that, you can evaluate it by our [evaluation](#evaluation) code or directly use it [out of the box](#use-our-models-out-of-the-box).
179
234
235
+
236
+
180
237
## Bugs or questions?
181
238
182
239
If you have any questions related to the code or the paper, feel free to email Tianyu (`[email protected]`) and Xingcheng (`[email protected]`). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
Several demos are available for people to play with our pre-trained SimCSE.
3
+
4
+
### Flask Demo
5
+
<divalign="center">
6
+
<imgsrc="../figure/demo.gif"width="750">
7
+
</div>
8
+
9
+
We provide a simple Web demo based on [flask](https://github.com/pallets/flask) to show how SimCSE can be directly used for information retrieval. To run this flask demo locally, make sure the SimCSE inference interfaces are setup:
10
+
```bash
11
+
git clone https://github.com/princeton-nlp/SimCSE
12
+
cd SimCSE
13
+
python setup.py develop
14
+
```
15
+
Then you can use `run_demo_example.sh` to launch the demo. As a default setting, we build the index for 1000 sentences sampled from STS-B dataset. Feel free to build the index of your own corpora. You can also install [faiss](https://github.com/facebookresearch/faiss) to speed up the retrieval process.
16
+
17
+
### Gradio Demo
18
+
[AK391](https://github.com/AK391) has provided a [Gradio Web Demo](https://gradio.app/g/AK391/SimCSE) of SimCSE to show how the pre-trained models can predict the semantic similarity between two sentences.
0 commit comments