-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Using CLIP in transfer learning for multilabel classification #334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Coming back to give an update on this: the strategy I described worked very well. I did some fine tuning using the original CLIP objective which had a large improvement on multilabel classification (ignoring the problem of in-batch false negatives). Regarding the positive logits, here is a paper explaining the behaviour. |
This is a great approach, do you have any code/learnings to share? |
I am also interested in knowing how CLIP can be used for multilabel classification |
@EachOneChew could you share some code sample to show how you achieved this? |
@talrejanikhil @fm1320 Hi, I unfortunately do not have access to the code I wrote. Here are some points that may help:
That's all I remember, good luck with your projects 👍 |
Hi @EachOneChew , thanks for helping with the training code but when you say the following:
Please, can you point me to the training code that you used to freeze the layer of the clip mode. Is it mentioned in some official paper. Thanks |
@abhijitherekar there is no reference to freezing layers officially because they used distributed loss when training. Freezing the layers was an adjustment I made myself to account for memory limitations on individual devices with which I saw good results. To freeze a layer, set
Look through the model parameters yourself to determine which to freeze. |
@EachOneChew How did you determined the threshold for similarities?. I noticed that in general CLIP similarities are very low between image-text pairs but relatively high between similar text-text and image-image pairs. This happens even after fine-tuning it on massive data. Just to try you can take a pre-trained model from HF, then an image of a tiger and the caption "a photo of a tiger", then the cosine similarity will be around 0.3 if you have fine-tuned CLIP have you checked during training how the similarity increases between the images and their corresponding texts?, for me it hasn't increased more than 0.31-0.32 when fine-tuning it using a distributed loss on massive fashion data, so I stopped paying attention to it and monitor only zero shot acc performance like most of the people online (that increases). for multi-label I am taking the top-k prediction and if the difference between the largest probability and second is small I output both |
First of all I apologize if this is the wrong place to be posting a question like this. If that is the case, please let me know (and point me towards where I should go).
When applied in zero-shot learning, CLIP considers an image, and assigns to it the most relevant text prompt. This is a multinomial logistic regression problem, as stated in the paper, or a type of multiclass classification.
I am hoping to adapt CLIP to perform multilabel classification, wherein CLIP assigns all relevant text prompts when given an image and a set of text prompts. Another way of framing the problem is to treat it as multiple binary classification problems; for each text prompt, decide whether or not to an image.
To this end I am simply taking the unscaled logits (produced by taking the cosine similarity between image and text embeddings) output by CLIP. Instead of applying softmax over the unscaled logits, I set a threshold and assign all text prompts with a similarity score over a certain threshold.
Doing so has gotten me decent results so far, but I would like to ask for the input of people more knowledgeable than me. Is it safe to assume that the cosine similarity between image and text embeddings are meaningful indicators for their relatedness?
A slightly related question: I have only ever observed positive cosine similarities (or
logits_per_image
) output by CLIP. I was under the impression that the range logit scores should be(-1, 1)
or(-100, 100)
after multiplyinglogit_scale
. Why is this?Thank you!
The text was updated successfully, but these errors were encountered: