Skip to content

Using CLIP in transfer learning for multilabel classification #334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
EachOneChew opened this issue Mar 10, 2023 · 8 comments
Closed

Using CLIP in transfer learning for multilabel classification #334

EachOneChew opened this issue Mar 10, 2023 · 8 comments

Comments

@EachOneChew
Copy link

EachOneChew commented Mar 10, 2023

First of all I apologize if this is the wrong place to be posting a question like this. If that is the case, please let me know (and point me towards where I should go).

When applied in zero-shot learning, CLIP considers an image, and assigns to it the most relevant text prompt. This is a multinomial logistic regression problem, as stated in the paper, or a type of multiclass classification.

I am hoping to adapt CLIP to perform multilabel classification, wherein CLIP assigns all relevant text prompts when given an image and a set of text prompts. Another way of framing the problem is to treat it as multiple binary classification problems; for each text prompt, decide whether or not to an image.

To this end I am simply taking the unscaled logits (produced by taking the cosine similarity between image and text embeddings) output by CLIP. Instead of applying softmax over the unscaled logits, I set a threshold and assign all text prompts with a similarity score over a certain threshold.

Doing so has gotten me decent results so far, but I would like to ask for the input of people more knowledgeable than me. Is it safe to assume that the cosine similarity between image and text embeddings are meaningful indicators for their relatedness?

A slightly related question: I have only ever observed positive cosine similarities (or logits_per_image) output by CLIP. I was under the impression that the range logit scores should be (-1, 1) or (-100, 100) after multiplying logit_scale. Why is this?

Thank you!

@EachOneChew
Copy link
Author

EachOneChew commented Oct 31, 2023

Coming back to give an update on this: the strategy I described worked very well. I did some fine tuning using the original CLIP objective which had a large improvement on multilabel classification (ignoring the problem of in-batch false negatives).

Regarding the positive logits, here is a paper explaining the behaviour.

@fm1320
Copy link

fm1320 commented Nov 14, 2023

This is a great approach, do you have any code/learnings to share?

@talrejanikhil
Copy link

I am also interested in knowing how CLIP can be used for multilabel classification

@talrejanikhil
Copy link

@EachOneChew could you share some code sample to show how you achieved this?

@EachOneChew
Copy link
Author

EachOneChew commented Dec 11, 2023

@talrejanikhil @fm1320 Hi, I unfortunately do not have access to the code I wrote. Here are some points that may help:

  • You can find many samples of CLIP training code. You can use those with minimal changes.
  • For hyperparameters, good learning rates vary case by case, but in general it should be much lower during fine tuning than what the CLIP team used. Consider reducing weight decay. Finally using a linear warmup on the learning rate anecdotally let me get away with higher learning rates for faster training. Batch size of 100.
  • Simple image augmentation can be used, in my case flipping pure white backgrounds to black, rotating, varying image dimensions, etc. No need to go overboard, I think it helped me a little but could just be chance and difference was small.
  • I used CLIP ViT-L14 and trained with the bottom half of layers on both encoders frozen (iirc one of the two transformer has more layers, so you freeze more layers on that one). If you don't do this you get memory issues unless using distributed loss.
  • When training and doing inference I saw a large performance gain by averaging text-text and text-image similarity, since my images had titles.
  • WiSE-FT is simple to implement and gave me 5%+ performance gains consistently.

That's all I remember, good luck with your projects 👍

@abhijitherekar
Copy link

Hi @EachOneChew , thanks for helping with the training code but when you say the following:

I used CLIP ViT-L14 and trained with the bottom half of layers on both encoders frozen (iirc one of the two transformer has more layers, so you freeze more layers on that one). If you don't do this you get memory issues unless using distributed loss.

Please, can you point me to the training code that you used to freeze the layer of the clip mode.
Also, please can you provide on how you make to that conclusion of freezing the layers.

Is it mentioned in some official paper.

Thanks

@EachOneChew
Copy link
Author

@abhijitherekar there is no reference to freezing layers officially because they used distributed loss when training. Freezing the layers was an adjustment I made myself to account for memory limitations on individual devices with which I saw good results.

To freeze a layer, set requires_grad of the parameter (layer) in a model to False. For example, to freeze every layer do:

for param in model.parameters():
    param.requires_grad = False

Look through the model parameters yourself to determine which to freeze.

@miguelalba96
Copy link

miguelalba96 commented Apr 19, 2024

@EachOneChew How did you determined the threshold for similarities?. I noticed that in general CLIP similarities are very low between image-text pairs but relatively high between similar text-text and image-image pairs. This happens even after fine-tuning it on massive data. Just to try you can take a pre-trained model from HF, then an image of a tiger and the caption "a photo of a tiger", then the cosine similarity will be around 0.3

if you have fine-tuned CLIP have you checked during training how the similarity increases between the images and their corresponding texts?, for me it hasn't increased more than 0.31-0.32 when fine-tuning it using a distributed loss on massive fashion data, so I stopped paying attention to it and monitor only zero shot acc performance like most of the people online (that increases).

for multi-label I am taking the top-k prediction and if the difference between the largest probability and second is small I output both

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants