Closed
Description
Thanks for sharing the nice work. But I didn't fully understand the labels usage:
labels = np.arange(n) # (0, 1, 2, 3, 4, ...)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2
the logits
is the cosine similarity matrix for n
image-text pairs. But cross entropy loss will make the rows/cols close to [0, 1, 2, 3, ...] (np.arange(n)
), leading to logits[0][0]=0
, equally the similarity of image[0] and text[0] is getting to 0.
But our goal is to enlarge the cosine similarity of image[0] and text[0].
So, could some awesome guys help me understand why using np.arange(n) rather than using one-hot as labels?
Metadata
Metadata
Assignees
Labels
No labels