Self-Labelling via simultaneous clustering and representation learning

Yuki M Asano & Christian Rupprecht

Summary: We have developed a self-supervised learning formulation that simultaneously learns feature representations and useful dataset labels by optimizing the common cross-entropy loss for features and labels, while maximizing information. This method can be used to generate labels for an any image dataset.

[Paper · Code]

Learning from unlabelled data can dramatically reduce the cost of deploying algorithms to new applications, thus amplifying the impact of machine learning in the real world. Self-supervision is an increasingly popular framework for learning without labels. The idea is to define pretext learning tasks can be constructed from raw data alone, but that still result in neural networks that transfer well to useful applications. Much of the research in self-supervision has focused on designing new pre-text tasks. However, given supervised data such as ImageNet, the standard classification objective of minimizing the cross-entropy loss still results in better pre-training than any of such methods (for a certain amount of data and model complexity). This suggest that the task of classification may be sufficient for pre-training networks, provided that suitable data labels are available. In this paper, we develop a method to obtain the labels automatically by designing a self-labelling algorithm.

Why do we want to train with labels?

Cluster ID: 0 Images from cluster

All 3000 clusters generated by our method are shown in the bottom diagram. Each pixel represents an image and its color signifies its ImageNet class label. Click on a cluster to show upto nine images from the validation set that belong to it.

The clusters are sorted by purity. Even when the clusters are not pure in term of ImageNet class labels, the containing images are visually similar.

Check out clusters 160, 584, 1386, 1689, 2077, 2820, 2848!

A method which extracts labels from an unlabelled datasets is desireable because of three reasons:

We know that training a neural network using image labels (e.g. from ImageNet) works extremely well and CNNs trained this way can transfer well to other tasks and datasets.
Even noisy labels such as Instagram hashtags work well, provided there is enough data.
Labels are a way of understanding datasets and grouping data into abstract categories.

With the interactive tool above you can browse the clusters/labels that our method has discovered automatically in the ImageNet dataset without using any ground-truth labels. Almost all of them are visually coherent and correspond to an intuitive concept.

Representation learning through clustering

The basic idea of our method is to simultaneously label the images and train a network using these labels. This is a chicken-and-egg problem: we require the labels to train the network, and we require the network to predict the labels.

In our method, the network comes first. By using the faint signal generated by a randomly-initialized network, we can bootstrap a first set of image labels that can then be subsequently refined. By adding various image transformations such as random crops and color jitters, we can further enforce invariance of the labelling against these kind of non-semantic transformations and let the network learn to extract more meaningful clusters.

Compared to previous approaches, such as DeepCluster, we do not introduce a separate clustering loss, as this leads to degenerate solutions and requires ad-hoc fixes. Instead, the novelty in our method lies in using a clustering approach that minimizes the same cross-entropy loss that the learning the network also optimizes. We do this by incorporating a regularization that our classes should partition the data equally. While this sounds potentially limiting, by simply choosing a large enough number of classes, we can account for even highly skewed datasets such that one “true” class might take up multiple of our classes.

Performance

To test the quality of our learned representation, we have performed many experiments, on various small and large datasets and tasks.

Small-scale datasets

Below we show the performance of our method on small-scale datasets: CIFAR-10/100 and SVHN. We see that we outperform the previous state of the art AND [2] by a significant margin.

We show the linear separability performance of the last convolutional layer on various smaller datasets.

Large-scale datasets

We also evaluate our method when training on the training set of ImageNet. For the evaluation, the network’s weight are frozen and only a linear layer is trained to assess the performance of the feature maps at various depths of the network. Since a linear layer is a relatively weak classifier, this is indicative of how good the CNN is as a feature extractor.

AlexNet

We show the linear separability performances of the last two convolutional layers of our method and various other methods.

ResNet-50

We also show how we compare against more recent approaches that mostly use contrastive losses. While some of approaches do not work on AlexNet (e.g. CPC [5]) or are not shown to work on AlexNet, we outperform the one that does give numbers on AlexNet, the Contrastive Multi-View method [6]. On the ResNet-50 benchmark itself, we outperform MoCo [7] and CPCv2 [8], which uses a similar amount of augmentation as us, but are below approaches that either use AutoAugment (CPCv2.1 [8], CMC [6]) or heavier augmentation (PIRL [9]). Note that AutoAugment uses manual supervision, so it indirectly injects a small amount of supervision in the training process.

We show the linear separability performances of the average pooled feature map of our method and various other methods.

Transfer performance

Finally, since pre-training is usually aimed at improving down-stream tasks, we evaluate the quality of the learned features by fine-tuning the model for three distinct tasks on the PASCAL VOC benchmark. In the table below we compare results with regard to multi-label classification, object detection and semantic segmentation on PASCAL VOC. Besides running our model the usual way, we can also take advantage of the fact that we’re generating labels. A set of labels generated can be repurposed. Thus, we further improve our performance by using labels computed by a powerful ResNet-50 and then using these labels to train an AlexNet on a quicker training schedule. Additionally, we can add an additional RotNet [3] loss, which gives us even further performance gains by combining multiple tasks. This hybrid approach is similar to another recent state of the art method which combines rotation with a retrieval tasks [4] This result (SeLa* [3k x 10]^-+Rot) achieves state of the art in unsupervised representation learning for AlexNet, with a gap of 1.3% to the previous best performance on ImageNet an surpassing the ImageNet supervised baseline transferred to Places by 1.7%.

We show the performance of our method on 4 downstream tasks using the smaller Pascal VOC datasets.

Useful tasks

While recent contrastive losses have shown great leaps in performance during the last months, our method shows that clustering based approaches yield state of the art representation learning performance while yielding meaningful labels and working throughout a variety of datasets. Finally, just like the image colorization paper [10], we believe that working on self-supervised learning approaches that also provide a useful side-effects, such as labels, is meaningful on its own and hence clustering based approaches continue to offer a promising avenue for future research.

References
[1] Deep Clustering for Unsupervised Learning of Visual Features. Caron, Bojanowski, Joulin, Douze. Proc. ECCV, 2018.
[2] Unsupervised deep learning by neighbourhood discovery Huang, Dong, Gong, Zhu. ICML, 2019
[3] Unsupervised representation learning by predicting image rotations.  Gidaris, Singh, Komodakis. Proc. ICLR, 2018.
[4] Self-supervised representation learning by rotation feature decoupling.  Feng, Xu, Tao. Proc. CVPR, 2019.
[5] Representation learning with contrastive predictive coding. Oord, Li, Vinyals. arXiv preprint arXiv:1807.03748, 2018.
[6] Contrastive multiview coding. Tian, Krishnan, Isola. arXiv preprint arXiv:1906.0584, 2019.
[7] Momentum contrast for unsupervised visual representation learning. He et al. arXiv preprint arXiv:1911.05722, 2019.
[8] Data-efficient image recognition with contrastive predictive coding. Hénaff et al. arXiv preprint arXiv:1905.09272, 2019.
[9] Self-supervised learning of pretext-invariant representations. Misra, van der Maaten. arXiv preprint arXiv:1912.01991, 2019.
[10] Colorful image colorization. Zhang, Isola, Efros. Proc. ECCV, 2016.

Author’s webpage: Yuki & Christian

Visual Geometry Group blog