0% found this document useful (0 votes)
21 views15 pages

Sim GCD

Uploaded by

wuhtcs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views15 pages

Sim GCD

Uploaded by

wuhtcs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Parametric Classification for Generalized Category Discovery: A Baseline Study

Xin Wen1 * Bingchen Zhao2 * Xiaojuan Qi1


1
The University of Hong Kong 2 University of Edinburgh
{wenxin,xjqi}@eee.hku.hk [email protected]
arXiv:2211.11727v4 [cs.CV] 15 Dec 2023

Representation Learning Objectives Current Works


1 Classification-Based Learning
① 2 Supervised Contrastive Learning
② 3 Self-Supervised Contrastive Learning
③ RankStat+
Mangrove 3 1 5
Cuckoo

Green UNO+
Jay
1 5

Classification Objectives
GCD
4 Non-Parametric Classification
④ 5 Parametric Classification

2 3 4

Ours (SimGCD)
2 3 5

Figure 1. Left: building blocks for representation learning or classifier learning; Right: overall abstraction of current works, where ‘→’
separates different stages of the method. Our work builds on GCD [43], and jointly trains a parametric classifier.

Abstract 1. Introduction
Generalized Category Discovery (GCD) aims to dis- With large-scale labelled datasets, deep learning meth-
cover novel categories in unlabelled datasets using knowl- ods can surpass humans in recognising images [25]. How-
edge learned from labelled samples. Previous studies ar- ever, it is not always possible to collect large-scale human
gued that parametric classifiers are prone to overfitting to annotations for training deep learning models. Therefore,
seen categories, and endorsed using a non-parametric clas- there is a rich body of recognition models that focus on
sifier formed with semi-supervised k-means. However, in learning with a large number of unlabelled data. Among
this study, we investigate the failure of parametric clas- them, semi-supervised learning (SSL) [33, 5, 38] is re-
sifiers, verify the effectiveness of previous design choices garded as a promising approach, yet with the assumption
when high-quality supervision is available, and identify un- that labelled instances are provided for each of the cate-
reliable pseudo-labels as a key problem. We demonstrate gories the model needs to classify. Generalized category
that two prediction biases exist: the classifier tends to pre- discovery (GCD) [43] is recently formalised to relax this
dict seen classes more often, and produces an imbalanced assumption by assuming the unlabelled data can also con-
distribution across seen and novel categories. Based on tain similar yet distinct categories from the labelled data.
these findings, we propose a simple yet effective paramet- The goal of GCD is to learn a model that is able to classify
ric classification method that benefits from entropy regular- the already-seen categories in the labelled data, and more
isation, achieves state-of-the-art performance on multiple importantly, jointly discover the new categories in the unla-
GCD benchmarks and shows strong robustness to unknown belled data and make correct classifications. Developing a
class numbers. We hope the investigation and proposed strong method for this problem could help us better utilise
simple framework can serve as a strong baseline to facil- the easily available large-scale unlabelled datasets.
itate future studies in this field. Our code is available at: Previous works [43, 22, 17, 6] approach this problem
https://github.com/CVMI-Lab/SimGCD. from two perspectives: learning generic feature representa-
* Equal contribution. tions to facilitate the discovery of novel categories, and gen-

1
erating pseudo clusters/labels for unlabelled data to guide Old Class ACC New Class ACC
CUB UNO+ CUB UNO+
GCD GCD
the learning of a classifier. The former is often achieved Ours Ours
by using self-supervised learning methods [22, 52, 18, 24, Aircraft IN-100 Aircraft IN-100

9, 54] to improve the generalization ability of features to


novel categories. For constructing the classifier, earlier
20
works [22, 52, 57, 6, 17] adopt a parametric approach that 50
SCars CF100 SCars CF100
builds a learnable classifier on top of the extracted features. 70 50

The classifier is jointly optimised with the backbone using 90 80


labelled data and pseudo-labelled data. CF10 Herb19 CF10 Herb19
However, recent research shows [43, 16] that parametric
Figure 2. Performance overview. Prior parametric classi-
classifiers are prone to overfit to seen categories (see Fig. 2) fication method (UNO+ [17]) shows highly degraded perfor-
and thus promote using a non-parametric classifier such as mance in ‘New’ classes. The non-parametric classification work
k-means clustering. Albeit obtaining promising results, the (GCD [43]) performs better, but at the sacrifice of ‘Old’ class and
non-parametric classifiers suffer from heavy computation high inference cost. Our method shows that parametric classifica-
costs on large-scale datasets due to quadratic complexity of tion can work well on both metrics.
the clustering algorithm. Besides, unlike a learnable para-
metric classifier, the non-parametric method loses the abil- the analysis, we propose a simple yet effective parametric
ity to jointly optimise the separating hyperplane of all cate- classification method. (3) Our method achieves SOTA on
gories in a learnable manner, potentially being sub-optimal. multiple popular GCD benchmarks, challenging the recent
promotion of non-parametric classification for this task.
This motivates us to revisit the reason that makes previ-
ous parametric classifiers fail to recognise novel classes. In
a series of investigations (Sec. 3) from the view of super- 2. Related Works
vision quality, we verify the effectiveness of prior design
choices in feature representations and training paradigms Semi-Supervised Learning (SSL) has been an important
when strong supervision is available, and conclude that research topic where a number of methods have been pro-
the key to previous parametric classifiers’ degraded per- posed [5, 38, 41]. SSL assumes that the labelled instances
formance is unreliable pseudo labels. By diagnosing the are available for all possible categories in the unlabelled
statistics of its predictions, we identify severe prediction bi- dataset; the objective is to learn a model to perform classifi-
ases within the model, i.e., the bias towards predicting more cation using both the labelled samples as well as the large-
‘Old’ classes than ‘New’ classes (Fig. 5) and the bias of pro- scale available unlabelled data. One of the most effective
ducing imbalanced pseudo-labels across all classes (Fig. 6). methods for SSL is the consistency-based method, where
the model is forced to learn consistent representations of
Based on these findings, we thus present a simple para-
two different augmentations of the same image [38, 5, 41].
metric classification baseline for generalized category dis-
Furthermore, it is also shown that self-supervised represen-
covery (see Figs. 1 and 7). The representation learning
tation learning is helpful for the task of SSL [51, 34] as it
objective follows GCD [43], and the classification objec-
can provide a strong representation for the task.
tive is simply cross-entropy for labelled samples and self-
distillation [9, 3] for unlabelled samples. Besides, an en- Open-Set Semi-Supervised Learning considers the case
tropy regularisation term is also adopted to overcome bi- where the unlabelled data may contain outlier data points
ased predictions by enforcing the model to predict more that do not belong to any of the categories in the labelled
uniformly distributed labels across all possible categories. training set. The goal is to learn a classifier for the labelled
Empirically, we indeed observe that our method produces categories from a noisy unlabelled dataset [50, 37, 11, 20].
more balanced pseudo-labels (Figs. 9 and 10) and achieves As this problem only focuses on the performance of the la-
a large performance gain on multiple GCD benchmarks belled categories, the outlier from novel categories are sim-
(Tabs. 2 to 4), indicating that the two types of biases we ply rejected and no further classification is needed.
identified are the core reason why the parametric-classifier- Generalized Category Discovery (GCD) is a relatively
based approach performs poorly for GCD. Additionally, we new problem recently formalised in Vaze et al. [43], and
observe that the entropy regulariser could also be used to en- is also studied in a parallel line of work termed open-world
force robustness towards an unknown number of categories semi-supervised learning [6, 39]. Different from the com-
(Figs. 11 and 12), this could further ease the deployment of mon assumption of SSL [33], GCD does not assume the
parametric classifiers for GCD in real-world scenarios. unlabelled dataset comes from the same class set as the
Our contributions are summarised as follows: (1) We re- labelled dataset, posing a greater challenge for designing
visit the design choices of parametric classification and con- an effective model. GCD can be seen as a natural exten-
clude the key factors that make it fail for GCD. (2) Based on sion of the novel category discovery (NCD) problem [23]

2
where it is assumed that the unlabelled dataset and the la- Representation learning. For representation learning, we
belled dataset do not have any class overlap, thus base- follow GCD [43], which applies supervised contrastive
lines for NCD [22, 52, 57, 56, 17] can be adopted for learning [27] on labelled samples, and self-supervised con-
the GCD problem by extending the classification head to trastive learning [10] on all samples (detailed in Sec. 4.1).
have more outputs [43]. The incremental setting of GCD Classifier. We follow UNO [17] to adopt a prototypical
is also explored [53, 36]. It is pointed out in [43] that a classifier. Take f (x) as the feature vector of an image x
non-parametric classifier formed using semi-supervised k- extracted using from the backbone f , the procedure for pro-
means can outperform strong parametric classification base- ducing logits is l = τ1 (w/||w||)⊤ (f (x)/||f (x)||). Here τ
lines from NCD [22, 17] because it can alleviate the overfit- is the temperature value that scales up the norm of l and
ting to seen categories in the labelled set. In this paper, we facilitates optimisation of the cross-entropy loss [45].
revisit this claim and show that parametric classifiers can
reach stronger performance than non-parametric classifiers. Training settings. We train with varying supervision quali-
ties. The minimal supervision setting utilises only the la-
Deep Clustering aims at learning a set of semantic proto- bels in Dl , while the oracle supervision setting assumes
types from unlabelled images with deep neural networks. all samples are labelled (both Dl and Du ). Besides, we
Considering that no label information is available, the fo- study two practical settings that adopt pseudo labels for
cus is on how to obtain reliable pseudo-labels. While unlabelled samples in Du : self-label that predicts pseudo-
early attempts rely on hard labels produced by k-means [7], labels with the Sinkhorn Knopp algorithm following [17],
there has been a shift towards soft labels produced by op- and self-distil, which depicts another pseudo-labelling strat-
timal transport [2, 8], and more recently sharpened predic- egy as in Fig. 7 and will be introduced in detail in Sec. 4.2.
tions from an exponential moving average-updated teacher For all settings, we only employ a cross-entropy loss on the
model [9, 3]. Deep clustering has shown strong potential for (pseudo-)labelled samples on hand for classification. Note
unsupervised representation learning [7, 2, 8, 9, 3], unsu- that unless otherwise stated, this is done on decoupled fea-
pervised semantic segmentation [12, 49], semi-supervised tures, thus representation learning is unaffected.
learning [4], and novel category discovery [17]. In this
work, we study the techniques that make strong parametric 3.2. Which Representation to Build Your Classifier?
classifiers for GCD with inspirations from deep clustering.
Motivation. Following the trend of deep clustering that fo-
3. On the Failure of Parametric Classification cuses on self-supervised representation learning [8], previ-
ous parametric classification work UNO [17] fed the classi-
In order to explore the reason that makes previous para- fier with representations taken from the projector. While in
metric classifiers fail to recognise ‘New’ classes for gen- GCD [43], significantly stronger performance is achieved
eralized category discovery, this section presents prelimi- with a non-parametric classifier built upon representations
nary studies to reveal the role of two major components: taken from the backbone. We revisit this choice as follows.
representation learning (Sec. 3.2) and pseudo-label quality
Setting. Consider f as the feature backbone, and g is a
on unseen classes (Sec. 3.3). These have led to conflict-
multi-layer perceptron (MLP) projection head. Given an
ing choices of previous works, but why? We show a uni-
input image xi , the representation from the backbone can
fied viewpoint (Figs. 3 and 4), and emphasise that taking
be written as f (xi ), and that from the projector is g(f (xi )).
pseudo-label quality into account is important for selecting
the suitable design choice. This then led to our diagnosis of CIFAR100 80
CUB
New (backbone) New (backbone)
what makes the degenerated pseudo-labels (Sec. 3.4), and
New ACC (in %)

New ACC (in %)

70 New (projector) New (projector)


60
motivated our de-biased pseudo-labelling strategy. 60
50 40

3.1. Investigation Setting Minimal Self-label Self-distil Oracle Minimal Self-label Self-distil Oracle
90
90 Old (backbone) Old (backbone)
Old ACC (in %)

Old ACC (in %)

Generalized category discovery. Given an unlabelled Old (projector) 80 Old (projector)


85 70
dataset Du = {(xui , yiu )} ∈ X × Yu where Yu is the 80 60
label space of the unlabelled samples, the goal of GCD 75 50

is to learn a model to categorise the samples in Du using Minimal Self-label Self-distil Oracle Minimal Self-label Self-distil Oracle
Supervision Quality (Low High) Supervision Quality (Low High)
the knowledge from a labelled dataset Dl = (xli , yil ) ∈ Figure 3. Results with different representations. We build the
X × Yl where Yl is the label space of labelled samples and classifier on post-backbone or post-projector representations, and
Yl ⊂ Yu . We denote the number of categories in Yu as train with varying supervision quality. Results on ‘Old’ class con-
Ku = |Yu |, it is common to assume the number of cate- sistently benefit from the post-backbone representations regardless
gories is known a-priori [22, 52, 57, 17], or can be esti- of the supervision quality, while unleashing its potential on ‘New’
mated using off-the-shelf methods [23, 43]. class requires stronger pseudo labels.

3
Result & discussion. As in Fig. 3, the post-backbone fea- when a stronger pseudo-labelling strategy (self-distillation)
ture space has a clearly higher upper bound for learning pro- or even oracle labels are utilised, we observe consistent
totypical classifiers than the post-projector feature space. gains from joint training. This means that the joint train-
Using a projector in self-supervised learning lets the projec- ing strategy does not necessarily result in UNO [17]’s low
tor focus on solving pretext tasks and allows the backbone performance in ‘New’ classes; on the contrary, it can even
to keep as much information as possible (which facilitates boost ‘New’ class performance by a notable margin. Our
downstream tasks) [13]. But when good classification per- overall explanation is that UNO’s framework could not
formance is all you need, our results suggest that the classi- make reliable pseudo-labels, thus restricting its ability to
fication objective should build on post-backbone represen- benefit from joint training. The joint training strategy is not
tations directly. The features post the projector might focus to blame and is, in fact, helpful. When switching to a more
more on solving the pretext task and not be necessarily use- advanced pseudo-labelling paradigm that produces higher-
ful for the classification objective. Note that high-quality quality pseudo-labels, the help from joint training can be
pseudo labels are necessary to unleash the post-backbone even more significant.
representations’ potential to recognise novel categories.
3.4. The Devil Is in the Biased Predictions
3.3. Decoupled or Joint Representation Learning?
Motivation. In Secs. 3.2 and 3.3, we verified the effective-
Motivation. Previous parametric classification methods, ness of two design choices when high-quality pseudo labels
e.g., UNO [17], commonly tune the representations jointly are available, and concluded the key to previous work’s de-
with the classification objective. On the contrary, in the graded performance is unreliable pseudo labels. We then
two-stage non-parametric method GCD [43] where the per- further diagnose the statistics of its predictions as follows.
formance in ‘New’ classes is notably higher, classifica-
tion/clustering is fully decoupled from representation learn- Setting. We categorise the model’s errors into four types:
ing, and the representations can be viewed as unaltered by “True Old”, “False New”, “False Old”, and “True New” ac-
classification. In this part, we study whether the joint learn- cording to the relationship between predicted and ground-
ing strategy contributes to previous parametric classifiers’ truth class. E.g., “True New” refers to predicting a ‘New’
degraded performance in recognising ‘New’ classes. class sample to another ‘New’ class, while “False Old” indi-
cates predicting a ‘New’ class sample as some ‘Old’ class.
Setting. Consider f (x) as the representation fed to the clas-
sifier, decoupled training, as the previous settings adopted, UNO+ (CIFAR100) UNO+ (CUB) GCD (CIFAR100) GCD (CUB)
15 20
8

'All' ACC Error


7.52% 6.30% 2.67% 10.05% 20 7.48% 6.16% 2.18% 13.32%
indicates f (x) is decoupled when computing the logits l,
New Old

Old

Old

Old
15
GT Class

10 6
thus the classification objective won’t supervise representa- 10 10
15.57% 2.98% 17.77% 24.02% 9.16% 2.74% 11.83% 20.97%
New

New

New
5 4 5
tion learning. While for joint training, the representations Old New Old New Old New Old New
are jointly optimised by classification. Predicted Class Predicted Class Predicted Class Predicted Class

Result & discussion. The results are illustrated in Fig. 4. Figure 5. Prediction bias between ‘Old’/‘New’ classes. We
When adopting the self-labelling strategy, there is a sharp simplify the results to binary classification and categorise errors
in ‘All’ ACC into four types. Both works, especially UNO+, are
drop in ‘Old’ class performance on both datasets, while
prone to make “False Old” predictions, and many samples corre-
for the ‘New’ classes, it can improve by 13 points on CI-
sponding to ‘New’ classes are misclassified as an ‘Old’ class.
FAR100, and drop by a small margin on CUB. In contrast,
CIFAR100 (New) CUB (New)
CIFAR100 CUB 1000
#Instance / Class

#Instance / Class

100 100 UNO+ GCD GT UNO+ GCD GT


New (joint) New (joint) 750 100
New ACC (in %)

New ACC (in %)

New (decouple) 80 New (decouple)


80 500
60 50
60 250
40
40 0 0
Minimal Self-label Self-distil Oracle Minimal Self-label Self-distil Oracle 0 5 10 15 20 0 20 40 60 80 100
100 Old (joint) 100 Old (joint) CIFAR100 (Old) CUB (Old)
60
Old ACC (in %)

Old ACC (in %)

Old (decouple) Old (decouple)


#Instance / Class

#Instance / Class

90 80 UNO+ GCD GT UNO+ GCD GT


80
400 40
60
70 200 20
40
Minimal Self-label Self-distil Oracle Minimal Self-label Self-distil Oracle 0 0
Supervision Quality (Low High) Supervision Quality (Low High)
0 20 40 60 80 0 20 40 60 80 100
Sorted Class Indexes Sorted Class Indexes
Figure 4. Results with different training paradigms. Decouple
denotes the classifier adopts decoupled features, while joint indi- Figure 6. Prediction bias across ‘Old’/‘New’ classes. We
cates the classification objective can affect representation learning. show the per-class prediction distributions. Both works, especially
Joint training is helpful when high-quality supervision is available, UNO+, are prone to make biased predictions. Across all classes,
otherwise, it could lead to degraded representations. the predictions are unexpectedly biased towards the head classes.

4
Result & discussion. We observe two types of prediction cross- labelled
bias. In Fig. 5, both works, especially UNO+ [17], are entropy
prone to make “False Old” predictions. In other words, their
predictions are biased towards ‘Old’ classes. Besides, the unlabelled
“True New” errors are also notable, indicating that misclas-
sification within ‘New’ classes is also common. We then de-
pict the predictions’ overall distribution across ‘Old’/‘New’ sg mean
classes in Fig. 6, and both works show highly biased pre- entropy
softmax softmax
dictions. This double-bias phenomenon then motivated the max.
prediction entropy regularisation design in our method.
share
weight
4. Method
In this section, we present the whole picture of this sim- GCD-style
rep. learn.
ple yet effective method (see Fig. 7), a one-stage framework
that builds on GCD [43], and jointly trains a parametric SupCL
classifier with self-distillation and entropy regularisation. SelfSupCL
And in Sec. 5.3, we discuss the step-by-step changes that
Figure 7. The overall framework of our method. For unla-
lead a simple baseline to our solution.
belled samples, the pseudo-labels are from sharpened predictions
4.1. Representation Learning of another random augmented view. And for labelled samples, we
simply adopt the ground truth. Details for representation learn-
Our representation learning objective follows GCD [43], ing and the mean-entropy-maximisation regulariser are omitted for
which is supervised contrastive learning [27] on labelled simplicity, and please refer to the text. (Also see Fig. 1 for a high-
samples, and self-supervised contrastive learning [10] on level comparison with previous works)
all samples. Formally, given two views (random augmen-
C scaled by 1/τs :
tations) xi , and x′i of the same image in a mini-batch B,
the self-supervised contrastive loss is written as:
\boldsymbol {p}_{i}^{(k)}=\frac {\exp \left (\frac {1}{\tau _s}(\boldsymbol {h}_i/||\boldsymbol {h}_i||_2)^\top (\boldsymbol {c}_k / ||\boldsymbol {c}_k||_2)\right )}{\sum _{k^\prime } \exp \left (\frac {1}{\tau _s}(\boldsymbol {h}_i / ||\boldsymbol {h}_i||_2)^\top (\boldsymbol {c}_{k^\prime } / ||\boldsymbol {c}_{k^\prime }||_2)\right )} \,, (3)
\mathcal {L}^u_\text {rep}=\frac {1}{|B|} \sum _{i \in B}-\log \frac {\exp \left (\boldsymbol {z}_i^\top \boldsymbol {z}_i^{\prime } / \tau _u\right )}{\sum _i^{i \neq n} \exp \left (\boldsymbol {z}_i^\top \boldsymbol {z}_n^{\prime } / \tau _u\right )} \,, (1)

and the soft pseudo-label q ′i is produced by another view


where the feature z i = g(f (xi )) and is ℓ2 -normalised, f, g xi with a sharper temperature τt in a similar fashion. The
denote the backbone and the projection head, and τu is a classification P
objectives are then simply cross-entropy loss
temperature value. The supervised contrastive loss is sim- ℓ(q ′ , p) = − k q ′(k) log p(k) between the predictions and
ilar, and the major difference is that positive samples are pseudo-labels or ground-truth labels:
matched by their labels, formally written as:
\label {eq:selfdistill} \mathcal {L}^u_\text {cls} = \frac {1}{|B|}\sum _{i \in B}\ell (\boldsymbol {q}_i^\prime , \boldsymbol {p}_i) - \varepsilon H(\overline {\boldsymbol {p}}) , \mathcal {L}^s_\text {cls} = \frac {1}{|B^l|}\sum _{i \in B^l}\ell (\boldsymbol {y}_i, \boldsymbol {p}_i) ,
\mathcal {L}^s_\text {rep}=\frac {1}{|B^l|} \sum _{i \in B^l}\frac {1}{|\mathcal {N}_i|} \sum _{q \in \mathcal {N}_i} -\log \frac {\exp \left (\boldsymbol {z}_i^\top \boldsymbol {z}_q^{\prime } / \tau _c\right )}{\sum _i^{i \neq n} \exp \left (\boldsymbol {z}_i^\top \boldsymbol {z}_n^{\prime } / \tau _c\right )} \,,
(4)
(2) where y i denote the one-hot label of xi . We also adopt
where Ni indexes all other images in the same batch that a mean-entropy maximisation regulariser [3] for the unsu-
1 ′
P
hold the same label as xi . The overall representation learn- pervised objective. Here p = 2|B| i∈B (p i + pi ) denotes
ing loss is balanced with λ: Lrep = (1 − λ)Lurep + λLsrep , thePmean prediction of a batch, and the entropy H(p) =
where B l corresponds to the labelled subset of B. − k p(k) log p(k) . Then the classification objective is
Lcls = (1 − λ)Lucls + λLscls , and the overall objective is
4.2. Parametric Classification simply Lrep + Lcls .
Our parametric classification paradigm follows the self- Discussions. Please note that this work doesn’t aim to pro-
distillation [9, 3] fashion. Formally, with K = |Yl ∪ Yu | mote new methods but to examine existing solutions, pro-
denoting the total number of categories, we randomly ini- vide insights into their failures and build a simple yet strong
tialise a set of prototypes C = {c1 , . . . , cK }, each standing baseline solution. The paradigm of producing pseudo-labels
for one category. During training, we calculate the soft label from sharpened predictions of another augmented view ap-
for each augmented view xi by softmax on cosine similarity pears to resemble consistency-based methods [38, 5, 41] in
between the hidden feature hi = f (xi ) and the prototypes the SSL community. However, despite differences in aug-

5
mentation strategies and soft/hard pseudo-labels, our ap- to 0.07, then warmed up to 0.04 with a cosine schedule in
proach jointly performs category discovery and self-training the starting 30 epochs. All experiments are done with an
style learning, while the SSL methods purely focus on boot- NVIDIA GeForce RTX 3090 GPU.
strapping itself with unlabelled data, and does not discover
novel categories. Besides, entropy regularisation is also ex- 5.2. Comparison With the State of the Arts
plored in deep clustering to avoid trivial solution [3]. In We compare with state-of-the-art methods in generalized
contrast, our method shows its help in overcoming the pre- category discovery (ORCA [6] and GCD [43]), strong base-
diction bias between and within seen/novel classes (Figs. 9 lines derived from novel category discovery (RS+ [22] and
and 10), and enforcing robustness to unknown numbers of UNO+ [17]), and k-means [30] on DINO [9] features. On
categories (Fig. 11). both the fine-grained SSB benchmark (Tab. 2) and generic
image recognition datasets (Tab. 3), our method achieves
5. Experiments notable improvements in recognising ‘New’ classes (across
5.1. Experimental Setup the instances in Du that belong to classes in Yu \Yl ), out-
performing the SOTAs by around 10%. The results in
Datasets. We validate the effectiveness of our method old classes are also competing against the best-performing
on the generic image recognition benchmark (including baselines. Given that the ability to discover ‘New’ classes
CIFAR10/100 [29] and ImageNet-100 [14]), the recently is a more desirable ability, the results are quite encouraging.
proposed Semantic Shift Benchmark [44] (SSB, including In Tab. 4, we also report the results on Herbarium
CUB [48], Stanford Cars [28], and FGVC-Aircraft [31]), 19 [40], a naturally long-tailed fine-grained dataset that is
and the harder Herbarium 19 [40] and ImageNet-1K [14]. closer to the real-world application of generalized category
For each dataset, we follow [43] to sample a subset of all discovery, and ImageNet-1K [14], a large-scale generic
classes as the labelled (‘Old’) classes Yl ; 50% of the images classification dataset. Still, our method shows consistent
from these labelled classes are used to construct Dl , and the improvements in all metrics.
remaining images are regarded as the unlabelled data Du .
See Tab. 1 for statistics of the datasets we evaluate on. CUB Stanford Cars FGVC-Aircraft
Methods All Old New All Old New All Old New
Labelled Unlabelled
k-means [30] 34.3 38.9 32.1 12.8 10.6 13.8 16.0 14.4 16.8
Dataset Balance #Image #Class #Image #Class RS+ [22] 33.3 51.6 24.2 28.3 61.8 12.1 26.9 36.4 22.2
UNO+ [17] 35.1 49.0 28.1 35.5 70.5 18.6 40.3 56.4 32.2
CIFAR10 [29] ✓ 12.5K 5 37.5K 10
ORCA [6] 35.3 45.6 30.2 23.5 50.1 10.7 22.0 31.8 17.1
CIFAR100 [29] ✓ 20.0K 80 30.0K 100
ImageNet-100 [14] ✓ 31.9K 50 95.3K 100 GCD [43] 51.3 56.6 48.7 39.0 57.6 29.9 45.0 41.1 46.9
CUB [48] ✓ 1.5K 100 4.5K 200 SimGCD 60.3 65.6 57.7 53.8 71.9 45.0 54.2 59.1 51.8
∆ +9.0 +9.0 +9.0 +14.8 +14.3 +15.1 +9.2 +18.0 +4.9
Stanford Cars [28] ✓ 2.0K 98 6.1K 196
FGVC-Aircraft [31] ✓ 1.7K 50 5.0K 100 Table 2. Results on the Semantic Shift Benchmark [44].
Herbarium 19 [40] ✗ 8.9K 341 25.4K 683
ImageNet-1K [14] ✓ 321K 500 960K 1000 CIFAR10 CIFAR100 ImageNet-100
Table 1. Statistics of the datasets we evaluate on. Methods All Old New All Old New All Old New

Evaluation protocol. We evaluate the model performance k-means [30] 83.6 85.7 82.5 52.0 52.2 50.8 72.7 75.5 71.3
RS+ [22] 46.8 19.2 60.5 58.2 77.6 19.3 37.1 61.6 24.8
with clustering accuracy (ACC) following standard prac- UNO+ [17] 68.6 98.3 53.8 69.5 80.6 47.2 70.3 95.0 57.9
tice [43]. During evaluation, given the ground truth y ∗ and ORCA [6] 81.8 86.2 79.6 69.0 77.4 52.0 73.5 92.6 63.9
theP predicted labels ŷ, the ACC is calculated as ACC = GCD [43] 91.5 97.9 88.2 73.0 76.2 66.5 74.1 89.8 66.3
1 M ∗ u SimGCD 97.1 95.1 98.1 80.1 81.2 77.8 83.0 93.1 77.9
M i=1 1(yi = p(ŷi )) where M = |D |, and p is the ∆ +5.6 -2.8 +9.9 +7.1 +5.0 +11.3 +8.9 +3.3 +11.6
optimal permutation that matches the predicted cluster as-
Table 3. Results on generic image recognition datasets.
signments to the ground truth class labels.
Implementation details. Following GCD [43], we train all Herbarium 19 ImageNet-1K
methods with a ViT-B/16 backbone [15] pre-trained with Methods All Old New All Old New
DINO [9]. We use the output of [CLS] token with a di-
k-means [30] 13.0 12.2 13.4 - - -
mension of 768 as the feature for an image, and only fine- RS+ [22] 27.9 55.8 12.8 - - -
tune the last block of the backbone. We train with a batch UNO+ [17] 28.3 53.7 14.7 - - -
size of 128 for 200 epochs with an initial learning rate of ORCA [6] 20.9 30.9 15.5 - - -
0.1 decayed with a cosine schedule on each dataset. Align- GCD [43] 35.4 51.0 27.0 52.5 72.5 42.2
ing with [43], the balancing factor λ is set to 0.35, and the SimGCD 44.0 58.0 36.4 57.1 77.3 46.9
∆ +8.6 +7.0 +9.4 +4.6 +4.8 +4.7
temperature values τu , τc as 0.07, 1.0, respectively. For the
classification objective, we set τs to 0.1, and τt is initialised Table 4. Results on more challenging datasets.

6
CIFAR100 ImageNet-100 CUB Stanford Cars Herbarium 19
70 60
80 90 60
60
ACC (in %)

ACC (in %)

ACC (in %)

ACC (in %)

ACC (in %)
70 80 40
40
60 70 50
All All All 20 All
20 All
50 Old 60 Old Old Old Old
New New
40 New New New
50 0 0
40
GCD +SL +BR +SD +TW +JT GCD +SL +BR +SD +TW +JT GCD +SL +BR +SD +TW +JT GCD +SL +BR +SD +TW +JT GCD +SL +BR +SD +TW +JT
Step-by-step Diff. (GCD SimGCD) Step-by-step Diff. (GCD SimGCD) Step-by-step Diff. (GCD SimGCD) Step-by-step Diff. (GCD SimGCD) Step-by-step Diff. (GCD SimGCD)
Figure 8. Step-by-step differences from GCD [43] to SimGCD. (SL: self-labelling, BR: post-backbone representation, SD: self-
distillation, TW: teacher temperature warmup, JT: joint training)

Methods CF100 CUB Herb19 IN-100 IN-1K be helpful for fine-grained classification datasets, while for
GCD [43] 7.5m 9m 2.5h 36m 7.7h generic classification datasets, which are similar to the pre-
SimGCD 1m 18s 3.5m 9.5m 0.6h training data (ImageNet), the unreliable pseudo label is not
Table 5. Inference time over the unlabelled split. a problem, thus lowering the confidence does not show help.
For simplicity, we keep the training strategy consistent.
In Tab. 5, we compare the inference time with GCD [43],
one iconic non-parametric classification method. Let the Jointly training the representation. Previous settings
number of all samples and unlabelled samples be N and adopt a decoupled training strategy for consistent repre-
Nu , the number of classes K, feature dimension d, and the sentations with GCD [43] and fair comparison. Finally,
number of k-means iterations to be t, the time complexity of as confirmed in Sec. 3.3, we jointly supervise the repre-
GCD is O(N 2 d + N Kdt) (including k-means++ initialisa- sentation with the classification objective (+JT). This re-
tion), while our method only requires a nearest-neighbour sults in a consistent improvement in ‘New’ classes for all
prototype search for each instance, with time complexity datasets. Changes in ‘Old’ classes are mostly neutral or
O(Nu Kd). All methods adopt GPU implementations. positive, with a notable drop in CIFAR100. Our intuition
is that the original representations are already good enough
5.3. Ablation Study for ‘Old’ classes in this dataset, and some incorrect pseudo
labels lead to sight degradation in this case.
In Fig. 8, we ablate the key components that bring the
baseline method step-by-step to a new SOTA. CIFAR100 CUB
12.5
Abs. 'All' ACC Error (in %)

Abs. 'All' ACC Error (in %)


=0 =2 =0 =2
=1 =4 30 =1 =4
Baseline. We start from GCD [43], a non-parametric clas- 10.0
sification framework. We keep its representation learn- 7.5 20

ing objectives unchanged, and first impose the UNO [17]- 5.0 10
2.5 0
style self-labelling classification objectives (+SL) to it, thus True Old False New False Old True New True Old False New False Old True New
Error Types Error Types
transforming it into a parametric classifier. The classifier
Figure 9. Effect of entropy regularisation on four types of clas-
is built on the projector, and detached from representation
sification errors. Appropriate entropy regularisation helps over-
learning. Results on ‘Old’ classes generally improve, while
come the bias between ‘Old’/‘New’ classes (see “False New” and
results on ‘New’ classes see a sharp drop. This is expected “False Old”, lower is better).
due to UNO’s strong bias toward ‘Old’ classes (Fig. 5).
Improving the representations. As suggested in Sec. 3.2, CIFAR100 (New) CUB (New)
2000 =0 =4 =0 =4
#Instance / Class

#Instance / Class

we build the classifier on the backbone (+BR). This fur- =1 GT 200 =1 GT


1500 =2 =2
ther makes notable improvements in ‘Old’ classes, while 1000 100
changes in ‘New’ classes vary across datasets. This indi- 500
cates that the pseudo labels’ quality is insufficient to benefit 0 0
from the post-backbone representations (Fig. 3). 0 3 6 9 12 15 18 0 20 40 60 80 100
CIFAR100 (Old) CUB (Old)
Improving the pseudo labels. We start by replacing the =0 =4 40 =0 =4
#Instance / Class

#Instance / Class

self-labelling strategy with our self-distillation paradigm. 400 =1 GT =1 GT


=2 30 =2

As shown in column (+SD), we achieve consistent improve- 200 20


ments across all datasets by a large margin (e.g., 26% in 10
CIFAR100, 13% in CUB) in ‘New’ classes. We then fur- 0 0
0 20 40 60 80 0 20 40 60 80 100
ther adopt a teacher temperature warmup strategy (+TW) to Sorted Class Indexes Sorted Class Indexes
lower the confidence of the pseudo-labels at an earlier stage. Figure 10. Per-class prediction distributions with different en-
The intuition is that at the beginning, both the classifier and tropy regularisation weights. Proper entropy regularisation helps
the representation are not well fitted to the target data, thus overcome the bias across ‘Old’/‘New’ classes, and approach the
the pseudo-labels are not quite reliable. This is shown to GT class distribution.

7
CIFAR100 (K=100) ImageNet-100 (K=100) CUB (K=200) Stanford Cars (K=196) Herbarium 19 (K=683)
80 50 40
80
60 38
70 40
New ACC (in %)

New ACC (in %)

New ACC (in %)

New ACC (in %)

New ACC (in %)


60 36
50 30 34
60
40
32
50 40 20
20 =1 =2 =4 =1 =2 =4 =1 =2 =4 =1 =2 =4 30 =1 =2 =4
40 10 28
75 100 125 150 175 200 75 100 125 150 175 200 75 100 125 150 175 200 75 100 125 150 175 200 75 100 125 150 175 200

80 94 70 70 62
60
Old ACC (in %)

Old ACC (in %)

Old ACC (in %)

Old ACC (in %)

Old ACC (in %)


70 92 60 65
58
90 50 60
60 56
88
40 55 54
50 =1 =2 =4 86 =1 =2 =4 =1 =2 =4 =1 =2 =4 =1 =2 =4
30 50 52
75 100 125 150 175 200 75 100 125 150 175 200 75 100 125 150 175 200 75 100 125 150 175 200 75 100 125 150 175 200
Ratio to GT Class Number (80% 200%) Ratio to GT Class Number (80% 200%) Ratio to GT Class Number (80% 200%) Ratio to GT Class Number (80% 200%) Ratio to GT Class Number (80% 200%)

Figure 11. Results with different numbers of categories. Stronger entropy regularisation effectively enforces the model’s robustness to
unknown numbers of categories, but over-regularisation may limit the ability to recognise ‘New’ classes under ground-truth class numbers.

5.4. Analyses And Discussions 140 GCD GCD

Abs. Contrib. to Overall ACC (%)


Ours 0.5 Ours
120
Ground-truth

#Instance Per Class


100 0.4
Entropy regularisation helps overcome prediction bias. 80 0.3
We verify the effectiveness of entropy regularisation in 60
40 0.2
overcoming prediction bias by diagnosing the model’s clas-
20 0.1
sification errors and class-wise prediction distributions. 0 0.0
Fig. 9 shows that this term consistently helps reduce “False 0 50 100 150 200 150 160 170 180 190 200
Sorted Class Indexes Tail Class Indexes
New” and “False Old” errors, which refer to predicting an
‘Old’ class sample to a ‘New’ class, and vice-versa. Be- Figure 13. Prediction analysis against GCD [43]. Left: Based
on identical representations, the non-parametric classifier (semi-
sides, Fig. 10 shows proper entropy regularisation helps
supervised k-means) adopted by GCD produces highly imbal-
overcome the imbalanced pseudo labels across all classes, anced predictions, while our method better fits the true distribu-
and approach the ground truth (GT) class distribution. tion; Right: our method significantly improves GCD’s tail classes.
Entropy regularisation enforces robustness to unknown
numbers of categories. The main text assumed the cate- What makes for the significant improvements over GCD
gory number is known a-priori following prior works [22, given identical representations? One interesting message
52, 57, 17], which is impractical [55]. In Fig. 11, we from Fig. 8 is that, even with the same representations (col.
present the results with different numbers of categories on +TW), we can already improve GCD by a large margin.
five representative datasets. A category number lower than We thus study the classification predictions and the major
the ground truth significantly limits the ability to discover components that lead to the performance gap. As shown in
‘New’ categories, and the model tends to focus more on the Fig. 13, the non-parametric classifier (semi-supervised k-
‘Old’ classes. On the other hand, increasing the category means) adopted by GCD [43] produces highly imbalanced
number results in less harm to the generic image recognition predictions, while our method better fits the true distribu-
datasets and can even be helpful for some datasets. When tion. Further analysis (right part) shows that our method
a stronger entropy penalty is imposed, the model shows significantly improves over the tail classes of GCD.
strong robustness to the category number. Interestingly, fur- How does the classification objective change the repre-
ther analysis in Fig. 12 shows the network prefers to keep sentations? In Fig. 8, we have shown that jointly training
the number of active prototypes low and close to the real the representations with the classification objective can lead
category number. This finding is inspiring and could ease to ∼15% boost in ‘New’ classes on CIFAR100. We study
the deployment of GCD in real-world scenarios. this difference by visualising the representations before and
CIFAR100 CUB after tuning with t-SNE [42]. As in Fig. 14, jointly tun-
750 K=100 60 K=200
#Instance / Class

#Instance / Class

K=200 K=400 ing the feature leads to less ambiguity, larger margins, and
500 GT 40 GT
compacter clusters. Concerning why this is not as helpful
250 20
for CUB: we hypothesise that one important factor lies in
0 0
0 50 100 150 200 0 100 200 300 400 how transferable the features learned in ‘Old’ classes are to
Sorted Class Indexes Sorted Class Indexes ‘New’ classes. While it may be easier for a cat classifier to
Figure 12. Per-class prediction distributions with different be adapted to dogs, things can be different for fine-grained
numbers of categories. Our method effectively identifies the cri- bird recognition. Besides, the small scale of CUB, which
terion for ‘New’ classes, thus keeping the number of active proto- contains only 6k images while holding a large class split
types close to the ground-truth class number. (200), might also make it hard to learn transferable features.

8
Decouple (same rep. as GCD) Joint (tune rep. w/ cls. objective)
beaver
bed
vised contrastive learning plus self-supervised contrastive
bridge
bear
butterfly
learning paradigm the ultimate answer to form the feature
table
tank
tulip
sweet_pepper
manifold? We believe that advances in representation learn-
wardrobe
Prototype ing can lead to further gains.
Alignment to human-defined categories. This paper fol-
lows the common practice of previous works where hu-
man labels in seen categories implicitly define the metric
for unseen ones, which can be viewed as an effort to align
algorithm-discovered categories with human-defined ones.
However, labels in seen categories may not be good guid-
Figure 14. T-SNE [42] visualisation of the representations of 10
ance when there is a gap between seen ones and the novel
classes randomly sampled from CIFAR100 [29]. Jointly super-
vising representation learning with a classification objective helps
categories we want to discover, e.g., how to use the labelled
disambiguate (e.g., bed & table) and forms compacter clusters. images in ImageNet to discover novel categories in CUB?
For another example, when we use a very big class vocab-
CIFAR100 CUB ulary (e.g., the full ImageNet-22K [14]), categories could
0.8 0.7 overlap with each other, and be in different granularities.
0.7 0.6 Further, assigning text names to the discovered categories
0.6
0.5
ACC

ACC

0.5
still requires a matching process, what if further utilising the
All 0.4 All relationship between class names, and directly predicting
0.4 Old Old
New 0.3 New the novel categories in the text space? We believe the align-
0.3
0 50 100 150 200 0 50 100 150 200 ment between algorithm-discovered categories and human-
#Epoch #Epoch
defined categories is of high research value for future works.
Figure 15. Performance evolution throughout the model learn-
ing process. We observe a trade-off between the performance in Ethical considerations. Current methods commonly suffer
‘Old’ and ‘New’ categories, which is common across datasets. from low-data or long-tailed scenarios. Depending on the
data and classification criteria of specific tasks, discrimina-
Trade-off between ‘Old’ and ‘New’ categories. We plot tion against minority categories or instances is possible.
the performance evolution throughout the model learning
process in Fig. 15. It can be observed that the performance
7. Conclusion
on the ‘Old’ categories first climbs to the highest point at
the early stage of training and then slowly degrades as the This study investigates the reasons behind the failure of
performance on the ‘New’ categories improves. We believe previous parametric classifiers in recognizing novel classes
this demonstrates an important aspect of the design of mod- in GCD and uncovers that unreliable pseudo-labels, which
els for the GCD problem: the performance on the ‘Old’ cat- exhibit significant biases, are the crucial factor. We propose
egories may be in odd with the performance on the ‘New’ a simple yet effective parametric classification method that
categories, how to achieve a better trade-off between these addresses these issues and achieves state-of-the-art perfor-
two could be an interesting investigation for future works. mance on multiple GCD benchmarks. Our findings provide
insights into the design of robust classifiers for discovering
6. Limitations and Potential Future Works novel categories and we hope our proposed framework will
serve as a strong baseline to facilitate future studies in this
Representation learning. This paper mainly targets im- field and contribute to the development of more accurate
proving the classification ability for generalized category and reliable methods for category discovery.
discovery. The representation learning, however, follows
the prior work GCD [43]. It is expectable that the quality Acknowledgements
of representation learning can be improved. For instance,
generally, by using more advanced geometric and photo- This work has been supported by Hong Kong Re-
metric data augmentations [19], and even multiple local search Grant Council - Early Career Scheme (Grant No.
crops [8]. Further, can the design of data augmentations 27209621), General Research Fund Scheme (Grant No.
be better aligned with the classification criteria of the tar- 17202422), and RGC Matching Fund Scheme (RMGS).
get data? For another example, using a large batch size has Part of the described research work is conducted in the JC
been shown to be critical to the performance of contrastive STEM Lab of Robotics for Soft Materials funded by The
learning-based frameworks [10]. However, the batch size Hong Kong Jockey Club Charities Trust. The authors ac-
adopted by GCD [43] is only 128, which might limit the knowledge SmartMore and MEGVII for partial computing
quality of learned representations. Moreover, is the super- support, and Zhisheng Zhong for professional suggestions.

9
References [16] Yixin Fei, Zhongkai Zhao, Siwei Yang, and Bingchen Zhao.
Xcon: Learning with experts for fine-grained category dis-
[1] David Arthur and Sergei Vassilvitskii. k-means++: the ad- covery. In BMVC, 2022. 2
vantages of careful seeding. In ACM-SIAM Symposium on
[17] Enrico Fini, Enver Sangineto, Stéphane Lathuilière, Zhun
Discrete Algorithms, 2007. 12
Zhong, Moin Nabi, and Elisa Ricci. A unified objective for
[2] Yuki M. Asano, Christian Rupprecht, and Andrea Vedaldi. novel class discovery. In ICCV, 2021. 1, 2, 3, 4, 5, 6, 7, 8,
Self-labelling via simultaneous clustering and representation 12, 14, 15
learning. In ICLR, 2020. 3
[18] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un-
[3] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bo- supervised representation learning by predicting image rota-
janowski, Florian Bordes, Pascal Vincent, Armand Joulin, tions. In ICLR, 2018. 2
Mike Rabbat, and Nicolas Ballas. Masked siamese networks
[19] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
for label-efficient learning. In ECCV, 2022. 2, 3, 5, 6
Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch,
[4] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bo-
Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh-
janowski, Armand Joulin, Nicolas Ballas, and Michael Rab-
laghi Azar, et al. Bootstrap your own latent-a new approach
bat. Semi-supervised learning of visual features by non-
to self-supervised learning. In NeurIPS, 2020. 9
parametrically predicting view assignments with support
samples. In ICCV, 2021. 3 [20] Lan-Zhe Guo, Zhen-Yu Zhang, Yuan Jiang, Yu-Feng Li,
and Zhi-Hua Zhou. Safe deep semi-supervised learning for
[5] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas
unseen-class unlabeled data. In ICML, 2020. 2
Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A
holistic approach to semi-supervised learning. In NeurIPS, [21] Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, An-
2019. 1, 2, 5 drea Vedaldi, and Andrew Zisserman. Automatically discov-
ering and learning new visual categories with ranking statis-
[6] Kaidi Cao, Maria Brbić, and Jure Leskovec. Open-world
tics. In ICLR, 2020. 12
semi-supervised learning. In ICLR, 2022. 1, 2, 6, 12, 15
[22] Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, An-
[7] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and
drea Vedaldi, and Andrew Zisserman. Autonovel: Automati-
Matthijs Douze. Deep Clustering for Unsupervised Learn-
cally discovering and learning novel visual categories. IEEE
ing of Visual Features. In ECCV, 2018. 3
TPAMI, 2021. 1, 2, 3, 6, 8
[8] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi-
[23] Kai Han, Andrea Vedaldi, and Andrew Zisserman. Learning
otr Bojanowski, and Armand Joulin. Unsupervised learning
to discover novel visual categories via deep transfer cluster-
of visual features by contrasting cluster assignments. In nips,
ing. In ICCV, 2019. 2, 3
2020. 3, 9
[9] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, [24] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- Girshick. Momentum contrast for unsupervised visual rep-
ing properties in self-supervised vision transformers. In resentation learning. In CVPR, 2020. 2
ICCV, 2021. 2, 3, 5, 6, 12 [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- Deep residual learning for image recognition. In CVPR,
offrey Hinton. A simple framework for contrastive learning 2016. 1
of visual representations. In ICML, 2020. 3, 5, 9 [26] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-
[11] Yanbei Chen, Xiatian Zhu, Wei Li, and Shaogang Gong. scale similarity search with GPUs. IEEE Transactions on
Semi-supervised learning under class distribution mismatch. Big Data, 2019. 12
In AAAI, 2020. 2 [27] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
[12] Jang Hyun Cho, Utkarsh Mall, Kavita Bala, and Bharath Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
Hariharan. Picie: Unsupervised semantic segmentation us- Dilip Krishnan. Supervised contrastive learning. In NeurIPS,
ing invariance and equivariance in clustering. In CVPR, 2020. 3, 5
2021. 3 [28] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
[13] Quan Cui, Bingchen Zhao, Zhao-Min Chen, Borui Zhao, 3d object representations for fine-grained categorization. In
Renjie Song, Jiajun Liang, Boyan Zhou, and Osamu Yoshie. ICCV Workshops, 2013. 6, 13
Discriminability-transferability trade-off: An information- [29] Alex Krizhevsky and Geoffrey Hinton. Learning multiple
theoretic perspective. In ECCV, 2022. 4 layers of features from tiny images. Technical Report, 2009.
[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, 6, 9, 13
and Li Fei-Fei. Imagenet: A large-scale hierarchical image [30] James MacQueen. Some methods for classification and anal-
database. In CVPR, 2009. 6, 9, 12, 13 ysis of multivariate observations. In Proceedings of the Fifth
[15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Berkeley Symposium on Mathematical Statistics and Proba-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, bility, 1967. 6
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [31] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is Blaschko, and Andrea Vedaldi. Fine-grained visual classi-
worth 16x16 words: Transformers for image recognition at fication of aircraft. arXiv preprint arXiv:1306.5151, 2013.
scale. In ICLR, 2021. 6, 12 6, 13

10
[32] Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh [50] Qing Yu, Daiki Ikami, Go Irie, and Kiyoharu Aizawa. Multi-
Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. task curriculum framework for open-set semi-supervised
Long-tail learning via logit adjustment. In ICLR, 2021. 15 learning. In ECCV, 2020. 2
[33] Avital Oliver, Augustus Odena, Colin Raffel, Ekin D Cubuk, [51] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lu-
and Ian J Goodfellow. Realistic evaluation of deep semi- cas Beyer. S4l: Self-supervised semi-supervised learning. In
supervised learning algorithms. In NeurIPS, 2018. 1, 2 ICCV, 2019. 2
[34] Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Kai Han, An- [52] Bingchen Zhao and Kai Han. Novel visual category discov-
drea Vedaldi, and Andrew Zisserman. Semi-supervised ery with dual ranking statistics and mutual knowledge distil-
learning with scarce annotations. In CVPR Workshops, 2020. lation. In NeurIPS, 2021. 2, 3, 8
2 [53] Bingchen Zhao and Oisin Mac Aodha. Incremental general-
[35] Jiawei Ren, Cunjun Yu, Xiao Ma, Haiyu Zhao, Shuai Yi, and ized category discovery. In ICCV, 2023. 3
Hongsheng Li. Balanced meta-softmax for long-tailed visual [54] Bingchen Zhao and Xin Wen. Distilling visual priors from
recognition. In NeurIPS, 2020. 15 self-supervised learning. In ECCV Workshops, 2020. 2
[36] Subhankar Roy, Mingxuan Liu, Zhun Zhong, Nicu Sebe, and [55] Bingchen Zhao, Xin Wen, and Kai Han. Learning semi-
Elisa Ricci. Class-incremental novel class discovery. In supervised gaussian mixture models for generalized category
ECCV, 2022. 3 discovery. In ICCV, 2023. 8
[37] Kuniaki Saito, Donghyun Kim, and Kate Saenko. Open- [56] Zhun Zhong, Enrico Fini, Subhankar Roy, Zhiming Luo,
match: Open-set semi-supervised learning with open-set Elisa Ricci, and Nicu Sebe. Neighborhood contrastive learn-
consistency regularization. In NeurIPS, 2021. 2 ing for novel class discovery. In CVPR, 2021. 3
[38] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao [57] Zhun Zhong, Linchao Zhu, Zhiming Luo, Shaozi Li, Yi
Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Yang, and Nicu Sebe. Openmix: Reviving known knowledge
Zhang, and Colin Raffel. Fixmatch: Simplifying semi- for discovering novel visual categories in an open world. In
supervised learning with consistency and confidence. In CVPR, 2021. 2, 3, 8
NeurIPS, 2020. 1, 2, 5
[39] Yiyou Sun and Yixuan Li. Opencon: Open-world contrastive
learning. TMLR, 2023. 2
[40] Kiat Chuan Tan, Yulong Liu, Barbara Ambrose, Melissa
Tulig, and Serge Belongie. The herbarium challenge 2019
dataset. arXiv preprint arXiv:1906.05372, 2019. 6, 12, 13
[41] Antti Tarvainen and Harri Valpola. Mean teachers are better
role models: Weight-averaged consistency targets improve
semi-supervised deep learning results. In NeurIPS, 2017. 2,
5
[42] Laurens Van der Maaten and Geoffrey Hinton. Visualizing
data using t-sne. JMLR, 2008. 8, 9
[43] Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisser-
man. Generalized category discovery. In CVPR, 2022. 1, 2,
3, 4, 5, 6, 7, 8, 9, 12, 13, 15
[44] Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisser-
man. Open-set recognition: A good closed-set classifier is
all you need? In ICLR, 2022. 6, 12, 13
[45] Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon
Yuille. Normface: L2 hypersphere embedding for face veri-
fication. In ACM MM, 2017. 3
[46] Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu, and
Stella X Yu. Long-tailed recognition by routing diverse
distribution-aware experts. In ICLR, 2021. 15
[47] Xudong Wang, Zhirong Wu, Long Lian, and Stella X Yu.
Debiased learning from naturally imbalanced pseudo-labels.
In CVPR, 2022. 15
[48] Peter Welinder, Steve Branson, Takeshi Mita, Catherine
Wah, Florian Schroff, Serge Belongie, and Pietro Perona.
Caltech-ucsd birds 200. Technical Report CNS-TR-201, Cal-
tech, 2010. 6, 13
[49] Xin Wen, Bingchen Zhao, Anlin Zheng, Xiangyu Zhang, and
Xiaojuan Qi. Self-supervised visual representation learning
with semantic grouping. In NeurIPS, 2022. 3

11
Parametric Classification for Generalized Category Discovery: A Baseline Study
Supplementary Material

Xin Wen1* Bingchen Zhao2* Xiaojuan Qi1


1 2
The University of Hong Kong University of Edinburgh
{wenxin,xjqi}@eee.hku.hk [email protected]

Contents A.2. Re-implementing Previous Works


Results of GCD [43] are taken from the original pa-
A. Implementation Details 12
per (if available), and otherwise re-implemented with the
A.1. Experiment Setting Details . . . . . . . . . 12 official codebase. One exception is ImageNet-1K [14],
A.2. Re-implementing Previous Works . . . . . . 12 which was not evaluated by the authors. Since naively
A.3. Error Analysis Details . . . . . . . . . . . . 12 adopting their official codebase to ImageNet-1K fails as
the semi-supervised k-means procedure requires too much
B. Extended Experiments And Discussions 13 GPU memory and cannot be done with available hardware,
B.1. Main Results . . . . . . . . . . . . . . . . . 13 we drop the k-mean++ initialisation [1] which takes the
B.2. Unknown Category Number . . . . . . . . . 13 most memory, and re-implement the method with faiss [26]
B.3. Extended Analyses . . . . . . . . . . . . . . 13 for speed up (otherwise the evaluation takes more than one
B.4. Relationship to Imbalanced Recognition . . 15 day). The results are in the main paper, compared to our
proposed strong baseline SimGCD, GCD requires signifi-
cantly more time to run and more engineering efforts, and
A. Implementation Details yet achieves a lower performance than SimGCD, which
demonstrates the effectiveness of our proposed method. Re-
A.1. Experiment Setting Details sults of UNO+ [17] and RS+ [21], which are adaptations
of the original works to the GCD task, are directly taken
The split of labelled (‘Old’) and unlabelled (‘New’) cat-
from the GCD [43] paper. Also note that unlike UNO [17],
egories follows GCD [43]. That is, 50% of all classes are
our method does not adopt the over-clustering trick for sim-
sampled as ‘Old’ classes (Yl ), and the rest are regarded
plicity. Results of ORCA [6] are re-implemented with the
as ‘New’ classes (Yu \ Yl ). The exception is CIFAR100,
official codebase. We align the details in dataset split and
for which 80% classes are sampled as ‘Old’, following the
backbone (ViT-B/16 [15] pre-trained with DINO [9]) with
novel category discovery (NCD) literature. Regarding the
GCD [43] for a fair comparison.
sampling process, for generic object recognition datasets,
the labelled classes are selected by their class index (the
A.3. Error Analysis Details
first |Yl | ones). For the Semantic Shift Benchmark, data
splits provided in [44] are adopted. For Herbarium 19 [40], We briefly clarify the details of obtaining the four kinds
the labelled classes are sampled randomly. Additionally, for of prediction errors in the main paper: we first rank the cat-
ImageNet-1K [14] which is not used in [43], we follow its egory indexes in consecutive order, such that by index, all
fashion to select the first 500 classes sorted by class id as ‘Old’ classes are followed by all ‘New’ classes. We then
the labelled classes. Then for all datasets, following [43], compute the full confusion matrix, with each element sum-
50% of the images from the labelled classes are randomly marising how many times images of one specific class (row
sampled to form the labelled dataset Dl , and all remain- index) are predicted as one class (column index). All ele-
ing images are regarded as the unlabelled dataset Du . All ments are divided by the number of testing samples to ac-
experiments are done with a batch size of 128 on a single count for the percentage. We then reduce the diagonal terms
GPU, except for ImageNet-1K, on which we train with eight to zero (representing correct predictions), and thus all re-
GPUs, scale the learning rate with the linear scaling rule, maining elements represent different kinds of prediction er-
and keep the per-GPU batch size unchanged. The inference rors (i.e., absolute contribution to the errors of ‘All’ ACC).
time on ImageNet-1K is still evaluated with one GPU. Finally, we slice the confusion matrix into four sub-matrices

12
CIFAR100 ImageNet-100 CUB Stanford Cars Herbarium 19
14 =0 =2 =0 =2 =0 =2 50 =0 =2 35 =0 =2
50
Abs. 'All' ACC Error (in %)

Abs. 'All' ACC Error (in %)

Abs. 'All' ACC Error (in %)

Abs. 'All' ACC Error (in %)

Abs. 'All' ACC Error (in %)


=1 =4 =1 =4 =1 =4 =1 =4 =1 =4
12 30 40 30
40 25
10 30
30 20 20
8
20 20 15
6 10
4 10 10 10
2 0 0 0 5
True Old False New False Old True New True Old False New False Old True New True Old False New False Old True New True Old False New False Old True New True Old False New False Old True New
Error Types Error Types Error Types Error Types Error Types
Figure 16. Complete error analysis results of SimGCD on five representative datasets. With appropriate entropy regularisation, the
bias between ‘Old’/‘New’ classes (see “False New” and “False Old” errors) are generally effectively alleviated, except in the long-tailed
Herbarium 19 that the effect varies. Also notably, “True New” errors are consistently penalised to a considerable extent, confirming entropy
regularisation’s ability in helping recognise and distinguish between novel categories.

at the boundaries between the ‘Old’ and ‘New’ classes, and CIFAR100 ImageNet-100 CUB SCars Herb19
add all elements in each sub-matrix together, thus obtaining GT K 100 100 200 196 683
the final error matrix standing for the four kinds of predic- Est. K 100 109 231 230 520
tion errors. Such a way of error classification helps distin-
Table 7. Number of categories K estimated using [43].
guish the prediction bias between and within seen and novel
categories, and thus facilitates the design of new solutions.
Note that the diagonal elements, e.g., ‘True Old’ predic- CIFAR100 ImageNet-100
tions, do not stand for correct predictions, but for cases that Methods Known K All Old New All Old New
incorrectly predicting samples of one specific ‘Old’ class to GCD [43] ✓ 73.0 76.2 66.5 74.1 89.8 66.3
another wrong ‘Old’ class. SimGCD ✓ 80.1 81.2 77.8 83.0 93.1 77.9
GCD [43] ✗ (w/ Est.) 73.0 76.2 66.5 72.7 91.8 63.8
B. Extended Experiments And Discussions SimGCD ✗ (w/ Est.) 80.1 81.2 77.8 81.7 91.2 76.8
SimGCD ✗ (w/ 2K) 77.7 79.5 74.0 80.9 93.4 74.8
B.1. Main Results
Table 8. Results on generic image recognition datasets.
We present the full results of SimGCD in the main paper
with error bars in Tab. 6. The results are obtained from three CUB Stanford Cars
independent runs and thus avoid randomness. Methods Known K All Old New All Old New
GCD [43] ✓ 51.3 56.6 48.7 39.0 57.6 29.9
Dataset All Old New SimGCD ✓ 60.3 65.6 57.7 53.8 71.9 45.0
CIFAR10 [29] 97.1±0.0 95.1±0.1 98.1±0.1 GCD [43] ✗ (w/ Est.) 47.1 55.1 44.8 35.0 56.0 24.8
CIFAR100 [29] 80.1±0.9 81.2±0.4 77.8±2.0 SimGCD ✗ (w/ Est.) 61.5 66.4 59.1 49.1 65.1 41.3
ImageNet-100 [14] 83.0±1.2 93.1±0.2 77.9±1.9 SimGCD ✗ (w/ 2K) 63.6 68.9 61.1 48.2 64.6 40.2
ImageNet-1K [14] 57.1±0.1 77.3±0.1 46.9±0.2
CUB [48] 60.3±0.1 65.6±0.9 57.7±0.4 Table 9. Results on the Semantic Shift Benchmark [44].
Stanford Cars [28] 53.8±2.2 71.9±1.7 45.0±2.4
FGVC-Aircraft [31] 54.2±1.9 59.1±1.2 51.8±2.3 the category number estimated with a specialised algorithm
Herbarium 19 [40] 44.0±0.4 58.0±0.4 36.4±0.8 (w/ Est.), or simply with a loose estimation that is two times
Table 6. Complete results of SimGCD in three independent runs. the ground truth (w/ 2K, other values are also applicable
since our method is robust to a wide range of estimations).
This property could ease the deployment of parametric clas-
B.2. Unknown Category Number sifiers for GCD in real-world scenarios.
In the main text, we showed that the performance of
B.3. Extended Analyses
SimGCD is robust to a wide range of estimated unknown
category numbers. In this section, we report the results with In supplementary to the main paper, we present a more
the number of categories estimated using an off-the-shelf complete version of the analytical experiments.
method [43] (Tab. 7) or with a roughly estimated relatively In Fig. 16, we show the error analysis results of SimGCD
big number (two times of the ground-truth K), and compare over five representative datasets that cover coarse-grained,
with the baseline method GCD [43]. fine-grained, and long-tailed classification tasks. Overall,
The results on CIFAR100 [29], ImageNet-100 [14], it shows that the entropy regulariser mainly helps in over-
CUB [48], and Stanford Cars [28] are available in Tabs. 8 coming two types of errors: the error of misclassification
and 9. Our method shows consistent improvements on four between ‘Old’/‘New’ categories, and the error of misclassi-
representative datasets when K is unknown, no matter with fication within ‘New’ categories. One exception is the long-

13
CIFAR100 (New) ImageNet-100 (New) CUB (New) Stanford Cars (New) Herbarium 19 (New)
2000 1200
=0 =4 =0 =4 =0 =4 =0 =4 8000 =0 =4
=1 GT 30000 =1 GT 200 =1 GT 1000 =1 GT =1 GT
#Instance / Class

#Instance / Class

#Instance / Class

#Instance / Class

#Instance / Class
1500 =2 =2 =2 =2 =2
150 800 6000
20000
1000 600 4000
100
10000 400
500 50 2000
200
0 0 0 0 0
0 3 6 9 12 15 18 0 10 20 30 40 50 0 20 40 60 80 100 0 20 40 60 80 100 0 100 200 300
CIFAR100 (Old) ImageNet-100 (Old) CUB (Old) Stanford Cars (Old) Herbarium 19 (Old)
=0 =4 40 =0 =4 50 =0 =4 =0 =4
400 =1 GT =1 GT =1 GT 600 =1 GT
600
#Instance / Class

#Instance / Class

#Instance / Class

#Instance / Class

#Instance / Class
=2 30 =2 40 =2 =2
300
400 30 400
200 20
20
100 200 =0 =4 10 200
=1 GT 10
0 0 =2 0 0 0
0 20 40 60 80 0 10 20 30 40 50 0 20 40 60 80 100 0 20 40 60 80 100 0 100 200 300
Sorted Class Indexes Sorted Class Indexes Sorted Class Indexes Sorted Class Indexes Sorted Class Indexes

Figure 17. Complete per-class prediction distribution results of SimGCD on five representative datasets. Proper entropy regularisa-
tion helps overcome the prediction bias in both ‘Old’ classes and ‘New’ classes, and fits the ground-truth distribution. The conclusion is
consistent across generic classification datasets, fine-grained classification datasets, and naturally long-tailed datasets.

tailed Herbarium 19 dataset, in which the models’ “False lated to make the model’s predictions closer to the uniform
Old” errors also increased, and our intuition is that the long- distribution. But interestingly, we empirically found that
tailed distribution adds to the difficulty in discriminating it could make the models’ predictions more biased on the
between ‘Old’ and ‘New’ categories. Still, the gain in dis- class-balanced ImageNet-100 dataset when the regularisa-
tinguishing between novel categories is consistent, and we tion is too strong. And when the dataset itself is long-tailed
provide a further analysis via per-class prediction distribu- (Herbarium 19), it also could help fit the ground-truth distri-
tions in the next paragraph. bution. We also note that the self-labelling strategy adopted
by UNO [17] forces the predictions in a batch to be strictly
ImageNet-100 (New) Herbarium 19 (New) uniform, which may account for its inferior performance.
=1 =4 =1 =4
#Instance / Class

#Instance / Class

2000 =2 GT =2 GT
400 ImageNet-100 Herbarium 19
600
2000 K=100 K=683
1000 200 K=200 500 K=1366
GT GT
#Instance / Class

#Instance / Class
1500 400
0 0
1000 300
0 6 12 18 24 30 36 42 48 0 100 200 300
200
ImageNet-100 (Old) Herbarium 19 (Old) 500
300 =1 =4
100
700
#Instance / Class

#Instance / Class

=2 GT 0 0
600 200 0 50 100 150 200 0 500 1000
500 Sorted Class Indexes Sorted Class Indexes
400 100
=1 =4 Figure 19. Per-class prediction distributions using different
300 =2 GT
0 category numbers on ImageNet-100 and Herbarium 19. Our
0 20 40 0 100 200 300 method effectively identifies the criterion for ‘New’ classes, thus
Sorted Class Indexes Sorted Class Indexes keeping the number of active prototypes close to the ground-truth
Figure 18. A closer look at the per-class distributions. Notably, class number. Notably, a loose category number greater than the
although the entropy regularisation term is formulated to approach ground truth may harm fitting the class-balanced ImageNet-100
uniform distribution, it could make the models’ predictions more dataset, but could help fit the distribution of the long-tailed Herbar-
biased on the class-balanced ImageNet-100 dataset when the reg- ium 19 dataset.
ularisation is too strong. Interestingly, it also could help fit the
distribution of the long-tailed Herbarium 19 dataset. In Fig. 19, we also show the per-class prediction distri-
butions using different category numbers. The results on the
In Fig. 17, we show the complete per-class prediction class-balanced ImageNet-100 are consistent with the results
results of SimGCD to further analyse the entropy regu- on CIFAR100 and CUB in the main paper, using a loose cat-
lariser’s effect in overcoming the classification errors within egory number greater than the ground truth may harm fitting
‘Old’ and ‘New’ classes, and it consistently verifies the help the ground-truth class distribution, yet the model still man-
in alleviating the prediction bias within ‘Old’ and ‘New’ ages to find the ground truth category number. Interestingly,
classes, and better fitting the ground-truth class distribution. we also find that for the long-tailed Herbarium 19 dataset,
In Fig. 18, we present a closer look at ImageNet-100 and using a greater category number could in fact help fit the
Herbarium 19. The entropy regularisation term is formu- ground-truth distribution.

14
CIFAR100 ImageNet-100 CUB Stanford Cars Herbarium 19
Method Logit Adjust All Old New All Old New All Old New All Old New All Old New
ORCA [6] ✓ 69.0 77.4 52.0 73.5 92.6 63.9 35.3 45.6 30.2 23.5 50.1 10.7 20.9 30.9 15.5
DebiasPL [47] ✓ 60.9 69.8 43.1 43.5 59.1 35.6 38.1 44.2 35.0 31.1 49.6 22.1 30.1 39.1 25.3
UNO+ [17] ✗ 69.5 80.6 47.2 70.3 95.0 57.9 35.1 49.0 28.1 35.5 70.5 18.6 28.3 53.7 14.7
GCD [43] ✗ 73.0 76.2 66.5 74.1 89.8 66.3 51.3 56.6 48.7 39.0 57.6 29.9 35.4 51.0 27.0
SimGCD ✗ 80.1 81.2 77.8 83.0 93.1 77.9 60.3 65.6 57.7 53.8 71.9 45.0 44.0 58.0 36.4

Table 10. Comparison to imbalanced recognition-inspired methods.

B.4. Relationship to Imbalanced Recognition


Our work also shares motivation with literature in long-
tailed/imbalanced recognition [32, 46, 35], in which resolv-
ing the imbalance in models’ prediction is also an important
issue. Technically, they commonly depend on a prior class
distribution to adjust classifiers’ output, which is not acces-
sible in GCD since labels for novel classes are unknown.
One could also estimate this distribution online from predic-
tions, which is inaccurate due to its open-world nature. We
note one baseline (ORCA [6]) compared in the paper also
shares key intuition with these works (adaptive margin). We
also reimplement one close work that operates on imbal-
anced semi-supervised learning, i.e., DebiasPL [47], align-
ing representation learning with GCD, and show a compar-
ison in Tab. 10. DebiasPL surpasses UNO+ on fine-grained
classification in novel classes and verifies it could overcome
the prediction imbalance to some extent. It also outperforms
ORCA but still lags behind GCD and ours. We hypothesise
manually altering logits may not be suitable for open-world
settings. Instead, a more natural and general solution could
be to regularise prediction statistics and let the model adjust
via optimisation.

15

You might also like