Sim GCD
Sim GCD
Green UNO+
Jay
1 5
Classification Objectives
GCD
4 Non-Parametric Classification
④ 5 Parametric Classification
⑤
2 3 4
Ours (SimGCD)
2 3 5
Figure 1. Left: building blocks for representation learning or classifier learning; Right: overall abstraction of current works, where ‘→’
separates different stages of the method. Our work builds on GCD [43], and jointly trains a parametric classifier.
Abstract 1. Introduction
Generalized Category Discovery (GCD) aims to dis- With large-scale labelled datasets, deep learning meth-
cover novel categories in unlabelled datasets using knowl- ods can surpass humans in recognising images [25]. How-
edge learned from labelled samples. Previous studies ar- ever, it is not always possible to collect large-scale human
gued that parametric classifiers are prone to overfitting to annotations for training deep learning models. Therefore,
seen categories, and endorsed using a non-parametric clas- there is a rich body of recognition models that focus on
sifier formed with semi-supervised k-means. However, in learning with a large number of unlabelled data. Among
this study, we investigate the failure of parametric clas- them, semi-supervised learning (SSL) [33, 5, 38] is re-
sifiers, verify the effectiveness of previous design choices garded as a promising approach, yet with the assumption
when high-quality supervision is available, and identify un- that labelled instances are provided for each of the cate-
reliable pseudo-labels as a key problem. We demonstrate gories the model needs to classify. Generalized category
that two prediction biases exist: the classifier tends to pre- discovery (GCD) [43] is recently formalised to relax this
dict seen classes more often, and produces an imbalanced assumption by assuming the unlabelled data can also con-
distribution across seen and novel categories. Based on tain similar yet distinct categories from the labelled data.
these findings, we propose a simple yet effective paramet- The goal of GCD is to learn a model that is able to classify
ric classification method that benefits from entropy regular- the already-seen categories in the labelled data, and more
isation, achieves state-of-the-art performance on multiple importantly, jointly discover the new categories in the unla-
GCD benchmarks and shows strong robustness to unknown belled data and make correct classifications. Developing a
class numbers. We hope the investigation and proposed strong method for this problem could help us better utilise
simple framework can serve as a strong baseline to facil- the easily available large-scale unlabelled datasets.
itate future studies in this field. Our code is available at: Previous works [43, 22, 17, 6] approach this problem
https://github.com/CVMI-Lab/SimGCD. from two perspectives: learning generic feature representa-
* Equal contribution. tions to facilitate the discovery of novel categories, and gen-
1
erating pseudo clusters/labels for unlabelled data to guide Old Class ACC New Class ACC
CUB UNO+ CUB UNO+
GCD GCD
the learning of a classifier. The former is often achieved Ours Ours
by using self-supervised learning methods [22, 52, 18, 24, Aircraft IN-100 Aircraft IN-100
2
where it is assumed that the unlabelled dataset and the la- Representation learning. For representation learning, we
belled dataset do not have any class overlap, thus base- follow GCD [43], which applies supervised contrastive
lines for NCD [22, 52, 57, 56, 17] can be adopted for learning [27] on labelled samples, and self-supervised con-
the GCD problem by extending the classification head to trastive learning [10] on all samples (detailed in Sec. 4.1).
have more outputs [43]. The incremental setting of GCD Classifier. We follow UNO [17] to adopt a prototypical
is also explored [53, 36]. It is pointed out in [43] that a classifier. Take f (x) as the feature vector of an image x
non-parametric classifier formed using semi-supervised k- extracted using from the backbone f , the procedure for pro-
means can outperform strong parametric classification base- ducing logits is l = τ1 (w/||w||)⊤ (f (x)/||f (x)||). Here τ
lines from NCD [22, 17] because it can alleviate the overfit- is the temperature value that scales up the norm of l and
ting to seen categories in the labelled set. In this paper, we facilitates optimisation of the cross-entropy loss [45].
revisit this claim and show that parametric classifiers can
reach stronger performance than non-parametric classifiers. Training settings. We train with varying supervision quali-
ties. The minimal supervision setting utilises only the la-
Deep Clustering aims at learning a set of semantic proto- bels in Dl , while the oracle supervision setting assumes
types from unlabelled images with deep neural networks. all samples are labelled (both Dl and Du ). Besides, we
Considering that no label information is available, the fo- study two practical settings that adopt pseudo labels for
cus is on how to obtain reliable pseudo-labels. While unlabelled samples in Du : self-label that predicts pseudo-
early attempts rely on hard labels produced by k-means [7], labels with the Sinkhorn Knopp algorithm following [17],
there has been a shift towards soft labels produced by op- and self-distil, which depicts another pseudo-labelling strat-
timal transport [2, 8], and more recently sharpened predic- egy as in Fig. 7 and will be introduced in detail in Sec. 4.2.
tions from an exponential moving average-updated teacher For all settings, we only employ a cross-entropy loss on the
model [9, 3]. Deep clustering has shown strong potential for (pseudo-)labelled samples on hand for classification. Note
unsupervised representation learning [7, 2, 8, 9, 3], unsu- that unless otherwise stated, this is done on decoupled fea-
pervised semantic segmentation [12, 49], semi-supervised tures, thus representation learning is unaffected.
learning [4], and novel category discovery [17]. In this
work, we study the techniques that make strong parametric 3.2. Which Representation to Build Your Classifier?
classifiers for GCD with inspirations from deep clustering.
Motivation. Following the trend of deep clustering that fo-
3. On the Failure of Parametric Classification cuses on self-supervised representation learning [8], previ-
ous parametric classification work UNO [17] fed the classi-
In order to explore the reason that makes previous para- fier with representations taken from the projector. While in
metric classifiers fail to recognise ‘New’ classes for gen- GCD [43], significantly stronger performance is achieved
eralized category discovery, this section presents prelimi- with a non-parametric classifier built upon representations
nary studies to reveal the role of two major components: taken from the backbone. We revisit this choice as follows.
representation learning (Sec. 3.2) and pseudo-label quality
Setting. Consider f as the feature backbone, and g is a
on unseen classes (Sec. 3.3). These have led to conflict-
multi-layer perceptron (MLP) projection head. Given an
ing choices of previous works, but why? We show a uni-
input image xi , the representation from the backbone can
fied viewpoint (Figs. 3 and 4), and emphasise that taking
be written as f (xi ), and that from the projector is g(f (xi )).
pseudo-label quality into account is important for selecting
the suitable design choice. This then led to our diagnosis of CIFAR100 80
CUB
New (backbone) New (backbone)
what makes the degenerated pseudo-labels (Sec. 3.4), and
New ACC (in %)
3.1. Investigation Setting Minimal Self-label Self-distil Oracle Minimal Self-label Self-distil Oracle
90
90 Old (backbone) Old (backbone)
Old ACC (in %)
is to learn a model to categorise the samples in Du using Minimal Self-label Self-distil Oracle Minimal Self-label Self-distil Oracle
Supervision Quality (Low High) Supervision Quality (Low High)
the knowledge from a labelled dataset Dl = (xli , yil ) ∈ Figure 3. Results with different representations. We build the
X × Yl where Yl is the label space of labelled samples and classifier on post-backbone or post-projector representations, and
Yl ⊂ Yu . We denote the number of categories in Yu as train with varying supervision quality. Results on ‘Old’ class con-
Ku = |Yu |, it is common to assume the number of cate- sistently benefit from the post-backbone representations regardless
gories is known a-priori [22, 52, 57, 17], or can be esti- of the supervision quality, while unleashing its potential on ‘New’
mated using off-the-shelf methods [23, 43]. class requires stronger pseudo labels.
3
Result & discussion. As in Fig. 3, the post-backbone fea- when a stronger pseudo-labelling strategy (self-distillation)
ture space has a clearly higher upper bound for learning pro- or even oracle labels are utilised, we observe consistent
totypical classifiers than the post-projector feature space. gains from joint training. This means that the joint train-
Using a projector in self-supervised learning lets the projec- ing strategy does not necessarily result in UNO [17]’s low
tor focus on solving pretext tasks and allows the backbone performance in ‘New’ classes; on the contrary, it can even
to keep as much information as possible (which facilitates boost ‘New’ class performance by a notable margin. Our
downstream tasks) [13]. But when good classification per- overall explanation is that UNO’s framework could not
formance is all you need, our results suggest that the classi- make reliable pseudo-labels, thus restricting its ability to
fication objective should build on post-backbone represen- benefit from joint training. The joint training strategy is not
tations directly. The features post the projector might focus to blame and is, in fact, helpful. When switching to a more
more on solving the pretext task and not be necessarily use- advanced pseudo-labelling paradigm that produces higher-
ful for the classification objective. Note that high-quality quality pseudo-labels, the help from joint training can be
pseudo labels are necessary to unleash the post-backbone even more significant.
representations’ potential to recognise novel categories.
3.4. The Devil Is in the Biased Predictions
3.3. Decoupled or Joint Representation Learning?
Motivation. In Secs. 3.2 and 3.3, we verified the effective-
Motivation. Previous parametric classification methods, ness of two design choices when high-quality pseudo labels
e.g., UNO [17], commonly tune the representations jointly are available, and concluded the key to previous work’s de-
with the classification objective. On the contrary, in the graded performance is unreliable pseudo labels. We then
two-stage non-parametric method GCD [43] where the per- further diagnose the statistics of its predictions as follows.
formance in ‘New’ classes is notably higher, classifica-
tion/clustering is fully decoupled from representation learn- Setting. We categorise the model’s errors into four types:
ing, and the representations can be viewed as unaltered by “True Old”, “False New”, “False Old”, and “True New” ac-
classification. In this part, we study whether the joint learn- cording to the relationship between predicted and ground-
ing strategy contributes to previous parametric classifiers’ truth class. E.g., “True New” refers to predicting a ‘New’
degraded performance in recognising ‘New’ classes. class sample to another ‘New’ class, while “False Old” indi-
cates predicting a ‘New’ class sample as some ‘Old’ class.
Setting. Consider f (x) as the representation fed to the clas-
sifier, decoupled training, as the previous settings adopted, UNO+ (CIFAR100) UNO+ (CUB) GCD (CIFAR100) GCD (CUB)
15 20
8
Old
Old
Old
15
GT Class
10 6
thus the classification objective won’t supervise representa- 10 10
15.57% 2.98% 17.77% 24.02% 9.16% 2.74% 11.83% 20.97%
New
New
New
5 4 5
tion learning. While for joint training, the representations Old New Old New Old New Old New
are jointly optimised by classification. Predicted Class Predicted Class Predicted Class Predicted Class
Result & discussion. The results are illustrated in Fig. 4. Figure 5. Prediction bias between ‘Old’/‘New’ classes. We
When adopting the self-labelling strategy, there is a sharp simplify the results to binary classification and categorise errors
in ‘All’ ACC into four types. Both works, especially UNO+, are
drop in ‘Old’ class performance on both datasets, while
prone to make “False Old” predictions, and many samples corre-
for the ‘New’ classes, it can improve by 13 points on CI-
sponding to ‘New’ classes are misclassified as an ‘Old’ class.
FAR100, and drop by a small margin on CUB. In contrast,
CIFAR100 (New) CUB (New)
CIFAR100 CUB 1000
#Instance / Class
#Instance / Class
#Instance / Class
4
Result & discussion. We observe two types of prediction cross- labelled
bias. In Fig. 5, both works, especially UNO+ [17], are entropy
prone to make “False Old” predictions. In other words, their
predictions are biased towards ‘Old’ classes. Besides, the unlabelled
“True New” errors are also notable, indicating that misclas-
sification within ‘New’ classes is also common. We then de-
pict the predictions’ overall distribution across ‘Old’/‘New’ sg mean
classes in Fig. 6, and both works show highly biased pre- entropy
softmax softmax
dictions. This double-bias phenomenon then motivated the max.
prediction entropy regularisation design in our method.
share
weight
4. Method
In this section, we present the whole picture of this sim- GCD-style
rep. learn.
ple yet effective method (see Fig. 7), a one-stage framework
that builds on GCD [43], and jointly trains a parametric SupCL
classifier with self-distillation and entropy regularisation. SelfSupCL
And in Sec. 5.3, we discuss the step-by-step changes that
Figure 7. The overall framework of our method. For unla-
lead a simple baseline to our solution.
belled samples, the pseudo-labels are from sharpened predictions
4.1. Representation Learning of another random augmented view. And for labelled samples, we
simply adopt the ground truth. Details for representation learn-
Our representation learning objective follows GCD [43], ing and the mean-entropy-maximisation regulariser are omitted for
which is supervised contrastive learning [27] on labelled simplicity, and please refer to the text. (Also see Fig. 1 for a high-
samples, and self-supervised contrastive learning [10] on level comparison with previous works)
all samples. Formally, given two views (random augmen-
C scaled by 1/τs :
tations) xi , and x′i of the same image in a mini-batch B,
the self-supervised contrastive loss is written as:
\boldsymbol {p}_{i}^{(k)}=\frac {\exp \left (\frac {1}{\tau _s}(\boldsymbol {h}_i/||\boldsymbol {h}_i||_2)^\top (\boldsymbol {c}_k / ||\boldsymbol {c}_k||_2)\right )}{\sum _{k^\prime } \exp \left (\frac {1}{\tau _s}(\boldsymbol {h}_i / ||\boldsymbol {h}_i||_2)^\top (\boldsymbol {c}_{k^\prime } / ||\boldsymbol {c}_{k^\prime }||_2)\right )} \,, (3)
\mathcal {L}^u_\text {rep}=\frac {1}{|B|} \sum _{i \in B}-\log \frac {\exp \left (\boldsymbol {z}_i^\top \boldsymbol {z}_i^{\prime } / \tau _u\right )}{\sum _i^{i \neq n} \exp \left (\boldsymbol {z}_i^\top \boldsymbol {z}_n^{\prime } / \tau _u\right )} \,, (1)
5
mentation strategies and soft/hard pseudo-labels, our ap- to 0.07, then warmed up to 0.04 with a cosine schedule in
proach jointly performs category discovery and self-training the starting 30 epochs. All experiments are done with an
style learning, while the SSL methods purely focus on boot- NVIDIA GeForce RTX 3090 GPU.
strapping itself with unlabelled data, and does not discover
novel categories. Besides, entropy regularisation is also ex- 5.2. Comparison With the State of the Arts
plored in deep clustering to avoid trivial solution [3]. In We compare with state-of-the-art methods in generalized
contrast, our method shows its help in overcoming the pre- category discovery (ORCA [6] and GCD [43]), strong base-
diction bias between and within seen/novel classes (Figs. 9 lines derived from novel category discovery (RS+ [22] and
and 10), and enforcing robustness to unknown numbers of UNO+ [17]), and k-means [30] on DINO [9] features. On
categories (Fig. 11). both the fine-grained SSB benchmark (Tab. 2) and generic
image recognition datasets (Tab. 3), our method achieves
5. Experiments notable improvements in recognising ‘New’ classes (across
5.1. Experimental Setup the instances in Du that belong to classes in Yu \Yl ), out-
performing the SOTAs by around 10%. The results in
Datasets. We validate the effectiveness of our method old classes are also competing against the best-performing
on the generic image recognition benchmark (including baselines. Given that the ability to discover ‘New’ classes
CIFAR10/100 [29] and ImageNet-100 [14]), the recently is a more desirable ability, the results are quite encouraging.
proposed Semantic Shift Benchmark [44] (SSB, including In Tab. 4, we also report the results on Herbarium
CUB [48], Stanford Cars [28], and FGVC-Aircraft [31]), 19 [40], a naturally long-tailed fine-grained dataset that is
and the harder Herbarium 19 [40] and ImageNet-1K [14]. closer to the real-world application of generalized category
For each dataset, we follow [43] to sample a subset of all discovery, and ImageNet-1K [14], a large-scale generic
classes as the labelled (‘Old’) classes Yl ; 50% of the images classification dataset. Still, our method shows consistent
from these labelled classes are used to construct Dl , and the improvements in all metrics.
remaining images are regarded as the unlabelled data Du .
See Tab. 1 for statistics of the datasets we evaluate on. CUB Stanford Cars FGVC-Aircraft
Methods All Old New All Old New All Old New
Labelled Unlabelled
k-means [30] 34.3 38.9 32.1 12.8 10.6 13.8 16.0 14.4 16.8
Dataset Balance #Image #Class #Image #Class RS+ [22] 33.3 51.6 24.2 28.3 61.8 12.1 26.9 36.4 22.2
UNO+ [17] 35.1 49.0 28.1 35.5 70.5 18.6 40.3 56.4 32.2
CIFAR10 [29] ✓ 12.5K 5 37.5K 10
ORCA [6] 35.3 45.6 30.2 23.5 50.1 10.7 22.0 31.8 17.1
CIFAR100 [29] ✓ 20.0K 80 30.0K 100
ImageNet-100 [14] ✓ 31.9K 50 95.3K 100 GCD [43] 51.3 56.6 48.7 39.0 57.6 29.9 45.0 41.1 46.9
CUB [48] ✓ 1.5K 100 4.5K 200 SimGCD 60.3 65.6 57.7 53.8 71.9 45.0 54.2 59.1 51.8
∆ +9.0 +9.0 +9.0 +14.8 +14.3 +15.1 +9.2 +18.0 +4.9
Stanford Cars [28] ✓ 2.0K 98 6.1K 196
FGVC-Aircraft [31] ✓ 1.7K 50 5.0K 100 Table 2. Results on the Semantic Shift Benchmark [44].
Herbarium 19 [40] ✗ 8.9K 341 25.4K 683
ImageNet-1K [14] ✓ 321K 500 960K 1000 CIFAR10 CIFAR100 ImageNet-100
Table 1. Statistics of the datasets we evaluate on. Methods All Old New All Old New All Old New
Evaluation protocol. We evaluate the model performance k-means [30] 83.6 85.7 82.5 52.0 52.2 50.8 72.7 75.5 71.3
RS+ [22] 46.8 19.2 60.5 58.2 77.6 19.3 37.1 61.6 24.8
with clustering accuracy (ACC) following standard prac- UNO+ [17] 68.6 98.3 53.8 69.5 80.6 47.2 70.3 95.0 57.9
tice [43]. During evaluation, given the ground truth y ∗ and ORCA [6] 81.8 86.2 79.6 69.0 77.4 52.0 73.5 92.6 63.9
theP predicted labels ŷ, the ACC is calculated as ACC = GCD [43] 91.5 97.9 88.2 73.0 76.2 66.5 74.1 89.8 66.3
1 M ∗ u SimGCD 97.1 95.1 98.1 80.1 81.2 77.8 83.0 93.1 77.9
M i=1 1(yi = p(ŷi )) where M = |D |, and p is the ∆ +5.6 -2.8 +9.9 +7.1 +5.0 +11.3 +8.9 +3.3 +11.6
optimal permutation that matches the predicted cluster as-
Table 3. Results on generic image recognition datasets.
signments to the ground truth class labels.
Implementation details. Following GCD [43], we train all Herbarium 19 ImageNet-1K
methods with a ViT-B/16 backbone [15] pre-trained with Methods All Old New All Old New
DINO [9]. We use the output of [CLS] token with a di-
k-means [30] 13.0 12.2 13.4 - - -
mension of 768 as the feature for an image, and only fine- RS+ [22] 27.9 55.8 12.8 - - -
tune the last block of the backbone. We train with a batch UNO+ [17] 28.3 53.7 14.7 - - -
size of 128 for 200 epochs with an initial learning rate of ORCA [6] 20.9 30.9 15.5 - - -
0.1 decayed with a cosine schedule on each dataset. Align- GCD [43] 35.4 51.0 27.0 52.5 72.5 42.2
ing with [43], the balancing factor λ is set to 0.35, and the SimGCD 44.0 58.0 36.4 57.1 77.3 46.9
∆ +8.6 +7.0 +9.4 +4.6 +4.8 +4.7
temperature values τu , τc as 0.07, 1.0, respectively. For the
classification objective, we set τs to 0.1, and τt is initialised Table 4. Results on more challenging datasets.
6
CIFAR100 ImageNet-100 CUB Stanford Cars Herbarium 19
70 60
80 90 60
60
ACC (in %)
ACC (in %)
ACC (in %)
ACC (in %)
ACC (in %)
70 80 40
40
60 70 50
All All All 20 All
20 All
50 Old 60 Old Old Old Old
New New
40 New New New
50 0 0
40
GCD +SL +BR +SD +TW +JT GCD +SL +BR +SD +TW +JT GCD +SL +BR +SD +TW +JT GCD +SL +BR +SD +TW +JT GCD +SL +BR +SD +TW +JT
Step-by-step Diff. (GCD SimGCD) Step-by-step Diff. (GCD SimGCD) Step-by-step Diff. (GCD SimGCD) Step-by-step Diff. (GCD SimGCD) Step-by-step Diff. (GCD SimGCD)
Figure 8. Step-by-step differences from GCD [43] to SimGCD. (SL: self-labelling, BR: post-backbone representation, SD: self-
distillation, TW: teacher temperature warmup, JT: joint training)
Methods CF100 CUB Herb19 IN-100 IN-1K be helpful for fine-grained classification datasets, while for
GCD [43] 7.5m 9m 2.5h 36m 7.7h generic classification datasets, which are similar to the pre-
SimGCD 1m 18s 3.5m 9.5m 0.6h training data (ImageNet), the unreliable pseudo label is not
Table 5. Inference time over the unlabelled split. a problem, thus lowering the confidence does not show help.
For simplicity, we keep the training strategy consistent.
In Tab. 5, we compare the inference time with GCD [43],
one iconic non-parametric classification method. Let the Jointly training the representation. Previous settings
number of all samples and unlabelled samples be N and adopt a decoupled training strategy for consistent repre-
Nu , the number of classes K, feature dimension d, and the sentations with GCD [43] and fair comparison. Finally,
number of k-means iterations to be t, the time complexity of as confirmed in Sec. 3.3, we jointly supervise the repre-
GCD is O(N 2 d + N Kdt) (including k-means++ initialisa- sentation with the classification objective (+JT). This re-
tion), while our method only requires a nearest-neighbour sults in a consistent improvement in ‘New’ classes for all
prototype search for each instance, with time complexity datasets. Changes in ‘Old’ classes are mostly neutral or
O(Nu Kd). All methods adopt GPU implementations. positive, with a notable drop in CIFAR100. Our intuition
is that the original representations are already good enough
5.3. Ablation Study for ‘Old’ classes in this dataset, and some incorrect pseudo
labels lead to sight degradation in this case.
In Fig. 8, we ablate the key components that bring the
baseline method step-by-step to a new SOTA. CIFAR100 CUB
12.5
Abs. 'All' ACC Error (in %)
ing objectives unchanged, and first impose the UNO [17]- 5.0 10
2.5 0
style self-labelling classification objectives (+SL) to it, thus True Old False New False Old True New True Old False New False Old True New
Error Types Error Types
transforming it into a parametric classifier. The classifier
Figure 9. Effect of entropy regularisation on four types of clas-
is built on the projector, and detached from representation
sification errors. Appropriate entropy regularisation helps over-
learning. Results on ‘Old’ classes generally improve, while
come the bias between ‘Old’/‘New’ classes (see “False New” and
results on ‘New’ classes see a sharp drop. This is expected “False Old”, lower is better).
due to UNO’s strong bias toward ‘Old’ classes (Fig. 5).
Improving the representations. As suggested in Sec. 3.2, CIFAR100 (New) CUB (New)
2000 =0 =4 =0 =4
#Instance / Class
#Instance / Class
#Instance / Class
7
CIFAR100 (K=100) ImageNet-100 (K=100) CUB (K=200) Stanford Cars (K=196) Herbarium 19 (K=683)
80 50 40
80
60 38
70 40
New ACC (in %)
80 94 70 70 62
60
Old ACC (in %)
Figure 11. Results with different numbers of categories. Stronger entropy regularisation effectively enforces the model’s robustness to
unknown numbers of categories, but over-regularisation may limit the ability to recognise ‘New’ classes under ground-truth class numbers.
#Instance / Class
K=200 K=400 ing the feature leads to less ambiguity, larger margins, and
500 GT 40 GT
compacter clusters. Concerning why this is not as helpful
250 20
for CUB: we hypothesise that one important factor lies in
0 0
0 50 100 150 200 0 100 200 300 400 how transferable the features learned in ‘Old’ classes are to
Sorted Class Indexes Sorted Class Indexes ‘New’ classes. While it may be easier for a cat classifier to
Figure 12. Per-class prediction distributions with different be adapted to dogs, things can be different for fine-grained
numbers of categories. Our method effectively identifies the cri- bird recognition. Besides, the small scale of CUB, which
terion for ‘New’ classes, thus keeping the number of active proto- contains only 6k images while holding a large class split
types close to the ground-truth class number. (200), might also make it hard to learn transferable features.
8
Decouple (same rep. as GCD) Joint (tune rep. w/ cls. objective)
beaver
bed
vised contrastive learning plus self-supervised contrastive
bridge
bear
butterfly
learning paradigm the ultimate answer to form the feature
table
tank
tulip
sweet_pepper
manifold? We believe that advances in representation learn-
wardrobe
Prototype ing can lead to further gains.
Alignment to human-defined categories. This paper fol-
lows the common practice of previous works where hu-
man labels in seen categories implicitly define the metric
for unseen ones, which can be viewed as an effort to align
algorithm-discovered categories with human-defined ones.
However, labels in seen categories may not be good guid-
Figure 14. T-SNE [42] visualisation of the representations of 10
ance when there is a gap between seen ones and the novel
classes randomly sampled from CIFAR100 [29]. Jointly super-
vising representation learning with a classification objective helps
categories we want to discover, e.g., how to use the labelled
disambiguate (e.g., bed & table) and forms compacter clusters. images in ImageNet to discover novel categories in CUB?
For another example, when we use a very big class vocab-
CIFAR100 CUB ulary (e.g., the full ImageNet-22K [14]), categories could
0.8 0.7 overlap with each other, and be in different granularities.
0.7 0.6 Further, assigning text names to the discovered categories
0.6
0.5
ACC
ACC
0.5
still requires a matching process, what if further utilising the
All 0.4 All relationship between class names, and directly predicting
0.4 Old Old
New 0.3 New the novel categories in the text space? We believe the align-
0.3
0 50 100 150 200 0 50 100 150 200 ment between algorithm-discovered categories and human-
#Epoch #Epoch
defined categories is of high research value for future works.
Figure 15. Performance evolution throughout the model learn-
ing process. We observe a trade-off between the performance in Ethical considerations. Current methods commonly suffer
‘Old’ and ‘New’ categories, which is common across datasets. from low-data or long-tailed scenarios. Depending on the
data and classification criteria of specific tasks, discrimina-
Trade-off between ‘Old’ and ‘New’ categories. We plot tion against minority categories or instances is possible.
the performance evolution throughout the model learning
process in Fig. 15. It can be observed that the performance
7. Conclusion
on the ‘Old’ categories first climbs to the highest point at
the early stage of training and then slowly degrades as the This study investigates the reasons behind the failure of
performance on the ‘New’ categories improves. We believe previous parametric classifiers in recognizing novel classes
this demonstrates an important aspect of the design of mod- in GCD and uncovers that unreliable pseudo-labels, which
els for the GCD problem: the performance on the ‘Old’ cat- exhibit significant biases, are the crucial factor. We propose
egories may be in odd with the performance on the ‘New’ a simple yet effective parametric classification method that
categories, how to achieve a better trade-off between these addresses these issues and achieves state-of-the-art perfor-
two could be an interesting investigation for future works. mance on multiple GCD benchmarks. Our findings provide
insights into the design of robust classifiers for discovering
6. Limitations and Potential Future Works novel categories and we hope our proposed framework will
serve as a strong baseline to facilitate future studies in this
Representation learning. This paper mainly targets im- field and contribute to the development of more accurate
proving the classification ability for generalized category and reliable methods for category discovery.
discovery. The representation learning, however, follows
the prior work GCD [43]. It is expectable that the quality Acknowledgements
of representation learning can be improved. For instance,
generally, by using more advanced geometric and photo- This work has been supported by Hong Kong Re-
metric data augmentations [19], and even multiple local search Grant Council - Early Career Scheme (Grant No.
crops [8]. Further, can the design of data augmentations 27209621), General Research Fund Scheme (Grant No.
be better aligned with the classification criteria of the tar- 17202422), and RGC Matching Fund Scheme (RMGS).
get data? For another example, using a large batch size has Part of the described research work is conducted in the JC
been shown to be critical to the performance of contrastive STEM Lab of Robotics for Soft Materials funded by The
learning-based frameworks [10]. However, the batch size Hong Kong Jockey Club Charities Trust. The authors ac-
adopted by GCD [43] is only 128, which might limit the knowledge SmartMore and MEGVII for partial computing
quality of learned representations. Moreover, is the super- support, and Zhisheng Zhong for professional suggestions.
9
References [16] Yixin Fei, Zhongkai Zhao, Siwei Yang, and Bingchen Zhao.
Xcon: Learning with experts for fine-grained category dis-
[1] David Arthur and Sergei Vassilvitskii. k-means++: the ad- covery. In BMVC, 2022. 2
vantages of careful seeding. In ACM-SIAM Symposium on
[17] Enrico Fini, Enver Sangineto, Stéphane Lathuilière, Zhun
Discrete Algorithms, 2007. 12
Zhong, Moin Nabi, and Elisa Ricci. A unified objective for
[2] Yuki M. Asano, Christian Rupprecht, and Andrea Vedaldi. novel class discovery. In ICCV, 2021. 1, 2, 3, 4, 5, 6, 7, 8,
Self-labelling via simultaneous clustering and representation 12, 14, 15
learning. In ICLR, 2020. 3
[18] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un-
[3] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bo- supervised representation learning by predicting image rota-
janowski, Florian Bordes, Pascal Vincent, Armand Joulin, tions. In ICLR, 2018. 2
Mike Rabbat, and Nicolas Ballas. Masked siamese networks
[19] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
for label-efficient learning. In ECCV, 2022. 2, 3, 5, 6
Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch,
[4] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bo-
Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh-
janowski, Armand Joulin, Nicolas Ballas, and Michael Rab-
laghi Azar, et al. Bootstrap your own latent-a new approach
bat. Semi-supervised learning of visual features by non-
to self-supervised learning. In NeurIPS, 2020. 9
parametrically predicting view assignments with support
samples. In ICCV, 2021. 3 [20] Lan-Zhe Guo, Zhen-Yu Zhang, Yuan Jiang, Yu-Feng Li,
and Zhi-Hua Zhou. Safe deep semi-supervised learning for
[5] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas
unseen-class unlabeled data. In ICML, 2020. 2
Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A
holistic approach to semi-supervised learning. In NeurIPS, [21] Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, An-
2019. 1, 2, 5 drea Vedaldi, and Andrew Zisserman. Automatically discov-
ering and learning new visual categories with ranking statis-
[6] Kaidi Cao, Maria Brbić, and Jure Leskovec. Open-world
tics. In ICLR, 2020. 12
semi-supervised learning. In ICLR, 2022. 1, 2, 6, 12, 15
[22] Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, An-
[7] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and
drea Vedaldi, and Andrew Zisserman. Autonovel: Automati-
Matthijs Douze. Deep Clustering for Unsupervised Learn-
cally discovering and learning novel visual categories. IEEE
ing of Visual Features. In ECCV, 2018. 3
TPAMI, 2021. 1, 2, 3, 6, 8
[8] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi-
[23] Kai Han, Andrea Vedaldi, and Andrew Zisserman. Learning
otr Bojanowski, and Armand Joulin. Unsupervised learning
to discover novel visual categories via deep transfer cluster-
of visual features by contrasting cluster assignments. In nips,
ing. In ICCV, 2019. 2, 3
2020. 3, 9
[9] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, [24] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- Girshick. Momentum contrast for unsupervised visual rep-
ing properties in self-supervised vision transformers. In resentation learning. In CVPR, 2020. 2
ICCV, 2021. 2, 3, 5, 6, 12 [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- Deep residual learning for image recognition. In CVPR,
offrey Hinton. A simple framework for contrastive learning 2016. 1
of visual representations. In ICML, 2020. 3, 5, 9 [26] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-
[11] Yanbei Chen, Xiatian Zhu, Wei Li, and Shaogang Gong. scale similarity search with GPUs. IEEE Transactions on
Semi-supervised learning under class distribution mismatch. Big Data, 2019. 12
In AAAI, 2020. 2 [27] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
[12] Jang Hyun Cho, Utkarsh Mall, Kavita Bala, and Bharath Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
Hariharan. Picie: Unsupervised semantic segmentation us- Dilip Krishnan. Supervised contrastive learning. In NeurIPS,
ing invariance and equivariance in clustering. In CVPR, 2020. 3, 5
2021. 3 [28] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
[13] Quan Cui, Bingchen Zhao, Zhao-Min Chen, Borui Zhao, 3d object representations for fine-grained categorization. In
Renjie Song, Jiajun Liang, Boyan Zhou, and Osamu Yoshie. ICCV Workshops, 2013. 6, 13
Discriminability-transferability trade-off: An information- [29] Alex Krizhevsky and Geoffrey Hinton. Learning multiple
theoretic perspective. In ECCV, 2022. 4 layers of features from tiny images. Technical Report, 2009.
[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, 6, 9, 13
and Li Fei-Fei. Imagenet: A large-scale hierarchical image [30] James MacQueen. Some methods for classification and anal-
database. In CVPR, 2009. 6, 9, 12, 13 ysis of multivariate observations. In Proceedings of the Fifth
[15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Berkeley Symposium on Mathematical Statistics and Proba-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, bility, 1967. 6
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [31] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is Blaschko, and Andrea Vedaldi. Fine-grained visual classi-
worth 16x16 words: Transformers for image recognition at fication of aircraft. arXiv preprint arXiv:1306.5151, 2013.
scale. In ICLR, 2021. 6, 12 6, 13
10
[32] Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh [50] Qing Yu, Daiki Ikami, Go Irie, and Kiyoharu Aizawa. Multi-
Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. task curriculum framework for open-set semi-supervised
Long-tail learning via logit adjustment. In ICLR, 2021. 15 learning. In ECCV, 2020. 2
[33] Avital Oliver, Augustus Odena, Colin Raffel, Ekin D Cubuk, [51] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lu-
and Ian J Goodfellow. Realistic evaluation of deep semi- cas Beyer. S4l: Self-supervised semi-supervised learning. In
supervised learning algorithms. In NeurIPS, 2018. 1, 2 ICCV, 2019. 2
[34] Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Kai Han, An- [52] Bingchen Zhao and Kai Han. Novel visual category discov-
drea Vedaldi, and Andrew Zisserman. Semi-supervised ery with dual ranking statistics and mutual knowledge distil-
learning with scarce annotations. In CVPR Workshops, 2020. lation. In NeurIPS, 2021. 2, 3, 8
2 [53] Bingchen Zhao and Oisin Mac Aodha. Incremental general-
[35] Jiawei Ren, Cunjun Yu, Xiao Ma, Haiyu Zhao, Shuai Yi, and ized category discovery. In ICCV, 2023. 3
Hongsheng Li. Balanced meta-softmax for long-tailed visual [54] Bingchen Zhao and Xin Wen. Distilling visual priors from
recognition. In NeurIPS, 2020. 15 self-supervised learning. In ECCV Workshops, 2020. 2
[36] Subhankar Roy, Mingxuan Liu, Zhun Zhong, Nicu Sebe, and [55] Bingchen Zhao, Xin Wen, and Kai Han. Learning semi-
Elisa Ricci. Class-incremental novel class discovery. In supervised gaussian mixture models for generalized category
ECCV, 2022. 3 discovery. In ICCV, 2023. 8
[37] Kuniaki Saito, Donghyun Kim, and Kate Saenko. Open- [56] Zhun Zhong, Enrico Fini, Subhankar Roy, Zhiming Luo,
match: Open-set semi-supervised learning with open-set Elisa Ricci, and Nicu Sebe. Neighborhood contrastive learn-
consistency regularization. In NeurIPS, 2021. 2 ing for novel class discovery. In CVPR, 2021. 3
[38] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao [57] Zhun Zhong, Linchao Zhu, Zhiming Luo, Shaozi Li, Yi
Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Yang, and Nicu Sebe. Openmix: Reviving known knowledge
Zhang, and Colin Raffel. Fixmatch: Simplifying semi- for discovering novel visual categories in an open world. In
supervised learning with consistency and confidence. In CVPR, 2021. 2, 3, 8
NeurIPS, 2020. 1, 2, 5
[39] Yiyou Sun and Yixuan Li. Opencon: Open-world contrastive
learning. TMLR, 2023. 2
[40] Kiat Chuan Tan, Yulong Liu, Barbara Ambrose, Melissa
Tulig, and Serge Belongie. The herbarium challenge 2019
dataset. arXiv preprint arXiv:1906.05372, 2019. 6, 12, 13
[41] Antti Tarvainen and Harri Valpola. Mean teachers are better
role models: Weight-averaged consistency targets improve
semi-supervised deep learning results. In NeurIPS, 2017. 2,
5
[42] Laurens Van der Maaten and Geoffrey Hinton. Visualizing
data using t-sne. JMLR, 2008. 8, 9
[43] Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisser-
man. Generalized category discovery. In CVPR, 2022. 1, 2,
3, 4, 5, 6, 7, 8, 9, 12, 13, 15
[44] Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisser-
man. Open-set recognition: A good closed-set classifier is
all you need? In ICLR, 2022. 6, 12, 13
[45] Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon
Yuille. Normface: L2 hypersphere embedding for face veri-
fication. In ACM MM, 2017. 3
[46] Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu, and
Stella X Yu. Long-tailed recognition by routing diverse
distribution-aware experts. In ICLR, 2021. 15
[47] Xudong Wang, Zhirong Wu, Long Lian, and Stella X Yu.
Debiased learning from naturally imbalanced pseudo-labels.
In CVPR, 2022. 15
[48] Peter Welinder, Steve Branson, Takeshi Mita, Catherine
Wah, Florian Schroff, Serge Belongie, and Pietro Perona.
Caltech-ucsd birds 200. Technical Report CNS-TR-201, Cal-
tech, 2010. 6, 13
[49] Xin Wen, Bingchen Zhao, Anlin Zheng, Xiangyu Zhang, and
Xiaojuan Qi. Self-supervised visual representation learning
with semantic grouping. In NeurIPS, 2022. 3
11
Parametric Classification for Generalized Category Discovery: A Baseline Study
Supplementary Material
12
CIFAR100 ImageNet-100 CUB Stanford Cars Herbarium 19
14 =0 =2 =0 =2 =0 =2 50 =0 =2 35 =0 =2
50
Abs. 'All' ACC Error (in %)
at the boundaries between the ‘Old’ and ‘New’ classes, and CIFAR100 ImageNet-100 CUB SCars Herb19
add all elements in each sub-matrix together, thus obtaining GT K 100 100 200 196 683
the final error matrix standing for the four kinds of predic- Est. K 100 109 231 230 520
tion errors. Such a way of error classification helps distin-
Table 7. Number of categories K estimated using [43].
guish the prediction bias between and within seen and novel
categories, and thus facilitates the design of new solutions.
Note that the diagonal elements, e.g., ‘True Old’ predic- CIFAR100 ImageNet-100
tions, do not stand for correct predictions, but for cases that Methods Known K All Old New All Old New
incorrectly predicting samples of one specific ‘Old’ class to GCD [43] ✓ 73.0 76.2 66.5 74.1 89.8 66.3
another wrong ‘Old’ class. SimGCD ✓ 80.1 81.2 77.8 83.0 93.1 77.9
GCD [43] ✗ (w/ Est.) 73.0 76.2 66.5 72.7 91.8 63.8
B. Extended Experiments And Discussions SimGCD ✗ (w/ Est.) 80.1 81.2 77.8 81.7 91.2 76.8
SimGCD ✗ (w/ 2K) 77.7 79.5 74.0 80.9 93.4 74.8
B.1. Main Results
Table 8. Results on generic image recognition datasets.
We present the full results of SimGCD in the main paper
with error bars in Tab. 6. The results are obtained from three CUB Stanford Cars
independent runs and thus avoid randomness. Methods Known K All Old New All Old New
GCD [43] ✓ 51.3 56.6 48.7 39.0 57.6 29.9
Dataset All Old New SimGCD ✓ 60.3 65.6 57.7 53.8 71.9 45.0
CIFAR10 [29] 97.1±0.0 95.1±0.1 98.1±0.1 GCD [43] ✗ (w/ Est.) 47.1 55.1 44.8 35.0 56.0 24.8
CIFAR100 [29] 80.1±0.9 81.2±0.4 77.8±2.0 SimGCD ✗ (w/ Est.) 61.5 66.4 59.1 49.1 65.1 41.3
ImageNet-100 [14] 83.0±1.2 93.1±0.2 77.9±1.9 SimGCD ✗ (w/ 2K) 63.6 68.9 61.1 48.2 64.6 40.2
ImageNet-1K [14] 57.1±0.1 77.3±0.1 46.9±0.2
CUB [48] 60.3±0.1 65.6±0.9 57.7±0.4 Table 9. Results on the Semantic Shift Benchmark [44].
Stanford Cars [28] 53.8±2.2 71.9±1.7 45.0±2.4
FGVC-Aircraft [31] 54.2±1.9 59.1±1.2 51.8±2.3 the category number estimated with a specialised algorithm
Herbarium 19 [40] 44.0±0.4 58.0±0.4 36.4±0.8 (w/ Est.), or simply with a loose estimation that is two times
Table 6. Complete results of SimGCD in three independent runs. the ground truth (w/ 2K, other values are also applicable
since our method is robust to a wide range of estimations).
This property could ease the deployment of parametric clas-
B.2. Unknown Category Number sifiers for GCD in real-world scenarios.
In the main text, we showed that the performance of
B.3. Extended Analyses
SimGCD is robust to a wide range of estimated unknown
category numbers. In this section, we report the results with In supplementary to the main paper, we present a more
the number of categories estimated using an off-the-shelf complete version of the analytical experiments.
method [43] (Tab. 7) or with a roughly estimated relatively In Fig. 16, we show the error analysis results of SimGCD
big number (two times of the ground-truth K), and compare over five representative datasets that cover coarse-grained,
with the baseline method GCD [43]. fine-grained, and long-tailed classification tasks. Overall,
The results on CIFAR100 [29], ImageNet-100 [14], it shows that the entropy regulariser mainly helps in over-
CUB [48], and Stanford Cars [28] are available in Tabs. 8 coming two types of errors: the error of misclassification
and 9. Our method shows consistent improvements on four between ‘Old’/‘New’ categories, and the error of misclassi-
representative datasets when K is unknown, no matter with fication within ‘New’ categories. One exception is the long-
13
CIFAR100 (New) ImageNet-100 (New) CUB (New) Stanford Cars (New) Herbarium 19 (New)
2000 1200
=0 =4 =0 =4 =0 =4 =0 =4 8000 =0 =4
=1 GT 30000 =1 GT 200 =1 GT 1000 =1 GT =1 GT
#Instance / Class
#Instance / Class
#Instance / Class
#Instance / Class
#Instance / Class
1500 =2 =2 =2 =2 =2
150 800 6000
20000
1000 600 4000
100
10000 400
500 50 2000
200
0 0 0 0 0
0 3 6 9 12 15 18 0 10 20 30 40 50 0 20 40 60 80 100 0 20 40 60 80 100 0 100 200 300
CIFAR100 (Old) ImageNet-100 (Old) CUB (Old) Stanford Cars (Old) Herbarium 19 (Old)
=0 =4 40 =0 =4 50 =0 =4 =0 =4
400 =1 GT =1 GT =1 GT 600 =1 GT
600
#Instance / Class
#Instance / Class
#Instance / Class
#Instance / Class
#Instance / Class
=2 30 =2 40 =2 =2
300
400 30 400
200 20
20
100 200 =0 =4 10 200
=1 GT 10
0 0 =2 0 0 0
0 20 40 60 80 0 10 20 30 40 50 0 20 40 60 80 100 0 20 40 60 80 100 0 100 200 300
Sorted Class Indexes Sorted Class Indexes Sorted Class Indexes Sorted Class Indexes Sorted Class Indexes
Figure 17. Complete per-class prediction distribution results of SimGCD on five representative datasets. Proper entropy regularisa-
tion helps overcome the prediction bias in both ‘Old’ classes and ‘New’ classes, and fits the ground-truth distribution. The conclusion is
consistent across generic classification datasets, fine-grained classification datasets, and naturally long-tailed datasets.
tailed Herbarium 19 dataset, in which the models’ “False lated to make the model’s predictions closer to the uniform
Old” errors also increased, and our intuition is that the long- distribution. But interestingly, we empirically found that
tailed distribution adds to the difficulty in discriminating it could make the models’ predictions more biased on the
between ‘Old’ and ‘New’ categories. Still, the gain in dis- class-balanced ImageNet-100 dataset when the regularisa-
tinguishing between novel categories is consistent, and we tion is too strong. And when the dataset itself is long-tailed
provide a further analysis via per-class prediction distribu- (Herbarium 19), it also could help fit the ground-truth distri-
tions in the next paragraph. bution. We also note that the self-labelling strategy adopted
by UNO [17] forces the predictions in a batch to be strictly
ImageNet-100 (New) Herbarium 19 (New) uniform, which may account for its inferior performance.
=1 =4 =1 =4
#Instance / Class
#Instance / Class
2000 =2 GT =2 GT
400 ImageNet-100 Herbarium 19
600
2000 K=100 K=683
1000 200 K=200 500 K=1366
GT GT
#Instance / Class
#Instance / Class
1500 400
0 0
1000 300
0 6 12 18 24 30 36 42 48 0 100 200 300
200
ImageNet-100 (Old) Herbarium 19 (Old) 500
300 =1 =4
100
700
#Instance / Class
#Instance / Class
=2 GT 0 0
600 200 0 50 100 150 200 0 500 1000
500 Sorted Class Indexes Sorted Class Indexes
400 100
=1 =4 Figure 19. Per-class prediction distributions using different
300 =2 GT
0 category numbers on ImageNet-100 and Herbarium 19. Our
0 20 40 0 100 200 300 method effectively identifies the criterion for ‘New’ classes, thus
Sorted Class Indexes Sorted Class Indexes keeping the number of active prototypes close to the ground-truth
Figure 18. A closer look at the per-class distributions. Notably, class number. Notably, a loose category number greater than the
although the entropy regularisation term is formulated to approach ground truth may harm fitting the class-balanced ImageNet-100
uniform distribution, it could make the models’ predictions more dataset, but could help fit the distribution of the long-tailed Herbar-
biased on the class-balanced ImageNet-100 dataset when the reg- ium 19 dataset.
ularisation is too strong. Interestingly, it also could help fit the
distribution of the long-tailed Herbarium 19 dataset. In Fig. 19, we also show the per-class prediction distri-
butions using different category numbers. The results on the
In Fig. 17, we show the complete per-class prediction class-balanced ImageNet-100 are consistent with the results
results of SimGCD to further analyse the entropy regu- on CIFAR100 and CUB in the main paper, using a loose cat-
lariser’s effect in overcoming the classification errors within egory number greater than the ground truth may harm fitting
‘Old’ and ‘New’ classes, and it consistently verifies the help the ground-truth class distribution, yet the model still man-
in alleviating the prediction bias within ‘Old’ and ‘New’ ages to find the ground truth category number. Interestingly,
classes, and better fitting the ground-truth class distribution. we also find that for the long-tailed Herbarium 19 dataset,
In Fig. 18, we present a closer look at ImageNet-100 and using a greater category number could in fact help fit the
Herbarium 19. The entropy regularisation term is formu- ground-truth distribution.
14
CIFAR100 ImageNet-100 CUB Stanford Cars Herbarium 19
Method Logit Adjust All Old New All Old New All Old New All Old New All Old New
ORCA [6] ✓ 69.0 77.4 52.0 73.5 92.6 63.9 35.3 45.6 30.2 23.5 50.1 10.7 20.9 30.9 15.5
DebiasPL [47] ✓ 60.9 69.8 43.1 43.5 59.1 35.6 38.1 44.2 35.0 31.1 49.6 22.1 30.1 39.1 25.3
UNO+ [17] ✗ 69.5 80.6 47.2 70.3 95.0 57.9 35.1 49.0 28.1 35.5 70.5 18.6 28.3 53.7 14.7
GCD [43] ✗ 73.0 76.2 66.5 74.1 89.8 66.3 51.3 56.6 48.7 39.0 57.6 29.9 35.4 51.0 27.0
SimGCD ✗ 80.1 81.2 77.8 83.0 93.1 77.9 60.3 65.6 57.7 53.8 71.9 45.0 44.0 58.0 36.4
15