Deep Learning For Scene Classification: A Survey
Deep Learning For Scene Classification: A Survey
Abstract—Scene classification, aiming at classifying a scene image to one of the predefined scene categories by comprehending the entire
image, is a longstanding, fundamental and challenging problem in computer vision. The rise of large-scale datasets, which constitute the
corresponding dense sampling of diverse real-world scenes, and the renaissance of deep learning techniques, which learn powerful feature
representations directly from big raw data, have been bringing remarkable progress in the field of scene representation and classification. To
help researchers master needed advances in this field, the goal of this paper is to provide a comprehensive survey of recent achievements in
scene classification using deep learning. More than 200 major publications are included in this survey covering different aspects of scene
classification, including challenges, benchmark datasets, taxonomy, and quantitative performance comparisons of the reviewed methods. In
retrospect of what has been achieved so far, this paper is also concluded with a list of promising research opportunities.
arXiv:2101.10531v2 [cs.CV] 20 Feb 2021
Index Terms—Scene classification, Deep learning, Convolutional neural network, Scene representation, Literature survey.
1 I NTRODUCTION
goal of scene classification is to classify a scene image1 100
T HE
to one of the predefined scene categories (such as beach,
kitchen, and bakery), based on the image’s ambient content, ob-
90
Accuracy on MIT67
Accuracy on SUN397
86.7
82.7
89.5
94.1
85.2
79.0
jects, and their layout. Visual scene understanding requires rea- 80 79.8 78.9
soning about the diverse and complicated environments that 70 68.9 72.0
we encounter in our daily life. Recognizing visual categories 64.6
such as objects, actions and events is no doubt the indispens- 60 56.8 61.7
63.2
Accuracy (%)
Fig. 2. A taxonomy of deep learning for scene classification. With the rise of large-scale datasets, powerful features are learned from pre-trained CNNs,
fine-tuned CNNs, or specific CNNs, having made remarkable progress. The major features include global CNN features, spatially invariant features,
semantic features, multi-layer features, and multi-view features. Meanwhile, many methods are improved via effective strategies, like encoding, attention
learning, and context modeling. As a new issue, methods using RGB-D datasets, focus on learning depth specific features and fusing multiple modalities.
Recently, several surveys for scene classification have also classification. However, in contrast to object classification (im-
been available, such as [59], [60], [61]. Cheng et al. [60] pro- ages are object-centric) or texture classification (images include
vided a recent comprehensive review of the recent progress for only textures), the scene images are more complicated, and it
remote sensing image scene classification. Wei et al. [59] carried is essential to further explore the content of scene, e.g., what
out an experimental study of 14 scene descriptors mainly in the semantic parts (e.g., objects, textures, background) are, in
the handcrafted feature engineering way for scene classifica- what way they are organized together, and what their semantic
tion. Xie et al. [61] reviewed scene recognition approaches in connections with each other are. Despite over several decades
the past two decades, and most of discussed methods in their of development in scene classification (shown in the appendix
survey appeared in this handcrafted way. As opposed to these due to space limit), most of methods still have not been capa-
existing reviews [59], [60], [61], this work herein summarizes ble of performing at a level sufficient for various real-world
the striking success and dominance in indoor/outdoor scene scenes. The inherent difficulty is due to the nature of complex-
classification using deep learning and its related methods, but ity and high variance of scenes. Overall, significant challenges
not including other scene classification tasks, e.g., remote sens- in scene classification stem from large intraclass variations, se-
ing scene classification [60], [62], [63], acoustic scene classifi- mantic ambiguity, and computational efficiency.
cation [64], [65], place classification [66], [67], etc. The major
contributions of this work can be summarized as follows: Large Intraclass Variation
• As far as we know, this paper is the first to specifically shopping mall shopping mall shopping mall
focus on deep learning methods for indoor/outdoor scene
classification, including RGB scene classification, as well
as RGB-D scene classification.
• We present a taxonomy (see Fig. 2), covering the most re-
cent and advanced progresses of deep learning for scene Semantic Ambiguity
representation. library archive bookstore
• Comprehensive comparisons of existing methods on
several public datasets are provided, meanwhile we also
present brief summaries and insightful discussions.
The remainder of this paper is organized as follows: Challenges
and benchmark datasets are summarized in Section 2. In sec-
tion 3, we present a taxonomy of the existing deep learning Fig. 3. Illustrations of large intraclass variation and semantic ambiguity. Top
(large intraclass variation): The shopping malls are quite different caused by
based methods. Then in section 4, we provide an overall dis- lighting and overall content. Below (semantic ambiguity): General layout and
cussion of their performance (Tables 2, 4). Followed by Sec- uniformly arranged objects are similar on archive, bookstore, and library.
tion 5 we conclude important future research outlook.
Large intraclass variation. Intraclass variation mainly
originates from intrinsic factors of the scene itself and imaging
2 BACKGROUND
conditions. In terms of intrinsic factors, each scene can have
2.1 The Problem and Challenge many different example images, possibly varying with large
Scene classification can be further dissected through analyzing variations among various objects, background, or human
its strong ties with related vision tasks, such as object classifi- activities. Imaging conditions like changes in illumination,
cation and texture classification. As typical pattern recognition viewpoint, scale and heavy occlusion, clutter, shading, blur,
problems, these tasks all consist of feature representation and motion, etc. contribute to large intraclass variations. Further
3
challenges may be added by digitization artifacts, noise scene classes (e.g., office, store, and kitchen) along with 10
corruption, poor resolution, and filtering distortion. For outdoor scene classes (like suburb, forest, and tall building).
instance, three shopping malls (top row of Fig. 3) are shown Each class contains 210−410 scene images, and the image size
with different lighting conditions, viewing angle, and objects. is around 300×250. The dataset is divided into two splits;
Semantic ambiguity. Since images of different classes may there are at least 100 images per class in the training set, and
share similar objects, textures, background, etc., they look very the rest are for testing.
similar in visual appearances, which causes ambiguity among MIT Indoor 67 (MIT67) dataset [70] covers a wide range of
them [43], [68]. The bottom row of Fig. 3 depicts strong visual indoor scenes, e.g., store, public space, and leisure. MIT67 com-
correlation between three different indoor scenes, i.e., archive, prises 15,620 scene images from 67 indoor categories, where
bookstore, and library. With the emerging of new scene cat- each category has about 100 images. Moreover, all images have
egories, the problem of semantic ambiguity would be more a minimum resolution of 200×200 pixels on the smallest axis.
serious. In addition, scene category annotation is subjective, Because of the shared similarities among objects in this dataset,
relying on the experience of the annotators, therefore a scene the classification of images is challenging. There are 80 and 20
image may belong to multiple semantic categories [68], [69]. images per class in the training and testing set, respectively.
Computational efficiency. The prevalence of social media Scene UNderstanding 397 (SUN397) dataset [57] consists
networks and mobile/wearable devices has led to increasing of 397 scene categories, in which each category has more than
demands for various computer vision tasks including 100 images. The dataset contains 108,754 images with an im-
scene recognition. However, mobile/wearable devices have age size of about 500×300 pixels. SUN397 spans over 175 in-
constrained computing related resources, making efficient door, 220 outdoor scene classes, and two classes with mixed in-
scene recognition a pressing requirement. door and outdoor images, e.g., a promenade deck with a ticket
booth. There are several train/test split settings with 50 images
2.2 Datasets per category in the testing.
ImageNet dataset [56] is one of the most famous large-scale
This section reviews publicly available datasets for scene
image databases particularly used for visual tasks. It is orga-
classification. The scene datasets (see Fig. 4) are broadly
nized in terms of the WordNet [74] hierarchy, each node of
divided into two main categories based on the image type:
which is depicted by hundreds and thousands of images. Up
RGB and RGB-D datasets. The datasets can further be divided
to now, there are more than 14 million images and about 20
into two categories in terms of their size. Small-size datasets
thousand notes in the ImageNet. Usually, a subset of ImageNet
(e.g., Scene15 [14], MIT67 [70], SUN397 [57], NYUD2 [71], SUN
dataset (about 1000 categories with a total of 1.2 million im-
RGBD [72]) are usually used for evaluation, while large-scale
ages [17]) is used to pre-train the CNN for scene classification.
datasets, e.g., ImageNet [56] and Places [18], [25], are essential
for pre-training and developing deep learning models. Table 1 Places dataset [18], [25] is a large-scale scene dataset with
summarizes the characteristics of these datasets for scene 434 scene categories, which provides an exhaustive list of the
classification. classes of environments encountered in the real world. The
Places dataset has inherited the same list of scene categories
Scene15 MIT67 SUN397 ImageNet Places NYUD2 SUN RGBD from SUN397 [57]. Four benchmark subsets of Places are
RGB shown as follows: 1) Places205 [18] has 2.5 million images
from scene categories. The image number per class varies from
Depth 5,000 to 15,000. The training set has 2,448,873 images, with
100 images per category for validation and 200 images per
HHA category for testing. 2) Places88 [18] contains the 88 common
scene categories among the ImageNet [56], SUN397 [57],
and Places205 datasets. Places88 includes only the images
Fig. 4. Some example images for scene classification from benchmark obtained in the second round of annotation from the Places.
datasets for a summary of these datasets. RGB-D images consist of RGB
and a depth map. Moreover, Gupta et al. [73] proposed to convert depth 3) Places365-Standard [25] has 1,803,460 training images with
image into three-channel feature maps, i.e., Horizontal disparity, Height the image number per class varying from 3,068 to 5,000. The
above the ground, and Angle of the surface norm (HHA). Such HHA validation set has 50 images/class, while the testing set has
encoding is useful for the visualization of depth data.
900 images/class. 4) Places365-Challenge contains the same
categories as the Places365-Standard, but its training set is
significantly larger with a total of 8 million images. This subset
TABLE 1
Popular datasets for scene classification. “#” denotes the number of. was released for the Places Challenge held in conjunction with
ECCV, as part of the ILSVRC 2016 Challenge.
Type Dataset #Images #Class Resolution Class label NYU-Depth V2 (NYUD2) dataset [71] is comprised of
Scene15 [14] 4,488 15 ≈ 300×250 Indoor/outdoor scene video sequences from a variety of indoor scenes as recorded
MIT67 [70] 15,620 67 ≥ 200×200 Indoor scene
SUN397 [57] 108,754 397 ≈ 500×300 Indoor/outdoor scene
by both the RGB and depth cameras. The dataset consists
ImageNet [17] 14 million+ 21,841 ≈ 500×400 Object of 1,449 densely labeled pairs of aligned RGB and depth
RGB Places205 [18] 7,076,580 205 ≥ 200×200 Indoor/outdoor scene images from 27 indoor scene categories. It features 464 scenes
Places88 [18] − 88 ≥ 200×200 Indoor/outdoor scene taken from 3 cities and 407,024 unlabeled frames. With the
Places365-S [25] 1,803,460 365 ≥ 200×200 Indoor/outdoor scene
Places365-C [25] 8 million+ 365 ≥ 200×200 Indoor/outdoor scene
publicly available split, NYUD2 for scene classification offers
NYUD2 [71] 1,449 10 ≈ 640×480 Indoor scene 795 images for training while 654 images for testing.
RGB-D SUN RGBD [72] 10,355 19 ≥512×424 Indoor scene SUN RGBD dataset [72] consists of 10,335 RGB-D images
with dense annotations in both 2D and 3D, for both objects
Scene15 dataset [14] is a small scene dataset containing and rooms. The dataset is collected by four different sensors
4,448 grayscale images of 15 scene categories, i.e., 5 indoor at a similar scale as PASCAL VOC [75]. The whole dataset
4
...
...
N
N key patches
Sparse
sampling Multipass Selecting Concatenating
CNN ...
key features
Scene Image ...
a×b
P office
a
Single Pass Conv. b Pooling (e.g. max pooling,
...
Layer P global average pooling)
Softmax
Fig. 5. Generic pipeline of deep learning for scene classification. An entire pipeline consists of a module in each of the three stages (local feature extraction,
feature encoding and pooling, and category prediction). The common pipelines are shown with arrows in different colors, including global CNN feature
based pipeline (blue arrows), spatially invariant feature based pipeline (green arrows), and semantic feature based pipeline (red arrows). Although the
pipeline of some methods (like [31], [32]) are unified and trained in an end-to-end manner, they are virtually composed of these three stages.
is densely annotated and includes 146,617 2D polygons and classification purposes. Later, Simonyan et al. [77] developed
58,657 3D bounding boxes with accurate object orientations, as VGGNet and showed that, for a given receptive field, using
well as a 3D room layout and category for scenes. multiple stacked small kernels is better than using a large
convolution kernel, because applying non-linearity on multiple
feature maps yields more discriminative representations. On
3 D EEP L EARNING BASED M ETHODS
the other hand, the reduction of kernels’ receptive filed size
In this section, we present a comprehensive review of deep decreases the number of parameters for bigger networks.
learning methods for scene classification. A brief introduction Therefore, VGGNet has 3×3 convolution kernels instead
of deep learning is provided in the appendix due to limit space. of large convolution kernels (i.e., 11×11, 7×7, and 5×5)
The most common deep learning architecture is Convolutional in AlexNet. Motivated by the idea that only a handful of
Neural Network (CNN) [76]. With CNN as feature extractor, neurons have an effective role in feature representation,
Fig. 5 shows the generic pipeline of most CNN based methods Szegedy et al. [78] proposed an Inception module to make
for scene classification. Almost without exception, given an in- a sparse approximation of CNNs. Deeper the model, the
put scene image, the first stage is to use CNN extractors to more descriptive representations. This is the advantage
obtain local features. Then, the second process is to aggregate of hierarchical feature extraction using CNN. However,
these features into an image-level representation via encoding, constantly increasing CNN’s depth could result in gradient
concatenating, or pooling. Finally, with the representation as vanishing. To address this issue, He et al. [79] included skip
input, the classification stage is to get a predicted category. connection to the hierarchical structure of CNN and proposed
The taxonomy, shown in Fig. 2, covers different aspects of Residual Networks (ResNets), which are easier to optimize
deep learning for scene classification. In the following investi- and can gain accuracy from considerably increased depth.
gation, we firstly study the main CNN frameworks for scene
In addition to the network architecture, the performance of
classification. Then, we review existing CNN based scene rep-
CNN interwinds with a sufficiently large amount of training
resentations. Furthermore, we explore various techniques for
data. However, the training data are scarce in certain applica-
improving the obtained representations. Finally, as a supple-
tions, which results in the under-fitting of the model during the
ment, we investigate scene classification using RGB-D data.
training process. To overcome this issue, pre-trained models
can be employed to effectively extract feature representations
3.1 Main CNN Framework of small datasets [80]. Training CNN on large-scale datasets,
Convolutional Neural Networks (CNNs) are common deep such as the ImageNet [56] and the Places [18], [25], makes them
learning models to extract high quality representation. At the learn enriched visual representations. Such models can further
beginning, limited by computing resources and labeled data, be used as pre-trained models for other tasks. However, the
scene features are extracted from pre-trained CNNs, which effectiveness of the employment of pre-trained models largely
is usually combined to BoVW pipeline [6]. Then, fine-tuned depends on the similarity between the source and target do-
CNN models are used to keep last layers more data-specific. mains. Yosinski et al. [81] documented that the transferability
Alternatively, specific CNN models have emerged to adapt to of pre-trained CNN models decreases as the similarity of the
scene attributes. target task and original source task decreases. Nevertheless,
pre-trained models still have better performance than random
3.1.1 Pre-trained CNN Model initialization of the models [81].
The network architecture plays a pivotal role in the perfor- Pre-trained CNNs, as fixed feature extractors, are divided
mance of deep models. In the beginning, AlexNet [17] served into two categories: object-centric and scene-centric CNNs.
as the mainstream CNN model for feature representation and Object-centric CNNs refer to the model pre-trained on object
5
datasets, e.g., the ImageNet [56], and deployed for scene few parameters, training a complex model using small datasets
classification. Since object images do not contain the diversity would be affordable. Optionally, it is also possible to gradually
provided by the scene [18], object-centric CNNs have limited unfreeze some layers to further enhance the learning quality
performance for scene classification. Hence, scene-centric as the earlier layers would adapt new representations from the
CNNs, pre-trained on scene images, like Places [18], [25], are target dataset. Alternatively, different learning rates could be
more effective to extract scene-related features. assigned to different layers of CNN, in which the early layers
Object-centric CNNs. Cimpoi et al. [82] asserted that the of the model have very low learning rate and the last layers
feature representations obtained from object-centric CNNs have higher learning rates. In this way, the early CONV layers
are object descriptors since they have likely more object that have more abstract representations are less affected, while
descriptive properties. The scene image is represented as a the specialized FC layers are fine-tuned with higher speed.
bag of semantics [30], and object-centric CNNs are sensitive to Small-size training dataset limits the effective fine-tuning
the overall shape of objects, so many methods [28], [30], [31], process, while data augmentation is one alternative to deal
[82], [83] used object-centric CNNs to extract local features with this issue [20], [21], [48], [88]. Liu et al. [85] indicated
from different regions of the scene image. Another important that deep models may not benefit from fine-tuning on a small
factor in the effective deployment of object-centric CNNs is target dataset. In addition, fine-tuning may have negative
the relational size of images in the source and target datasets. effects since the specialized FC layers are changed while
Although CNNs are generally robust against size and scale, inadequate training data are provided for fine-tuning. To this
the performance of object-centric CNNs is influenced by end, Khan et al. [20] augmented the scene image dataset with
scaling because such models are originally pre-trained on flipped, cropped, and rotated versions to increase the size of
datasets to detect and/or recognize objects. Therefore, the the dataset and further improve the robustness of the learned
shift to describing scenes, which have multiple objects with representations. Liu et al. [21] proposed a method to select
different scales, would drastically affect their performance [19]. representative image patches of the original image.
For instance, if the image size of the target dataset is smaller There exists a problem via data augmentation to fine-tune
than the source dataset to a certain degree, the accuracy of the CNNs for scene classification. Herranz et al. [19] asserted that
model would be compromised. fine-tuning a CNN model have certain “equalizing” effect be-
Scene-centric CNNs. Zhou et al. [18], [25] demonstrated the tween the input patch scale and final accuracy, i.e., to some
classification performance of scene-centric CNNs is better than extent, with too small patches as CNN inputs, the final classifi-
object-centric CNNs since the former use the prior knowledge cation accuracy is worse. This is because the small patch inputs
of the scene. Herranz et al. [19] found that Places-CNNs [23] contain insufficient image information, while the final labels
achieve better performance at larger scales; therefore, scene- indicate scene categories [32], [89]. Moreover, the number of
centric CNNs generally extract the representations in the whole cropped patches is huge, so just a tiny part of these patches
range of scales. Guo et al. [40] noticed that the CONV layers of is used to fine-tune CNN models, rendering limited overall
scene-centric CNNs capture more detail information of a scene, improvement [19]. On the other hand, Herranz et al. [19] also
such as local semantic regions and fine-scale objects, which is explored the effect of fine-tuning CNNs on different scales, i.e.,
crucial to discriminate the ambiguous scenes, while the fea- with different scale patches as inputs. From the practical re-
ture representations obtained from the FC layers do not convey sults, there is a moderate accuracy gain in the range of scale
such perceptive quality. Zhou et al. [84] showed that scene- patches where the original CNNs perform poorly, e.g., in the
centric CNNs may also perform as object detectors without cases of global scales for ImageNet-CNN and local scales for
explicitly being trained on object datasets. Places-CNN. However, there is marginal or no gain in ranges
where CNN have already strong performance. For example,
3.1.2 Fine-tuned CNN Model since Places-CNN has the best performance in the whole range
of scale patches, in this case, fine-tuning on target dataset leads
Pre-trained CNNs, described in Section 3.1.1, perform as fea- to negligible performance improvement.
ture extractor with prior knowledge of the training data [6],
[85]. However, using only the pre-training strategy would pre- 3.1.3 Specific CNN Model
vent exploiting the full capability of the deep models in de- In addition to the generic CNN models, i.e., pre-trained CNN
scribing the target scenes adaptively. Hence, fine-tuning the models and the fine-tuned CNN models, another group of
pre-trained CNNs using the target scene dataset improves their deep models are specifically designed for scene classification.
performance by reducing the possible domain shift between These models are specifically developed to extract effective
two datasets [85]. Notably, a suitable weight initialization be- scene representations from the input by introducing new
comes very important, because it is quite difficult to train a network architectures. As is shown in Fig. 6, we only show
model with many adjustable parameters and non-convex loss four typical specific models [22], [23], [24], [26].
functions [86]. Therefore, fine-tuning the pre-trained CNN con- To capture discriminative information from regions of in-
tributes to the effective training process [29], [30], [34], [87]. terest, Zhou et al. [23] replaced the FC layers in a CNN model
For CNNs, a common fine-tuning technique is the freeze with a Global Average Pooling (GAP) layer [90] followed by
strategy. In this method, the last FC layer of a pretrained a Softmax layer, i.e., GAP-CNN. As shown in Fig. 6 (a), by a
model is replaced with a new FC layer with the same number simple combination of the original GAP layer and the 1×1 con-
of neurons as the classes in the target dataset (i.e., MIT67, volution operation to form a class activation map (CAM), GAP-
SUN397), while the previous CONV layers’ parameters are CNN can focus on class-specific regions and perform scene
frozen, i.e., they are not updated during fine-tuning. Then, classification well. Although the GAP layer has a lower number
this modified CNN is fine-tuned by training on the target of parameters than the FC layer [23], [48], the GAP-CNN can
dataset. Herein, the back-propagation is stopped after the last obtain comparable classification accuracy.
FC layers, which allows these layers to extract discriminative Hypothesizing that a certain amount of sparsity improves
features from the previous learned layers. Through updating the discriminability of the feature representations [91],
6
Feature Concatenation
Object Residual Block Scene Object
CONV Conv. CNN
CNN ... CNN CNN
layers map LSTM Conv.
Natural
Parameterization Min Max Residual Block
Different layer layer
LSTM Conv.
Scales
Batch Normalization Pooling Pooling
Fisher
SLSTM . . . SLSTM LSTM SFV Vector WSP C + ReLu + GAP
...
Conv.
C
Feature Concatenation Concatenate
(a) HLSTM (b) SFV (c) WELDON (d) FTOTLM (e) Scale-specific network
Fig. 7. Five typical architectures to extract CNN based scene representations (see Section 3.2), respectively. Hourglass architectures are backbone
networks, such as AlexNet or VGGNet. (a) HLSTM [27], a global CNN feature based method, extracts deep feature from the whole image. Spatial LSTM
is used to model 2D characteristics among the spatial layout of image regions. Moreover, Zuo et al. captured cross-scale contextual dependencies via
multiple LSTM layers. (b) SFV [30], a spatially invariant feature based method, extract local features from dense patches. The highlight of SFV is to
add a natural parameterization to transform the semantic space into a natural parameter space. (c) WELDON [34], a semantic feature based method,
extracts deep features from top evidence (red) and negative instances (yellow). In WSP scheme, Durand et al. used the max layer and min layer to
select positive and negative instances, respectively. (d) FTOTLM [21], a typical multi-layer feature based method, extracts deep feature from each residual
block. (e) Scale-specific network [19], a multi-view feature based architecture, used scene-centric CNN extract deep features from coarse versions, while
object-centric CNN is used to extract features from fine patches. Two types of deep features complement each other.
Convolution (RI-Conv) layer so that they can obtain the in green arrows). The entire process can be decomposed into
identical representation for an image and its left-right reversed five basic steps: 1) Local patch extraction: a given input image
copy. Nevertheless, global CNN feature based methods have is divided into smaller local patches, which are used as the
not fully exploited the underlying geometric and appearance input to a CNN model, 2) Local feature extraction: deep fea-
variability of scene images. tures are extracted from either the CONV or FC layers of the
The performance of global CNN features is greatly affected model, 3) Codebook generation: this step is to generate a code-
by the content of the input image. CNN models can extract book with multiple codewords based on the extracted deep
generic global feature representations once trained on a suf- features from different regions of the image. The codewords
ficiently large and rich training dataset, as opposed to hand- usually are learned in an unsupervised way (e.g., using GMM),
crafted feature extraction methods. It is noteworthy that global 4) Spatially invariant feature generation: given the generated
representations obtained by scene-centric CNN models yield codebook, deep features are encoded into a spatially invari-
more enriched spatial information than those obtained using ant representation, and 5) Class prediction: the representation
object-centric CNN models, arguably since global representa- input is classified into a predefined scene category.
tions from scene-centric CNNs contain spatial correlations be-
As opposed to patch-based local feature extraction (each lo-
tween objects and global scene properties [18], [19], [25]. In
cal feature is extracted from an original patch by independently
addition, Herranz et al. [19] showed that the performance of a
using the CNN extractor), local features can also be extracted
scene recognition system depends on the entities in the scene
from the semantic CONV maps of a standard CNN [29], [44],
image, i.e., when the global features are extracted from images
[102], [103]. Specifically, since each cell (deep descriptor) of the
with chaotic background, the model’s performance is degraded
feature map corresponds to one local image patch in the input
compared to the cases that the object is isolated from the back-
image, each cell is regarded as a local feature. In this approach,
ground or the image has a plain background. This suggests that
the computation time is decreased, compared to independently
the background may introduce some noise into the feature that
processing of multiple spatial patches to obtain local features.
weakens the performance. Since contour symmetry provides a
For instance, Yoo et al. [29] replaced the FC layers with CONV
perceptual advantage when human observers recognize com-
layers to obtain large amount of local spatial features. They
plex real-world scenes, Rezanejad et al. [101] studied global
also used multi-scale CNN activations to achieve geometric
CNN features from the full image and only contour informa-
robustness. Gao et al. [102] used a spatial pyramid to directly
tion and showed that the performance of the full image as
divide the activations into multi-level pyramids, which contain
input is better, because CNN captures potential information
more discriminative spatial information.
from images. Nevertheless, they still concluded that contour is
an auxiliary clue to improve recognition accuracy. The feature encoding technique, which aggregates the local
features, is crucial in relating local features with the final
3.2.2 Spatially Invariant Feature based Method feature representation, and it directly influences the accuracy
To alleviate the problems caused by sequential operations and efficiency of the scene classification algorithms [85].
in the standard CNN, a body of alternatives [28], [40], Improved Fisher Vector (IFV) [104], Vector of Locally
[82] proposed spatially invariant feature based methods to Aggregated Descriptors (VLAD) [105], and Bag-of-Visual-
maintain spatial robustness. The “spatially invariant” means Word (BoVW) [106] are among the popular and effective
that the output features are robust against the geometrical encoding techniques that are used in deep learning based
variations of the input image [96]. methods. For instance, many methods, like FV-CNN [82], MFA-
As shown in Fig. 7 (b), spatially invariant features are usu- FS [42], and MFAFVNet [31], apply IFV encoding to obtain the
ally extracted from multiple local patches. The visualization of image embedding as spatially invariant representations, while
such a feature extraction process is shown in Fig. 5 (marked MOP-CNN [28], SDO [35], etc utilize VLAD to cluster local
8
features. Noteworthily, the codebook selection and encoding deeply rely on the performance of object detection. Weak
procedures result in disjoint training of the model. To this supervision settings (i.e., without the patch labels of scene
end, some works proposed networks that are trained in an images) make it difficult to accurately identify the scene by
end-to-end manner, e.g., NetVLAD [67], MFAFVNet [31], and the key information of an image [34]. Moreover, the error
VSAD [32]. accumulation problem and extra computation cost also limit
Spatially invariant feature based methods are efficient the development of semantic feature based methods [103].
to achieve geometric robustness. Nevertheless, the sliding
windows based paradigm requires multi-resolution scanning 3.2.4 Multi-layer Feature based Method
with fixed aspect ratios, which is not suitable for arbitrary
Global feature based methods usually extract the high-layer
objects with variable sizes or aspect ratios in the scene image.
CNN features, and feed them into a classifier to achieve clas-
Moreover, using dense patches may introduce noise into
sification task. Due to the compactness of such high-layer fea-
the final representation, which decreases the classification
tures, it is easy to miss some important slight clues [40], [114].
accuracy. Therefore, extracting semantic features from salient
Features from different layers are complementary [33], [115].
regions of the scene image can circumvent these drawbacks.
Low-layer features generally capture small objects, while high-
layer features capture big objects [33]. Moreover, semantic in-
3.2.3 Semantic Feature based Method formation of low-layer features is relatively less, but the object
Processing all patches of the input image requires computa- location is accurate [115]. To take full advantage of features
tional cost while yields redundant information. Object detec- from different layers, many methods [21], [38], [39], [47] used
tion determines whether or not any instance of the salient re- the high resolution features from the early layers along with
gions is presented in an image [107]. Inspired by this, object the high semantic information of the features from the latest
detector based approaches allow identifying salient regions of layers of hierarchical models (e.g., CNNs).
the scene, which provide distinctive information about the con- As shown in Fig. 7 (d), typical multi-layer feature forma-
text of the image. tion process includes: 1) Feature extraction: the outputs (fea-
Different methods have been put forward to effective ture maps) of certain layers are extracted as deep features, 2)
saliency detection, such as selective search [108], unsuper- Feature vectorization: vectorize the extracted feature maps, 3)
vised discovery [109], Multi-scale Combinatorial Grouping Multi-layer feature combination: multiple features from differ-
(MCG) [110], and object detection networks (e.g., Fast ent layers are combined into a single feature vector, and 4)
RCNN [51], Faster RCNN [52], SSD [111], Yolo [112], [113]). Feature classification: classify the given scene image based on
For instance, since selective search combines the strengths the obtained combined feature.
of exhaustive search and segmentation, Liu et al. [87] used Although using all features from different layers seems to
it to capture all possible semantic regions, and then used a improve the final representation, it likely increases the chance
pre-trained CNN to extract the feature maps of each region of overfitting, and thus hurts performance [37]. Therefore,
followed by a spatial pyramid to reduce map dimensions. many methods [21], [37], [38], [39], [47] only extract features
Because the common objects or characteristics in different from certain layers. For instance, Xie et al. [38] constructed two
scenes lead to the commonality of different scenes, Cheng et dictionary-based representations, Convolution Fisher Vector
al. [35] used a region proposal network [52] to extract the (CFV), and Mid-level Local Discriminative Representation
discriminative regions while discarde non-discriminative (MLR) to classify subsidiarily scene images. Tang et al. [39]
regions. These semantic feature based methods [35], [87] divided GoogLeNet layers into three parts from bottom to
harvest many semantic regions, so encoding technology is top and extracted final feature maps of each part. Liu et
adapted to aggregate key features, which pipeline is shown in al. [21] captured feature maps from each residual block from
Fig. 5 (red arrows). ResNet independently. Song et al. [47] selected discriminative
On the other hand, some semantic feature based meth- combinations from different layers and different network
ods [33], [34] are based on weakly supervised learning, which branches via minimizing a weighted sum of the probability of
directly predicts categories by several semantic features of error and the average correlation coefficient. Yang et al. [37]
the scene. For instance, Wu et al. [33] generated high-quality used greedily select to explore the best layer combinations.
proposal regions by using MCG [110], and then used SVM Feature fusion in multi-layer feature based methods is
on each scene category to prune outliers and redundant another important direction. Feature fusion techniques are
regions. Semantic features from different scale patches supply mainly divided into two groups [116], [117], [118]: 1) Early
complementary cues, since the coarser scales deal with larger fusion: extracting multi-layer features and merging them
objects, while the finer levels provide smaller objects or into a comprehensive feature for scene classification, and 2)
object parts. In practice, they found two semantic features Late fusion: directly learning each multi-layer feature via a
sufficient to represent the whole scene, comparable to multiple supervised learner, which enforces the features to be directly
semantic features. Training a deep model only using a sensitive to the category label, and then merging them into
single salient region may result in a suboptimal performance a final feature. Although the performance of late fusion is
due to the possible existence of outliers in the training set. better, it is more complex and time-consuming, so early fusion
Hence, multiple regions can be selected to train the model is more popular [21], [38], [39], [47]. In addition, addition
together [34]. As shown in Fig. 7 (c), Durand et al. [34] and product rules are usually applied to combine multiple
designed a Max layer to select the attention regions to enhance features [39]. Since the feature spaces in different layers are
the discrimination. To provide a more robust strategy, they disparate, product rule is better than addition rule to fusing
also designed a Min layer to capture the regions with the most features, and empirical experiments on [39] also show this
negative evidence to further improve the model. statement. Moreover, Tang et al. [39] proposed two strategies
Although better performance can be obtained via using to fuse multi-layer features, i.e., ‘fusion with score’ and ‘fusion
more semantic local features, semantic feature based methods with features’. Fusion with score technique has obtained a
9
better performance over fusion with feature thanks to the have also been adapted in deep learning based methods.
end-to-end training. Fisher Vector (FV) coding [104], [125] is an encoding technique
commonly used in scene classification. Fisher vector stores the
3.2.5 Multiple-view Feature based Method mean and the covariance deviation vectors per component
Describing a complex scene just using a single and compact of the GMM and each element of the local features together.
feature representation is a non-trivial task. Hence, there has Thanks to the covariance deviation vectors, FV encoding leads
been extensive effort to compute a comprehensive representa- to excellent results. Moreover, it is empirically proven that
tion of a scene by integrating multiple features generated from Fisher vectors are complementary to global CNN features [30],
complementary CNN models [24], [32], [41], [119], [120], [121]. [31], [38], [40], [42], [46]. Therefore, this survey takes FV-based
Features generated from networks trained on different datasets approaches as the major cue and discusses the adapted
usually are complementary. As shown in Fig. 7 (e), Herranz et combination of encoding technology and deep learning.
al. [19] found the best scale response of object-centric CNNs
and scene-centric CNNs, and they combine the knowledge in (𝜋𝜋, 𝜇𝜇, 𝛴𝛴 ) ( 𝜋𝜋, 𝜇𝜇, 𝛴𝛴 )
Natural
( 𝜈𝜈, 𝜇𝜇, 𝛴𝛴 )
a scale-adaptive way via either object-centric CNNs or scene- parametrization
GMM
centric CNNs. This finding is widely used [121], [122]. For in- GMM
stance, the authors in [121] used an object-centric CNN to carry FV FV
BoF features BoS features
information about object depicted in the image, while a scene-
centric CNN was used to capture global scene information. (a) Basic FV (b) Semantic FV
Along this way, Wang et al. [32] designed PatchNet, a weakly
supervised learning method, which uses image-level supervi- (𝜇𝜇, 𝑃𝑃, 𝛬𝛬, 𝛺𝛺, 𝑙𝑙𝑙𝑙𝑙𝑙 𝑘𝑘) BoS
sion information as the supervision signal for effective extrac-
MFA Softmax
tion of the patch-level features. To enhance the recognition per-
formance, Scene-PatchNet and Object-PatchNet jointly used to FV Codebook FV
BoS features BoF (𝜋𝜋, 𝜇𝜇, 𝛴𝛴) features
extract features for each patch. construction
Employing complementary CNN architectures is essential for (c) MFA-FV layer (d) VSAD
obtaining discriminative multi-view feature representations.
Wang et al. [41] proposed a multi-resolution CNN (MR-CNN) Fig. 8. Structure comparisons of (a) basic Fisher vector [82] and its
architecture to capture visual content in multiple scale images. variations. BoF denotes bag of features, while BoS represents bag of
In their work, normal BN-Inception [123] is used to extract semantic probabilities. (b) In semantic FV [30], natural parameterization
is added to map multinomial distribution (i.e., π ) to its natural parameter
coarse resolution features, while deeper BN-Inception is
space (i.e., ν ). (c) In MFAFVNet [31], GMM is replaced with MFA to
employed to extract fine resolution features. Jin et al. [124] build codebook. (d) In VSAD [32], codebook is constructed via exploiting
used global features and spatially invariant features to account semantics (i.e., BoS) to aggregate local features (i.e., BoF).
for both the coarse layout of the scene and the transient objects.
Generally, CONV features and FC features are regarded
Sun et al. [24] separately extracted three representations, i.e.,
as Bags of Features (BoF), they can be readily modeled
object semantics representation, contextual information,
by the Gaussian Mixture Model followed by Fisher Vector
and global appearance, from discriminative views, which
(GMM-FV) [30], [31]. To avoid the computation of the FC
are complementarity to each other. Specifically, the object
layers, Cimpoi et al. [82] utilized GMM-FV to aggregate
semantic features of the scene image are extracted by a CNN
BoF from different CONV layers, respectively. Comparing
followed by spatial fisher vectors, while the deep feature of
their experiment results, they asserted that the last CONV
a multi-direction LSTM-based model represents contextual
features can more effectively represent scenes. To rescue the
information, and the FC feature represents global appearance.
fine-grained information of early/middle layers, Guo et al. [40]
Li et al. [119] used ResNet18 [79] to generate discriminative
proposed Fisher Convolutional Vector (FCV) to encode the
attention maps, which is used as an explicit input of CNN
feature maps from multiple CONV layers. Wang et al. [46]
together with the original image. Using global features
extracted the feature maps from RGB, HHA, and surface
extracted by ResNet18 and attention map features extracted
normal images, and then directly encoded these maps by FV
from the spatial feature transformer network, the attention
coding. In addition, through the performance comparisons
map features are multiplied to the global features for adaptive
of GMM-FV encoding on CONV features and FC features,
feature refinement so that the network focuses on the most
respectively, Dixit et al. [30] asserted that the FC features is
discriminative parts. Later, a multi-modal architecture is
more effective for scene classification. However, since the
proposed in [43], composed of a deep branch and a semantic
CONV features and FC features do not derive from semantic
branch. The deep Branch extracts global CNN features, while
probability space, it is likely to be both less discriminant
semantic branch aims to extract meaningful scene objects and
and less abstract than the truly semantic features [30], [82].
their relations from super pixels.
The activations of Softmax layer are probability vectors,
inhabiting the probability simplex, which are more abstract
3.3 Strategies for Improving Scene Representation and semantic, but it is difficult to implement an effective
To obtain more discriminative representations for scene classi- invariant coding (e.g., GMM-FV) [30], [126]. To this end,
fication, a range of strategies has been proposed. Four major Dixit et al. [30] proposed an indirect FV implementation to
categories (i.e., encoding strategy, attention strategy, contextual aggregate these semantic probability features, i.e., adding
strategy, and regularization strategy) will be discussed below. a step to convert semantic multinomials from probability
space to the natural parameter space, as shown in Fig. 8 (b).
3.3.1 Encoding strategy Inspired by FV and VLAD, Wang et al. [32] proposed Vector
Although the current driving force has been the incorporation of Semantically Aggregated Descriptors (VSAD) to encode the
of CNNs, encoding technology of the first generation methods probability features, as shown in Fig. 8 (d). Comparing the
10
discriminant probability learned by the weakly-supervised and cross-aware attentive pooling, to learn the contributions
method (PatchNet) with the generative probability from of RGB and depth modalities, respectively. Here the attention
an unsupervised method (GMM), the results show that the strategies are also used to further fuse the learned discriminate
discriminant probability is more expressive in aggregating semantic cues across RGB and depth modalities. Moreover,
local features. From the above discussion, representation they also designed a class-agnostic attentive pooling to ignore
encoding local features on probability space outperforms that some salient regions that may mislead classification. Inspired
on non-probability space. by the idea that specific objects are associated with a scene,
Deep features usually are high dimensional ones. Therefore, Seong et al. [132] proposed correlative context gating to
more Gaussian kernels are needed to accurately model the fea- activate scene-specific object features.
ture space [83]. However, this would a lot of overhead to the Channel attention maps can also be computed from differ-
computations and, hence, it is not efficient. Liu et al. [83] em- ent sources of information. With multiple salient regions on
pirically proved that the discriminative power of FV features different scales as input, Xia et al. [122] designed a Weakly Su-
increases slowly as the number of Gaussian kernels increases. pervised Attention Map (WS-AM) by proposing a gradient-
Therefore, dimensionality reduction of the features is very im- weighted class activation mapping technique and privileging
portant, as it directly affects the computational efficiency. A weakly supervised information. In another work [43], the input
wide range of approaches [29], [30], [32], [40], [42], [58], [83] of semantic branch is a semantic segmentation score map, and
used poplular dimensionality reduction techniques, Principal semantic features are extracted from semantic branch via using
Component Analysis (PCA), for pre-processing of the local fea- three channel attention modules, shown in Fig. 9 (a). Moreover,
tures. Moreover, Liu et al. [83] drew local features from Gaus- semantic features are used to gate global CNN features via
sian distribution with a nearly zero mean, which ensures the another attention module.
sparsity of the resulting FV. Wang et al. [46] enforced intercom-
Intput 512 FC 512 Attention:
ponent sparsity of GMM-FV features via component regular- Average
Softmax 512
Output
Pooling
ization to discount unnecessary components. 512
7 × 7
.
512 512 512
7 Max FC
Due to the non-linear property of deep features and a lim- Pooling
7
ited ability of the covariance of GMM, a large number of di- (a) Channel Attention
agonal GMM components are required to model deep features 6 32
32 36
so that the FV has very high dimensions [31], [42]. To address 6
Attention:
Conv1×1 Reshape 36×36
this issue, Dixit et al. [42] proposed MFA-FS, in which GMM is Input
replaced by Mixtures of Factor Analysis (MFA) [127], [128], i.e., 36
×
6 6 Attention: Output
256 32 32
a set of local linear subspaces is used to capture non-linear fea- 6 6 36×256
Conv1×1 Reshape 6
tures. MFA-FS performs well but does not support end-to-end ×
Reshape 256
6
training. However, end-to-end training is more efficient than 6 256
256 36
any disjoint training process [31]. Therefore, Li et al. [31] pro- 6
Conv1×1 Reshape
posed MFAFVnet, an improved variant of MFA-FS [42], which
+
is conveniently embedded into the state-of-the-art network ar- (b) Spatial Attention
chitectures. Fig. 8 (c) shows the MFA-FV layer of MFAFVNet,
compared with the other two structures. Fig. 9. Illustrations of two typical attentions. (a) In channel attention [43], the
In FV coding, local features are assumed to be independent channel attention map is used to weight the input by a Hadamard product.
(b) Spatial attention in [44] is used to enhance the local feature selection.
and identically distributed (iid), which violates intrinsic image
attributes that these patches are not iid. To this end, Cinbis et
Spatial attention policy. Spatial attention policy infers
al. [58] introduced a non-iid model via treating the model pa-
attention maps along height and width of input feature maps,
rameters as latent variables, rendering features related locally.
then the attention maps are combined to original maps for
Later, Wei et al. [129] proposed a correlated topic vector, treated
adaptive feature refinement. Joseph et al. [133] proposed a
as an evolution oriented from Fisher kernel framework, to ex-
layer-spatial attention model, including a hard attention to
plore latent semantics, and consider semantic correlation.
select a CNN layer and a soft attention to achieve spatial
localization within the selected layer. Attention maps are
3.3.2 Attention strategy obtained from a Conv-LSTM architecture, where the layer
As opposed to semantic feature based methods (focusing on attention uses the previous hidden states, and spatial attention
key cues generally from original images), attention mechanism uses both the selected layer and the previous hidden states. To
aims to capture distinguishing cues from the extracted feature enhance local feature selection, Xiong et al. [44], [103] designed
space [43], [44], [96], [122]. The attention maps are learned a spatial attention module, shown in Fig. 9 (b), to generate
without any explicit training signal, rather task-related loss attention masks. This attention masks of RGB and depth
function alone provides the training signal for the attention modalities are encouraged to be similar, and then learn the
weights. Generally, attention policy mainly includes channel modal-consistent features.
attention and spatial attention.
Channel attention policy. Channel activation maps 3.3.3 Contextual strategy
(ChAMs) generated from attention policy refers to the Contextual information (the correlations among image regions,
weighted activation maps, which highlights the class-specific and local features), and objects/scenes may provide beneficial
discriminative regions. For instance, class activation map [23] information in disambiguating visual words [134]. However,
is a simple ChAM, widely used in many works [130], [131]. convolution and pooling kernels are locally performed on im-
Since the same semantic cue has different roles for different age regions separately, and encoding technologies usually in-
types of scenes in some cases, Li et al. [96] designed class-aware tegrate multiple local features into an unstructured feature. As
attentive pooling, including intra-modality attentive pooling a result, contextual correlations among different regions have
11
not been taken into account [135]. To this end, contextual corre- al. [146], Laranjeira et al. [36] proposed a bidirectional LSTM to
lations have been further explored to focus on the global layout capture the contextual relations of regions of interest, as shown
or local region coherence [136]. in Fig. 10 (a). Their model supports variable length sequences,
The contextual relations can broadly be grouped into two because the number of object parts of each image are different.
major categories: 1) spatial contextual relation: the correlations Graph-related Model. The sequential models often
of neighboring regions, in which capturing spatial contextual simplify the contextual relations, while graph-related models
relation usually encounters the problem of incomplete regions can explore more complicated structural layouts. Song et
or noise caused by predefined grid patches, and 2) semantic al. [147] proposed a joint context model that uses MRFs
contextual relation: the relations of salient regions. The to combine multiple scales, spatial relations, and multiple
network to extract semantic relations is often a two-stage features among neighboring semantic multinomials, showing
framework (i.e., detecting objects and classifying scenes). that this method can discover consistent co-occurrence
Therefore, accuracy is also influenced by object detection. patterns and filter out noisy ones. Based on CTM, Wei et
Generally, there are three types of algorithms to capture al. [129] captured relations among latent themes as a semantic
contextual relations: 1) sequential model, like RNN [137] feature, i.e., corrected topic vector (CTV). Later, with the
and LSTM [95], and 2) graph-related model, such as Markov development of Graph Neural Network (GNN) [142], [143],
Random Field (MRF) [138], [139], Correlated Topic Model graph convolution has become increasingly popular to model
(CTM) [129], [140] and graph convolution [141], [142], [143]. contextual information for scene classification. Yuan et al. [144]
used spectral graph convolution to mine the relations among
LSTM the selected local CNN features, as shown in Fig. 10 (b). To
LSTM
use the complementary cues of multiple modalities, Yuan et
al. also considered the inter-modality correlations of RGB
LSTM and depth modalities through a cross-modal graph. Chen et
LSTM al. [45] used graph convolution [148] to model the more
complex spatial structural layouts via pre-defining the features
LSTM
of discriminative regions as graph nodes. However, the
LSTM spatial relation overlooks the semantic meanings of regions.
(a) Sequential model To address this issue, Chen et al. also defined a similarity
subgraph as a complement to the spatial subgraph.
4 0
3.3.4 Regularization strategy
0 1 2 3 4
9 3 2 1 5 6 The training classifier not only requires a classification loss
5 6 7 8 9 function, but it may also need multi-task learning with
8 7 10
10 different regularization terms to reduce generalization error.
12 11 The regularization strategies for scene classification mainly
11 12
14 include sparse regularization, structured regularization, and
13 14 15 15 13 supervised regularization.
Sparse Regularization. Sparse regularization is a technique
(b) Graph-related model
to reduce the complexity of the model to prevent overfitting
Fig. 10. Illustrations of a sequence model and a graph model to related and even improve generalization ability. Many works [22], [40],
contextual information. (a) Local feature is extracted from each salient [87], [149], [150] include `0 , `1 , or `2 norms to the base loss
region via a CNN, and a bidirectional LSTM is used to model synchronously function for learning sparse features. For example, the sparse
many-to-many local feature relations [36]. (b) Graph is constructed by reconstruction term in [87] encourages the learned representa-
assigning selected key features to graph nodes (including the center nodes,
sub-center nodes and other nodes) [144]. tions to be significantly informative. The loss in [22] combines
the strength of the Mahalanobis and Euclidean distances to
Sequential Model. With the success of sequential models, balance the accuracy and the generalization ability.
such as RNN and LSTM, capturing the sequential informa- Structured Regularization. Minimizing the triplet loss
tion among local regions has shown promising performance function minimizes the distance between the anchor and
for scene classification [36]. Spatial dependencies are captured positive features with the same class labels while maximizing
from direct or indirect connections between each region and its the distance between the anchor and negative features with
surrounding neighbors. Zuo et al. [27] stacked multi-directional one different class labels. In addition, according to the
LSTM layers on the top of CONV feature maps to encode spa- maximum margin theory in learning [151], hinge distance
tial contextual information in scene images. Furthermore, a hi- focus on the hard training samples. Hence, many research
erarchical strategy was adopted to capture cross-scale contex- efforts [44], [50], [103], [149], [152] have utilized structured
tual information. Like the work [27], Sun et al. [24] bypassed regularization of the triplet loss with hinge distance to learn
two sets of multi-directional LSTM layers on the CONV fea- robust featureP representations. The structured regularization
ture maps. In their framework, the outputs of all LSTM lay- term is a,p,n max(d(xa , xp ) − d(xa , xn ) + α, 0), where
ers are concatenated to form a contextual representation. In xa , xp , xn are anchor, positive, negative features, and α is
addition, the works [36], [145] captured semantic contextual an adjustable parameter, while the function d(x, y) denotes
knowledge from variable salient regions. In [145], two types calculating a distance of x and y . The structured regularization
of representations, i.e., COOR and SOOR, are proposed to de- term promotes exemplar selection, while it also ignores
scribe object-to-object relations. Herein, COOR adapts the co- noisy training examples that might overwhelm the useful
occurring frequency to represent the object-to-object relations, discriminative patterns [103], [149].
while SOOR is encoded with sequential model via regarding Supervised Regularization. Supervised regularization uses
object sequences as sentences. Rooted in the work of Javed et the label information for tuning the intermediate feature maps.
12
ThePsupervised regularization is generally expressed in terms Fine-tuning RGB-CNNs for depth images. Due to the
of i d(yi , f (xi )), where xi andyi denote the middle-layer ac- availability of RGB data, many models [46], [49], [50] are first
tivated features and real label of the image i, respectively, and pre-trained on large-scale RGB datasets, such as ImageNet and
f (xi ) is a predicted label. For example, Guo et al. [40] utilized Places, followed by fine-tuning on depth data. Fine-tuning
an auxiliary loss to directly propagate the label information to only updates the last few FC layers, while the parameters of
the CONV layers, and thus accurately captures the informa- the previous layers are not adjusted. Therefore, the fine-tuned
tion of local objects and fine structures in the CONV layers. model’s layers do not fully leverage depth data [89]. However,
Similarly, these alternatives [44], [103], [144] used supervised abstract representations of early CONV layers play a crucial
regularization to learn modal-specific features. role in computing deep features using different modalities.
Others. Extracting discriminative features by incorporating Weakly-supervised learning and semi-supervised learning can
different regularization techniques has been always a main- enforce explicit adaptation in the previous layers.
stream topic in scene classification. For example, label consis- Weak-supervised learning with patches of depth images.
tent regularization [87] guarantees that inputs from different Song et al. [47], [89] proposed to learn depth features from
categories have discriminative responses. The shareable con- scratch using weakly supervised learning. Song et al. [89]
straint in [149] can learn a flexible number of filters to represent pointed out that the diversity and complexity of patterns in
common patterns across different categories. Clustering loss in the depth images are significantly lower than those in the RGB
[124] is utilized to further fine-tune confusing clusters to over- images. Therefore, they designed a Depth-CNN (DCNN) with
come the intra-class variation issues inherent. Since assigning fewer layers for depth features extraction. They also trained
soft labels to the samples cause a degree of ambiguity, which the DCNN by three strategies of freezing, fine-tuning, and
reaps high benefits when increasing the number of scene cat- training from scratch to adequately capture depth information.
egories [153], Wang et al. [41] improved generalization ability Nevertheless, weakly-supervised learning is sensitive to the
by exploiting soft labels contained in knowledge networks as a noise in the training data. As a result, the extracted features
bias term of the loss function. Noteworthily, optimizing proper may not have good discriminative quality for classification.
loss function can pick up effective patches for image classifi- Semi-supervised learning with unlabeled images. Due
cation. In fast RCNN [51] and Faster RCNN [52], regression to the convenient collection of unlabeled RGB-D data, semi-
loss is used to learn effective region proposals. Wu et al. [33] supervised learning can also be employed in the training of
adopted one-class SVMs [154] as discriminative models to get CNNs with a limited number of labeled samples compared to
meta-objects. Inspired by MANTRA [155], the main intuition very large size of unlabeled data [48], [158]. Cheng et al. [158]
in [34] is to equip each possible output with pairs of latent trained a CNN using a very limited number of labeled RGB-D
variables, i.e., top positive and negative patches, via optimizing images while a massive amount of unlabeled RGB-D images
max+min prediction problem. via a co-training algorithm to preserve diversity. Subsequently,
Nearly all multi-task learning approaches using regulariza- Du et al. [48] developed an encoder-decoder model to construct
tion aim to find a trade-off among conflicting requirements, paired complementary-modal data of the input. In particular,
e.g., accuracy, generalization robustness, and efficiency. Re- the encoder is used as a modality-specific network to extract
searchers apply completely different supervision information specific features for the subsequent classification task.
to a variety of auxiliary tasks in an effort to facilitate the
3.4.2 Multiple modality fusion
convergence of the major scene classification task [40].
Various modality fusion methods [44], [152], [159] have been
put forward to combine the information of different modalities
3.4 RGB-D Scene Classification to further enhance the performance of the classification model.
RGB modality provides the intensity of the colors and texture The fusion strategies are mainly divided into three categories,
cues, while depth modality carries information regarding the i.e., feature-level modal combination, consistent feature based
distance of the scene surfaces from a viewpoint. The depth fusion, and distinctive feature based fusion. Fig. 11 shows il-
information is invariant to lighting and color variations, and lustrations of the late three categories. Despite the existence of
contains geometrical and shape cues, which is useful for scene different fusion categories, some works [44], [46], [152] com-
representation [156], [157], [158]. Moreover, HHA data [73], an bine multiple fusion strategies to achieve better performance
encoding result of depth image, depth information presents a for scene classification.
certain color modality, which somewhat is similar to the RGB Feature-level modal combination. Song et al. [47] proposed
image. Hence, some CNNs trained on RGB images can transfer a multi-modal combination approach to select discriminative
their knowledge and be used on HHA data. combinations of layers from different source models. They con-
The depth information of RGB-D image can further catenated RGB and depth features for not losing the correlation
improve the performance of CNN models compared to RGB between the RGB and depth data. Reducing the redundancy of
images [89]. For the task of RGB-D scene classification, except features can significantly improve the performance when RGB
for exploring suitable RGB features, described in Section 3.2, and depth features have correlations; especially, in the case of
there exists another two main problems, i.e., 1) how to extract extracting depth features merely via RGB-CNNs [89]. Because
depth-specific features and 2) how to properly fuse features of of such correlation, direct concatenation of features may re-
RGB and depth modalities. sult in redundancy of information. To avoid this issue, Du et
al. [48] performed global average pooling to reduce the fea-
3.4.1 Depth-specific feature learning ture dimensions after concatenating modality-specific features.
Depth information is usually scarce compared to RGB data. Wang et al. [46] used the modality regularization based on ex-
Therefore, it is non-trivial to train CNNs only on limited depth clusive group lasso to ensure feature sparsity and co-existence,
data to achieve depth-specific models [89], i.e., depth-CNN. while features within a modality are encouraged to compete.
Hence, different training strategies are employed to train Li et al. [96] used an attention module to discern discrimina-
CNNs using limited amount of depth images. tive semantic cues from intra- and cross-modalities. Moreover,
13
𝑥𝑥1 𝑥𝑥2 𝑥𝑥𝑁𝑁 constructing multimodal consistent features, features can also
be processed separately to increase discriminative capability.
... For instance, Li et al. [152] and Zhu et al. [50] adopted
CNN 𝑓𝑓𝑟𝑟𝑟𝑟𝑟𝑟 structured regularization in the triplet loss function, in which
RGB to encourage the model to learn the modal-specific features
𝑧𝑧1 𝑧𝑧2 𝑧𝑧𝑁𝑁 Multi-modal
under the supervision of this regularization. Li et al. [152]
feature learning
designed a distinctive embedding module for each modality to
CNN ... 𝑓𝑓𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑
learn distinctive features. Using labels for separate supervision
HHA of model-specific representation learning for each modality
Features is also another technique of individual processing [44], [103],
(a) Feature-level modal combination
[144]. Moreover, by minimizing the feature correlation, Xiong
et al. [44] learned the modal distinctive features as the RGB
Three typical combination ways:
and depth modalities have different characteristics.
1 𝐹𝐹𝑖𝑖 = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑥𝑥𝑖𝑖 , 𝑧𝑧𝑖𝑖
(2) 𝐹𝐹𝑖𝑖 = 𝜆𝜆𝑟𝑟𝑟𝑟𝑟𝑟 𝑥𝑥𝑖𝑖 + 𝜆𝜆𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑧𝑧𝑖𝑖
(3) 𝐹𝐹𝑖𝑖 = 𝑊𝑊𝑟𝑟𝑟𝑟𝑟𝑟 𝑥𝑥𝑖𝑖 + 𝑊𝑊𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑧𝑧𝑖𝑖
4 P ERFORMANCE C OMPARISON
𝜆𝜆: scalar, 𝑊𝑊: a matrix, 𝐹𝐹𝑖𝑖 : combined feature 4.1 Performance on RGB scene data
(b) Consistent- feature based modal fusion In contrast, CNN-based methods have quickly demonstrated
Three typical consistent loss terms 𝐿𝐿𝑐𝑐 : their strengths in scene classification. Table 2 compares the
performance of deep models for scene classification on RGB
1 𝐿𝐿𝑐𝑐 = ∑𝑖𝑖,𝑗𝑗[𝑑𝑑(𝑓𝑓𝑟𝑟𝑟𝑟𝑟𝑟 𝑥𝑥𝑖𝑖 , 𝑓𝑓𝑟𝑟𝑟𝑟𝑟𝑟 𝑥𝑥𝑗𝑗 − 𝑑𝑑 𝑓𝑓𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑧𝑧𝑖𝑖 , 𝑓𝑓𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑧𝑧𝑗𝑗 datasets. To gain insight into the performance of the presented
(2) 𝐿𝐿𝑐𝑐 =d (𝑓𝑓𝑟𝑟𝑟𝑟𝑟𝑟 , 𝑓𝑓𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 ) methods, we also provided input information, feature
(3) 𝐿𝐿𝑐𝑐 = ∑𝑖𝑖 𝑑𝑑(𝑓𝑓𝑟𝑟𝑟𝑟𝑟𝑟 𝑥𝑥𝑖𝑖 , 𝑓𝑓𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 (𝑧𝑧𝑖𝑖 )) information, and architecture of each method. The results
𝑑𝑑(𝑥𝑥, y): a distance of 𝑥𝑥 and 𝑦𝑦 show that a simple deep model (i.e., AlexNet), which is trained
on ImageNet, achieves 84.23%, 56.79%, and 42.61% accuracy
(c) Distinctive-feature based modal fusion on Scene15, MIT67, and SUN397 datasets, respectively. This
Three typical distinction loss terms 𝐿𝐿𝑑𝑑 : accuracy is comparable with the best non-deep learning
1 𝐿𝐿𝑟𝑟𝑟𝑟𝑟𝑟_𝑑𝑑 = ∑𝑎𝑎,𝑝𝑝,𝑛𝑛[𝑑𝑑(𝑓𝑓𝑟𝑟𝑟𝑟𝑟𝑟 𝑥𝑥𝑎𝑎 , 𝑓𝑓𝑟𝑟𝑟𝑟𝑟𝑟 𝑥𝑥𝑝𝑝 − 𝑑𝑑(𝑓𝑓𝑟𝑟𝑟𝑟𝑟𝑟 𝑥𝑥𝑎𝑎 , 𝑓𝑓𝑟𝑟𝑟𝑟𝑟𝑟 𝑥𝑥𝑛𝑛 ] methods. Starting from the generic deep models [17], [18],
(2) 𝐿𝐿𝑟𝑟𝑟𝑟𝑟𝑟_𝑑𝑑 = ∑𝑖𝑖 𝑑𝑑(𝑓𝑓𝑟𝑟𝑟𝑟𝑟𝑟 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ))
[25], CNN-based methods improve steadily when more
effective strategies have been introduced. As a result, nearly
(3) 𝐿𝐿𝑑𝑑 = ∑𝑖𝑖 𝑐𝑐𝑐𝑐𝑐𝑐(𝑓𝑓𝑟𝑟𝑟𝑟𝑟𝑟 𝑥𝑥𝑖𝑖 , 𝑓𝑓𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 (𝑧𝑧𝑖𝑖 )) all the approaches yielded an accuracy of 90% on the Scene15
(a, p, n): triplets (anchor, positive, negative), dataset. Moreover, FTOTLM [21] combined with a novel data
𝑦𝑦𝑖𝑖 : the true label of Image𝑖𝑖 augmentation outperforms other state-of-the-art models and
achieves the best accuracy on three benchmark datasets so far.
Fig. 11. Illustrations of multi-modal feature learning. (a) Three popular ways Extracting global CNN features, which are computed using
to achieve feature combination: directly concatenate features, combine a pre-trained model, is faster than other deep feature represen-
weighted features and combine features with linear converting. (b) Three tation techniques, but their quality is not good when there are
methods to achieve modal-consistent: minimize the pairwise distances
large differences between the source and target datasets. Com-
between modalities [50]; encourage the attention maps of modalities similar
[103]; minimize the distances between modalities [44], [152]. (c) Three paring these performances [17], [18], [25] demonstrates that the
strategies to achieve modal-distinctive: learn the model structure via triplet expressive power of global CNN features is improved as richer
loss [50], [152]; use label information to guide modal-specific learning [103], scene datasets appear. In GAP-CNN [23] and DL-CNN [22],
[144]; minimize cosine similarity between modalities [44].
new layers with a small number of parameters substitute for
FC layers, but they can still achieve considerable results com-
paring with benchmark CNNs [17], [18].
Cheng et al. [49] proposed a gated fusion layer to adjust the Spatially invariant feature based methods are usually time-
RGB and depth contributions on image pixels. consuming, especially the computational time of sampling lo-
Consistent-feature based modal fusion. Images may suf- cal patches, extracting individually local features, and build-
fer from missing information or noise pollution so that multi- ing codebook. However, these methods are robust against ge-
modal features are not consistent, hence it is essential to ex- ometrical variance, and thus improve the accuracy of bench-
ploit the correlation between different modalities to exclude mark CNNs, like SFV [30] vs. ImageNet-CNN [17], and MFA-
such issue. To drive feature consistency of different modalities, FS [42] vs. PL1-CNN [18]. Encoding technologies generally in-
Zhu et al. [50] introduced an inter-modality correlation term to clude more complicated training procedure, so some architec-
minimize pairwise distances of two modalities from the same tures (e.g., MFAFVNet [31] and VSAD [32]) are designed in a
class, while maximize pairwise distances from different classes. unified pipeline to reduce the operation complexity.
Zheng et al. [160] used multi-task metric learning to learn linear Semantic feature based methods [35], [36], [45], [165]
transformations of RGB and depth features, making full use of demonstrate very competitive performance, due to the
inter-modal relations. Li et al. [152] learned a correlative em- discriminative information laying on the salient regions,
bedding module between the RGB and depth features inspired compared to global CNN feature based and spatially
by Canonical Correlation Analysis [161], [162]. Xiong et al. [44], invariant feature based methods. Salient regions generally are
[103] proposed a learning approach to encourage two modal- generated by region selection algorithms, which may cause
specific networks to focus on features with similar spatial po- a two-stage training procedure and require more time and
sitions to learn more discriminative modal-consistent features. computations [61]. In addition, even though the contextual
Distinctive-feature based modal fusion. In addition to analysis demands more computational power, methods [36],
14
TABLE 2
Performance (%) summarization of some representative methods on popular benchmark datasets. All scores are quoted directly from the original papers.
For each dataset, the highest three classification scores are highlighted. Some abbreviations. Column “Scheme”: Whole Image (WI), Dense Patches
(DP), Regional Proposals (RP); Column “Init.”(Initialization): ImageNet (IN), Places205 (PL1), Places365 (PL2).
TABLE 3
Performance (%) comparison of related methods with/without concatenating global CNN feature on benchmark scene datasets.
DSFL [149] SFV [30] MFA−FS [42] MFAFVNet [31] VSAD [32] SOSF+CFA [24] SDO [35] LGN [45]
Baseline 52.2 72.8 81.4 82.6 84.9 84.1 68.1 85.2
MIT67 +Global feature 76.2 (↑ 24) 79 (↑ 6.2) 87.2 (↑ 5.8) 87.9 (↑ 5.3) 85.3 (↑ 0.4) 89.5 (↑ 5.4) 84 (↑ 15.9) 85.4 (↑ 0.2)
Baseline − 54.4 63.3 64.6 71.7 66.5 54.8 −
SUN397 +Global feature − 61.7 (↑ 7.3) 71.1 (↑ 7.8) 72 (↑ 7.4) 72.5 (↑ 0.8) 78.9 (↑ 12.4) 67 (↑ 12.2) −
[45], exploring the contextual relations among salient regions, their original classification accuracy, e.g., a baseline model
can significantly improve the classification accuracy. “SFV” [30] achieves 72.8% on MIT67, while “SFV+global
Multi-layer feature based methods employ the complemen- feature” yields 79%. Moreover, there are two aspects to
tary features from different layers to improve performance. It emphasize: 1) Herranz et al. [19] empirically proved that
is a simple way to use more feature cues, while it also does combining too much invalid features is marginally helpful and
not require to add any other layers. However, these methods significantly increases calculation and introduces noise into
are structurally complicated and have high-dimensional fea- the final feature. 2) It is essential to improve the expression
tures, which make training models difficult and prone to over- ability of each view feature, and thus enhance the entire ability
fitting [37]. Nevertheless, owing to a novel data augmentation, of multi-view features.
FTOTLM [21] yields a gain of 19.5% and 19.7% on MIT67 and In summary, the scene classification performance can be
SUN397, respectively, and has achieved the best results so far. boosted by adopting more sophisticated deep models [77], [79]
Multi-view feature based methods take full advantage of and large-scale datasets [18], [25]. Meanwhile, deep learning
features extracted from various CNNs to achieve high classi- based methods can obtain relatively satisfied accuracy on
fication accuracy. For instance, Table 3 shows that combining public datasets via combining multiple features [24], focusing
global features with other baselines significantly improves on semantic regions [35], augmenting data [21], and training
15
TABLE 4
Performance (%) comparison of representative methods on benchmark RGB-D scene datasets. For each dataset, the top three scores are highlighted.
TABLE 5
Ablation study on benchmark datasets to validate the performance (%) of depth information.
MaCAFF [46] MSMM [47] DCNN [89] DF2Net [152] TRecgNet [48] ACM [144] CBCL [170] KFS [103] MSN [44]
Baseline 53.5 − 53.4 61.1 53.7 55.4 66.4 53.5 53.5
NYUD2 +Depth 63.9 (↑ 10.4) − 67.5 (↑ 14.1) 65.4 (↑ 4.3) 67.5 (↑ 13.8) 67.4 (↑ 12) 69.7 (↑ 3.3) 67.8 (↑ 14.3) 68.1 (↑ 14.6)
Baseline 40.4 41.5 44.3 46.3 42.6 45.7 48.8 36.1 −
SUN RGBD +Depth 48.1 (↑ 7.7) 52.3 (↑ 10.8) 53.8 (↑ 9.5) 54.6 (↑ 8.3) 53.3 (↑ 10.7) 55.1 (↑ 9.4) 57.8 (↑ 9.0) 41.3 (↑ 5.2) −
in a unified pipeline [132]. In addition, many methods also Feature-level fusion based methods are commonly used
improve their accuracy via adopting different strategies, i.e., due to their high cost-effectiveness, e.g., [96], [144], [170].
improved encoding [31], [32], contextual modeling [36], [45], Along this way, consistent-feature based, and distinctive-
attention policy [23], [43], and multi-task learning [22], [40]. feature based modal fusion use complex fusion layer with
high cost, like inference speed and training complexity, but
4.2 Performance on RGB-D scene datasets they generally yield more effective features [44], [103], [152].
We can observe that the field of RGB-D scene classification
The accuracy of different methods on RGB-D datasets is
has constantly been improving. Weakly-supervised and
summarized in Table 4. By adding depth information with
semi-supervised learning are useful to learn depth-specific
different fusion strategies, accuracy results (see Table 5) are
features [47], [48], [89]. Moreover, multi-modal feature fusion is
improved over 10.8% and 8.8% on average on NYUD2 and
a major issue to improve performance on public datasets [44],
SUN RGBD datasets, respectively. Since depth data provide
[152], [170]. In addition, effective strategies (like contextual
extra information to train classification model, this observation
strategy [144], [145] and attention mechanism [44], [96]) are
is within expectation. Noteworthily, it is more difficult to
also popular for RGB-D scene classification. Nevertheless, the
improve the effect on a large dataset (SUN RGBD) than a small
accuracy achieved by current methods is far from expectation
dataset (NYUD2). Moreover, the best results on NYUD2 and
and there remains significant rooms for future improvement.
SUN RGBD datasets achieved by CBSC [170] are as high as
69.7% and 57.84%, respectively.
RGB-D scene data for training are relatively limited, while
5 C ONCLUSION AND O UTLOOK
the dimension of scene features is high. Hence, Support
Vector Machines (SVMs) are commonly used in RGB-D scene As a contemporary survey for scene classification using deep
classification [47], [72], [89], [160] in the early stages. Thanks to learning, this paper has highlighted the recent achievements,
data augmentation and back-propagation, Softmax classifier provided some structural taxonomy for various methods ac-
becomes progressively popular, and it is an important reason cording to their roles in scene representation for scene clas-
to yield a comparable performance [44], [48], [96], [170]. sification, analyzed their advantages and limitations, summa-
Many methods, such as [46], [50], fine-tune RGB-CNNs to rized existing popular scene datasets, and discussed perfor-
extract deep features of depth modality, where the training pro- mance for the most representative approaches. Despite great
cess is simple, and the computational cost is low. To adapt to progress, there are still many unsolved problems. Thus in this
depth data, Song et al. [89], [171] used weakly-supervised learn- section, we will point out these problems and introduce some
ing to train depth-specific models from scratch, which achieves promising trends for future research. We hope that this survey
a gain of 3.5% accuracy, compared to the fine-tuned RGB-CNN. not only provides a better understanding of scene classification
TRecgNet [48], which is based on semi-supervised learning, for researchers but also stimulates future research activities.
requires complicated training process and high computational Develop more advanced network frameworks. With
cost, but it obtains comparable results (69.2% on NYUD2 and the development of deep CNN architectures, from generic
56.7% on SUN RGBD). CNNs [17], [77], [78], [79] to scene-specific CNNs [22], [23],
16
the accuracy of scene classification is getting increasingly therefore researchers have begun to develop convenient and
comparable. Nevertheless, there still exists lots of works to be efficient unified networks (encapsulating all computation in
explored on the theoretical research of deep learning [172]. a one-stage network) [22], [31], [44]. Moreover, it is also a
It is a further direction to solidify the theoretical basis so as challenge to keep the model scalable and efficient well when
to get more advanced network frameworks. In particular, big data from smart wearables and mobile applications is
it is essential to design specific frameworks for scene growing rapidly in size temporally or spatially [192].
classification, such as using automated Neural Architecture Imbalanced scene classification. The Places365 challenge
Search (NAS) [173], [174], or according to scene attributes. dataset [25] has more than 8M training images, and the num-
Release rich scene datasets. Deep learning based models bers of images in different classes range from 4,000 to 30,000
require enormous amounts of data to initialize their parameters per class. It shows that scene categories are imbalanced, i.e.,
so that they can learn the scene knowledge well [18], [25]. some categories are abundant while others have scarce exam-
However, compared to scenes of real world, the publicly ples. Generally, the minority class samples are poorly predicted
available datasets are not large or rich enough, so it is essential by the learned model [193], [194]. Therefore, learning a model
to release datasets that encompass richness and high-diversity which respects both type of categories and equally performs
of environmental scenes [175], especially large-scale RGB-D well on frequent and infrequent ones remains a great challenge
scene datasets. As opposed to object/texture datasets, scene and needs further exploration [194], [195], [196].
appearance may be changed dramatically as time goes by, Continuous scene classification. The ultimate goal is
and there emerges new functional scenes as humans develop to develop methods capable of accurately and efficiently
activity places. Therefore, it requires updating the original classifying samples in thousands or more unseen scene
scene data and releasing new scene datasets regularly. categories in open environments [107]. The classic deep
Reduce the dependence of labeled scene images. The suc- learning paradigm learns in isolation, i.e., it needs many
cess of deep learning heavily relies on gargantuan amounts training examples and is only suitable for well-defined tasks in
of labeled images. However, the labeled training images are closed environments [76], [197]. In contrast, “human learning”
always very limited, so supervised learning is not scalable in is a continuous learning and adapting to new environments:
the absence of fully labeled training data and its generalization humans accumulate the knowledge gained in the past and use
ability to classify scenes frequently deteriorates. Therefore, it this knowledge to help future learning and problem solving
is desirable to reduce dependence on large amounts of labeled with possible adaptations [198]. Ideally, it should also be
data. To alleviate this difficulty, if with large numbers of unla- capable to discover unknown scenarios and learn new works
bel data, it is necessary to further study semi-supervised learn- in a self-supervised manner. Inspired by this, it is necessary to
ing [176], unsupervised learning [177], or self-supervised learn- do lifelong machine learning via developing versatile systems
ing [178]. Even more constrained, without any unlabel training that continually accumulate and refine their knowledge
data, the ability to learn from only few labeled images, small- over time [199], [200]. Such lifelong machine learning has
sample learning [179], is also appealing. represented a long-standing challenge for deep learning and,
Few shot scene classification. The success of generic CNNs consequently, artificial intelligence systems.
for scene classification relies heavily on gargantuan amounts Multi-label scene classification. Many scenes are semantic
of labeled training data [107]. Due to the large intra-variation multiplicity [68], [69], i.e., a scene image may belong to multi-
among scenes, scene datasets cannot cover various classes so ple semantic classes. Such a problem poses a challenge to the
that the performance of CNNs frequently deteriorates and fails classic pattern recognition paradigm and requires developing
to generalize well. In contrast, humans can learn a visual con- multi-label learning methods [69], [201]. Moreover, when con-
cept quickly from very few given examples and often gener- structing scene datasets, most researchers either avoid labeling
alize well [180], [181]. Inspired by this, domain adaptation ap- multi-label images or use the most obvious class (single label)
proaches utilize the knowledge of labeled data in task-relevant to annotate subjectively each image [68]. Hence, it is hard to im-
domains to execute new tasks in target domain [182], [183]. prove the generalization ability of the model trained on single-
Furthermore, domain generalization methods aim at learning label datasets, which also brings problems to classification task.
generic representation from multiple task-irrelevant domains Other-modal scene classification. RGB images provide key
to generalize unseen scenarios [184], [185]. features such as color, texture, and spectrum of objects. Never-
theless, the scenes reproduced by RGB images may have un-
Robust scene classification. Once scene classification in the
even lighting, target occlusion, etc. Therefore, the robustness of
laboratory environment is deployed in the real application sce-
RGB scene classification is poor, and it is difficult to accurately
nario, there will still be a variety of unacceptable phenomena,
extract key information such as target contours and spatial po-
that is, the robustness in open environments is a bottleneck to
sitions. In contrast, the rapid development of sensors has made
restrict pattern recognition technology. The main reasons why
the acquisition of other modal data easier and easier, such as
the pattern recognition systems are not robust are basic as-
RGB-D [72], video [202], 3D point clouds [203]. Recently, re-
sumptions, e.g., closed world assumption, independent identi-
search on recognizing and understanding various modalities
cal distribution and big data assumption [186], which are main
has attracted an increasing attention [89], [204], [205].
differences between machine learning and human intelligence;
hence, it is a fundamental challenge to improve the robustness
by breaking these assumptions. It is a nature consider via uti- A PPENDIX A
lizing adversarial training and optimization [187], [188], [189],
which have been applied to pattern recognition [190], [191].
A R OAD M AP OF S CENE C LASSIFICATION IN 20
YEARS
Realtime scene classification. Many methods for scene
classification, trained in a multiple-stage manner, are com- Scene representation or scene feature extraction, the process of
putationally expensive for current mobile/wearable devices, converting a scene image into feature vectors, plays the critical
which have limited storage and computational capability, role in scene classification, and thus is the focus of research
17
1999 2003 2005 2009 2010 2011 2012 2013 2014 2015 2016
2001 2006 ImageNet: a large-scale VGG: increases CNN depth with
image dataset trigged AlexNet: very small convolution filters SUN RGB-D: ResNet: revolution of
the breakthrough of the ConvNet that reignites (Simonyan et al.) A large RGB-D scene network depth by using
deep learning the field of neural networks dataset (Song et al.) skip connections
(Deng et al.) (Krizhevsky et al.) (He et al.)
HoG: similar to SIFT, cells of GoogLeNet: increases network
dense gradient orientations CENTRIST: a local binary pattern depth with the parameter
(Dalal and Triggs) like holistic representation method efficient inception module
Deep Learning Network
for scene (Wu et al.) (Szegedy et al.)
Bag of Visual Words: marking SUN 397: the first large scale
Feature Representation
the beginning of BoVW in scene database (Xiao et al.)
computer vision (Sivic et al.) Typical Dataset
Fig. 12. Milestones of scene classification. Handcrafted features gained tremendous popularity, starting from SIFT [206] and GIST [207]. Then, HoG [208]
and CENTRIST [209] were proposed by Dalal et al. and Wu et al., respectively, further promoting the development of scene classification. In 2003, Sivic et
al. [210] proposed BoVW model, marking the beginning of codebook learning. Along this way, more effective BoVW based methods, SPM [14], IFV [104]
and VLAD [105], also emerged to deal with larger-scale tasks. In 2010, Object Bank [15], [211] represents the scene as object attributes, marking the
beginning of more semantic representations. Then, Juneja et al. [212] proposed Bags of Part to learn distinctive parts of scenes automatically. In 2012,
AlexNet [17] reignites the field of artificial neural networks. Since then, CNN-based methods, VGGNet [77], GoogLeNet [78] and ResNet [79], have begun
to take over handcrafted methods. Additionally, Razavian et al. [213] highlights the effectiveness and generality of CNN representations for different tasks.
Along this way, in 2014, Gong et al. [28] proposed MOP-CNN, the first deep learning methods for scene classification. Later, FV-CNN [82], Semantic
FV [30] and GAP-CNN [23] are proposed one after another to learn more effective representations. For datasets, ImageNet [56] triggers the breakthrough
of deep learning. Then, Xiao et al. [57] proposed SUN database to evaluate numerous algorithms for scene classification. Later, Places [18], [25], the
largest scene database currently, emerged to satisfy the need of deep learning training. Additionally, SUN RGBD [72] has been introduced, marking the
beginning of deep learning for RGB-D scene classification.
in this field. In the past two decades, remarkable progress has of the learned codebook has a great impact on the coding
been witnessed in scene representation, which mainly consists procedure. The generic codebooks mainly include Fisher
of two important generations: handcrafted feature engineering, kernels [104], [125], sparse codebook [226], [227], Locality-
and deep learning (feature learning). The milestones of scene constrained Linear Codes (LLC) [228], Histogram Intersection
classification in the past two decades are presented in Fig. 12, Kernels (HIK) [229], contextual visual words [230], Efficient
in which two main stages (SIFT vs. DNN) are highlighted. Match Kernels (EMK) [231] and Supervised Kernel Descriptors
Handcrafted feature engineering era. From 1995 to 2012, (SKDES) [232]. Particularly, semantic codebooks generate
the field was dominated by the Bag of Visual Word (BoVW) from salient regions, like Object Bank [15], [211], [233],
model [106], [210], [214], [215] borrowed from document object-to-class [234], Latent Pyramidal Regions (LPR) [235],
classification which represents a document as a vector of Bags of Parts (BoP) [212] and Pairwise Constraints based
word occurrence counts over a global word vocabulary. In Multiview Subspace Learning (PC-MSL) [236], capturing more
the image domain, BoVW firstly probes an image with local discriminative features for scene classification.
feature descriptors such as Scale Invariant Feature Transform Deep learning era. In 2012, Krizhevsky et al. [17]
(SIFT) [206], [216], and then represents an image statistically as introduced a DNN, commonly referred to as “AlexNet”,
an orderless histogram over a pre-trained visual vocabulary, for the object classification task, and achieved breakthrough
in a similar form to a document. Some important variants of performance surpassing the best result of hand-engineered
BoVW such as Bag of Semantics [15], [126], [140], [212] and features by a large margin, and thus triggered the recent
Improved Fisher Vector (IFV) [104], have also been proposed. revolution in AI. Since then, deep learning has started to
Local invariant feature descriptors play an important dominate various tasks (like computer vision [80], [107], [237],
role in BoVW because they are discriminative, yet less [238], speech recognition [239], autonomous driving [240],
sensitive to image variations such as illumination, scale, cancer detection [241], [242], machine translation [243],
rotation, viewpoint etc, and thus have been widely studied. playing complex games [244], [245], [246], [247], earthquake
Representative local descriptors for scene classification have forecasting [248], medicine discovery [249], [250]), and scene
started from SIFT [206], [216] and Global Information Systems classification is no exception, leading to a new generation of
Technology (GIST) [207], [217]. Other local descriptors, such scene representation methods with remarkable performance
as Local Binary Patterns (LBP) [218], Deformable Part Model improvements. Such substantial progress can be mainly
(DPM) [219], [220], [221], CENsus TRansform hISTogram attributed to advances in deep models including VGGNet [77],
(CENTRIST) [209], also contribute to the development of GoogLeNet [78], ResNet [79], etc., the availability of large-scale
scene classification. To improve the performance, research image datasets like ImageNet [56] and Places [18], [25] and
focus shifts to feature encoding and aggregation, mainly more powerful computational resources.
including Bag-of-Visual-Words (BoVW) [106], Latent Dirichlet Deep learning networks have gradually replaced the local
Allocation (LDA) [222], Histogram of Gradients (HoG) [208], feature descriptors of the first generation methods and are
Spatial Pyramid Matching (SPM) [14], [223], Vector of certainly the engine for scene classification. Although the major
Locally Aggregated Descriptors (VLAD) [105], Fisher kernel driving force of progress in scene classification has been the
coding [104], [125], Multi-Resolution BoVW (MR-BoVW) [224], incorporation of deep learning networks, the general pipelines
and Orientational Pyramid Matching (OPM) [225]. The quality like BoVW, feature encoding and aggregation methods like
C
H E 18
R
M
filter 1 S W F Class-specific Output M
Output
Generic-level M Channels discriminative features
features Feature Maps
Downsampling
C (e.g. Max Pooling)
(H/2) x (W/2) x C
R Output M
Feature Maps
(H/4) x (W/4) x (2C)
filter M S
M Convolutional Filters+
Nonlinearity (e.g. ReLU)
(H/8) x (W/8) x (4C)
M 3D filters # class
Input Image Red Channel Green Channel Blue Channel
C (b1)
(H/16) x (W/16) x (8C)
input fmap output
(H/8) x (W/8) x (8C) C input feature maps
C fmap output one map
One filter for C
convolution
1 S (H/4) x (W/4) x (4C) C
H (H/2) x (W/2) x E
Feature maps from Conv+ReLU E
(2C) H
C Feature maps by max pooling R
HxWxC M
HxWx3 W F
Fully connected layer (+ReLU) S W F
Output M Channels
Output layer (+Softmax) (b2)
S
(a) VGG (b) Convolution operation
Fig. 13. (a) A typical CNN architecture VGGNet [77] with 16 weight layers. The network has 13 convolutional layers, 5 max pooling layers (The last one is
global max pooling)
input fmap and 3 fully-connected layers (The last one uses Softmax function as nonlinear activation function). The whole network can be learned
output a loss function (e.g., , cross-entropy loss). (b) Illustration of basic operations (i.e., , convolution, nonlinearity and
D filters from labeled training data by optimizing
C fmap
downsampling) that are repeatedly applied for a typical CNN. (b1) The outputs (called the feature maps) of each layer (horizontally) of a typical CNN
applied to a scene image. Each feature map in the second row corresponds to the output for one of the learned 3D filters (see (b2)).
H E
S C
Fisher Vector,
W VLAD of the F first generation methods have forecasting [248], medicine discovery [249], [250], etc. In many
also been output deep learning based scene
adapted in current of these domains, DNNs have reached breakthrough levels of
output fmap fmap
methods,frome.g.,
step MOP-CNN
1 [28], SCFVC [83], MPP-CNN [29], performance, often approaching and sometimes exceeding the
C
DSP [102], Semantic FV [30], LatMoG [58], MFA-FS [42] and abilities of human experts. Thanks to the growth of big data
C
1 DUCA [20]. To take fully E advantage of back-propagation, and more powerful computational resources, deep learning
1 E
scene representations are extracted M from end-to-end trainable and AI for scientific research are evolving quickly, with new
1*C filters C F
CNNs, like F DAG-CNN [37], MFAFVNet [31], VSAD [32], developments appearing continually for analyzing datasets,
G-MS2F [39], and DL-CNN [22]. To focus on main content of discovering patterns, and predicting behaviors in almost all
the scene, object detection is used to capture salient regions, fields of science and technology [98].
such as MetaObject-CNN [33], WELDON [34], SDO [35], and In computer vision, the most commonly used type of
BiLSTM [36]. Since features from multiple CNN layers or DNNs is Convolutional Neural Networks (CNNs) [76]. As is
multiple views are complementary, many literatures [19], [21], illustrated in Figure 13 (a), the frontend of a typical CNN is a
[24], [40], [41] also explored their complementarity to improve stack of CONV layers and pooling layers to learn generic-level
performance. In addition, there exists many strategies (like features, and these features are further transformed into class-
attention mechanism, contextual modeling, multi-task learning specific discriminative representations via training multiple
with regularization terms) to enhance representation ability, layers on target datasets. As we slide a convolution filter
such as CFA [24], BiLSTM [36], MAPNet [96], MSN [44], and over the width and height of the input of 3 color channels,
LGN [45]. For datasets, because depth images from RGB-D we will produce a two-dimensional (2-D) activation map, as
cameras are not vulnerable to illumination changes, since 2015, shown in Figure 13 (b1), giving the responses of that filter
researchers have started to explore RGB-D scene recognition. at every spatial position. As shown in Figure 13 (b2), a 2-D
Some works [47], [48], [89] focus on depth-specific feature convolution operates xl−1 ∗ wl , describing as a weighted
learning, while other alternatives, like DMMF [50], ACM [144], average of an input map xl−1 from previous layer l − 1, where
and MSN [44] focus on multi-modal feature fusion. the corresponding weighting is given by wl ; since every filter
extends through the full depth of the input maps with C
channels, so we calculate the sum of 2-D convolution as the
A PPENDIX B result of a 3-D convolution, i.e.,
A B RIEF I NTRODUCTION TO D EEP L EARNING C
X
In 2012, the breakthrough object classification result on xl−1
i ∗ wil (1)
the large scale ImageNet dataset [56] achieved by a deep i=1
learning network [17] is arguably what reignited the field During the forward pass, the j neuron in l CONV layer op-
of Artificial Neural Networks (ANNs) and triggered the erates a 3-D convolution with N l−1 channels between input
recent revolution in AI. Since then, deep learning, or Deep maps xl−1 and the corresponding filter wjl , plus a bias term
Neural Networks (DNNs) [251], shines in a broad range blj , and passes the above result to a nonlinear function σ(·) to
of areas, including computer vision [17], [80], [107], [237], obtain the final output xlj , i.e.,
[238], speech recognition [239], autonomous driving [240], N l−1
cancer detection [241], [242], machine translation [243], X
xlj = σ( xl−1
i
l
∗ wi,j + blj ) (2)
playing complex games [244], [245], [246], [247], earthquake
i=1
19
recurrent neural networks,” IEEE TIP, vol. 25, no. 7, pp. 2983–2996, [55] S. Cai, J. Huang, D. Zeng, X. Ding, and J. Paisley, “MEnet: A metric
2016. expression network for salient object segmentation,” in IJCAI, 2018,
[28] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless pp. 598–605.
pooling of deep convolutional activation features,” in ECCV, 2014, [56] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
pp. 392–407. A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–
[29] D. Yoo, S. Park, J.-Y. Lee, and I. So Kweon, “Multi-scale pyramid 255, http://image-net.org/download.
pooling for deep convolutional representation,” in CVPRW, 2015, [57] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun
pp. 71–80. database: Large-scale scene recognition from abbey to zoo,” in CVPR,
[30] M. Dixit, S. Chen, D. Gao, N. Rasiwasia, and N. Vasconcelos, “Scene 2010, pp. 3485–3492, http://places2.csail.mit.edu/download.html.
classification with semantic fisher vectors,” in CVPR, 2015, pp. 2974– [58] R. G. Cinbis, J. Verbeek, and C. Schmid, “Approximate fisher kernels
2983. of non-iid image models for image categorization,” IEEE TPAMI,
[31] Y. Li, M. Dixit, and N. Vasconcelos, “Deep scene image classification vol. 38, no. 6, pp. 1084–1098, 2015.
with the MFAFVNet,” in ICCV, 2017, pp. 5746–5754. [59] X. Wei, S. L. Phung, and A. Bouzerdoum, “Visual descriptors for
[32] Z. Wang, L. Wang, Y. Wang, B. Zhang, and Y. Qiao, “Weakly scene categorization: Experimental evaluation,” AI Review, vol. 45,
supervised patchnets: Describing and aggregating local patches for no. 3, pp. 333–368, 2016.
scene recognition,” IEEE TIP, vol. 26, no. 4, pp. 2028–2041, 2017. [60] G. Cheng, J. Han, and X. Lu, “Remote sensing image scene
classification: Benchmark and state of the art,” Proceedings of the IEEE,
[33] R. Wu, B. Wang, W. Wang, and Y. Yu, “Harvesting discriminative
vol. 105, no. 10, pp. 1865–1883, 2017.
meta objects with deep CNN features for scene classification,” in
ICCV, 2015, pp. 1287–1295. [61] L. Xie, F. Lee, L. Liu, K. Kotani, and Q. Chen, “Scene recognition: A
comprehensive survey,” Pattern Recognition, vol. 102, p. 107205, 2020.
[34] T. Durand, N. Thome, and M. Cord, “WELDON: Weakly supervised
[62] K. Nogueira, O. A. Penatti, and J. A. Dos Santos, “Towards better
learning of deep convolutional neural networks,” in CVPR, 2016, pp.
exploiting convolutional neural networks for remote sensing scene
4743–4752.
classification,” Pattern Recognition, vol. 61, pp. 539–556, 2017.
[35] X. Cheng, J. Lu, J. Feng, B. Yuan, and J. Zhou, “Scene recognition [63] F. Hu, G.-S. Xia, J. Hu, and L. Zhang, “Transferring deep
with objectness,” Pattern Recognition, vol. 74, pp. 474–487, 2018. convolutional neural networks for the scene classification of high-
[36] C. Laranjeira, A. Lacerda, and E. R. Nascimento, “On modeling resolution remote sensing imagery,” Remote Sensing, vol. 7, no. 11,
context from objects with a long short-term memory for indoor scene pp. 14 680–14 707, 2015.
recognition,” in SIBGRAPI, 2019, pp. 249–256. [64] A. Mesaros, T. Heittola, and T. Virtanen, “Tut database for acoustic
[37] S. Yang and D. Ramanan, “Multi-scale recognition with DAG- scene classification and sound event detection,” in EUSIPCO, 2016,
CNNs,” in ICCV, 2015, pp. 1215–1223. pp. 1128–1132.
[38] G.-S. Xie, X.-Y. Zhang, S. Yan, and C.-L. Liu, “Hybrid CNN [65] Z. Ren, K. Qian, Y. Wang, Z. Zhang, V. Pandit, A. Baird, and
and dictionary-based models for scene recognition and domain B. Schuller, “Deep scalogram representations for acoustic scene
adaptation,” IEEE TCSVT, vol. 27, no. 6, pp. 1263–1274, 2015. classification,” JAS, vol. 5, no. 3, pp. 662–669, 2018.
[39] P. Tang, H. Wang, and S. Kwong, “G-MS2F: Googlenet based [66] S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke,
multi-stage feature fusion of deep CNN for scene recognition,” and M. J. Milford, “Visual place recognition: A survey,” IEEE T-RO,
Neurocomputing, vol. 225, pp. 188–197, 2017. vol. 32, no. 1, pp. 1–19, 2015.
[40] S. Guo, W. Huang, L. Wang, and Y. Qiao, “Locally supervised deep [67] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic,
hybrid model for scene recognition,” IEEE TIP, vol. 26, no. 2, pp. “NetVLAD: CNN architecture for weakly supervised place
808–820, 2016. recognition,” in CVPR, 2016, pp. 5297–5307.
[41] L. Wang, S. Guo, W. Huang, Y. Xiong, and Y. Qiao, “Knowledge [68] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-
guided disambiguation for large-scale scene classification with label scene classification,” Pattern Recognition, vol. 37, no. 9, pp. 1757–
multi-resolution CNNs,” IEEE TIP, vol. 26, no. 4, pp. 2055–2068, 1771, 2004.
2017. [69] M.-L. Zhang and Z.-H. Zhou, “Multi-label learning by instance
[42] M. D. Dixit and N. Vasconcelos, “Object based scene representations differentiation,” in AAAI, vol. 7, 2007, pp. 669–674.
using fisher scores of local subspace projections,” in NeurIPS, 2016, [70] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in
pp. 2811–2819. CVPR, 2009, pp. 413–420, http://web.mit.edu/torralba/www/
[43] A. López-Cifuentes, M. Escudero-Viñolo, J. Bescós, and Á. Garcı́a- indoor.html.
Martı́n, “Semantic-aware scene recognition,” Pattern Recognition, vol. [71] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor
102, p. 107256, 2020. segmentation and support inference from RGB-D images,” in ECCV,
[44] Z. Xiong, Y. Yuan, and Q. Wang, “MSN: Modality separation 2012, pp. 746–760, https://cs.nyu.edu/∼silberman/datasets/nyu
networks for RGB-D scene recognition,” Neurocomputing, vol. 373, depth v2.html.
pp. 81–89, 2020. [72] S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D scene
[45] G. Chen, X. Song, H. Zeng, and S. Jiang, “Scene recognition with understanding benchmark suite,” in CVPR, 2015, pp. 567–576, https:
prototype-agnostic scene layout,” IEEE TIP, vol. 29, pp. 5877–5888, //github.com/ankurhanda/sunrgbd-meta-data.
2020. [73] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich
features from RGB-D images for object detection and segmentation,”
[46] A. Wang, J. Cai, J. Lu, and T.-J. Cham, “Modality and component
in ECCV, 2014, pp. 345–360.
aware feature fusion for RGB-D scene classification,” in CVPR, 2016,
pp. 5995–6004. [74] G. A. Miller, “Wordnet: A lexical database for english,” Communica-
tions of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
[47] X. Song, S. Jiang, and L. Herranz, “Combining models from multiple
[75] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and
sources for RGB-D scene recognition.” in IJCAI, 2017, pp. 4523–4529.
A. Zisserman, “The pascal visual object classes (voc) challenge,”
[48] D. Du, L. Wang, H. Wang, K. Zhao, and G. Wu, “Translate-to- IJCV, vol. 88, no. 2, pp. 303–338, 2010.
recognize networks for RGB-D scene recognition,” in CVPR, 2019, [76] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
pp. 11 836–11 845. 521, no. 7553, pp. 436–444, 2015.
[49] Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang, “Locality- [77] K. Simonyan and A. Zisserman, “Very deep convolutional networks
sensitive deconvolution networks with gated fusion for RGB-D for large-scale image recognition,” ICLR, 2015.
indoor semantic segmentation,” in CVPR, 2017, pp. 3029–3037. [78] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
[50] H. Zhu, J.-B. Weibel, and S. Lu, “Discriminative multi-modal feature D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
fusion for RGB-D indoor scene recognition,” in CVPR, 2016, pp. convolutions,” in CVPR, 2015, pp. 1–9.
2969–2976. [79] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
[51] R. Girshick, “Fast R-CNN,” in ICCV, 2015, pp. 1440–1448. image recognition,” in CVPR, 2016, pp. 770–778.
[52] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- [80] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich
time object detection with region proposal networks,” IEEE TPAMI, feature hierarchies for accurate object detection and semantic
vol. 39, no. 6, pp. 1137–1149, 2016. segmentation,” in CVPR, 2014, pp. 580–587.
[53] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep con- [81] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are
volutional encoder-decoder architecture for image segmentation,” features in deep neural networks?” in NeurIPS, 2014, pp. 3320–3328.
IEEE TPAMI, vol. 39, no. 12, pp. 2481–2495, 2017. [82] M. Cimpoi, S. Maji, and A. Vedaldi, “Deep filter banks for texture
[54] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, recognition and segmentation,” in CVPR, 2015, pp. 3828–3836.
“Deeplab: Semantic image segmentation with deep convolutional [83] L. Liu, C. Shen, L. Wang, A. Van Den Hengel, and C. Wang,
nets, atrous convolution, and fully connected crfs,” IEEE TPAMI, “Encoding high dimensional local features by sparse coding based
vol. 40, no. 4, pp. 834–848, 2017. fisher vectors,” in NeurIPS, 2014, pp. 1143–1151.
21
[84] Z. Bolei, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object [115] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
detectors emerge in deep scene CNNs,” ICLR, 2015. “Feature pyramid networks for object detection,” in CVPR, 2017, pp.
[85] L. Liu, J. Chen, P. Fieguth, G. Zhao, R. Chellappa, and M. Pietikäinen, 2117–2125.
“From BoW to CNN: Two decades of texture representation for [116] C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late
texture classification,” IJCV, vol. 127, no. 1, pp. 74–109, 2019. fusion in semantic video analysis,” in Proceedings of the 13th annual
[86] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance ACM international conference on Multimedia, 2005, pp. 399–402.
of initialization and momentum in deep learning,” in ICML, 2013, [117] H. Gunes and M. Piccardi, “Affect recognition from face and body:
pp. 1139–1147. Early fusion vs. late fusion,” in International conference on systems,
[87] B. Liu, J. Liu, J. Wang, and H. Lu, “Learning a representative and man and cybernetics, vol. 4, 2005, pp. 3437–3443.
discriminative part model with deep convolutional features for scene [118] Y. Dong, S. Gao, K. Tao, J. Liu, and H. Wang, “Performance
recognition,” in ACCV, 2014, pp. 643–658. evaluation of early and late fusion methods for generic semantics
[88] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return indexing,” Pattern Analysis and Applications, vol. 17, no. 1, pp. 37–50,
of the devil in the details: Delving deep into convolutional nets,” 2014.
arXiv:1405.3531, 2014. [119] J. Li, D. Lin, Y. Wang, G. Xu, Y. Zhang, C. Ding, and Y. Zhou, “Deep
[89] X. Song, S. Jiang, L. Herranz, and C. Chen, “Learning effective RGB- discriminative representation learning with attention map for scene
D representations for scene recognition,” IEEE TIP, vol. 28, no. 2, pp. classification,” Remote Sensing, vol. 12, no. 9, p. 1366, 2020.
980–993, 2018. [120] F. Zhang, B. Du, and L. Zhang, “Scene classification via a gradient
[90] M. Lin, Q. Chen, and S. Yan, “Network in network,” ICLR, 2014. boosting random convolutional network framework,” IEEE TGRS,
[91] J. Sun and J. Ponce, “Learning discriminative part detectors for vol. 54, no. 3, pp. 1793–1802, 2015.
image classification and cosegmentation,” in ICCV, 2013, pp. 3400– [121] L. Wang, Z. Wang, W. Du, and Y. Qiao, “Object-scene convolutional
3407. neural networks for event recognition in images,” in CVPRW, 2015,
[92] K. J. Shih, I. Endres, and D. Hoiem, “Learning discriminative pp. 30–35.
collections of part detectors for object recognition,” IEEE TPAMI, [122] S. Xia, J. Zeng, L. Leng, and X. Fu, “WS-AM: Weakly supervised
vol. 37, no. 8, pp. 1571–1584, 2014. attention map for scene recognition,” Electronics, vol. 8, no. 10, p.
[93] Y. Sun, X. Wang, and X. Tang, “Deeply learned face representations 1072, 2019.
are sparse, selective, and robust,” in CVPR, 2015, pp. 2892–2900. [123] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
[94] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing network training by reducing internal covariate shift,” ICML, 2015.
algorithms for compressed sensing,” PNAS, vol. 106, no. 45, pp. [124] H. Jin Kim and J.-M. Frahm, “Hierarchy of alternating specialists for
18 914–18 919, 2009. scene recognition,” in ECCV, 2018, pp. 451–467.
[95] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” [125] F. Perronnin and C. Dance, “Fisher kernels on visual vocabularies
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. for image categorization,” in CVPR, 2007, pp. 1–8.
[96] Y. Li, Z. Zhang, Y. Cheng, L. Wang, and T. Tan, “MAPNet: [126] R. Kwitt, N. Vasconcelos, and N. Rasiwasia, “Scene recognition on
Multi-modal attentive pooling network for RGB-D indoor scene the semantic manifold,” in ECCV, 2012, pp. 359–372.
classification,” Pattern Recognition, vol. 90, pp. 436–449, 2019. [127] Z. Ghahramani, G. E. Hinton et al., “The em algorithm for mixtures
[97] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, of factor analyzers,” University of Toronto, Tech. Rep., 1996.
M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu et al., “Learning to [128] J. Verbeek, “Learning nonlinear image manifolds by global
navigate in complex environments,” ICLR, 2016. alignment of local linear models,” IEEE TPAMI, vol. 28, no. 8, pp.
[98] T. J. Sejnowski, “The unreasonable effectiveness of deep learning in 1236–1250, 2006.
artificial intelligence,” PNAS, 2020. [129] P. Wei, F. Qin, F. Wan, Y. Zhu, J. Jiao, and Q. Ye, “Correlated topic
[99] L. Xie, L. Zheng, J. Wang, A. L. Yuille, and Q. Tian, “Interactive: vector for scene classification,” IEEE TIP, vol. 26, no. 7, pp. 3221–
Inter-layer activeness propagation,” in CVPR, 2016, pp. 270–279. 3234, 2017.
[100] L. Xie, J. Wang, W. Lin, B. Zhang, and Q. Tian, “Towards reversal- [130] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
invariant image representation,” IJCV, vol. 123, no. 2, pp. 226–250, D. Batra, “Grad-CAM: Visual explanations from deep networks via
2017. gradient-based localization,” in ICCV, 2017, pp. 618–626.
[101] M. Rezanejad, G. Downs, J. Wilder, D. B. Walther, A. Jepson, [131] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff,
S. Dickinson, and K. Siddiqi, “Scene categorization from contours: “Top-down neural attention by excitation backprop,” IJCV, vol. 126,
Medial axis based salience measures,” in CVPR, 2019, pp. 4116–4124. no. 10, pp. 1084–1102, 2018.
[102] B.-B. Gao, X.-S. Wei, J. Wu, and W. Lin, “Deep spatial pyramid: The [132] H. Seong, J. Hyun, and E. Kim, “Fosnet: An end-to-end trainable
devil is once again in the details,” arXiv:1504.05277, 2015. deep neural network for scene recognition,” IEEE Access, vol. 8, pp.
[103] Z. Xiong, Y. Yuan, and Q. Wang, “RGB-D scene recognition via 82 066–82 077, 2020.
spatial-related multi-modal feature learning,” IEEE Access, vol. 7, pp. [133] T. Joseph, K. G. Derpanis, and F. Z. Qureshi, “Joint spatial and layer
106 739–106 747, 2019. attention for convolutional networks,” arXiv:1901.05376, 2019.
[104] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image [134] Z. Niu, G. Hua, X. Gao, and Q. Tian, “Context aware topic model for
classification with the Fisher vector: Theory and practice,” IJCV, vol. scene recognition,” in CVPR, 2012, pp. 2743–2750.
105, no. 3, pp. 222–245, 2013. [135] Z. Zuo, B. Shuai, G. Wang, X. Liu, X. Wang, B. Wang, and
[105] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local Y. Chen, “Convolutional recurrent neural networks: Learning spatial
descriptors into a compact image representation,” in CVPR, 2010, dependencies for image representation,” in CVPRW, 2015, pp. 18–26.
pp. 3304–3311. [136] X. Wang and E. Grimson, “Spatial latent dirichlet allocation,” in
[106] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual NeurIPS, 2008, pp. 1577–1584.
categorization with bags of keypoints,” in ECCVW, 2004, pp. 1–2. [137] J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14,
[107] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and no. 2, pp. 179–211, 1990.
M. Pietikäinen, “Deep learning for generic object detection: A [138] G. R. Cross and A. K. Jain, “Markov random field texture models,”
survey,” IJCV, vol. 128, no. 2, pp. 261–318, 2020. IEEE TPAMI, no. 1, pp. 25–39, 1983.
[108] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, [139] S. Z. Li, Markov random field modeling in image analysis. Springer,
“Selective search for object recognition,” IJCV, vol. 104, no. 2, pp. 2009.
154–171, 2013. [140] N. Rasiwasia and N. Vasconcelos, “Holistic context models for visual
[109] S. Singh, A. Gupta, and A. A. Efros, “Unsupervised discovery of recognition,” IEEE TPAMI, vol. 34, no. 5, pp. 902–917, 2012.
mid-level discriminative patches,” in ECCV, 2012, pp. 73–86. [141] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks
[110] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik, and locally connected networks on graphs,” arXiv:1312.6203, 2013.
“Multiscale combinatorial grouping,” in CVPR, 2014, pp. 328–335. [142] T. N. Kipf and M. Welling, “Semi-supervised classification with
[111] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and graph convolutional networks,” ICLR, 2016.
A. C. Berg, “SSD: Single shot multibox detector,” in ECCV, 2016, pp. [143] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun,
21–37. “Graph neural networks: A review of methods and applications,”
[112] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in arXiv:1812.08434, 2018.
CVPR, 2017, pp. 7263–7271. [144] Y. Yuan, Z. Xiong, and Q. Wang, “Acm: Adaptive cross-modal graph
[113] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look convolutional neural networks for RGB-D scene recognition,” in
once: Unified, real-time object detection,” in CVPR, 2016, pp. 779– AAAI, vol. 33, 2019, pp. 9176–9184.
788. [145] X. Song, S. Jiang, B. Wang, C. Chen, and G. Chen, “Image
[114] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply- representations with spatial object-to-object relations for RGB-D
supervised nets,” in AISTATS, 2015, pp. 562–570. scene recognition,” IEEE TIP, vol. 29, pp. 525–537, 2019.
22
[146] S. A. Javed and A. K. Nelakanti, “Object-level context modeling for [176] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning,”
scene classification with context-CNN,” arXiv:1705.04358, 2017. IEEE TNN, vol. 20, no. 3, pp. 542–542, 2009.
[147] X. Song, S. Jiang, and L. Herranz, “Multi-scale multi-feature context [177] H. B. Barlow, “Unsupervised learning,” Neural computation, vol. 1,
modeling for scene recognition in the semantic manifold,” IEEE TIP, no. 3, pp. 295–311, 1989.
vol. 26, no. 6, pp. 2721–2735, 2017. [178] A. Kolesnikov, X. Zhai, and L. Beyer, “Revisiting self-supervised
[148] T. N. Kipf and M. Welling, “Semi-supervised classification with visual representation learning,” in CVPR, 2019, pp. 1920–1929.
graph convolutional networks,” ICLR, 2017. [179] Y.-X. Wang and M. Hebert, “Learning to learn: Model regression
[149] Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X. Jiang, “Learning networks for easy small sample learning,” in ECCV, 2016, pp. 616–
discriminative and shareable features for scene classification,” in 634.
ECCV, 2014, pp. 552–568. [180] L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object
[150] S. Jiang, G. Chen, X. Song, and L. Liu, “Deep patch representations categories,” IEEE TPAMI, vol. 28, no. 4, pp. 594–611, 2006.
with shared codebook for scene classification,” ACM TOMM, vol. 15, [181] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum, “Human-level
no. 1s, pp. 1–17, 2019. concept learning through probabilistic program induction,” Science,
[151] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm vol. 350, no. 6266, pp. 1332–1338, 2015.
for optimal margin classifiers,” in Annual workshop on Computational [182] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable
learning theory, 1992, pp. 144–152. features with deep adaptation networks,” in ICML, 2015, pp. 97–105.
[152] Y. Li, J. Zhang, Y. Cheng, K. Huang, and T. Tan, “Df2net: [183] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,”
Discriminative feature learning and fusion network for RGB-D Neurocomputing, vol. 312, pp. 135–153, 2018.
indoor scene classification,” in AAAI, 2018. [184] K.-C. Peng, Z. Wu, and J. Ernst, “Zero-shot deep domain
[153] J. C. Van Gemert, C. J. Veenman, A. W. Smeulders, and J.-M. adaptation,” in ECCV, 2018, pp. 764–781.
Geusebroek, “Visual word ambiguity,” IEEE TPAMI, vol. 32, no. 7, [185] H. Li, S. Jialin Pan, S. Wang, and A. C. Kot, “Domain generalization
pp. 1271–1283, 2009. with adversarial feature learning,” in CVPR, 2018, pp. 5400–5409.
[154] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and
[186] X.-Y. Zhang, C.-L. Liu, and C. Y. Suen, “Towards robust pattern
R. C. Williamson, “Estimating the support of a high-dimensional
recognition: A review,” Proceedings of the IEEE, vol. 108, no. 6, pp.
distribution,” Neural computation, vol. 13, no. 7, pp. 1443–1471, 2001.
894–922, 2020.
[155] T. Durand, N. Thome, and M. Cord, “Mantra: Minimum maximum
[187] U. Shaham, Y. Yamada, and S. Negahban, “Understanding
latent structural svm for image classification and ranking,” in ICCV,
adversarial training: Increasing local stability of supervised models
2015, pp. 2713–2721.
through robust optimization,” Neurocomputing, vol. 307, pp. 195–204,
[156] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng,
2018.
“Convolutional-recursive deep learning for 3d object classification,”
[188] C. Qin, J. Martens, S. Gowal, D. Krishnan, K. Dvijotham, A. Fawzi,
in NeurIPS, 2012, pp. 656–664.
S. De, R. Stanforth, and P. Kohli, “Adversarial robustness through
[157] A. Wang, J. Cai, J. Lu, and T.-J. Cham, “MMSS: Multi-modal sharable
local linearization,” in NeurIPS, 2019, pp. 13 847–13 856.
and specific feature learning for RGB-D object recognition,” in ICCV,
2015, pp. 1125–1133. [189] A. Shafahi, M. Najibi, M. A. Ghiasi, Z. Xu, J. Dickerson, C. Studer,
L. S. Davis, G. Taylor, and T. Goldstein, “Adversarial training for
[158] Y. Cheng, X. Zhao, R. Cai, Z. Li, K. Huang, Y. Rui et al., “Semi-
free!” in NeurIPS, 2019, pp. 3358–3369.
supervised multimodal deep learning for RGB-D object recognition,”
IJCAI, pp. 3346–3351, 2016. [190] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle,
[159] Q. Wang, M. Chen, F. Nie, and X. Li, “Detecting coherent groups in F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial
crowd scenes by multiview clustering,” IEEE TPAMI, vol. 42, no. 1, training of neural networks,” JMLR, vol. 17, no. 1, pp. 2096–2030,
pp. 46–58, 2018. 2016.
[160] Y. Zheng and X. Gao, “Indoor scene recognition via multi-task metric [191] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “Synthesizing
multi-kernel learning from RGB-D images,” Multimedia Tools and robust adversarial examples,” in ICML, 2018, pp. 284–293.
Applications, vol. 76, no. 3, pp. 4427–4443, 2017. [192] A. R. Dargazany, P. Stegagno, and K. Mankodiya, “WearableDL:
[161] B. Thompson, “Canonical correlation analysis,” Encyclopedia of Wearable internet-of-things and deep learning for big data
statistics in behavioral science, 2005. analytics—concept, literature, and future,” Mobile Information
[162] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical Systems, vol. 2018, 2018.
correlation analysis,” in ICML, 2013, pp. 1247–1255. [193] L. Shen, Z. Lin, and Q. Huang, “Relay backpropagation for effective
[163] J. Wu, B.-B. Gao, and G. Liu, “Representing sets of instances for learning of deep convolutional neural networks,” in ECCV, 2016, pp.
visual recognition,” in AAAI, 2016, pp. 2237–2243. 467–482.
[164] J. H. Bappy, S. Paul, and A. K. Roy-Chowdhury, “Online adaptation [194] F. Thabtah, S. Hammoud, F. Kamalov, and A. Gonsalves, “Data
for joint scene and object classification,” in ECCV, 2016, pp. 227–243. imbalance in classification: Experimental evaluation,” Information
[165] Z. Zhao and M. Larson, “From volcano to toyshop: Adaptive Sciences, vol. 513, pp. 429–441, 2020.
discriminative region discovery for scene recognition,” in ACM MM, [195] M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study
2018, pp. 1760–1768. of the class imbalance problem in convolutional neural networks,”
[166] M. Koskela and J. Laaksonen, “Convolutional network features for Neural Networks, vol. 106, pp. 249–259, 2018.
scene recognition,” in ACM MM, 2014, pp. 1169–1172. [196] J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning
[167] G. Nascimento, C. Laranjeira, V. Braz, A. Lacerda, and E. R. with class imbalance,” Journal of Big Data, vol. 6, no. 1, p. 27, 2019.
Nascimento, “A robust indoor scene recognition method based on [197] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew, “Deep
sparse representation,” in CIARP, 2017, pp. 408–415. learning for visual understanding: A review,” Neurocomputing, vol.
[168] Y. Liao, S. Kodagoda, Y. Wang, L. Shi, and Y. Liu, “Understand scene 187, pp. 27–48, 2016.
categories by objects: A semantic regularized scene classifier using [198] Z. Chen and B. Liu, “Lifelong machine learning,” Synthesis Lectures
convolutional neural networks,” in CRA, 2016, pp. 2318–2325. on Artificial Intelligence and Machine Learning, vol. 12, no. 3, pp. 1–207,
[169] Z. Cai and L. Shao, “RGB-D scene classification via multi-modal 2018.
feature learning,” Cognitive Computation, vol. 11, no. 6, pp. 825–840, [199] S. Thrun and T. M. Mitchell, “Lifelong robot learning,” Robotics and
2019. autonomous systems, vol. 15, no. 1-2, pp. 25–46, 1995.
[170] A. Ayub and A. Wagner, “Cbcl: Brain-inspired model for RGB-D [200] D. Hassabis, D. Kumaran, C. Summerfield, and M. Botvinick,
indoor scene classification,” arXiv:1911.00155, 2019. “Neuroscience-inspired artificial intelligence,” Neuron, vol. 95, no. 2,
[171] X. Song, L. Herranz, and S. Jiang, “Depth CNNs for RGB-D scene pp. 245–258, 2017.
recognition: Learning from scratch better than transferring from [201] M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning approach
RGB-CNNs,” in AAAI, vol. 31, no. 1, 2017. to multi-label learning,” Pattern Recognition, vol. 40, no. 7, pp. 2038–
[172] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya et al., 2048, 2007.
“Deep learning applications and challenges in big data analytics,” [202] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Temporal residual
Journal of Big Data, vol. 2, no. 1, p. 1, 2015. networks for dynamic scene recognition,” in CVPR, 2017, pp. 4728–
[173] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement 4737.
learning,” ICLR, 2017. [203] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and
[174] T. Elsken, J. H. Metzen, F. Hutter et al., “Neural architecture search: M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor
A survey.” JMLR, vol. 20, no. 55, pp. 1–21, 2019. scenes,” in CVPR, 2017, pp. 5828–5839.
[175] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva, “SUN [204] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
database: Exploring a large collection of scene categories,” IJCV, vol. spatiotemporal features with 3d convolutional networks,” in ICCV,
119, no. 1, pp. 3–22, 2016. 2015, pp. 4489–4497.
23
[205] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning [236] J. Yu, D. Tao, Y. Rui, and J. Cheng, “Pairwise constraints
on point sets for 3D classification and segmentation,” in CVPR, 2018, based multiview features fusion for scene classification,” Pattern
pp. 652–660. Recognition, vol. 46, no. 2, pp. 483–496, 2013.
[206] D. G. Lowe, “Object recognition from local scale-invariant features,” [237] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing
in IJCV, vol. 2, 1999, pp. 1150–1157. the gap to human-level performance in face verification,” in CVPR,
[207] A. Oliva and A. Torralba, “Modeling the shape of the scene: A 2014, pp. 1701–1708.
holistic representation of the spatial envelope,” IJCV, vol. 42, no. 3, [238] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, “Deep
pp. 145–175, 2001. learning for 3d point clouds: A survey,” IEEE TPAMI, 2020.
[208] N. Dalal and B. Triggs, “Histograms of oriented gradients for human [239] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed et al., “Deep
detection,” in CVPR, vol. 1, 2005, pp. 886–893. neural networks for acoustic modeling in speech recognition,” IEEE
[209] J. Wu and J. M. Rehg, “CENTRIST: A visual descriptor for scene Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
categorization,” IEEE TPAMI, vol. 33, no. 8, pp. 1489–1501, 2010. [240] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learning
[210] J. Sivic and A. Zisserman, “Video google: A text retrieval approach affordance for direct perception in autonomous driving,” in ICCV,
to object matching in videos,” in ICCV, vol. 2, 2003, p. 1470–1477. 2015, pp. 2722–2730.
[211] L.-J. Li, H. Su, Y. Lim, and L. Fei-Fei, “Objects as attributes for scene [241] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau,
classification,” in ECCV, 2010, pp. 57–69. and S. Thrun, “Dermatologist level classification of skin cancer with
[212] M. Juneja, A. Vedaldi, C. Jawahar, and A. Zisserman, “Blocks that deep neural networks,” Nature, vol. 542, no. 7639, pp. 115–118, 2017.
shout: Distinctive parts for scene classification,” in CVPR, 2013, pp. [242] S. M. McKinney, M. Sieniek, V. Godbole, J. Godwin, N. Antropova
923–930. et al., “International evaluation of an AI system for breast cancer
[213] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN screening,” Nature, vol. 577, no. 7788, pp. 89–94, 2020.
features off-the-shelf: An astounding baseline for recognition,” in [243] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi et al., “Google’s
CVPRW, 2014, pp. 806–813. neural machine translation system: Bridging the gap between human
[214] N. Vasconcelos and A. Lippman, “A probabilistic architecture for and machine translation,” arXiv:1609.08144, 2016.
content-based image retrieval,” in CVPR, 2000, pp. 216–221. [244] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre et al.,
[215] C. Wallraven, B. Caputo, and A. Graf, “Recognition with local “Mastering the game of go with deep neural networks and tree
features: The kernel recipe,” in ICCV, 2003, pp. 257–264. search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
[245] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang
[216] D. G. Lowe, “Distinctive image features from scale-invariant
et al., “Mastering the game of go without human knowledge,”
keypoints,” IJCV, vol. 60, no. 2, pp. 91–110, 2004.
Nature, vol. 550, no. 7676, pp. 354–359, 2017.
[217] A. Oliva and A. Torralba, “Building the gist of a scene: The role of
[246] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai et al., “A
global image features in recognition,” Progress in brain research, vol.
general reinforcement learning algorithm that masters chess, shogi,
155, pp. 23–36, 2006.
and go through self-play,” Science, vol. 362, no. 6419, pp. 1140–1144,
[218] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray- 2018.
scale and rotation invariant texture classification with local binary [247] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik
patterns,” IEEE TPAMI, vol. 24, no. 7, pp. 971–987, 2002. et al., “Grandmaster level in starcraft ii using multi-agent
[219] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discrimina- reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019.
tively trained, multiscale, deformable part model,” in CVPR, 2008, [248] P. M. DeVries, F. Viégas, M. Wattenberg, and B. J. Meade, “Deep
pp. 1–8. learning of aftershock patterns following large earthquakes,” Nature,
[220] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, vol. 560, no. 7720, pp. 632–634, 2018.
“Object detection with discriminatively trained part-based models,” [249] https://www.technologyreview.com/lists/technologies/2020/.
IEEE TPAMI, vol. 32, no. 9, pp. 1627–1645, 2009. [250] J. M. Stokes, K. Yang, K. Swanson, W. Jin, A. Cubillos-Ruiz, N. M.
[221] M. Pandey and S. Lazebnik, “Scene recognition and weakly Donghia, C. R. MacNair, S. French, L. A. Carfrae, Z. Bloom-
supervised object localization with deformable part-based models,” Ackerman et al., “A deep learning approach to antibiotic discovery,”
in ICCV, 2011, pp. 1307–1314. Cell, vol. 180, no. 4, pp. 688–702, 2020.
[222] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” [251] J. Schmidhuber, “Deep learning in neural networks: An overview,”
JMLR, vol. 3, no. 1, pp. 993–1022, 2003. Neural networks, vol. 61, pp. 85–117, 2015.
[223] K. Grauman and T. Darrell, “The pyramid match kernel: [252] M. D. Zeiler and R. Fergus, “Visualizing and understanding
Discriminative classification with sets of image features,” in ICCV, convolutional networks,” in ECCV, 2014, pp. 818–833.
vol. 2, 2005, pp. 1458–1465. [253] S. Ioffe, “Batch renormalization: Towards reducing minibatch
[224] L. Zhou, Z. Zhou, and D. Hu, “Scene classification using a multi- dependence in batch-normalized models,” in NeurIPS, 2017, pp.
resolution bag-of-features model,” Pattern Recognition, vol. 46, no. 1, 1945–1953.
pp. 424–433, 2013. [254] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press,
[225] L. Xie, J. Wang, B. Guo, B. Zhang, and Q. Tian, “Orientational 2016.
pyramid matching for recognizing indoor scenes,” in CVPR, 2014, [255] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi et al., “A
pp. 3734–3741. survey on deep learning in medical image analysis,” Medical image
[226] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid analysis, vol. 42, pp. 60–88, 2017.
matching using sparse coding for image classification,” in CVPR, [256] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu,
2009, pp. 1794–1801. X. Wang, G. Wang, J. Cai et al., “Recent advances in convolutional
[227] S. Gao, I. W.-H. Tsang, L.-T. Chia, and P. Zhao, “Local features are not neural networks,” Pattern Recognition, vol. 77, pp. 354–377, 2018.
lonely–laplacian sparse coding for image classification,” in CVPR, [257] S. Pouyanfar, S. Sadiq, Y. Yan, H. Tian, Y. Tao, M. P. Reyes,
2010, pp. 3555–3561. M.-L. Shyu, S.-C. Chen, and S. Iyengar, “A survey on deep
[228] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality- learning: Algorithms, techniques, and applications,” Computing
constrained linear coding for image classification,” in CVPR, 2010, Surveys, vol. 51, no. 5, pp. 1–36, 2018.
pp. 3360–3367.
[229] J. Wu and J. M. Rehg, “Beyond the euclidean distance: Creating
effective visual codebooks using the histogram intersection kernel,”
in ICCV, 2009, pp. 630–637.
[230] J. Qin and N. H. Yung, “Scene categorization via contextual visual
words,” Pattern Recognition, vol. 43, no. 5, pp. 1874–1888, 2010.
[231] L. Bo and C. Sminchisescu, “Efficient match kernel between sets of
features for visual recognition,” in NeurIPS, 2009, pp. 135–143.
[232] P. Wang, J. Wang, G. Zeng, W. Xu, H. Zha, and S. Li, “Supervised
kernel descriptors for visual recognition,” in CVPR, 2013, pp. 2858– Delu Zeng received his Ph.D. degree in electronic and information engi-
2865. neering from South China University of Technology, China, in 2009. He is
[233] L.-J. Li, H. Su, Y. Lim, and L. Fei-Fei, “Object bank: An object-level now a full professor in the School of Mathematics in South China University
image representation for high-level visual recognition,” IJCV, vol. of Technology, China. He has been the visiting scholar of Columbia Univer-
107, no. 1, pp. 20–39, 2014. sity, University of Oulu, University of Waterloo. He has been focusing his
[234] L. Zhang, X. Zhen, and L. Shao, “Learning object-to-class kernels for research in applied mathematics and its interdisciplinary applications. His
scene classification,” IEEE TIP, vol. 23, no. 8, pp. 3241–3253, 2014. research interests include numerical calculations, applications of partial dif-
[235] F. Sadeghi and M. F. Tappen, “Latent pyramidal regions for ferential equations, optimizations, machine learning and their applications
recognizing scenes,” in ECCV, 2012, pp. 228–241. in image processing, and data analysis.
24
Minyu Liao received her B.S. degree in applied mathematics from Shantou
University, China, in 2018. She is currently pursuing the master’s degrees
in computational mathematics with South China University of Technology,
China. Her research interests include computer vision, scene recognition,
and deep learning.
Yulan Guo He received his Ph.D. degrees from National University of De-
fense Technology (NUDT) in 2015, where he is currently an associate pro-
fessor. He was a visiting Ph.D. student with the University of Western Aus-
tralia from 2011 to 2014. He has authored over 90 articles in journals and
conferences, such as the IEEE TPAMI and IJCV. His current research inter-
ests focus on 3D vision, particularly on 3D feature learning, 3D modeling,
3D object recognition, and scene understanding.
Dewen Hu received the B.S. and M.S. degrees from Xi’an Jiaotong Uni-
versity, China, in 1983 and 1986, respectively, and the Ph.D. degree from
the National University of Defense Technology in 1999. He is currently a
Professor at School of Intelligent Science, National University of Defense
Technology. From October 1995 to October 1996, he was a Visiting Scholar
with the University of Sheffield, U.K. His research interests include image
processing, system identification and control, neural networks, and cogni-
tive science.
Matti Pietikäinen received his Ph.D. degree from the University of Oulu,
Finland. He is now emeritus professor with the Center for Machine Vision
and Signal Analysis, University of Oulu. He is a fellow of the IEEE for fun-
damental contributions, e.g., , to Local Binary Pattern (LBP) methodology,
texture based image and video analysis, and facial image analysis. He has
authored more than 350 refereed papers in international journals, books,
and conferences. His papers have nearly 68,700 citations in Google Scholar
(h-index 92). He was the recipient of the IAPR King-Sun Fu Prize 2018 for
fundamental contributions to texture analysis and facial image analysis.