Robust One-Class Classification Using Deep Kernel Spectral Regression
Robust One-Class Classification Using Deep Kernel Spectral Regression
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Communicated by X. Gu The existing one-class classification (OCC) methods typically presume the existence of a pure target training
set and generally face difficulties when the training set is contaminated with non-target objects. This work
Keywords:
addresses this aspect of the OCC problem and formulates an effective method that leverages the advantages
One-class classification
Label contamination
of kernel-based methods to achieve robustness against training label noise while enabling direct deep learning
Fisher null transformation of features from the data to optimise a Fisher-based loss function in the Hilbert space. As such, the proposed
Deep convolutional learning OCC approach can be trained in an end-to-end fashion while, by virtue of a Tikhonov regularisation in the
Reproducing kernel Hilbert space (RKHS) Hilbert space, it provides high robustness against the training set contamination.
Tikhonov regularisation Extensive experiments conducted on multiple datasets in different application scenarios demonstrate that
the proposed methodology is robust and performs better than the state-of-the-art algorithms for OCC when
the training set is corrupted by contamination.
∗ Corresponding author.
E-mail address: [email protected] (S. Rahimzadeh Arashloo).
https://doi.org/10.1016/j.neucom.2024.127246
Received 1 June 2023; Received in revised form 2 October 2023; Accepted 7 January 2024
Available online 11 January 2024
0925-2312/© 2024 Elsevier B.V. All rights reserved.
S. Mohammad and S. Rahimzadeh Arashloo Neurocomputing 573 (2024) 127246
as counter-examples to enhance the robustness of the classifier against presented in Section 4 where we discuss the proposed loss function,
noise. In spite of considerable advancements in one-class learning over the optimisation algorithm, and the decision strategy for classifica-
the last couple of years [22,23], the majority of existing methods are tion. Section 5 presents and discusses the results of an evaluation of
not specifically designed to operate on a contaminated training set the proposed approach on multiple datasets and compares its perfor-
where some abnormal objects may exist and implicitly presume the mance against state-of-the-art techniques. Finally, Section 7 provides
existence of a pure and uncontaminated normal training set. Although concluding remarks.
exist are a number of non-deep studies [21,24] that are well-suited for
such an application scenario, they require substantial feature engineer- 2. Related work
ing. On the other hand, the recent state-of-the-art deep learning-based
approaches [23,25] are not specifically equipped to provide robustness Although other categorisations exist [26], OCC methods may be
against noisy samples and may face a deterioration in the performance broadly categorised into generative methods and discriminative ap-
when training data is contaminated with noising objects. Among oth- proaches. Generative methods focus on modelling the underlying gen-
ers, the work in [21] presents a non-deep method that operates in erative process of the data, for example, by learning the underlying
a reproducing kernel Hilbert space and provides robustness against distribution whereas discriminative methods try to directly learn the
noise through a Tikhonov regularisation on the discriminant in the optimal decision boundary that best separates the positive samples
Hilbert space. A comparison of this method against other approaches from anomalies. Within each category, the one-class description may
confirms its superior performance on contaminated datasets where the be learned following either a deep or non-deep learning methodology.
training set includes incorrectly labelled abnormal objects. Although An example of the generative non-deep methods is the kernel PCA [27]
this approach is quite robust to contamination and can also be used for approach that projects the training samples onto the reproducing kernel
observation ranking effectively, it has certain limitations. In particular, Hilbert space. The novelty score is then calculated using the test
as the method in [21] is unable to learn representations directly from sample’s reconstruction error in the considered subspace. An alternative
the data, substantial feature engineering is required to extract discrim- robust PCA introduced in [28] uses 𝓁1 -norm instead of the commonly
inative features to effectively utilise this approach. Consequently, the used 𝓁2 -norm to handle anomalies in the training set. A more recent
results are highly dependent on the features fed to the one-class learner. approach that can potentially tolerate outliers by up to the square
Moreover, at its core, the method in [21] operates on an alternating of the number of inliers is that of the Dual Principal Component
optimisation approach that requires a matrix inversion operation which Pursuit [24] which solves a non-convex 𝓁1 -norm optimisation problem
may not only be computationally expensive when the number of train- and provides an improvement in the performance compared to some
ing observations is large but may also be very demanding in terms other robust PCA methods. A fast and non-convex algorithm known
of the system memory requirements during the training stage which as fast median subspace (FMS) was proposed in [29] that robustly
impedes its applicability to large-scale datasets. determines the underlying subspace while being computationally less
complex compared to other similar methods. The work in [30] presents
1.1. Contributions a weighted graph-based method that defines a Markov chain and
identifies outliers via random walks. Instances of the generative deep
The current study presents a deep one-class approach which is OCC methods include approaches based on deep autoencoders [31]
robust against contamination in the training set. For this purpose, the which aim to learn salient features from the normal class by trying
proposed method builds on the concepts from kernel machines [21] to minimise the reconstruction error. Such methods are used either
combined with those of deep convolutional structures to enable end- in conjunction with classical anomaly detection methods by providing
to-end learning while providing robustness against the training set con- the learned embedding as input [32,33] to a one-class learner, or by
tamination. To this end, the proposed approach draws on the method directly utilising the reconstruction error as the anomaly score [34].
presented in [21] and addresses the shortcomings of the current OCC Denoising autoencoders [35] and sparse [36] and deep convolutional
methods by combining a regularised kernel-based learning formalism autoencoders [37] are some examples of the well-known autoencoder-
with a convolutional network learning approach in a way that it is based anomaly detection methods. The ICS method [25] which may be
trainable in an end-to-end fashion. Furthermore, the proposed method considered a successful generative method in this category trains a CNN
enables a ranking of samples in a dataset according to how well each for one-class classification based on splitting the training data into two
object matches the bulk of observations. The main contributions of the parts of typical and non-typical normal subsets. One of the drawbacks
proposed approach may be summarised as: of autoencoder-based approaches is the difficulty in selecting a suitable
degree for compression as estimating the inherent dimensionality of the
• Combining a kernel-based learning formalism with a deep- data typically poses a big challenge [38]. Some other generative deep
learning-based methodology to enable learning a one-class classi- OCC approaches are based on Generative Adversarial Networks (GANs)
fier in an end-to-end fashion; that have shown promise for anomaly detection. The work in [13] is
• A Tikhonov regularisation of the discriminant in the Hilbert space one of the recent GAN-based methods that learns a manifold of normal
to improve robustness of the proposed deep kernel-based model; anatomical variability. Among others, the approach presented in [39] is
• Learning the discriminant in the Hilbert space and the network a novel hybrid method that combines deep autoencoders and GAN-style
parameters jointly via a gradient-based approach to address mem- adversarial learning. A common difficulty regarding the GAN-based
ory and computational constraints; methods is a lack of knowledge with respect to the regularisation of
• An extensive evaluation of the proposed approach on different the generator to achieve compactness [38].
datasets and a comparison against existing methods in different In contrary to the generative approaches, discriminative OCC meth-
application scenarios. ods aim to infer a decision boundary that separates the normal class
from anomalies. One of the most prominent discriminative non-deep
1.2. Outline methods is the One-Class SVM (OCSVM) approach [40] that tries to
learn a hyper-plane which separates the normal data from the origin.
The remainder of the article is organised as follows: Section 2 Another well-known and closely related method is that of the support
reviews related work on OCC with an emphasis on robust techniques vector data description (SVDD) approach [41] where the objective is
developed in the literature. In Section 3, a brief background on the to infer the smallest hyper-sphere surrounding the training objects.
one-class kernel spectral regression technique [21] is provided. The Among other kernel-based approaches is the one-class kernel spectral
proposed deep robust one-class kernel spectral regression method is regression method [21] that tries to infer a robust one-class description
2
S. Mohammad and S. Rahimzadeh Arashloo Neurocomputing 573 (2024) 127246
in the presence of contamination in the training set by imposing a task, the work in [21] maximises its robustness against noise. Learning
Tikhonov regularisation on the Fisher null discriminant in the Hilbert the regularised one-class model above entails inferring the optimal
space. Determination of the optimal regularisation parameter is realised parameter 𝜶 along with the labels 𝐲 that minimise the loss function
by posing the one-class learning problem as a sensitivity analysis task. 𝐿(𝜶, 𝐲). Minimisation in 𝜶 is performed by requiring the derivative of
An evaluation of this method in the presence of contaminated training the loss w.r.t 𝜶 to vanish, yielding
data has verified its superior robustness as compared with other alter-
𝜶 𝑜𝑝𝑡 = (𝐊 + 𝛿𝐈𝐧 )−1 𝐲 , (4)
natives. Although the aforementioned non-deep discriminative methods
have their own merits, they require a careful feature engineering step where 𝐈𝑛 denotes an 𝑛 × 𝑛 identity matrix. For optimisation in 𝐲, one
to yield satisfactory performance in real-world scenarios. Recently, dif- may equate the derivative of the loss with respect to 𝐲 to zero, which
ferent attempts have been made to compensate for the shortcomings of results in 𝐲 = 𝐊𝜶. When no knowledge regarding the contamination in
the discriminative methods. As an instance, deep SVDD (DSVDD) [38] the training set is available, 𝐲 is initialised to a vector of 1’s.
may be considered one of the earliest discriminative deep OCC methods
which try to combine deep learning formalism with an OCC-based 4. Deep Robust One-Class Kernel Spectral Regression
objective function. For this purpose, similar to the conventional SVDD
method, it tries to learn a minimum volume hyper-sphere enclosing the In this section, the proposed Deep Robust Kernel Spectral Regression
embedding of the positive data. Yet, the representations of the data are (DRKSR) approach, an extension of the method of [21] to a deep end-to-
learned through a deep network. One-Class CNN (OCCNN) [22] serves end learning formalism is introduced. To this end, first, we first present
as another deep discriminative approach which consists of two sub- the proposed objective function and then discuss its optimisation fol-
networks serving as the classifier and the feature extractor. Inspired lowed by formulating the corresponding decision-making process for
by the OCSVM method, the networks are trained so that the features classification in the context of the proposed method.
learned are different compared to a zero-centred Gaussian distribution
subject to a cross-entropy binary objective function. The work in [42] 4.1. Objective function
presents an effective OCC approach that adds small perturbations to
input and uses temperature scaling to detect anomalies. Among others, The proposed objective function is designed to enable joint learning
a successful OCC method corresponds to that of [23] which uses a clas- of appropriate representations of the data while optimising a Fisher
sification network based on deep learning trained with log-likelihood null-space one-class classification criterion over thus obtained repre-
loss in combination with a holistic regularisation. sentations subject to a Tikhonov regularisation on the discriminant in
Although successful OCC methods exist in the literature, the func- the Hilbert space. For this purpose, for some input space ⊆ R𝑑 and
tionality of these methods may be seriously compromised when ex- the real output space R, let 𝛹 (⋅; 𝝎; 𝜶, 𝐊) ∶ → R denote a neural
posed to contamination and noise in the training set. Compared to the network with layers, with the set of weights 𝝎 = {𝝎1 , … , 𝝎 } where
existing methods, the current work, exploiting ideas from both kernel 𝝎𝑙 are the weights associated with layer 𝑙 ∈ {1, 2, … , }. Moreover,
machines and convolutional structures, presents a one-class learning let 𝐊 be a positive semi-definite kernel matrix placed after layers of
approach that provides a high degree of robustness against training the network and 𝜶 = [𝛼1 , … , 𝛼𝑛 ]⊤ be the coefficients associated with
set impurities while enabling direct end-to-end learning from the data. the training samples for a given training data. The objective of the
Next, we present a brief review of the method in [21] on which we proposed approach is to find an optimal feature subspace such that
build the proposed approach and then present the proposed method in samples from the target/normal class are mapped onto nearby points to
this work. minimise the within-class scatter while imposing certain regularisations
on the network parameters so that it is able to handle label noise
3. Background and contamination in the training set. For this purpose, the proposed
objective function in this study is defined as
The one-class kernel spectral regression [21] operates by tailoring 𝐿(𝜶, 𝐲, 𝝎) = ‖𝐊𝜶 − 𝐲‖22 + 𝛿𝜶 ⊤ 𝐊𝜶 + 𝜆‖𝝎‖22 , (5)
the Fisher null classification principle [43] to an OCC setting by using
the origin as a hypothetical object of the non-target class and only where the first term on the RHS of the equation measures the total
employing target objects for training. By formulating the learning discrepancy (in a Euclidean sense) between the expected (i.e. 𝐲) and
problem as regression in the Hilbert space, the work in [21] requires the inferred (i.e. 𝐊𝜶) labels of dataset items. The second term on
the response of all normal data items to be close to each other subject the RHS of the equation above imposes a Tikhonov regularisation on
to some regularisation on the projection function, that is the discriminant, i.e. 𝜶 in the Hilbert space. The last term imposes a
regularisation on the parameters of the network. 𝛿 is the Tikhonov
∑
𝑛
regularisation parameter while 𝜆 is the regularisation coefficient for the
𝜙(⋅)𝑜𝑝𝑡 = arg min (𝜙(𝐱𝑖 ) − 𝑦𝑖 )2 + 𝛿‖𝜙(.)‖2 , (1)
𝜙(⋅) 𝑖=1 weights of the network. In the proposed method, the network weights
𝝎, the coefficients 𝜶 associated with the training samples for the kernel-
where 𝐱𝑖 represents a sample from the training set and ‖.‖2 represents space classifier, and the soft labels 𝐲 should be determined jointly by
the squared norm in the Hilbert space. By operating in a reproducing optimising the objective function defined in Eq. (5). Note that when
kernel Hilbert space, it is shown that the projection function 𝜙(.) may the number of training samples is fewer than the model parameters, the
be written as [21] problem above with no regularisation becomes ill-posed. In such cases,
∑
𝑛
a Tikhonov regularisation is deployed to stabilise the decision boundary
𝜙(⋅) = 𝛼𝑖 𝜅(., 𝑥𝑖 ), (2) and create a unique solution. In our case, the Tikhonov regularisation is
𝑖=1
used to maintain a balance between data fidelity and constraining the
where 𝜅(., .) denotes the kernel function and 𝛼𝑖 ’s are model coefficients norm of the solution function i.e to control its smoothness. It penalises
(Lagrange multipliers) to be determined. Using a matrix notation, large magnitudes of the parameters to yield a small variance which is
Eq. (1) in the dual space is equivalently written as [21] advantageous for one-class learning in the presence of a contaminated
𝐿(𝜶, 𝐲) = ‖𝐊𝜶 − 𝐲‖22 + 𝛿𝜶 ⊤ 𝐊𝜶, (3) dataset. These merits of a Tikhonov regularisation are instrumental
in enhancing the generalisation capabilities of the network as will
where 𝐊 denotes the kernel matrix and 𝐲 is a vector collection of be demonstrated through experiments in different application settings.
the (soft) labels. By formulating the one-class classification problem Note that a fundamental difference between the proposed approach in
in the presence of a contaminated training set as a sensitivity analysis this work and that of [21] is that the kernel matrix in this study is built
3
S. Mohammad and S. Rahimzadeh Arashloo Neurocomputing 573 (2024) 127246
on the learned representations associated with the training samples 𝜕𝐊 = −𝛾𝐊◦𝜕𝐖. (11)
which are updated during each training iteration whereas in [21], raw,
Furthermore, the differentiation of a scalar-valued matrix function
or handcrafted features are used to build the kernel matrix. In Eq. (5),
yields
𝐲 denotes the vector of labels associated with the training samples
(( 𝜕𝐿 ) )
which is initialised as 𝐲 = [1, 1, … , 1]⊤ when no prior knowledge 3 ⊤
𝜕𝐿 = trace 𝜕𝐊
is available with regards to the conformity of each positive object 𝜕𝐊
(( 𝜕𝐿 )⊤ )
with the normal class, or no information regarding the existence of = trace −𝛾𝐊◦ 3 𝜕𝐖
negative samples is available. Note that, during the training stage of the 𝜕𝐊
(( 𝜕𝐿 ) )
network, a soft label is inferred for each training sample by updating 2 ⊤
= trace 𝜕𝐖 . (12)
𝐲 at each iteration to reflect the degree of normality of each object 𝜕𝐖
within the target class, discussed in the next section. Furthermore, Using the equation above, one obtains
learning 𝐲 allows the network to infer a soft label for each training 𝜕𝐿2 𝜕𝐿
object which facilitates a ranking of training objects according to their = −𝛾𝐊◦ 3 . (13)
𝜕𝐖 𝜕𝐊
conformity with the majority of observations in the training set while 𝜕𝐿3
is nothing but the differential of the loss function w.r.t the kernel
improving the generalisation capability of the network in the presence 𝜕𝐊
matrix 𝐊 which is derived in Eq. (8). In a similar fashion, one may
of a contaminated dataset.
derive:
𝜕𝐿1 (( 𝜕𝐿 𝜕𝐿 ) ) 𝜕𝐿
4.2. Optimisation = 𝐈◦ 2
+ ( 2 )⊤ 𝟏⊤ − 2 2
𝜕𝐐 𝜕𝐖 𝜕𝐖 𝜕𝐖
( 𝜕𝐿 𝜕𝐿1 ⊤ )
In order to fully characterise the proposed model, the objective 𝜕𝐿 1
= +( ) 𝐎. (14)
function should be optimised w.r.t. 𝜶, 𝐲, and 𝝎. For the optimisation of 𝜕𝐎 𝜕𝐐 𝜕𝐐
the proposed objective function w.r.t. 𝜶, we follow a gradient-based Once the derivatives are computed, one may apply backpropagation
approach. The derivative of the loss function given in Eq. (5) with through the entire network to update the weights. After updating the
respect to 𝜶 is network parameters, similar to [21], optimisation w.r.t. 𝐲 requires the
𝜕𝐿 gradient of the objective function with respect to 𝐲 to vanish followed
= 2𝐊 (𝐊𝜶 − 𝐲) + 2𝛿𝐊𝜶. (6)
𝜕𝜶 by an 𝓁2 normalisation, that is
In order to update the network weights 𝝎, since the kernel matrix 𝐊𝜶
𝐲= . (15)
𝐊 depends on the most recent representations of the objects during ‖𝐊𝜶‖2
training, in addition to the regularisation term ‖𝝎‖22 , the derivative of
the loss function w.r.t to the kernel matrix 𝐊 needs to be computed too. 4.2.1. Memory considerations
As computing the gradient of ‖𝝎‖22 is straightforward, we shall focus As typically a large number of samples are utilised for training,
on computing the gradient of 𝐊 w.r.t. 𝝎. For this purpose, let 𝐎𝑛×𝑚 manipulating a large kernel matrix may lead to large requirements in
denote the representations associated with the penultimate layer of the terms of system memory. In order to alleviate this problem, in this
network where 𝑛 stands for the number of training set samples and 𝑚 study, we follow a batch learning scheme and calculate the kernel
stands for the dimensionality of the representations obtained from the matrix separately for each batch. Let 𝑛𝑏 be the total number of batches
penultimate layer. Also, let 𝐈𝑛×𝑛 stand for an identity matrix and 𝟏𝑛×𝑛 in the training set. In each batch 𝑏 = 1, … , 𝑛𝑏, the size of the kernel
represent an 𝑛 × 𝑛 matrix of 1’s. Let also ‘‘◦’’ stand for the element-wise matrix 𝐊𝑏 is (𝑏𝑠, 𝑏𝑠) where 𝑏𝑠 stands for the number of batch samples.
product between two matrices, and let ‘‘exp’’ represent an exponential We use a mini-batch gradient descent approach that allows one to ben-
function operable on a matrix in an element-wise fashion. Although efit from the desirable properties of both stochastic and batch gradient
other alternatives exist, the most widely used non-linear kernel function descent updates to alleviate memory requirements for large training
in the context of kernel-based methods is that of a Gaussian kernel. sets while minimising the chances of local minima. More specifically, in
Using the definitions above, the Gaussian (RBF) kernel function can be the proposed approach, the outputs 𝐊𝑏 𝜶 𝑏 are calculated for each batch
represented as by using the corresponding coefficients 𝜶 𝑏 associated with the samples
( ) belonging to that batch. The weights of the network and the coefficients
𝐊 = exp 𝛾((𝐈◦𝐎𝐎⊤ )𝟏 + 𝟏⊤ (𝐈◦𝐎𝐎⊤ )⊤ − 2𝐎𝐎⊤ ) . (7) used are subsequently updated via a gradient descent method. After
The derivative of the loss function with respect to 𝐊 may be calculated updating the coefficients 𝜶 and the weights 𝝎 of the network, the
as representations associated with all the training samples are extracted
and stored as the matrix 𝐎𝑛×𝑚 . The coefficients 𝜶 are then utilised to
𝜕𝐿
= 2(𝐊𝜶 − 𝐲)𝜶 ⊤ + 𝛿𝜶𝜶 ⊤ . (8) derive the soft labels corresponding to each batch as
𝜕𝐊
Next, in order to update the weights, one needs to compute the gradient 𝐲𝑏 = 𝐊(𝐎𝐛 , 𝐎)𝜶 , (16)
of the kernel matrix 𝐊 w.r.t. 𝐎 where 𝐎 represents the representations
which are subsequently normalised as
obtained from the network’s penultimate layer. To this end, let 𝐿 be a
composition function fed with the extracted features 𝐎 as its arguments. 𝐲
𝐲= , (17)
𝐿(𝐎) may be expressed as a composition function as ‖𝐲‖2
where 𝐲 = [𝐲1 , 𝐲2 , … , 𝐲𝑛𝑏 ]⊤ . The process is repeated for a number of
𝐿(𝐎) = 𝐿1 (𝐐) = 𝐿2 (𝐖) = 𝐿3 (𝐊) , (9)
epochs till convergence. The final feature matrix 𝐎𝑛×𝑚 is utilised at
where 𝐐, 𝐖, and 𝐊 are defined as the time of inference. Algorithm 1 summarises the training stage of the
proposed DRKSR approach.
𝐐 = 𝐎𝐎⊤ ,
𝐖 = ((𝐈◦𝐐)1 + 1⊤ (𝐈◦𝐐)⊤ − 2𝐐), 4.3. Decision strategy
𝐊 = exp[−𝛾𝐖]. (10)
After inferring the optimal parameter 𝜶 and weights 𝝎, the normal-
Following the rules of differentiation, one obtains ity score of a test sample 𝐳, i.e. 𝑦(𝐳), is computed as
𝜕𝐐 = (𝜕𝐎)𝐎⊤ + 𝐎(𝜕𝐎⊤ ), ∑
𝑛
4
S. Mohammad and S. Rahimzadeh Arashloo Neurocomputing 573 (2024) 127246
Algorithm 1: Training of the proposed DRKSR approach 𝛾 is selected from {0.5, 1, 2, 3} and the regularisation parameter 𝛿 which
controls the weight of the Tikhonov regularisation applied to 𝜶 is
1 Input: Training batches 𝐗𝑏 | ; labels of training batches 𝐲𝑏 |𝑛𝑏
𝑛𝑏
𝑏=1 𝑏=1
Output: Soft labels 𝐲𝑏 |𝑛𝑏 ; penultimate layer features 𝐎 Parameters: chosen from {0.01, 0.05, 0.1, 0.5, 1}. The regularisation parameter 𝜆 that
𝑏=1
𝛾, 𝛿, 𝜆, 𝜶 = [𝜶 1 , … , 𝜶 𝑛𝑏 ], 𝝎 = [𝝎1 , … , 𝝎𝐿 ] weights the regularisation applied to network weights 𝝎 is chosen
2 for 𝑏 = 1, … , 𝑛𝑏 do // initialisation from {10−4 , 10−3 , 10−2 }. The parameters 𝛾, 𝛿, and 𝜆 are all set using
3 𝜶 𝑏 = [1, … , 1]⊤ validation data. We fix the batch size at 2500 samples for all datasets.
4 𝐲𝑏 = [1, … , 1]⊤ 𝜶 is initialised to a vector of 1’s. For learning network parameters
5 for 𝑡 = 1, … , 𝑇 do // #epochs the Adam optimiser operating on a learning rate of 5 × 10−4 is used.
6 for 𝑏 = 1, … , 𝑛𝑏 do // iterate over batches We also tune the Adam optimiser using one-cycle policy [44] with a
7 𝐎𝐛 = 𝛹 (𝝎𝑏 , 𝐗𝑏 ) // forward pass through the network maximum learning rate set of 5 × 10−3 . The data in all datasets are pre-
[ ]
8 𝐊𝑏 = exp −𝛾((𝐈◦𝐎𝑏 𝐎⊤𝑏 )𝟏 + 𝟏⊤ (𝐈◦𝐎𝑏 𝐎⊤𝑏 )⊤ − 2𝐎𝑏 𝐎⊤𝑏 ) // due to processed by dividing each image by its 𝓁2 -norm followed by removing
Eq. (7) ( ( ) ) the mean. For each experiment, the network is trained for a maximum
9 𝜶 𝑏 = 𝜶 𝑏 − 𝜂 2𝐊𝑏 𝐊𝑏 𝜶 𝑏 − 𝐲𝑏 + 2𝛿𝐊𝑏 𝜶 𝑏 // due to Eq. (6)
of 100 epochs and then evaluated on the test set of each dataset using
10 𝝎𝑏 = 𝝎𝑏 − 𝜂𝜕𝐿(𝜶 𝑏 , 𝐲𝑏 , 𝝎)∕𝜕𝝎𝑏 // due to § 4.2
[ ] the best configuration of parameters obtained on the validation set.
11 𝐎 = 𝐎1 , 𝐎2 , … , 𝐎𝑛𝑏
[ ] For injecting contamination into the training set, we use randomly
12 𝜶 = 𝜶 1 , 𝜶 2 , … , 𝜶 𝑛𝑏
13 for 𝑏 = 1, … , 𝑛𝑏 do // compute batch labels due to Eq. selected samples from all classes other than the target class. The process
(16) above is repeated five times and we report the average performance
14 𝐲𝑏 = 𝐊(𝐎𝑏 , 𝐎)𝜶 over these five repetitions. In order to facilitate a fair comparison
[ ] against the existing methods, we use a LeNet-based architecture [45]
15 𝐲 = 𝐲1 , 𝐲2 , … , 𝐲𝑛𝑏
16 𝐲 = 𝐲∕‖𝐲‖2 // due to Eq. (17) as the convolutional backbone architecture as it serves as a widely used
baseline and was used in similar studies [38]. Nevertheless, note that
our method can be used with an arbitrary CNN structure.
where 𝐳̄ denotes the representation derived from the penultimate layer 5.2. Datasets
of the network for the test sample 𝐳 and 𝐨𝑖 ’s stand for the representa-
tions obtained for the training samples. The projections of the normal In this study, we use the most widely used datasets for one-class
samples are expected to be near point ‘‘1’’ in the feature subspace classification to enable comparison with a larger body of existing work
whereas anomalous samples are expected to be projected further away in the literature. These datasets are briefly explained next.
and closer to the origin. A threshold 𝜏 is ultimately applied to 𝑦(𝐳) to
classify 𝐳.
5.2.1. The CIFAR10 dataset [46]
This database comprises 60, 000 colour images of 32 × 32 pixels
5. Experiments
from 10 classes of animals, birds, and vehicles. The dataset is split
into a training set comprising 50, 000 images in addition to a test
In this section, we present an experimental evaluation and analysis
set containing 10, 000 images. Each class includes 6000 images and is
of the proposed approach along with a comparison to the baseline
divided into 5000 images for training and 1000 for testing. Similar to the
and state-of-the-art methods on different datasets. We evaluate the pro-
MNIST dataset, ten one-class classification problems are created for this
posed approach in two different scenarios: (1) one-class classification
dataset where in each round, one class is assumed as the positive/target
subject to the training set contamination; and (2) unsupervised obser-
class while the others serve as outliers. In order to determine the
vation ranking where the dataset includes both normal and anomalous
samples. The remainder of this section is structured as follows. optimal parameters of the proposed approach, 10% of the training set
is randomly selected and used as validation data, accounting for 500
• In Section 5.1, the implementation details of the proposed ap- normal samples and 4500 anomalous samples. The test set contains 1000
proach including hyper-parameter tuning, data pre-processing images per class.
and the backbone CNN network used are outlined;
• Section 5.2 discusses the datasets used in the experiments;
5.2.2. The MNIST dataset [45]
• Section 5.3 analyses the convergence behaviour of the proposed
This dataset includes images of 28 × 28-pixel handwritten digits
method;
0–9. There are 60, 000 images in the training set and 10, 000 in the
• In Section 5.7, we discuss the computational complexity of the
test set. In our analysis, one digit is considered as the normal class
proposed approach;
and all other digits represent anomalous samples. Therefore, ten one-
• Section 5.4, presents details of implementation for the methods
class classification tasks are considered. There are approximately 6000
included in the comparison;
samples in the training set for each class. We use randomly selected
• In Section 5.5, the proposed method is evaluated for OCC in
10% of the data as a validation set to tune the (hyper)parameters of
the presence of training set contamination and compared against
the proposed approach, leading to approximately 600 positive and 5400
other methods;
anomalous samples per class. The test set contains about 1000 samples
• In Section 5.6, we evaluate the proposed DRKSR approach in
of normal class and 9000 samples corresponding to anomalies. The
an unsupervised observation ranking scenario and compare its
original test-train split provided by the MNIST dataset is followed in
performance against other approaches.
our experiments.
5.1. Implementation details
5.2.3. The fashion-MNIST dataset [47]
In the following experiments, a Gaussian kernel is used, i.e. the The Fashion-MNIST dataset incorporates a training set of 60,000
(𝑖, 𝑗)th element of the kernel matrix is defined as 𝐊𝑖𝑗 = exp(𝛾‖𝐨𝑖 − 𝐨𝑗 ‖2 ) samples along with a test set of 10,000 samples. Each sample corre-
where 𝛾 controls the width of the Gaussian kernel and 𝐨𝑖 and 𝐨𝑗 stand sponds to a 28 × 28 gray-scale image, from one of 10 possible classes.
for the features associated with the training samples 𝐱𝑖 and 𝐱𝑗 obtained The dataset shares the same image dimensions, structure, and train and
from the penultimate layer of the network. The kernel width parameter test splits as those of the MNIST.
5
S. Mohammad and S. Rahimzadeh Arashloo Neurocomputing 573 (2024) 127246
Fig. 1. Loss functions (left column) and AUC curves (right column) for two sample classes from the validation set of the MNIST dataset.
5.3. Convergence characteristics {0.01, 0.1} where 𝑣 reflects one’s assumptions with respect to the
train set outlier fraction. The dimensionality of the data is initially
This section analyses the convergence characteristics of the pro- reduced via PCA such that 95% of the variance is retained.
posed approach. For this purpose, the loss function and the AUC on • For the baseline method of [21], designated as ‘‘Robust-kernel’’
the validation set versus the number of epochs are recorded for two in the subsequent sections, following the setting of [21], we use
sample classes from the MNIST dataset. Fig. 1 depicts the behaviour raw pixels as features and normalise them by dividing each image
of the proposed approach in terms of the training loss and the AUC by its 𝓁2 -norm, and use the average of the Euclidean distance
for digit 0 and digit 1. As may be observed from the figure, the loss between the features as the Gaussian width of the kernel.
corresponding to both classes drops significantly after the first iteration, • For Deep SVDD [38] and DCAE [38] methods, the data is pre-
partly due to the fact that labels are initialised away from their optimal processed based on a normalisation of the global contrast using
values. As the optimisation proceeds and the labels are updated, the an 𝓁1 -norm and then re-scaled to [0, 1] via a max–min scaling.
loss function decreases further until convergence. Furthermore, from • For HRN [23], the data is pre-processed using 𝓁2 -norm instance-
the figure, it can be seen that both the AUCs and the loss function values level data normalisation. The best parameter settings suggested
typically converge within 40 epochs. Similar behaviour is observed for in [23] are used in our experiments.
other classes too. • For ICS [25], we use the min–max normalisation for pre-
processing and employ the best parameter settings reported in
5.4. Compared methods [25].
• Regarding SRO [30], the 𝜆 parameter controlling the balance
In the following experiments, we present a comparison between between sparseness and connectivity is set as 𝜆 = 0.95 and the
the proposed approach and the state-of-the-art methods. The methods coefficient of self-representation errors is selected from [0, 20]
included in the comparison correspond to both deep and non-deep using the validation sets of each dataset.
approaches. For these methods, we use the implementations of the
• For DASVDD [48], the data is pre-processed based on global
corresponding algorithms provided by the corresponding authors with
contrast normalisation using 𝓁2 -norm as proposed in their pa-
the parameters tuned on the validation sets for all methods in all ex-
per. Moreover, The hyper-sphere centre 𝑐 was initialised as the
periments. The methods and the corresponding implementation details
average of the latent representation of the whole dataset for
are as follows.
stability.
• OCSVM [40]: as a widely used baseline method, we include • For MAF-IAD [49], we directly report the results from the corre-
the One-Class SVM approach in the comparisons. For this pur- sponding paper.
pose, we use a Gaussian kernel with the width parameter 𝛾 ∈ • For the CA-AAE approach [50], we report the results presented in
{2−10 , 2−9 , … , 2−1 }, and run all experiments with 𝑣 selected from the corresponding paper.
6
S. Mohammad and S. Rahimzadeh Arashloo Neurocomputing 573 (2024) 127246
Table 1
AUC’s (mean±std%) of different approaches over all classes for one-class classification on the CIFAR-10 dataset in the presence of contamination.
Cont. (%) OC-SVM DSVDD HRN Robust-kernel DCAE ICS DASVDD CC DRKSR
5 64.1 ± 0.7 61.8 ± 2.5 70.9 ± 0.6 56.5 ± 0.0 59.4 ± 3.2 66.1 ± 1.7 64.7 ± 2.9 66.7 ± 𝑛𝑎 𝟕𝟐.𝟏 ± 𝟏.𝟖
10 63.6 ± 0.7 60.7 ± 2.6 70.5 ± 0.6 56.2 ± 0.0 58.0 ± 3.3 65.5 ± 1.8 63.3 ± 2.9 69.1 ± 𝑛𝑎 𝟕𝟏.𝟔 ± 𝟏.𝟗
15 62.7 ± 0.8 59.2 ± 2.8 70.3 ± 0.7 56.0 ± 0.0 57.1 ± 3.5 64.8 ± 1.6 62.5 ± 3.0 71.6 ± 𝑛𝑎 𝟕𝟏.𝟐 ± 𝟏.𝟗
20 61.9 ± 0.9 57.5 ± 3.0 70.1 ± 0.7 55.7 ± 0.0 56.2 ± 3.6 64.4 ± 1.5 61.7 ± 3.1 𝑛𝑎 𝟕𝟎.𝟗 ± 𝟏.𝟗
25 61.0 ± 0.9 56.9 ± 3.0 69.7 ± 0.7 55.4 ± 0.0 55.3 ± 3.7 63.8 ± 1.5 60.3 ± 3.2 𝑛𝑎 𝟕𝟎.𝟓 ± 𝟐.𝟎
30 59.8 ± 1.0 55.8 ± 3.2 69.4 ± 0.8 55.2 ± 0.0 54.4 ± 3.9 63.1 ± 1.5 59.6 ± 3.4 𝑛𝑎 𝟕𝟎.𝟐 ± 𝟐.𝟎
35 58.9 ± 1.0 54.4 ± 3.3 68.6 ± 0.9 55.0 ± 0.0 53.9 ± 3.9 62.8 ± 1.5 58.7 ± 3.7 𝑛𝑎 𝟔𝟗.𝟕 ± 𝟐.𝟏
40 57.8 ± 1.1 53.5 ± 3.5 67.9 ± 1.0 54.8 ± 0.0 52.8 ± 4.0 62.5 ± 1.5 58.1 ± 3.8 𝑛𝑎 𝟔𝟗.𝟔 ± 𝟐.𝟐
45 56.9 ± 1.2 52.7 ± 3.6 66.8 ± 1.3 54.6 ± 0.0 51.6 ± 4.2 62.1 ± 1.6 57.5 ± 3.7 𝑛𝑎 𝟔𝟗.𝟑 ± 𝟐.𝟐
50 56.1 ± 1.2 52.3 ± 3.8 65.6 ± 1.7 54.5 ± 0.0 50.7 ± 4.3 61.7 ± 1.7 57.2 ± 3.6 𝑛𝑎 𝟔𝟖.𝟗 ± 𝟐.𝟑
Average 60.28 56.48 68.98 55.39 54.94 63.68 60.36 𝑛𝑎 𝟕𝟎.𝟑𝟗
• Regarding the CC approach [51], the results are reported from contaminated with non-target observations. In particular, compared to
the relevant paper. the CIFAR10 dataset, it can be seen that the performances of all other
methods except for the proposed DRKSR drop significantly. The HRN
approach performs better only for the case of 5% training set contami-
5.5. One-class classification
nation. Nevertheless, the proposed DRKSR yields superior performance
for all the other cases. Overall, the proposed approach performs ap-
In our first set of experiments, different OCC approaches are exam-
proximately 4% better than HRN on average on this dataset. The robust
ined for one-class classification subject to training set contamination.
kernel-based approach of [21] appears to be quite robust, with minimal
Following a real-world and realistic setting, it is assumed that the
decrease in the performance when the training set contamination is
fraction of contamination in the training set is not known. For the
increased. The ICS, DCAE, and DSVDD approaches all perform poorly
purpose of this experiment, on all datasets, each model is trained
on this dataset as their performances degrade rapidly with increased
subject to different training set contamination levels ranging from 5%
contamination. The standard deviations of the DSVDD as well as the
to 50% in steps of 5% and then assessed on the corresponding test set.
DCAE methods are observed to be the highest for the MNIST dataset
A contaminated dataset, in this case, is created by randomly removing
as well. From the results, it can be observed that DRKSR outperforms
some positive samples from the target training set and replacing them
all other methods for all the digits by a margin as high as 8% AUC
with random samples from other classes. The distribution of the con-
tamination is uniform in the sense that the same number of samples on average except digit 1 where ICS performs better than DRKSR by
are randomly selected from each negative class. For computational a very small margin. HRN yields slightly better results for low levels
efficiency, we determine the optimal parameters where the training of contamination. However, the performances of the HRN and ICS
set did not contain any contamination and only optimise the Tikhonov methods deteriorate quite drastically in the presence of high levels
regularisation parameter 𝛿 using the validation set. Nevertheless, some of contamination. The HRN approach yields the second-best results
improvement in the performance is expected if all the parameters are for the majority of classes on average. The DASVDD method’s perfor-
tuned for each level of contamination for each class separately. mance drops significantly with an increase in contamination and it is
Tables 1–3 summarise the performances of different methods on interesting to note that DSVDD outperforms DASVDD by almost 3%.
the CIFAR-10, MNIST and Fashion-MNIST datasets, respectively, where On the Fashion-MNIST dataset 3 it may be observed that the pro-
each row in the tables reports the average AUC of all classes for a posed approach achieves the best overall average performance com-
particular level of contamination in the training set along with the pared to other methods. In particular, while for lower levels of con-
corresponding standard deviations. As may be seen from Table 1, the tamination, the HRN approach seems to perform well, when the con-
proposed approach is the best-performing method on average over tamination degree is increased, the proposed method outperforms other
levels of contamination with an average AUC of 70.39%. The second approaches including the HRN approach.
best-performing method on the CIFAR-10 dataset is that of HRN which Table 4 provides an average ranking of different methods based
seems to be moderately robust to contamination. However, with an on their performances in this set of experiments using Friedman’s
increasing level of contamination, one observes a sudden drop in test. Inspecting Table 4, one can observe that on all datasets, the
the performance of the HRN approach whereas the performance of proposed DRKSR approach is the best-performing method. The second
the proposed DRKSR is not affected as much. The ICS method also best performing method is that of HRN [23].
appears as another relatively robust approach against the training set
contamination. The method in [21], on the other hand, has a relatively 5.6. Unsupervised observation ranking
lower performance but seems to be quite robust against contamination
with only a slight deterioration in the performance as the contam- In this set of experiments, different methods are evaluated in an
ination percentage increases. The Deep SVDD and DCAE methods, unsupervised observation ranking scenario. For this purpose, the train-
however, provide quite unstable results with the standard deviation ing set is contaminated with varied levels of contamination ranging
being relatively high for both low and high levels of contamination. from 5% to 50%, and the observations are ranked based on their
It is interesting that the OC-SVM method outperforms DVSDD in the conformity with the learned model. The techniques used in this setting
presence of contamination in the training set. For DASVDD [48] on this implicitly assume that normal samples are in a majority and appear
particular dataset, we observed that the centre 𝑐 exhibited significant more frequently compared to anomalies. We follow the same protocol
instability during the training process when there were anomalies as in the previous experiments. That is, the number of samples in the
present in the dataset. It failed to converge to a stable representation. training set remains the same and some positive samples are randomly
To solve the issue, the centre 𝑐 was updated every epoch instead of replaced with some negative samples from other classes which are
every batch. This instability underscores the sensitivity of the model’s considered as contamination/noise. In this experiment, in addition to
latent space to outliers, highlighting the need for robust mechanisms to the approaches included in the earlier experiments, we also include
ensure consistent convergence in the presence of contaminated data. the SRO method [30] which is considered an effective unsupervised
On the MNIST dataset (Table 2) one observes that the proposed approach for observation ranking. Similar to the previous experiments,
DRKSR method outperforms other approaches when the training set is the performance metric employed in this set of experiments is that
7
S. Mohammad and S. Rahimzadeh Arashloo Neurocomputing 573 (2024) 127246
Table 2
AUC’s (mean±std%) of different approaches over all classes for one-class classification on the MNIST dataset in the presence of contamination.
Cont. (%) OC-SVM DSVDD HRN Robust-kernel DCAE ICS DASVDD MAF-IAD CA-AAE CC DRKSR
5 87.8 ± 0.1 90.9 ± 2.0 𝟗𝟔.𝟎 ± 𝟎.𝟑 90.4 ± 0.0 85.4 ± 2.1 91.7 ± 1.6 89.9 ± 0.5 89.8 ± 1.2 92.1 ± 𝑛𝑎 58.2 ± 𝑛𝑎 95.7 ± 0.5
10 85.7 ± 0.1 89.5 ± 2.3 94.9 ± 0.4 89.9 ± 0.0 83.8 ± 2.3 90.4 ± 1.8 86.8 ± 0.6 91.0 ± 2.2 89.2 ± 𝑛𝑎 60.1 ± 𝑛𝑎 𝟗𝟓.𝟎 ± 𝟎.𝟔
15 83.6 ± 0.1 87.6 ± 2.7 93.8 ± 0.4 89.5 ± 0.0 82.4 ± 2.8 89.1 ± 2.0 84.7 ± 0.7 na 86.9 ± 𝑛𝑎 66.0 ± 𝑛𝑎 𝟗𝟒.𝟕 ± 𝟎.𝟔
20 81.4 ± 0.2 85.4 ± 3.1 92.5 ± 0.4 89.1 ± 0.0 80.9 ± 3.1 88.2 ± 2.3 82.1 ± 0.9 91.6 ± 1.9 84.5 ± 𝑛𝑎 na 𝟗𝟒.𝟓 ± 𝟎.𝟕
25 79.4 ± 0.2 83.6 ± 3.6 90.9 ± 0.4 88.8 ± 0.0 79.3 ± 3.5 85.9 ± 2.0 80.2 ± 1.0 na 83.1 ± 𝑛𝑎 na 𝟗𝟒.𝟑 ± 𝟎.𝟕
30 77.3 ± 0.2 81.6 ± 4.0 89.7 ± 0.5 88.3 ± 0.0 77.6 ± 4.0 84.3 ± 2.1 77.8 ± 1.0 na 82.0 ± 𝑛𝑎 na 𝟗𝟒.𝟏 ± 𝟎.𝟕
35 75.6 ± 0.2 80.2 ± 4.2 88.1 ± 0.5 87.9 ± 0.0 76.1 ± 4.2 83.4 ± 2.4 75.5 ± 1.0 na 79.1 ± 𝑛𝑎 na 𝟗𝟑.𝟖 ± 𝟎.𝟖
40 74.3 ± 0.3 78.9 ± 4.4 86.3 ± 0.5 87.5 ± 0.0 74.5 ± 4.4 84.0 ± 2.6 74.6 ± 1.0 na 76.1 ± 𝑛𝑎 na 𝟗𝟑.𝟔 ± 𝟎.𝟖
45 72.6 ± 0.3 77.4 ± 4.7 84.6 ± 0.6 87.1 ± 0.0 72.5 ± 4.7 82.7 ± 2.9 72.8 ± 1.0 na 74.9 ± 𝑛𝑎 na 𝟗𝟑.𝟓 ± 𝟎.𝟖
50 70.8 ± 0.3 75.9 ± 5.0 83.1 ± 0.7 86.2 ± 0.0 71.0 ± 5.0 81.3 ± 3.2 71.2 ± 1.1 na 74.0 ± 𝑛𝑎 na 𝟗𝟑.𝟑 ± 𝟎.𝟗
Average 78.85 83.10 89.99 88.47 78.35 86.10 79.56 na 82.1 na 94.25
Table 3
AUC’s (mean±std%) of different approaches over all classes for one-class classification on the FMNIST dataset in the presence of contamination.
Cont. (%) OC-SVM DSVDD HRN Robust-kernel DCAE ICS DASVDD MAF-IAD CC DRKSR
5 85.6 ± 0.3 84.1 ± 2.1 𝟗𝟐.𝟏 ± 𝟎.𝟐 87.0 ± 0.0 82.7 ± 2.0 91.1 ± 1.0 85.3 ± 0.9 90.0 ± 0.7 60.4 ± 𝑛𝑎 91.8 ± 0.7
10 83.9 ± 0.3 83.4 ± 2.4 𝟗𝟏.𝟕 ± 𝟎.𝟐 86.2 ± 0.0 81.1 ± 2.3 90.6 ± 1.8 82.5 ± 1.3 89.3 ± 0.5 61.5 ± 𝑛𝑎 91.2 ± 0.6
15 82.1 ± 0.4 82.5 ± 2.8 𝟗𝟏.𝟐 ± 𝟎.𝟐 85.4 ± 0.0 80.2 ± 2.5 90.1 ± 1.0 80.1 ± 1.6 na 62.0 ± 𝑛𝑎 90.8 ± 0.7
20 80.9 ± 0.5 81.1 ± 3.0 𝟗𝟎.𝟖 ± 𝟎.𝟑 84.5 ± 0.0 78.9 ± 2.9 89.5 ± 1.1 78.2 ± 1.8 88.0 ± 0.4 na 90.4 ± 0.8
25 79.3 ± 0.5 79.8 ± 3.4 88.6 ± 0.4 83.5 ± 0.0 77.6 ± 3.4 88.7 ± 1.3 76.5 ± 2.0 na na 𝟖𝟗.𝟕 ± 𝟎.𝟖
30 77.7 ± 0.5 76.4 ± 3.7 87.1 ± 0.4 82.6 ± 0.0 75.2 ± 3.7 86.3 ± 1.7 74.9 ± 2.1 na na 𝟖𝟗.𝟎 ± 𝟎.𝟖
35 75.4 ± 0.7 74.2 ± 4.1 85.9 ± 0.5 81.5 ± 0.0 73.1 ± 4.0 84.8 ± 2.0 73.4 ± 2.2 na na 𝟖𝟖.𝟑 ± 𝟎.𝟗
40 73.7 ± 0.8 72.3 ± 4.6 84.1 ± 0.5 80.3 ± 0.0 71.9 ± 4.4 82.7 ± 2.1 72.3 ± 2.0 na na 𝟖𝟕.𝟑 ± 𝟎.𝟗
45 72.1 ± 0.9 71.1 ± 5.1 82.6 ± 0.6 78.9 ± 0.0 70.2 ± 4.7 80.1 ± 2.4 70.2 ± 2.0 na na 𝟖𝟔.𝟗 ± 𝟏.𝟎
50 71.4 ± 0.9 70.3 ± 5.5 81.0 ± 0.7 77.1 ± 0.0 68.4 ± 5.1 77.8 ± 2.9 68.7 ± 2.1 na na 𝟖𝟓.𝟗 ± 𝟏.𝟎
Average 78.21 77.50 87.52 82.69 75.90 86.20 76.21 na na 89.13
8
S. Mohammad and S. Rahimzadeh Arashloo Neurocomputing 573 (2024) 127246
Table 5
AUC’s (mean±std%) of different approaches over all classes for observation ranking on the Cifar10 dataset in the presence of contamination.
Cont. (%) OC-SVM DSVDD HRN Robust-kernel DCAE ICS SRO DASVDD DRKSR
5 63.8 ± 0.8 61.5 ± 2.7 71.1 ± 0.5 57.0 ± 0.0 58.4 ± 3.3 65.2 ± 1.6 58.8 ± 0.0 63.9 ± 3.0 𝟖𝟒.𝟏 ± 𝟏.𝟎
10 63.3 ± 0.8 60.4 ± 2.8 70.8 ± 0.6 56.5 ± 0.0 57.4 ± 3.4 64.8 ± 1.7 58.3 ± 0.0 63.1 ± 3.0 𝟖𝟑.𝟕 ± 𝟏.𝟏
15 62.6 ± 0.9 59.1 ± 3.0 70.5 ± 0.6 56.2 ± 0.0 56.4 ± 3.5 64.5 ± 1.3 57.6 ± 0.0 62.0 ± 3.1 𝟖𝟑.𝟒 ± 𝟏.𝟐
20 61.7 ± 1.0 57.5 ± 3.3 70.2 ± 0.6 56.0 ± 0.0 55.5 ± 3.7 64.1 ± 1.4 57.0 ± 0.0 61.3 ± 3.2 𝟖𝟑.𝟎 ± 𝟏.𝟐
25 60.7 ± 1.0 56.5 ± 3.4 69.8 ± 0.7 55.8 ± 0.0 54.9 ± 3.8 63.7 ± 1.4 57.0 ± 0.0 60.2 ± 3.2 𝟖𝟐.𝟑 ± 𝟏.𝟑
30 59.6 ± 1.1 55.4 ± 3.6 69.4 ± 0.8 55.5 ± 0.0 53.8 ± 3.9 63.4 ± 1.5 57.0 ± 0.0 59.5 ± 3.4 𝟖𝟏.𝟕 ± 𝟏.𝟑
35 58.7 ± 1.1 54.4 ± 3.7 68.7 ± 0.8 55.2 ± 0.0 53.5 ± 4.0 63.2 ± 1.5 56.7 ± 0.0 58.7 ± 3.3 𝟖𝟏.𝟎 ± 𝟏.𝟒
40 57.7 ± 1.2 53.4 ± 3.9 68.1 ± 0.9 54.8 ± 0.0 52.2 ± 4.2 62.8 ± 1.6 56.4 ± 0.0 57.7 ± 3.6 𝟖𝟎.𝟑 ± 𝟏.𝟓
45 56.9 ± 1.2 52.7 ± 4.0 67.4 ± 1.2 54.7 ± 0.0 51.6 ± 4.4 62.6 ± 1.7 56.2 ± 0.0 57.1 ± 3.7 𝟕𝟗.𝟕 ± 𝟏.𝟔
50 56.2 ± 1.3 52.0 ± 4.3 66.6 ± 1.6 54.5 ± 0.0 50.3 ± 4.7 62.4 ± 1.7 56.0 ± 0.0 56.3 ± 3.7 𝟕𝟗.𝟏 ± 𝟏.𝟖
Average 60.14 56.27 69.29 55.65 54.40 63.70 57.14 59.98 𝟖𝟏.𝟖𝟐
Table 6
AUC’s (mean±std%) of different approaches over all classes for observation ranking on the MNIST dataset in the presence of contamination.
Cont. (%) OC-SVM DSVDD HRN Robust-kernel DCAE ICS SRO DASVDD DRKSR
5 87.6 ± 0.1 90.2 ± 2.0 𝟗𝟓.𝟎 ± 𝟎.𝟑 89.3 ± 0.0 84.3 ± 1.9 90.6 ± 1.6 73.8 ± 0.0 89.1 ± 0.5 94.8 ± 0.4
10 85.7 ± 0.1 89.4 ± 2.3 𝟗𝟒.𝟎 ± 𝟎.𝟒 89.0 ± 0.0 83.3 ± 1.8 89.7 ± 1.8 72.8 ± 0.0 85.7 ± 0.7 94.3 ± 0.4
15 83.6 ± 0.1 87.7 ± 2.6 92.9 ± 0.4 88.6 ± 0.0 81.7 ± 2.1 88.2 ± 1.9 71.6 ± 0.0 83.9 ± 0.7 𝟗𝟑.𝟔 ± 𝟎.𝟔
20 81.5 ± 0.2 85.6 ± 2.9 91.6 ± 0.4 88.3 ± 0.0 80.4 ± 2.4 87.0 ± 2.1 70.1 ± 0.0 81.4 ± 1.0 𝟗𝟑.𝟏 ± 𝟎.𝟕
25 79.7 ± 0.2 83.9 ± 3.2 90.3 ± 0.4 87.8 ± 0.0 78.8 ± 2.7 85.7 ± 2.0 68.6 ± 0.0 79.7 ± 1.0 𝟗𝟐.𝟕 ± 𝟎.𝟕
30 77.6 ± 0.2 82.6 ± 3.5 88.9 ± 0.4 87.3 ± 0.0 77.6 ± 2.9 83.8 ± 2.1 66.9 ± 0.0 77.3 ± 1.0 𝟗𝟐.𝟑 ± 𝟎.𝟕
35 75.9 ± 0.2 80.8 ± 3.9 87.4 ± 0.5 87.1 ± 0.0 75.8 ± 3.3 83.4 ± 2.3 66.1 ± 0.0 75.8 ± 1.0 𝟗𝟏.𝟗 ± 𝟎.𝟖
40 74.5 ± 0.2 79.4 ± 4.1 85.6 ± 0.5 86.6 ± 0.0 74.2 ± 3.5 83.2 ± 2.5 65.4 ± 0.0 74.2 ± 1.0 𝟗𝟏.𝟓 ± 𝟎.𝟖
45 72.7 ± 0.2 77.9 ± 4.4 84.0 ± 0.6 86.1 ± 0.0 72.2 ± 3.7 82.3 ± 2.7 64.7 ± 0.0 72.1 ± 1.1 𝟗𝟏.𝟑 ± 𝟎.𝟖
50 71.0 ± 0.2 76.5 ± 4.7 82.1 ± 0.7 85.6 ± 0.0 70.7 ± 3.9 81.4 ± 3.0 63.9 ± 0.0 70.9 ± 1.2 𝟗𝟎.𝟖 ± 𝟎.𝟕
Average 78.98 83.39 89.18 87.55 77.90 85.53 68.48 79.01 𝟗𝟐.𝟔𝟓
Table 7
AUC’s (mean±std%) of different approaches over all classes for observation ranking on the Fashion MNIST dataset in the presence of contamination.
Cont. (%) OC-SVM DSVDD HRN Robust-kernel DCAE ICS SRO DASVDD DRKSR
5 84.4 ± 0.4 83.1 ± 2.3 𝟗𝟏.𝟕 ± 𝟎.𝟐 86.8 ± 0.0 81.9 ± 2.2 90.5 ± 1.0 62.3 ± 0.0 84.2 ± 1.0 90.9 ± 0.7
10 82.5 ± 0.4 82.4 ± 2.5 𝟗𝟏.𝟏 ± 𝟎.𝟐 85.9 ± 0.0 80.4 ± 2.5 90.1 ± 1.1 61.8 ± 0.0 81.9 ± 1.2 90.5 ± 0.8
15 81.0 ± 0.5 81.1 ± 3.1 𝟗𝟎.𝟓 ± 𝟎.𝟐 85.1 ± 0.0 79.1 ± 2.8 89.4 ± 1.3 60.1 ± 0.0 79.6 ± 1.7 90.1 ± 0.8
20 79.8 ± 0.5 80.2 ± 2.9 89.5 ± 0.4 83.9 ± 0.0 77.4 ± 3.2 88.4 ± 2.1 59.2 ± 0.0 77.4 ± 2.0 𝟖𝟗.𝟕 ± 𝟎.𝟖
25 78.3 ± 0.5 78.6 ± 3.2 88.2 ± 0.4 83.0 ± 0.0 75.8 ± 3.6 86.7 ± 2.0 57.8 ± 0.0 76.1 ± 2.0 𝟖𝟖.𝟗 ± 𝟎.𝟗
30 76.5 ± 0.6 74.9 ± 3.5 86.7 ± 0.4 82.2 ± 0.0 74.1 ± 3.9 85.3 ± 2.1 55.4 ± 0.0 74.4 ± 1.9 𝟖𝟖.𝟐 ± 𝟎.𝟗
35 73.8 ± 0.7 72.6 ± 3.9 85.3 ± 0.5 81.1 ± 0.0 72.4 ± 4.1 83.6 ± 2.3 53.9 ± 0.0 72.6 ± 2.2 𝟖𝟕.𝟖 ± 𝟎.𝟗
40 72.2 ± 0.7 71.0 ± 4.1 83.2 ± 0.5 79.8 ± 0.0 70.9 ± 4.6 81.2 ± 2.5 52.6 ± 0.0 71.7 ± 2.0 𝟖𝟔.𝟖 ± 𝟏.𝟎
45 70.2 ± 0.9 69.9 ± 4.4 81.4 ± 0.7 78.5 ± 0.0 69.2 ± 5.1 79.3 ± 2.7 51.9 ± 0.0 69.6 ± 2.2 𝟖𝟓.𝟕 ± 𝟏.𝟏
50 68.7 ± 0.9 67.8 ± 4.7 80.2 ± 0.8 76.7 ± 0.0 67.2 ± 5.5 76.4 ± 3.0 51.2 ± 0.0 67.1 ± 2.2 𝟖𝟓.𝟏 ± 𝟏.𝟏
Average 76.74 76.16 86.78 82.30 74.81 85.10 56.62 75.46 𝟖𝟖.𝟑𝟕
Table 8 Table 9
Average ranking of different approaches based on Friedman’s test for observation Average AUC over all different levels of contamination with and without Tikhonov
ranking on the CIFAR10 (𝑝-value = 2.31𝑒 − 13), MNIST (𝑝-value = 2.71𝑒 − 13) and regularisation.
FMNIST (𝑝-value = 1.37𝑒 − 13) datasets. Digit 𝛿 Average AUC
Method Cifar10 MNIST FMNIST
9 0 90.38
OC-SVM 4.3 6.35 5.3 9 0.01 94.03
DSVDD 7.1 4.8 5.95 8 0 86.23
HRN 2.0 2.2 1.8 8 0.01 92.38
Robust-kernel 7.8 3.1 3.9
SRO 6.4 9.0 9.0
DCAE 8.7 7.65 7.85
ICS 3.0 3.8 3.1
average AUC over all different contamination levels for digit 9 increases
DASVDD 4.7 7.0 6.9
DRKSR (this work) 𝟏.𝟎 𝟏.𝟏 𝟏.𝟐
by 4%, leading to an average AUC of 94.03%, compared to 90.38%
when no regularisation is applied. We can see similar behaviour for
digit 8 where there is an improvement of approximately 6% in the
average. Similar behaviours are observed for all other classes which
experiment is to analyse how the Tikhonov regularisation affects perfor- underscores the importance of a Tikhonov regularisation, especially
mance, especially when there are varying levels of contamination in the when dealing with contaminated datasets.
training set. To this end, for different values of the trade-off parameter
𝛿 (see Eq. (5)) we record the performance. 6.3. Time complexity vs. Batch size
The results of this experiment are reported in Table 9. As may be ob-
served from the table, utilising Tikhonov regularisation with a 𝛿 of 0.01 Our primary objective in this experiment is to discern the relation-
results in an improvement of approximately 4% in average AUC com- ship between the number of batches and the computational efficiency,
pared with the case of eliminating the Tikhonov regulariser altogether, especially in the context of kernel matrix computation. To this end,
i.e. 𝛿 = 0. In essence, when Tikhonov regularisation is applied, the we evaluate the proposed approach by executing the training and
9
S. Mohammad and S. Rahimzadeh Arashloo Neurocomputing 573 (2024) 127246
testing processes on a CPU and meticulously tracking the time taken References
for each phase. The results of the study are as follows. With the entire
dataset processed in a single batch, the training time was the largest, [1] J.d.J. Rubio, Stability analysis of the modified Levenberg–Marquardt algorithm
for the artificial neural network training, IEEE Trans. Neural Netw. Learn. Syst.
approximately 4.4 s. The inference time, however, was clocked at a
32 (8) (2021) 3510–3524.
relatively swift 2.9 s. Splitting the data across 2 batches led to a [2] J. de Jesús Rubio, Bat algorithm based control to decrease the control energy
reduction in training time by about 33%, taking it down to roughly consumption and modified bat algorithm based control to increase the trajectory
2.95 s. Increasing the number of batches to 3 further reduced the tracking accuracy in robots, Neural Netw. 161 (2023) 437–448.
[3] H.-S. Chiang, M.-Y. Chen, Y.-J. Huang, Wavelet-based EEG processing for epilepsy
training time to around 2.2 s. 4 batches resulted in a training duration
detection using fuzzy entropy and associative Petri net, IEEE Access 7 (2019)
of close to 1.9 s. Processing the data across 6 batches reduced the 103255–103262.
training time even further to 1.47 s. The highest division, with the [4] J. de Jesús Rubio, D. Garcia, H. Sossa, I. Garcia, A. Zacarias, D. Mujica-Vargas,
data divided across 12 batches, yielded the shortest training time of Energy processes prediction by a convolutional radial basis function network,
Energy 284 (2023) 128470.
just 0.98 s.
[5] A. López-González, J. Meda Campaña, E. Hernández Martínez, P.P. Contro,
Throughout the variation in the number of batches, the inference Multi robot distance based formation using parallel genetic algorithm, Appl. Soft
time remained relatively consistent, averaging around the three sec- Comput. 86 (2020) 105929.
onds mark. This consistency in inference time suggests that, despite [6] W.J. Scheirer, A. de Rezende Rocha, A. Sapkota, T.E. Boult, Toward open set
recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (7) (2012) 1757–1772.
the computational challenges of kernel matrix calculations, the over-
[7] S. Rahimzadeh Arashloo, 𝓁𝑝 -Norm support vector data description, Pattern
head during inference remains steady. However, training times exhibit Recognit. 132 (2022) 108930.
fluctuations, decreasing with an increase in the number of batches. [8] Y. Zheng, S. Wang, B. Chen, Multikernel correntropy based robust least squares
These observations suggest increasing the number of batches can one-class support vector machine, Neurocomputing 545 (2023) 126324.
reduce training times while the inference duration remains more or [9] W. Wang, F. Chang, H. Mi, Intermediate fused network with multiple timescales
for anomaly detection, Neurocomputing 433 (2021) 37–49.
less invariant. This characteristic suggests that the computational bur- [10] X. Xia, X. Pan, N. Li, X. He, L. Ma, X. Zhang, N. Ding, GAN-based anomaly
den during training differs fundamentally from that during inference, detection: A review, Neurocomputing 493 (2022) 497–535.
especially in the context of kernel matrix operations. [11] P. Bergmann, M. Fauser, D. Sattlegger, C. Steger, MVTec AD–A comprehensive
real-world dataset for unsupervised anomaly detection, in: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp.
7. Conclusion 9592–9600.
[12] S. Fatemifar, S.R. Arashloo, M. Awais, J. Kittler, Client-specific anomaly detection
for face presentation attack detection, Pattern Recognit. 112 (2021) 107696.
In this study, we addressed the one-class classification problem
[13] T. Schlegl, P. Seeböck, S.M. Waldstein, U. Schmidt-Erfurth, G. Langs, Unsuper-
and presented a novel one-class kernel regression-based deep learning vised anomaly detection with generative adversarial networks to guide marker
approach that is trainable end-to-end. The proposed method utilises discovery, in: International Conference on Information Processing in Medical
the Fisher null concept for classification and benefits from regularisa- Imaging, Springer, 2017, pp. 146–157.
[14] R. Chaker, Z.A. Aghbari, I.N. Junejo, Social network model for crowd anomaly
tion theories in kernel-based algorithms to achieve robustness against
detection and localization, Pattern Recognit. 61 (2017) 266–281.
contamination while at the same time enjoying deep end-to-end con- [15] X. Zhang, S. Yang, J. Zhang, W. Zhang, Video anomaly detection and localization
volutional learning. Through experiments on multiple datasets, it was using motion-field shape description and homogeneity testing, Pattern Recognit.
illustrated that the proposed approach yields a substantially superior 105 (2020) 107394.
[16] C. Yan, B. Gong, Y. Wei, Y. Gao, Deep multi-view enhancement hashing for image
performance when the training set is contaminated with noisy obser-
retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 43 (4) (2021) 1445–1451.
vations. Additionally, it was illustrated that the proposed method may [17] C. Yan, Z. Li, Y. Zhang, Y. Liu, X. Ji, Y. Zhang, Depth image denoising
provide state-of-the-art results for unsupervised observation ranking. using nuclear norm and learning graph model, ACM Trans. Multimed. Comput.
As potential future directions for investigation, one may consider ex- Commun. Appl. 16 (4) (2020).
[18] C. Yan, Y. Hao, L. Li, J. Yin, A. Liu, Z. Mao, Z. Chen, X. Gao, Task-adaptive
tending the proposed approach into an open-set classification problem
attention for image captioning, IEEE Trans. Circuits Syst. Video Technol. 32 (1)
where multiple one-class classifiers are learned concurrently. (2022) 43–51.
[19] C. Yan, T. Teng, Y. Liu, Y. Zhang, H. Wang, X. Ji, Precise no-reference image
quality evaluation based on distortion identification, ACM Trans. Multimed.
CRediT authorship contribution statement
Comput. Commun. Appl. 17 (3s) (2021).
[20] C. Yan, L. Meng, L. Li, J. Zhang, Z. Wang, J. Yin, J. Zhang, Y. Sun, B. Zheng,
Salman Mohammad: Software implementation, Validation, Formal Age-invariant face recognition by multi-feature fusionand decomposition with
self-attention, ACM Trans. Multimed. Comput. Commun. Appl. 18 (1s) (2022).
analysis, Writing – original draft, Methodology. Shervin Rahimzadeh
[21] S.R. Arashloo, J. Kittler, Robust one-class kernel spectral regression, IEEE Trans.
Arashloo: Conceptualization, Methodology, Writing – review & editing, Neural Netw. Learn. Syst. 32 (3) (2020) 999–1013.
Supervision. [22] P. Oza, V.M. Patel, One-class convolutional neural network, IEEE Signal Process.
Lett. 26 (2) (2018) 277–281.
[23] W. Hu, M. Wang, Q. Qin, J. Ma, B. Liu, HRN: A holistic approach to one class
Declaration of competing interest learning, Adv. Neural Inf. Process. Syst. 33 (2020) 19111–19124.
[24] Z. Zhu, Y. Wang, D. Robinson, D. Naiman, R. Vidal, M. Tsakiris, Dual principal
The authors declare that they have no known competing finan- component pursuit: Improved analysis and efficient algorithms, in: S. Bengio, H.
Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances
cial interests or personal relationships that could have appeared to in Neural Information Processing Systems, Vol. 31, Curran Associates, Inc., 2018.
influence the work reported in this paper. [25] P. Schlachter, Y. Liao, B. Yang, Deep one-class classification using intra-class
splitting, in: 2019 IEEE Data Science Workshop, DSW, IEEE, 2019, pp. 100–104.
[26] S.S. Khan, M.G. Madden, One-class classification: taxonomy of study and review
Data availability of techniques, Knowl. Eng. Rev. 29 (3) (2014) 345–374.
[27] H. Hoffmann, Kernel PCA for novelty detection, Pattern Recognit. 40 (3) (2007)
Standard publicly available datasets are used in this study. 863–874.
[28] N. Kwak, Principal component analysis based on L1-norm maximization, IEEE
Trans. Pattern Anal. Mach. Intell. 30 (9) (2008) 1672–1680.
Acknowledgements [29] G. Lerman, T. Maunu, Fast, robust and non-convex subspace recovery, Inf.
Inference J. IMA 7 (2) (2018) 277–336.
[30] C. You, D.P. Robinson, R. Vidal, Provable self-representation based outlier
This research is supported by The Scientific and Technological detection in a union of subspaces, in: Proceedings of the IEEE Conference on
Research Council of Turkey (TÜBİTAK) under grant no 121E465. Computer Vision and Pattern Recognition, 2017, pp. 3395–3404.
10
S. Mohammad and S. Rahimzadeh Arashloo Neurocomputing 573 (2024) 127246
[31] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural [44] L.N. Smith, N. Topin, Super-convergence: Very fast training of neural networks
networks, Science 313 (5786) (2006) 504–507. using large learning rates, in: Artificial Intelligence and Machine Learning for
[32] J.T. Andrews, E.J. Morton, L.D. Griffin, Detecting anomalous data using Multi-Domain Operations Applications, Vol. 11006, International Society for
auto-encoders, Int. J. Mach. Learn. Comput. 6 (1) (2016) 21. Optics and Photonics, 2019, 1100612.
[33] M. Sabokrou, M. Fayyaz, M. Fathy, Z. Moayed, R. Klette, Deep-anomaly: Fully [45] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to
convolutional neural network for fast anomaly detection in crowded scenes, document recognition, Proc. IEEE 86 (11) (1998) 2278–2324.
Comput. Vis. Image Underst. 172 (2018) 88–97. [46] A. Krizhevsky, G. Hinton, Learning Multiple Layers of Features from Tiny Images,
[34] J. Chen, S. Sathe, C. Aggarwal, D. Turaga, Outlier detection with autoencoder Tech. Rep. 2009-TR, Department of Computer Science, University of Toronto,
ensembles, in: Proceedings of the 2017 SIAM International Conference on Data Toronto, Ontario, 2009, online: https://www.cs.toronto.edu/~kriz/cifar.html.
Mining, SIAM, 2017, pp. 90–98. [47] H. Xiao, K. Rasul, R. Vollgraf, Fashion-MNIST: a novel image dataset for
[35] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, L. Bottou, Stacked benchmarking machine learning algorithms, 2017-08-28, arXiv:cs.LG/1708.
denoising autoencoders: Learning useful representations in a deep network with 07747.
a local denoising criterion, J. Mach. Learn. Res. 11 (12) (2010). [48] H. Hojjati, N. Armanfard, DASVDD: Deep autoencoding support vector data
[36] A. Makhzani, B. Frey, K-sparse autoencoders, 2014, arXiv:1312.5663. descriptor for anomaly detection, 2021, CoRR abs/2106.05410, arXiv:2106.
[37] A. Makhzani, B.J. Frey, Winner-take-all autoencoders, in: C. Cortes, N. Lawrence, 05410.
D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances in Neural Information [49] M. Kim, J. Yu, J. Kim, T.-H. Oh, J.K. Choi, An iterative method for unsupervised
Processing Systems, Vol. 28, Curran Associates, Inc., 2015. robust anomaly detection under data contamination, IEEE Trans. Neural Netw.
[38] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S.A. Siddiqui, A. Binder, E. Learn. Syst. (2023) 1–13.
Müller, M. Kloft, Deep one-class classification, in: International Conference on [50] D. Li, Q. Tao, J. Liu, H. Wang, Center-aware adversarial autoencoder for anomaly
Machine Learning, PMLR, 2018, pp. 4393–4402. detection, IEEE Trans. Neural Netw. Learn. Syst. 33 (6) (2022) 2480–2493.
[39] P. Perera, R. Nallapati, B. Xiang, Ocgan: One-class novelty detection using [51] Y. Li, P. Hu, Z. Liu, D. Peng, J.T. Zhou, X. Peng, Contrastive clustering, in:
gans with constrained latent representations, in: Proceedings of the IEEE/CVF Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 10,
Conference on Computer Vision and Pattern Recognition, 2019, pp. 2898–2906. 2021, pp. 8547–8555.
[40] B. Schölkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, R.C. Williamson, Estimating
the support of a high-dimensional distribution, Neural Comput. 13 (7) (2001)
1443–1471.
[41] D.M. Tax, R.P. Duin, Support vector data description, Mach. Learn. 54 (1) (2004) Salman Mohammad received the B.Sc. and M.Sc. degrees in electrical engineering
45–66. and computer engineering, respectively, from Bilkent University, Ankara, Turkey. He
[42] S. Liang, Y. Li, R. Srikant, Enhancing the reliability of out-of-distribution is currently a Ph.D. candidate at University of Zurich.
image detection in neural networks, in: International Conference on Learning
Representations, 2018.
Shervin Rahimzadeh Arashloo received the Ph.D. degree from the Centre for Vision,
[43] L.-F. Chen, H.-Y.M. Liao, M.-T. Ko, J.-C. Lin, G.-J. Yu, A new LDA-based face
Speech and Signal Processing (CVSSP), University of Surrey, Guildford, U.K., in 2010.
recognition system which can solve the small sample size problem, Pattern
He is currently an Assistant Professor with the Department of Computer Engineering,
Recognit. 33 (10) (2000) 1713–1726.
Bilkent University, Ankara, Turkey. His research interests include pattern recognition,
machine learning, and signal processing.
11