Apicella et al. 2019_A simple and efficient architecture for trainable activation functions
Apicella et al. 2019_A simple and efficient architecture for trainable activation functions
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: Automatically learning the best activation function for the task is an active topic in neural network re-
Received 12 February 2019 search. At the moment, despite promising results, it is still challenging to determine a method for learn-
Revised 17 June 2019
ing an activation function that is, at the same time, theoretically simple and easy to implement. Moreover,
Accepted 14 August 2019
most of the methods proposed so far introduce new parameters or adopt different learning techniques. In
Available online xxx
this work, we propose a simple method to obtain a trained activation function which adds to the neural
Communicated by Dr Ding Wang network local sub-networks with a small number of neurons. Experiments show that this approach could
lead to better results than using a pre-defined activation function, without introducing the need to learn
Keywords:
Neural networks a large number of additional parameters.
Machine learning © 2019 Elsevier B.V. All rights reserved.
Activation functions
Trainable activation functions
https://doi.org/10.1016/j.neucom.2019.08.065
0925-2312/© 2019 Elsevier B.V. All rights reserved.
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]
training process. To clarify this aspect, let us briefly summarise the to obtain “any” activation function f, since a one-hidden layer neu-
training process, which corresponds to the minimisation of an er- ral network can approximate arbitrarily well any continuous func-
ror function with respect to the network parameters. An expression tional mapping from one finite-dimensional space to another. This
for the error function is (although many other forms are possible approximation is feasible provided that the number of hidden neu-
[3]): rons is sufficiently large, and the activation functions of the hidden
neurons satisfy suitable properties, for example, to be sigmoidal
1N
c
or, more in general, not-polynomial functions [3,40,49]. Thus, our
E (θ ) = [yk (xn ; θ ) − tkn ]2
2 neural network architecture with trainable activation functions is,
n=1 k=1
again, an MLFF neural network, so that any classical approach ap-
where yk (xn ; θ ) denotes the output of the neuron k in the out- plicable to MLFF networks can also be applied directly to our ar-
put layer as a function of both the input xn and the network pa- chitecture with trainable activation functions. It is worth to point
rameters θ , and tkn represents the target value for output neuron k out that, as we discuss in Sections 2 and 3, our architecture can
when the input is xn . It is common to determine the solution for be interpreted as a general framework which generalises several
the network parameter values at the global minimum of the error approaches recently proposed in the literature.
function by iterating a gradient-based algorithm with the gradient The main contribution of our paper can be summarised in two
computed by backpropagation [3]. Because the error function for main points:
MLFF networks is typically a highly non-linear function of the pa-
• we propose a new type of trainable activation function which,
rameters (not-convex surface), there may exist many local minima
in addition to producing encouraging results, generalises several
or saddle-points.
approaches already present in literature. The trainable function
In a learning process, the main problem is to avoid these sta-
we propose holds several attractive properties: p1) it can ap-
tionary points or regions of the error function. Consequently, the
proximate arbitrarily well any continuous one-to-one mapping
choice of the activation functions has an essential impact on the
defined on a compact domain; p2) any standard learning mech-
shape of the error function and, on the performance of the learn-
anism for neural networks can be directly and easily applied to
ing process. Besides, this choice can affect the number of hidden
the resulting network; p3) it does not add any further learn-
neurons and layers necessary to reach the desired precision of ap-
ing process in addition to those classically used for neural net-
proximation [12,18]. For these reasons, there is a rich recent liter-
works; p4) the added parameters are network weights or bi-
ature proposing activation functions that differ from the standards
ases, therefore any classical regularisation method can be used,
ones (e.g., sigmoid and tanh). In particular, the introduction of acti-
including the possibility of imposing sparsity using norms such
vation functions as ReLU [38] and related functions, such as Leaky
as l1 ;
ReLU [34] and parametric ReLU [20], has contributed to reviving
• we give a taxonomy of the existing studies on trainable activa-
the interest of the scientific community for MLFF networks. The
tion functions dividing them into three main groups, highlight-
use of these new activation functions has been shown to improve
ing some properties which should be considered to evaluate if
the performance of the networks significantly in terms of accuracy
a trainable activation function is suitable for a given task.
and training speed, thanks to properties as no saturation (e.g.[15]).
Another significant advance can be found in [7], where the learn- The paper is structured as follows. In the next section, we crit-
ing is speeded up by introducing the ELU activation function, and ically discuss our approach with respect to the current literature.
more recently in [27] with the introduction of SELU units. In Section 3 we introduce our architecture. Section 4 is dedicated
Finding alternative functions that are likely to improve the re- to the experimental assessment. Finally, Section 5 is left to the
sults is still an open field of research. In this perspective, many conclusions.
recent papers compare neural architectures with different activa-
tion functions, as in [39], or propose to search appropriate ac- 2. Related work
tivation functions within a finite set of potentially interesting
activation functions, as in [42]. Over the last years, ReLU has become the standard activation
A current field of research goes further and concentrates on function for deep neural models, surpassing classic functions as
learning proper activation functions from data, obtaining adaptable sigmoid and tanh , which were preferred in the literature thanks
(or trainable) activation functions which are adjusted during the to some useful properties, such as the ability to avoid saturation
learning stage, allowing the network to exploit the data better (see, issues[16,38]. Different variations of the ReLU have been proposed
for example,[41]). Furthermore, any layer of the network could po- [11,34,36]. All these functions are somehow different from ReLU,
tentially have their own best activation function, increasing the but once chosen, they stay fixed, and the choice of which one
number of choices to make at the design stage. On the other hand, to use must be taken during the design stage, typically following
it is not guaranteed that fixing the same function for each layer is some heuristic. A partial attempt to overcome this drawback moves
the best decision. A way to tackle the problem is to learn the acti- in the direction of searching the best activation function from a
vation functions from the data, together with the other parameters predefined set [32,42,59]. A limit to these techniques lies in the
of the network; the idea is to find proper activation functions that, fact that there is no learning process for determining the activa-
together with the other network parameters, provide a good model tion functions, but they are selected from a collection of standard
for the data. functions. Approaches using trainable activation functions propose
In this paper, we propose a novel method for learning activation a more general framework. In this direction one can isolate three
functions in the context of full-connected and convolutional MLFF basic classes of approaches:
neural networks, and we assess the impact of this method on the
• parameterised standard activation function;
performance of the network empirically.
• linear combination of one-variable functions;
The idea builds upon the possibility to obtain adaptable acti-
• ensemble of standard activation functions.
vation functions in terms of sub-networks with only one hidden
layer. In a nutshell, we can replace each neuron having a non- In Sections 2.1, 2.2 and 2.3 we will discuss these three types
linear activation function f with a neuron with an Identity activa- of approaches. In Section 2.4, we present other types of solutions
tion function which sends its output to a one-hidden-layer sub- that not clearly fall in any of these classes. Our discussion is based
network with just one output neuron. This replacement enables us mainly on three characteristics: (1) how many new parameters are
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]
added to the network model; (2) the possibility or not to use stan- polynomial function of degree p. The coefficients of the polyno-
dard techniques, within neural network context, for learning the mial function are computed by linear regression. The number of
new parameters, such as backpropagation for computing the error added parameters corresponds to the number p + 1 of coefficients
function gradient or sparse methods; (3) the expressive power of for each neuron.
the class of the trainable activation functions.
2.3. Ensemble of standard activation functions
2.1. Parameterised standard activation functions
In this class of methods, activation functions are defined as an
With the expression parameterised standard activation func- ensemble of a predetermined number of standard activation func-
tions we refer to all the functions with a shape that is very sim- tions. For example, the authors of [25] designed an activation S-
ilar to a given standard activation function, but whose diversity shape function composed of three linear functions taking inspira-
from the latter comes from a set of trainable parameters. The addi- tion by Webner–Fechner [14] and Stevens law [50], or in [41] a
tion of these parameters, therefore, requires changes, even minimal mixture of eLU and ReLU is presented. In [52], the authors propose
ones, in the learning algorithm, for example, in the case of using a trainable activation function in terms of a linear combination of
gradient-based methods with backpropagation, the partial deriva- n different, predefined and fixed functions such as hyperbolic tan-
tives of the error function respect to these new parameters are gent (tanh), ReLU and ELU. The added parameters are the n coeffi-
needed. A first attempt to have a parameterised activation func- cients of the linear combination for each hidden neuron. A similar
tion is described in [24], where the proposed activation function strategy is proposed in [19], where the authors model the train-
uses two trainable parameters α , β to rule the function shape of a able activation function as a linear combination of a predefined
classic sigmoidal function. Similar works on sigmoidal and hyper- set of n normalised fixed activation functions. The added param-
bolic tangent functions are discussed in [5,6,48,57,58]. eters are the coefficients of the linear combination and a set of
More recently, the authors in [20] introduce PReLU, a paramet- offset parameters, η and δ , which are used to offset the normal-
ric version of ReLU, which modifies the function shape when the isation range for each predefined function dynamically. Moreover,
argument is negative. However, the resulting function remains a to force the network to choose amongst the predefined activation
modified version of the ReLU function that can change its shape functions, during the learning process, it is required that all the
in a restricted domain. In [7] the ELU function is proposed, which coefficients of the linear combination sum to one. This last fact
outperforms the results obtained by ReLU on CIFAR100 dataset, be- gives rise to another optimisation process unrelated to the stan-
coming one of the best activation function known for the time be- dard learning procedure for neural networks.
ing. However, it needs to set an external parameter α . In [55] PELU
unit is proposed, where the need to manually set the α parameter 2.4. Other approaches
is eliminated using two additional trainable parameters.
In all these approaches, although the number of added param- Two interesting and successful methods are Maxout [17,51] and
eters for each node is low, the expressive power of the trainable NIN [31]. However, despite the excellent performances, both ap-
activation functions is limited. proaches move away from the concept of trainable activation func-
tion, insofar as the adaptable function does not correspond to the
2.2. Linear combination of one-variable functions neuron activation function by which the neuron output is com-
puted based on a scalar value (the neuron input), according to the
In this case, activation functions are modelled as linear com- standard two-stage process. In fact, in Maxout, instead of comput-
binations of one-variable functions. These one-variable functions ing the input ai of a neuron i and then assigning it as input to a
can, in turn, have additional parameters. For example, in [1] each trainable activation function, n input values aij are computed, with
activation function is represented as a linear combination of S j = 1, . . . , n, by n trainable linear functions, and then the maxi-
hinge-shaped functions. Each hinge-shaped function has just one mum is taken over the output of these linear functions. NIN in-
parameter which regulates the location of the hinge. The number stead represents an approach explicitly used in the case of convo-
of additional parameters that must be learned when using this ap- lutional neural networks, wherein the nonlinear parts of the filters
proach is 2SM, where M is the total number of hidden units in are replaced with a fully connected neural network acting on all
the neural network. During the learning phase, the network can channels simultaneously.
be trained using standard methods based on backpropagation. Any Another interesting way to tackle the problem is to use inter-
continuous piecewise-linear function can be approximated arbi- polating functions as in [45,54]. For instance, in [45], the authors
trarily well provided that the number S of hinge-shaped functions propose an adaptable activation function by using a cubic spline
is sufficiently large. interpolation, whose q control points for each neuron are adapted
A similar approach has been recently proposed by [46]. In this in the optimisation phase. External methods to classic approaches
case, the activation function is modelled as a linear combination in neural networks are needed to train the added parameters q∗ m,
of S fixed functions, where the S fixed functions are defined in where m is the number of hidden neurons.
terms of parametric kernel functions. The parameters of the ker-
nel functions are computed before the network learning phase by 3. System architecture
some heuristic procedure applied to the data set. During the net-
work learning phase, the number of additional parameters is SM, MLFF networks are composed of N elementary computing units
however, for the kernel functions KS parameters must be computed (neurons), organised in L > 1 layers. The first layer of an MLFF net-
in a prior stage (where k is the number of parameters of the ker- work is composed of d input variables. Each neuron i belonging
nel functions). In case of a correct choice of the parameters of the to a layer l, with l > 1, receives connections from neurons (or in-
kernel functions, any continuous one-to-one function defined on put variables in case of l = 2) from the previous layer l − 1. Each
a compact set can be approximated arbitrarily well, provided that connection is associated with a real value called weight, and the
the number of kernel functions is sufficiently large. flow of computation proceeds from the first layer to the last layer
In [13], in the context of random weight artificial neural net- (forward propagation). The last neuron layer is called the output
works, a trainable activation function is proposed in terms of a layer, and the remaining neuron layers are called hidden layers.
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]
Fig. 2. An example of VAF in a full connected network (on the left) and in a convolutional layer (on the right).
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]
shared weights principle for every network layer, the only added subnetwork. More specifically, let us call h this neuron, then we
hyper-parameters are: set to one the weight associated to both the input and output
connection of the neuron h.
• the number k of hidden neurons of the VAF subnetwork;
• f is any given function. We can use a regression algorithm to
• the activation function g( · ) of the VAF hidden neurons.
pretrain the VAF weights to approximate f. More in detail, if
It is worth to emphasise the fact that, in our approach, we have we want to pretrain a VAF unit composed of k hidden nodes to
a neural network architecture which is still a MLFF network with approximate a function f, we can make a single-layer network
fixed activation functions, without adding any external structure or T with the same VAF architecture. T is then formed by a sin-
parameters. Let us clarify this aspect (see also Fig. 2). Given a neu- gle input neuron, an hidden layer of k nodes and a final level
ron i belonging to l-th layer of an MLFF network Net, its output composed of a single output neuron with activation function
is computed as zil = f (ali ). In our approach, we replace the activa- corresponding to the Identity function. Next, we sample a set
tion function f with the Identity function, thus obtaining zil = ali . of points R = (x, y ) s.t. y = f (x ) and then we can use it to train
Then, we add a VAF sub-network which receives as input vari- our network. The resulting model composes the initial weight
able the output zil of the neuron i and computes its output as set for the VAF units.
defined in Eq. (1). Then, this output is sent as input to the next
3.3. Comparison with other approaches
layer l + 1 of Net. This procedure is performed on all the neurons
of the MLFF network Net, except for the output layer. Hence, we
obtain a new neural network VafNet, which is still a MLFF net- To the best of our knowledge, all the approaches proposed so
far in the literature, either have a limited expressive power of the
work with fixed activation functions; however, it behaves as Net
equipped with trainable activation functions expressed in terms of trainable activation functions or add new learning mechanisms,
Eq. (1). Consequently, Consequently, we can use any standard train- constraints and categories of parameters. Instead, the approach
proposed in this paper produces a feed-forward neural network
ing procedure (e.g., Stochastic Gradient Descent).
Fig. 2 shows how a VAF network can be integrated into a typi- with a trainable activation function using a feed-forward neural
cal multilayer full-connected neural network (on the left) and in a network with fixed activation functions, thus leaving unaltered the
convolutional neural network (on the right). classic learning mechanisms and categories of parameters.
Our approach holds several attractive properties:
It is worth to point out that, since a VAF subnetwork performs
a linear combination of one-variable functions, any approach dis- p1) the trainable activation function can approximate arbitrarily
cussed in Section 2.2 can be considered as a particular case of this well any continuous one-to-one mapping defined on a com-
scheme, if the activation function g and the parameters α and β pact domain;
are appropriately chosen. p2) any standard learning mechanism for neural networks can
be directly and easily applied;
3.2. VAF Network learning p3) it does not add any further learning process in addition to
those classically used for neural networks;
As discussed above, our neural architecture is a MLFF network; p4) the added parameters are network weights or biases; there-
consequently, it can be trained using any learning algorithm used fore, any classical regularisation method can be used, includ-
to train MLFF networks. In case of the same VAF is used for all ing the possibility of imposing sparsity using norms such as
the neurons on the same layer, then there is the constrain that l1 .
the weights of VAF networks should be considered shared weights.
From an implementation point of view, this allows interpreting a None of the known approaches hold all these properties. For
VAF network as a convolution operator applied to the values ali example, property p1 is non satisfied by any of the approaches
[31]. The values of the few weights of the VAF, being connected discussed in Sections 2.1 and 2.3; the methods discussed in
to each unit, influence the behaviour of the whole network. There- Section 2.2 either do no satisfy property p1 as in [1] or property
fore their behaviour must be taken into account during the training p3 as in [46].
phase, and, in particular, the initial value of the VAF weights can Interestingly, as discussed in Section 3.1, our architecture repre-
be crucial. sents a general framework including all the approaches described
The training of a neural network usually starts initialising the in Section 2.2, and some of the approaches in Section 2.3 can be
weights and biases randomly [3], or using any initialisation rule, included, insofar as any linear combination of m one-variable func-
as in [15]. Although it is feasible to follow these approaches in our tions can be represented by a sub-network with m hidden neurons.
case, it is also possible to choose different solutions for the initial-
4. Experimental results
isation of the VAF weights. A possible alternative is to select the
initial weights of the VAF so that the VAF networks approximate a
In this Section, we provide an experimental evaluation of the
fixed function at the beginning of the learning process. For exam-
proposed trainable activation function architecture. To the end of
ple, we can select a classic activation function as ReLU or Sigmoid.
achieving a first evaluation on the validity of our approach, and
In this way, the function would start from an already known valid
some heuristic indications for the initialisation strategies of VAF
form, and the training process should modify it just enough to im-
networks, in Section 4.1 we report some preliminary experiments
prove the performance of the network. However, it must be kept
on Sensorless, a relatively small dataset used as standard bench-
in mind that this choice can have a negative effect on the solution
mark for supervised classification techniques. More in detail, Sen-
produced by the learning process since the resulting VAF can be
sorless is composed of 49 features extracted from electric current
too similar to the initial function.
drive signals. The drive has intact and defective components, which
If we want to initialise a VAF subnetwork to a given function f,
result in 11 different classes. This dataset is also used as a bench-
we can consider two different cases:
mark in recent studies, such as [46].
• the target function f is the same of the output function of the Based on the results of these experiments, we performed two
VAF hidden nodes, that is f = g. In this case, we initialise to different series of experiments to test our approach on MLFF
zero all the VAF weights and biases, except the weights associ- networks. In the former, we consider standard MLFF networks
ated to the connections of just one hidden neuron of the VAF (Section 4.2), and in the latter convolutional MLFF networks
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]
Table 2
Properties of the datasets used for the experiments, and architectures of the neural network applied to the data.
Name Istances Input Dim. N. classes Task Neural Network Arch. Ref.
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]
Fig. 3. Accuracy of networks with different VAF subnets on each layer. Using the Sensorless dataset, we trained three small shallow networks composed of 5, 10 and 20
hidden neurons with fixed activation functions corresponding to either tanh or ReLU. In Figure, such networks are referred to as noVaf. Then we run the same experiments
substituting the fixed activation functions with VAF subnets. The number of VAF hidden neurons ranges in k ∈ {3, 5, 7, 9, 11, 15}, the possible activation functions for VAF
hidden neurons are tanh and ReLU. Weight initialisation of VAF subnets is either a classic random initialisation or a weight initialisation by which VAF subnets have a
behaviour very similar to activation functions of the VAF hidden neurons. VAF subnets on the same layer can have different weights.
functions are often profoundly different from the classic tanh and we obtained 4 network models with 1-hidden layer, and 6 with
ReLU, and that still they exhibit similarly a high degree of non- 2-hidden layers. Let us call netm1 and netm1 ,m2 the 1-hidden and
linearity. 2-hidden layer networks, respectively, with m1 , m2 ∈ {10, 25, 50,
100}. Based on what discussed in Section 3, it is possible to asso-
ciate with each network netm1 (netm1 ,m2 ) a neural network vnetmk
4.2. Full-connected MLFF networks: classification and regression 1
(vnetmk ) equipped with VAF subnetworks, where k is the num-
1 ,m2
In this experimental scenario we focus on evaluating the im- ber of hidden neurons of the VAF subnetworks.
pact of both VAF subnetworks and VAF weight initialisation using On the basis of the results discussed in Section 4.1, we con-
full-connected MLFF networks with 1 or 2 hidden layers trained sidered VAF subnets shared on each layer, and k = 3. In Table 3,
on 20 publicly available datasets (see Table 2). 10 of these datasets we specify the neural network architectures used in this group of
are proper for classification problems, and 10 for regression prob- experiments. Neural network architectures were sorted in ascend-
lems. The number of hidden neurons varies in the set {10, 25, 50, ing order according to their complexity. We trained the networks
100}, but for neural networks with 2 hidden layers we only se- following a usual learning method, described in Algorithm 1.
lected neural networks with a number of hidden neurons belong- More precisely, we used a batch approach, RProp [43], with
ing to the first layer larger than the number of hidden neurons of “small” datasets, i.e., when the number of examples was less than
the second layer. We chose ReLU as activation function g of the 5 · 103 , otherwise, we used a mini-batch approach, RMSProp [53].
hidden neurons of the VAF sub-networks. Thus, for each dataset, Moreover, networks with VAF subnetworks were trained using
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]
Fig. 4. Accuracy of networks with shared VAF subnets on each layer. Using the Sensorless dataset, we trained three small shallow networks composed of 5, 10 and 20
hidden neurons with fixed activation functions corresponding to either tanh or ReLU. In Figure, such networks are referred to as noVaf. Then we run the same experiments
substituting the fixed activation functions with VAF subnets. The number of VAF hidden neurons ranges in k ∈ {3, 5, 7, 9, 11, 15}, the possible activation functions for VAF
hidden neurons are tanh and ReLU. Weight initialisation of VAF subnets is either a classic random initialisation or a weight initialisation by which VAF subnets have a
behaviour very similar to activation functions of the VAF hidden neurons. VAF subnets on the same layer share the weights.
Table 3
Neural network architectures used in the first experimental scenario. See text for further details.
#1 #2 #3 #4 #5 #6 #7 #8 #9 #10
Stand net10 net25 net50 net100 net25,10 net50,10 net100,10 net50,25 net100,25 net100,50
3 3 3 3 3 3 3 3 3 3
VAF vnet10 vnet25 vnet50 vnet100 vnet25 ,10 vnet50 ,10 vnet100 ,10 vnet50 ,25 vnet100 ,25 vnet100 ,50
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]
Fig. 5. Examples of trained VAF subnetworks. On the y-axis we plot the output value of the VAF. In (5 a) are plotted trained VAF subnetworks when a random weight
initialisation is chosen. In (5 b) when a specific weight initialisation is chosen.
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]
Table 4
Parameters of the first experimental scenario. See text for further details.
{10, 25, 50, 100} {3} {Random, ReLU} {RMSProp, RProp} 300 10
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]
Table 5
RMSE for the experiments on the regression datasets. We used a K-Fold Cross-validation evaluation. In bold the best results.
The best neural architecture for each case is between parentheses.
Table 6
Accuracies for the experiments on the classification datasets. We used a K-Fold Cross-validation evaluation. In bold the best
results. The best neural architecture for each case is between parentheses.
Table 7
Parameters of the second experimental scenario. See text for further details.
Details of the CNN architectures used and of the learning pro- remain comparable to or lower than those without VAF subnet-
cess can be found in Table 7. works. We obtain a considerable improvement especially on the
CIFAR10 dataset.
Results In Figs. 7 and 8, we show some examples of trained activation
In Table 8, we show mean and standard deviations of the ac- functions, respectively in vcnetA2 and vcnetA3 . It seems that, in case
curacy obtained on the three datasets Cifar10, MNIST and Fashion of initialization as ReLU, the original shape remains mostly un-
MNIST, using 10-fold cross-validation for the neural architectures changed, giving a resulting function that looks like a PReLU/Leaky
in the first two rows of Table 7. The best results are in bold. The ReLU. A more interesting behaviour is given by random initiali-
VAF approach outperforms the standard approach, especially when sation, where every VAF unit seems to exhibit more significant
using a random initialisation scheme. Also, in this experimental changes respect the original function. This more significant vari-
scenario, the standard deviations obtained by networks with VAF ability given by random initialisation respect to ReLU initialisation
Fig. 6. Plots of some VAF behaviours at the end of the learning process. In (6a) for regression datasets, in (6b) for classification datasets.
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]
Fig. 7. Examples of changes in a VAF in a 2 layer conv. network using random (top) and ReLU initialisation (bottom). The blue line is the start function, the orange line is
the learned function.
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]
Table 8
Results of the convolutional networks with a 10-fold cross Validation with cnetA .
Cifar10 0.857 ± 0.002 (cnet35 ) 0.875 ± 0.003 (vcnetA53 ) 0.860 ± 0.002 (vcnetA53 )
MNIST 0.991 ± 0.001 (cnetA52 ) 0.994 ± 0.001 (vcnetA52 ) 0.993 ± 0.002 (vcnetA52 )
Fashion MNIST 0.923 ± 0.001 (cnetA52 ) 0.935 ± 0.002 (vcnetA52 ) 0.934 ± 0.001 (vcnetA52 )
Table 9 In the end, also respect the other two approaches with trainable
Comparison between different activation functions on cnetB .
activation functions known in the literature, our method has bet-
standard ReLU VAF(k=5) KAF(D=20) NIN ter performance.
Accuracy Accuracy Accuracy Accuracy
Fig. 8. Examples of a resulting VAF in a 3 layer conv. using random (top) and ReLU initialisation (bottom). The blue line is the start function, the orange line is the learned
function.
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]
It is worth to remark that our approach distinguishes itself from It would be interesting to test these hypotheses experimentally
other approaches proposed in literature as it satisfies simultane- and to see how our approach behaves on more complex and mas-
ously the properties p1 – p4 as discussed in Section 3.3. These sive datasets such as Imagenet [8]. Unfortunately, we currently do
properties include a high expressive power of the trainable activa- not have the computing power needed to perform experiments on
tion functions, no external parameter or learning process in addi- more complex and more significant state-of-art neural networks
tion to the classical ones for neural networks, and the possibility and datasets. We are planning to make it in the next future.
to use classical regularisation methods.
Interestingly, as we discussed in Section 3 our architecture rep- Declaration of Competing Interest
resents a general framework which includes all the approaches de-
scribed in Section 2.2 and some of the approaches in Section 2.3. The authors declare that they have no known competing finan-
We evaluated our architecture empirically on three different cial interests or personal relationships that could have appeared to
groups of experiments. In the first one (see Section 4.1), we influence the work reported in this paper.
tested our approach using small shallow networks for setting some
heuristic choices for the VAF subnets. All the models provided with Acknowledgements
VAF subnets outperform the corresponding shallow networks, and
the results support the possibility of using a shared VAF approach The work has been partially supported by the italian na-
with a relatively low number of VAF hidden neurons. In the sec- tional project Perception, Performativity and Cognitive Sciences
ond series of experiments (see Section 4.2), we considered full- - PRIN2015 Cod. 2015TM24JS_009, funded by MIUR (Ministero
connected Multi-Layered Neural Network (MLFF) networks. More dell’Istruzione, dell’Università e della Ricerca).
specifically, we selected 10 networks with 1 or 2 hidden layers. For
each one of these 10 networks we built a correspondent network References
with VAF subnetworks (see Section 3 and 4.2). We obtained a total
[1] F. Agostinelli, M. Hoffman, P. J. Sadowski, P. Baldi, Learning activation functions
of 20 distinct neural network architectures. These neural architec-
to improve deep neural networks. Workshop of International Conference on
tures were assessed and compared using a K-Fold Cross-Validation Learning Representations (2015), also available as arXiv:1412.6830.
procedure (see Algorithm 2) on 20 different datasets (see Table 2), [2] B. Antal, A. Hajdu, An ensemble-based system for automatic screening of dia-
betic retinopathy, Knowl. Based Syst. 60 (2014) 20–27.
either for classification or regression tasks. The results reveal that
[3] C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
the networks with VAF subnetworks are more performing than [4] F. Cao, T. Xie, Z. Xu, The estimate for approximation error of neural networks:
the ones without VAF networks. More precisely, our approach out- a constructive approach, Neurocomputing 71 (4–6) (2008) 626–630.
performs that without VAF networks on the 85% of the datasets. [5] P. Chandra, Y. Singh, An activation function adapting training algorithm for sig-
moidal feedforward networks, Neurocomputing 61 (2004) 429–437.
our approach produced worse results only on three of the datasets [6] C.-T. Chen, W.-D. Chang, A feedforward neural network with function shape
considered. autotuning, Neural Netw. 9 (4) (1996) 627–641.
In the last set of experiments, we considered Convolutional [7] D.-A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network
learning by exponential linear units (Elus), International Conference on Learn-
Neural Networks with 2 and 3 layers and correspondent networks ing Representations (2016), also availble as arXiv:1511.07289.
with VAF units, and we evaluate them using 3 image datasets [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: a large-scale
(MNIST, Fashion MNIST and CIFAR10) in classification tasks. Again, hierarchical image database, in: Proceedings of the 2009 IEEE Conference on
Computer Vision and Pattern Recognition, IEEE, 2009, pp. 248–255.
the VAF subnetworks outperform networks with static units and [9] R.A. DeVore, K.I. Oskolkov, P.P. Petrushev, Approximation by feed-forward neu-
selected state-of-the-art neural architectures (KAFNet and NIN) ral networks, Ann. Numer. Math. 4 (1996) 261–288.
equipped with trainable activation functions. [10] D. Dheeru, E. Karra Taniskidou, UCI machine learning repository, 2017.
[11] C. Dugas, Y. Bengio, F. Blisle, C. Nadeau, R. Garcia, Incorporating second-order
In conclusion, VAF units have been tested using traditional
functional knowledge for better option pricing, in: Proceedings of the NIPS,
MLNN networks and CNN networks with various datasets and give 20 0 0, pp. 472–478.
better results compared with networks with similar design both [12] R. Eldan, O. Shamir, The power of depth for feedforward neural networks, in:
Proceedings of the Conference on Learning Theory, 2016, pp. 907–940.
with traditional ReLU functions and trainable activation functions.
[13] Ö.F. Ertuğrul, A novel type of activation function in artificial neural networks:
We showed that is possible to obtain encouraging results without trained activation function, Neural Netw. 99 (2018) 148–157.
the need to use complex designs, particular initialisation schemes [14] G. Fechner, Elements of Psychophysics, New York, Holt, Rinehart and Winston,
or learning process in addition to those classically used for neural 1966.
[15] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward
networks. These results were achieved on datasets commonly used neural networks, in: Proceedings of the AISTATS, 9, 2010, pp. 249–256.
in the context of machine learning research. [16] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Pro-
In massive and complex models, e.g. deep neural networks with ceedings of the 14th International Conference on Artificial Intelligence and
Statistics, 2011, pp. 315–323.
a large number of layers, the choice of the appropriate activation [17] I.J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, Y. Bengio, Maxout net-
functions can be a critical aspect. For example, deep networks im- works, in: Proceedings of the 30th International Conference on International
prove their performances when one shifts from the sigmoid to the Conference on Machine Learning - Volume 28, ICML’13, 2013, pp. III–1319–III–
1327.
ReLU function family. We have seen that parameters change during [18] N.J. Guliyev, V.E. Ismailov, A single hidden layer feedforward network with
the learning phase in VAF subnetworks can be interpreted as acti- only one neuron in the hidden layer can approximate any univariate function,
vation function changes in the original neural network. However, Neural Comput. 28 (7) (2016) 1289–1304.
[19] M. Harmon, D. Klabjan, Activation ensembles for deep neural networks.
we suspect that in the case of deep nets, the positive effect of the
arXiv:1702.07790v1 (2017).
VAF can be slightly weakened insofar as the search of the appro- [20] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing hu-
priate values in such a wide parameter-space could fail. We think man-level performance on imagenet classification, in: Proceedings of the IEEE
International Conference on Computer Vision, 2015, pp. 1026–1034.
better behaviour is possible in case of transfer learning approaches
[21] K. Hornik, Approximation capabilities of multilayer feedforward networks,
[60], using VAF subnetworks in the last layers of a target network. Neural Netw. 4 (2) (1991) 251–257.
In fact, in transfer learning, first a base network is trained on a [22] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are
given dataset and task, and then its weights are used to initialise universal approximators, Neural Netw. 2 (5) (1989) 359–366, doi:10.1016/
0893- 6080(89)90020- 8.
a second target network to be trained on a target dataset and task. [23] https://www.dcc.fc.up.pt, 2009.
The first-layer features are general, and last-layer features are spe- [24] Z. Hu, H. Shao, The study of neural network adaptive control systems, Control
cific to the target task, consequently, during the learning phase, the Decis. 7 (2) (1992) 361–366.
[25] X. Jin, C. Xu, J. Feng, Y. Wei, J. Xiong, S. Yan, Deep learning with s-shaped
main changes occur on the last layers of the target network, thus rectified linear activation units, in: Proceedings of the 30th AAAI Conference
limiting the search in the parameter space. on Artificial Intelligence, AAAI Press, 2016, pp. 1737–1743.
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]
[26] D. Kingma, J. Ba, Adam: a method for stochastic optimization, in: Proceedings [55] L. Trottier, P. Gigu, B. Chaib-draa, et al., Parametric exponential linear unit for
of the International Conference on Learning Representations, 2014. deep convolutional neural networks, in: Proceedings of the 16th IEEE Interna-
[27] G. Klambauer, T. Unterthiner, A. Mayr, S. Hochreiter, Self-normalizing neural tional Conference on Machine Learning and Applications (ICMLA), 2017, IEEE,
networks, in: Proceedings of the Conference on Neural Information Processing 2017, pp. 207–214.
System, 2017. arXiv: 1706.02515v5. [56] H. Xiao, K. Rasul, R. Vollgraf, Fashion-MNIST: a novel image dataset for bench-
[28] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images, marking machine learning algorithms, arXiv:1708.07747v2 (2017).
in: Computer Science Department, University of Toronto, Technical Report, vol- [57] T. Yamada, T. Yabuta, Neural network controller using autotuning method for
ume 1, 2009. nonlinear functions, IEEE Trans. Neural Netw. 3 (4) (1992) 595–601.
[29] Y. LeCun, C. Cortes, MNIST handwritten digit database, 2010, http://yann.lecun. [58] T. Yamada, T. Yabuta, Remarks on a neural network controller which uses an
com/exdb/mnist/. auto-tuning method for nonlinear functions, in: Proceedings of the IJCNN In-
[30] M. Leshno, V.Y. Lin, A. Pinkus, S. Schocken, Multilayer feedforward networks ternational Joint Conference on Neural Networks, 1992, 2, 1992, pp. 775–780.
with a nonpolynomial activation function can approximate any function, Neu- vol.2
ral Netw. 6 (6) (1993) 861–867. [59] X. Yao, Evolving artificial neural networks, Proc. IEEE 87 (1999) 1423–1447.
[31] M. Lin, Q. Chen, S. Yan, Network in network, arXiv:1312.4400v3 (2014). [60] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep
[32] Y. Liu, X. Yao, Evolutionary design of artificial neural networks with different neural networks? in: Proceedings of the Advances in Neural Information Pro-
nodes, in: Proceedings of the International Conference on Evolutionary Com- cessing Systems, 2014, pp. 3320–3328.
putation, 1996, pp. 670–675. [61] S. Yu, S. Jia, C. Xu, Convolutional neural networks for hyperspectral image clas-
[33] D.D. Lucas, R. Klein, J. Tannahill, D. Ivanova, S. Brandon, D. Domyancic, sification, Neurocomputing 219 (2017) 88–98.
Y. Zhang, Failure analysis of parameter-induced simulation crashes in climate [62] H. Zhang, Z. Wang, D. Liu, A comprehensive review of stability analysis of con-
models, Geosci. Model Dev. 6 (4) (2013) 1157–1171. tinuous-time recurrent neural networks, IEEE Trans Neural Netw. Learn Syst.
[34] A.L. Maas, A.Y. Hannun, A.Y. Ng, Rectifier nonlinearities improve neural net- 25 (7) (2014) 1229–1262.
work acoustic models, in: Proceedings of the ICML, 2013. [63] X.-M. Zhang, Q.-L. Han, X. Ge, D. Ding, An overview of recent developments
[35] K. Mansouri, T. Ringsted, D. Ballabio, R. Todeschini, V. Consonni, Quantitative in Lyapunov–Krasovskii functionals and stability criteria for recurrent neural
structure-activity relationship models for ready biodegradability of chemicals, networks with time-varying delays, Neurocomputing 313 (2018) 392–401.
J Chem Inf Model 53 (4) (2013) 867–878.
[36] R. Memisevic, K.R. Konda, D. Krueger, Zero-bias autoencoders and the benefits Andrea Apicella received the M.Sc. degree in Computer
of co-adapting features, arXiv:1402.3337v5 (2015). Science and the Ph.D. degree in Mathematics and Com-
[37] A. Montalto, S. Stramaglia, L. Faes, G. Tessitore, R. Prevete, D. Marinazzo, Neural puter Science from University of Naples Federico II, Italy,
networks with non-uniform embedding and explicit validation phase to assess in 2014 and 2019, respectively. He is currently a Post-Doc
granger causality, Neural Networks 71 (2015) 159–171. in the department of Information Technology and Electri-
[38] V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann ma- cal Engineering of Federico II University of Naples. His re-
chines, in: Proceedings of the ICML, 2010, pp. 807–814. search interests include computer vision, neural networks
[39] D. Pedamonti, Comparison of non-linear activation functions for deep neural and biometric applications.
networks on MNIST classification task, arXiv:1804.02763v1 (2018).
[40] A. Pinkus, Approximation theory of the MLP model in neural networks, Acta
Numer. 8 (January) (1999) 143–195.
[41] S. Qian, H. Liu, C. Liu, S. Wu, H.-S. Wong, Adaptive activation functions in con-
volutional neural networks, Neurocomputing 272 (2018) 204–212.
[42] P. Ramachandran, B. Zoph, Q.V. Le, Searching for Activation Functions, in: Pro-
ceedings of the Sixth International Conference on Learning Representations Francesco Isgrò was awarded a Master degree in Math-
(ICMR), 2018, 2018. ematics from Università di Palermo (1994), and a Ph.D.
[43] M. Riedmiller, H. Braun, RPROP - A Fast Adaptive Learning Algorithm, Technical in Computer Science from Heriot-Watt University in Ed-
Report, Proceedings of the ISCIS VII, Universitat, 1992. inburgh (UK) in 2001. He worked as Research Assistant
[44] B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge University at Heriot-Watt University and Università di Genova (Italy),
Press, 2007. and since 2006 is permanent staff member at Università
[45] S. Scardapane, M. Scarpiniti, D. Comminiello, A. Uncini, Learning activation di Napoli Federico II (Italy), where he teaches the courses
functions from data using cubic spline interpolation, in: Proceedings of the of Computer Vision, and Programming. His research inter-
Italian Workshop on Neural Nets, Springer, 2017, pp. 73–83. ests cover various areas of image processing, computer vi-
[46] S. Scardapane, S. Van Vaerenbergh, S. Totaro, A. Uncini, Kafnets: kernel-based sion and machine learning methods. He has co-authored
non-parametric activation functions for neural networks, Neural Netw. (2018). more than 80 scientific papers. He has served on the
[47] M. Sikora, Ł. Wróbel, Application of rule induction algorithms for analysis of technical and organising committees of several confer-
data collected by seismic hazard monitoring systems in coal mines, Arch. Min. ences, and has refereed papers for various journals.
Sci. 55 (1) (2010) 91–114.
[48] Y. Singh, P. Chandra, A class+ 1 sigmoidal activation functions for FFANNS, J. Roberto Prevete (M.Sc. in Physics, Ph.D. in Mathemat-
Econ. Dyn. Control 28 (1) (2003) 183–187. ics and Computer Science) is an Assistant Professor of
[49] S. Sonoda, N. Murata, Neural network with unbounded activation functions is Computer Science at the Department of Electrical Engi-
universal approximator, Appl. Comput. Harmon. Anal. 43 (2) (2017) 233–268. neering and Information Technologies (DIETI), University
[50] S.S. Stevens, On the psychophysical law, Psychol. Rev. 64 (3) (1957) 153. of Naples Federico II, Italy. His current research interests
[51] W. Sun, F. Su, L. Wang, Improving deep neural networks with multi-layer max- include computational models of brain mechanisms, ma-
out networks and a novel initialization method, Neurocomputing 278 (2018) chine learning and artificial neural networks and their
34–40. applications. His research has been published in interna-
[52] L.R. Sütfeld, F. Brieger, H. Finger, S. Füllhase, G. Pipa, Adaptive blending units: tional journals such as Biological Cybernetics, Experimen-
trainable activation functions for deep neural networks, arXiv:1806.10064 tal Brain Research, Neurocomputing, Neural Networks and
(2018). Behavioral and Brain Sciences.
[53] T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: divide the gradient by a running
average of its recent magnitude, COURSERA: Neural Netw. Mach. Learn. 4 (2)
(2012) 26–31.
[54] E. Trentin, Networks with trainable amplitude of activation functions, Neural
Netw. 14 (4–5) (2001) 471–493.
Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and efficient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065