0% found this document useful (0 votes)

21 views

Apicella et al. 2019_A simple and efficient architecture for trainable activation functions

This article discusses a novel architecture for trainable activation functions in neural networks, proposing a method that integrates local sub-networks with a small number of neurons. The authors demonstrate that this approach can yield better performance than traditional activation functions without significantly increasing the number of parameters to learn. The paper also categorizes existing methods for trainable activation functions and highlights the advantages of their proposed architecture.

Uploaded by

Debtanu Seth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Apicella et al. 2019_A simple and efficient architecture for trainable activation functions

Uploaded by

Debtanu Seth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

JID: NEUCOM

ARTICLE IN PRESS [m5G;August 31, 2019;13:35]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

A simple and eﬃcient architecture for trainable activation functions

Andrea Apicella, Francesco Isgrò, Roberto Prevete∗
Dipartimento di Ingegneria Elettrica e delle Tecnologie dell’Informazione, Università di Napoli Federico II Italy

a r t i c l e i n f o a b s t r a c t

Article history: Automatically learning the best activation function for the task is an active topic in neural network re-
Received 12 February 2019 search. At the moment, despite promising results, it is still challenging to determine a method for learn-
Revised 17 June 2019
ing an activation function that is, at the same time, theoretically simple and easy to implement. Moreover,
Accepted 14 August 2019
most of the methods proposed so far introduce new parameters or adopt different learning techniques. In
Available online xxx
this work, we propose a simple method to obtain a trained activation function which adds to the neural
Communicated by Dr Ding Wang network local sub-networks with a small number of neurons. Experiments show that this approach could
lead to better results than using a pre-deﬁned activation function, without introducing the need to learn
Keywords:
Neural networks a large number of additional parameters.
Machine learning © 2019 Elsevier B.V. All rights reserved.
Activation functions
Trainable activation functions

1. Introduction of the unknown function, where tn = F (xn ) + , usually called tar-

gets, are the values assumed by F in xn perturbed by an unknown
Neural networks are, in general, a robust framework for sev- additive noise noise . The task is to find the appropriate values
eral research and practical tasks (see, for example [37,61,62,63]). for the parameters of a parametric function M, so to get as close as
The success of deep learning approaches has led to an increase in possible to the unknown function F. In this context, there are two
interest in MultiLayer FeedForward (MLFF) neural networks. MLFF different problems: the first one concerns the expressive power of
networks are composed of L consecutive layers of elementary com- the parameterised function M, that means if there exist parameter
puting units, called neurons. The last layer is the output layer, and values that are likely to approximate the unknown function F; the
the remaining layers are usually called hidden layers; the compu- second one concerns the possibility of actually finding such param-
tation proceeds from input variables to the output going through eter values.
the hidden layers (forward-propagation). The computation of each Regarding the first problem, an MLFF network with a single hid-
neuron i belonging to the layer l corresponds to a two-step pro- den layer, which is usually called shallow network, can approxi-
cess (see [3], chapter 4): the neuron input ali is computed first, and mate arbitrarily well any continuous functional mapping defined
then the neuron output zil is computed by an activation function on a compact input domain, provided that the number of hidden
f( · ), i.e., zil = f (ali ). The input ali is computed based on weights as- neurons is sufficiently large and the activation functions of the hid-
sociated with connections coming from the neurons belonging to den neurons satisfy some suitable properties. For example, to be
the layer l − 1 and a bias value associated to the neuron i. sigmoidal or, more in general, not-polynomial functions [3,40,49].
Given an MLFF network with d input variables and c neurons In other words, given a certain precision of approximation, there
in the output layer, it achieves a functional mapping from a d- exists a set of parameters θ̄ for which the neural network Net(x; θ )
dimensional space to a c-dimensional space. Thus, it is possible approximates the unknown function within these limits, provided
to interpret an MLFF network as a non-linear parametric function that there is a sufficient number of hidden neurons and appropri-
y = Net (x; θ ), where the parameters θ are all the weights and bi- ate activation functions. In this sense, MLFF networks are consid-
ases of the network and y is the response of the output layer. ered universal approximators.
The approximation properties of MLFF networks have been However, the critical problem is how to determine a suit-
studied widely (e.g., [4,9]). In a nutshell, a function approximation able set of these network parameters (i.e., weights and biases)
problem can be summarised as follows [3,44]: Let F : x ∈ Rd −→ from the data set, a process known as learning or training. Al-
y = F (x ) ∈ Rc be an unknown function, and {(xn , tn )}N though the choice of non linear activation functions f( · ) does not
n=1 a sample
affect the MLFF network’s universal approximator property, pro-
vided certain constraints are met, this choice becomes critical
∗
Corresponding author. when network weights and biases must be determined during the
E-mail address: [email protected] (R. Prevete).

https://doi.org/10.1016/j.neucom.2019.08.065
0925-2312/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: A. Apicella, F. Isgrò and R. Prevete, A simple and eﬃcient architecture for trainable activation functions,
Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.065
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 31, 2019;13:35]

2 A. Apicella, F. Isgrò and R. Prevete / Neurocomputing xxx (xxxx) xxx

training process. To clarify this aspect, let us briefly summarise the to obtain “any” activation function f, since a one-hidden layer neu-
training process, which corresponds to the minimisation of an er- ral network can approximate arbitrarily well any continuous func-
ror function with respect to the network parameters. An expression tional mapping from one finite-dimensional space to another. This
for the error function is (although many other forms are possible approximation is feasible provided that the number of hidden neu-
[3]): rons is sufficiently large, and the activation functions of the hidden
neurons satisfy suitable properties, for example, to be sigmoidal
1N
c
or, more in general, not-polynomial functions [3,40,49]. Thus, our
E (θ ) = [yk (xn ; θ ) − tkn ]2
2 neural network architecture with trainable activation functions is,
n=1 k=1
again, an MLFF neural network, so that any classical approach ap-
where yk (xn ; θ ) denotes the output of the neuron k in the out- plicable to MLFF networks can also be applied directly to our ar-
put layer as a function of both the input xn and the network pa- chitecture with trainable activation functions. It is worth to point
rameters θ , and tkn represents the target value for output neuron k out that, as we discuss in Sections 2 and 3, our architecture can
when the input is xn . It is common to determine the solution for be interpreted as a general framework which generalises several
the network parameter values at the global minimum of the error approaches recently proposed in the literature.
function by iterating a gradient-based algorithm with the gradient The main contribution of our paper can be summarised in two
computed by backpropagation [3]. Because the error function for main points:
MLFF networks is typically a highly non-linear function of the pa-
• we propose a new type of trainable activation function which,
rameters (not-convex surface), there may exist many local minima
in addition to producing encouraging results, generalises several
or saddle-points.
approaches already present in literature. The trainable function
In a learning process, the main problem is to avoid these sta-
we propose holds several attractive properties: p1) it can ap-
tionary points or regions of the error function. Consequently, the
proximate arbitrarily well any continuous one-to-one mapping
choice of the activation functions has an essential impact on the
defined on a compact domain; p2) any standard learning mech-
shape of the error function and, on the performance of the learn-
anism for neural networks can be directly and easily applied to
ing process. Besides, this choice can affect the number of hidden
the resulting network; p3) it does not add any further learn-
neurons and layers necessary to reach the desired precision of ap-
ing process in addition to those classically used for neural net-
proximation [12,18]. For these reasons, there is a rich recent liter-
works; p4) the added parameters are network weights or bi-
ature proposing activation functions that differ from the standards
ases, therefore any classical regularisation method can be used,
ones (e.g., sigmoid and tanh). In particular, the introduction of acti-
including the possibility of imposing sparsity using norms such
vation functions as ReLU [38] and related functions, such as Leaky
as l1 ;
ReLU [34] and parametric ReLU [20], has contributed to reviving
• we give a taxonomy of the existing studies on trainable activa-
the interest of the scientific community for MLFF networks. The
tion functions dividing them into three main groups, highlight-
use of these new activation functions has been shown to improve
ing some properties which should be considered to evaluate if
the performance of the networks significantly in terms of accuracy
a trainable activation function is suitable for a given task.
and training speed, thanks to properties as no saturation (e.g.[15]).
Another significant advance can be found in [7], where the learn- The paper is structured as follows. In the next section, we crit-
ing is speeded up by introducing the ELU activation function, and ically discuss our approach with respect to the current literature.
more recently in [27] with the introduction of SELU units. In Section 3 we introduce our architecture. Section 4 is dedicated
Finding alternative functions that are likely to improve the re- to the experimental assessment. Finally, Section 5 is left to the
sults is still an open field of research. In this perspective, many conclusions.
recent papers compare neural architectures with different activa-
tion functions, as in [39], or propose to search appropriate ac- 2. Related work
tivation functions within a finite set of potentially interesting
activation functions, as in [42]. Over the last years, ReLU has become the standard activation
A current field of research goes further and concentrates on function for deep neural models, surpassing classic functions as
learning proper activation functions from data, obtaining adaptable sigmoid and tanh , which were preferred in the literature thanks
(or trainable) activation functions which are adjusted during the to some useful properties, such as the ability to avoid saturation
learning stage, allowing the network to exploit the data better (see, issues[16,38]. Different variations of the ReLU have been proposed
for example,[41]). Furthermore, any layer of the network could po- [11,34,36]. All these functions are somehow different from ReLU,
tentially have their own best activation function, increasing the but once chosen, they stay fixed, and the choice of which one
number of choices to make at the design stage. On the other hand, to use must be taken during the design stage, typically following
it is not guaranteed that fixing the same function for each layer is some heuristic. A partial attempt to overcome this drawback moves
the best decision. A way to tackle the problem is to learn the acti- in the direction of searching the best activation function from a
vation functions from the data, together with the other parameters predefined set [32,42,59]. A limit to these techniques lies in the
of the network; the idea is to find proper activation functions that, fact that there is no learning process for determining the activa-
together with the other network parameters, provide a good model tion functions, but they are selected from a collection of standard
for the data. functions. Approaches using trainable activation functions propose
In this paper, we propose a novel method for learning activation a more general framework. In this direction one can isolate three
functions in the context of full-connected and convolutional MLFF basic classes of approaches:
neural networks, and we assess the impact of this method on the
• parameterised standard activation function;
performance of the network empirically.
• linear combination of one-variable functions;
The idea builds upon the possibility to obtain adaptable acti-
• ensemble of standard activation functions.
vation functions in terms of sub-networks with only one hidden
layer. In a nutshell, we can replace each neuron having a non- In Sections 2.1, 2.2 and 2.3 we will discuss these three types
linear activation function f with a neuron with an Identity activa- of approaches. In Section 2.4, we present other types of solutions
tion function which sends its output to a one-hidden-layer sub- that not clearly fall in any of these classes. Our discussion is based
network with just one output neuron. This replacement enables us mainly on three characteristics: (1) how many new parameters are

A. Apicella, F. Isgrò and R. Prevete / Neurocomputing xxx (xxxx) xxx 3

added to the network model; (2) the possibility or not to use stan- polynomial function of degree p. The coefficients of the polyno-
dard techniques, within neural network context, for learning the mial function are computed by linear regression. The number of
new parameters, such as backpropagation for computing the error added parameters corresponds to the number p + 1 of coefficients
function gradient or sparse methods; (3) the expressive power of for each neuron.
the class of the trainable activation functions.
2.3. Ensemble of standard activation functions
2.1. Parameterised standard activation functions
In this class of methods, activation functions are defined as an
With the expression parameterised standard activation func- ensemble of a predetermined number of standard activation func-
tions we refer to all the functions with a shape that is very sim- tions. For example, the authors of [25] designed an activation S-
ilar to a given standard activation function, but whose diversity shape function composed of three linear functions taking inspira-
from the latter comes from a set of trainable parameters. The addition by Webner–Fechner [14] and Stevens law [50], or in [41] a
tion of these parameters, therefore, requires changes, even minimal mixture of eLU and ReLU is presented. In [52], the authors propose
ones, in the learning algorithm, for example, in the case of using a trainable activation function in terms of a linear combination of
gradient-based methods with backpropagation, the partial deriva- n different, predefined and fixed functions such as hyperbolic tan-
tives of the error function respect to these new parameters are gent (tanh), ReLU and ELU. The added parameters are the n coeffi-
needed. A first attempt to have a parameterised activation func- cients of the linear combination for each hidden neuron. A similar
tion is described in [24], where the proposed activation function strategy is proposed in [19], where the authors model the train-
uses two trainable parameters α , β to rule the function shape of a able activation function as a linear combination of a predefined
classic sigmoidal function. Similar works on sigmoidal and hyper- set of n normalised fixed activation functions. The added param-
bolic tangent functions are discussed in [5,6,48,57,58]. eters are the coefficients of the linear combination and a set of
More recently, the authors in [20] introduce PReLU, a paramet- offset parameters, η and δ , which are used to offset the normal-
ric version of ReLU, which modifies the function shape when the isation range for each predefined function dynamically. Moreover,
argument is negative. However, the resulting function remains a to force the network to choose amongst the predefined activation
modified version of the ReLU function that can change its shape functions, during the learning process, it is required that all the
in a restricted domain. In [7] the ELU function is proposed, which coefficients of the linear combination sum to one. This last fact
outperforms the results obtained by ReLU on CIFAR100 dataset, be- gives rise to another optimisation process unrelated to the stan-
coming one of the best activation function known for the time be- dard learning procedure for neural networks.
ing. However, it needs to set an external parameter α . In [55] PELU
unit is proposed, where the need to manually set the α parameter 2.4. Other approaches
is eliminated using two additional trainable parameters.
In all these approaches, although the number of added param- Two interesting and successful methods are Maxout [17,51] and
eters for each node is low, the expressive power of the trainable NIN [31]. However, despite the excellent performances, both ap-
activation functions is limited. proaches move away from the concept of trainable activation func-
tion, insofar as the adaptable function does not correspond to the
2.2. Linear combination of one-variable functions neuron activation function by which the neuron output is com-
puted based on a scalar value (the neuron input), according to the
In this case, activation functions are modelled as linear com- standard two-stage process. In fact, in Maxout, instead of comput-
binations of one-variable functions. These one-variable functions ing the input ai of a neuron i and then assigning it as input to a
can, in turn, have additional parameters. For example, in [1] each trainable activation function, n input values aij are computed, with
activation function is represented as a linear combination of S j = 1, . . . , n, by n trainable linear functions, and then the maxi-
hinge-shaped functions. Each hinge-shaped function has just one mum is taken over the output of these linear functions. NIN in-
parameter which regulates the location of the hinge. The number stead represents an approach explicitly used in the case of convo-
of additional parameters that must be learned when using this ap- lutional neural networks, wherein the nonlinear parts of the filters
proach is 2SM, where M is the total number of hidden units in are replaced with a fully connected neural network acting on all
the neural network. During the learning phase, the network can channels simultaneously.
be trained using standard methods based on backpropagation. Any Another interesting way to tackle the problem is to use inter-
continuous piecewise-linear function can be approximated arbi- polating functions as in [45,54]. For instance, in [45], the authors
trarily well provided that the number S of hinge-shaped functions propose an adaptable activation function by using a cubic spline
is sufficiently large. interpolation, whose q control points for each neuron are adapted
A similar approach has been recently proposed by [46]. In this in the optimisation phase. External methods to classic approaches
case, the activation function is modelled as a linear combination in neural networks are needed to train the added parameters q∗ m,
of S fixed functions, where the S fixed functions are defined in where m is the number of hidden neurons.
terms of parametric kernel functions. The parameters of the ker-
nel functions are computed before the network learning phase by 3. System architecture
some heuristic procedure applied to the data set. During the net-
work learning phase, the number of additional parameters is SM, MLFF networks are composed of N elementary computing units
however, for the kernel functions KS parameters must be computed (neurons), organised in L > 1 layers. The first layer of an MLFF net-
in a prior stage (where k is the number of parameters of the ker- work is composed of d input variables. Each neuron i belonging
nel functions). In case of a correct choice of the parameters of the to a layer l, with l > 1, receives connections from neurons (or in-
kernel functions, any continuous one-to-one function defined on put variables in case of l = 2) from the previous layer l − 1. Each
a compact set can be approximated arbitrarily well, provided that connection is associated with a real value called weight, and the
the number of kernel functions is sufficiently large. flow of computation proceeds from the first layer to the last layer
In [13], in the context of random weight artificial neural net- (forward propagation). The last neuron layer is called the output
works, a trainable activation function is proposed in terms of a layer, and the remaining neuron layers are called hidden layers.

4 A. Apicella, F. Isgrò and R. Prevete / Neurocomputing xxx (xxxx) xxx

The computation of a VAF sub-network associated to a neuron

i proceeds as follows: VAF sub-network is fed with the input ai
of the neuron i, then the k neurons of the hidden layer compute k
outputs as yh = g(αh ai + α0h ) with h = 1, 2, . . . , k, while the output

neuron computes z = h βh yh + β0 . α h and α 0h are weights and
biases of the hidden layer of the VAF sub-network, respectively,
and β h and β 0 are weights and bias of the output layer of VAF
sub-network, respectively. In this way the output zi of the neuron
i can be expressed as:

k
zi = VAF (ai ) = β j g(α j ai + α0 j ) + β0 (1)
j=1

where α j , α 0j , β j and β 0 are the parameters learned from the data

during the training process.
A generic scheme of a VAF unit is depicted in Fig. 1. This
Fig. 1. A general VAF scheme; I indicates the identity function. scheme enables to approximate any activation function arbitrarily
well, provided that:

• the number k of hidden neurons in the VAF is suﬃciently large;

The computation for a neuron i belonging to the layer l is a two-
• the activation function g of the hidden layer is a not-polynomial
step process: the neuron activation ali is computed first, and then
function.
the neuron output zil . Usually, the neuron activation ali is con-
structed as a linear combination of the outputs of the previous As already discussed in Section 1, the first condition comes

layer: ali = j wli j zl−1
j
+ bli where wli j is the weight of the connec- from [21,22], where the authors show that a shallow network
tion going to the neuron j belonging to the layer l − 1 to the neu- can approximate any continuous function provided that a suffi-
ron i, bli is a parameter said bias, l = 1, . . . , L and j runs on the in- cient number of hidden neurons are available and that the acti-
dexes of the neurons of the layer l − 1 which send connections to vation function is continuous, bounded and non-constant. This re-
the neuron i. If l = 2, the variables zl−1 correspond to the input sult was generalised in [30], where it is proved that a shallow
variables. The neuron output zil is computed usually by a differen- network can approximate any continuous function arbitrarily well
tiable, non linear activation function f( · ): zil = f (ali ). Generally, the if and only if the network’s activation function is not-polynomial.
nonlinear functions f( · ) are chosen as simple fixed functions, such Therefore without loss of generality, MLFF networks equipped with
as the logistic sigmoid or the tanh function. VAF can substitute any other MLFF networks with fixed activation
functions and have as overhead only an increment in the num-
ber of networks parameters, that is equal to N · (3k + 1 ), where
3.1. Proposed model: variable activation function subnetwork N total number of the hidden neurons in the network. This num-
ber of additional parameters is comparable with the number of
The key idea of our approach is to implement the second step additional parameters introduced by many other trainable activa-
of the computation via a “small” one-hidden layer sub-network, tion functions proposed in literature (e.g. [1,13,46]), as discussed
with k hidden neurons and only one input and one output neu- in Section 2. Nonetheless, the number of required parameters can
ron. Let us call it Variable Activation Function (henceforth, VAF) sub- drop to L · (3k + 1 ), with L number of hidden layers, if we adopt
network. So, a VAF for a single neuron i can be described as a net- the shared weights principle, so that the functions on the same layer
work composed of: share the same VAF weights.
Therefore, making the reasonable assumption that if an acti-
• a hidden layer, composed of k > 1 neurons directly connected to vation function is proper for a single neuron, then it should also
the neuron i by a set of weights α h , with h = 1, 2, . . . , k; be proper for the other neurons of the same layer, we can reduce
• a fixed activation function g for the k hidden neurons; the number of parameters. This assumption can also be justified in
• an output layer composed of a single neuron connected to all terms of the activation function of classic neural networks, popu-
the neurons of the hidden layer by a set of weights β h , with lar in the neural network literature, where neurons on the same
h = 1 , 2 , . . . , k. layer have the same activation function. Summing up, under the

Fig. 2. An example of VAF in a full connected network (on the left) and in a convolutional layer (on the right).

A. Apicella, F. Isgrò and R. Prevete / Neurocomputing xxx (xxxx) xxx 5

shared weights principle for every network layer, the only added subnetwork. More specifically, let us call h this neuron, then we
hyper-parameters are: set to one the weight associated to both the input and output
connection of the neuron h.
• the number k of hidden neurons of the VAF subnetwork;
• f is any given function. We can use a regression algorithm to
• the activation function g( · ) of the VAF hidden neurons.
pretrain the VAF weights to approximate f. More in detail, if
It is worth to emphasise the fact that, in our approach, we have we want to pretrain a VAF unit composed of k hidden nodes to
a neural network architecture which is still a MLFF network with approximate a function f, we can make a single-layer network
fixed activation functions, without adding any external structure or T with the same VAF architecture. T is then formed by a sin-
parameters. Let us clarify this aspect (see also Fig. 2). Given a neu- gle input neuron, an hidden layer of k nodes and a final level
ron i belonging to l-th layer of an MLFF network Net, its output composed of a single output neuron with activation function
is computed as zil = f (ali ). In our approach, we replace the activa- corresponding to the Identity function. Next, we sample a set
tion function f with the Identity function, thus obtaining zil = ali . of points R = (x, y ) s.t. y = f (x ) and then we can use it to train
Then, we add a VAF sub-network which receives as input vari- our network. The resulting model composes the initial weight
able the output zil of the neuron i and computes its output as set for the VAF units.
defined in Eq. (1). Then, this output is sent as input to the next
3.3. Comparison with other approaches
layer l + 1 of Net. This procedure is performed on all the neurons
of the MLFF network Net, except for the output layer. Hence, we
obtain a new neural network VafNet, which is still a MLFF net- To the best of our knowledge, all the approaches proposed so
far in the literature, either have a limited expressive power of the
work with fixed activation functions; however, it behaves as Net
equipped with trainable activation functions expressed in terms of trainable activation functions or add new learning mechanisms,
Eq. (1). Consequently, Consequently, we can use any standard train- constraints and categories of parameters. Instead, the approach
proposed in this paper produces a feed-forward neural network
ing procedure (e.g., Stochastic Gradient Descent).
Fig. 2 shows how a VAF network can be integrated into a typi- with a trainable activation function using a feed-forward neural
cal multilayer full-connected neural network (on the left) and in a network with fixed activation functions, thus leaving unaltered the
convolutional neural network (on the right). classic learning mechanisms and categories of parameters.
Our approach holds several attractive properties:
It is worth to point out that, since a VAF subnetwork performs
a linear combination of one-variable functions, any approach dis- p1) the trainable activation function can approximate arbitrarily
cussed in Section 2.2 can be considered as a particular case of this well any continuous one-to-one mapping defined on a com-
scheme, if the activation function g and the parameters α and β pact domain;
are appropriately chosen. p2) any standard learning mechanism for neural networks can
be directly and easily applied;
3.2. VAF Network learning p3) it does not add any further learning process in addition to
those classically used for neural networks;
As discussed above, our neural architecture is a MLFF network; p4) the added parameters are network weights or biases; there-
consequently, it can be trained using any learning algorithm used fore, any classical regularisation method can be used, includ-
to train MLFF networks. In case of the same VAF is used for all ing the possibility of imposing sparsity using norms such as
the neurons on the same layer, then there is the constrain that l1 .
the weights of VAF networks should be considered shared weights.
From an implementation point of view, this allows interpreting a None of the known approaches hold all these properties. For
VAF network as a convolution operator applied to the values ali example, property p1 is non satisfied by any of the approaches
[31]. The values of the few weights of the VAF, being connected discussed in Sections 2.1 and 2.3; the methods discussed in
to each unit, influence the behaviour of the whole network. There- Section 2.2 either do no satisfy property p1 as in [1] or property
fore their behaviour must be taken into account during the training p3 as in [46].
phase, and, in particular, the initial value of the VAF weights can Interestingly, as discussed in Section 3.1, our architecture repre-
be crucial. sents a general framework including all the approaches described
The training of a neural network usually starts initialising the in Section 2.2, and some of the approaches in Section 2.3 can be
weights and biases randomly [3], or using any initialisation rule, included, insofar as any linear combination of m one-variable func-
as in [15]. Although it is feasible to follow these approaches in our tions can be represented by a sub-network with m hidden neurons.
case, it is also possible to choose different solutions for the initial-
4. Experimental results
isation of the VAF weights. A possible alternative is to select the
initial weights of the VAF so that the VAF networks approximate a
In this Section, we provide an experimental evaluation of the
fixed function at the beginning of the learning process. For exam-
proposed trainable activation function architecture. To the end of
ple, we can select a classic activation function as ReLU or Sigmoid.
achieving a first evaluation on the validity of our approach, and
In this way, the function would start from an already known valid
some heuristic indications for the initialisation strategies of VAF
form, and the training process should modify it just enough to im-
networks, in Section 4.1 we report some preliminary experiments
prove the performance of the network. However, it must be kept
on Sensorless, a relatively small dataset used as standard bench-
in mind that this choice can have a negative effect on the solution
mark for supervised classification techniques. More in detail, Sen-
produced by the learning process since the resulting VAF can be
sorless is composed of 49 features extracted from electric current
too similar to the initial function.
drive signals. The drive has intact and defective components, which
If we want to initialise a VAF subnetwork to a given function f,
result in 11 different classes. This dataset is also used as a bench-
we can consider two different cases:
mark in recent studies, such as [46].
• the target function f is the same of the output function of the Based on the results of these experiments, we performed two
VAF hidden nodes, that is f = g. In this case, we initialise to different series of experiments to test our approach on MLFF
zero all the VAF weights and biases, except the weights associ- networks. In the former, we consider standard MLFF networks
ated to the connections of just one hidden neuron of the VAF (Section 4.2), and in the latter convolutional MLFF networks

6 A. Apicella, F. Isgrò and R. Prevete / Neurocomputing xxx (xxxx) xxx

Table 1 different number k of VAF hidden neurons, in particular k ∈ {3, 5,

Tanh vs ReLU. In each table’s cell is reported the mean value of the relative dif-
7, 9, 11, 15}; (2) tanh and ReLU as activation functions for VAF
ferences between accuracies obtained by using Tanh and those obtained by using
ReLU activation function for the hidden neurons of VAF’subnetwork. For each shal- hidden neurons; (3) two different strategies for weight initiali-
low network, we considered the corresponding neural networks provided with VAF sation of VAF subnets, both a classic random initialisation and
sub-networks and computed the relative differences in accuracy obtained by net- a weight initialisation by which VAF subnets have a behaviour
works with tanh activation function for the VAF hidden neurons and those obtained very similar to activation functions of the VAF hidden neurons;
with ReLU. We distinguished the two cases not-shared and shared VAF.
we call the latter specific weight initialisation; 4) as discussed in
VAF-shared # hidden neurons of shallow network Section 3, we examined both the case in which VAF subnets on the
5 10 20 same layer share the weights (shared weights principle) and the
case in which VAF subnets on the same layer can have different
no 0.007 0.01 0.008
yes 0.005 0.002 0.004 weights.
We trained all the networks using ADAM algorithm [26] for 500
epochs. Furthermore, we repeat our experiments 10 times.
(Section 4.3). In Section 4.2 we consider both classification and re-
gression problems using 20 different datasets. In Section 4.3 we Results
consider more large-scale dataset as MNIST, Fashion MNIST and CI- In Fig. 3a, 3 b and 3 c we show the results with respect to the
FAR10. shallow networks with 5, 10 and 20 hidden neurons, respectively,
in the case in which VAF subnets on the same layer do not share
4.1. VAF subnetworks: activation functions, number of hidden the weights. In Fig. 4a, 4 b and 4 c are reported the results in the
neurons and weight initialization case in which VAF subnets on the same layer share the weights.
We can observe that all the models with VAF subnets outper-
For a preliminary analysis of the validity of our approach, and form the corresponding shallow networks. Interestingly, these re-
for setting some heuristic choices of the VAF subnets, such as sults support the possibility of using a shared VAF approach with a
the number of hidden neurons and their activation functions, we relatively low number of VAF hidden neurons, thus having a lower
perform a series of experiments on the Sensorless dataset (see number of parameters to learn. The two approaches, non-shared
Table 2 for details), partitioning it in a random sample of 60% for (Fig. 3) and shared (Fig. 4) VAF subnets, exhibit a very similar be-
training, 20% for validation and another 20% for testing. In line haviour, and although in all cases accuracy tends to increase with
with what was also reported in [46], if one uses a standard shallow the number of neurons of the VAF subnets, this increment is not
network, i.e., 1-hidden layer network, we found that tanh is the always very relevant. The two types of VAF weight initialisation
best fixed activation function for this dataset. In particular, using a seem to give similar results, with somewhat better performance
shallow network with 50 hidden neurons we obtained an accuracy for random initialisation.
on the test set very close to 100%. To investigate the impact of our In Table 1, we report the mean value of the relative differences
approach thoroughly, we chose a more difficult situation for a shal- between the accuracies obtained using Tanh and those obtained
low network using network models with a small number of hidden using ReLU as the activation function for the hidden neurons of
neurons. More in detail, we selected three small shallow nets with VAF’subnetwork. Although the accuracy obtained by networks with
5, 10 and 20 hidden neurons. For each model, we performed a set ReLU activation function for the VAF hidden neurons seems be uni-
of experiments using different activation functions. formly lower than those obtained with the Tanh activation func-
Firstly, we train these small networks using as fixed activation tion, one can note that the difference is slight.
functions either tanh or ReLU, then we repeat the same experi- In Fig. 5 awe show the functions obtained by trained VAF sub-
ment, substituting the fixed activation functions with VAF subnets networks when a random (Fig. 5a) or a specific weight initialisa-
as described in Section 3. We considered several scenarios: (1) tion (Fig. 5b) is chosen. One can note that the resulting activation

Table 2
Properties of the datasets used for the experiments, and architectures of the neural network applied to the data.

Name Istances Input Dim. N. classes Task Neural Network Arch. Ref.

CPU-Small 8192 12 – Regress. MLFF [10]

DeltaElevator 9517 6 – Regress. MLFF [23]
Elevators 16,599 18 – Regress. MLFF [23]
Kinematics 8192 8 – Regress. MLFF [23]
Puma-8NH 8192 8 – Regress. MLFF [23]
Puma-32NH 8192 32 – Regress. MLFF [23]
Servo 197 4 – Regress. MLFF [10]
Energy Cooling 768 8 – Regress. MLFF [23]
Energy Heating 768 8 - Regress. MLFF [23]
Yatch 308 7 – Regress. MLFF [10]
Sensorless 58,509 49 11 Classif MLFF [10]
Liver 345 7 2 Classif. MLFF [10]
Wine 178 13 3 Classif. MLFF [10]
Statlog Image Segmentation 2310 19 7 Classif. MLFF [10]
Statlog Landsat Satellite 6435 36 7 Classif. MLFF [10]
Cardiotocography 2126 22 3 Classif. MLFF [10]
Seismic bumps 2584 18 2 Classif. MLFF [47]
Dermatology 336 35 3 Classif. MLFF [10]
Diabetic retinopathy debrecen 1151 19 2 Classif. MLFF [2]
QSAR biodegradation 1055 41 2 Classif. MLFF [35]
Climate model simulation 540 18 2 Classif. MLFF [33]
MNIST 70, 0 0 0 28 × 28 10 Classif. CNN [29]
Fashion MNIST 70, 0 0 0 28 × 28 10 Classif. CNN [56]
Cifar10 60, 0 0 0 32 × 32 × 3 10 Classif. CNN [28]

A. Apicella, F. Isgrò and R. Prevete / Neurocomputing xxx (xxxx) xxx 7

Fig. 3. Accuracy of networks with different VAF subnets on each layer. Using the Sensorless dataset, we trained three small shallow networks composed of 5, 10 and 20
hidden neurons with ﬁxed activation functions corresponding to either tanh or ReLU. In Figure, such networks are referred to as noVaf. Then we run the same experiments
substituting the ﬁxed activation functions with VAF subnets. The number of VAF hidden neurons ranges in k ∈ {3, 5, 7, 9, 11, 15}, the possible activation functions for VAF
hidden neurons are tanh and ReLU. Weight initialisation of VAF subnets is either a classic random initialisation or a weight initialisation by which VAF subnets have a
behaviour very similar to activation functions of the VAF hidden neurons. VAF subnets on the same layer can have different weights.

functions are often profoundly different from the classic tanh and we obtained 4 network models with 1-hidden layer, and 6 with
ReLU, and that still they exhibit similarly a high degree of non- 2-hidden layers. Let us call netm1 and netm1 ,m2 the 1-hidden and
linearity. 2-hidden layer networks, respectively, with m1 , m2 ∈ {10, 25, 50,
100}. Based on what discussed in Section 3, it is possible to asso-
ciate with each network netm1 (netm1 ,m2 ) a neural network vnetmk
4.2. Full-connected MLFF networks: classification and regression 1
(vnetmk ) equipped with VAF subnetworks, where k is the num-
1 ,m2
In this experimental scenario we focus on evaluating the im- ber of hidden neurons of the VAF subnetworks.
pact of both VAF subnetworks and VAF weight initialisation using On the basis of the results discussed in Section 4.1, we con-
full-connected MLFF networks with 1 or 2 hidden layers trained sidered VAF subnets shared on each layer, and k = 3. In Table 3,
on 20 publicly available datasets (see Table 2). 10 of these datasets we specify the neural network architectures used in this group of
are proper for classification problems, and 10 for regression prob- experiments. Neural network architectures were sorted in ascend-
lems. The number of hidden neurons varies in the set {10, 25, 50, ing order according to their complexity. We trained the networks
100}, but for neural networks with 2 hidden layers we only se- following a usual learning method, described in Algorithm 1.
lected neural networks with a number of hidden neurons belong- More precisely, we used a batch approach, RProp [43], with
ing to the first layer larger than the number of hidden neurons of “small” datasets, i.e., when the number of examples was less than
the second layer. We chose ReLU as activation function g of the 5 · 103 , otherwise, we used a mini-batch approach, RMSProp [53].
hidden neurons of the VAF sub-networks. Thus, for each dataset, Moreover, networks with VAF subnetworks were trained using

8 A. Apicella, F. Isgrò and R. Prevete / Neurocomputing xxx (xxxx) xxx

Fig. 4. Accuracy of networks with shared VAF subnets on each layer. Using the Sensorless dataset, we trained three small shallow networks composed of 5, 10 and 20
hidden neurons with ﬁxed activation functions corresponding to either tanh or ReLU. In Figure, such networks are referred to as noVaf. Then we run the same experiments
substituting the ﬁxed activation functions with VAF subnets. The number of VAF hidden neurons ranges in k ∈ {3, 5, 7, 9, 11, 15}, the possible activation functions for VAF
hidden neurons are tanh and ReLU. Weight initialisation of VAF subnets is either a classic random initialisation or a weight initialisation by which VAF subnets have a
behaviour very similar to activation functions of the VAF hidden neurons. VAF subnets on the same layer share the weights.

Table 3
Neural network architectures used in the ﬁrst experimental scenario. See text for further details.

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10

Stand net10 net25 net50 net100 net25,10 net50,10 net100,10 net50,25 net100,25 net100,50
3 3 3 3 3 3 3 3 3 3
VAF vnet10 vnet25 vnet50 vnet100 vnet25 ,10 vnet50 ,10 vnet100 ,10 vnet50 ,25 vnet100 ,25 vnet100 ,50

both a random initialisation and an speciﬁc weight initialisation Results

so that they approximate a ReLU function. All the network mod- In Tables 5 and 6 are shown mean and standard deviations of
k
els, i.e., netm1 , netm1 ,m2 , vnetm k
and vnetm were assessed using RMSE and accuracy for regression and classiﬁcation datasets, re-
1 1 ,m2
a K-fold cross validation scheme (see Algorithm 2), with K = 10. spectively, by using a K-fold cross-validation approach. The best re-
Note that Learning Rate (LR) in RMSProp spans in the range sults are in bold.
[0.0 0 01,0.1], considering 10 equispaced values, while in RProp η+ In case of the regression datasets, VAF approach almost regu-
was selected equal to 1.01, and η− equal to 0.5. The values of the larly outperforms standard methods; only in one case, we obtain
parameters used for this group of empirical evaluations are re- the best result with a standard method. For four datasets (DeltaEl-
ported in a compact form in Table 4. evator, Elevators, Puma-32H and Yatch) RMSE’s mean obtained by

A. Apicella, F. Isgrò and R. Prevete / Neurocomputing xxx (xxxx) xxx 9

Fig. 5. Examples of trained VAF subnetworks. On the y-axis we plot the output value of the VAF. In (5 a) are plotted trained VAF subnetworks when a random weight
initialisation is chosen. In (5 b) when a speciﬁc weight initialisation is chosen.

10 A. Apicella, F. Isgrò and R. Prevete / Neurocomputing xxx (xxxx) xxx

training process of network with VAF subnetworks is suﬃciently

Algorithm 1: Standard learning schema.
stable.
Input: T S, V S, net, MaxE pochs: T S and V S are training and Similar results were obtained for the classiﬁcation datasets (see
validation datasets, respectively; net is the network to Table 6), where the neural networks with VAF outperforms neural
be trained; MaxE pochs is the maximum number of networks without VAF in most of the cases. Only in two datasets
epochs (20%) neural networks without VAF outperforms neural networks
Output: t rainNet : Trained net with VAF. Moreover, in this case, standard deviations remain com-
1 net ← weight AndBiasInit ializat ion(net ) ; parable to or lower than those without VAF. In some cases, accu-
2 bestNet ← net ; racies obtained by networks equipped with VAF’ subnetworks are
3 n ← 0, minErr ← MAX ; signiﬁcantly better than networks without VAF (see, for example,
4 repeat the Wine case), whereas in some other (e.g. Image Segmentation)
5 n←n+1 ; the improvement is lower. This behaviour is in line with the results
6 net ← l earningAl gorithm(net, T S ) ; obtained in [13], a similar study which uses the same datasets but
7 er ror T (n ) ← Sim(net, T S ); with different network architectures and methodologies. Thus, we
8 er rorV (n ) ← Sim(net, V S ); surmise that these differences can be attributed to the underlying
9 if er rorV (n ) < minErr then data quality rather than to the method used.
10 minEr r ← er rorV (n ); In Fig. 6 are shown examples of the activation functions ob-
11 bestNet ← net; tained by some of the VAF subnetworks at the end of the learning
12 end process.
13 until n > MaxE pochs OR earlyStoppingCriteria(errorT,errorV);
14 t rainNet ← best Net ;
4.3. Convolutional MLFF networks

To evaluate the impact of VAF on Convolutional Neural Net-

Algorithm 2: K-fold cross validation procedure. works (CNN) experimentally, we consider standard CNN net-
Input: Dataset D, network model mnet, number of folds k, works with 2 and 3 convolutional layers and run experiments on
hyper-parameters values { p1 , p2 , . . . , pn } with pi = { three different datasets: MNIST, Fashion MNIST and CIFAR10 (see
possible values for ith parameter } with 1 ≤ i ≤ n Table 2 for further details). As discussed in Sections 3.2 and 4.2,
1 F ol dResults = [ ]; a key aspect is the initialisation of the VAF networks. Thus, also
in this case, we chose to initialise the weights of the VAF subnet-
2 split D in a k−partition P k (D ) ;
works either randomly or to approximate a ReLU function. To this
3 forall the 1 ≤ i ≤ k do
end, we build two CNN architectures (similar to the essential ones
4 T estSet ← Pik (D );
used in [31]), the first one composed of 2-layer CNN networks,
5 R ← P k (D ) \ {T estSet }; used for MNIST and Fashion-MNIST, and the second one composed
6 split R in a 2−partition P 2 (R ) ; of a 3-layers network trained and tested on the more complex CI-
7 T rainSet ← P12 (R ); FAR10 dataset. Let us call cnetA2 and cnetA3 respectively the 2-layer
8 V alSet ← P22 (R ); CNN and the 3-layer CNN; as stated in Section 3, it is possible
9 bestParams ← ∅; to associate to each cnet a neural network vcnetk equipped with
10 best Result s ← ∅; VAF sub-networks having k hidden units. The experiments were
11 forall the h ∈ p1 × p2 × · · · × pn do performed using a 10-fold cross validation scheme as formalised
12 model ← T rain(mnet, T rainSet, V alSet, h ); in 2. Networks were trained using the Stochastic Gradient Descent
13 results ← Sim(model, T estSet ); (SGD) method with mini-batching.
14 if results better than bestResults then Furthermore, we compare our architecture with two other neu-
15 best Result s ← results; ral architectures also equipped with trainable activation functions.
16 bestParams ← h; The first one is KAFnet, a very recent and promising approach pro-
17 bestModel ← model; posed in [46] and already discussed in Section 2.2. The second
18 end one is Network in Network (NIN), a popular approach proposed in
19 end [31] and already discussed in Section 2.4. To this aim, we used the
20 F ol dResults[i] ← bestResults; same experimental settings described in [46], i.e., a convolutional
21 end MLFF network constituted of two convolutional layers, each one of
22 return Average(F ol dResults ) these followed by a 3 × 3 maxpooling layer and a dropout layer of
0.25 (see Table 7). We call this network cnetB . Starting from cnetB ,
we obtained three different types of neural networks with train-
able activation functions according to three different procedures
VAF networks results much smaller than RMSE obtained by a neu- proposed for KAFnet, NIN and VAF. We use the classic CIFAR10 data
ral network without VAF subnetwork. For example, when consid- configuration (50,0 0 0 training samples + 10,0 0 0 test samples) to
ering the DeltaElevator dataset, the RMSE’s mean was reduced by train the three types of networks. The network cnetB with fixed ac-
two (VAF init random) and one (VAF Init ReLU) order of magni- tivation function corresponding to ReLU is also considered as base-
tude. Moreover, standard deviations remain comparable to or lower line. We repeat the same setup for the MNIST and Fashion-MNIST
than those without VAF subnetworks. This fact suggests that the datasets.

Table 4
Parameters of the ﬁrst experimental scenario. See text for further details.

m1 , m2 k VAF initialization Learning approaches # maximum epochs K

{10, 25, 50, 100} {3} {Random, ReLU} {RMSProp, RProp} 300 10

A. Apicella, F. Isgrò and R. Prevete / Neurocomputing xxx (xxxx) xxx 11

Table 5
RMSE for the experiments on the regression datasets. We used a K-Fold Cross-validation evaluation. In bold the best results.
The best neural architecture for each case is between parentheses.

standard ReLU VAF Init random VAF Init ReLU

RMSE: mean + St. dev. RMSE: mean + St. dev. RMSE: mean + St. dev.

CPUsmall 0.0616 ± 0.0016 (net50,10 ) 0.0593 ± 0.0017 3

(vnet100 ) 0.0606 ± 0.0032 3
(vnet100 ,10 )
DeltaElevator 0.1355 ± 0.0055 (net25 ) 0.0030 ± 0.0006 3
(vnet100 ,50 ) 0.0414 ± 0.0393 3
(vnet10 )
Elevators 1.2746 ± 0.3741 (net50,10 ) 0.0068 ± 0.0003 3
(vnet50 ,25 ) 0.1915 ± 0.1658 3
(vnet10 )
Kinematics 0.1090 ± 0.0040 (net100 ) 0.1315 ± 0.0185 3
(vnet25 ) 0.0935 ± 0.0048 3
(vnet50 ,25 )
Puma-8NH 0.1336 ± 0.0023 (net25,10 ) 0.1316 ± 0.0018 3
(vnet100 ,10 ) 0.1331 ± 0.0025 3
(vnet25 ,10 )
Puma-32H 0.2372 ± 0.1473 (net25,10 ) 0.0273 ± 0.0005 3
(vnet100 ,10 ) 0.0317 ± 0.0039 3
(vnet25 ,10 )
Servo 0.0946 ± 0.0158 (net100 ) 0.0896 ± 0.0276 3
(vnet10 ) 0.0961 ± 0.0271 3
(vnet25 )
Energy Cooling 0.0417 ± 0.0014 (net100,10 ) 0.0461 ± 0.0024 3
(vnet100 ,25 ) 0.0400 ± 0.0026 3
(vnet50 ,25 )
Energy Heating 0.0206 ± 0.0026 (net100,10 ) 0.0304 ± 0.0237 3
(vnet25 ,10 ) 0.0213 ± 0.0027 3
(vnet100,10 )
Yatch 0.2442 ± 0.1146 (net25 ) 0.3435 ± 0.2186 3
(vnet100 ) 0.1481 ± 0.0553 3
(vnet25 )

Table 6
Accuracies for the experiments on the classiﬁcation datasets. We used a K-Fold Cross-validation evaluation. In bold the best
results. The best neural architecture for each case is between parentheses.

standard ReLU VAF Init random VAF Init ReLU

Accuracy: mean + St.Dev Accuracy: mean + St.Dev Accuracy: mean + St.Dev

Liver 0.6203 ± 0.0474 (net25,10 ) 0.6290 ± 0.0378 3

(vnet100 ,10 ) 0.6348 ± 0.0375(vnet25
3
)
Wine 0.8879 ± 0.0516 (net10 ) 0.9552 ± 0.0371 3
(vnet50 ) 0.9162 ± 0.0434 (vnet10
3
)
Image segmentation 0.9463 ± 0.0128 (net25 ) 0.9351 ± 0.0179 3
(vnet50 ) 0.9381 ± 0.0079 (vnet50
3
)
Satellite image 0.8821 ± 0.0101 (net100,50 ) 0.8856 ± 0.0028 3
(vnet100 ) 0.8875 ± 0.0080 (vnet100,25 )
3

CTG 0.8979 ± 0.0263 (net100 ) 0.9040 ± 0.0073 3

(vnet100 ) 0.8984 ± 0.0261 (vnet50,25 )
3

Seismic bumps 0.9346 ± 0.0009 (net10 ) 0.9234 ± 0.0074 3

(vnet10 ) 0.9342 ± 0.0001 (vnet10
3
)
Dermatology 0.9749 ± 0.0116 (net50,25 ) 0.9692 ± 0.0182 3
(vnet10 ) 0.9750 ± 0.0248 (vnet100
3
)
Diabetic 0.7254 ± 0.0290 (net100 ) 0.7315 ± 0.0238 3
(vnet10 ) 0.7333 ± 0.0231 (vnet50
3
)
Biodegradation 0.8635 ± 0.0336 (net10 ) 0.8673 ± 0.0225 3
(vnet100 ,10 ) 0.8569 ± 0.0108 (vnet50
3
)
Climate simulation 0.9500 ± 0.0140 (net50,25 ) 0.9519 ± 0.0211 3
(vnet10 ) 0.9556 ± 0.0240 (vnet100
3
)

Table 7
Parameters of the second experimental scenario. See text for further details.

Name Layers VAF initialization Learning approaches # maximum epochs

cnetA2 2 × (Conv. 192 + Maxout + Dropout) {Random, ReLU} SGD 300

cnetA3 3 × (Conv. 192 + Maxout + Dropout) {Random, ReLU} SGD 300
cnetB 2 × (Conv. 150 + Maxout + Dropout) Random Adam 300

Details of the CNN architectures used and of the learning pro- remain comparable to or lower than those without VAF subnet-
cess can be found in Table 7. works. We obtain a considerable improvement especially on the
CIFAR10 dataset.
Results In Figs. 7 and 8, we show some examples of trained activation
In Table 8, we show mean and standard deviations of the ac- functions, respectively in vcnetA2 and vcnetA3 . It seems that, in case
curacy obtained on the three datasets Cifar10, MNIST and Fashion of initialization as ReLU, the original shape remains mostly un-
MNIST, using 10-fold cross-validation for the neural architectures changed, giving a resulting function that looks like a PReLU/Leaky
in the first two rows of Table 7. The best results are in bold. The ReLU. A more interesting behaviour is given by random initiali-
VAF approach outperforms the standard approach, especially when sation, where every VAF unit seems to exhibit more significant
using a random initialisation scheme. Also, in this experimental changes respect the original function. This more significant vari-
scenario, the standard deviations obtained by networks with VAF ability given by random initialisation respect to ReLU initialisation

Fig. 6. Plots of some VAF behaviours at the end of the learning process. In (6a) for regression datasets, in (6b) for classiﬁcation datasets.

12 A. Apicella, F. Isgrò and R. Prevete / Neurocomputing xxx (xxxx) xxx

Fig. 7. Examples of changes in a VAF in a 2 layer conv. network using random (top) and ReLU initialisation (bottom). The blue line is the start function, the orange line is
the learned function.

A. Apicella, F. Isgrò and R. Prevete / Neurocomputing xxx (xxxx) xxx 13

Table 8
Results of the convolutional networks with a 10-fold cross Validation with cnetA .

standard ReLU VAF Init random VAF Init ReLU

Acc. + St.Dev Acc. + St.Dev Acc. + St.Dev

Cifar10 0.857 ± 0.002 (cnet35 ) 0.875 ± 0.003 (vcnetA53 ) 0.860 ± 0.002 (vcnetA53 )
MNIST 0.991 ± 0.001 (cnetA52 ) 0.994 ± 0.001 (vcnetA52 ) 0.993 ± 0.002 (vcnetA52 )
Fashion MNIST 0.923 ± 0.001 (cnetA52 ) 0.935 ± 0.002 (vcnetA52 ) 0.934 ± 0.001 (vcnetA52 )

Table 9 In the end, also respect the other two approaches with trainable
Comparison between different activation functions on cnetB .
activation functions known in the literature, our method has bet-
standard ReLU VAF(k=5) KAF(D=20) NIN ter performance.
Accuracy Accuracy Accuracy Accuracy

Cifar10 0.707 0.812 0.802 0.763 5. Conclusion

MNIST 0.995 0.995 0.995 0.996
Fashion MNIST 0.920 0.935 0.929 0.925 In this work, we proposed a simple and direct way to ob-
tain trainable activation functions in feed-forward neural networks.
More precisely, we proposed to modify a feed-forward neural net-
work by adding Variable Activation Functions (VAF) in terms of
seems to give an improvement in accuracy results, as shown in one-hidden layer subnetworks (see Section 3). The resulting net-
Table 8. work is still a feed-forward neural network. The proposed archi-
In Table 9, we show the performances of KAFnet, NIN and VAF tecture doesn’t need many more parameters than networks using
network on the two datasets Fashion-MNIST and CIFAR10 in terms not trainable activation functions as ReLU, and the learning process
of accuracy. VAF network outperforms KAF and NIN on both the follows standard procedures (see Section 3.2). Moreover, VAF sub-
dataset CIFAR10 and Fashion MNIST. The results on the MNIST networks can approximate any activation functions arbitrarily well,
dataset cannot be considered enlightening, because the accuracies provided that the number of hidden neurons is suﬃciently large
returned are all very similar and high (above the 99% of accuracy). (see Section 3).

Fig. 8. Examples of a resulting VAF in a 3 layer conv. using random (top) and ReLU initialisation (bottom). The blue line is the start function, the orange line is the learned
function.

14 A. Apicella, F. Isgrò and R. Prevete / Neurocomputing xxx (xxxx) xxx

It is worth to remark that our approach distinguishes itself from It would be interesting to test these hypotheses experimentally
other approaches proposed in literature as it satisfies simultane- and to see how our approach behaves on more complex and mas-
ously the properties p1 – p4 as discussed in Section 3.3. These sive datasets such as Imagenet [8]. Unfortunately, we currently do
properties include a high expressive power of the trainable activa- not have the computing power needed to perform experiments on
tion functions, no external parameter or learning process in addi- more complex and more significant state-of-art neural networks
tion to the classical ones for neural networks, and the possibility and datasets. We are planning to make it in the next future.
to use classical regularisation methods.
Interestingly, as we discussed in Section 3 our architecture rep- Declaration of Competing Interest
resents a general framework which includes all the approaches de-
scribed in Section 2.2 and some of the approaches in Section 2.3. The authors declare that they have no known competing finan-
We evaluated our architecture empirically on three different cial interests or personal relationships that could have appeared to
groups of experiments. In the first one (see Section 4.1), we influence the work reported in this paper.
tested our approach using small shallow networks for setting some
heuristic choices for the VAF subnets. All the models provided with Acknowledgements
VAF subnets outperform the corresponding shallow networks, and
the results support the possibility of using a shared VAF approach The work has been partially supported by the italian na-
with a relatively low number of VAF hidden neurons. In the sec- tional project Perception, Performativity and Cognitive Sciences
ond series of experiments (see Section 4.2), we considered full- - PRIN2015 Cod. 2015TM24JS_009, funded by MIUR (Ministero
connected Multi-Layered Neural Network (MLFF) networks. More dell’Istruzione, dell’Università e della Ricerca).
specifically, we selected 10 networks with 1 or 2 hidden layers. For
each one of these 10 networks we built a correspondent network References
with VAF subnetworks (see Section 3 and 4.2). We obtained a total
[1] F. Agostinelli, M. Hoffman, P. J. Sadowski, P. Baldi, Learning activation functions
of 20 distinct neural network architectures. These neural architec-
to improve deep neural networks. Workshop of International Conference on
tures were assessed and compared using a K-Fold Cross-Validation Learning Representations (2015), also available as arXiv:1412.6830.
procedure (see Algorithm 2) on 20 different datasets (see Table 2), [2] B. Antal, A. Hajdu, An ensemble-based system for automatic screening of dia-
betic retinopathy, Knowl. Based Syst. 60 (2014) 20–27.
either for classification or regression tasks. The results reveal that
[3] C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
the networks with VAF subnetworks are more performing than [4] F. Cao, T. Xie, Z. Xu, The estimate for approximation error of neural networks:
the ones without VAF networks. More precisely, our approach out- a constructive approach, Neurocomputing 71 (4–6) (2008) 626–630.
performs that without VAF networks on the 85% of the datasets. [5] P. Chandra, Y. Singh, An activation function adapting training algorithm for sig-
moidal feedforward networks, Neurocomputing 61 (2004) 429–437.
our approach produced worse results only on three of the datasets [6] C.-T. Chen, W.-D. Chang, A feedforward neural network with function shape
considered. autotuning, Neural Netw. 9 (4) (1996) 627–641.
In the last set of experiments, we considered Convolutional [7] D.-A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network
learning by exponential linear units (Elus), International Conference on Learn-
Neural Networks with 2 and 3 layers and correspondent networks ing Representations (2016), also availble as arXiv:1511.07289.
with VAF units, and we evaluate them using 3 image datasets [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: a large-scale
(MNIST, Fashion MNIST and CIFAR10) in classification tasks. Again, hierarchical image database, in: Proceedings of the 2009 IEEE Conference on
Computer Vision and Pattern Recognition, IEEE, 2009, pp. 248–255.
the VAF subnetworks outperform networks with static units and [9] R.A. DeVore, K.I. Oskolkov, P.P. Petrushev, Approximation by feed-forward neu-
selected state-of-the-art neural architectures (KAFNet and NIN) ral networks, Ann. Numer. Math. 4 (1996) 261–288.
equipped with trainable activation functions. [10] D. Dheeru, E. Karra Taniskidou, UCI machine learning repository, 2017.
[11] C. Dugas, Y. Bengio, F. Blisle, C. Nadeau, R. Garcia, Incorporating second-order
In conclusion, VAF units have been tested using traditional
functional knowledge for better option pricing, in: Proceedings of the NIPS,
MLNN networks and CNN networks with various datasets and give 20 0 0, pp. 472–478.
better results compared with networks with similar design both [12] R. Eldan, O. Shamir, The power of depth for feedforward neural networks, in:
Proceedings of the Conference on Learning Theory, 2016, pp. 907–940.
with traditional ReLU functions and trainable activation functions.
[13] Ö.F. Ertuğrul, A novel type of activation function in artificial neural networks:
We showed that is possible to obtain encouraging results without trained activation function, Neural Netw. 99 (2018) 148–157.
the need to use complex designs, particular initialisation schemes [14] G. Fechner, Elements of Psychophysics, New York, Holt, Rinehart and Winston,
or learning process in addition to those classically used for neural 1966.
[15] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward
networks. These results were achieved on datasets commonly used neural networks, in: Proceedings of the AISTATS, 9, 2010, pp. 249–256.
in the context of machine learning research. [16] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Pro-
In massive and complex models, e.g. deep neural networks with ceedings of the 14th International Conference on Artificial Intelligence and
Statistics, 2011, pp. 315–323.
a large number of layers, the choice of the appropriate activation [17] I.J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, Y. Bengio, Maxout net-
functions can be a critical aspect. For example, deep networks im- works, in: Proceedings of the 30th International Conference on International
prove their performances when one shifts from the sigmoid to the Conference on Machine Learning - Volume 28, ICML’13, 2013, pp. III–1319–III–
1327.
ReLU function family. We have seen that parameters change during [18] N.J. Guliyev, V.E. Ismailov, A single hidden layer feedforward network with
the learning phase in VAF subnetworks can be interpreted as acti- only one neuron in the hidden layer can approximate any univariate function,
vation function changes in the original neural network. However, Neural Comput. 28 (7) (2016) 1289–1304.
[19] M. Harmon, D. Klabjan, Activation ensembles for deep neural networks.
we suspect that in the case of deep nets, the positive effect of the
arXiv:1702.07790v1 (2017).
VAF can be slightly weakened insofar as the search of the appro- [20] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing hu-
priate values in such a wide parameter-space could fail. We think man-level performance on imagenet classification, in: Proceedings of the IEEE
International Conference on Computer Vision, 2015, pp. 1026–1034.
better behaviour is possible in case of transfer learning approaches
[21] K. Hornik, Approximation capabilities of multilayer feedforward networks,
[60], using VAF subnetworks in the last layers of a target network. Neural Netw. 4 (2) (1991) 251–257.
In fact, in transfer learning, first a base network is trained on a [22] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are
given dataset and task, and then its weights are used to initialise universal approximators, Neural Netw. 2 (5) (1989) 359–366, doi:10.1016/
0893- 6080(89)90020- 8.
a second target network to be trained on a target dataset and task. [23] https://www.dcc.fc.up.pt, 2009.
The first-layer features are general, and last-layer features are spe- [24] Z. Hu, H. Shao, The study of neural network adaptive control systems, Control
cific to the target task, consequently, during the learning phase, the Decis. 7 (2) (1992) 361–366.
[25] X. Jin, C. Xu, J. Feng, Y. Wei, J. Xiong, S. Yan, Deep learning with s-shaped
main changes occur on the last layers of the target network, thus rectified linear activation units, in: Proceedings of the 30th AAAI Conference
limiting the search in the parameter space. on Artificial Intelligence, AAAI Press, 2016, pp. 1737–1743.

A. Apicella, F. Isgrò and R. Prevete / Neurocomputing xxx (xxxx) xxx 15

[26] D. Kingma, J. Ba, Adam: a method for stochastic optimization, in: Proceedings [55] L. Trottier, P. Gigu, B. Chaib-draa, et al., Parametric exponential linear unit for
of the International Conference on Learning Representations, 2014. deep convolutional neural networks, in: Proceedings of the 16th IEEE Interna-
[27] G. Klambauer, T. Unterthiner, A. Mayr, S. Hochreiter, Self-normalizing neural tional Conference on Machine Learning and Applications (ICMLA), 2017, IEEE,
networks, in: Proceedings of the Conference on Neural Information Processing 2017, pp. 207–214.
System, 2017. arXiv: 1706.02515v5. [56] H. Xiao, K. Rasul, R. Vollgraf, Fashion-MNIST: a novel image dataset for bench-
[28] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images, marking machine learning algorithms, arXiv:1708.07747v2 (2017).
in: Computer Science Department, University of Toronto, Technical Report, vol- [57] T. Yamada, T. Yabuta, Neural network controller using autotuning method for
ume 1, 2009. nonlinear functions, IEEE Trans. Neural Netw. 3 (4) (1992) 595–601.
[29] Y. LeCun, C. Cortes, MNIST handwritten digit database, 2010, http://yann.lecun. [58] T. Yamada, T. Yabuta, Remarks on a neural network controller which uses an
com/exdb/mnist/. auto-tuning method for nonlinear functions, in: Proceedings of the IJCNN In-
[30] M. Leshno, V.Y. Lin, A. Pinkus, S. Schocken, Multilayer feedforward networks ternational Joint Conference on Neural Networks, 1992, 2, 1992, pp. 775–780.
with a nonpolynomial activation function can approximate any function, Neu- vol.2
ral Netw. 6 (6) (1993) 861–867. [59] X. Yao, Evolving artificial neural networks, Proc. IEEE 87 (1999) 1423–1447.
[31] M. Lin, Q. Chen, S. Yan, Network in network, arXiv:1312.4400v3 (2014). [60] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep
[32] Y. Liu, X. Yao, Evolutionary design of artificial neural networks with different neural networks? in: Proceedings of the Advances in Neural Information Pro-
nodes, in: Proceedings of the International Conference on Evolutionary Com- cessing Systems, 2014, pp. 3320–3328.
putation, 1996, pp. 670–675. [61] S. Yu, S. Jia, C. Xu, Convolutional neural networks for hyperspectral image clas-
[33] D.D. Lucas, R. Klein, J. Tannahill, D. Ivanova, S. Brandon, D. Domyancic, sification, Neurocomputing 219 (2017) 88–98.
Y. Zhang, Failure analysis of parameter-induced simulation crashes in climate [62] H. Zhang, Z. Wang, D. Liu, A comprehensive review of stability analysis of con-
models, Geosci. Model Dev. 6 (4) (2013) 1157–1171. tinuous-time recurrent neural networks, IEEE Trans Neural Netw. Learn Syst.
[34] A.L. Maas, A.Y. Hannun, A.Y. Ng, Rectifier nonlinearities improve neural net- 25 (7) (2014) 1229–1262.
work acoustic models, in: Proceedings of the ICML, 2013. [63] X.-M. Zhang, Q.-L. Han, X. Ge, D. Ding, An overview of recent developments
[35] K. Mansouri, T. Ringsted, D. Ballabio, R. Todeschini, V. Consonni, Quantitative in Lyapunov–Krasovskii functionals and stability criteria for recurrent neural
structure-activity relationship models for ready biodegradability of chemicals, networks with time-varying delays, Neurocomputing 313 (2018) 392–401.
J Chem Inf Model 53 (4) (2013) 867–878.
[36] R. Memisevic, K.R. Konda, D. Krueger, Zero-bias autoencoders and the benefits Andrea Apicella received the M.Sc. degree in Computer
of co-adapting features, arXiv:1402.3337v5 (2015). Science and the Ph.D. degree in Mathematics and Com-
[37] A. Montalto, S. Stramaglia, L. Faes, G. Tessitore, R. Prevete, D. Marinazzo, Neural puter Science from University of Naples Federico II, Italy,
networks with non-uniform embedding and explicit validation phase to assess in 2014 and 2019, respectively. He is currently a Post-Doc
granger causality, Neural Networks 71 (2015) 159–171. in the department of Information Technology and Electri-
[38] V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann ma- cal Engineering of Federico II University of Naples. His re-
chines, in: Proceedings of the ICML, 2010, pp. 807–814. search interests include computer vision, neural networks
[39] D. Pedamonti, Comparison of non-linear activation functions for deep neural and biometric applications.
networks on MNIST classification task, arXiv:1804.02763v1 (2018).
[40] A. Pinkus, Approximation theory of the MLP model in neural networks, Acta
Numer. 8 (January) (1999) 143–195.
[41] S. Qian, H. Liu, C. Liu, S. Wu, H.-S. Wong, Adaptive activation functions in con-
volutional neural networks, Neurocomputing 272 (2018) 204–212.
[42] P. Ramachandran, B. Zoph, Q.V. Le, Searching for Activation Functions, in: Pro-
ceedings of the Sixth International Conference on Learning Representations Francesco Isgrò was awarded a Master degree in Math-
(ICMR), 2018, 2018. ematics from Università di Palermo (1994), and a Ph.D.
[43] M. Riedmiller, H. Braun, RPROP - A Fast Adaptive Learning Algorithm, Technical in Computer Science from Heriot-Watt University in Ed-
Report, Proceedings of the ISCIS VII, Universitat, 1992. inburgh (UK) in 2001. He worked as Research Assistant
[44] B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge University at Heriot-Watt University and Università di Genova (Italy),
Press, 2007. and since 2006 is permanent staff member at Università
[45] S. Scardapane, M. Scarpiniti, D. Comminiello, A. Uncini, Learning activation di Napoli Federico II (Italy), where he teaches the courses
functions from data using cubic spline interpolation, in: Proceedings of the of Computer Vision, and Programming. His research inter-
Italian Workshop on Neural Nets, Springer, 2017, pp. 73–83. ests cover various areas of image processing, computer vi-
[46] S. Scardapane, S. Van Vaerenbergh, S. Totaro, A. Uncini, Kafnets: kernel-based sion and machine learning methods. He has co-authored
non-parametric activation functions for neural networks, Neural Netw. (2018). more than 80 scientific papers. He has served on the
[47] M. Sikora, Ł. Wróbel, Application of rule induction algorithms for analysis of technical and organising committees of several confer-
data collected by seismic hazard monitoring systems in coal mines, Arch. Min. ences, and has refereed papers for various journals.
Sci. 55 (1) (2010) 91–114.
[48] Y. Singh, P. Chandra, A class+ 1 sigmoidal activation functions for FFANNS, J. Roberto Prevete (M.Sc. in Physics, Ph.D. in Mathemat-
Econ. Dyn. Control 28 (1) (2003) 183–187. ics and Computer Science) is an Assistant Professor of
[49] S. Sonoda, N. Murata, Neural network with unbounded activation functions is Computer Science at the Department of Electrical Engi-
universal approximator, Appl. Comput. Harmon. Anal. 43 (2) (2017) 233–268. neering and Information Technologies (DIETI), University
[50] S.S. Stevens, On the psychophysical law, Psychol. Rev. 64 (3) (1957) 153. of Naples Federico II, Italy. His current research interests
[51] W. Sun, F. Su, L. Wang, Improving deep neural networks with multi-layer max- include computational models of brain mechanisms, ma-
out networks and a novel initialization method, Neurocomputing 278 (2018) chine learning and artificial neural networks and their
34–40. applications. His research has been published in interna-
[52] L.R. Sütfeld, F. Brieger, H. Finger, S. Füllhase, G. Pipa, Adaptive blending units: tional journals such as Biological Cybernetics, Experimen-
trainable activation functions for deep neural networks, arXiv:1806.10064 tal Brain Research, Neurocomputing, Neural Networks and
(2018). Behavioral and Brain Sciences.
[53] T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: divide the gradient by a running
average of its recent magnitude, COURSERA: Neural Netw. Mach. Learn. 4 (2)
(2012) 26–31.
[54] E. Trentin, Networks with trainable amplitude of activation functions, Neural
Netw. 14 (4–5) (2001) 471–493.

Artificial Intelligence Artificial Neural Networks - : Introduction
No ratings yet
Artificial Intelligence Artificial Neural Networks - : Introduction
43 pages
TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)
appendhhdhdh
No ratings yet
appendhhdhdh
17 pages
Lecture Five Radial-Basis Function Networks: Associate Professor
No ratings yet
Lecture Five Radial-Basis Function Networks: Associate Professor
64 pages
week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
week 03-04 - Deep Feedforward Networks - Intro
141 pages
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
No ratings yet
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
13 pages
Session 1
No ratings yet
Session 1
8 pages
Module 2
No ratings yet
Module 2
44 pages
AI17-Neural Networks
No ratings yet
AI17-Neural Networks
34 pages
S2_5_NN
No ratings yet
S2_5_NN
22 pages
Multilayer Perceptron
No ratings yet
Multilayer Perceptron
16 pages
Week-3 Module-2 Neural Network
No ratings yet
Week-3 Module-2 Neural Network
58 pages
Unit 2 Deep Learning
No ratings yet
Unit 2 Deep Learning
19 pages
CS217_2024_lec11
No ratings yet
CS217_2024_lec11
7 pages
Unit2 3 Notes
No ratings yet
Unit2 3 Notes
34 pages
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
No ratings yet
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
23 pages
Learning in A Feed Forward Multiple Layer ANN - Backpropagation
No ratings yet
Learning in A Feed Forward Multiple Layer ANN - Backpropagation
18 pages
Neural Networks Five
No ratings yet
Neural Networks Five
65 pages
Neural Networks - Basics Matlab PDF
No ratings yet
Neural Networks - Basics Matlab PDF
59 pages
Unit 5 Neural Networks and Types of Learning.pptx
No ratings yet
Unit 5 Neural Networks and Types of Learning.pptx
38 pages
Understanding Multi-Layer Feed-Forward Neural Networks in Machine Learning
No ratings yet
Understanding Multi-Layer Feed-Forward Neural Networks in Machine Learning
4 pages
Neural Network Theory22
No ratings yet
Neural Network Theory22
60 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
29 pages
ECE613-Lecture 4: Multilayer Feedforward Neural Networks-Applications
No ratings yet
ECE613-Lecture 4: Multilayer Feedforward Neural Networks-Applications
4 pages
Surrogate Gradient Learning in Spiking
No ratings yet
Surrogate Gradient Learning in Spiking
25 pages
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
No ratings yet
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
40 pages
Ad3451 Ml Unit 4 Notes
No ratings yet
Ad3451 Ml Unit 4 Notes
34 pages
Contents MLP PDF
No ratings yet
Contents MLP PDF
60 pages
UNIT_1_DL
No ratings yet
UNIT_1_DL
18 pages
ML_Lec-22
No ratings yet
ML_Lec-22
25 pages
ANN
No ratings yet
ANN
7 pages
Feed Forward Neural Network Assignment PDF
No ratings yet
Feed Forward Neural Network Assignment PDF
11 pages
Unit 2 v1.
No ratings yet
Unit 2 v1.
41 pages
NNDL
No ratings yet
NNDL
96 pages
Activation Function
No ratings yet
Activation Function
7 pages
ML Mentorship Prahitha Movva V1
No ratings yet
ML Mentorship Prahitha Movva V1
5 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
1725876123-Unit 1 Fundamental of Deep Learning
No ratings yet
1725876123-Unit 1 Fundamental of Deep Learning
51 pages
Lec 23
No ratings yet
Lec 23
13 pages
Ml Neural Networks
No ratings yet
Ml Neural Networks
71 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
Neural Networks
No ratings yet
Neural Networks
37 pages
Module1 - Upto Loss Function
No ratings yet
Module1 - Upto Loss Function
137 pages
Inferring The Function Performed by A Recurrent Neural Network
No ratings yet
Inferring The Function Performed by A Recurrent Neural Network
12 pages
Module - 5 - ANN
No ratings yet
Module - 5 - ANN
50 pages
Unit 3
No ratings yet
Unit 3
12 pages
Activation Function
No ratings yet
Activation Function
44 pages
Two Applications of Deep Learning in The Physical Layer of Communication Systems
No ratings yet
Two Applications of Deep Learning in The Physical Layer of Communication Systems
10 pages
CC511 Week 5 - 6 - NN - BP
No ratings yet
CC511 Week 5 - 6 - NN - BP
62 pages
AI Mod4 Session 8 Best Fit Line & ANN
No ratings yet
AI Mod4 Session 8 Best Fit Line & ANN
39 pages
Function of Single Biological Neuron and Modelling of Artificial Neuron From It
No ratings yet
Function of Single Biological Neuron and Modelling of Artificial Neuron From It
33 pages
Feedforward Neural Networks: Fundamentals and Applications for The Architecture of Thinking Machines and Neural Webs
From Everand
Feedforward Neural Networks: Fundamentals and Applications for The Architecture of Thinking Machines and Neural Webs
Fouad Sabry
No ratings yet
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet
Enhancing Deep Learning Performance Using Displaced Rectifier Linear Unit
From Everand
Enhancing Deep Learning Performance Using Displaced Rectifier Linear Unit
David Macêdo
No ratings yet
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
From Everand
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
Fouad Sabry
No ratings yet
Neural Networks
From Everand
Neural Networks
Sasha Kurzweil
No ratings yet
Long Short Term Memory: Fundamentals and Applications for Sequence Prediction
From Everand
Long Short Term Memory: Fundamentals and Applications for Sequence Prediction
Fouad Sabry
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
ML Dec 22
No ratings yet
ML Dec 22
3 pages
ICMR Healthcare Capstone Project - Jupyter Notebook
No ratings yet
ICMR Healthcare Capstone Project - Jupyter Notebook
30 pages
Assignment 8 Solution
No ratings yet
Assignment 8 Solution
7 pages
ML Course PDF
No ratings yet
ML Course PDF
133 pages
Decision Trees and Random Forests
No ratings yet
Decision Trees and Random Forests
36 pages
Introduction To Soft Computing: Practice Sheet: NN-1
No ratings yet
Introduction To Soft Computing: Practice Sheet: NN-1
2 pages
ANN Manual
No ratings yet
ANN Manual
41 pages
DWM May 2024
No ratings yet
DWM May 2024
3 pages
A Brief Review of Feed-Forward Neural Networks
No ratings yet
A Brief Review of Feed-Forward Neural Networks
8 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
Compute2
No ratings yet
Compute2
10 pages
BAB 3 - Dian Ayu Rahmawati - 205150201111042
No ratings yet
BAB 3 - Dian Ayu Rahmawati - 205150201111042
14 pages
Hierarchical Clustering Unit 4 ML
No ratings yet
Hierarchical Clustering Unit 4 ML
14 pages
Deep Learning - Lecture 4
No ratings yet
Deep Learning - Lecture 4
13 pages
Exam 2003 B
No ratings yet
Exam 2003 B
20 pages
A_New_Approach_for_the_Short-Term_Load_Forecasting
No ratings yet
A_New_Approach_for_the_Short-Term_Load_Forecasting
7 pages
Parameter Calculation
No ratings yet
Parameter Calculation
10 pages
NNFL
No ratings yet
NNFL
2 pages
Scikit-Learn Cheatsheet For Machine Learning
No ratings yet
Scikit-Learn Cheatsheet For Machine Learning
1 page
UNIT V (1)
No ratings yet
UNIT V (1)
25 pages
Convolutional Neuralnetworks: Abin - Roozgard
No ratings yet
Convolutional Neuralnetworks: Abin - Roozgard
54 pages
Advances and Issues in Frequent Pattern Mining
No ratings yet
Advances and Issues in Frequent Pattern Mining
21 pages
Cluster Analysis Unit 4.
No ratings yet
Cluster Analysis Unit 4.
16 pages
MLT numericals
No ratings yet
MLT numericals
4 pages
Clustering Before Classification
No ratings yet
Clustering Before Classification
3 pages
MACHINE LEARNING New
No ratings yet
MACHINE LEARNING New
2 pages
Survey of Deep Learning Approaches For Twitter Text Classification
No ratings yet
Survey of Deep Learning Approaches For Twitter Text Classification
7 pages
Tiny Machine Learning
No ratings yet
Tiny Machine Learning
39 pages
ML Question BanK
No ratings yet
ML Question BanK
5 pages
Report On Neural Networks
No ratings yet
Report On Neural Networks
15 pages