Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Knowledge-Based Systems 218 (2021) 106779

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Squashing activation functions in benchmark tests: Towards a more


eXplainable Artificial Intelligence using continuous-valued logic

Daniel Zeltner a , Benedikt Schmid a , Gábor Csiszár c,b , Orsolya Csiszár a,b ,
a
Faculty of Basic Sciences, University of Applied Sciences Esslingen, Esslingen, Germany
b
Physiological Controls Research Center, Óbuda University, Budapest, Hungary
c
Institute of Materials Physics, University of Stuttgart, Stuttgart, Germany

article info a b s t r a c t

Article history: Over the past few years, deep neural networks have shown excellent results in multiple tasks,
Received 10 October 2020 however, there is still an increasing need to address the problem of interpretability to improve
Received in revised form 19 November 2020 model transparency, performance, and safety. Logical reasoning is a vital aspect of human intelligence.
Accepted 12 January 2021
However, traditional symbolic reasoning methods are mostly based on hard rules, which may only
Available online 20 February 2021
have limited generalization capability. Achieving eXplainable Artificial Intelligence (XAI) by combining
Keywords: neural networks with soft, continuous-valued logic and multi-criteria decision-making tools is one
XAI of the most promising ways to approach this problem: by this combination, the black-box nature of
Neural networks neural models can be reduced. The continuous logic-based neural model uses so-called Squashing
Squashing function activation functions, a parametric family of functions that satisfy natural invariance requirements
Continuous logic and contain rectified linear units as a particular case. This work demonstrates the first benchmark
Fuzzy logic tests that measure the performance of Squashing functions in neural networks. Three experiments
were carried out to examine their usability and a comparison with the most popular activation
functions was made for five different network types. The performance was determined by measuring
the accuracy, loss, and time per epoch. These experiments and the conducted benchmarks have proven
that the use of Squashing functions is possible and similar in performance to conventional activation
functions. Moreover, a further experiment was conducted by implementing nilpotent logical gates to
demonstrate how simple classification tasks can be solved successfully and with high performance.
The results indicate that due to the embedded nilpotent logical operators and the differentiability of
the Squashing function, it is possible to solve classification problems, where other commonly used
activation functions fail.
© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).

1. Introduction are capable of learning from a set of data and of producing a


model that can be used to solve different problems, the values
While AI techniques, especially deep learning techniques, are of the accuracy or the prediction error are not enough, since
revolutionizing the business and technology world, there is an these numbers only provide an incomplete description of most
increasing need to address the problem of interpretability and to real-world problems. The interpretability of a machine learning
improve model transparency, performance, and safety: a problem model gives insight into its internal functionality to explain the
that is of vital importance to all our research community [1]. reasons why it suggests making certain decisions. In low-risk
This challenge is closely related to the fact that although deep environments, such as film recommendation, only the predictive
neural networks have achieved impressive experimental results, performance of the model counts. However, in high-risk environ-
especially in image classification, they have shown to be sur- ments, such as health care or the insurance sector, it is important
prisingly unstable when it comes to adversarial perturbations: to be able to explain why a decision was made. In this case, we
minimal changes to the input image may cause the network to
need to have some reasonable explanations behind our decisions
misclassify it. Moreover, although machine learning algorithms
to be more convincing and also to avoid lawsuits claiming race-
based, gender-based, or age-based bias [2]. Understandability
∗ Corresponding author at: Faculty of Basic Sciences, University of Applied
means that we are able to describe the computations by using
Sciences Esslingen, Esslingen, Germany.
words from natural human language. One of the main challenges
E-mail addresses: [email protected] (D. Zeltner),
[email protected] (B. Schmid), [email protected] here is that natural language is often imprecise (fuzzy), making
(G. Csiszár), [email protected] (O. Csiszár). it difficult to find the relation between imprecise words and

https://doi.org/10.1016/j.knosys.2021.106779
0950-7051/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

mathematical algorithms. This experience led to the design of complexity, these networks are easy to interpret and analyze.
fuzzy logic by Zadeh; see, e.g., [3–7]. The reason, why human-led The results indicate that due to nilpotent logical operators and
control often leads to much better results than even the optimal the differentiability of the Squashing function, it is possible to
automatic control is that humans use additional knowledge. solve classification problems, where other commonly used acti-
The basic idea of continuous logic is the replacement of the vation functions fail. Finally, in Section 5, the main results are
space of truth values {T , F } by a compact interval such as [0, 1]. summarized.
This means that the inputs and the outputs of the extended
logical gates are real numbers of the unit interval, representing 2. Preliminaries
truth values of inequalities. Quantifiers ∀x and ∃x are replaced by
supx and infx , and logical connectives are continuous functions. First, we recall some important preliminaries regarding nilpo-
tent logical systems and Squashing functions.
Based on this idea, human thinking and natural language can
be modeled in a more sophisticated way. Among other families
2.1. Nilpotent logical systems
of fuzzy logics, nilpotent fuzzy logic is beneficial from several
perspectives. The fulfillment of the law of contradiction and the As mentioned in the Introduction, in the field of continuous
excluded middle, and the coincidence of the residual and the S- logic, nilpotent logical systems are the most suitable for neural
implication [8,9] make the application of nilpotent operators in computation. For more details about nilpotent systems see [10–
logical systems promising. In [10–15], an abundant asset of oper- 16]. In [14], the authors examined a general parametric operator,
ators was examined thoroughly: in [11], negations, conjunctions oν (x), of nilpotent systems.
and disjunctions, in [12] implications, and in [13] equivalence
operators. In [14], the aggregative operators were studied and a Definition 1 ([14]). Let f : [0, 1] → [0, 1] be an increasing
parametric form of a general operator oν was given by using a bijection, ν ∈ [0, 1], and x = (x1 , . . . , xn ), where xi ∈ [0, 1] and
shifting transformation of the generator function. Varying the pa- let us define the general operator by
rameters, nilpotent conjunctive, disjunctive, aggregative (where [ n ]
a high input can compensate for a lower one) and negation ∑
operators can all be obtained. It was also demonstrated how oν (x) = f −1
(f (xi ) − f (ν )) + f (ν ) =
the nilpotent generated operator can be applied for preference i=1
[ n ] (1)
modeling. Moreover, as shown in [15], membership functions, ∑
which play a substantial role in the overall performance of fuzzy =f −1
f (xi ) − (n − 1)f (ν ) .
representation, can also be defined using a generator function. i=1

In [16,17], the authors introduced an idea of achieving eX-


plainable Artificial Intelligence (XAI) by combining neural net- Remark 1. Note that the general operator for ν = 1 is conjunc-
tive, for ν = 0 it is disjunctive and for ν = ν∗ = f −1 12 it is
( )
works with nilpotent fuzzy logic as a promising way to approach
the problem: by this combination, the black-box nature of neural self-dual.
models can be reduced, and the neural network-based models can As a benefit of using this general operator, a conjunction, a dis-
become more interpretable, transparent, and safe [18,19]. In [16], junction and an aggregative operator differ only in one parameter
the authors showed that in the field of continuous logic, nilpotent of the general operator in Eq. (1). Additionally, the parameter ν
logical systems are the most suitable for neural computation. To has the semantic meaning of the level of expectation: maximal
achieve transparency using logical operators, it is desirable to for the conjunction, neutral for the aggregation, and minimal for
choose an activation function that fits the theoretical background the disjunction.
the best. In the formulae of the nilpotent operators, the cutting Next, let us recall the weighted form of the general operator:
function plays a crucial role. Although piecewise linear functions
are easy to handle, there are areas where the parameters are Definition 2 ([14]). Let w ∈ Rn , wi > 0, f : [0, 1] → [0, 1]
learned by a gradient-based optimization method. In this case, an increasing bijection with ν ∈ [0, 1], x = (x1 , . . . , xn ), where
the lack of continuous derivatives makes the application im- xi ∈ [0, 1]. The weighted general operator is defined by
possible. To address this problem, a continuously differentiable [ n ]
approximation of the cutting function, the Squashing function, in-

oν,w (x) := f −1
wi (f (xi ) − f (ν )) + f (ν ) . (2)
troduced in [20], was used in the nilpotent neural model [16,17].
i=1
In [21], the authors explain the empirical success of Squashing ∑n
functions by showing that the formulas describing this family Note that if the weight vector is normalized; i.e. for i=1 wi =
(that contain rectified linear units as a particular case) follow 1,
from natural invariance requirements.
( n
)

This study provides the first benchmark tests that measure oν,w (x) = f −1
wi f (xi ) . (3)
the performance of the Squashing functions in neural networks i=1
and also demonstrates the first steps towards the implementation For future application, we introduce a threshold-based operator
of nilpotent logical gates. The article is organized as follows. in the following way.
After recalling the most important preliminaries in Section 2, Sec-
tion 3 provides three experiments to demonstrate the usability of Definition 3 ([14]). Let w ∈ Rn , wi > 0, x = (x1 , . . . , xn ) ∈ [0, 1]n ,
squashing activation functions together with a comparison with ν = (ν1 , . . . νn ) ∈ [0, 1]n and let f : [0, 1] → [0, 1] be a strictly
the most popular activation functions for five different network increasing bijection. Let us define the threshold-based nilpotent
types. The performance of these functions was determined by operator by
measuring accuracy, loss and time per epoch. These experiments [ n ]
and the conducted benchmarks have proven that the use of ∑
the Squashing function is possible and similar in performance
oν,w (x) = f −1
wi (f (xi ) − f (νi )) + f (ν ) =
i=1
to the conventional activation functions. In Section 4, a further
experiment was conducted by implementing nilpotent logical [ n ]
gates to demonstrate how simple classification tasks can be per-

=f −1
wi f (xi ) + C , (4)
formed successfully and with high performance. Due to their low
i=1
2
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

Table 1
The most important two-variable operators ow (x).
w1 w2 C ow (x, y) for f (x) = x Notation
Logical operators
Disjunction 1 1 0 f −1 [f (x) + f (y)] [x + y] d(x, y)
Conjunction 1 1 −1 f −1 [f (x) + f (y) − 1] [x + y − 1] c(x, y)
Implication −1 1 1 f −1 [f (y) − f (x) + 1] [y − x + 1] i(x, y)
Multi-criteria decision tools
(f (x) + f (y)) m(x, y)
[1 ] 1
Arithmetic mean 0.5 0.5 0 f −1 (x+ y)
[2 2
−1 1
(f (y) − f (x) + 1) 1
( − x + 1) p(x, y)
]
Preference −0.5 0.5 0.5 f 2 2
y
f −1 f (x) + f (y) − 21 x + y − 21 a(x, y)
[ ] [ ]
Aggregative operator 1 1 −0.5

The Squashing function given in Definition 4 is a continuously


differentiable approximation of the generalized cutting function
by means of sigmoid functions (see Fig. 1). By increasing the
value of β , the Squashing function approaches the generalized
cutting function. In other words, β drives the accuracy of the
approximation, while the parameters a and λ determine the
center and width. The error of the approximation can be upper
bounded by constant/β , which means that by increasing the
parameter β , the error decreases by the same order of magnitude.
The derivatives of the Squashing function are easy to calculate
and can be expressed by sigmoid functions and itself:
(β )
∂ Sa,λ (x) 1( (β ) (β )
)
= σa−λ/2 (x) − σa+λ/2 (x) (8)
Fig. 1. Squashing functions for a = 0.5, λ = 1, for different β values (β1 = 0.5,
∂x λ
β2 = 1, β2 = 2, β3 = 5, and β4 = 50). In [21], it is shown that the formulas describing the squash-
ing functions follow from natural symmetry requirements and
contain linear units as a particular case.
where
n
3. Implementation and benchmark tests

C = f (ν ) − wi f (νi ). (5)
This section describes the exact implementation of the Squash-
i=1
ing function in the PyTorch framework (GitHub Repository [22]).
To verify this implementation, three experiments were conducted.
Remark 2. Note that the Equation in (4) describes the percep-
First, the basics of these experiments are introduced and the
tron model in neural computation. Here, the parameters all have
datasets are presented. The results obtained by using Squashing
semantic meanings as importance (weights), decision level and
functions in different benchmark tasks, including tests in different
level of expectancy. Table 1 shows how the logical operators and
neural network architectures and a comparison with commonly
some multi-criteria decision tools, like the preference operator,
used activation functions, indicate that the Squashing function
can be implemented in neural models.
is capable of performing similarly to other popular activation
The most commonly used operators for n = 2 and for special functions. As a starting point, in Section 3.1, the behavior of the
values of wi and C , also for f (x) = x, are listed in Table 1. Squashing function with a = 0.5, λ = 1, and a learnable β
parameter is investigated. Choosing a suitable initial value for
2.2. Squashing function as a differentiable parametric approximation β is not trivial and is sensitive to the environment. The greater
of the cutting function the value of β is, the better the approximation of the cutting
function gets. However, choosing a greater |β| value also means
As highlighted in the Introduction, in the formulae of the ignoring a wider spectrum of inputs and can lead to the vanishing
nilpotent operators, the cutting function plays a critical role (see gradient problem (see Fig. 1). For future consideration, it would
Table 1). To address the problem of the lack of differentiability, be interesting to set different β parameters for the input layers
the following approximation, the so-called Squashing function and for the logical operators.
(introduced in [20]) was used in the nilpotent neural model [16,
17]. 3.1. Testing of the squashing function

Definition 4. The Squashing function [15,20] is defined as The test phase is divided into three experiments. The goal is
(−β ) to see if the Squashing function could solve simple classification
(β ) 1 1 + eβ(x−(a−λ/2)) 1 σ a+λ/2 (x) problems. Each of these experiments consisted of classifying a
Sa,λ (x) = ln = ln . (6)
λβ 1+ eβ(x−(a+λ/2)) λβ (−β )
σa−λ/2 (x) set of data that are distributed in different shapes. The dataset
is composed of two balanced classes, each containing 250 points.
(β )
where x, a, λ, β ∈ R, λ, β ̸ = 0, and σd (x) denotes the logistic In the first experiment, two-point clouds are to be separated by
function: a straight line. In the second experiment, these point clouds are
(β ) 1 arranged in circular configurations as seen in Fig. 2b. In the last
σd (x) = . (7) experiment, the point sets formed two intertwined spirals.
1 + e−β·(x−d)
3
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

Table 2
Experiment 1 - Hyperarameters.
Type of layer Number of input features Number of output features
Fully-connected 2 2

Table 3
Experiment 2 - Hyperparameters.
Fig. 2. Experimental datasets - (a) Gaussian data, (b) circle data, (c) spiral data.
Type of layer Number of input features Number of output features
Fully-connected 2 8
Fully-connected 8 2

Table 4
Experiment 3 - Hyperparameters.
Type of layer Number of input features Number of output features
Fully-connected 2 64
Fully-connected 64 128
Fully-connected 128 2

3.1.1. Experiment 1: Classification of Gaussian data


The task of the first experiment is solved with a one-layer
feedforward network. The model architecture shown in Table 2
uses a fully-connected layer with two input and two output fea-
tures. As a cost function cross-entropy function is applied, while
the Adam optimization algorithm is utilized for the optimization
Fig. 3. Results of Experiment 1 - Classification of Gaussian data. procedure. The training process takes 10 epochs with a learning
rate of η = 0.1.
Fig. 3 shows the visualization of the optimization process for
10 epochs.

3.1.2. Experiment 2: Classification of circle data


The problem of the second experiment is solved with a two-
layer feedforward network. The model architecture shown in
Table 3 uses an input layer with two input features and one out-
put layer with eight input features. Similarly to experiment one,
a cross-entropy function is applied with the Adam optimization
algorithm. The training process takes 150 epochs, with a learning
rate of η = 0.1. The visualization of the optimization process for
150 epochs can be seen in Fig. 4.

3.1.3. Experiment 3: Classification of spiral data


In the third experiment, a three-layer feedforward network is
employed. The model architecture shown in Table 4 uses an input
layer with two input features, one hidden layer with 64 input
Fig. 4. Results of Experiment 2 - Classification of circle data. features, and an output layer with 128 input features. Similarly
to the first two experiments, a cross-entropy function is applied
with the Adam optimization algorithm. The training process takes
2000 epochs, with a learning rate of η = 0.001. The visualization
of the optimization process for 2000 epochs can be seen in Fig. 5.

3.1.4. Results
Fig. 6 shows the learning curves obtained in the experiments,
which illustrate the evolution of the cost function for the train-
ing set. By observing the loss curves, we can conclude that the
Squashing function is capable of solving the tasks of classifying
Gaussian, circle, and spiral data. The optimization process of
the three experiments clearly shows success in separating both
classes. For more computational details see Table 5.

3.2. Benchmarking on FASHION-MNIST

In this Section, a benchmark test of various activation func-


tions is presented to compare the performance of the Squashing
Fig. 5. Results of Experiment 3 - Classification of spiral data. function with other popular activation functions in a classification
4
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

Table 5 3.3.2. Inception-v3


Determination of β parameter in the Gaussian, circle and spiral spatial
The Inception-v3 network was proposed by a research group
configurations.
at Google in 2015 and is a 42-layer deep learning network
Experiment βinit /βfinal epoch loss Train accuracy Test accuracy
with higher computational efficiency and fewer parameters in-
Gaussian 0.1/1.4 10 0.58 1 1
cluded compared to other state-of-the-art CNN networks [26].
Circle 10−6 /0.565 150 0.33 1 1
Spiral 0.1/1.4924 2000 0.36 0.96 0.94 With about 24 million parameters, this network is one of the
largest and most computationally intensive during the bench-
marks. Inception-v3 uses so-called Inception Modules. These act
as multiple filters that are applied to the same input value by
means of convolution layers and pooling layers. By using different
filter sizes, different patterns can be extracted from the input
images that increase the number of trainable parameters. This
procedure increases memory consumption and computing time
considerably, however leads to a significant increase in accuracy.

3.3.3. ShuffleNet-v2
Fig. 6. Results Cross-Entropy - (a) Gaussian, (b) circle, (c) spiral.
ShuffleNet, published in 2018 by Ma et al. [27], also seeks
to improve efficiency but is designed for mobile devices with
limited computing capabilities. The improvement in efficiency
problem on FASHION-MNIST, a dataset consisting of 60000 train-
is given by the introduction of two new operations: point-wise
ing images and 10000 test images. Each of them is a grayscale
group convolution and channel shuffle. The main drawback of
image of 28 by 28 pixels in sizes, showing a piece of clothing
1x1 convolutions, also known as point-wise convolutions, is the
from Zalando distributed in 10 different categories. In the bench-
relatively high computational cost that can be reduced by using
marks, it should be determined whether the Squashing function
group convolutions. The channel shuffle operation has shown to
can deliver similar performance results as conventional activa-
be able to mitigate some unintended side effects that may evolve.
tion functions. The architectures used to solve the classification
In general, the group-wise convolution divides the input feature
of the benchmarks tests are: LeNet, Inception-v3, ShuffleNet-v2, maps into two or more groups in the channel dimension and
SqueezeNet, and DenseNet-121. A more detailed description of performs convolution separately on each group. It is the same as
the individual networks can be found in Section 3.3. For each slicing the input into several feature maps of smaller depth and
network separate runs for the following activation functions was then running a different convolution on each. After the grouped
performed: Rectified Linear Unit (ReLU), Sigmoid function, Hyper- convolution, the channel shuffle operation rearranges the output
bolic tangent (Tanh), Squashing function. Because of the learnable feature map along the channel dimension.
parameter in the Squashing function, the run with this function
was performed twice: first with a dynamic, learnable β , and then
3.3.4. SqueezeNet
with a static value for β . Following the same strategy as in the
SqueezeNet, which was developed in 2016 within the coopera-
experiments presented in Section 3.1, a cross-entropy function
tion of DeepScale, University of California, University of Berkeley,
is applied with the Adam optimization algorithm. The value of
and Stanford University, is a convolutional neural network archi-
the learning rate is set to 0.0001 and the size of the batches to
tecture proposed by Iandola et al. [28] that seeks to achieve levels
32. The total of amount the training process for each network
of accuracy similar to previous architectures, while significantly
architecture is 50 epochs.
reducing the number of parameters in the model. SqueezeNet
relies primarily on reducing the size of the filters by combining
3.3. Networks channels to decrease the inputs of each layer and to handle larger
feature maps. This yields to better feature extraction despite the
3.3.1. LeNet reduction in the number of parameters. This optimization of the
The prototype of the LeNet model was introduced in the year feature extraction is done by applying subsampling to these maps
1989 by Yann LeCun et al. [23]. They combined a Convolutional at the final network layers, rather than after each layer. The basic
Neural Network trained by backpropagation algorithm to learn building block of SqueezeNet is called the Fire module. It is com-
the convolution kernel coefficients directly from images. This posed of a squeeze layer that is in charge of input compression
prototype was able to recognize handwritten ZIP code numbers consisting of 1x1 filters. These combine all channels of each input
for the United States Postal Service and became the founda- pixel into one. It has also an expand layer which combines 3x3
tion of Convolutional Neural Networks. A few years later, in and 1x1 filters for feature extraction.
1998, LeCun et al. published a paper about gradient-based learn-
ing applied to document recognition, in which they reviewed 3.3.5. DenseNet-121
different methods of recognizing handwritten characters on pa- The main goal of the DenseNet-121, which was released in
per and used standard handwritten digits to identify benchmark 2015 by Facebook AI Research, is to reduce the model size and
tasks [24]. The results showed that the network exceeded all complexity [29]. In Dense convolution networks, each layer of the
other models. The most common form of the LeNet-Model is feature map is concatenated with the input of each successive
the LeNet-5 Architecture. The LeNet-5 is a seven-layer neural layer within a dense block. This allows later layers within the
network architecture (excluding inputs) that consists of two al- network to directly leverage the features from earlier layers, en-
ternate convolutional and pooling layers followed by three fully couraging feature reuse within the network [30]. Concatenating
connected layers (dense layers) at the end [25]. This network feature maps learned by different layers increases the variation
was successfully used in ATM check readers which could auto- in input from subsequent layers, improving efficiency. As the
matically read the check amount by recognizing hand-written network is able to use any previous feature map directly, the
numbers on checks. number of parameters required can be reduced considerably [31].
5
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

Fig. 7. Line plot showing learning curves of accuracies for different activation Fig. 8. Line plot showing learning curves of loss for different activation functions
functions applied in the Lenet-5 architecture. applied in the Lenet-5 architecture.

3.4. Results and discussion

In this section, the results of the benchmarking on FASHION-


MNIST is demonstrated. For each network listed in Section 3.3,
separate runs for the following activation functions is performed:
ReLU, Sigmoid function, Tanh, Squashing function with a static
(squashing-nl, βinitial = 0.1), as well as with a dynamic, learnable
β parameter.

3.4.1. LeNet
The accuracy over a period of 50 epochs is shown in Fig. 7.
The Squashing function with an adjustable β parameter has an
accuracy of 10% until epoch 7, then rises steeply and settles at
81%. Note that the training of the Squashing function with a
learnable beta parameter needs more initial steps to approximate
the appropriate β parameter value. The inset of Fig. 7 displays the
course and adjustment of the β value for the Squashing function Fig. 9. Line plot showing the time performance for different activation functions
with dynamic and static β values. Despite the larger compu- applied in the Lenet-5 architecture.
tational cost, this additional procedure strengthens the veracity
of the applied method. The accuracy curves of both Squashing
functions (with dynamic β and static β ) and the sigmoid function sigmoid function. This indicates difficulties in making predictions,
settle at about 81%. In contrast, the accuracy of the activation although the train accuracy for all activation functions is above
functions ReLU and Tanh reaches 91%. The trends for the test and 90%. However, the amplitude of this waving effect decreases after
training process are similar. a couple of tens of epochs landing at above 85% at epoch 50.
Fig. 8 illustrates the course of the loss value for the different
Note here that with about 24 million parameters, this network is
activation functions. The value of the loss converges towards 0.
one of the largest and most computationally intensive during the
The deviation between the training and the test loss is negligible
benchmarking. This can explain the initially fluctuating behavior.
for all activation functions. Consequently, the network can to
As a consequence, the development of the loss behaves similarly
make predictions even for unseen datasets. No overfitting or
as shown in Fig. 11. Note the performance of Squashing-nl being
underfitting takes place here.
close to that of the other activation functions.
Fig. 9 demonstrates the runtime in seconds for the different
The graphs ‘‘time per epoch’’ and ‘‘Beta per epoch’’ can be
activation functions, as a function of the number of epochs. It is
found in Appendix.
noticeable that the Squashing function with adjustable β value
takes between 15.5 and 17 s per epoch. In comparison, the
other activation functions (squashing-nl included) perform some- 3.4.3. ShuffleNet-v2
what better. However, this difference can be compensated by the Similar to Figs. 7 and 10, Fig. 12 provides information about
fact that the Squashing function has the potential of modeling the accuracy of the investigated activation functions over a time
nilpotent logic. period of 50 epochs for the network ShuffleNet-v2. The accu-
racy for the train and the test set of the different activation
3.4.2. Inception-v3 functions shows a steady development. Compared to ReLU, Sig-
Similar to Fig. 7, Fig. 10 provides information about the accu- moid and Tanh functions, the train accuracy of the Squashing
racy of the investigated activation functions over a time period and Squashing-nl function increases slower but settles above
of 50 epochs for the network Inception-v3. A special charac- 98% accuracy like the other functions. Surprisingly, the different
teristic that stands out is the significant fluctuation of the test activation functions show also very high test accuracy values of
accuracy in the case of the Squashing, the Squashing-nl and the about 90% at epoch 50.
6
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

Fig. 13. Learning curves of loss for different activation functions applied in the
Fig. 10. Learning curves of accuracies for different activation functions applied
ShuffleNet-v2 architecture.
in the Inception-v3 architecture.

Fig. 14. Learning curves of accuracies for different activation functions applied
Fig. 11. Learning curves of loss for different activation functions applied in the in the SqueezeNet architecture.
Inception-v3 architecture.

In Fig. 13, with respect to the loss, the network overfits for
each activation function. This is indicated by the fact that the test
and the train loss show an increasing deviation.
The graphs ‘‘time per epoch’’ and ‘‘Beta per epoch’’ can be
found in Appendix.

3.4.4. SqueezeNet
The SquezzeNet accuracy diagram given in Fig. 14 illustrates
that the progression of the training and test set curves is similar
to that of ShuffleNet-v2.
The diagram for the losses given in Fig. 15 clearly illustrates
that the network is overfitting for all of the examined activation
functions.
The graphs ‘‘time per epoch’’ and ‘‘Beta per epoch’’ can be
found in Appendix.

3.4.5. DenseNet-121
The accuracy diagram of the Densenet-121 plotted in Fig. 16
demonstrates the development of the accuracy over 50 epochs.
Fig. 12. Learning curves of accuracies for different activation functions applied Notably, there is no difference in the train accuracies of the
in the ShuffleNet-v2 architecture. different activation functions. The same characteristics apply to
7
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

Fig. 17. Learning curves of loss for different activation functions applied in the
Fig. 15. Learning curves of loss for different activation functions applied in the DenseNet-121 architecture.
SqueezeNet architecture.

Fig. 16. Learning curves of accuracies for different activation functions applied
in the DenseNet-121 architecture.

Fig. 18. Confusion matrix for the training set of the DenseNet-121.

the test set for all the activation functions used in this network.
The train accuracy for all activation functions lies at about 99%
and the test accuracy at about 94%. and classification error can be calculated as follows [32]:
In the losses diagram of the Densenet-121 in Fig. 17, the large total correct predictions
Accuracy = · 100 (9)
deviation between test and train loss is particularly visible. This total predictions made
deviation causes the network to overfit.
The graphs ‘‘time per epoch’’ and ‘‘Beta per epoch’’ can be total incorrect predictions
found in Appendix. Error = · 100 (10)
total predictions made
Figs. 18 and 19 display an example of a confusion matrix for
3.4.6. Evaluation in terms of confusion matrices the train and test sets of the DenseNet-121. The distribution of
A confusion matrix is a tool that allows one to see the per- the total dataset consists of 10 classes. The training accuracy of
formance of a model in a general way, where each column of the the Squashing function in the Densenet-121 is 99.8%, while the
matrix represents the identification class that the model predicts, test accuracy is 94%.
while each row represents the expected class, the true input. The 59882
diagonal indicates which images were correctly predicted. One of Train Accuracy = · 100 = 99.803% (11)
60000
the advantages of a confusion matrix is that they make it easier to 9402
see which categories the network is confusing with one another. Test Accuracy = · 100 = 94.02% (12)
10000
It is usually used in supervised learning. The prediction accuracy
8
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

Fig. 20. Schematic representation of a shallow network with two neurons.

Fig. 19. Confusion matrix for the test set of the DenseNet-121.

The confusion matrices for each network and the correspond- Fig. 21. AND gates: connecting four neurons.
ing activation functions can be found in Appendix.

4. Implementing nilpotent logical gates in neural networks


parameters of hidden nodes need not be tuned. ELMs are able to
produce good generalization performance and learn thousands of
As we have seen, nilpotent logical systems provide a suit-
times faster than networks trained using backpropagation. The
able mathematical background for the combination of continuous
model suggested here can combine extreme learning machines
nilpotent logic and neural networks, contributing to the improve-
with the continuous logical background, being a promising direc-
ment of interpretability and safety of machine learning. The fol-
tion towards a more interpretable, transparent, and safe machine
lowing sections describe the implementation of the conjunction,
learning.
one of the most important continuous logical operators. Accord-
After this implementation, the number of straight lines is
ing to Table 1, the conjunction can be modeled by [x + y − 1];
increased to four. If more straight lines are connected by AND
i.e. by a perceptron with fixed weights (wi = 1), fixed bias (C =
gates, more AND gates with two inputs are required. However,
−1) and the cutting function or its differentiable approximation,
these can be reduced to one gate after the bias has been adapted
the Squashing function as an activation function.
to it. This connection is shown in Fig. 21.
As a first experiment shown in Fig. 22, we define a classifi-
cation problem where two intersecting straight lines delineate AND

a segment of the plane to be found by a shallow network. This


  
Out = [[g1 + g2 − 1] + [g3 + g4 − 1] −1] (15)
segment is defined by      
AND AND
b1 y ≥ m1 x + c1 AND b2 y ≤ m2 x + c2 . (13) = [g1 + g2 + g3 + g4 − 3] (16)
  
Here, not only the AND operator but also the inequalities can AND
be modeled by perceptrons. The output values should be the truth
values of the inequalities. A perceptron with weights −m1 , b1 ,
and bias −c1 in the case of the first inequality, and m2 , −b2 , and 4.1. Experiment 1: Two lines
bias c2 in the case of the second one, using the cutting (or its
approximation, the Squashing) activation function can model the For the first experiment, a dataset is divided into two cat-
soft inequalities well: egories. An open angular shape is labeled with 1 (blue) at the
edge of the data field. The remaining data points are labeled
[b1 y − m1 x − c1 ] AND [m2 x − b2 y + c2 ]. (14)
0 (orange). The goal is to separate the two datasets with two
This means, to model the problem described in Eq. (14), a straight lines. The network architecture is predesigned according
shallow network with only two layers needs to be set up (see to the nilpotent model described in Section 4. The activation
Fig. 20). The weights and biases in the first layer are to be learned, function is the Squashing function with a learnable β parameter,
while the parameters of the hidden layer are frozen (modeling the different for the AND gate in the hidden layer and for the first
conjunction). This architecture is similar to that of Extreme Learn- layer. The learning rate is set to 0.02. After 750 epochs, with a
ing Machines (ELM) introduced by Huang et al. in [33], where the runtime of a few seconds, the network is able to separate the
9
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

Fig. 22. Results of 750 training epochs with a two-line shallow network.

two datasets. A longer runtime further reduces the error. The corresponding development of the β parameter, the error in the
results can be found in Fig. 22. It is important to note that the network decreases. The development of network loss is displayed
processing time of these networks is extremely fast due to their in Fig. 23.
low complexity.
4.2.1. Other activation functions
4.2. Experiment 2: Four lines Looking at the other usual activation functions for this appli-
cation, it stands out that no sufficient results could be achieved
In the second experiment, the generated dataset is similar to in this experiment. For the behavior during training with ReLU,
the first. A trapezoidal area lies in the middle of the dataset and sigmoid, and TanH, see Fig. 24.
the data points are labeled with 1 (blue) and 0 (orange). This Considering the loss of the individual activation functions,
area can now be separated by four straight lines. With about the ReLU function does not improve accuracy. The error remains
4000 epochs, the training lasts disproportionately longer than the
constant during the entire training period. Using sigmoid or TanH
training with two neurons, although the number of parameters
improves in accuracy and the error initially decreases, but this
only slightly more than doubled. The activation function is the
value settles down after a few epochs and then remains almost
Squashing function with a learnable β parameter. We allow β in
constant. This development is reflected in Fig. 25.
the AND gate (hidden layer) to be different from that in the first
layer. The learning rate is set to 0.02. After about 1700 epochs,
5. Conclusion
the network is able to align the four straight lines to the record.
Between 2000 and 4000 epochs, the network improves accuracy
significantly, adjusting parameters of the straight lines to obtain a As recent research shows, the idea of achieving eXplainable
more accurate classification (see Fig. 23). During the development Artificial Intelligence (XAI) by combining neural networks with
of the β parameters of the Squashing functions, the values for the continuous logic is a promising way to approach the problem
first and for the second layer develop in different directions. Note of interpretability of machine learning: by this combination, the
that allowing β to be negative leads to a decreasing activation black-box nature of neural models can be reduced, and the neural
function (see Fig. 1). For the interpretation of the hidden layer network-based models can become more interpretable, trans-
as a logical gate, a negative β value means that in Eq. (14), the parent, and safe. This hybrid approach suggests using Squashing
cutting function is replaced by its decreasing counterpart (a step functions (continuously differentiable approximations of cutting
function with value 1 for negative inputs and value 0 for non- functions) as activation functions. To the best of our knowledge,
negative ones), which corresponds to finding the complement there has been no attempt in the literature to test the per-
of the intersection. Clearly, for a binary classifier, finding the formance of these functions so far. The goal of this study was
intersection is equivalent to finding its complement. The devel- to implement Squashing functions in neural networks and to
opment of the β parameters is illustrated in Fig. 23. With the test them by conducting benchmark tests. Additionally, we also
10
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

Fig. 23. Results of 4000 training epochs with a four-line shallow network with β and loss values of 4000 training epochs with a four line shallow network.

conducted the first experiments implementing continuous logical 2. to test the performance of the activation functions under
gates using the Squashing function. different conditions, i.e. to measure the performance for
The implementation of the squashing function was success- different architectures of neural networks.
fully performed with the framework PyTorch and tested with a The benchmark tests showed that the performance of the
series of selected experiments and benchmark tests. The aim of Squashing function is comparable to conventional activation func-
the benchmark tests was: tions. The following activation functions were considered: the
Rectified Linear Unit (ReLu), the sigmoid function, the hyperbolic
1. to compare the Squashing function with other activation tangent (TanH), and the Squashing function, both with static
functions, and with a learnable β parameter. The measured values were
11
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

Fig. 24. Performance of other common activation functions trying to find a rectangular area using the nilpotent neural model.

Fig. 25. Loss of the different activation functions.

determined for the following network architectures: LeNet-5, such representations [2]. Furthermore, we are working on im-
Inception-v3, ShuffleNet-v2, SqueezeNet and DenseNet-121. plementing squashing activation functions and soft logical rules
Another focus of this study was the implementation of contin- into medical recommender systems to increase transparence and
uous logic using the Squashing function. The experiments have safety.
proven that by utilizing the differentiability of the Squashing
function, there is a possible way to implement continuous logic CRediT authorship contribution statement
into neural networks, as a crucial step towards more transparent
machine learning. Daniel Zeltner: Software, Coding, Visualisation, Validation,
As a next step, we are working on a comparison with extreme Writing, Editing. Benedikt Schmid: Software, Coding, Visuali-
learning machines (ELM) introduced in [33], where, similarly to sation, Validation, Writing, Editing. Gábor Csiszár: Methodol-
the model suggested in this study, the parameters of hidden
ogy, Software, Visualisation, Validation, Writing, Editing. Orsolya
nodes are frozen, and need not be tuned. ELMs are able to produce
Csiszár: Methodology, Visualisation, Writing, Editing, Conceptu-
good generalization performance and learn thousands of times
alization, Supervision.
faster than networks trained using backpropagation. Combining
extreme learning machines with the continuous logical back-
ground can be a very promising direction towards a more inter- Declaration of competing interest
pretable, transparent, and safe machine learning. Supplemental
research is also in progress aiming to investigate which ‘‘And’’- The authors declare that they have no known competing finan-
and ‘‘Or’’-operations can be represented by the fastest (i.e., 1- cial interests or personal relationships that could have appeared
Layer) neural networks, and which activations functions allow to influence the work reported in this paper.

12
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

Appendix

13
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

14
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

15
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

16
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779

References [17] O. Csiszár, G. Csiszár, J. Dombi, How to implement MCDM tools and
continuous logic into neural computation? Towards better interpretability
[1] A. Barredo Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, of neural networks, Knowl.-Based Syst. 210 (2020).
A. Barbado, S. Garcia, S. Gil-Lopez, D. Molina, R. Benjamins, R. Chatila, [18] R. Riegel, A. Gray, F. Luus, N. Khan, N. Makondo, I.Y. Akhalwaya, H. Qian, R.
F. Herrera, Explainable Artificial Intelligence (XAI): Concepts, taxonomies, Fagin, F. Barahona, U. Sharma, S. Ikbal, H. Karanam, S. Neelam, A. Likhyani,
opportunities and challenges toward responsible AI, Inf. Fusion 58 S. Srivastava, Logical neural networks, 2020, URL arXiv:2006.13155.
(2020) 82–115, https://doi.org/10.1016/j.inffus.2019.12.012, URL http:// [19] S. Shi, H. Chen, M. Zhang, Y. Zhang, Neural logic networks, 2019, arXiv:
www.sciencedirect.com/science/article/pii/S1566253519308103. 1910.08629.
[2] K. Alvarez, J.C. Urenda, O. Csiszar, G. Csiszar, J. Dombi, G. Eigner, V. [20] J. Dombi, Z. Gera, The approximation of piecewise linear membership
Kreinovich, Towards fast and understandable computations: Which ‘‘and’’- functions and lukasiewicz operators, Fuzzy Sets and Systems 154 (2005)
and ‘‘or’’-operations can be represented by the fastest (i.e., 1-layer) neural 275–286.
networks? Which activations functions allow such representations? Acta [21] J.C. Urenda, O. Csiszár, G. Csiszár, J. Dombi, O. Kosheleva, V. Kreinovich,
Polytech. Hung. 8 (2021) 27–45, URL https://scholarworks.utep.edu/cs_ G. Eigner, Why squashing functions in multi-layer neural networks, in:
techrep/1443/. IEEE International Conference on Systems, Man, and Cybernetics, 2020,
[3] R. Belohlavek, J.W. Dauben, G.J. Klir, Fuzzy Logic and Mathematics: A URL https://scholarworks.utep.edu/cs_techrep/1398/.
Historical Perspective, Oxford University Press, New York, 2017. [22] D. Zeltner, B. Schmid, A study of activation functions, 2020, URL https:
[4] G. Klir, B. Yuan, Fuzzy Sets and Fuzzy Logic, Prentice Hall, Upper Saddle //github.com/TeamCoffein/A-Study-of-Activation-Functions,
River, New Jersey, 1995. [23] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard,
[5] J. Mendel, Uncertain Rule-Based Fuzzy Systems, Springer, Cham,
L.D. Jackel, Backpropagation applied to handwritten zip code recognition,
Switzerland, 2017.
Neural Comput. 1 (4) (1989) 541–551.
[6] H. Nguyen, C.L. Walker, E.A. Walker, A First Course in Fuzzy Logic,
[24] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied
Chapman and Hall/CRC, Boca Raton, Florida, 2017.
to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324.
[7] L. Zadeh, Fuzzy sets, Inf. Control 8 (1965) 338–353.
[25] A. Zhang, Z.C. Lipton, M. Li, A.J. Smola, Dive into deep learning, 2020, URL
[8] D. Dubois, H. Prade, Fuzzy sets in approximate reasoning, Fuzzy Sets and
https://d2l.ai/.
Systems 40 (1991) 143–202.
[9] E. Trillas, L. Valverde, On some functionally expressable implications for [26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the
fuzzy set theory, in: Proceedings of the 3rd International Seminar on Fuzzy inception architecture for computer vision, 2015, arXiv:1512.00567.
Set Theory, Linz, Austria, 1981, pp. 173–1902. [27] N. Ma, X. Zhang, H.-T. Zheng, J. Sun, ShuffleNet V2: PRactical guidelines
[10] O. Csiszár, J. Dombi, Generator-based modifiers and membership functions for efficient CNN architecture design, 2018, CoRR http://arxiv.org/abs/1807.
in nilpotent operator systems, in: IEEE International Work Conference on 11164.
Bioinspired Intelligence (Iwobi 2019), 2019, pp. 99–106. [28] F.N. Iandola, M.W. Moskewicz, K. Ashraf, S. Han, W.J. Dally, K. Keutzer,
[11] J. Dombi, O. Csiszár, The general nilpotent operator system, Fuzzy Sets and SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB
Systems 261 (2015) 1–19. model size, 2016, CoRR http://arxiv.org/abs/1602.07360.
[12] J. Dombi, O. Csiszár, Implications in bounded systems, Inform. Sci. 283 [29] M. Chablani, DenseNet, 2017, URL https://towardsdatascience.com/
(2014) 229–240. densenet-2810936aeebb.
[13] J. Dombi, O. Csiszár, Equivalence operators in nilpotent systems, Fuzzy Sets [30] J. Jordan, Common architectures in convolutional neural networks, 2018,
and Systems 299 (2016) 113–129. URL https://www.jeremyjordan.me/convnet-architectures/.
[14] J. Dombi, O. Csiszár, Self-dual operators and a general framework for [31] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected
weighted nilpotent operators, Internat. J. Approx. Reason. 81 (2017) convolutional networks, in: 2017 IEEE Conference on Computer Vision and
115–127. Pattern Recognition (CVPR), 2017, pp. 2261–2269.
[15] J. Dombi, O. Csiszár, Operator-dependent modifiers in nilpotent logical [32] S. Visa, B. Ramsay, A. Ralescu, E. Knaap, Confusion matrix-based feature
systems, in: Proceedings of the 10th International Joint Conference on selection, in: Proceedings of the 22nd Midwest Artificial Intelligence and
Computational Intelligence - Volume 1: IJCCI, INSTICC, SciTePress, 2018, Cognitive Science, vol. 710, 2011, pp. 120–127.
pp. 126–134. [33] Q. Huang, C. Siew, Extreme learning machine: Theory and applications,
[16] O. Csiszár, G. Csiszár, J. Dombi, Interpretable neural networks based Neurocomputing 70 (2006) 489–501.
on continuous-valued logic and multicriterion decision operators,
Knowl.-Based Syst. 199 (2020) https://doi.org/10.1016/j.knosys.2020.
105972.

17

You might also like