Squashing Activation Functions in Benchmark Tests Towards A 2021 Knowledge
Squashing Activation Functions in Benchmark Tests Towards A 2021 Knowledge
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
article info a b s t r a c t
Article history: Over the past few years, deep neural networks have shown excellent results in multiple tasks,
Received 10 October 2020 however, there is still an increasing need to address the problem of interpretability to improve
Received in revised form 19 November 2020 model transparency, performance, and safety. Logical reasoning is a vital aspect of human intelligence.
Accepted 12 January 2021
However, traditional symbolic reasoning methods are mostly based on hard rules, which may only
Available online 20 February 2021
have limited generalization capability. Achieving eXplainable Artificial Intelligence (XAI) by combining
Keywords: neural networks with soft, continuous-valued logic and multi-criteria decision-making tools is one
XAI of the most promising ways to approach this problem: by this combination, the black-box nature of
Neural networks neural models can be reduced. The continuous logic-based neural model uses so-called Squashing
Squashing function activation functions, a parametric family of functions that satisfy natural invariance requirements
Continuous logic and contain rectified linear units as a particular case. This work demonstrates the first benchmark
Fuzzy logic tests that measure the performance of Squashing functions in neural networks. Three experiments
were carried out to examine their usability and a comparison with the most popular activation
functions was made for five different network types. The performance was determined by measuring
the accuracy, loss, and time per epoch. These experiments and the conducted benchmarks have proven
that the use of Squashing functions is possible and similar in performance to conventional activation
functions. Moreover, a further experiment was conducted by implementing nilpotent logical gates to
demonstrate how simple classification tasks can be solved successfully and with high performance.
The results indicate that due to the embedded nilpotent logical operators and the differentiability of
the Squashing function, it is possible to solve classification problems, where other commonly used
activation functions fail.
© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
https://doi.org/10.1016/j.knosys.2021.106779
0950-7051/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779
mathematical algorithms. This experience led to the design of complexity, these networks are easy to interpret and analyze.
fuzzy logic by Zadeh; see, e.g., [3–7]. The reason, why human-led The results indicate that due to nilpotent logical operators and
control often leads to much better results than even the optimal the differentiability of the Squashing function, it is possible to
automatic control is that humans use additional knowledge. solve classification problems, where other commonly used acti-
The basic idea of continuous logic is the replacement of the vation functions fail. Finally, in Section 5, the main results are
space of truth values {T , F } by a compact interval such as [0, 1]. summarized.
This means that the inputs and the outputs of the extended
logical gates are real numbers of the unit interval, representing 2. Preliminaries
truth values of inequalities. Quantifiers ∀x and ∃x are replaced by
supx and infx , and logical connectives are continuous functions. First, we recall some important preliminaries regarding nilpo-
tent logical systems and Squashing functions.
Based on this idea, human thinking and natural language can
be modeled in a more sophisticated way. Among other families
2.1. Nilpotent logical systems
of fuzzy logics, nilpotent fuzzy logic is beneficial from several
perspectives. The fulfillment of the law of contradiction and the As mentioned in the Introduction, in the field of continuous
excluded middle, and the coincidence of the residual and the S- logic, nilpotent logical systems are the most suitable for neural
implication [8,9] make the application of nilpotent operators in computation. For more details about nilpotent systems see [10–
logical systems promising. In [10–15], an abundant asset of oper- 16]. In [14], the authors examined a general parametric operator,
ators was examined thoroughly: in [11], negations, conjunctions oν (x), of nilpotent systems.
and disjunctions, in [12] implications, and in [13] equivalence
operators. In [14], the aggregative operators were studied and a Definition 1 ([14]). Let f : [0, 1] → [0, 1] be an increasing
parametric form of a general operator oν was given by using a bijection, ν ∈ [0, 1], and x = (x1 , . . . , xn ), where xi ∈ [0, 1] and
shifting transformation of the generator function. Varying the pa- let us define the general operator by
rameters, nilpotent conjunctive, disjunctive, aggregative (where [ n ]
a high input can compensate for a lower one) and negation ∑
operators can all be obtained. It was also demonstrated how oν (x) = f −1
(f (xi ) − f (ν )) + f (ν ) =
the nilpotent generated operator can be applied for preference i=1
[ n ] (1)
modeling. Moreover, as shown in [15], membership functions, ∑
which play a substantial role in the overall performance of fuzzy =f −1
f (xi ) − (n − 1)f (ν ) .
representation, can also be defined using a generator function. i=1
Table 1
The most important two-variable operators ow (x).
w1 w2 C ow (x, y) for f (x) = x Notation
Logical operators
Disjunction 1 1 0 f −1 [f (x) + f (y)] [x + y] d(x, y)
Conjunction 1 1 −1 f −1 [f (x) + f (y) − 1] [x + y − 1] c(x, y)
Implication −1 1 1 f −1 [f (y) − f (x) + 1] [y − x + 1] i(x, y)
Multi-criteria decision tools
(f (x) + f (y)) m(x, y)
[1 ] 1
Arithmetic mean 0.5 0.5 0 f −1 (x+ y)
[2 2
−1 1
(f (y) − f (x) + 1) 1
( − x + 1) p(x, y)
]
Preference −0.5 0.5 0.5 f 2 2
y
f −1 f (x) + f (y) − 21 x + y − 21 a(x, y)
[ ] [ ]
Aggregative operator 1 1 −0.5
Definition 4. The Squashing function [15,20] is defined as The test phase is divided into three experiments. The goal is
(−β ) to see if the Squashing function could solve simple classification
(β ) 1 1 + eβ(x−(a−λ/2)) 1 σ a+λ/2 (x) problems. Each of these experiments consisted of classifying a
Sa,λ (x) = ln = ln . (6)
λβ 1+ eβ(x−(a+λ/2)) λβ (−β )
σa−λ/2 (x) set of data that are distributed in different shapes. The dataset
is composed of two balanced classes, each containing 250 points.
(β )
where x, a, λ, β ∈ R, λ, β ̸ = 0, and σd (x) denotes the logistic In the first experiment, two-point clouds are to be separated by
function: a straight line. In the second experiment, these point clouds are
(β ) 1 arranged in circular configurations as seen in Fig. 2b. In the last
σd (x) = . (7) experiment, the point sets formed two intertwined spirals.
1 + e−β·(x−d)
3
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779
Table 2
Experiment 1 - Hyperarameters.
Type of layer Number of input features Number of output features
Fully-connected 2 2
Table 3
Experiment 2 - Hyperparameters.
Fig. 2. Experimental datasets - (a) Gaussian data, (b) circle data, (c) spiral data.
Type of layer Number of input features Number of output features
Fully-connected 2 8
Fully-connected 8 2
Table 4
Experiment 3 - Hyperparameters.
Type of layer Number of input features Number of output features
Fully-connected 2 64
Fully-connected 64 128
Fully-connected 128 2
3.1.4. Results
Fig. 6 shows the learning curves obtained in the experiments,
which illustrate the evolution of the cost function for the train-
ing set. By observing the loss curves, we can conclude that the
Squashing function is capable of solving the tasks of classifying
Gaussian, circle, and spiral data. The optimization process of
the three experiments clearly shows success in separating both
classes. For more computational details see Table 5.
3.3.3. ShuffleNet-v2
Fig. 6. Results Cross-Entropy - (a) Gaussian, (b) circle, (c) spiral.
ShuffleNet, published in 2018 by Ma et al. [27], also seeks
to improve efficiency but is designed for mobile devices with
limited computing capabilities. The improvement in efficiency
problem on FASHION-MNIST, a dataset consisting of 60000 train-
is given by the introduction of two new operations: point-wise
ing images and 10000 test images. Each of them is a grayscale
group convolution and channel shuffle. The main drawback of
image of 28 by 28 pixels in sizes, showing a piece of clothing
1x1 convolutions, also known as point-wise convolutions, is the
from Zalando distributed in 10 different categories. In the bench-
relatively high computational cost that can be reduced by using
marks, it should be determined whether the Squashing function
group convolutions. The channel shuffle operation has shown to
can deliver similar performance results as conventional activa-
be able to mitigate some unintended side effects that may evolve.
tion functions. The architectures used to solve the classification
In general, the group-wise convolution divides the input feature
of the benchmarks tests are: LeNet, Inception-v3, ShuffleNet-v2, maps into two or more groups in the channel dimension and
SqueezeNet, and DenseNet-121. A more detailed description of performs convolution separately on each group. It is the same as
the individual networks can be found in Section 3.3. For each slicing the input into several feature maps of smaller depth and
network separate runs for the following activation functions was then running a different convolution on each. After the grouped
performed: Rectified Linear Unit (ReLU), Sigmoid function, Hyper- convolution, the channel shuffle operation rearranges the output
bolic tangent (Tanh), Squashing function. Because of the learnable feature map along the channel dimension.
parameter in the Squashing function, the run with this function
was performed twice: first with a dynamic, learnable β , and then
3.3.4. SqueezeNet
with a static value for β . Following the same strategy as in the
SqueezeNet, which was developed in 2016 within the coopera-
experiments presented in Section 3.1, a cross-entropy function
tion of DeepScale, University of California, University of Berkeley,
is applied with the Adam optimization algorithm. The value of
and Stanford University, is a convolutional neural network archi-
the learning rate is set to 0.0001 and the size of the batches to
tecture proposed by Iandola et al. [28] that seeks to achieve levels
32. The total of amount the training process for each network
of accuracy similar to previous architectures, while significantly
architecture is 50 epochs.
reducing the number of parameters in the model. SqueezeNet
relies primarily on reducing the size of the filters by combining
3.3. Networks channels to decrease the inputs of each layer and to handle larger
feature maps. This yields to better feature extraction despite the
3.3.1. LeNet reduction in the number of parameters. This optimization of the
The prototype of the LeNet model was introduced in the year feature extraction is done by applying subsampling to these maps
1989 by Yann LeCun et al. [23]. They combined a Convolutional at the final network layers, rather than after each layer. The basic
Neural Network trained by backpropagation algorithm to learn building block of SqueezeNet is called the Fire module. It is com-
the convolution kernel coefficients directly from images. This posed of a squeeze layer that is in charge of input compression
prototype was able to recognize handwritten ZIP code numbers consisting of 1x1 filters. These combine all channels of each input
for the United States Postal Service and became the founda- pixel into one. It has also an expand layer which combines 3x3
tion of Convolutional Neural Networks. A few years later, in and 1x1 filters for feature extraction.
1998, LeCun et al. published a paper about gradient-based learn-
ing applied to document recognition, in which they reviewed 3.3.5. DenseNet-121
different methods of recognizing handwritten characters on pa- The main goal of the DenseNet-121, which was released in
per and used standard handwritten digits to identify benchmark 2015 by Facebook AI Research, is to reduce the model size and
tasks [24]. The results showed that the network exceeded all complexity [29]. In Dense convolution networks, each layer of the
other models. The most common form of the LeNet-Model is feature map is concatenated with the input of each successive
the LeNet-5 Architecture. The LeNet-5 is a seven-layer neural layer within a dense block. This allows later layers within the
network architecture (excluding inputs) that consists of two al- network to directly leverage the features from earlier layers, en-
ternate convolutional and pooling layers followed by three fully couraging feature reuse within the network [30]. Concatenating
connected layers (dense layers) at the end [25]. This network feature maps learned by different layers increases the variation
was successfully used in ATM check readers which could auto- in input from subsequent layers, improving efficiency. As the
matically read the check amount by recognizing hand-written network is able to use any previous feature map directly, the
numbers on checks. number of parameters required can be reduced considerably [31].
5
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779
Fig. 7. Line plot showing learning curves of accuracies for different activation Fig. 8. Line plot showing learning curves of loss for different activation functions
functions applied in the Lenet-5 architecture. applied in the Lenet-5 architecture.
3.4.1. LeNet
The accuracy over a period of 50 epochs is shown in Fig. 7.
The Squashing function with an adjustable β parameter has an
accuracy of 10% until epoch 7, then rises steeply and settles at
81%. Note that the training of the Squashing function with a
learnable beta parameter needs more initial steps to approximate
the appropriate β parameter value. The inset of Fig. 7 displays the
course and adjustment of the β value for the Squashing function Fig. 9. Line plot showing the time performance for different activation functions
with dynamic and static β values. Despite the larger compu- applied in the Lenet-5 architecture.
tational cost, this additional procedure strengthens the veracity
of the applied method. The accuracy curves of both Squashing
functions (with dynamic β and static β ) and the sigmoid function sigmoid function. This indicates difficulties in making predictions,
settle at about 81%. In contrast, the accuracy of the activation although the train accuracy for all activation functions is above
functions ReLU and Tanh reaches 91%. The trends for the test and 90%. However, the amplitude of this waving effect decreases after
training process are similar. a couple of tens of epochs landing at above 85% at epoch 50.
Fig. 8 illustrates the course of the loss value for the different
Note here that with about 24 million parameters, this network is
activation functions. The value of the loss converges towards 0.
one of the largest and most computationally intensive during the
The deviation between the training and the test loss is negligible
benchmarking. This can explain the initially fluctuating behavior.
for all activation functions. Consequently, the network can to
As a consequence, the development of the loss behaves similarly
make predictions even for unseen datasets. No overfitting or
as shown in Fig. 11. Note the performance of Squashing-nl being
underfitting takes place here.
close to that of the other activation functions.
Fig. 9 demonstrates the runtime in seconds for the different
The graphs ‘‘time per epoch’’ and ‘‘Beta per epoch’’ can be
activation functions, as a function of the number of epochs. It is
found in Appendix.
noticeable that the Squashing function with adjustable β value
takes between 15.5 and 17 s per epoch. In comparison, the
other activation functions (squashing-nl included) perform some- 3.4.3. ShuffleNet-v2
what better. However, this difference can be compensated by the Similar to Figs. 7 and 10, Fig. 12 provides information about
fact that the Squashing function has the potential of modeling the accuracy of the investigated activation functions over a time
nilpotent logic. period of 50 epochs for the network ShuffleNet-v2. The accu-
racy for the train and the test set of the different activation
3.4.2. Inception-v3 functions shows a steady development. Compared to ReLU, Sig-
Similar to Fig. 7, Fig. 10 provides information about the accu- moid and Tanh functions, the train accuracy of the Squashing
racy of the investigated activation functions over a time period and Squashing-nl function increases slower but settles above
of 50 epochs for the network Inception-v3. A special charac- 98% accuracy like the other functions. Surprisingly, the different
teristic that stands out is the significant fluctuation of the test activation functions show also very high test accuracy values of
accuracy in the case of the Squashing, the Squashing-nl and the about 90% at epoch 50.
6
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779
Fig. 13. Learning curves of loss for different activation functions applied in the
Fig. 10. Learning curves of accuracies for different activation functions applied
ShuffleNet-v2 architecture.
in the Inception-v3 architecture.
Fig. 14. Learning curves of accuracies for different activation functions applied
Fig. 11. Learning curves of loss for different activation functions applied in the in the SqueezeNet architecture.
Inception-v3 architecture.
In Fig. 13, with respect to the loss, the network overfits for
each activation function. This is indicated by the fact that the test
and the train loss show an increasing deviation.
The graphs ‘‘time per epoch’’ and ‘‘Beta per epoch’’ can be
found in Appendix.
3.4.4. SqueezeNet
The SquezzeNet accuracy diagram given in Fig. 14 illustrates
that the progression of the training and test set curves is similar
to that of ShuffleNet-v2.
The diagram for the losses given in Fig. 15 clearly illustrates
that the network is overfitting for all of the examined activation
functions.
The graphs ‘‘time per epoch’’ and ‘‘Beta per epoch’’ can be
found in Appendix.
3.4.5. DenseNet-121
The accuracy diagram of the Densenet-121 plotted in Fig. 16
demonstrates the development of the accuracy over 50 epochs.
Fig. 12. Learning curves of accuracies for different activation functions applied Notably, there is no difference in the train accuracies of the
in the ShuffleNet-v2 architecture. different activation functions. The same characteristics apply to
7
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779
Fig. 17. Learning curves of loss for different activation functions applied in the
Fig. 15. Learning curves of loss for different activation functions applied in the DenseNet-121 architecture.
SqueezeNet architecture.
Fig. 16. Learning curves of accuracies for different activation functions applied
in the DenseNet-121 architecture.
Fig. 18. Confusion matrix for the training set of the DenseNet-121.
the test set for all the activation functions used in this network.
The train accuracy for all activation functions lies at about 99%
and the test accuracy at about 94%. and classification error can be calculated as follows [32]:
In the losses diagram of the Densenet-121 in Fig. 17, the large total correct predictions
Accuracy = · 100 (9)
deviation between test and train loss is particularly visible. This total predictions made
deviation causes the network to overfit.
The graphs ‘‘time per epoch’’ and ‘‘Beta per epoch’’ can be total incorrect predictions
found in Appendix. Error = · 100 (10)
total predictions made
Figs. 18 and 19 display an example of a confusion matrix for
3.4.6. Evaluation in terms of confusion matrices the train and test sets of the DenseNet-121. The distribution of
A confusion matrix is a tool that allows one to see the per- the total dataset consists of 10 classes. The training accuracy of
formance of a model in a general way, where each column of the the Squashing function in the Densenet-121 is 99.8%, while the
matrix represents the identification class that the model predicts, test accuracy is 94%.
while each row represents the expected class, the true input. The 59882
diagonal indicates which images were correctly predicted. One of Train Accuracy = · 100 = 99.803% (11)
60000
the advantages of a confusion matrix is that they make it easier to 9402
see which categories the network is confusing with one another. Test Accuracy = · 100 = 94.02% (12)
10000
It is usually used in supervised learning. The prediction accuracy
8
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779
Fig. 19. Confusion matrix for the test set of the DenseNet-121.
The confusion matrices for each network and the correspond- Fig. 21. AND gates: connecting four neurons.
ing activation functions can be found in Appendix.
Fig. 22. Results of 750 training epochs with a two-line shallow network.
two datasets. A longer runtime further reduces the error. The corresponding development of the β parameter, the error in the
results can be found in Fig. 22. It is important to note that the network decreases. The development of network loss is displayed
processing time of these networks is extremely fast due to their in Fig. 23.
low complexity.
4.2.1. Other activation functions
4.2. Experiment 2: Four lines Looking at the other usual activation functions for this appli-
cation, it stands out that no sufficient results could be achieved
In the second experiment, the generated dataset is similar to in this experiment. For the behavior during training with ReLU,
the first. A trapezoidal area lies in the middle of the dataset and sigmoid, and TanH, see Fig. 24.
the data points are labeled with 1 (blue) and 0 (orange). This Considering the loss of the individual activation functions,
area can now be separated by four straight lines. With about the ReLU function does not improve accuracy. The error remains
4000 epochs, the training lasts disproportionately longer than the
constant during the entire training period. Using sigmoid or TanH
training with two neurons, although the number of parameters
improves in accuracy and the error initially decreases, but this
only slightly more than doubled. The activation function is the
value settles down after a few epochs and then remains almost
Squashing function with a learnable β parameter. We allow β in
constant. This development is reflected in Fig. 25.
the AND gate (hidden layer) to be different from that in the first
layer. The learning rate is set to 0.02. After about 1700 epochs,
5. Conclusion
the network is able to align the four straight lines to the record.
Between 2000 and 4000 epochs, the network improves accuracy
significantly, adjusting parameters of the straight lines to obtain a As recent research shows, the idea of achieving eXplainable
more accurate classification (see Fig. 23). During the development Artificial Intelligence (XAI) by combining neural networks with
of the β parameters of the Squashing functions, the values for the continuous logic is a promising way to approach the problem
first and for the second layer develop in different directions. Note of interpretability of machine learning: by this combination, the
that allowing β to be negative leads to a decreasing activation black-box nature of neural models can be reduced, and the neural
function (see Fig. 1). For the interpretation of the hidden layer network-based models can become more interpretable, trans-
as a logical gate, a negative β value means that in Eq. (14), the parent, and safe. This hybrid approach suggests using Squashing
cutting function is replaced by its decreasing counterpart (a step functions (continuously differentiable approximations of cutting
function with value 1 for negative inputs and value 0 for non- functions) as activation functions. To the best of our knowledge,
negative ones), which corresponds to finding the complement there has been no attempt in the literature to test the per-
of the intersection. Clearly, for a binary classifier, finding the formance of these functions so far. The goal of this study was
intersection is equivalent to finding its complement. The devel- to implement Squashing functions in neural networks and to
opment of the β parameters is illustrated in Fig. 23. With the test them by conducting benchmark tests. Additionally, we also
10
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779
Fig. 23. Results of 4000 training epochs with a four-line shallow network with β and loss values of 4000 training epochs with a four line shallow network.
conducted the first experiments implementing continuous logical 2. to test the performance of the activation functions under
gates using the Squashing function. different conditions, i.e. to measure the performance for
The implementation of the squashing function was success- different architectures of neural networks.
fully performed with the framework PyTorch and tested with a The benchmark tests showed that the performance of the
series of selected experiments and benchmark tests. The aim of Squashing function is comparable to conventional activation func-
the benchmark tests was: tions. The following activation functions were considered: the
Rectified Linear Unit (ReLu), the sigmoid function, the hyperbolic
1. to compare the Squashing function with other activation tangent (TanH), and the Squashing function, both with static
functions, and with a learnable β parameter. The measured values were
11
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779
Fig. 24. Performance of other common activation functions trying to find a rectangular area using the nilpotent neural model.
determined for the following network architectures: LeNet-5, such representations [2]. Furthermore, we are working on im-
Inception-v3, ShuffleNet-v2, SqueezeNet and DenseNet-121. plementing squashing activation functions and soft logical rules
Another focus of this study was the implementation of contin- into medical recommender systems to increase transparence and
uous logic using the Squashing function. The experiments have safety.
proven that by utilizing the differentiability of the Squashing
function, there is a possible way to implement continuous logic CRediT authorship contribution statement
into neural networks, as a crucial step towards more transparent
machine learning. Daniel Zeltner: Software, Coding, Visualisation, Validation,
As a next step, we are working on a comparison with extreme Writing, Editing. Benedikt Schmid: Software, Coding, Visuali-
learning machines (ELM) introduced in [33], where, similarly to sation, Validation, Writing, Editing. Gábor Csiszár: Methodol-
the model suggested in this study, the parameters of hidden
ogy, Software, Visualisation, Validation, Writing, Editing. Orsolya
nodes are frozen, and need not be tuned. ELMs are able to produce
Csiszár: Methodology, Visualisation, Writing, Editing, Conceptu-
good generalization performance and learn thousands of times
alization, Supervision.
faster than networks trained using backpropagation. Combining
extreme learning machines with the continuous logical back-
ground can be a very promising direction towards a more inter- Declaration of competing interest
pretable, transparent, and safe machine learning. Supplemental
research is also in progress aiming to investigate which ‘‘And’’- The authors declare that they have no known competing finan-
and ‘‘Or’’-operations can be represented by the fastest (i.e., 1- cial interests or personal relationships that could have appeared
Layer) neural networks, and which activations functions allow to influence the work reported in this paper.
12
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779
Appendix
13
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779
14
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779
15
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779
16
D. Zeltner, B. Schmid, G. Csiszár et al. Knowledge-Based Systems 218 (2021) 106779
References [17] O. Csiszár, G. Csiszár, J. Dombi, How to implement MCDM tools and
continuous logic into neural computation? Towards better interpretability
[1] A. Barredo Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, of neural networks, Knowl.-Based Syst. 210 (2020).
A. Barbado, S. Garcia, S. Gil-Lopez, D. Molina, R. Benjamins, R. Chatila, [18] R. Riegel, A. Gray, F. Luus, N. Khan, N. Makondo, I.Y. Akhalwaya, H. Qian, R.
F. Herrera, Explainable Artificial Intelligence (XAI): Concepts, taxonomies, Fagin, F. Barahona, U. Sharma, S. Ikbal, H. Karanam, S. Neelam, A. Likhyani,
opportunities and challenges toward responsible AI, Inf. Fusion 58 S. Srivastava, Logical neural networks, 2020, URL arXiv:2006.13155.
(2020) 82–115, https://doi.org/10.1016/j.inffus.2019.12.012, URL http:// [19] S. Shi, H. Chen, M. Zhang, Y. Zhang, Neural logic networks, 2019, arXiv:
www.sciencedirect.com/science/article/pii/S1566253519308103. 1910.08629.
[2] K. Alvarez, J.C. Urenda, O. Csiszar, G. Csiszar, J. Dombi, G. Eigner, V. [20] J. Dombi, Z. Gera, The approximation of piecewise linear membership
Kreinovich, Towards fast and understandable computations: Which ‘‘and’’- functions and lukasiewicz operators, Fuzzy Sets and Systems 154 (2005)
and ‘‘or’’-operations can be represented by the fastest (i.e., 1-layer) neural 275–286.
networks? Which activations functions allow such representations? Acta [21] J.C. Urenda, O. Csiszár, G. Csiszár, J. Dombi, O. Kosheleva, V. Kreinovich,
Polytech. Hung. 8 (2021) 27–45, URL https://scholarworks.utep.edu/cs_ G. Eigner, Why squashing functions in multi-layer neural networks, in:
techrep/1443/. IEEE International Conference on Systems, Man, and Cybernetics, 2020,
[3] R. Belohlavek, J.W. Dauben, G.J. Klir, Fuzzy Logic and Mathematics: A URL https://scholarworks.utep.edu/cs_techrep/1398/.
Historical Perspective, Oxford University Press, New York, 2017. [22] D. Zeltner, B. Schmid, A study of activation functions, 2020, URL https:
[4] G. Klir, B. Yuan, Fuzzy Sets and Fuzzy Logic, Prentice Hall, Upper Saddle //github.com/TeamCoffein/A-Study-of-Activation-Functions,
River, New Jersey, 1995. [23] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard,
[5] J. Mendel, Uncertain Rule-Based Fuzzy Systems, Springer, Cham,
L.D. Jackel, Backpropagation applied to handwritten zip code recognition,
Switzerland, 2017.
Neural Comput. 1 (4) (1989) 541–551.
[6] H. Nguyen, C.L. Walker, E.A. Walker, A First Course in Fuzzy Logic,
[24] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied
Chapman and Hall/CRC, Boca Raton, Florida, 2017.
to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324.
[7] L. Zadeh, Fuzzy sets, Inf. Control 8 (1965) 338–353.
[25] A. Zhang, Z.C. Lipton, M. Li, A.J. Smola, Dive into deep learning, 2020, URL
[8] D. Dubois, H. Prade, Fuzzy sets in approximate reasoning, Fuzzy Sets and
https://d2l.ai/.
Systems 40 (1991) 143–202.
[9] E. Trillas, L. Valverde, On some functionally expressable implications for [26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the
fuzzy set theory, in: Proceedings of the 3rd International Seminar on Fuzzy inception architecture for computer vision, 2015, arXiv:1512.00567.
Set Theory, Linz, Austria, 1981, pp. 173–1902. [27] N. Ma, X. Zhang, H.-T. Zheng, J. Sun, ShuffleNet V2: PRactical guidelines
[10] O. Csiszár, J. Dombi, Generator-based modifiers and membership functions for efficient CNN architecture design, 2018, CoRR http://arxiv.org/abs/1807.
in nilpotent operator systems, in: IEEE International Work Conference on 11164.
Bioinspired Intelligence (Iwobi 2019), 2019, pp. 99–106. [28] F.N. Iandola, M.W. Moskewicz, K. Ashraf, S. Han, W.J. Dally, K. Keutzer,
[11] J. Dombi, O. Csiszár, The general nilpotent operator system, Fuzzy Sets and SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB
Systems 261 (2015) 1–19. model size, 2016, CoRR http://arxiv.org/abs/1602.07360.
[12] J. Dombi, O. Csiszár, Implications in bounded systems, Inform. Sci. 283 [29] M. Chablani, DenseNet, 2017, URL https://towardsdatascience.com/
(2014) 229–240. densenet-2810936aeebb.
[13] J. Dombi, O. Csiszár, Equivalence operators in nilpotent systems, Fuzzy Sets [30] J. Jordan, Common architectures in convolutional neural networks, 2018,
and Systems 299 (2016) 113–129. URL https://www.jeremyjordan.me/convnet-architectures/.
[14] J. Dombi, O. Csiszár, Self-dual operators and a general framework for [31] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected
weighted nilpotent operators, Internat. J. Approx. Reason. 81 (2017) convolutional networks, in: 2017 IEEE Conference on Computer Vision and
115–127. Pattern Recognition (CVPR), 2017, pp. 2261–2269.
[15] J. Dombi, O. Csiszár, Operator-dependent modifiers in nilpotent logical [32] S. Visa, B. Ramsay, A. Ralescu, E. Knaap, Confusion matrix-based feature
systems, in: Proceedings of the 10th International Joint Conference on selection, in: Proceedings of the 22nd Midwest Artificial Intelligence and
Computational Intelligence - Volume 1: IJCCI, INSTICC, SciTePress, 2018, Cognitive Science, vol. 710, 2011, pp. 120–127.
pp. 126–134. [33] Q. Huang, C. Siew, Extreme learning machine: Theory and applications,
[16] O. Csiszár, G. Csiszár, J. Dombi, Interpretable neural networks based Neurocomputing 70 (2006) 489–501.
on continuous-valued logic and multicriterion decision operators,
Knowl.-Based Syst. 199 (2020) https://doi.org/10.1016/j.knosys.2020.
105972.
17