Practical Options For Adopting Recurrent Neural Network and Its Variants On Remaining Useful Life Prediction
Practical Options For Adopting Recurrent Neural Network and Its Variants On Remaining Useful Life Prediction
(2021) 34:69
https://doi.org/10.1186/s10033-021-00588-x Chinese Journal of Mechanical
Engineering
Abstract
The remaining useful life (RUL) of a system is generally predicted by utilising the data collected from the sensors that
continuously monitor different indicators. Recently, different deep learning (DL) techniques have been used for RUL
prediction and achieved great success. Because the data is often time-sequential, recurrent neural network (RNN) has
attracted significant interests due to its efficiency in dealing with such data. This paper systematically reviews RNN
and its variants for RUL prediction, with a specific focus on understanding how different components (e.g., types of
optimisers and activation functions) or parameters (e.g., sequence length, neuron quantities) affect their performance.
After that, a case study using the well-studied NASA’s C-MAPSS dataset is presented to quantitatively evaluate the
influence of various state-of-the-art RNN structures on the RUL prediction performance. The result suggests that the
variant methods usually perform better than the original RNN, and among which, Bi-directional Long Short-Term
Memory generally has the best performance in terms of stability, precision and accuracy. Certain model structures
may fail to produce valid RUL prediction result due to the gradient vanishing or gradient exploring problem if the
parameters are not chosen appropriately. It is concluded that parameter tuning is a crucial step to achieve optimal
prediction performance .
Keywords: Remaining useful life prediction, Deep learning, Recurrent neural network, Long short-term memory,
Bi-directional long short-term memory, Gated recurrent unit
© The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativeco
mmons.org/licenses/by/4.0/.
Wang et al. Chin. J. Mech. Eng. (2021) 34:69 Page 2 of 20
working condition impedes the construction of the phys- (WPT) and DBN for consecutive identification of bearing
ical systems, it results in difficulties in developing the fault location and severity. Thus, research suggests that
modelling of complex dynamic systems [6]. In addition, CNN and RNN are generally used as predictive models
the difficulty to be updated with the online measured and have proved to outperform traditional prognosis
data, limits the effectiveness and flexibility of the phys- algorithms in RUL prediction. CNN based approaches
ics-based models. In contrast, data-driven approaches are used more in fault diagnosis and surface integration
are gaining popularity due to its quick implementation inspection [16]. RNN, on the other hand, gained much
and widespread deployment of low-cost sensors and more attention and achievements in the research of RUL
their connection to the internet, where RUL is computed prediction because of its ability to accommodate time
through statistical and probabilistic methods by utilis- sequence data [17]. Therefore, this paper systematically
ing historic information and routinely monitored data of reviews the applications of RNN and its variants for RUL
the system [7]. The precondition for setting up the data- prediction in recent years. Many novel RNN based meth-
driven models for RUL prediction is the availability of the ods have been proposed, and the performance of the RUL
multivariate historical data about the system behaviour, performance has been greatly improved. However, most
which must encompass all phases of the system opera- of these works just focused on how to achieve a better
tion and degradation scenarios under certain operating prediction performance using a certain approach. Very
conditions. In recent years, Artificial intelligence (AI) few researchers paid attention to some other factors that
techniques, particularly deep learning (DL) techniques also affect the prediction result such as the optimizer,
are becoming more and more attractive because of the activation function, neuron number and sequence length.
rapid growth in the industrial Internet of Things (IoT), Taking the optimizers as an example, they are used to
Big Data and increasing computing power [8]. Research- shape the model into its most accurate form through
ers have exploited applications of AI techniques for RUL futzing with the weights. To the best of our knowledge,
prediction as well. there is no research discussing how different optimizers
Deep learning is one of the sub-branches of machine affect the performance of RUL prediction using DL based
learning, which originated from the Artificial Neural Net- approaches, and what is the underlying principle to opti-
work (ANN) and featuring multiple nonlinear processing mise the selection. To fill these research gaps, this paper
layers. It intends to model hierarchical representations not only presents an evaluation of the basic RNN and its
and predicts patterns behind data through building variants on RUL prediction based on a case study in a
stacked multiple layers of information processing mod- publicly available dataset, but a specific investigation has
ules in hierarchical architectures. With the rapid develop- also been carried out on how different components (e.g.,
ment of computational infrastructure and the availability types of optimisers and activation functions) or param-
of a large volume of data, DL has become one of the main eters (e.g., sequence length, neuron quantities) of these
research topics in the field of prognostics, given its capa- approaches affect the overall performance (e.g., stability,
bility to capture the hierarchical relationship embedded precision, accuracy) of the RUL prediction.
in deep structures [9]. The published literature on DL The remainder of this paper is organized as follows.
approaches for RUL prediction mainly focus on four rep- Section 2 briefly introduces the basic conception of RNN
resentative deep architectures, including Auto-encoder and its variants. Section 3 presents the different optimiz-
(AE), Deep Belief Network (DBN), Convolutional Neural ers that is normally used in DL. Section 4 explains how
Network (CNN) and Recurrent Neural Network (RNN) activation functions affect the training of the network
[10]. AE and DBN are often used for the pre-training of and demonstrate the advantages and drawbacks of differ-
networks. For instance, Jia et al. [11] developed a stacked ent activation functions. Section 5 presents a case study
denoising autoencoder (SDA) which is fed with the fre- that aims to evaluate the factors that influence the per-
quency spectra of time-series data to do the rotating formance of RUL prediction based on a publicly available
machinery diagnosis. Chen et al. [12] proposed an SDA dataset.
to identify the health state of certain systems with signals
containing ambient noise and working condition fluctua- 2 RNN and Its Variants
tions. Shao et al. [13] developed a deep AE based method 2.1 RNN
to diagnose rotating machinery fault. Liao et al. [14] pro- In a traditional neural network, inputs are independ-
posed to combine an enhanced Restricted Boltzmann ent, while in RNN, the front neurons pass the informa-
Machine (RBM) with a novel regularisation term to auto- tion to the following neurons. As illustrated in Figure 1,
matically extract the features which are suitable for RUL in contrast to a traditional feed-forward neural network,
prediction. Gan et al. [15] presented a hierarchical diag- an RNN can be regarded as numerous copies of the same
nosis network that combines a wavelet packet transform neural network cell, in which each cell passes the message
Wang et al. Chin. J. Mech. Eng. (2021) 34:69 Page 3 of 20
which is constrained by the output gate. The exist- Recurrent Unit LSTM (GRU-LSTM) and AdaBoost-
ence of the gates enables LSTM to fulfil the long-term LSTM showed improved performance in all cases. They
dependencies in the sequence, and by learning the developed a vanilla LSTM approach two years later
gate parameters, the network can find the appropri- which further improved the prediction performance
ate internal storage. Therefore, LSTMs are naturally significantly [20]. A multi-layer LSTM approach pro-
suited for RUL prediction tasks using sensor data with vided by Zheng et al. [17] investigated the hidden pat-
the inherent sequential nature due to their capability terns from sensors and operational data with multiple
of remembering information over long periods. Yuan operating conditions, fault and degradation models by
et al. [19] proposed an LSTM approach for different combining multiple layers of LSTM cells with standard
types of faults, where C-MPASS dataset was used as the feed-forward layers. The superiority of this approach in
case study. Compared to the traditional RNN, Gated RUL prediction was validated by three widely used data
Wang et al. Chin. J. Mech. Eng. (2021) 34:69 Page 5 of 20
sets, C-MAPSS Data Set, PHM08 Challenge Data Set multiple components, multiple states and a large number
and the Milling Data Set. of parameters.
rt = σ U r xt + W r ht−1 + br ,
(10) where Eq. (13) refers to the forward path and Eq. (14)
refers to the backward path, yi is the output of the Bi-
directional LSTM obtained by fusing the results from
∼ ∼
∼
ht = tanh U h xt + W h ht−1 · rt + b∼ , (11) both directional paths.
h
As for the application, Zhao et al. [23] presented an
∼ integrated approach of CNN and bi-directional LSTM
ht = (1 − zt ) · ht−1 + zt · ht , (12) for machining tool wear prediction named Convolutional
Bi-directional Long Short-Term Memory (CBLSTM)
Since there are fewer tensor operations in GRU, it runs
networks. CNN was firstly used to extract local robust
relatively faster when training the structure than LSTM.
features from the sequential input. Then, Bi-directional
However, the accuracy is behind LSTM due to fewer
LSTM was utilised to encode temporal information. The
gates. Thus, when the computational resource is limited,
proposed CBLSTM’s capability of predicting the RUL
or fast training is required, GRU could be a good option.
of actual tool wear based on raw sensory data was veri-
For instance, Chen et al. [22] adopted a GRU network
fied with a real-life tool wear test. Zhang et al. [24] pre-
to predict the RUL for a complex system featured with
sented a Bi-directional LSTM network to discover the
underlying patterns embedded in time-series to track the robustness of SGD and do not need much manual tuning
system degradation. The Bi-directional LSTM network of the learning rate. These four optimizers are therefore
was implemented to track the variation of the health selected and discussed in more detail in this paper.
index, and the RUL was predicted by the recursive one-
step ahead method. Elsheikh et al. [25] built a Bidirec-
3.1 Adagrad
tional Handshaking LSTM (BHLSTM) network for RUL
Adagrad is a gradient-based optimizer that adapts the
prediction, where short sequences of monitored observa-
learning rate to the parameters, performing larger
tions were given with random initial wear. This method
updates for infrequent and smaller updates for frequent
was able to predict the RUL with a random start, which
updates. Thus, it is very suitable for sparse data. It uses
makes it more suitable for real-world application as the
a different learning rate for every parameter θi at every
initial condition of physical systems is usually unknown,
time step t, so the gradient of the objective function gt,i
especially in terms of its manufacturing deficiencies.
regarding the parameter θi at time step t is written as:
3 Optimizer gt,i = ∇θt J θt,i , (16)
Gradient descent by far is the most commonly used way
The SGD updates for every parameter θi at each time
to optimise neural network [26]. It is an iterative optimi-
step t following equation:
zation algorithm used to find the values of parameters or
coefficients of a function that minimizes a cost function. θt+1, i = θt,i − η · gt,i . (17)
Although various algorithms have been developed to
optimize gradient descent, they are usually used as black- Adagrad modifies the general learning rate η at each
box optimizers because it is hard to figure out the practi- time step t for every parameter θi based on the past
cal explanations of their strengths and weaknesses. gradients:
Different in how much data used to compute the gradient η
of the objective function, the gradient descent variants are θt+1,i = θt,i − · gt,i , (18)
Gt,ii + ∈
classified into two categories: batch gradient descent (BGD)
and stochastic gradient descent (SGD). BGD is guaranteed where Gt ∈ Rd×d is a diagonal matrix where each diag-
to converge to a global minimum for convex error surfaces onal element i is the sum of the squares of the gradients
and a local minimum for non-convex surfaces. However, regarding the parameter θi at time step t, ∈ is a smoothing
BGD can be very time-consuming because it needs to calcu- term used to avoid division by zero.
late the gradients for the whole dataset to perform just one One of the main advantages of Adagrad is that it is not
update and thus it is intractable for datasets that do not fit required to manually tune the learning rate. The default value
in memory. In addition, BGD cannot be used to update the is set as 0.01. The main drawback of this optimizer is that its
model online. In contrast, SGD performs one update at a accumulation of the squared gradients in the denominator
time, and thus it will not have any redundant computations would result in the learning rate to shrink and become infini-
for large datasets as BGD does. As a result, SGD is usually tesimally small, which means that at a certain point, the algo-
much fast than BGD. Meanwhile, it can be used to learn rithm can no longer acquire additional knowledge.
the model online. The drawback of SGD is that the frequent
updates with a high variance would lead to a heavy fluctua-
3.2 Adadelta
tion to the objective function. While if the learning rate is
To reduce the monotonically decreasing learning rate,
slowly decreased over time, SGD shows the same conver-
an extension optimizer of Adagrad has been promoted,
gence behaviour as BGD, it almost certainly converges to a
named Adadelta. It uses a fixed-size window of accu-
local or the global minimum for non-convex optimization.
mulated past gradients instead of accumulating all past
Although SGD can often lead to good convergence, few
squared gradients. The sum of the gradient is recursively
challenges need to be addressed. For instance, it is diffi-
defined as a decaying average of all past squared gradi-
cult to determine a proper learning rate and an anneal-
ents. Thus, the running average of the squared gradients
ing schedule, or it is hard to update features to a different
of the objective function at time step t depends on the
extent avoiding suboptimal minima. Ruder [26] outlines
previous average and the current gradient:
some algorithms that are widely used by the deep learn-
ing community which can deal with these challenges E[g 2 ]t = γ E[g 2 ]t−1 + (1 − γ )gt2 , (19)
includes Momentum, Nesterov accelerated gradi-
ent, Adagrad, Adadelta, RMSprop, Adam, AdamMax where γ is the fraction of the update vector of the past
and Nadam. Ruder also stated that Adagrad, Adadelta, time step to the current update vector, which is normally
RMSprop and Adam can all significantly improve the set to 0.9 [26].
Wang et al. Chin. J. Mech. Eng. (2021) 34:69 Page 7 of 20
form data and complex function mappings that represent Tanh function is the translation and contraction of sig-
nonlinearity between input and output. moid function: tanh(x)=2⋅σ(2x)−1. Tanh function often
There are three types of activation functions normally outperforms sigmoid in practice because its output is
used in the deep learning area: tanh & sigmoid, ReLU and zero mean. Nevertheless, it still suffers from gradient sat-
swish. In this section, the basic mathematical expression of uration and computational complexity.
these three types of activation functions is reviewed with
their advantages and drawbacks. The expression of these
activation functions and their variants are demonstrated in 4.2 ReLU
Figure 5. Rectification of linear unit (ReLU) is the most commonly
used deep learning neural network activation function. It
4.1 Sigmoid & Tanh is the default activation function for most of the feed-for-
Sigmoid function, expressed in Eq. (34), also known as ward neural networks. The ReLU function is written as:
Logistic function, is normally used for the output of the
hidden layer neurons: f (x) = max(0, x) (36)
x
α e − 1 , x ≤ 0,
f (x) =
x, x > 0. (39)
5.3 Performance Evaluation
5 Case Study In this case study, the mean square error (MSE) was used
5.1 Benchmark Dataset Overview to evaluate the performance of the trained neural net-
The case study focuses on the investigation of the influ- works. The mathematical expression is:
ence of various practical options of optimizers, acti-
n
vation functions and other parameters like sequence 1 2
length and neuron number when adopting RNN and
MSE = di , (41)
n
i=1
its variants on RUL prediction. We selected the NASA’s
Commercial Modular Aero-Propulsion System Simula- where n is the total number of true RUL targets in the
tion (C-MAPSS) dataset, aiming at modelling the dam- related test set and di refers to the difference between the
age propagation of aircraft gas turbine engines [29]. This true RUL and the predicted RUL.
engine simulator produced four datasets which are con- The RNN algorithm and its variants were tested with
sisted of three operational condition indicators. Each the dataset FD001, and three different layer structures
subset has different numbers of engines with varied oper- for each method were used. Each algorithm and struc-
ational cycles. ture have been tested five times to achieve the statistical
In the dataset, engine profiles were simulated with dif- result, which was illustrated in the form of a box chart.
ferent initial degradation conditions. The maintenance The results were presented and discussed according to
was not considered during the simulation. The dataset four main factors: optimizers, activation functions, neu-
includes one training set and one testing set for each ron numbers and sequence lengths against three assess-
engine. The training set consists of the historical run- ment criteria: stability, precision and accuracy. The ranks
to-failure measurement records of the engines from 21 for precision and accuracy for these four factors will be
on-board sensors. The objective is to predict the RUL of presented. As for the stability, if the network can produce
each engine based on the given sensor measurements. a reliable result, it will be marked as 1, otherwise, it will
The information of the four subsets is listed in Table 1. be marked as 0.
Specifically, FD001 refers to the engine failure arising
from the high-pressure compressor under a single oper- 5.3.1 Optimizers
ating condition. FD002 refers to the engine failure from Different neural network structures were tested with the
the high-pressure compressor under six operation condi- fixed activation function of ReLU, the neuron number of
tions. FD003 refers to the engine failure from both high- 128, the sequence length of 50, four different optimizers
pressure compressor and fan under a single operating including Rmsprop, Adam, AdamGrad and AdamDelta.
The prediction results are displayed using box plots so The assessment of the four optimizers have been made
that the stability, precision and accuracy of these opti- for all network structures such as the example set in
mizers can be evaluated. Table 2, and all the optimal optimizers have been summa-
As indicated in Figure 6, gradient exploring or gradi- rized in Table 3. In this case, AdaGrad can be regarded as
ent vanishing took place when adopting AdaGrad in the optimal optimizer for most of the network structures.
RNN_2LAYERS, RNN_3LAYERS, LSTM_2LAYERS,
LSTM_3LAYERS and GRU_3LAYERS and AdaDelta 5.3.2 Activation Functions
in RNN_LAYERS, Bi_LSTM_3LAYERS. This observa- In this section, the evaluation of five activation func-
tion suggests that RMSprop and Adam are less sensible tions is performed with the fixed optimizer (Adam),
to the parameters than AdaGrad and AdaDelta, which neuron number (128) and sequence length (50). Both
means they are more workable in this case. More spe- Sigmoid and Tanh functions have also been tested, but
cifically, AdaGrad and AdaDelta are more likely to lose these two activation functions were found to be greatly
their stability when the network gets more complicated.
In terms of accuracy, generally speaking, AdamGrad can
help to achieve the most accurate prediction result in
most network structures, regardless of the stability. As Table 2 Assessment of different optimizers using network
for Rmsprop, the change in the structure layers would structure RNN-1LAYER
make a great difference to the prediction performance. In Optimizers Stability Precision Accuracy Assessment
contrast, this influence can hardly be seen when adopting
Adam and AdaDelta as the optimizers. As for the preci- RMSprop 1 4 4
sion, Rmsprop has the worst performance among these Adam 1 3 3
four optimizers where the other three can all produce AdaGrad 1 1 1 √
relatively precise outcomes. AdaDelta 1 2 2
Table 3 Optimal optimizers for different algorithms Swish in both precision and accuracy. However, gradi-
Algorithms Optimal optimizers
ent exploring, or gradient vanishing occurred when
adopting ReLU, PReLU and ELU in RNN_1LAYER and
RNN_1LAYER AdaGrad GRU_3layer, which suggests that Swish and Leaky_
RNN_2LAYERS Adam ReLU are more stable than these three activation func-
RNN_3LAYERS Adam tions. The Optimal activation functions in this case for
LSTM_1LAYER AdaGrad different algorithms are listed in Table 4.
LSTM_2LAYERS Adam
LSTM_3LAYERS Adam, AdaDelta
Bi_LSTM_1LAYER Adam, AdaGrad, AdaDelta 5.3.3 Sequence Length
Bi_LSTM_2LAYERS AdaGrad In this section, the impact of different sequence length
Bi_LSTM_3LAYERS AdaGrad on the prediction result has been compared with a fixed
GRU_1LAYER Adam, AdaGrad optimizer (Adam), activation function (ReLU) and neu-
GRU_2LAYERS AdaGrad ron number (128). As indicated in Figure 8, generally
GRU_3LAYERS AdaGrad the longer the sequence length uses, the better perfor-
mance the algorithms achieved. In this case, gradient
vanishing happened when adopting GRU_3LAYERS
network structure which suggests that the choice of the
affected by the gradient vanishing and gradient explor- sequence length may also affect the workability of GRU.
ing problem. Therefore, these two functions are not The optimal sequence length for different algorithms is
discussed in this section. As demonstrated in Figure 7, listed in Table 5 considering the workability, precision
the performance of ReLU, Leaky_ReLU, PReLU and and accuracy.
ELU is quite similar, and they are generally better than
Table 4 Optimal activation functions for different algorithms different neuron number varies significantly using differ-
Algorithms Optimal activation functions
ent algorithms. Taking the LSTM network structure as
an example, the influence of different neuron numbers
RNN_1LAYER Leaky_ReLU is smaller when using LSTM_1LAYER than the other
RNN_2LAYERS Leaky_ReLU, ELU two. In addition, for LSTM_3LAYERS, the observation
RNN_3LAYERS ELU shows that the more neuron number is used, the less
LSTM_1LAYER ReLU accurate the result turns out to be, while this tendency
LSTM_2LAYERS PReLU cannot be found in the other two network structures.
LSTM_3LAYERS ELU As only three different neuron numbers were tested, the
Bi_LSTM_1LAYER PReLU optimal neuron for each network structure could not be
Bi_LSTM_2LAYERS ReLU, Leaky_ReLU achieved. Nevertheless, the optimal neuron number for
Bi_LSTM_3LAYERS PReLU different algorithms in this case is listed in Table 6 just for
GRU_1LAYER ReLU reference.
GRU_2LAYERS ReLU
GRU_3LAYERS Swish
5.3.5 Overall Performance of Each Algorithm
Figure 10 demonstrates the performance of different
algorithms using a certain group of parameters. As in this
5.3.4 Neuron Number case, gradient exploring, or gradient vanishing occurred
Figure 9 shows the influences of different neuron num- when using RNN_1LAYER and GRU_3LAYRES. It seems
ber has on the performance of RUL prediction with a that generally, the performance of LSTM, Bi_LSTM and
fixed optimizer (Adam), activation function (ReLU) and GRU network structures seems to be relatively close and
sequence length (50). Gradient exploring, or gradient significantly better than RNN. The accuracy of LSTM is
vanishing occurs in this case when using network struc- close to Bi_LSTM and GRU, but the precision is relatively
ture RNN_1LAYER and all GRU structures which may poor. GRU turns out to be very accurate and precise,
suggest that neuron number is a sensitive parameter for but it suffers from stability problems. Thus, a Bi_LSTM
RNN and GRU in terms of stability. The performance of structure might be a better option in this case.
Table 5 Optimal sequence length for different algorithms Table 6 Optimal Neuron number for different algorithms
Algorithms Optimal Algorithms Optimal
sequence Neuron
length number
A more detailed comparison of different algorithms is global optimal parameters cannot be selected for the
displayed in Table 7. The optimal parameters (with the dataset based on this table since it has not considered all
base parameters) for this subset are highlighted using a combinations, it provides a level of useful options with
yellow hatch for every network structure. Although the certainty.
6 Conclusions
A systematic review of the applications of RNN and
its variants for RUL prediction in recent years is pre-
sented in this paper. An evaluation of these algorithms
has been conducted using the NASA’s C-MAPSS data-
set, where different parameters, such as optimisers,
activation functions, neuron quantities, and sequence
length are discussed using a sensitivity analysis. It can
be seen that some of the network structures are very
parameter sensitive from the result of the evaluation.
The influence of these parameters on the performance
of RUL prediction is different according to differ-
ent network structures. Instead of giving the optimal
parameters and the network structures for the data-
set, the result of this case study offers some practical
choices of parameters for different network struc-
tures. Although the conclusions achieved above from
this case study could not be applied to other cases
directly, it at least suggests the influence of these fac-
tors on the RUL prediction. Moreover, it provides
Figure 10 Performance of different algorithms on Dataset FD001, some options for researchers when they consider
where Optimizer: Adam; Activation function: ReLU; Neuron Number:
adopting DL to carry out the similar prediction task
128; Sequence Length: 50
for their own cases.
Algorithms ReLU L_ReLU PReLU ELU Swish Rmsprop Adam AdaGrad AdaDelta 128 64 32 25 50 75 100
RNN_1L 569 871 569 487 286 478 533 553 994 478 596
RNN_2L 651 491 710 488 646 494 406 476 651 494 444 930 476 565 425
RNN_3L 547 563 539 440 831 325 572 547 547 531 493 670 547 570 431
LSTM_1L 310 443 439 382 725 566 397 260 464 310 419 379 836 464 196 181
LSTM_2L 432 526 391 464 670 398 358 396 432 522 369 632 396 221 199
LSTM_3L 515 460 480 531 872 411 375 397 515 458 436 617 397 284 209
Bi_LSTM_1L 526 458 423 516 562 718 422 362 428 526 473 385 714 428 294 185
Bi_LSTM_2L 423 431 538 507 672 441 413 228 513 423 465 467 557 513 243 286
Bi_LSTM_3L 508 492 494 611 673 473 434 285 508 508 435 450 512 508 253 198
GRU_1L 363 393 356 392 450 457 383 389 540 363 360 658 540 220 166
GRU_2L 336 415 383 423 486 417 365 535 405 336 459 570 405 179 231
RNN-1LAYER Rmsprop 1 4 4
Adam 1 1 3 Recommended
Optimizers Workability Precision Accuracy Assessment
AdaGrad 1 3 1 Recommended
Rmsprop 1 4 4 AdaDelta 1 2 2 Recommended
Adam 1 3 3
AdaGrad 1 1 1 Recommended
Bi_LSTM-2LAYERS
AdaDelta 1 2 2
Optimizers Workability Precision Accuracy Assessment
RNN-2LAYERS Rmsprop 1 2 2
Adam 1 4 3
Optimizers Workability Precision Accuracy Assessment
AdaGrad 1 1 1 Recommended
Rmsprop 1 3 2 AdaDelta 1 3 4
Adam 1 1 1 Recommended
AdaGrad 0 0 0
Bi_LSTM-3LAYERS
AdaDelta 1 2 3
Optimizers Workability Precision Accuracy Assessment
RNN-3LAYERS Rmsprop 1 3 3
Adam 1 1 2
Optimizers Workability Precision Accuracy Assessment
AdaGrad 1 2 1 Recommended
Rmsprop 1 1 1 Recommended AdaDelta 0 0 0
Adam 1 2 2
AdaGrad 0
GRU_1LAYER
AdaDelta 0
Optimizers Workability Precision Accuracy Assessment
LSTM-1LAYER Rmsprop 1 3 3
Adam 1 1 2 Recommended
Optimizers Workability Precision Accuracy Assessment
AdaGrad 1 2 1 Recommended
Rmsprop 1 4 4 AdaDelta 1 4 4
Adam 1 1 2
AdaGrad 1 2 1 Recommended
GRU_2LAYERS
AdaDelta 1 3 3
Optimizers Workability Precision Accuracy Assessment
LSTM-2LAYERS Rmsprop 1 2 2
Adam 1 1 1 Recommended
Optimizers Workability Precision Accuracy Assessment
AdaGrad 1 3 4
Rmsprop 1 3 3 AdaDelta 1 4 3
Adam 1 1 1 Recommended
AdaGrad 0 0 0
GRU_3LAYERS
AdaDelta 1 2 2
Optimizers Workability Precision Accuracy Assessment
LSTM-3LAYERS Rmsprop 1 2 2
Adam 1 1 1 Recommended
Optimizers Workability Precision Accuracy Assessment
AdaGrad 0 0 0
Rmsprop 1 3 3 AdaDelta 0 0 0
Adam 1 2 1 Recommended
AdaGrad 0 0 0
AdaDelta 1 1 2 Recommended
Wang et al. Chin. J. Mech. Eng. (2021) 34:69 Page 16 of 20
ReLU 1 4 4
RNN-1LAYER
Leaky_ReLU 1 3 2
Activation Workability Precision Accuracy Assessment PReLU 1 2 3
functions
ELU 1 1 1 Recommended
ReLU 0 0 0 Swish 1 5 5
Leaky_ReLU 1 1 1 Recommended
PReLU 0 0 0
Bi_LSTM-1LAYER
ELU 0 0 0
Swish 1 2 2 Activation Workability Precision Accuracy Assessment
functions
ReLU 1 4 3
RNN-2LAYERS
Leaky_ReLU 1 3 2
Activation Workability Precision Accuracy Assessment PReLU 1 2 1 Recommended
functions
ELU 1 1 4
ReLU 1 5 5 Swish 1 5 5
Leaky_ReLU 1 1 1 Recommended
PReLU 1 4 4
Bi_LSTM-2LAYERS
ELU 1 3 2 Recommended
Swish 1 2 3 Activation Workability Precision Accuracy Assessment
functions
ReLU 1 2 1 Recommended
RNN-3LAYERS
Leaky_ReLU 1 1 2 Recommended
Activation Workability Precision Accuracy Assessment PReLU 1 4 4
functions
ELU 1 3 3
ReLU 1 4 3 Swish 1 5 5
Leaky_ReLU 1 2 4
PReLU 1 3 2
Bi_LSTM-3LAYERS
ELU 1 1 1 Recommended
Swish 1 5 5 Activation Workability Precision Accuracy Assessment
functions
ReLU 1 3 3
LSTM-1LAYER
Leaky_ReLU 1 4 1
Activation Workability Precision Accuracy Assessment PReLU 1 1 2 Recommended
functions
ELU 1 5 5
ReLU 1 2 1 Recommended Swish 1 2 4
Leaky_ReLU 1 4 3
PReLU 1 1 2
GRU_1LAYER
ELU 1 3 4
Swish 1 5 5 Activation Workability Precision Accuracy Assessment
functions
ReLU 1 1 1 Recommended
LSTM-2LAYERS
Leaky_ReLU 1 4 3
Activation Workability Precision Accuracy Assessment PReLU 1 2 2
functions
ELU 1 5 4
ReLU 1 1 2 Swish 1 3 5
Leaky_ReLU 1 3 4
PReLU 1 2 1 Recommended
ELU 1 5 3
Swish 1 4 5
Wang et al. Chin. J. Mech. Eng. (2021) 34:69 Page 17 of 20
GRU_2LAYERS LSTM-1LAYER
Activation Workability Precision Accuracy Assessment Sequence Workability Precision Accuracy Assessment
functions length
ReLU 1 1 1 Recommended 25 1 4 4
Leaky_ReLU 1 4 3 50 1 3 3
PReLU 1 3 2 75 1 2 2
ELU 1 5 5 100 1 1 1 Recommended
Swish 1 2 4
LSTM-2LAYERS
GRU_3LAYERS Sequence Workability Precision Accuracy Assessment
Activation Workability Precision Accuracy Assessment length
functions
25 1 4 4
ReLU 0 0 0 50 1 3 3
Leaky_ReLU 1 1 2 75 1 2 2
PReLU 0 0 0 100 1 1 1 Recommended
ELU 0 0 0
Swish 1 2 1 Recommended
LSTM-3LAYERS
Sequence Workability Precision Accuracy Assessment
3. Performance evaluation—Sequence length length
Recommended sequence length for different structures.
25 1 4 4
50 1 1 3
RNN-1LAYER
75 1 2 2
Sequence Workability Precision Accuracy Assessment 100 1 3 1 Recommended
length
25 1 3 3
Bi_LSTM-1LAYER
50 1 1 1 Recommended
75 0 0 0 Sequence Workability Precision Accuracy Assessment
length
100 1 2 2
25 1 4 4
50 1 3 3
RNN-2LAYERS
75 1 2 2
Sequence Workability Precision Accuracy Assessment 100 1 1 1 Recommended
length
25 1 4 4
Bi_LSTM-2LAYERS
50 1 2 2 Recommended
75 1 1 3 Recommended Sequence Workability Precision Accuracy Assessment
length
100 1 3 1 Recommended
25 1 4 4
50 1 2 3
RNN-3LAYERS
75 1 1 1 Recommended
Sequence Workability Precision Accuracy Assessment 100 1 3 2
length
25 1 2 4
50 1 3 3
75 1 1 2 Recommended
100 1 4 1
Wang et al. Chin. J. Mech. Eng. (2021) 34:69 Page 18 of 20
Bi_LSTM-3LAYERS RNN-2LAYERS
Sequence Workability Precision Accuracy Assessment Neuron Workability Precision Accuracy Assessment
length number
25 1 4 3 128 1 3 3
50 1 3 4 64 1 1 2
75 1 1 2 32 1 2 1 Recommended
100 1 2 1 Recommended
RNN-3LAYERS
GRU_1LAYER Neuron Workability Precision Accuracy Assessment
Sequence Workability Precision Accuracy Assessment number
length
128 1 2 3
25 1 4 4 64 1 3 2
50 1 3 3 32 1 1 1 Recommended
75 1 1 2 Recommended
100 1 2 1 Recommended
LSTM-1LAYER
Neuron number Workability Precision Accuracy Assessment
GRU_2LAYERS
128 1 3 1 Recommended
Sequence Workability Precision Accuracy Assessment
length 64 1 2 3
32 1 1 2
25 1 4 4
50 1 3 3
75 1 2 2 LSTM-2LAYERS
100 1 1 1 Recommended Neuron number Workability Precision Accuracy Assessment
128 1 1 2
GRU_3LAYERS 64 1 2 3
Sequence Workability Precision Accuracy Assessment 32 1 3 1 Recommended
length
25 0 0 0 LSTM-3LAYERS
50 0 0 0 Neuron number Workability Precision Accuracy Assessment
75 0 0 0
100 1 1 1 Recommended 128 1 3 3
64 1 2 2
32 1 1 1 Recommended
4. Performance evaluation—Neuron number
Recommended Neuron number for different structures.
Bi_LSTM-1LAYER
RNN-1LAYER Neuron number Workability Precision Accuracy Assessment
Neuron Workability Precision Accuracy Assessment 128 1 3 3
number
64 1 2 2
128 0 0 0 32 1 1 1 Recommended
64 1 2 1 Recommended
32 1 1 2 Recommended
Wang et al. Chin. J. Mech. Eng. (2021) 34:69 Page 19 of 20
128 1 3 2 Funding
64 1 2 1 Recommended Supported by U.K. EPSRC Platform Grant (Grant No. EP/P027121/1).
32 1 1 3
Competing Interests
The authors declare no competing financial interests.
Bi_LSTM-3LAYERS Received: 6 October 2020 Revised: 17 June 2021 Accepted: 22 June 2021
Neuron number Workability Precision Accuracy Assessment
128 1 3 3
64 1 1 1 Recommended References
32 1 2 2 [1] P E Leser. Probabilistic prognostics and health management for fatigue-critical
components using high-fidelity models. North Carolina State University,
ProQuest Dissertations Publishing, 2017. 10758909
GRU_1LAYER [2] T Salunkhe, N I Jamadar, S B Kivade. Prediction of remaining useful life of
mechanical components-A review. International Journal of Engineering
Neuron number Workability Precision Accuracy Assessment Education, 2014, 3(6): 125-135.
[3] C Okoh, R Roy, J Mehnen, et al. Overview of remaining useful life prediction
128 1 2 1 Recommended techniques in through-life engineering services. Procedia CIRP, 2014, 16: 158-
163, https://doi.org/10.1016/j.procir.2014.02.006.
64 0 0 0
[4] J Ma, H Su, W Zhao, et al. Predicting the remaining useful life of an aircraft
32 1 1 2 engine using a stacked sparse autoencoder with multilayer self-learning.
Complexity, 2018: 1-13, https://doi.org/10.1155/2018/3813029.
[5] J Deutsch, D He. Using deep learning-based approach to predict remaining
GRU_2LAYERS useful life of rotating components. Journal of Zoo and Wildlife Medicine, 2017,
40(2): 321-327, https://doi.org/10.1109/TSMC.2017.2697842.
Neuron number Workability Precision Accuracy Assessment [6] W Zhang, M P Jia, L Zhu, et al. Comprehensive overview on computational
intelligence techniques for machinery condition monitoring and fault
128 1 1 1 Recommended diagnosis. Chinese Journal of Mechanical Engineering, 2017, 30(4): 782-795,
64 1 2 2 https://doi.org/10.1007/s10033-017-0150-0.
[7] A Kabir, C Bailey, H Lu, et al. A review of data-driven prognostics in power
32 0 0 0
electronics. 35th International Spring Seminar on Electronics Technology,
2012: 189–192, https://doi.org/10.1109/ISSE.2012.6273136.
[8] R Zhao, R Yan, Z Chen, et al. Deep learning and its applications to machine
GRU_3LAYERS health monitoring. Mechanical Systems and Signal Processing, 2019, 115: 213-
237, https://doi.org/10.1016/j.ymssp.2018.05.050.
Neuron number Workability Precision Accuracy Assessment
[9] M Kraus, S Feuerriegel. Forecasting remaining useful life: Interpretable deep
learning approach via variational Bayesian inferences. Decision Support
128 0 0 0
Systems, 2019, 125, https://doi.org/10.1016/j.dss.2019.113100.
64 1 1 1 Recommended [10] R Zhao, R Yan, Z Chen, et al. Deep learning and its applications to machine
32 0 0 0 health monitoring. Mechanical Systems and Signal Processing, 2019, 115:
213–237.
[11] F Jia, Y Lei, J Lin, et al. Deep neural networks: A promising tool for fault
Acknowledgements characteristic mining and intelligent diagnosis of rotating machinery with
Not applicable. massive data. Mechanical Systems and Signal Processing, 2016, 72-73: 303-
315, https://doi.org/10.1016/j.ymssp.2015.10.025.
Authors’ Information [12] P Wang, Ananya, R Yan, et al. Virtualization and deep recognition for system
Youdao Wang, born in 1992, is currently a PhD candidate at Cranfield University, fault classification. Journal of Manufacturing Systems, 2017, 44: 310–316,
UK. He received his Msc degree from The University of Manchester, UK, in 2017. https://doi.org/10.1016/j.jmsy.2017.04.012.
His research interests focus on adopting DL algorithms into the RUL predic- [13] H Shao, H Jiang, H Zhao, et al. A novel deep autoencoder feature learning
tion of industrial systems. method for rotating machinery fault diagnosis. Mechanical Systems and
Yifan Zhao was born in Zhejiang, China. He received the PhD degree in Signal Processing, 2017, 95: 187-204, https://doi.org/10.1016/j.ymssp.2017.03.
Automatic Control and System Engineering from the University of Sheffield, UK, 034.
in 2007. He is currently a Senior Lecturer in Data Science at Cranfield University, [14] L Liao, W Jin, R Pavel. Enhanced restricted Boltzmann machine with
UK. His research interests include computer vision, signal processing, non- prognosability regularization for prognostics and health assessment. IEEE
destructive testing, active thermography, and nonlinear system identification. Transactions on Industrial Electronics, 2016, 63(11): 7076–7083, https://doi.
Sri Addepalli is a Lecturer in Degradation Assessment at Cranfield Uni- org/10.1109/TIE.2016.2586442.
versity, UK. His research interests include passive and active thermography [15] M Gan, C Wang, C Zhu. Construction of hierarchical diagnosis network
and material component damage characterisation with specific expertise in based on deep learning and its application in the fault pattern recognition
composite materials. Sri holds an MPhil in Materials Technology from Swansea of rolling element bearings. Mechanical Systems and Signal Processing, 2016,
University, UK and a BEng in Mechanical Engineering from Anna University, 72–73: 92–104, https://doi.org/10.1016/j.ymssp.2015.11.014.
India. He is also a member of the Institute of Engineering Technology.
Wang et al. Chin. J. Mech. Eng. (2021) 34:69 Page 20 of 20
[16] J Wang, Y Ma, L Zhang, et al. Deep learning for smart manufacturing: Meth- [23] R Zhao, R Yan, J Wang, et al. Learning to monitor machine health with con-
ods and applications. Journal of Manufacturing Systems, 2018, 48: 144-156, volutional Bi-directional LSTM networks. Sensors (Switzerland), 2017, 17(2):
https://doi.org/10.1016/j.jmsy.2018.01.003. 1-18, https://doi.org/10.3390/s17020273.
[17] S Zheng, K Ristovski, A Farahat, et al. Long short-term memory network [24] J Zhang, P Wang, R Yan, et al. Long short-term memory for machine remain-
for remaining useful life estimation. 2017 IEEE International Conference on ing life prediction. Journal of Manufacturing Systems, 2018, 48: 78-86, https://
Prognostics and Health Management (ICPHM), 2017: 88–95, https://doi.org/ doi.org/10.1016/j.jmsy.2018.05.011.
10.1109/ICPHM.2017.7998311. [25] Ahmed Elsheikh, Soumaya Yacout, Mohamed-Salah Ouali, et al. Bidirec-
[18] S Hochreiter, J Schmidhuber. Long short-term memory. Neural Computation, tional handshaking LSTM for remaining useful. Neurocomputing, 2019, 323:
1997, 9(8): 1735-1780, https://doi.org/10.1162/neco.1997.9.8.1735. 148-156
[19] M Yuan, Y Wu, L Lin. Fault diagnosis and remaining useful life estimation of [26] S Ruder. An overview of gradient descent optimization algorithms. 2016:
aero engine using LSTM neural network. 2016 IEEE International Conference http://arxiv.org/abs/1609.04747.
on Aircraft Utility Systems (AUS), 2016: 135–140, https://doi.org/10.1109/AUS. [27] D P Kingma, J L Ba. Adam: A method for stochastic optimization. 3rd Inter-
2016.7748035. national Conference on Learning Representations, ICLR 2015 - Conference Track
[20] M Yuan, Y Wu, S Dong, et al. Remaining useful life estimation of engineered Proceedings, San Diego, 2015: 1–15.
systems using vanilla LSTM neural networks. Neurocomputing, 2018, 275: [28] P Ramachandran, B Zoph, Q V Le. Searching for activation functions. 6th
167-179, https://doi.org/10.1016/j.neucom.2017.05.063. International Conference on Learning Representations, ICLR 2018 - Workshop
[21] K Cho, B van Merrienboer, C Gulcehre et al. Learning phrase representations Track Proceedings, 2018: 1–13.
using RNN encoder-decoder for statistical machine translation. Proceedings [29] A Saxena, K Goebel, D Simon, et al. Damage propagation modeling for
of the 2014 Conference on Empirical Methods in Natural Language Processing aircraft engine run-to-failure simulation. 2008 International Conference on
(EMNLP), 2014: 1724–1734, https://doi.org/10.3115/v1/d14-1179. Prognostics and Health Management, PHM 2008, 2008, https://doi.org/10.
[22] J Chen, H Jing, Y Chang, et al. Gated recurrent unit based recurrent neural 1109/PHM.2008.4711414.
network for remaining useful life prediction of nonlinear deterioration
process. Reliability Engineering and System Safety, 2019, 185: 372-382, https://
doi.org/10.1016/j.ress.2019.01.006.