Joint Selective State Space Model and Detrending for Robust Time Series Anomaly Detection

Junqi Chen, Xu Tan, Sylwan Rahardja, ,
Jiawei Yang, , and Susanto Rahardja
Junqi Chen, Xu Tan and Susanto Rahardja are with the School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, 710072, China. (e-mail: [email protected]; [email protected]; [email protected]).Sylwan Rahardja is with the School of Computing, University of Eastern Finland, FI-80101 Joensuu, Finland. (e-mail: [email protected]).Jiawei Yang is with the Department of Computing University of Turku, 20014, Turku, Finland (e-mail: [email protected]).Susanto Rahardja is also with the Engineering Cluster, Singapore Institute of Technology, 10 Dover Drive, Singapore 138683.(Corresponding author: Susanto Rahardja.)
Abstract

Deep learning-based sequence models are extensively employed in Time Series Anomaly Detection (TSAD) tasks due to their effective sequential modeling capabilities. However, the ability of TSAD is limited by two key challenges: (i) the ability to model long-range dependency and (ii) the generalization issue in the presence of non-stationary data. To tackle these challenges, an anomaly detector that leverages the selective state space model known for its proficiency in capturing long-term dependencies across various domains is proposed. Additionally, a multi-stage detrending mechanism is introduced to mitigate the prominent trend component in non-stationary data to address the generalization issue. Extensive experiments conducted on real-world public datasets demonstrate that the proposed methods surpass all 12121212 compared baseline methods.

Index Terms:
Time series anomaly detection, Selective state space model, Time series detrending

I Introduction

As the volume of data generated continues to grow exponentially, Time Series Anomaly Detection (TSAD) has garnered significant attention due to its increasing demand in various real-world applications such as intrusion detection, disaster warning, and medical diagnosis [1, 2, 3, 4, 5]. TSAD aims to identify irregular points or subsequences collectively referred to as anomalies. Typically, samples with high reconstruction errors of deep neural networks (DNNs) trained exclusively on normal data are detected as anomalies. Given the temporal correlations inherent in time series [6], DNN-based sequence models are considered the most suitable approach for TSAD tasks [7, 8, 9, 10, 11, 12, 13, 14]. However, two key challenges remain in these sequence model-based TSAD: (i) the ability to model long-range dependencies and (ii) the generalization issue for non-stationary data.

Normal behavior in time-series data typically involves long-term dependencies[1]. To capture these dependencies and enhance the modeling of normal behavior, various sequence models including Temporal Convolutional Networks (TCNs) [15], Recurrent Neural Networks (RNNs) [8, 9] and Transformers [11, 12] have been explored for TSAD. However, existing methods still encounter difficulties due to their intrinsic characteristics, such as limited context window, high memory costs, or unstable gradient flow. Recent research on the selective state space model (S6) [16] has addressed the shortcomings of the aforementioned sequence models and demonstrated its excellent long-term modeling capabilities in other domains. Moreover, the selective nature of the S6 model allows it to discard abnormal information and generate reliable reconstructed output for TSAD. Despite offering a potential solution to model long-range dependencies, there has been no research exploring its application in TSAD.

The generalization issue for non-stationary data stems from trends in non-stationary data. The erratic distribution of trends can result in significant fluctuations in data magnitude, leading to erroneously high reconstruction errors in regions with previously unseen trend patterns in the training set, ultimately causing false alarms [14]. Traditional time series decomposition methods such as Seasonal-Trend decomposition using LOESS (STL) [17] and Hodrick–Prescott (HP) trend filter [18] are frequently used to mitigate the impact of these trends. However, achieving optimal decomposed results may not yield the best detection performance. To address this, several approaches have tried to integrate decomposition methods into detectors for joint optimization. Nevertheless, most of them rely on Moving Average (MA) with a fixed kernel size [19, 20, 14], lacking the flexibility for broader scenarios. Other methods rely on DNN models for detrending [21, 22]. Though they allow end-to-end optimization, their dependence on training data reintroduces generalization issues.

To tackle the challenges outlined above, an innovative detector constructed using the S6 model and integrated with a multi-stage detrending mechanism is proposed. The contributions of this paper is as follows: (i) Introduce a novel detector that utilizes the S6 model to effectively model long-range dependencies in TSAD. (ii) Propose a multi-stage detrending mechanism that can generate reliable decomposed results. (iii) Evaluate the proposed method on three benchmark datasets and demonstrate the superiority of the proposed method over recent State-of-the-art (SOTA) methods.

II Related Work

Sequence models such as TCN, RNN, and Transformer [23, 24, 25] had greatly improved TSAD tasks. However, these methods struggled with long-range dependency modeling, limiting detection performance. Fig. 1 illustrates how different sequence models gather context information. As shown in Fig. 1(a), TCN’s context was constrained by the kernel size k𝑘kitalic_k, which made it unsuitable for capturing long-term dependencies. Conversely, the Transformer could capture long context using the Self-Attention (SA) mechanism as shown in Fig. 1(b). However, SA was memory and time-intensive during training and inference because it did not compress context information, which made it impractical for modeling long dependencies. Although RNN could efficiently compress context information using a finite state shown in Fig. 1(c), it suffered from unstable gradient issues with long sequences [26]. The State Space Model (SSM), similar to RNN, resolves gradient issues by employing solely linear transformations, as shown in Fig. 1(d).

The Structured State Space Sequence model (S4) [27] was a well-known SSM that combines the benefits of RNN with the stability of linear transformations. This allows S4 to compress context states into a finite size while ensuring stable gradients, making it suitable for modeling long-range dependencies. However, its parameters remain constant through time, forming a Linear Time Invariance (LTI) model, limiting its effectiveness in TSAD tasks. To remove the LTI constraint, Gu et al. [16] proposed a selective SSM called S6 by introducing input-dependent parameters. S6 exhibited superior performance in modeling long-range dependencies in other domains, including natural language processing [16], computer vision [28], and speech [29]. Additionally, its selection mechanism has the potential to enhance TSAD performance by adaptively selecting the appropriate context information for producing reliable anomaly scores. Despite its effectiveness across various domains, its application in TSAD remains to be unexplored.

Refer to caption
Figure 1: Illustration of different sequence models gathering context information, where k𝑘kitalic_k denotes the kernel size, t𝑡titalic_t denotes the time-step, 𝒙1:t1subscript𝒙:1𝑡1\bm{x}_{1:t-1}bold_italic_x start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT represents the context information, 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒚tsubscript𝒚𝑡\bm{y}_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represent the input and output in current time-step, respectively. (a) TCN’s context was constrained by the kernel size k𝑘kitalic_k. (b) SA in the Transformer could capture long-term context but lacked information compression. (c-d) RNN and SSM could compress context information by a finite state.

III Preliminaries

III-A Problem Statement

Given a multivariate time series 𝒙T×D𝒙superscript𝑇𝐷\bm{x}\in\mathbb{R}^{T\times D}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT with length T𝑇Titalic_T for training, where each observation 𝒙tDsubscript𝒙𝑡superscript𝐷\bm{x}_{t}\in\mathbb{R}^{D}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT contains D𝐷Ditalic_D features. TSAD aims to predict anomalous labels 𝒚=[y1,,yT^]T𝒚superscriptsubscript𝑦1subscript𝑦^𝑇𝑇\bm{y}=[y_{1},\cdots,y_{\hat{T}}]^{T}bold_italic_y = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT over^ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT for unseen test time series 𝒙^^𝒙\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG of length T^^𝑇\hat{T}over^ start_ARG italic_T end_ARG, where yt{0,1}subscript𝑦𝑡01y_{t}\in\{0,1\}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 }. This prediction is based on anomaly scores 𝒂=[a1,,aT^]T𝒂superscriptsubscript𝑎1subscript𝑎^𝑇𝑇\bm{a}=[a_{1},\cdots,a_{\hat{T}}]^{T}bold_italic_a = [ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT over^ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT generated by the detector and a predefined threshold athsubscript𝑎tha_{\rm th}italic_a start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT, where yt=1subscript𝑦𝑡1y_{t}=1italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 if at>athsubscript𝑎𝑡subscript𝑎tha_{t}>a_{\rm th}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > italic_a start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT, otherwise 00.

III-B S6 Model

A typical SSM model maps a univariate sequence xtsubscript𝑥𝑡x_{t}\in\mathbb{R}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R to the output ytsubscript𝑦𝑡y_{t}\in\mathbb{R}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R through a hidden state 𝒉tNsubscript𝒉𝑡superscript𝑁\bm{h}_{t}\in\mathbb{R}^{N}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT with four parameters ΔΔ\Delta\in\mathbb{R}roman_Δ ∈ blackboard_R, 𝑨,𝑩,𝑪N𝑨𝑩𝑪superscript𝑁\bm{A},\bm{B},\bm{C}\in\mathbb{R}^{N}bold_italic_A , bold_italic_B , bold_italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where t𝑡titalic_t denotes the t𝑡titalic_t-th time-step and N𝑁Nitalic_N is the number of states.

To remove the LTI constraint, three of the above parameters in S6 model were adjusted to vary with time, namely 𝚫T𝚫superscript𝑇\bm{\Delta}\in\mathbb{R}^{T}bold_Δ ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, 𝑩,𝑪T×N𝑩𝑪superscript𝑇𝑁\bm{B},\bm{C}\in\mathbb{R}^{T\times N}bold_italic_B , bold_italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N end_POSTSUPERSCRIPT, where T𝑇Titalic_T denotes the number of time-steps. The operation of the S6 model could be represented by a general formula applicable to multivariate time series data 𝒙,𝒚T×D𝒙𝒚superscript𝑇𝐷\bm{x},\bm{y}\in\mathbb{R}^{T\times D}bold_italic_x , bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT:

𝒉t,dsubscript𝒉𝑡𝑑\displaystyle\bm{h}_{t,d}bold_italic_h start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT =𝑨¯t,d𝒉t1,d+𝑩¯t,dxt,d,absentdirect-productsubscript¯𝑨𝑡𝑑subscript𝒉𝑡1𝑑subscript¯𝑩𝑡𝑑subscript𝑥𝑡𝑑\displaystyle=\overline{\bm{A}}_{t,d}\odot\bm{h}_{t-1,d}+\overline{\bm{B}}_{t,% d}\cdot{x}_{t,d},= over¯ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT ⊙ bold_italic_h start_POSTSUBSCRIPT italic_t - 1 , italic_d end_POSTSUBSCRIPT + over¯ start_ARG bold_italic_B end_ARG start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT , (1)
yt,dsubscript𝑦𝑡𝑑\displaystyle{y}_{t,d}italic_y start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT =𝑪tT𝒉t,d,absentsubscriptsuperscript𝑪𝑇𝑡subscript𝒉𝑡𝑑\displaystyle=\bm{C}^{T}_{t}\cdot\bm{h}_{t,d},= bold_italic_C start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_italic_h start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT ,

where d𝑑ditalic_d denotes the d𝑑ditalic_d-th feature, direct-product\odot represents the Hadamard product, 𝑨¯t,d=fA(Δt,d,𝑨d)subscript¯𝑨𝑡𝑑subscript𝑓𝐴subscriptΔ𝑡𝑑subscript𝑨𝑑\overline{\bm{A}}_{t,d}=f_{A}({\Delta}_{t,d},\bm{A}_{d})over¯ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT , bold_italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) and 𝑩¯t,d=fB(Δt,d,𝑩t)subscript¯𝑩𝑡𝑑subscript𝑓𝐵subscriptΔ𝑡𝑑subscript𝑩𝑡\overline{\bm{B}}_{t,d}=f_{B}({\Delta}_{t,d},\bm{B}_{t})over¯ start_ARG bold_italic_B end_ARG start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT , bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Here, fA()subscript𝑓𝐴f_{A}(\cdot)italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( ⋅ ) and fB()subscript𝑓𝐵f_{B}(\cdot)italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( ⋅ ) represent discretization rules for 𝑨𝑨\bm{A}bold_italic_A and 𝑩𝑩\bm{B}bold_italic_B. It is suggested that fA()subscript𝑓𝐴f_{A}(\cdot)italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( ⋅ ) followed the zero-order hold rule, while fB()subscript𝑓𝐵f_{B}(\cdot)italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( ⋅ ) follows the Euler rule [16], expressed as:

fA(Δt,d,𝑨d)subscript𝑓𝐴subscriptΔ𝑡𝑑subscript𝑨𝑑\displaystyle f_{A}({\Delta}_{t,d},\bm{A}_{d})italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT , bold_italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) =exp(Δt,d𝑨d),absentsubscriptΔ𝑡𝑑subscript𝑨𝑑\displaystyle=\exp({\Delta}_{t,d}\cdot\bm{A}_{d}),= roman_exp ( roman_Δ start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT ⋅ bold_italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , (2)
fB(Δt,d,𝑩t)subscript𝑓𝐵subscriptΔ𝑡𝑑subscript𝑩𝑡\displaystyle f_{B}({\Delta}_{t,d},\bm{B}_{t})italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT , bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =Δt,d𝑩t.absentsubscriptΔ𝑡𝑑subscript𝑩𝑡\displaystyle={\Delta}_{t,d}\cdot\bm{B}_{t}.= roman_Δ start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT ⋅ bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

III-C Time Series Detrending

A common time series model with trend and seasonality could be formulated as 𝒙t=𝝉t+𝒔t+𝒓t,t=1,,T,formulae-sequencesubscript𝒙𝑡subscript𝝉𝑡subscript𝒔𝑡subscript𝒓𝑡𝑡1𝑇\bm{x}_{t}=\bm{\tau}_{t}+\bm{s}_{t}+\bm{r}_{t},\ t=1,\cdots,T,bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t = 1 , ⋯ , italic_T , where 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the observation at the t𝑡titalic_t-th time-step, 𝝉tsubscript𝝉𝑡\bm{\tau}_{t}bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the trend component, 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the seasonal component with a period of k𝑘kitalic_k, and 𝒓tsubscript𝒓𝑡\bm{r}_{t}bold_italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the residual component. In this paper, emphasis was placed on trend removal and HP trend filter was adopted as the detrending method, expressed as:

𝝉^t=argmin𝝉tt=1T(𝒙t𝝉t)2+λt=2T1(𝝉t+12𝝉t+𝝉t1)2,subscript^𝝉𝑡subscriptsubscript𝝉𝑡superscriptsubscript𝑡1𝑇superscriptsubscript𝒙𝑡subscript𝝉𝑡2𝜆superscriptsubscript𝑡2𝑇1superscriptsubscript𝝉𝑡12subscript𝝉𝑡subscript𝝉𝑡12\hat{\bm{\tau}}_{t}=\mathop{\arg\min}\limits_{\bm{\tau}_{t}}\sum_{t=1}^{T}(\bm% {x}_{t}-\bm{\tau}_{t})^{2}+\lambda\sum_{t=2}^{T-1}(\bm{\tau}_{t+1}-2\bm{\tau}_% {t}+\bm{\tau}_{t-1})^{2},over^ start_ARG bold_italic_τ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( bold_italic_τ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - 2 bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_τ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (3)

where λ𝜆\lambdaitalic_λ is a hyper-parameter controlling the smoothness.

IV Method

Fig. 2 illustrates the overall framework of the proposed method, comprising a HP trend filter, multiple Decomposition-based Mamba (DMamba) blocks, and a final output module. Initially, input data was processed by the trend filter to extract initial trends and seasonality components. The initial seasonality was refined through DMamba blocks, while the initial trend was merged with trends from DMamba blocks. Finally, the fused trend and refined seasonality were summed to form the reconstructed output.

Refer to caption
Figure 2: The structure of the proposed method includes a HP trend filter, multiple DMamba blocks, and an output module.

IV-A HP Trend Filter

The input data was initially segmented into subsequences using sliding windows of length W𝑊Witalic_W. Each input window 𝒙inW×Dsuperscript𝒙insuperscript𝑊𝐷\bm{x}^{\rm in}\in\mathbb{R}^{W\times D}bold_italic_x start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_W × italic_D end_POSTSUPERSCRIPT underwent detrending using HP trend filter. To ensure the extraction of global trends, the current input window was first combined with historical windows, and then the HP-based trend component 𝝉HPsubscript𝝉HP\bm{\tau}_{\rm HP}bold_italic_τ start_POSTSUBSCRIPT roman_HP end_POSTSUBSCRIPT was computed via Eq. (3). The HP-based seasonality was further derived as 𝒔HP=𝒙in𝝉HPsubscript𝒔HPsuperscript𝒙insubscript𝝉HP\bm{s}_{\rm HP}=\bm{x}^{\rm in}-\bm{\tau}_{\rm HP}bold_italic_s start_POSTSUBSCRIPT roman_HP end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT - bold_italic_τ start_POSTSUBSCRIPT roman_HP end_POSTSUBSCRIPT.

IV-B DMamba Blocks

Refer to caption
Figure 3: The structure of Mamba block.

The HP-based seasonality 𝒔HPsubscript𝒔HP\bm{s}_{\rm HP}bold_italic_s start_POSTSUBSCRIPT roman_HP end_POSTSUBSCRIPT was first projected to a Dmsubscript𝐷𝑚D_{m}italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT-dimensional space by an embedding network Emb()Emb{\rm Emb(\cdot)}roman_Emb ( ⋅ ), and then it underwent L𝐿Litalic_L DMamba blocks. Each block comprised a Mamba block [16] and an Adaptive MA (AMA) module. The input of each block was the output of the previous block, denoted as 𝒙l=𝒚l1superscript𝒙𝑙superscript𝒚𝑙1\bm{x}^{l}=\bm{y}^{l-1}bold_italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = bold_italic_y start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT with 𝒚0=Emb(𝒔HP)superscript𝒚0Embsubscript𝒔HP\bm{y}^{0}={\rm Emb}(\bm{s}_{\rm HP})bold_italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = roman_Emb ( bold_italic_s start_POSTSUBSCRIPT roman_HP end_POSTSUBSCRIPT ). For clarity, the superscript l𝑙litalic_l was omitted in the following content. The input 𝒙𝒙\bm{x}bold_italic_x first entered the Mamba block shown in Fig. 3 and could be formulated by Eq. (4):

𝒖,𝒈𝒖𝒈\displaystyle\bm{u},\ \bm{g}bold_italic_u , bold_italic_g =σ(Conv(𝒙𝑾2)),σ(𝒙𝑾1),absent𝜎Conv𝒙subscript𝑾2𝜎𝒙subscript𝑾1\displaystyle=\sigma({\rm Conv}(\bm{x}\cdot\bm{W}_{2})),\ \sigma(\bm{x}\cdot% \bm{W}_{1}),= italic_σ ( roman_Conv ( bold_italic_x ⋅ bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) , italic_σ ( bold_italic_x ⋅ bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , (4)
𝒛𝒛\displaystyle\bm{z}bold_italic_z =S6(𝒖),absentS6𝒖\displaystyle={\rm S6}(\bm{u}),= S6 ( bold_italic_u ) ,
𝒙SSMsubscript𝒙SSM\displaystyle\bm{x}_{\rm SSM}bold_italic_x start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT =(𝒈𝒛)𝑾3,absentdirect-product𝒈𝒛subscript𝑾3\displaystyle=(\bm{g}\odot\bm{z})\cdot\bm{W}_{3},= ( bold_italic_g ⊙ bold_italic_z ) ⋅ bold_italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ,

where 𝑾1,𝑾2Dm×2Dm,𝑾32Dm×Dmformulae-sequencesubscript𝑾1subscript𝑾2superscriptsubscript𝐷𝑚2subscript𝐷𝑚subscript𝑾3superscript2subscript𝐷𝑚subscript𝐷𝑚\bm{W}_{1},\bm{W}_{2}\in\mathbb{R}^{D_{m}\times 2D_{m}},\bm{W}_{3}\in\mathbb{R% }^{2D_{m}\times D_{m}}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × 2 italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are trainable projection matrices, Conv()Conv{\rm Conv}(\cdot)roman_Conv ( ⋅ ) represents a convolution layer with a kernel size of 4444, σ𝜎\sigmaitalic_σ denotes the SiLU[30] activation function, and S6()S6{\rm S6(\cdot)}S6 ( ⋅ ) denotes the S6 operation detailed in algorithm 1. The three parameters 𝚫,𝑩,𝑪𝚫𝑩𝑪\bm{\Delta},\bm{B},\bm{C}bold_Δ , bold_italic_B , bold_italic_C in Algorithm 1 could be computed as:

𝚫𝚫\displaystyle\bm{\Delta}bold_Δ =Softplus(𝒖𝑾Δ),absentSoftplus𝒖subscript𝑾Δ\displaystyle={\rm Softplus}(\bm{u}\cdot\bm{W}_{\Delta}),= roman_Softplus ( bold_italic_u ⋅ bold_italic_W start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ) , (5)
𝑩𝑩\displaystyle\bm{B}bold_italic_B =𝒖𝐖B,absent𝒖subscript𝐖𝐵\displaystyle=\bm{u}\cdot\mathbf{W}_{B},= bold_italic_u ⋅ bold_W start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , (6)
𝑪𝑪\displaystyle\bm{C}bold_italic_C =𝒖𝐖C,absent𝒖subscript𝐖𝐶\displaystyle=\bm{u}\cdot\mathbf{W}_{C},= bold_italic_u ⋅ bold_W start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , (7)

where Softplus()Softplus{\rm Softplus}(\cdot)roman_Softplus ( ⋅ ) denotes the Softplus activation function, 𝑾B,𝑾C2Dm×Nsubscript𝑾𝐵subscript𝑾𝐶superscript2subscript𝐷𝑚𝑁\bm{W}_{B},\bm{W}_{C}\in\mathbb{R}^{2D_{m}\times N}bold_italic_W start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT and 𝑾Δ2Dm×2Dmsubscript𝑾Δsuperscript2subscript𝐷𝑚2subscript𝐷𝑚\bm{W}_{\Delta}\in\mathbb{R}^{2D_{m}\times 2D_{m}}bold_italic_W start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × 2 italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are trainable projection matrices.

Algorithm 1 S6 operation
0:  𝒖:(W,D):𝒖𝑊𝐷\bm{u}:(W,D)bold_italic_u : ( italic_W , italic_D ) with W𝑊Witalic_W time-steps and D𝐷Ditalic_D features ;
0:  𝒙SSM:(W,D):subscript𝒙SSM𝑊𝐷\bm{x}_{\rm SSM}:(W,D)bold_italic_x start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT : ( italic_W , italic_D );
1:  𝑨:(D,N):𝑨𝐷𝑁absent\bm{A}:(D,N)\leftarrowbold_italic_A : ( italic_D , italic_N ) ← Parameter with N𝑁Nitalic_N states
2:  𝚫:(W,D)fΔ(𝒖):𝚫𝑊𝐷subscript𝑓Δ𝒖\bm{\Delta}:(W,D)\leftarrow f_{\Delta}(\bm{u})bold_Δ : ( italic_W , italic_D ) ← italic_f start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( bold_italic_u ) using Eq. (5)
3:  𝑩:(W,N)fB(𝒖):𝑩𝑊𝑁subscript𝑓𝐵𝒖\bm{B}:(W,N)\leftarrow f_{B}(\bm{u})bold_italic_B : ( italic_W , italic_N ) ← italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_italic_u ) using Eq. (6)
4:  𝑪:(W,N)fC(𝒖):𝑪𝑊𝑁subscript𝑓𝐶𝒖\bm{C}:(W,N)\leftarrow f_{C}(\bm{u})bold_italic_C : ( italic_W , italic_N ) ← italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_italic_u ) using Eq. (7)
5:  𝑨¯,𝑩¯:(W,D,N)discretize(𝚫,𝑨,𝑩):¯𝑨¯𝑩𝑊𝐷𝑁discretize𝚫𝑨𝑩\overline{\bm{A}},\overline{\bm{B}}:(W,D,N)\leftarrow{\rm discretize}(\bm{% \Delta},\bm{A},\bm{B})over¯ start_ARG bold_italic_A end_ARG , over¯ start_ARG bold_italic_B end_ARG : ( italic_W , italic_D , italic_N ) ← roman_discretize ( bold_Δ , bold_italic_A , bold_italic_B ) using Eq. (2)
6:  𝒚SSM(𝑨¯,𝑩¯,𝑪)(𝒖)𝒚SSM¯𝑨¯𝑩𝑪𝒖\bm{y}\leftarrow{\rm SSM}(\overline{\bm{A}},\overline{\bm{B}},\bm{C})(\bm{u})bold_italic_y ← roman_SSM ( over¯ start_ARG bold_italic_A end_ARG , over¯ start_ARG bold_italic_B end_ARG , bold_italic_C ) ( bold_italic_u ) using Eq. (1)
7:  return 𝒙SSMsubscript𝒙SSM\bm{x}_{\rm SSM}bold_italic_x start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT.

The AMA module then took 𝒙SSMsubscript𝒙SSM\bm{x}_{\rm SSM}bold_italic_x start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT as input and performed trend removal through average pooling:

𝝉MA=AvgPool(𝒙SSM,k)subscript𝝉𝑀𝐴AvgPoolsubscript𝒙SSM𝑘\bm{\tau}_{MA}={\rm AvgPool}(\bm{x}_{\rm SSM},k)bold_italic_τ start_POSTSUBSCRIPT italic_M italic_A end_POSTSUBSCRIPT = roman_AvgPool ( bold_italic_x start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT , italic_k ) (8)

with a kernel size of k𝑘kitalic_k. The kernel size is adaptively estimated by k=W/n𝑘𝑊𝑛k=\left\lfloor W/n\right\rflooritalic_k = ⌊ italic_W / italic_n ⌋, where \left\lfloor\cdot\right\rfloor⌊ ⋅ ⌋ denotes the floor function, n=argmax(𝒙fH𝒙f)𝑛direct-productsubscriptsuperscript𝒙𝐻𝑓subscript𝒙𝑓n=\arg\max(\bm{x}^{H}_{f}\odot\bm{x}_{f})italic_n = roman_arg roman_max ( bold_italic_x start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ⊙ bold_italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) indicates the peak position of the power density function and 𝒙fsubscript𝒙𝑓\bm{x}_{f}bold_italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is the Fourier transformation of 𝒙SSMsubscript𝒙SSM\bm{x}_{\rm SSM}bold_italic_x start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT. The output of the current block is then obtained by 𝒚=𝒙ssm𝝉MA𝒚subscript𝒙ssmsubscript𝝉MA\bm{y}=\bm{x}_{\rm ssm}-\bm{\tau}_{\rm MA}bold_italic_y = bold_italic_x start_POSTSUBSCRIPT roman_ssm end_POSTSUBSCRIPT - bold_italic_τ start_POSTSUBSCRIPT roman_MA end_POSTSUBSCRIPT.

IV-C Final Reconstructed Output

The output of the final DMamba block served as the reconstructed seasonal component i.e., 𝒔=𝒚L𝑾s𝒔superscript𝒚𝐿subscript𝑾𝑠\bm{s}=\bm{y}^{L}\cdot\bm{W}_{s}bold_italic_s = bold_italic_y start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ⋅ bold_italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, where 𝑾sDm×Dsubscript𝑾𝑠superscriptsubscript𝐷𝑚𝐷\bm{W}_{s}\in\mathbb{R}^{D_{m}\times D}bold_italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT is a trainable projection matrix. The final trend is a combination of all extracted trends:

𝝉=𝝉HP+Conv(l=1L𝝉MAl),𝝉subscript𝝉HPConvsuperscriptsubscript𝑙1𝐿subscriptsuperscript𝝉𝑙MA\bm{\tau}=\bm{\tau}_{\rm HP}+{\rm Conv}(\sum_{l=1}^{L}\bm{\tau}^{l}_{\rm MA}),bold_italic_τ = bold_italic_τ start_POSTSUBSCRIPT roman_HP end_POSTSUBSCRIPT + roman_Conv ( ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT bold_italic_τ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_MA end_POSTSUBSCRIPT ) , (9)

where Conv()Conv{\rm Conv}(\cdot)roman_Conv ( ⋅ ) is a CNN layer with an input channel of Dmsubscript𝐷𝑚D_{m}italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, an output channel of D𝐷Ditalic_D, and a kernel size of 3333. In the training phase, the objective was to minimize the Mean Square Error (MSE) between the original input and the reconstructed output:

=𝒙in(𝒔+𝝉)22.subscriptsuperscriptnormsuperscript𝒙in𝒔𝝉22\mathcal{L}=||\bm{x}^{\rm in}-(\bm{s}+\bm{\tau})||^{2}_{2}.caligraphic_L = | | bold_italic_x start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT - ( bold_italic_s + bold_italic_τ ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (10)

During inference, the MSE serves as the anomaly score.

V Experiments

V-A Setup

V-A1 Datasets

TABLE I: Datasets description.
Dataset Train Test Dimensions Contamination (%)
NASA 6,32963296,3296 , 329 18,7431874318,74318 , 743 25/55255525/5525 / 55 9.529.529.529.52
SMD 128,267128267128,267128 , 267 128,270128270128,270128 , 270 38383838 7.117.117.117.11
SWaT 6,84068406,8406 , 840 7,50075007,5007 , 500 25252525 12.6312.6312.6312.63
TABLE II: Comparison results on three datasets. The best results were bolded while the second best were underlined.
Method NASA SWaT SMD Avg.
P-AF R-AF F1-AF P-AF R-AF F1-AF P-AF R-AF F1-AF P-AF R-AF F1-AF
LOF [31] 0.40200.40200.40200.4020 0.63130.63130.63130.6313 0.48860.48860.48860.4886 0.84500.84500.84500.8450 0.28550.28550.28550.2855 0.42680.42680.42680.4268 0.70020.70020.70020.7002 0.73760.73760.73760.7376 0.69590.69590.69590.6959 0.64910.64910.64910.6491 0.55150.55150.55150.5515 0.53710.53710.53710.5371
CBLOF [32] 0.40780.40780.40780.4078 0.60360.60360.60360.6036 0.48180.48180.48180.4818 0.68610.68610.68610.6861 0.05100.05100.05100.0510 0.09490.09490.09490.0949 0.82270.82270.82270.8227 0.71310.71310.71310.7131 0.66980.66980.66980.6698 0.63890.63890.63890.6389 0.45590.45590.45590.4559 0.41550.41550.41550.4155
OCSVM [33] 0.47350.47350.47350.4735 0.47390.47390.47390.4739 0.43410.43410.43410.4341 0.45580.45580.45580.4558 0.43160.43160.43160.4316 0.44340.44340.44340.4434 0.79230.79230.79230.7923 0.50030.50030.50030.5003 0.49000.49000.49000.4900 0.57390.57390.57390.5739 0.46860.46860.46860.4686 0.45580.45580.45580.4558
IForest [34] 0.38110.38110.38110.3811 0.63530.63530.63530.6353 0.47440.47440.47440.4744 0.99770.99770.99770.9977 0.14240.14240.14240.1424 0.24920.24920.24920.2492 0.79050.79050.79050.7905 0.65890.65890.65890.6589 0.63590.63590.63590.6359 0.72310.72310.72310.7231 0.47890.47890.47890.4789 0.45320.45320.45320.4532
HBOS [35] 0.50140.50140.50140.5014 0.99260.99260.99260.9926 0.66610.66610.66610.6661 0.98350.98350.98350.9835 0.14760.14760.14760.1476 0.25670.25670.25670.2567 0.78480.78480.78480.7848 0.53290.53290.53290.5329 0.54250.54250.54250.5425 0.75660.75660.75660.7566 0.55770.55770.55770.5577 0.48840.48840.48840.4884
LODA [36] 0.47360.47360.47360.4736 0.95510.95510.95510.9551 0.63260.63260.63260.6326 0.97770.97770.97770.9777 0.02630.02630.02630.0263 0.05130.05130.05130.0513 0.73790.73790.73790.7379 0.60640.60640.60640.6064 0.56740.56740.56740.5674 0.72970.72970.72970.7297 0.52930.52930.52930.5293 0.41710.41710.41710.4171
ECOD [37] 0.38060.38060.38060.3806 0.62380.62380.62380.6238 0.47200.47200.47200.4720 0.99400.99400.99400.9940 0.11220.11220.11220.1122 0.20160.20160.20160.2016 0.63580.63580.63580.6358 0.55510.55510.55510.5551 0.53490.53490.53490.5349 0.67010.67010.67010.6701 0.43040.43040.43040.4304 0.40280.40280.40280.4028
Omnianomaly [9] 0.50740.50740.50740.5074 0.99990.99990.99990.9999 0.67320.67320.67320.6732 0.50850.50850.50850.5085 0.97380.97380.97380.9738 0.66810.66810.66810.6681 0.49570.49570.49570.4957 0.92210.92210.92210.9221 0.64380.64380.64380.6438 0.50390.50390.50390.5039 0.96530.96530.96530.9653 0.66170.66170.66170.6617
FGANomaly [10] 0.51010.51010.51010.5101 1.0001.0001.0001.000 0.6756¯¯0.6756\underline{0.6756}under¯ start_ARG 0.6756 end_ARG 0.53380.53380.53380.5338 0.96390.96390.96390.9639 0.68710.68710.68710.6871 0.62830.62830.62830.6283 0.99370.99370.99370.9937 0.76590.76590.76590.7659 0.55740.55740.55740.5574 0.98590.98590.98590.9859 0.7095¯¯0.7095\underline{0.7095}under¯ start_ARG 0.7095 end_ARG
Anomalytrans [11] 0.50240.50240.50240.5024 0.97920.97920.97920.9792 0.66390.66390.66390.6639 0.59650.59650.59650.5965 0.59290.59290.59290.5929 0.59470.59470.59470.5947 0.53980.53980.53980.5398 0.94150.94150.94150.9415 0.68470.68470.68470.6847 0.54620.54620.54620.5462 0.83790.83790.83790.8379 0.64780.64780.64780.6478
DCdetector [13] 0.48480.48480.48480.4848 0.97770.97770.97770.9777 0.64710.64710.64710.6471 0.47530.47530.47530.4753 0.51750.51750.51750.5175 0.49550.49550.49550.4955 0.50190.50190.50190.5019 0.53670.53670.53670.5367 0.64120.64120.64120.6412 0.48730.48730.48730.4873 0.67730.67730.67730.6773 0.59460.59460.59460.5946
D3R [14] 0.43340.43340.43340.4334 0.97760.97760.97760.9776 0.58680.58680.58680.5868 0.62410.62410.62410.6241 0.79700.79700.79700.7970 0.7000¯¯0.7000\underline{0.7000}under¯ start_ARG 0.7000 end_ARG 0.75060.75060.75060.7506 0.93280.93280.93280.9328 0.8295¯¯0.8295\underline{0.8295}under¯ start_ARG 0.8295 end_ARG 0.60270.60270.60270.6027 0.90250.90250.90250.9025 0.70540.70540.70540.7054
Proposed method 0.52820.52820.52820.5282 0.97660.97660.97660.9766 0.68290.6829\bm{0.6829}bold_0.6829 0.64410.64410.64410.6441 0.86410.86410.86410.8641 0.73800.7380\bm{0.7380}bold_0.7380 0.80610.80610.80610.8061 0.90400.90400.90400.9040 0.84030.8403\bm{0.8403}bold_0.8403 0.65950.65950.65950.6595 0.91490.91490.91490.9149 0.75370.7537\bm{0.7537}bold_0.7537

Experiments were conducted on three publicly available datasets. The NASA dataset [8] contained sensor metrics recorded by Soil Moisture Active Passive satellite (SMAP) and Mars Science Laboratory rover (MSL). Three non-trivial subsets were utilized, namely A-4, T-1, and C-2. as suggested in [12]. The Server Machine Dataset (SMD) [9] includes multiple server machine metrics from a large Internet company. Emphasis was placed on non-trivial subsets 1-1, 1-6, 2-1, 3-2, and 3-7, as suggested [12]. Lastly, the Secure Water Treatment (SWaT) [38] was a dataset that spans 11 days of continuous operation. In the experiments, SWaT was downsampled to the minute level and binary features were discarded. Detailed information is listed in Tab. I.

V-A2 Baselines

The proposed method was compared to traditional and DNN-based methods. Traditional methods include LOF [31], CBLOF [32], OCSVM [33], IForest [34], HBOS [35], LODA [36] and ECOD [37]. DNN-based methods comprised two RNN-based methods: Omnianomaly [9] and FGANomaly [10], two Transformer-based methods: AnomalyTrans [11] and DCdetector [13], and a time series decomposition-based method: D3R [14].

V-A3 Implementation details

All baselines were implemented based on PyOD [39] or their publicly available codes. We utilized Peak-Over-Threshold (POT) [40] for threshold selection. The smoothness parameter λ𝜆\lambdaitalic_λ was set to 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. The block size L𝐿Litalic_L was 3333, the model size Dmsubscript𝐷𝑚D_{m}italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT was 32323232, the state size was 16161616 and the window size W𝑊Witalic_W was 100100100100. The training was conducted for 7777 epochs with a batch size of 128128128128, using the AdamW [41] optimizer with a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. Given that point-wise metrics can be influenced by lengthy anomalies and point-adjustment strategy [9] could lead to misguided rankings [42], the recently proposed affiliation-based precision (P-AF), recall (R-AF), and F1-score (F1-AF) [43] were adopted as evaluation metrics. These metrics offer evaluations at the level of abnormal events rather than individual points, making them suitable for TSAD. Since F1-AF represents the harmonic mean of P-AF and R-AF, it was employed to evaluate the overall performance, a common practice in the TSAD community.

V-B Results

The proposed method was compared to 12121212 baselines, as listed in Tab. II. Generally, sequence model-based methods outperformed traditional methods due to their ability to model temporal dependencies. The proposed method surpassed all compared methods on the three datasets, achieving a 6.2%percent6.26.2\%6.2 % relative improvement in average F1-AF compared to the best baseline method. This demonstrated that the proposed S6 model-based detector was suitable for TSAD tasks and superior to previous SOTA detectors with RNN or Transformer backbones. Meanwhile, for the SWaT and SMD datasets containing significant trend variation, D3R and the proposed method achieved notable performance improvement by incorporating time series decomposition approaches compared to other methods. Overall, the experimental results showcased the robustness of the proposed method in non-stationary scenarios and its ability to detect complex time series anomalies.

TABLE III: Ablation study of the HP trend filter and AMA. The best results were bolded while the second best were underlined.
Modules F1-AF
HP Trend Filter AMA NASA SWaT SMD Avg.
0.67170.67170.67170.6717 0.69680.69680.69680.6968 0.77800.77800.77800.7780 0.71550.71550.71550.7155
0.6777¯¯0.6777\underline{0.6777}under¯ start_ARG 0.6777 end_ARG 0.66620.66620.66620.6662 0.77820.77820.77820.7782 0.70740.70740.70740.7074
0.67000.67000.67000.6700 0.7302¯¯0.7302\underline{0.7302}under¯ start_ARG 0.7302 end_ARG 0.8110¯¯0.8110\underline{0.8110}under¯ start_ARG 0.8110 end_ARG 0.7371¯¯0.7371\underline{0.7371}under¯ start_ARG 0.7371 end_ARG
0.68290.6829\bm{0.6829}bold_0.6829 0.73800.7380\bm{0.7380}bold_0.7380 0.84030.8403\bm{0.8403}bold_0.8403 0.75370.7537\bm{0.7537}bold_0.7537

To further verify the necessity of the HP trend filter and the AMA mechanism, an ablation analysis was conducted and the results were shown in Tab. III. The results demonstrated that applying only one might not bring improvement. For instance, solely applying the HP trend filter in the NASA dataset or only utilizing AMA in the SWaT dataset led to a performance drop. However, when both were applied, the HP trend filter provided stable input for AMA’s period estimation, where AMA and the detector could symbiotically optimize performance and capture more complex patterns, resulting in the best performance over all datasets. Additionally, a visualization in Fig. 4 clearly showcased the effectiveness of the proposed detrending mechanisms.

Refer to caption
Figure 4: A visualization from the SWaT dataset, which contains non-stationary data. The first row shows the input signal and extracted trend. The second row displays anomaly scores from various detectors. Baseline methods were influenced by trends, leading to false alarms, while our proposed method generated reliable scores by incorporating detrending mechanisms.

VI Conclusion

This paper proposed solutions to effectively tackle two critical challenges in TSAD: modeling long-range dependency and generalizing to non-stationary data, which are bottlenecks for sequence models in TSAD. Our solutions involve leveraging a recently published S6 model to capture long-term context and introducing a novel multi-stage detrending mechanism to provide stable input for the sequence model. Experimental results underscored the suitability of the S6 model for TSAD and highlighted the importance of the proposed detrending mechanism. Furthermore, this study laid the foundation for developing more advanced S6 model-based detectors for TSAD.

References

  • [1] Z. Z. Darban, G. I. Webb, S. Pan, C. C. Aggarwal, and M. Salehi, “Deep learning for time series anomaly detection: A survey,” arXiv preprint arXiv:2211.05244, 2022.
  • [2] J. Yang, G. I. Choudhary, S. Rahardja, and P. Fränti, “Classification of interbeat interval time-series using attention entropy,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 321–330, 2020.
  • [3] J. Yang, X. Tan, and S. Rahardja, “Mipo: How to detect trajectory outliers with tabular outlier detectors,” Remote sensing, vol. 14, no. 21, p. 5394, 2022.
  • [4] N. LaRosa, J. Farber, P. Venkitasubramaniam, R. Blum, and A. Al Rashdan, “Separating sensor anomalies from process anomalies in data-driven anomaly detection,” IEEE Signal Processing Letters, vol. 29, pp. 1704–1708, 2022.
  • [5] Z. Wang, Y. Zhang, G. Wang, and P. Xie, “Main-auxiliary aggregation strategy for video anomaly detection,” IEEE Signal Processing Letters, vol. 28, pp. 1794–1798, 2021.
  • [6] K.-H. Lai, D. Zha, J. Xu, Y. Zhao, G. Wang, and X. Hu, “Revisiting time series outlier detection: Definitions and benchmarks,” in Thirty-fifth conference on neural information processing systems datasets and benchmarks track (round 1), 2021.
  • [7] K. Cheng, Y. Liu, and X. Zeng, “Learning graph enhanced spatial-temporal coherence for video anomaly detection,” IEEE Signal Processing Letters, vol. 30, pp. 314–318, 2023.
  • [8] K. Hundman, V. Constantinou, C. Laporte, I. Colwell, and T. Soderstrom, “Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding,” in Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2018, pp. 387–395.
  • [9] Y. Su, Y. Zhao, C. Niu, R. Liu, W. Sun, and D. Pei, “Robust anomaly detection for multivariate time series through stochastic recurrent neural network,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2828–2837.
  • [10] B. Du, X. Sun, J. Ye, K. Cheng, J. Wang, and L. Sun, “Gan-based anomaly detection for multivariate time series using polluted training set,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 12, pp. 12 208–12 219, 2021.
  • [11] J. Xu, H. Wu, J. Wang, and M. Long, “Anomaly transformer: Time series anomaly detection with association discrepancy,” in International Conference on Learning Representations, 2021.
  • [12] S. Tuli, G. Casale, and N. R. Jennings, “Tranad: deep transformer networks for anomaly detection in multivariate time series data,” Proceedings of the VLDB Endowment, vol. 15, no. 6, pp. 1201–1214, 2022.
  • [13] Y. Yang, C. Zhang, T. Zhou, Q. Wen, and L. Sun, “Dcdetector: Dual attention contrastive representation learning for time series anomaly detection,” in Proceedings of the 29th ACM SIGKDD international conference on knowledge discovery & data mining, 2023, pp. 3033–3045.
  • [14] C. Wang, Z. Zhuang, Q. Qi, J. Wang, X. Wang, H. Sun, and J. Liao, “Drift doesn’t matter: Dynamic decomposition with diffusion reconstruction for unstable multivariate time series anomaly detection,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [15] Y. He and J. Zhao, “Temporal convolutional networks for anomaly detection in time series,” in Journal of Physics: Conference Series, vol. 1213, no. 4.   IOP Publishing, 2019, p. 042050.
  • [16] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
  • [17] R. B. Cleveland, W. S. Cleveland, J. E. McRae, I. Terpenning et al., “Stl: A seasonal-trend decomposition,” J. Off. Stat, vol. 6, no. 1, pp. 3–73, 1990.
  • [18] R. J. Hodrick and E. C. Prescott, “Postwar us business cycles: an empirical investigation,” Journal of Money, credit, and Banking, pp. 1–16, 1997.
  • [19] H. Wu, J. Xu, J. Wang, and M. Long, “Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting,” Advances in neural information processing systems, vol. 34, pp. 22 419–22 430, 2021.
  • [20] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin, “Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting,” in International conference on machine learning.   PMLR, 2022, pp. 27 268–27 286.
  • [21] Z. Zhang, R. Wang, R. Ding, and Y. Gu, “Unravel anomalies: an end-to-end seasonal-trend decomposition approach for time series anomaly detection,” in 2024 International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2024, pp. 5415–5419.
  • [22] S. Qin, J. Zhu, D. Wang, L. Ou, H. Gui, and G. Tao, “Decomposed transformer with frequency attention for multivariate time series anomaly detection,” in 2022 International Conference on Big Data.   IEEE, 2022, pp. 1090–1098.
  • [23] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
  • [24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [26] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International conference on machine learning.   PMLR, 2013, pp. 1310–1318.
  • [27] A. Gu, K. Goel, and C. Re, “Efficiently modeling long sequences with structured state spaces,” in International Conference on Learning Representations, 2021.
  • [28] Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, and Y. Liu, “Vmamba: Visual state space model,” arXiv preprint arXiv:2401.10166, 2024.
  • [29] X. Jiang, C. Han, and N. Mesgarani, “Dual-path mamba: Short and long-term bidirectional selective structured state space models for speech separation,” arXiv preprint arXiv:2403.18257, 2024.
  • [30] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural networks, vol. 107, pp. 3–11, 2018.
  • [31] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying density-based local outliers,” in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp. 93–104.
  • [32] Z. He, X. Xu, and S. Deng, “Discovering cluster-based local outliers,” Pattern recognition letters, vol. 24, no. 9-10, pp. 1641–1650, 2003.
  • [33] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural computation, vol. 13, no. 7, pp. 1443–1471, 2001.
  • [34] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in 2008 eighth ieee international conference on data mining.   IEEE, 2008, pp. 413–422.
  • [35] M. Goldstein and A. Dengel, “Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm,” KI-2012: poster and demo track, vol. 1, pp. 59–63, 2012.
  • [36] T. Pevnỳ, “Loda: Lightweight on-line detector of anomalies,” Machine Learning, vol. 102, pp. 275–304, 2016.
  • [37] Z. Li, Y. Zhao, X. Hu, N. Botta, C. Ionescu, and G. H. Chen, “Ecod: Unsupervised outlier detection using empirical cumulative distribution functions,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 12, pp. 12 181–12 193, 2022.
  • [38] A. P. Mathur and N. O. Tippenhauer, “Swat: A water treatment testbed for research and training on ics security,” in 2016 international workshop on cyber-physical systems for smart water networks (CySWater).   IEEE, 2016, pp. 31–36.
  • [39] Y. Zhao, Z. Nasrullah, and Z. Li, “Pyod: A python toolbox for scalable outlier detection,” Journal of machine learning research, vol. 20, no. 96, pp. 1–7, 2019.
  • [40] A. Siffer, P.-A. Fouque, A. Termier, and C. Largouet, “Anomaly detection in streams with extreme value theory,” in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery & data mining, 2017, pp. 1067–1075.
  • [41] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [42] S. Kim, K. Choi, H.-S. Choi, B. Lee, and S. Yoon, “Towards a rigorous evaluation of time-series anomaly detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 7, 2022, pp. 7194–7201.
  • [43] A. Huet, J. M. Navarro, and D. Rossi, “Local evaluation of time series anomaly detection algorithms,” in Proceedings of the 28th ACM SIGKDD international conference on knowledge discovery & data mining, 2022, pp. 635–645.