Position Encoding Based Convolutional Neural Networks for Machine Remaining Useful Life Prediction

2022-08-13 02:05:58RuibingJinMinWuKeyuWuKaizhouGaoZhenghuaChenSeniorandXiaoliLi

IEEE/CAA Journal of Automatica Sinica 2022年8期

Ruibing Jin, Min Wu,, Keyu Wu, Kaizhou Gao,, Zhenghua Chen,Senior, and Xiaoli Li,

Abstract—Accurate remaining useful life (RUL) prediction is important in industrial systems. It prevents machines from working under failure conditions, and ensures that the industrial system works reliably and efficiently. Recently, many deep learning based methods have been proposed to predict RUL.Among these methods, recurrent neural network (RNN) based approaches show a strong capability of capturing sequential information. This allows RNN based methods to perform better than convolutional neural network (CNN) based approaches on the RUL prediction task. In this paper, we question this common paradigm and argue that existing CNN based approaches are not designed according to the classic principles of CNN, which reduces their performances. Additionally, the capacity of capturing sequential information is highly affected by the receptive field of CNN, which is neglected by existing CNN based methods. To solve these problems, we propose a series of new CNNs, which show competitive results to RNN based methods. Compared with RNN, CNN processes the input signals in parallel so that the temporal sequence is not easily determined. To alleviate this issue,a position encoding scheme is developed to enhance the sequential information encoded by a CNN. Hence, our proposed position encoding based CNN called PE-Net is further improved and even performs better than RNN based methods. Extensive experiments are conducted on the C-MAPSS dataset, where our PE-Net shows state-of-the-art performance.

I. INTRODUCTION

PROGNOSTIC health management (PHM) plays an important role in industry, and improves the stability and reliability of industry equipments. In PHM, remaining useful life (RUL) prediction is often used to prevent industrial systems from unnecessary downtime, and thus, reduces maintenance costs [1]. Recently, neural networks have been widely studied, and they have shown impressive performance in different applications, such as computer vision [2], [3], nature language processing [4], [5] and optimization [6]–[8]. Without the expert knowledge of a specific field, neural networks are able to learn an effective model given enough training data.For example, in industry, the control system may be complex,which often poses a challenge to formulate a model for optimization. To solve this problem, some neural networks[6]–[8] are often proposed to simplify the complex control systems and achieve better performances. With the remarkable progress of deep learning, the existing neural network becomes more deeper. Many deep learning based approaches which consist of these neural networks, have been proposed[1], [9]–[22] for the RUL prediction, and achieve notable performances on this task. Currently, existing deep learning based methods fall into three categories, convolutional neural network (CNN) based methods, recurrent neural network(RNN) based methods, and the hybrid of CNN and RNN based methods.

In CNN based methods [10], [13], [14], several convolutional layers and pooling layers are stacked together as a deep neural network, which directly extracts feature representation from raw time series signals. Then, the fully connected layers are used as a regressor to predict the final results. Benefiting from this deep neural network architecture, CNN based methods [10], [13], [14], [23] are able to provide much useful information for RUL prediction. With the features automatically learned from a CNN, the regressor can then accurately predict the machine RUL. Time series signals are usually the inputs for the RUL prediction. It is thus important to capture the sequential information in time series signals for predicting the RUL. Since CNN is originally designed to process data in a parallel way, several studies [1], [11], [15]–[17], [21] argue that CNN may not be suitable to exploit the sequential information compared with RNN.

With this viewpoint, some RNN based approaches [1], [11],[15]–[17], [21], [22], [24], [25] are proposed. Usually, RNN receives the signal in a recurrent way, which enables it to capture sequential information, providing more accurate prediction results. However, limited by the recurrent forward scheme in RNN, RNN based methods cannot be trained in parallel. The training process of deep RNN becomes timeconsuming. In practice, most RNN based methods only stack two or three RNN layers. Since shallow networks may not be able to generate strong feature representations, the performances of RNN based methods are thus hindered. To address this issue, some hybrid of CNN and RNN based methods [12],[18], [19] are proposed, where a deep CNN is used as a feature extractor, and a shallow RNN is applied to exploit the sequential information. In these hybrid approaches, although a deep CNN is adopted for feature generation, sequential information is still captured by a shallow RNN. However, sequential information may not be fully exploited, which causes these hybrid based methods to be sub-optimal.

To alleviate this problem, we propose a new CNN based method called the position encoding based network (PE-Net)in this paper. Firstly, we rethink the common viewpoint held in RUL prediction, and investigate how to improve the CNN’s capacity of capturing sequential information. Through a series of analysis, we find that existing CNN based approaches [10],[13], [14], [23] are limited by three factors. 1) Existing CNN based methods neglect the relationship between the receptive field and the capability of capturing sequential information.Actually, sequential information captured by CNN can be improved via enlarging the receptive field for the last convolutional layer in a CNN. 2) The architecture of existing CNN based methods do not follow the classic principle which is widely adopted by famous CNNs like VGG16 [2], GoogleNet [26],ResNet [3] and MobileNet [27]. According to this principle,the kernel size should be smaller and the channel number should be larger, when the CNN depth increases. Extensive experiments have demonstrated that this principle can enhance CNN’s capability of feature learning. However, existing CNN based RUL prediction methods are not designed according to this principle [10], [13], [14], [23]. They simply stack the same convolutional layer repeatedly, which increases the difficulty in training and degenerates the quality of the feature map produced by CNN. 3) Feature normalization is not considered by existing CNN based methods. Without feature normalization, overfitting easily happens during the CNN’s training process. This further limits the performances of CNN based methods.

Motivated by these observations, we propose a series of novel CNNs with carefully-designed network architectures.Our new CNNs are proposed in the following three aspects.Firstly, we enlarge the CNN’s receptive field for the final convolutional layer. This enables our CNN to capture more sequential information compared with existing CNN based methods. Additionally, our proposed CNNs are designed according to the classic principles, which improves the quality of feature map produced by our CNN. Furthermore, our proposed CNN is a feature normalization based network,which adopts the batch normalization technique to improve the RUL prediction accuracy. To the best of our knowledge,we are thefirstto demonstrate the importance of the above aspects to design CNNs for the RUL prediction. With these modifications, our proposed CNNs show satisfactory performances for the RUL prediction, and are able to perform comparably to RNN based methods.

Apart from it, to enhance the encoded sequential information by the CNN, we propose a position encoding scheme according to [4]. The original position encoding scheme [4] is proposed for nature language processing, where a pre-defined position vector is directly added to a feature vector produced by an off-the-shelf word embedding. However, there is no offthe-shelf word embedding in the RUL field. To adopt the position encoding scheme into the RUL task, we firstly propose two transformations which transform both raw input signals and the pre-defined position vectors into the same latent space. Then, we propose two different fusion methods to effectively combine the transformed position information with the transformed input signals together. With our proposed position encoding scheme, our PE-Net can fully exploit the sequential information in time series signals. Through experiments, our proposed PE-Net surpasses existing methods and sets new state-of-the-art performances for the RUL prediction on the C-MAPSS dataset.

Overall, our contributions can be summarized as follows:

1) We first investigate various CNN architectures for capturing sequential information. According to our analysis, a series of novel CNNs are proposed for RUL prediction.

2) To further enhance the capability of capturing sequential information, we propose a position encoding based CNN called PE-Net for more accurate RUL prediction.

3) Extensive experiments are conducted, and our PE-Net shows state-of-the-art performances on the C-MAPSS dataset.

The rest of the paper is organized as follows: Section II discusses related work. Our PE-Net method is presented in Section III. Experimental results on public datasets and related discussions are provided in Section IV. We present conclusions and future work in Section V.

II. LITERATURE REVIEW

Methods in RUL prediction can be roughly divided into two categories: model based [28], [29] and data driven methods[30]–[32].

Model Based Methods:Model based methods are proposed according to the physical properties of the industrial systems[28], [29]. These approaches investigate the relationship between the sensor data and the target output, explicitly. However, these model based methods become unpractical, when the industrial systems are complex.

Data Driven Methods:Compared with model based methods, data driven methods do not require physical knowledge on the industrial systems. They can be easily applied to industrial systems by collecting enough sensor data. For data driven methods, they can be further divided into two classes:conventional machine learning based methods [33] and deep learning based methods. Since our work is proposed based on the deep learning technology, we focus on discussing deep learning based methods in this paper.

In deep learning based methods, a CNN based method [14]is firstly proposed to explore how to apply CNN for the RUL prediction. In this method, the proposed CNN consists of two 2D convolutional layers, two pooling layers and one fully connected layer. To adapt the 2D convolutional layer with the time series signal, the original signal is segmented by a sliding window strategy into some two-dimensional data matrix,where each row indicates the time series signal from one sensor, and each column represents the time step. Instead of using 2D convolutional layers, Liet al. [23] utilize 1D convolutional layers to extract features from the original signals. A deeper CNN is proposed in [23], where five 1D convolutional layers are used to produce the feature representation and one fully connected layer is used to predict the final results. In the paper [13], a double-convolutional neural network (D-CNN) is proposed. D-CNN is composed of two convolutional neural networks. It divides the RUL prediction into two stages: fault point detection and RUL prediction. One CNN firstly predicts the fault point, and another CNN is used to produce the RUL according to the fault point estimation.

Although CNN based methods achieve remarkable results,their performances are not satisfactory. CNN processes input signals in a parallel way. Compared with CNN, RNN receives the input signal recursively, which demonstrates that RNN may be more suitable to process time series signals. Deep long short-term memory (LSTM) [15] proposes a network consisting of two LSTM layers and two fully connected layers. By using LSTM, Deep LSTM [15] shows better performances than CNN based methods. LSTM only receives information in a forward manner. To further improve the captured sequential information, [16] proposes a bi-directional LSTM based approach. The bi-directional LSTM (Bi-LSTM) is able to receive time series signals in both a forward and backward manner, which improves the exploited sequential information,enhancing RUL prediction accuracy. Following the method in[16], Huanget al. [17] combines operational conditions with original input signals together in a bi-directional LSTM network. The method [17] effectively uses operational conditions to improve the performance on the RUL prediction.ATS2S [24] proposes a self-supervised method based on a sequence-to-sequence approach to improve performance.Chenet al. [1] propose an attention-based LSTM method for the RUL prediction, where handcrafted features are integrated with learned features by LSTM. Based on the fused features,RUL prediction accuracy is improved. In the [11], an attention-based gated recurrent unit (GRU) network is proposed for RUL prediction.

RNN based methods (including LSTM and GRU) achieve better performance than CNN based methods. However, since RNN receives input signals recursively, it is impractical to train a very deep RNN network. To utilize deep network features, some hybrid of CNN and LSTM based methods [12],[18], [19] are proposed. Although these approaches show impressive performance, their RNN parts are shallow modules, which limit their performances. To solve this problem, we investigate a series of effective CNNs for the RUL prediction in this paper. With novel CNN architectures, our CNNs can be easily trained and capture the sequential information effectively. Moreover, based on [4], we develop a position encoding scheme in our CNNs to better model the sequential information and thus achieving better performances.

In this section, we first investigate the relationship between the CNN architecture and its capability of modelling sequential information. Based on our analysis, a series of new convolutional neural networks are proposed. Furthermore, a position encoding scheme is developed to further enhance the encoded sequential information.

Following the protocol of existing methods [10], [14], a sliding window technology is applied to original time series signals. These signals are segmented into an input matrix denoted asX∈RT×C, whereTis the segmented temporal length andCrepresents the number of sensors. Given segmented inputX, a designed neural networkNis expected to produce the final outputyp. This can be formulated as

In the following parts, we discuss how to design a CNN which can effectively capture sequential information, and provide an accurate estimation for RUL.

A. Network Architecture Analysis

The performance of a CNN is affected by many factors. In this section, we discuss the CNN’s architecture with respect to convolutional kernel selection, layer type, receptive field and network depth.

1) Convolutional Kernel Selection:There are three conventional convolutions in CNN: 1D convolution, 2D convolution and 3D convolution. 3D convolution is often used in video related tasks, such as action recognition [34] and video object detection [35], [36]. Since time series signals in the RUL prediction are segmented into 2D matrix, 3D convolution is apparently not suitable for the RUL prediction.

For RUL prediction, existing CNN based methods can be divided into two categories in view of convolutional layer type: 1D convolution based methods [10], [13], [23] and 2D convolution based methods [12], [14]. In this part, we perform a comparison between 1D convolution and 2D convolution according to the characteristic of time series signals, and investigate which convolution operation is more suitable for the RUL prediction.

III. METHOD

whereKis kernel size,bdenotes a learnable bias and W is a learnable kernel weight matrix in the convolutional layer.

In Fig. 1, the 1D convolution operation is applied on time series signals, where each sensor signal is indicated by a unique color. The vertical axis indicates the sensor index, and the horizontal axis represents the time step.

As shown at the left side of Fig. 1, the kernel is moved along the time axis in the single sensor case. When the number of sensors increases, different types of sensors are put along the channel axis. According to the convolution definition, the number of channels of the kernel should be the same as that of the input. As displayed at the right side of Fig. 1, we extend the kernel along the channel axis and still move the kernel along the time axis.

Fig. 1. 1D convolution operation applied on time series. In the single sensor case, as shown at the left side, a convolution kernel is moved along the time axis. When the number of sensors increases, according to the convolution definition, the number of channels of the kernel is extended to the same as that of input, which is illustrated at the right side.

We visualize the 2D convolution operation applied to time series signals in Fig. 2. According to the left side of Fig. 2, in the single sensor case, since there is only one type of sensor,the 2D convolution is actually equal to 1D convolution. In the multi-sensor case, different from 1D convolution, 2D convolution kernel is moved along two directions.

Fig. 2. 2D convolution operation applied on time series. Different from the 1D convolution operation, the 2D convolution operation moves the convolution kernel on two directions: the time step direction and the sensor index direction. In the single signal case, since only one type of sensor exists,as displayed at the left side, the 2D convolution is equal to 1D convolution. In the multi-sensor case, as illustrated at the right side, the 2D convolution moves the kernel on a space.

To investigate the suitable convolution operation for time series signals, we need to analyze the characteristics of time series signals. Some example time series signals are shown in Fig. 3. As shown in Fig. 3, signals from different sensors show significantly diverse properties. The convolutional layer shares parameters and aims to learn the translation invariant features. When 2D convolution is applied to time series signals, different sensor signals are formulated into a twodimensional matrix. It is difficult to exploit the invariant information due to diverse properties existing in different sensor signals. Compared with 2D convolution, 1D convolution operations move the convolution kernel along the time axis, instead of both the sensor axis and time axis. It is much easier for 1D convolution kernels to learn generic features from time series signals.

Fig. 3. Visualization for different time series signals. It can be found that different signals show different properties.

2) Receptive Field for Modeling Sequential Information:

Although RNN’s recursive forward scheme enables it to capture sequential information, it is unpractical to establish a deep RNN. This limits the feature representation for the RUL prediction. Compared with RNN, a deep CNN is much easier to train. In this part, we investigate how to design a deep CNN which can effectively encode sequential information from time series signals.

Since the kernel size is often much smaller than the temporal length of the times series input, people believe that CNN cannot fully exploit sequential information. However, the CNN’s capability of modelling the sequential information is not decided by the kernel size. This capability is actually affected by the receptive field of CNN’s final feature map.

Fig. 4. Visualization for the receptive field in a CNN. It shows that the receptive field can be enlarged by stacking several convolutional layers.

As show in Fig. 4, the feature maps from three consecutive convolutional layers are illustrated, where the kernel size in the corresponding CNN is set as three. For convenience, we usernto represent the receptive field at thenth feature mapFn. In Fig. 4,F1indicates the raw input signal, andr1is 1. The second feature mapF2is generated by a convolutional layer with a kernel (3×1), and ther2is 3. At the final feature mapF3, the receptive fieldr3becomes 5, though the convolutional kernel size is 3. This demonstrates that each feature vector atF3is able to capture information across 5 time steps in the raw input signal. Furthermore, we can stack more convolutional layers to encode more temporal information. This indicates that actually, CNN is able to capture long dependency or sequential information by stacking more convolutional layers.

In a CNN, we can compute the receptive field of the final produced feature map,rf, recursively, according to

3) Network Depth and Kernel Size:To ensure the CNN’s sequential information capacity, a large receptive field is required. To fulfill this requirement, we have two solutions: a large kernel size with a shallow network, and a small kernel size with a deep network. In VGG-Net [2], more non-linear layers make the decision function of a CNN more discriminative. According to ZF-Net [37], the feature quality is improved when the kernel size and stride become small.Based on these observations, we design a new CNN with small kernel size and increased depth.

B. Proposed Convolutional Neural Networks

According to the discussions above, we propose a series of deep CNNs for the RUL prediction in this part.

Various famous CNNs, such as VGG16 [2], GoogleNet[26], ResNet [3] and MobileNet [27], follow a principle to design their network architecture, i.e., the kernel size should be smaller and the channel number should be larger when the CNN depth increases. However, existing CNN based methods for RUL prediction [10], [13], [14], [23] do not follow this classic principle, which limits their performance. In comparison, our proposed CNNs can produce a high-qualification feature map, by following this design principle.

Additionally, existing CNN based methods [10], [13], [14],[23] only design a shallow CNN for RUL prediction. Considering that the internal covariate shift becomes serious in a deep CNN [38], we adopt a batch normalization (BN) layer[38] in our proposed CNN. The BN layer can be formulated as

Furthermore, to improve the non-linearity of our proposed CNN, we choose the rectified linear unit (ReLU) as the activation function, which is defined as

In our proposed CNN, each convolutional layer is followed by a BN layer and a ReLU layer, which is illustrated in Fig. 5.We regard this three layer combination as a convolution unit and use it to design our CNNs.

Fig. 5. The convolution unit in our proposed CNN. This unit consists of a convolutional layer, a batch normalization layer and a rectified linear layer.

In Table I, a series of CNNs are proposed for the RUL prediction. Convolutional parameters are indicated as(k1×k2,C,s) , wherek1×k2is the kernel size,Cdenotes the channel number andsrepresents the stride value. Each convolutional layer is followed by a BN layer and a ReLU layer. Instead of using the pooling layer, we down sample a feature map by setting the stride value as 2 in some convolutional layers. We also list the corresponding receptive fieldrffor the produced feature map by the final convolutional layer. After that, two fully connected (denoted as FC)layers are applied as a decoder, where the hidden number for each layer is 256. Finally, we use an FC layer to predict the final results.

TABLE I OUR PROPOSED CNN FOR THE RUL PREDICTION. THE PARAMETER FOR A CONVOLUTIONAL LAYER IS REPRESENTED AS (KERNEL SIzE,CHANNEL NUMBER, STRIDE VALUE)

Fig. 6. The work flow of our proposed CNNs. With a careful design, in our proposed CNN, each convolutional layer is followed by a batch normalization layer, and the variation of the produced feature map follows the classic principle: the channel number increases, and the height and width decrease with the CNN depth increasing.

As listed in Table I, we propose four different CNNs. In Table I, it can be seen that our proposed CNNs are designed following the principle discussed above, where we only set the kernel size as 5 at the first layer, and the kernel sizes for other layers are set as 3. For the channel number, we gradually increase it from 16 to 256. CNN-A consists of two convolutional units. We set the stride value for the conv1 layer as 2 for down sampling feature map. In CNN-B, two more convolutional units are added. The stride values for these two additional convolutional layers are set as 2. The receptive field for CNN-B is increased by stacking more layers. To further improve the receptive field of CNN, we stack more layers, and propose CNN-C and CNN-D. To present our proposed CNN clearly, the work flow is visualized in Fig. 6. As shown in Fig. 6, the channel number of the feature map gradually becomes deep, and the height and width gradually decrease,since the feature map is down-sampled and the channel number increases when the CNN depth increases. In our CNN, the BN layer is integrated with 1D concolutional layer,which solves the issue of internal covariate shift in CNN. In our experiment part, we use these four CNNs with different configurations to verify our hypothesis above.

C. Position Encoding Scheme

whereirepresents the dimension index in our proposed position encoding scheme. When we use sine and cosine functions, the pb(i) is modified as pb(i)=100002?[i/2]/pd.

With the above position encoding scheme, there may be a domain gap between our encoded position information and the input signals. To address this issue, we propose a novel position encoding scheme, where two transformations,Φ(?)and Ψ (?) are proposed to translate the encoded position information and input signals into the same high dimension space,and then these two translated features are fused together. This fusion process can be formulated as

where F indicates the fused feature vector and?is the fusion function. By fusing the position information with input signals, our proposed CNN can effectively capture the sequential information in time series signals.

Fig. 7 shows the proposed fusion module to combine both the position encoding vectors and segmented input signals. In particular, two convolutional layers are used as two transformations Φ(P) and Ψ(X). Therefore, our proposed position encoding scheme is different from the original version [4] in three aspects. 1) There is no off-the-shelf word embedding in the RUL prediction. To address this issue. we propose a transformation P to adaptively convert the raw time series signals into feature vectors for the position embedding fusion.2) Instead of directly adding the produced feature vector to the position vector, we propose another transformation Ψ(X) to project the position vector into a latent space. 3) We propose a fusion module which is represented as ?(?) in (12) to combine these two transformed feature vectors together. In the fusion module, we propose two types of fusion methods:concatenation and element-wise addition. After the fusion operation, a batch normalization layer is added to improve the generalization of the encoded output.

D. Optimization

We regard the RUL prediction as a regression problem and use the mean square error (MSE) loss to optimize our proposed neural networks. MSE is defined as

IV. ExPERIMENTS

Fig. 7. Fusion module for combining position encoding vectors and segmented input signals. With the sliding window method, the raw signals are firstly segmented as the input signals. Then, the segmented input signals and the position encoding vectors are forwarded to two transformations, respectively. After that, we fuse these two transformed features via our proposed fusion module to produced the fused feature vector. Finally, we forward this fused feature vector into our proposed CNN to predict the RUL result.

In this section, extensive experiments are conducted to verify the performance of the proposed CNNs and the position encoding scheme in our PE-Net.

A. Experimental Dataset and Setting

1) Dataset:TheC-MAPSSDataset [39] is widely used in the RUL prediction to evaluate the performance of models.Following the previous studies [1], [14], [17], we also select theC-MAPSSDataset to evaluate our proposed method. It records the signals from 21 sensors installed on aircraft engines, which is illustrated in Fig. 8. These time series signals describe the degradation process of the aircraft engine.There are four subsets: FD001, FD002, FD003 and FD004 in this dataset, where each subset is divided into a training set and a test set. The details for this dataset are listed in Table II.The maximum time step for each trajectory is different. In FD001, the maximum time step is from 128 to 362. In FD002,the maximum time step is from 126 to 378. The maximum time step varies from 137 to 525 in FD003. In FD004, the maximum time step is from 126 to 554.

Fig. 8. Illustration for an aircraft engine. Multiple sensors are installed on an engine to measure different signals, such as temperature and speed.

As listed in Table II, theC-MAPSSDataset provides operation condition information and faulty types. There are hundreds of training and testing engine data trajectories.Training trajectories record all the run-to-failure sensor data.Testing trajectories only include a certain period of sensor data during the degradation process. Among four subsets,FD001 is the simplest subset, which only includes one operation condition, one fault mode, 100 training trajectories and 100 testing trajectories. FD004 is the most difficult subset,and involves 6 operation conditions, 2 fault modes, 248 training trajectories and 249 testing trajectories.

2) Data Preprocessing:The sensor data provided in theCMAPSSDataset is redundant. Following methods [1], [14],[17], we remove the data from sensors indexed 1, 5, 6, 10, 16,18, and 19. The sliding window is widely used for data segmentation [1], [10]. We set the window length as 30 and the step as 1. The piece-wise linear RUL model [1], [10] is also adopted to generate the RUL labels, where the maximal RUL is set as 125. The learning rate is 0.001 andAdamis chosen to optimize our proposed neural network.

3) Evaluation Metrics:According to previous studies [1],[14], [17], two metrics, namely RMSE and scoring function[14], are used to evaluate the performance of our method. The RMSE is defined as

TABLE II DATAILS FOR C-MAPSS DATASET

B. Comparison With Other Methods

Our experimental results show that our CNN-C and CNN-D achieve similar performances. In this subsection, we compareour CNN-C with other state-of-the-art methods in Table III,where PE-Net indicates CNN-C with our position encoding scheme. Both Φ and Ψ is a 1D convolution operation with 1×1×16kernel, and?is the element-wise addition operation. To facilitate understanding on the differences among compared methods, the structures for different types of approaches are illustrated in Fig. 9. These methods fall into three types: the LSTM based method, CNN based method and hybrid method. As presented in Section II, these methods utilize different neural networks to exploit the temporal dependencies for the RUL prediction. For the CNN based method, CNN is used as a feature encoder to capture the temporal information. In the LSTM based method, the LSTM or Bi-LSTM is used to exploit the long-term memories. For the hybrid method, a CNN is firstly used to extract feature vectors. Then, an LSTM is used to capture the temporal information for the RUL prediction.

電影的字幕翻譯是為電影本身和觀眾服務的，它是為了使觀眾理解電影作品內(nèi)涵而存在的。字幕翻譯必須和電影里的角色、故事情節(jié)等結(jié)合起來才能發(fā)揮最大的作用。

TABLE III COMPARISON WITH OTHER METHODS

Fig. 9. Illustration for different types of methods. These methods can be roughly divided into three categories: LSTM based method, CNN based method and hybrid method.

In these methods, Deep LSTM [15] utilizes LSTM to extract a feature map for the RUL prediction. Huang’s BLSTM [17] adopts a bi-directional LSTM for the RUL prediction. Benefiting from bi-directional LSTM, Huang’s BLSTM [17] performs better than Deep LSTM [15] on FD004. Chen’s Hybrid method [1] is proposed based on a hybrid of CNN and LSTM. It achieves better performance than Deep LSTM [15] on FD001, and performs comparably to Deep LSTM [15] on FD004. Among these methods, KDnet[10] utilizes knowledge distillation to transfer knowledge in a well trained LSTM to a CNN. Although KDnet [10] achieves remarkable performance, its training process is complex and time-consuming. Compared with other methods, our proposed CNN-C achieves comparable performance without requiring a complex training process. After adding our proposed position encoding scheme, the performance of our PE-Net on nearly all subsets are further improved, which sets new state-of-the-art performances onC-MAPSSDataset.

In Table III, we also compare our methods with three CNN based methods: Babu’s CNN [14], Li’s CNN [23] and KDnet[10]. It can be found that our methods perform better than other CNN based methods on nearly all subsets. Even on FD001, our PE-Net achieves the second best performance in terms of Score. These comparison results verify our analysis,showing our novel CNN’s strong capability of capturing sequential information for the RUL prediction.

C. Analysis Experiments

In the subsection, we analyze our method performances under different network configurations and different position encoding parameters. Since FD004 is the most complicated among four subsets inC-MAPSSDataset, we choose to carry out experiments on FD004.

1) Analysis for Different Network Configurations:We believe that the receptive field affects the CNN’s sequential information capability. To verify this hypothesis, several experiments are carried out, which are listed in Table IV.

TABLE IV THE PERFORMANCES OF OUR PROPOSED CNNS ON FD004

In Table IV, for each proposed CNN, its corresponding convolution unit number and receptive field on the final produced feature map are listed. Since the temporal length for the segmented input signal is 30, a CNN whose receptive field is larger than 30, can theoretically capture sequential information. As shown in Table IV, CNN-D performs similarly to CNN-C. It may be because CNN-C with a receptive field of 33 can fully exploit sequential information. Improvement by further increasing receptive field is limited. Among CNN-A,CNN-B and CNN-C, the performance of the CNN is improved when its corresponding receptive field is enlarged. This verifies our analysis on the relationship between the CNN’s receptive field and the sequential information capacity.

Since the performances of CNN may be affected by other factors, we conduct another three comparison experiments(CNN-S1, CNN-S2 and CNN-S3) to further verify our hypothesis. To distinguish CNN-S variants with CNN-C, we illustrate these four networks in Fig. 10, where the input tensor, the produced feature maps and the receptive fields for the last convolutional layer are shown. The major differences between CNN-S variants and CNN-C are highlighted in red and pink. The depth of CNN may also affect its performance.To avoid the influence of the CNN’s depth, an experiment called CNN-S1 is conducted. In CNN-S1, we only change the stride value for the first convolutional layer from 2 to 1 and preserve other configurations. As shown in Fig. 10, after changing the network configuration, the receptive field of CNN-S1 is reduced. According to Table IV, the depth of CNN-S1 which is represented as a convolution unit number, is the same as that of CNN-C, while the receptive field decreases from 33 to 19. The performance of CNN-S1 is inferior to that of CNN-C. This demonstrates that the receptive filed is important for the CNN to capture the sequential information,which further verifies our hypothesis in Section III-A-2).

To enlarge the receptive field of the CNN, an alternative method is to apply a convolutional kernel with a large size.However, it is difficult to optimize this large kernel, which may degrade the performance of CNN. To verify our assumption, we conduct another experiment denoted as CNN-S2.CNN-S2 consists of two convolution units, where these two convolutional layer parameters are 15×1 with channel 16 and stride 4, and 5×1 with channel 64 and stride 5, respectively.For the decoder component, CNN-S2 adopts the same parameters as others’, where two FC layers with 256 hidden number are used. In Fig. 10, it can be found that, the receptive field of CNN-S2 is nearly unchanged, while the network depth is decreased from 6 to 2. As listed in Table IV, although the receptive field for CNN-S2 is similar to that of CNN-C, CNNS2 performs worse than CNN-C. This supports our hypothesis in Section III-A-3) and demonstrates that the kernel size of CNN should be small.

Apart from the receptive field and kernel size, the channel number may also affect the performance of a CNN. To verify our hypothesis, we conduct another experiment called CNNS3. In CNN-S3, all kernels are set as 9×1 and the variation of channel number is opposite to that in CNN-C, which decreases from 512 to 64. As shown in Fig. 10, the feature depth in CNN-S3 gradually decreases, which is opposite to that in other networks. The receptive field in CNN-S3 is very enlarged and is even larger than the size of input signals. From Table IV, it can be found that though the receptive field for CNN-S3 is much larger than that of CNN-C, CNN-S3 is still surpassed by CNN-C. This indicates that the configuration of the channel number is important for a CNN to capture sequential information, and the channel number should increase gradually, which also verifies our conclusion in Section III-B.

2) Analysis for Position Encoding Scheme:In this part, we conduct experiments to analyze our method performances under different position encoding parameters.

Parameterpd: Position encoding dimensionpddetermines the size of the position encoding vector. We conduct three experiments with differentpdvalues, where the Ψ is a 1×3×16 convolutional layer and Φ is a 1×1×16 convolutional layer. As shown in Fig. 11, our method performs the best (15.83 in RMSE and 1153.18 in Score), whenpdis set as 16. When we increase the position encoding dimension, our method shows relatively stable performances in Score.

Parameter Ψ: Ψ is the transformation for input signals. As shown in Fig. 12, three experiments are conducted to show the impact of the Ψ parameter on performance, wherepdis set as 16, and Φ is a 1×1×16 convolutional layer. When we increase the kernel size in Φ, performance decreases. This indicates that the context information among the input signal may affect the encoded position information. Our method performs the best with the 1 ×1×16 convolutional layer.

Parameter Φ: Φ is the transformation for position encoding information. In Table V, two experiments are conducted. A 1×1×16convolutional layer is firstly used to transform the position encoding information. In comparison, we remove the convolutional layer and directly fuse the position encoding information with the transformed input signals, where Ψ is a 1×1×16convolutional layer. From Table V, it demonstrates that our proposed transformation Φ can improve the RUL prediction accuracy.

Fig. 10. Comparison of CNN-C and CNN-S variants. The input, the produced feature maps and the corresponding receptive field are illustrated for these four neural networks.

Fig. 11. Results of parameter pd analysis experiments. It can be found that our method achieves the best performance when pd is 16.

Fig. 12. Results of parameter Ψ analysis experiments. Our approach shows the best performance with the 1 ×1×16 convolutional layer.

TABLE V ANALYSIS ExPERIMENTS FOR Φ

Fusion Method ?:We propose the fusion method?to combine the transformed input signals and position encoding information. In this part, we compare two fusion methods:element-wise addition fusion and concatenation fusion in Table VI. It can be seen that the performances for the addition fusion are better than that for the concatenation fusion. This indicates that element-wise addition fusion can more effectively fuse position encoding information with input signals.

TABLE VI ANALYSIS ExPERIMENTS FOR FUSION METHOD

D. Prediction Encoding Influence on RNN

Our proposed position encoding scheme has shown effectiveness on the RUL prediction in our PE-Net. In this subsection, we investigate whether the position encoding scheme can benefit the RNN based method.

We firstly implement a RNN based method according to[17], where Bi-LSTM is used to learn the long-term dependencies. After that, we apply our proposed position encoding scheme on this RNN based method, which is denoted as RNN+PE. The corresponding experimental results are listed in Table VII. As shown in Table VII, RNN+PE performs better than RNN, indicating that our proposed position encoding scheme is also effective for RNN based methods. Although this Bi-LSTM based method is improved with our position encoding scheme, it is still inferior to our PE-Net. This may be because our novel CNN in the PE-Net can provide highqualified feature representations for the RUL prediction.Overall, our proposed position encoding scheme is effective for both RNN and CNN based methods.

E. Prediction Result Analysis on Test Data

In theC-MAPSSdataset [39], the test trajectories contain various scenarios. To analyze our proposed PE-Net performance on different scenarios, we show our PE-Net predictions on the FD004 subset in this subsection.

We firstly illustrate four examples of life-time RUL predictions on the test engines. As listed in Fig. 13, four test engines with different RUL degradation characteristics are listed. In Figs. 13(a) and 13(c), the true RUL is near the predefined maximum RUL, and our predictions are very close to the true RUL. From Figs. 13(b) and 13(d), it can be seen that the true RUL varies largely, and scenarios become complex.Our PE-Net can still accurately predict the true RUL.

We also visualize the prediction results of all test engines on the FD004 subset in Fig. 14. It can be found that our PE-Net isable to predict RUL accurately, which demonstrates that our proposed PE-Net is effective.

TABLE VII ANALYSIS ExPERIMENTS ON THE POSITION ENCODING SCHEME FOR RNN

Fig. 13. Illustration for the RUL predictions on four test engines. It can be found that our PE-Net is able to predict the RUL accurately in both simple and complex scenarios.

Fig. 14. Our PE-Net predictions on FD004. This shows that our PE-Net can predict the RUL accurately.

V. CONCLUSION AND FUTURE WORK

In this paper, we have investigated the relationship between CNN’s architecture and the capability for sequential information. Based on our analysis, we have proposed a series of CNNs, whose capability for sequential information has been improved. Extensive experiments have been carried out.According to our experimental results, CNN based approaches are able to perform comparably to RNN based methods and hybrid based methods. Additionally, we have proposed a position encoding scheme to further improve the sequential information capability. Experimental results have shown that our proposed position encoding scheme effectively improves the CNN’s performance. With our proposed position encoding scheme, our PE-Net surpasses nearly all methods and sets new state-of-the-art performance on theC-MAPSSdataset.

The inputs for the RUL prediction are generally taken from multiple sensor data, which can be regarded as a highdimensional and sparse (HiDS) matrix. Our current method,PE-Net, directly processes input signals, which is timeconsuming and redundant. In the future, for dealing with this data more efficiently, we will investigate how to apply technologies [42]–[44] related to the HiDS matrix processing for efficient processing of high-dimensional data.