1. Introduction
High-speed trains have emerged as one of the most crucial components within intelligent transportation systems. Traction control systems (TCSs), serving as the core power systems for high-speed trains, are intricately linked to trains’ reliability and safety. However, they also represent a major source of faults in both long-term operation and harsh operating environments. Consequently, fault detection and diagnosis (FDD) has become an active area of research over the past few decades [
1,
2,
3].
Currently, FDD methods for high-speed trains can be broadly categorized into three groups: model-based approaches, signal-based approaches, and data-driven approaches. Despite their accessibility and high efficiency in producing FDD results, establishing model-based methods is challenging due to practical uncertainties and complex designs. Signal-based methods exhibit limited effectiveness in detecting minor symptoms, particularly in dynamic scenarios [
4].
In the meantime, due to the widespread deployment of sensors in complex systems, data-driven methods have been extensively advocated for accomplishing fault detection and diagnosis (FDD) tasks by effectively processing a massive volume of data [
1,
5,
6,
7]. In [
8], the authors proposed a discriminative stacked autoencoder (D-SAE) network based on feature integration boosting for bearing fault diagnosis. This method mitigated the performance degradation and enhanced the generalization ability in various scenarios. Ref. [
9] proposed an innovative fault detection (FD) method for bogie. In this study, a Monte Carlo-based perturbation technique is employed to amplify the distinction between unexpected faults and known ones. Consequently, the FD outcome for unexpected faults can be obtained using dropout-based Bayesian deep learning. The authors in [
10] proposed a fault diagnosis method for braking friction based on a one-dimensional convolutional neural network (1DCNN) and the GraphSAGE network. This approach effectively addresses the challenge of imbalanced fault samples by considering the correlation between different fault features. In addition, ref. [
11] presented an incipient FDD method for running gear systems that leveraged Hellinger distance and slow feature analysis.
The aforementioned FDD methods primarily address permanent faults (PFs) in mechanical components or systems. However, transient faults (TFs), as a type of incipient fault, have the potential to develop into PFs and are responsible for most failures observed in electronic devices such as power electronics, sensors, and traction control units (TCUs) within TCSs.
In the context of complex industrial systems, multiple fault detection and diagnosis (FDD) methods have been developed specifically for transfer functions (TFs) [
12,
13,
14,
15,
16,
17,
18]. Ref. [
12] assesses and demonstrates the ability of a bulk built-in current sensor’s (BBICS) architecture to detect multiple and simultaneous TFs for integrated circuits. Ref. [
14] studies fault tolerance in switching reconfigurable nano-crossbar arrays, considering both TFs and PFs. In [
15], an innovative ontology-based fault propagation analysis approach (ontologyFPA) is proposed to analyze transient fault propagation effects in networked control systems (NCSs). Ref. [
17] presents a TF detection and classification approach in power transmission lines based on graph convolutional neural networks. In [
18], an optimal fractional-order method is proposed for TF diagnosis, which suppresses background noise and amplifies the faulty part of the signal. Afterward, kurtosis and the fault duration time are applied to locate the fault component.
However, the methods mentioned above perform in static or one fixed operation condition, which are not involved in dynamic cases [
4]. Different operation conditions may lead to significant distribution differences, which means that an intelligent FDD model trained on data under a certain operation condition is usually not applicable to other operation conditions [
19,
20]. Traditional deep learning approaches necessitate a plethora of samples from diverse operational conditions for effective model training. Conversely, a TCS typically operates under steady-state conditions, resulting in imbalanced distributions across various operation conditions [
21].
The primary challenges in the field of TF detection encompass the following:
TFs exhibit sporadic and stochastic behavior, leading to impermanent damage that disappears unpredictably.
The distribution of samples across different operational conditions is imbalanced, particularly for faulty samples which are significantly underrepresented.
The features of TFs are inherently weak and can easily be overshadowed by background noise, especially in dynamic scenarios.
These characteristics make TFs challenging to detect.
In this context, transfer learning (TL) has been extensively discussed for extracting latent feature information and achieving precise fault detection under dynamic operation conditions. TL aims to enhance the performance of target domains by leveraging the knowledge embedded in diverse but related source domains, thereby reducing the reliance on a substantial amount of target domain data for constructing target learners [
21,
22].
Several FDD methods with TL have been developed for electrical systems. Ref. [
23] proposes an FD method for traction converter faults in traction drive systems. This method consists of a federal neural network based on a variational autoencoder (VAE), which can perform the FD task with performance degradation. The authors of ref. [
24] developed a hierarchical method for transformer rectifier unit (TRU) fault diagnosis and a transfer learning-based fault diagnosis method without training new models for different TRUs. In [
25], a novel transferrable open-circuit fault diagnosis method is proposed for insulated gate bipolar transistors in three-phase inverters, which can be applied to different systems with the same topology but different parameters. The authors of ref. [
26] developed an adversarial-based deep TL model that can detect and classify short-circuit faults in DC microgrids without using historical fault data. Ref. [
27] proposes a transfer learning-based fault location method for voltage source convertor-based high-voltage direct current (VSC-HVDC) transmission lines. This method can locate faults with small training datasets. However, executing the task of transient FD for TCS in dynamic operation conditions is still an urgent problem that needs to be solved.
Motivated by the discussions above, we propose a TL strategy to detect the transient faults of TCS under various operation conditions. In the proposed method, a Cycle-Flow adversarial network (CFAN) is first constructed for latent feature extraction and data reconstruction in steady operation conditions. Secondly, a TL framework with the federated CFANs jointly adjust the changed information caused by varied operation conditions. The two mentioned steps are to learn and preserve knowledge under normal cases. Finally, designed federated CFANs reconstruct residuals with faulty data for transient FD under dynamic operation conditions.
The contributions of the proposed method are summarized as follows:
A CFAN is proposed for latent variable extraction and data reconstruction, which consists of an invertible flow model and two discriminative networks; the loss function is designed as well. Specifically, bidirectional optimization can enhance the quality of reconstruction while mitigating interference caused by background noise through adversarial training and flexible inference.
The proposed federated CFAN-based TL is divided into two stages. Initially, the first CFAN model is trained using normal data in steady operation conditions. Subsequently, the second CFAN calibrates the changed information caused by varied operation conditions utilizing limited data. In conclusion, the federated CFANs can jointly learn latent knowledge in a steady state and be applied to transient fault detection in various operation conditions.
Simulation experiments are conducted on various transient faults using the normal steady state of TCS as the source domain and the dynamic operation condition as the target domain. The simulation results show that the federated CFAN-based TL method can improve the performance of transient fault detection.
The remainder of this paper is organized as follows:
Section 2 states the transient fault detection problems and flow basics.
Section 3 details the proposed transfer learning fault detection strategy based on federated CFANs. In
Section 4, the experiment results and data sources are briefly described. Finally, the conclusions and prospects are given in
Section 5.
3. The Proposed Federated CFAN-Based Transfer Learning Strategy
Motivated by the research on image processing based on NF [
35], a federated CFAN-based TL strategy to detect transient faults in TCSs is proposed due to its reversibility and flexibility in modeling various distributions.
In this work, the source domain is represented by
, where
represents data, which denotes the measurements under the steady operation of TCSs. Similarly, target domain data are represented by
, which represents the measurements in dynamic operation. There will be some differences in data distribution between the source and target domain. New knowledge can be acquired through reasonable adjustments of previous knowledge. This transfer-learning approach can achieve better FD performance than using only the target domain data [
36].
3.1. Principle of CFAN
The framework of the proposed CFAN model is shown in
Figure 3. In this work, the forward process of CFAN can be defined as
, and the reverse process is expressed as
. Consider source domain sample
. The target of the defined model
is to learn the potential features of the source domain in which
represents hyperparameters. Model
contains two mapping functions, the forward process
, and the reverse process
. In addition, two adversarial discriminative networks,
and
, are introduced, where
aims to differentiate between
and the generated data
. Similarly,
aims to distinguish between
and
.
encourages
to transform
into an output (itself) that is indistinguishable from
and vice versa for
and
.
Given a
dimensional input,
:
is split into
and
, which are given as follows:
The affine coupling layer is presented in
Figure 4; the output
of an affine coupling layer follows Equations (16) and (17).
Finally, and are merged into one group .
As the reverse input
and output
, its reverse process can be expressed as follows:
where
represents the scaling function,
represents the translation function, and ⨀ is the Hadamard or element-wise product.
Considering the forward process, the Jacobian matrix of transformation
can be expressed as follows:
The upper left area of the Jacobian matrix is an identity matrix . Since is irrelevant to , the upper right area of the Jacobian matrix is a zero matrix, 0. The lower right area of the Jacobian matrix is a diagonal matrix with the diagonal element . Therefore, the calculation of the lower left area of the Jacobian matrix can be ignored. Because the Jacobian of or is not necessary for computing the Jacobian determinant of the coupling layer, or can be arbitrarily complex for various network designs.
Although the coupling layer may be powerful, the distribution is often very complex in practice. Moreover, it is challenging to transform a complex distribution into another; one transformation is often insufficient. In addition, the forward transformation leaves some components unchanged, with the first
d dimensions being identical to the initial data.
Figure 5 illustrates the composition of the coupling layer in an alternating pattern. This structure allows different parts of the data to be passed through different transformation paths. It ensures that the final generated data do not contain components originating from the initial data [
30]. Combining coupling layers is carried out as follows:
Then, its reverse process can be expressed as follows:
As mentioned above, to minimize the error between the input and reconstructed output, the expectation of
can be estimated by Monte Carlo as follows:
Similarly, for the reverse process,
For the forward process, the loss function
of the model
in this work can be expressed as follows:
where
is a discriminative network and
is a hyperparameter, and then the loss function of
can be expressed as follows:
Similarly, the loss function
of the reverse process can be expressed as follows:
where
is a discriminative network, and
is a hyperparameter, and then the loss function of
can be expressed as follows:
The total loss
of the proposed CFAN is presented as follows:
The overall optimization objective of the model can be written as follows:
In summary, the CFAN can learn knowledge in the source domains by adversarial training, and the trained hyperparameter is .
The trained
model can perform FD tasks under steady operation conditions(The training progress is detailed in Algorithm 1). However, the distributed discrepancies arising from diverse operational conditions result in a decline in its overall performance. To mitigate this issue, fine-tuning of the model is necessary to attain optimal FD performance through TL.
Algorithm 1: offline |
Loop |
for number of training iterations do |
|
, : |
|
,, by Adam optimizer [36] |
, : |
|
, , by Adam optimizer |
, |
, , , by Adam optimizer |
end for |
end loop |
3.2. Fault Detection with Transfer Learning Based on Federated CFANs
This work aims to establish an FD model under dynamic operation with TL. The first CFAN reflects the information on steady-state operation in the system, which was trained in the previous step. The second CFAN learns the performance changes influenced by domain changes. This design concept involves neural model-aided learning to identify changing and unchanging crucial parameters. The framework of the proposed TL strategy is illustrated in
Figure 6.
The data of target domain
are input into the
after training. Due to the different data distribution between the
and
, their performance will also change. Consider target domain sample
from
, where the residual signal
can be expressed as follows:
From the above formula,
is the path between the source and the target domain, which retains the information when the operation conditions change. The
has the ability to calibrate the knowledge changes caused by the varied operation conditions. The construction of
is similar to
, and
is a hyperparameter. In addition, it also includes the discriminative network
and
, in which
and
are hyperparameters. The loss function
of
can be expressed as follows:
The loss function
of
can be expressed as follows:
The loss function of the reverse process
can be expressed as follows:
The loss function
of
can be expressed as follows:
In summary, the total loss
of
is formulated as follows:
The overall optimization objective of the proposed TL model is provided as follows:
The
learns the performance variation of the
due to varied operation conditions; the training process of
is detailed in Algorithm 2. The change information
is obtained using the following formula:
Algorithm 2: offline |
Loop |
for number of training iterations do |
|
,: |
|
|
by Adam optimizer |
: |
) |
, , by Adam optimizer |
, |
, , , by Adam optimizer |
end for |
end loop |
Based on the above analysis, the residual signal
used for the final FD decision is defined as follows:
According to the final decision signal
,
represents the dimension of
. The framework of the proposed federated CFANs is depicted in
Figure 7.
This work utilizes the root mean square (RMS) norm to maintain satisfactory false alarm rates (FARs) in high-dimensional situations. The RMS measures the average energy of a signal
and is defined by the following formula:
The threshold is set to be
Then, the fault detection logic becomes
The flowchart of the proposed method is illustrated in
Figure 8, comprising an offline training phase and an online fault detection (FD) phase. The first CFAN-based model
is trained by using the normal data
obtained during steady operation conditions to extract latent variables and reconstruct data. Subsequently, the model
undergoes federated training based on dynamic operation condition data
. The trained federated neural networks
and
enable feature extraction and the reconstruction of the healthy data. Thus, the residual
is calculated using the federated CFANs. Finally, with the FD threshold
being determined by the RMS statistics of the residual
, the
of the testing data is compared with
to realize the FD of the TCS.
4. Experiment Results and Analysis
In this section, the data source and experimental platform are briefly described. To verify the effectiveness of the proposed method, FD tasks with different methods were performed on the TCS under dynamic operation conditions. Some discussions are proposed based on the experimental results.
4.1. Data Description
In this case, a TCS is adopted to demonstrate the effectiveness of the proposed FD method. A simulation platform of traction drive control systems named “TDCS-FIB” is presented in [
37,
38]. TDCS-FIB develops fault injection benchmarks based on simulation models. TDCS-FIB provides a variety of fault injection types for the main components in TCS, which provides reliable data support for fault detection and diagnosis.
To verify the proposed method, a TCS with different TFs is adopted. As depicted in
Figure 9, the onboard TCS serves as the experimental system, with its specifications presented in
Table 1. The sensor data were collected under traction operation conditions.
In practice, transient faults will lead to abnormal data from multiple sensors. Multi-sensor FD can reduce interference and improve detection efficiency [
39,
40]. Therefore, multi-sensor data are used to detect transient faults, which include the three-phase current output
of an inverter, the voltage output
of the upper and lower support capacitors in the DC link, and the transformer secondary voltage and current
. The FD model of TCS is trained based on the sensor signals as follows:
where
. The collected data can be expressed as
for the transient faults under dynamic operation conditions.
Since the waveforms of the seven groups of sensors tend to be stable after the -th step, samples in the normal steady state of the TCS are obtained as the source domain training dataset , and samples in the dynamic condition and in steady are used as the target domain training dataset .
The test dataset in the dynamic state contains four transient faults and fault-free scenarios. Each fault scenario contains
samples, and the fault-free scenario contains
samples. The evaluation of the experimental results is completed using the false alarm rate, fault detection rate (FDR), recall, and accuracy rate (ACR), which are defined as follows:
Define fault samples as positive samples and normal samples as negative samples. The total number of fault samples predicted to be correct is called true positive (). The total number of fault samples predicted to be errors is called false positive (). The total number of normal samples predicted to be correct is true negative (), and the total number of errors is false negative (). represents the type of fault.
The proposed model was built by Pytorch 1.13.1. The
and
models have the same structure and contain four affine coupling layers.
includes two fully connected layers with
neurons.
includes two fully connected layers with
neurons. The two discriminators
use the same fully connected structure with
. According to the loss function
defined in (29) and
defined in (35), the best weights and biases can be obtained via ADAM. The details of the CFANs and methods for comparison are given in
Table 2 and
Table 3.
4.2. Analysis and Discussion
Comparisons between each FD task and other methods were conducted, encompassing four types of FD tasks and fault-free detection tasks for each method.
Figure 10,
Figure 11,
Figure 12 and
Figure 13 illustrate the FD results obtained using both the proposed method and VAE (including transfer and non-transfer learning). The traditional VAE refers to the VAE method without TL, while the federated VAE, which incorporates a similar TL strategy as our proposed method, is adapted for dynamic operating conditions. As shown in
Figure 10a,
Figure 11a,
Figure 12a and
Figure 13a, the blue curve represents the a-phase current waveform
, and the orange dotted line represents the fault injection time. For (b), (c), and (d) in
Figure 10,
Figure 11,
Figure 12 and
Figure 13, the blue curve represents the detection results using three methods, and the red dotted line in the figures represents the FD threshold
.
The fault is attributed to the damage incurred by manufacturing processes, overstress, and other contributing factors on the shielding layer of communication cables. The transmission of external pulses in combinational logic circuits induces variations in both the pulse width and amplitude, which leads to TF in the TCS.
The reason for faults is that the sensor chip pins and wiring are loose or improperly connected. The sensor signal is instantaneously disturbed by vibration, thereby inducing transient fault .
Transient shock faults may arise from improper sensor installation and the degradation of insulating materials triggered by power and ground wire surges.
The occurrence of can be attributed to IGBT damage resulting from internal structural defects, manufacturing processes, and other contributing factors. Furthermore, excessive stress induced by high temperatures may lead to gate driver circuit failure, such as TF caused by erroneous pulse control signals originating from the control circuit.
The comparison results of the three methods are illustrated in
Figure 14, and
Table 4 shows the ACR and average fault detection delay. The proposed method comprehensively achieves better performance for different FD tasks. Specifically,
Figure 14a shows the FDR, and
Figure 14b shows the FAR of different methods under four types of faults. The FDR of the other two FD methods is lower than that of the method described in this article. In
Figure 14b and
Table 3, the FAR, recall, and ACR of different methods are all lower than those of the method proposed in this work. The traditional VAE does not include the TL process and cannot adaptively adjust the changing knowledge based on the target domain data, which causes poor FD performance.
The data distributions vary across different operation conditions of TCSs, leading to a degradation in FD performance. However, there exists common knowledge among various operation conditions, necessitating the acquisition of knowledge from the steady-state operation of a TCS. As depicted in
Figure 10,
Figure 11,
Figure 12 and
Figure 13, due to the proposed TL strategy that leverages prior knowledge and mitigates the impact of operational variations, a federated VAE outperforms a traditional VAE. The proposed TL strategy based on federated CFANs effectively transfers and adapts knowledge between steady-state and dynamic operation conditions while ensuring the accurate extraction of latent variables and data reconstruction. By leveraging the adversarial training and reversibility properties of CFANs, the precise description of data distribution is achieved through bidirectional optimization, resulting in significant performance improvements as demonstrated in
Figure 14 and
Table 4. Especially for weak TFs (case studies
,
, and
), this proposed method exhibits superior fault detection capabilities under dynamic operating conditions.
In addition, FD experiments are also introduced under steady operating conditions. The test dataset in the steady state also contains four transient faults and fault-free scenarios which are similar to
,
,
, and
in dynamic operation conditions. The performance comparison of different methods is shown in
Table 5, each fault scenario contains
samples, and the fault-free scenario contains
samples. The comparison results of the three methods are illustrated in
Figure 15 and
Table 6 and
Table 7.
The comparison results in steady operation conditions are illustrated in
Figure 15, and
Table 7 shows the FAR, recall, ACR, and average FD delay. It can be concluded that the proposed second CFAN achieves better performance for different FD tasks under steady operation conditions, for the reason that the knowledge of steady states has been learned by a small amount of data in healthy condition.
The training loss curve is defined by the Mean Squared Error (MSE) for evaluating the reconstruction accuracy. As illustrated in
Figure 16a, during the training of the proposed method, the training loss stabilizes at a lower level than the other two methods, indicating the superior data reconstruction capabilities of the proposed method.
Figure 16b displays the loss of
and federated VAE network2. The losses of two methods converge to a similar value, which illustrates that both networks have the ability to achieve performance adjustments.
The ROC-AUC (Receiver Operating Characteristic-Area Under the Curve) curves of three methods are shown in
Figure 17, the AUC of the proposed method is 0.953, while the AUC values of the traditional VAE and the federated VAE are 0.826 and 0.907, respectively. The proposed method has the largest area under the curve, indicating superior performance in terms of FD.
Generally, to ensure the security of the system, the TCS typically works in normal states. As a result, the fault occurrences have a much lower chance of appearing than the healthy instances [
21]. This unsupervised method only learns normal patterns from fault-free data, which is a feasible solution to the problem of imbalanced data. Therefore, unsupervised learning improves robustness without the cost of labeling. This FD method is not limited to the TCS of the train, but for faults in other electrical systems, this method has efficient transient FD performance.
5. Conclusions
In this work, we present a transient fault detection method under dynamic operation conditions. For the purpose of latent variable extraction and data reconstruction, a CFAN is established by an invertible flow model and two discriminative networks; additionally, the loss function was designed. Moreover, adversarial training and bidirectional optimization can enhance the reconstruction quality and depress interference caused by background noise.
Then, an unsupervised transfer learning strategy based on federated CFANs is proposed for transient fault detection under various operation conditions, which is divided into two stages. Initially, the first CFAN model is trained using the normal data in steady operation conditions. Subsequently, the second CFAN calibrates the changed information caused by varied operation conditions utilizing only a few samples. The federated CFANs can jointly learn latent knowledge in steady states and be applied to transient fault detection in various operation conditions.
By selecting the data-driven fault detection methods for comparative experiments, the effectiveness of the method is verified.
Several directions are available for future work. The first is to develop fault diagnosis technology and locate faulty components further. Otherwise, the FD method employed in this work is based on the CRH2 type, and data related to high-speed trains with different topological structures have not been explored. Such out-of-distribution (OOD) data, as mentioned in [
41], may negatively impact FD performance. Future work will be considered, and fault diagnosis methods for high-speed trains of multiple types will be developed.