Mar 26, 2025
Integrating vibration signal analysis and image embedding for enhanced bearing fault diagnosis in manufacturing | Scientific Reports
Scientific Reports volume 15, Article number: 10398 (2025) Cite this article Metrics details Bearing fault diagnosis in mechanical systems is an imperative task across various industries, including
Scientific Reports volume 15, Article number: 10398 (2025) Cite this article
Metrics details
Bearing fault diagnosis in mechanical systems is an imperative task across various industries, including manufacturing, energy, and transportation. Although recent advances in deep learning have enabled automated approaches for fault diagnosis, these approaches often fail to incorporate vital domain-specific knowledge and operating conditions into the models. To address this limitation, we propose ZERONE, a novel image embedding method that simplifies the representation of time-domain features, frequency-domain features, and operating conditions of vibration signals by integrating them into a single image. In this representation, these features are expressed as colored numbers of either zero or one, while categorical variables are represented as text in their original form. Subsequently, these images are processed by a convolutional neural network model for fault diagnosis. Experimental results demonstrate the superior performance of our approach, achieving 98.24% accuracy on the Paderborn University bearing dataset and 99.64% accuracy on the Jiangnan University dataset, surpassing other methods. Moreover, it achieved an average performance improvement of 7.24% compared to existing image embedding methods. Furthermore, the proposed method employs gradient-weighted class activation mapping to identify key frequency and statistical variables, offering interpretability and setting a new standard for diagnosing failures in mechanical systems.
In modern manufacturing industries, the proper functionality of bearings and other rotating parts is crucial for the efficient operation of various machines. Defects in these components can significantly diminish efficiency, leading to operational breakdowns. Vibration analysis is a crucial technique for identifying defects in rotating machinery. This method is instrumental in the early detection of issues, thereby averting major breakdowns, prolonging equipment lifespan, and reducing unexpected downtime1,2.
A traditional approach for identifying malfunctions in rotating machinery using vibration analysis involves data-driven fault diagnosis techniques. These methods typically encompass three steps2: (1) signal feature extraction, (2) feature selection, and (3) fault classification. The process starts with the extraction of signal features from vibration data, a crucial step for isolating and analyzing key characteristics for accurate fault diagnosis. Although these extracted features are effective, they often contain redundant information. Consequently, the next step involves feature selection to reduce unnecessary information. This reduction not only enhances classification efficiency but also ensures the retention of only the most significant and representative features3. In the final step, these selected features are used in various fault-classification methods, which utilize both traditional statistical models and modern machine learning algorithms.
In recent years, the field of fault diagnosis of rotating machinery has experienced a remarkable evolution mainly by the integration of deep learning technologies. This advancement enables an end-to-end deep learning approach that does not require handcrafted features4,5. In particular, convolutional neural networks (CNNs) initially developed for image analysis6 have been adeptly employed to analyze vibration signal patterns.
In some scenarios, the vibration data can be presented in two-dimensional (2D) format to use 2D CNNs. Studies such as7 have demonstrated the efficacy of using the raw vibration signals, envelope spectrums, and spectrograms as distinct inputs to 2D CNNs. In8, by integrating three channels, namely, raw signal, mean, and median, higher performance was achieved compared to using only raw signals with a single channel.
Contrary to traditional 2D CNNs, one-dimensional (1D) CNNs utilize 1D filters to process input data, making them highly suitable for capturing sequential data. This approach is particularly beneficial in fault diagnosis, where temporal signal patterns often signify underlying issues. This approach has been validated in an application, a patient-specific electrocardiogram (ECG) classification9. In10, a fast and accurate motor state monitoring and early failure detection system using 1D CNNs was proposed. In11, a deep CNN with wide first-layer kernels (WDCNN) was introduced to diagnose bearing defects.
Despite advancements in deep learning technologies, domain expertise continues to be crucial in machine fault diagnosis. Experienced professionals possess extensive practical knowledge, including effective methods for understanding the underlying mechanisms of faults, key features that signify problems, and algorithms for isolating these features. This blend of empirical knowledge underpins the reliability and effectiveness of diagnostic processes12.
Acknowledging the significance of domain expertise, recent studies have explored the synergy between handcrafted features and deep learning in bearing fault diagnosis. Approaches such as13 have leveraged both empirical and adaptively extracted deep learning features as inputs to machine learning algorithms like XGBoost. Similarly, a self-supervised learning method in12 combined handcrafted and general features with a deep convolutional autoencoder, highlighting its utility in mechanical failure diagnosis using small samples.
Entropy-based feature extraction techniques have also gained significant attention. For example, in14 employs the gramian angle summation field (GASF) transformation to convert raw signals into images, followed by a multi-scale perception and multi-level feature fusion strategy to effectively capture fault-related information, thereby significantly enhancing diagnostic accuracy. In15 proposed multiple time-shift multi-scale decomposition subsignals to extract subtle abnormal patterns that single-entropy-based approaches may overlook, enabling an efficient analysis of the complex operational characteristics of rotating machinery.
In recent years, short-time Fourier transform (STFT)-based deep learning techniques have shown excellent performance in bearing fault diagnosis16,17,18. STFT converts vibration signals into a spectrogram, providing a rich time-frequency signal representation. This representation is highly informative for capturing signal patterns and anomalies in the vibration data. The output matrix from STFT, i.e., the spectrogram is used as input for deep learning models, such as CNNs for effectively processing and analyzing the complex information in vibration signals.
However, the field of deep learning-based fault diagnosis has not extensively explored the integration of diverse textual information such as process operating conditions, engineering logs, and vibration signals. Comprehending the relationship between vibration data and textual information is crucial for identifying issues such as bearing defects in machinery. This integration has the potential to enhance the efficiency and reliability of manufacturing processes through advanced predictive models and diagnostic tools. However, this approach faces challenges in handling data complexity and achieving interdisciplinary integration. The limited research in this domain presents a substantial opportunity for significant advancements in manufacturing technology.
Meanwhile, an increasing number of studies are exploring deep learning tasks involving the superimposing letters and numbers on images. SuperTML19 incorporates features into images from tabular data. These images are then processed using pre-trained CNN models for classification tasks. Research has shown that SuperTML achieves state-of-the-art results on both large and small datasets, highlighting its efficacy in classifying tabular data.
Unlike CLIP20, CLIPPO21 introduces a single encoder for processing both images and textual content visualized as images, facilitating effective handling of tasks across both image-centric and linguistic domains. Significantly, CLIPPO performs well in tasks such as image retrieval and zero-shot classification tasks, achieving results comparable to CLIP-style models but with enhanced efficiency attributed to fewer parameters and the elimination of text-specific embedding. PIXEL22 employs a unique approach by transforming textual content into visual formats. This method facilitates cross-linguistic representation transfer, leveraging orthographic similarities and pixel co-activation. Notably, PIXEL excels in syntactic and semantic tasks across diverse languages, including those with non-latin alphabets. These advancements highlight the potential of integrating various artificial intelligence domains to address complex challenges.
Inspired by the aforementioned studies, we present ZERONE, a novel deep learning-based methodology for fault diagnosis by expressing both vibration signals and text information within a single image. This approach is simple but effective. The vibration signal is converted into handcrafted features of the time-domain and STFT features of the time-frequency domain, which are then integrated with textual data, such as operating conditions of a machine, within an image.
The key contributions of this study are as follows:
We introduce ZERONE, a simple technique for identifying malfunctions in mechanical systems, effectively utilizing the computational abilities of deep learning and integrating expert knowledge. The uniqueness of ZERONE is that it directly integrates operational condition data into images, an approach rarely seen in existing methodologies. This integration enhances the ability of the model to provide a comprehensive interpretation of machine states, improving accuracy and reliability of fault detection.
ZERONE is designed to be user-friendly, particularly for engineering professionals, allowing them to customize variables to improve its performance. Even with the basic default settings, ZERONE outperforms other existing machine learning and deep learning methods, making it a practical tool in various industrial applications.
ZERONE leverages domain knowledge features to achieve superior performance even with limited data scenarios. This aspect is particularly crucial in areas where extensive data collections are unavailable. In addition, ZERONE facilitates the interpretation of the results by identifying the key frequencies and statistical variables that influence the classification using gradient-weighted class activation mapping (Grad-CAM).
The remainder of this paper is organized as follows: section “Procedure of the proposed method” introduces the proposed method, ZERONE. Section “Case study” describes the effectiveness of the proposed method using the Paderborn University (PU) and Jiangnan University (JNU) bearing datasets. Section “Discussion” discusses the results of ablation experiments and interpretability. Finally, section “Conclusion” provides conclusion and future work.
This section outlines the process of the proposed method, as shown in Fig. 1. The process begins with the extraction of both time-domain and time-frequency domain features from the vibration signals. These features are embedded into a 2D image with a white background and subsequently utilized as the input for a deep learning model. To improve the model’s generalization performance, we introduce a feature-masking technique that is selectively applied to remove specific features during the learning process.
Framework of the proposed method.
Illustration of the ZERONE image: (a) Scaling individual features between 0 and 1. (b) Binarization of features using a threshold of 0.5. (c) Representation of features and operating conditions in a 150 \(\times\) 150 \(\times\) 3 image using Jet colormap, and (d) Feature masking to enhance model generalization.
As illustrated in Fig. 1, in the ZERONE framework, the initial step is the application of STFT to vibration signals. This process aims to extract features within the time-frequency domain. STFT is mathematically expressed as follows:
where \(x(t)\) represents the time-domain signal, \(f\) indicates frequency, and \(h(t-\tau )\) denotes the window function centered at time \(\tau\).
The precision of time and frequency resolution in the STFT is influenced by three key parameters: the length of the time series \(N_x\), the width of the window \(N_w\), and the degree of window overlap \(N_0\). This interrelation is articulated as:
where \(T\) and \(F\) represent the time and frequency resolutions, respectively.
In this work, we applied STFT to a vibration signal comprising 2,048 points, employing a window size of 128 points with an overlap of 64 points. This procedure yielded a spectrogram of dimensions 65 \(\times\) 31, effectively capturing the signal’s frequency content over time. The spectrogram, with time on the x-axis and frequency on the y-axis, graphically illustrates the dynamic intensity of various frequencies that change over time. This representation is pivotal for interpreting the frequency characteristics of the signal. To extract meaningful statistics from the spectrogram, the average magnitude for each frequency band was calculated by averaging across the temporal dimension, yielding 65 distinct features.
Following the STFT process, we extracted 13 time-domain features from the vibration signal, as outlined in Table 1. This selection of features is supported by literature23,24, ensuring their relevance and utility. In addition, we integrated two widely used features in signal processing and pattern recognition tasks: zero crossing (ZC)25 and mean absolute value (MAV)26.
The proposed ZERONE is introduced in Algorithm 1. Specifically, we integrate 80 distinctive features extracted from the vibration signals into a 150 \(\times\) 150 \(\times\) 3 dimensional image with a white background. These features are methodically organized into groups of five, vertically aligned from the top to bottom of the image. Our method extends beyond simply extracting characteristics from signal data. ZERONE is adept at incorporating both categorical variables and varying operational conditions directly into the visual representation. For example, the top three features in the image, as shown in Fig. 2c, represent the operating conditions of the process, showcasing the method’s adaptability. A pivotal element of our method is the color normalization technique. Initially, as shown in Fig. 2a, each feature is individually scaled to a range between 0 and 1. This scaling is based on specific minimum and maximum values derived from each column of the training dataset. Subsequent to normalization, the color coding for each feature is stored in a jet colormap. To enhance generalizability and mitigate overfitting, we implement a binarization technique for each feature, utilizing a threshold of 0.5. This step transforms each feature value into a binary state, represented as either 0 or 1, as depicted in Fig. 2b. These binary values are then visually encoded using the predefined color scheme, as shown in Fig. 2c, allowing each feature in the ZERONE image to be distinctly expressed through RGB color values, with their relative sizes and importance visually discernible. As a result, ZERONE framework enables the comprehensive analysis of machinery states by not only examining vibration signals but also considering external variables and conditions. For the analytical phase, we leverage the capabilities of ResNet1827 to process these feature-enriched images, ensuring a robust and accurate analysis. ResNet18 is a deep convolutional neural network with 18 layers that employs residual connections to mitigate the vanishing gradient problem, facilitating the training of deeper architectures. Its structure begins with an initial convolutional layer, followed by four groups of residual blocks, and ends with in global average pooling and a fully connected classification layer.
Training of ZERONE
Augmentation techniques are known to improve the generalization capabilities of deep learning models28. Among various augmentation techniques, random erasing29 has been shown to enhance performance in various image-classification tasks. This method involves the random selection of n \(\times\) m patches of an image during the training process and replacing their values with random ones within the [0, 255] pixel value range. Building on this concept, our approach incorporates a tailored augmentation technique called feature masking. This method selectively masks a certain number of features within the 2D image by setting their pixel values to 255, as illustrated in Fig. 2d. The primary objective of this technique is to prevent the model’s overreliance on particular features or patterns within the training dataset. With feature masking, the model is encouraged to learn from a more diverse set of characteristics, thereby enhancing its robustness and adaptability to new data.
To evaluate the effectiveness of the proposed method, we selected a traditional machine learning model and three advanced deep learning models for comparison.
Random Forest30 is a well-established machine learning algorithm for classification tasks. Sixty-five average frequencies were extracted from STFT results and 15 time-domain features to serve as input for the random forest model for defect classification. Additionally, the model incorporates one-hot encoding of operating conditions as an additional input.
FaultNet8 has a strong capability of extracting features from vibration signals. It transforms 1D raw signals into a 2D format using moving-average and median filtering techniques. This transformation allows FaultNet to efficiently identify the specific characteristics of defects.
WDCNN11 processes 1D raw vibration signals to efficiently detect complex fault patterns and characteristics. The architecture of this model is designed to handle the complex nature of vibration signals, making it a reliable solution for identifying and classifying diverse mechanical defects. To further improve generalization performance, we applied time-series augmentation techniques such as jittering, scaling, stretching, and cropping, based on empirical studies31.
DCA-BiGRU32 has been introduced as a method for fault diagnosis in small sample scenarios, utilizing a dual path convolution integrated with an attention mechanism (DCA) alongside a bidirectional gated recurrent unit (BiGRU) to process 1D input signals. This framework extracts crucial features from vibration signals through the fusion of spatiotemporal characteristics and performs dimensionality reduction and fault diagnosis via global average pooling (GAP).
STFT-based ResNet18 leverages STFT to transform raw vibration signals into time-frequency representations. These data are then fed into the ResNet18 deep learning architecture. The method combines the ability of STFT to capture both time and frequency information with the advanced feature-extraction capabilities of ResNet18 to effectively identify subtle and complex fault signals within the machine system. In the preprocessing step, the images are resized to 150 \(\times\) 150. After resizing, the final input format for the model is set to dimensions of 150 \(\times\) 150 \(\times\) 1.
The PU dataset originates from a test rig setup comprised several components, including an electric motor, a torque-measurement shaft, rolling bearing test module, flywheel, and load motor33, as illustrated in Fig. 3. Data were sampled at a 64 kHz sampling rate under four operating conditions, which varied in rotational speed (rpm), load torque (Nm), and radial force (N) combinations. These conditions are detailed in Table 2. For each condition, the rotational speed is expressed as ‘N’, the load torque is expressed as ‘M’, and the radial force is expressed as ‘F’. For example, N15_M07_F10 means that the rotational speed is 1500 rpm, load torque is 0.7 Nm, and radial force is 1000 N.
Experimental platform and corresponding signal waveforms for the PU dataset.
Table 3 presents an analysis of bearings subjected to a lifetime test. The “Bearing code” column lists unique identifiers for each bearing examined. The “Damage” column specifies the primary type of damage observed as either fatigue pitting or plastic deformation. The “Location” column indicates the location of the damage on the bearing, whether it is on the outer race or inner race. The “Combination” column classifies the occurrence of damage into three types: Single for an isolated instance of damage, Repetitive for repeated damage of the same type, and Multiple for various damages or the same damage occurring on multiple components. Lastly, the “Severity” column ranks the severity of the damage on a scale from 1 to 3. This study utilizes bearing fault data from Table 3 and normal data from K001 to K006. Each bearing code was subjected to twenty measurements. Given that each test lasted 4 seconds, the initial 2 seconds of data were used for training, while the remaining data served as the test dataset. The training data were segmented into signals of length 2,048, with an overlap of 80 points, whereas the test data were utilized without overlapping. Finally, we constructed the training and test data by extracting data for each operating condition, as given in Table 4.
The hyperparameters of ResNet18 are listed in Table 5. Adam was chosen as the optimizer, the batch size was determined to be 32, and the initial learning rate was set at 0.001, which reduced by 30% every 10 epochs. To enhance the training process, 20% of the features, 16 features out of 80 were randomly masked. The performance metric of the experiment was the diagnostic accuracy, averaged over three trials to assess the model’s stability.
The experimental results are listed in Table 6. The random forest model, which leveraged prior features (time and frequency domain features) showed an accuracy of 97.17%, which was higher than that of the other three deep learning models. This result indicates the effectiveness of domain-specific feature extraction using domain knowledge in fault diagnosis
Random forest, which learned prior features and operating conditions, showed an improved accuracy of 0.24% by learning operating conditions simultaneously. Although the operating conditions were not included, the STFT-based ResNet18 model achieved the highest performance among all methods except the proposed method. ZERONE without operating conditions outperformed the STFT-based ResNet18. Moreover, the integration of operational conditions into the proposed method led to an additional performance boost of 0.26%, affirming the efficacy of incorporating comprehensive operational insights into the diagnostic process.
The JNU bearing dataset34, derived from induction motors and bearing systems, provided the basis for Case 2. Data acquisition was conducted over 20-second intervals for each operating condition (600, 800, and 1000 rpm) with a sampling rate of 50 kHz, as illustrated in Fig. 4.
Experimental platform and corresponding signal waveforms for the JNU dataset.
Inner race fault, outer race fault, rolling element fault, and normal class were handled in the same way as the PU dataset listed in Table 7. A notable difference in this case was that we evaluated the model performance by constructing fewer training samples compared to the PU dataset. To ensure a consistent basis for performance evaluation, we maintained identical hyperparameter settings across all models as established in the PU dataset experiments, except that the batch size was set to 16.
The experimental results, as summarized in Table 8, reveal that FaultNet showed significantly lower performance than other models with an accuracy of 86.94%. This indicates challenges in pattern recognition from deep learning models using raw signals due to limited training data. In contrast, as WDCNN used the data augmentation method, it achieved competitive performance at 95.00% even with a small number of training samples. The random forest, which utilized both STFT and time-domain features, achieved an accuracy of 99.28%, outperforming other deep learning models. This outcome underscores the continued relevance and effectiveness of domain-specific feature extraction in fault diagnosis, consistent with findings from the PU dataset. Unlike the PU dataset results, the random forest model showed similar performance regardless of whether operating conditions were incorporated. Unlike the STFT-based ResNet18, which performed almost similar to the random forest, our proposed model achieved higher performance. In addition, when operating conditions were added, a slight performance improvement was achieved.
Figure 5 presents the performance depending on whether color normalization is applied in each dataset. In the PU dataset, implementing color normalization significantly enhanced accuracy by 11.51%, from 86.43% to 97.94%. Similarly, in the JNU dataset, color normalization led to a remarkable performance boost of 16.47%, increasing the accuracy from 83.36% to 99.83%. This dramatic improvement is attributed to the preservation of feature information through color normalization, unlike binary representations where most feature details are reduced to 0 or 1, resulting in information loss.
Impact of color normalization on PU and JNU datasets.
This implies that representing magnitude through color, as done in color normalization, is more advantageous than binary representations. As given in Table 9, further investigation into the effect of numerical precision on performance revealed that for both datasets, displaying data to the first decimal place without color normalization, yielded the most favorable outcomes. However, extending precision to two decimal places led to performance decline due to overfitting. Notably, the application of color normalization consistently enhanced performance across varying levels of numerical precision. The experiments demonstrated that the simplest form of data representation, using binary values as shown in Fig. 2c, consistently delivered the highest performance across all tests, underscoring the efficacy of straightforward data representation methods. Furthermore, the existing SuperTML method for embedding tabular data into images did not use color normalization and 0 or 1 binarization, which were proposed in ZERONE. This corresponds to the third row in Table 9. In contrast, ZERONE, corresponding to the fourth row, achieved a test accuracy improvement of 2.77% on the PU bearing dataset and 11.7% on the JNU bearing dataset compared to the SuperTML method, leading to an average test accuracy improvement of 7.24%.
To validate the impact of feature masking, the performance of the two datasets according to the masking ratio from 0% to 50% is shown in Fig. 6. The blue line represents the PU dataset, which showed the highest performance at a masking ratio of 20%, but its performance decreased from 30% to maintain a similar performance up to 50%. By contrast, the JNU dataset, denoted by the green line, showed the best performance without feature masking, and the performance progressively deteriorated with increased masking ratios, with a slight exception at a 30%. Nonetheless, in scenarios where feature masking was absent or applied at a masking ratio of up to 30%, both datasets consistently outperformed the comparative models.
Performance comparison of PU and JNU datasets across different feature masking ratios.
To evaluate the impact of different extracted features affect bearing fault diagnosis performance, the experiment compared classification accuracy using time-domain features, frequency-domain features, and a combination of both. Table 10 presents that using only time-domain features resulted in an average accuracy of 91.47% for the PU dataset and 84.92% for the JNU dataset. This suggests that while statistical features from time-domain signals provide valuable information for fault diagnosis, they are limited in capturing the complex frequency characteristics of vibration signals. On the other hand, when only frequency-domain features were used, the accuracy reached 97.67% for the PU dataset and 99.28% for the JNU dataset. Frequency-domain features effectively represent the spectral information of vibration signals and have a significant benefit in capturing subtle changes in faults. Combining time-domain and frequency-domain features yielded the best results, with classification accuracies of 98.24% for the PU dataset and 99.64% for the JNU dataset. This result indicates that features from both domains complement each other. By integrating these two types of information, a more comprehensive identification of various aspects crucial for fault diagnosis becomes possible. This demonstrates that the ZERONE framework generates a rich representation essential for fault diagnosis, effectively capturing subtle fault differences and the complex nature of vibration signals.
The interpretability of deep learning models is becoming increasingly important, particularly in critical applications where understanding the decision-making process of the model is crucial for reliability. Therefore, we utilized Grad-CAM35 to visualize the areas of the input image that were significant in making predictions, as it is widely used in deep learning classification studies36,37. This method overlays a heatmap on the image, effectively highlighting the regions on which the model focuses. The heat map is generated by analyzing the contribution of each feature map to the final classification score. It offers valuable insights into the internal workings of the model and confirms that the model is concentrating on the right features for its predictions. Figure 7 shows the Grad-CAM visualization results on the JNU dataset when the ResNet18 model correctly predicts each class. In these images, the first variable corresponds to rpm, and each frequency bin is from 0 to 65 in an image. The number 0 represents the DC component, 1 represents 390.625 Hz, and 64 represents 25,000 Hz. The rest are time-domain features. As shown in Fig. 7a, the model focuses on the low-frequency band, as there is no obvious structural resonance in the case of normal bearings. On the other hand, all defect types show a greater number of frequency components marked as ’1’ compared to the normal signal, which largely consists of ’0’ components. This observation suggests that the frequency components in defect classes have higher amplitudes than those in the normal class. More specifically, the inner race defect signal exhibits relatively higher frequency amplitudes than the outer race defect, with noticeable sidebands as shown in Fig. 7b and c. Meanwhile, the ball defect, characterized by irregular and unstable vibration patterns, displays stronger amplitudes across various frequency spectra, as shown in Fig. 7d. The model’s significant attention to operating conditions also implies that, in addition to vibration signals, the process’s operating conditions play a crucial role. Overall, these findings indicate that the proposed method effectively learns the bearing fault mechanism and functions as an interpretable model, offering deeper insights into its analytical process.
Grad-CAM results for the JNU dataset: (a) Normal, (b) Inner Race, (c) Outer Race, and (d) Ball Element Fault. For each class, the top images showcase heatmaps, while the middle images display ZERONE images overlaid with heatmaps. The bottom image displays the variable names of each data column in the image and then overlays heatmaps.
This study introduces ZERONE, a novel method for bearing fault diagnosis that integrates vibration signal features and operating conditions into an image-based representation for analysis. A key strength of ZERONE is its ability to seamlessly incorporate categorical variables, such as operating conditions obtained from the manufacturing process, along with vibration signal features as input. Experimental evaluations using Paderborn University (PU) and Jiangnan University (JNU) bearing datasets demonstrated that ZERONE outperformed existing methods. Notably, the proposed color normalization led to an improvement of over 10% in both datasets, while feature masking contributed to enhancing generalization performance in the PU dataset. One of the key findings is the consistent performance improvement achieved by incorporating operating conditions, with accuracy increasing from 97.98% to 98.24% on the PU dataset and from 99.47% to 99.64% on the JNU dataset, highlighting the importance of contextual data in improving diagnostic precision. Furthermore, ZERONE exhibited exceptional robustness by maintaining high accuracy even with limited training samples, which is a significant advantage in real-world scenarios where data collection is constrained. Additionally, Grad-CAM analysis validated ZERONE’s interpretability by visually confirming that the model focuses on key frequency bands and time-domain features, providing transparent insights into fault mechanisms. These results establish ZERONE as an effective and interpretable tool for fault diagnosis in mechanical systems.
Future research will focus on developing preprocessing methods to incorporate more contextual information into the image representation. Additionally, we plan to further refine feature masking to enhance the model’s generalization performance. Finally, we aim to explore the integration of advanced feature extraction methods to further improve the performance of ZERONE.
Data utilized in this study are publicly accessible. For the Paderborn University dataset, please refer to [https://groups.uni-paderborn.de/kat/BearingDataCenter/]. For the Jiangnan University dataset, detailed access instructions are available at [https://github.com/ClarkGableWang/JNU-Bearing-Dataset]. For further assistance, please contact the corresponding author, Prof. Hyunsoo Yoon, at [email protected].
Tama, B. A., Vania, M., Lee, S. & Lim, S. Recent advances in the application of deep learning for fault diagnosis of rotating machinery using vibration signals. Artif. Intell. Rev. 56, 4667–4709 (2023).
MATH Google Scholar
Lei, Y. Intelligent Fault Diagnosis and Remaining Useful Life Prediction of Rotating Machinery (Butterworth-Heinemann, 2016).
Chen, R.-C., Dewi, C., Huang, S.-W. & Caraka, R. E. Selecting critical features for data classification based on machine learning methods. J. Big Data 7, 52 (2020).
CAS MATH Google Scholar
Liu, X.-M., Zhang, R.-M., Li, J.-P., Xu, Y.-F. & Li, K. A motor bearing fault diagnosis model based on multi-adversarial domain adaptation. Sci. Rep. 14, 29078 (2024).
CAS PubMed PubMed Central MATH Google Scholar
Liu, W., Zhang, Z., Ye, Z. & He, Q. A novel intelligent fault diagnosis method for gearbox based on multi-dimensional attention denoising convolution. Sci. Rep. 14, 24688 (2024).
CAS PubMed PubMed Central MATH Google Scholar
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
MATH Google Scholar
Pandhare, V., Singh, J. & Lee, J. Convolutional neural network based rolling-element bearing fault diagnosis for naturally occurring and progressing defects using time-frequency domain features. In Proc. Prognostics Syst. Health Manage. Conf. (PHM-Paris), 320–326 (2019).
Magar, R., Ghule, L., Li, J., Zhao, Y. & Farimani, A. B. Faultnet: A deep convolutional neural network for bearing fault classification. IEEE Access 9, 25189–25199 (2021).
Google Scholar
Kiranyaz, S., Ince, T., Hamila, R. & Gabbouj, M. Convolutional neural networks for patient-specific ecg classification. In Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc., 2608–2611 (2015).
Ince, T., Kiranyaz, S., Eren, L., Askar, M. & Gabbouj, M. Real-time motor fault detection by 1-d convolutional neural networks. IEEE Trans. Ind. Electron. 63, 7067–7075 (2016).
MATH Google Scholar
Zhang, W., Peng, G., Li, C., Chen, Y. & Zhang, Z. A new deep learning model for fault diagnosis with good anti-noise and domain adaptation ability on raw vibration signals. Sensors 17, 425 (2017).
ADS PubMed PubMed Central MATH Google Scholar
Zhang, T., Chen, J., He, S. & Zhou, Z. Prior knowledge-augmented self-supervised feature learning for few-shot intelligent fault diagnosis of machines. IEEE Trans. Ind. Electron. 69, 10573–10584 (2022).
MATH Google Scholar
Xie, J., Li, Z., Zhou, Z. & Liu, S. A novel bearing fault classification method based on xgboost: The fusion of deep learning-based features and empirical features. IEEE Trans. Instrum. Meas. 70, 1–9 (2020).
MATH Google Scholar
Wang, Z. et al. Few-shot fault diagnosis for machinery using multi-scale perception multi-level feature fusion image quadrant entropy. Adv. Eng. Inform. 63, 102972 (2025).
Google Scholar
Wang, Z. et al. A generalized fault diagnosis framework for rotating machinery based on phase entropy. Reliab. Eng. Syst. Saf. 256, 110745 (2025).
MATH Google Scholar
Fei, S.-W. & Liu, Y.-Z. Fault diagnosis method of bearing utilizing GLCM and MBASA-based KELM. Sci. Rep. 12, 17368 (2022).
ADS CAS PubMed PubMed Central Google Scholar
Xin, G. et al. Fault diagnosis of wheelset bearings in high-speed trains using logarithmic short-time Fourier transform and modified self-calibrated residual network. IEEE Trans. Ind. Inform. 18, 7285–7295 (2021).
MATH Google Scholar
Ribeiro Junior, R. F. et al. Fault detection and diagnosis in electric motors using convolution neural network and short-time Fourier transform. J. Vibrat. Eng. Technol. 10, 2531–2542 (2022).
MATH Google Scholar
Sun, B. et al. Supertml: Two-dimensional word embedding for the precognition on structured tabular data. In Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit. Workshops, 1–9 (2019).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. Int. Conf. Mach. Learn., 8748–8763 (2021).
Tschannen, M., Mustafa, B. & Houlsby, N. Clippo: Image-and-language understanding from pixels only. In Proc. IEEE Conf. Comput. vis. Pattern Recognit, 11006–11017 (2023).
Rust, P. et al. Language modelling with pixels. arXiv:2207.06991 (2022).
Pan, T., Chen, J., Zhou, Z., Wang, C. & He, S. A novel deep learning network via multiscale inner product with locally connected feature extraction for intelligent fault detection. IEEE Trans. Ind. Inform. 15, 5119–5128 (2019).
MATH Google Scholar
Chen, J., Wang, C., Wang, B. & Zhou, Z. A visualized classification method via t-distributed stochastic neighbor embedding and various diagnostic parameters for planetary gearbox fault identification from raw mechanical data. Sensors Actuators A Phys. 284, 52–65 (2018).
ADS CAS MATH Google Scholar
William, P. E. & Hoffman, M. W. Identification of bearing faults using time domain zero-crossings. Mech. Syst. Signal Process 25, 3078–3088 (2011).
ADS MATH Google Scholar
Purushothaman, G. & Ray, K. EMG based man-machine interaction—a pattern recognition research platform. Robot. Auton. Syst. 62, 864–870 (2014).
MATH Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 770–778 (2016).
Shorten, C. & Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J. Big Data 6, 1–48 (2019).
MATH Google Scholar
Zhong, Z., Zheng, L., Kang, G., Li, S. & Yang, Y. Random erasing data augmentation. Proc. AAAI Conf. Artif. Intell. 34, 13001–13008 (2020).
Google Scholar
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
MATH Google Scholar
Iwana, B. K. & Uchida, S. An empirical survey of data augmentation for time series classification with neural networks. PLoS One 16, e0254841 (2021).
CAS PubMed PubMed Central MATH Google Scholar
Zhang, X. et al. Fault diagnosis for small samples based on attention mechanism. Measurement 187, 110242 (2022).
MATH Google Scholar
Lessmeier, C., Kimotho, J. K., Zimmer, D. & Sextro, W. Condition monitoring of bearing damage in electromechanical drive systems by using motor current signals of electric motors: A benchmark data set for data-driven classification. In Proc. Euro. Conf. Prognostics Health Manage. Soc., 5–08 (2016).
Li, K., Ping, X., Wang, H., Chen, P. & Cao, Y. Sequential fuzzy diagnosis method for motor roller bearing in variable operating conditions based on vibration analysis. Sensors 13, 8013–8041 (2013).
ADS PubMed PubMed Central MATH Google Scholar
Selvaraju, R. R. et al. GRAD-CAM: Visual explanations from deep networks via gradient-based localization. In Proc. Int. Conf. Comput. Vis, 618–626 (2017).
Kim, S. H., Park, J. S., Lee, H. S., Yoo, S. H. & Oh, K. J. Combining CNN and grad-cam for profitability and explainability of investment strategy: Application to the KOSPI 200 futures. Expert Syst. Appl. 225, 120086 (2023).
Google Scholar
Guan, Y. et al. A novel diagnostic framework based on vibration image encoding and multi-scale neural network. Expert Syst. Appl. 251, 124054 (2024).
MATH Google Scholar
Download references
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2023-11-0450).
Department of Industrial Engineering, Yonsei University, Seoul, 03722, Republic of Korea
Yongmin Kim & Hyunsoo Yoon
You can also search for this author inPubMed Google Scholar
You can also search for this author inPubMed Google Scholar
Yongmin Kim contributed to the establishment of methodology, conducting research, formal analysis, validation, and drafting of the paper; Hyunsoo Yoon contributed to conceptualization, review and editing, and overall research guidance.
Correspondence to Hyunsoo Yoon.
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Kim, Y., Yoon, H. Integrating vibration signal analysis and image embedding for enhanced bearing fault diagnosis in manufacturing. Sci Rep 15, 10398 (2025). https://doi.org/10.1038/s41598-025-94351-0
Download citation
Received: 24 December 2024
Accepted: 13 March 2025
Published: 26 March 2025
DOI: https://doi.org/10.1038/s41598-025-94351-0
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative

