Audio signal classifcation using deep learning tech techniques

Introduction

Background

Audio signal classification (ASC) involves extracting features from audio signals to identify their respective classes and differentiate them from others Humans can quickly recognize distinct sounds, such as speech, phone alerts, and animal noises, but the classification process becomes challenging in noisy environments or when audio differences are subtle, like detecting a patient's breathing or heart sounds for medical diagnosis While experienced professionals can efficiently perform these tasks, training expert systems for sound classification is often costly and not always feasible ASC systems aim to automate these classification tasks through two main steps: feature extraction from audio signals and the design of classification models These systems have made significant strides in areas such as speech and music recognition, demonstrating their potential to replace human classifiers.

Fig 1.1 The main framework of an ASC system

Audio Signal Classification (ASC) encompasses the processing of audio data across various domains, including entertainment, media, education, digital libraries, and supervisory systems It has tackled classical challenges like speech and speaker recognition for decades In speech recognition, the focus is on differentiating phonemes to form words, phrases, and sentences, with the primary challenge being continuous speech recognition that is independent of grammar and speaker identity Additionally, ASC involves music recognition and transcription, where the acoustic signal represents music and musical notes are layered to produce a track While several systems have been developed for these tasks, complexities arise with orchestral sounds An effective ASC system can differentiate between speech and music, utilizing speech recognition for voice signals and music transcription for musical inputs This dual approach allows for the optimization of both recognizers, enhancing the overall system's efficiency and robustness Furthermore, research in ASC extends to language recognition, broadening its applications.

Audio context recognition, video segmentation based on audio, and sound effects retrieval are distinct applications within the field of Audio Scene Classification (ASC) Each of these areas is researched and developed independently, targeting specific ASC challenges Consequently, there remain numerous issues that require further exploration and integration into the ASC domain.

Literature review

In recent years, numerous studies have focused on advanced techniques for audio signal classification (ASC) Lin et al (2005) utilized a support vector machine (SVM) to classify audio based on features like subband power and pitch Umapathy et al (2007) introduced a multigroup classification system using local discriminant bases for time-frequency analysis Xu et al developed a clustering algorithm leveraging linear prediction and cepstral coefficients for music content organization Ajmera et al employed an artificial neural network (ANN) and hidden Markov model (HMM) to enhance speech/music distinction in automatic broadcast news transcription Additionally, a method for speech/music discrimination based on root mean square and zero-crossings was proposed Honda et al introduced a technique for assessing single-channel audio signal distances through phase interference analysis.

Artificial Neural Networks (ANNs) have garnered significant attention in recent years, particularly following the introduction of the perceptron algorithm in 1957 and the backpropagation algorithm in 1986 The breakthrough in deep learning (DL) for image classification and speech recognition in 2012 further propelled their popularity.

Deep learning (DL) models consist of multiple layers that connect input and output layers, utilizing a significant number of parameters trained on large datasets Common architectures employed in automatic speech recognition (ASR) include multilayer perceptrons (MLPs), convolutional neural networks (CNNs), and recurrent neural networks (RNNs).

Deep learning (DL) initially gained popularity in image processing but has since been extensively applied in various fields, including speech recognition, music processing, environmental noise analysis, and localization and tracking The success of DL models in speech recognition has led to their adoption in related tasks such as language recognition, speech translation, and voice activity detection In the music domain, DL facilitates automatic descriptions for large catalogs, cabinet-based music recommendations, and auto-derived chords for songs However, its application in environmental sound is less common, resulting in a limited database compared to speech and music, with most open datasets emerging from the annual Detection and Classification of Acoustic Scenes and Events In localization and tracking, DL aids in estimating sound source distances and localizing sound sources Additionally, DL is utilized in heart and lung sound classification and fault diagnosis Most studies leverage spectrograms and waveforms as inputs to DL models like CNNs, RNNs, and CRNNs, showcasing superior performance over traditional methods such as Gaussian mixture models and HMMs These findings highlight the significant potential of DL while addressing challenges in acoustic scene classification (ASC).

Deep learning (DL) has proven effective in advancing autonomous systems and industrial applications However, many studies focus on single tasks with limited datasets, highlighting the need for richer databases and improved feature extraction methods Additionally, selecting and developing optimal models remains a challenge for achieving superior results Expanding the range of engineering applications is also essential for further progress in this field.

Objectives

The primary objective of this work is to create optimal deep learning (DL) models by refining their architecture and parameter settings, which will be utilized to tackle engineering challenges in sound classification Additionally, specific issues related to this field have been systematically addressed.

First, the two main processes in ASC, feature extraction and the machine learning (ML) models are summarized In which features and models used in DL are analyzed and emphasized

A novel method for estimating sound receiver locations using a convolutional neural network (CNN) has been developed, demonstrating high accuracy through extensive simulations and experiments This research not only enhances sound localization but also contributes to optimizing sound quality and the design of audio rooms.

A novel method utilizing sound analysis and deep learning techniques was proposed to identify abnormalities in water pumps Experiments were carried out on three distinct water pumps during both suction and discharge operations, assessing their performance under normal and abnormal conditions The findings from this research can contribute to the development of automated systems for detecting pump faults efficiently.

Two deep learning models were developed for classifying heart sounds using log-mel spectrograms of heart sound signals These models demonstrated superior performance compared to earlier research, potentially aiding cardiologists in the diagnosis of cardiovascular diseases.

Structure of this thesis

This thesis is structured into six chapters: Chapter 2 provides a detailed overview of the methodologies employed, while Chapters 3, 4, and 5 present three application studies focused on sound classification using deep learning techniques Lastly, Chapter 6 offers a summary of the current work, highlighting its limitations and outlining prospects for future research.

Chapter 2 presents the methodologies used in this thesis with the challenges considered in Section 1.2 To be more detailed, a comprehensive study on audio features and classification models is investigated for ASC In which DL features and models are emphasized and analyzed in detail

Chapter 3 presents the proposed method and experiment for sound receiver location estimation using a CNN In chapter 4, a method to detect the abnormalities in a machine was proposed based on sound analysis using a DL technique, apply to water pump fault detection In chapter 5, two DL models of classifying heart sounds are proposed and implemented to diagnose five heart valve diseases

Chapter 6 summarizes and concludes the current work The limitations are also discussed to show out some potential directions for future work.

Methodology

Audio features for ASC …

Short-term energy is the average energy per window/frame [31] and computed by Eq (2.1)

N [x (n)] (2.1) where x (n), n = 1,2, … , N is the sequence of a sound sample of the ith frame, and N is the frame's length

Loudness refers to the intensity of an auditory sensation in an audio signal It is mathematically represented in decibels (dB) and is approximately proportional to the logarithm of sound intensity.

I (2.2) where I is the intensity of the sound signal and I = 10 W/m is the minimum intensity detectable by the human ear

Loudness has been employed in speech/music discrimination [32] and speech segmentation [33]

The temporal centroid serves as a balancing point for sound energy over time, calculated from the signal's envelope across audio samples This concept has been effectively utilized in various applications, including acoustic scene classification and environmental sound recognition.

Zero-crossing rate (ZCR) is the number of times of zero-crossing of an audio signal within frame [31] It can be expressed as:

Eq (2.4) expressed the sign function and x(k) is a discrete signal, k = 1 … L, L is frame’s length sgn(x) −1 if x < 0

The Zero Crossing Rate (ZCR) is a crucial metric for analyzing the frequency content of audio signals, particularly in differentiating between voiced and unvoiced audio classes Unlike music, which typically lacks the alternating patterns of voiced and unvoiced signals found in speech, ZCR values tend to fluctuate more significantly in speech This characteristic makes ZCR valuable for various applications, including speech and music discrimination, musical genre classification, speech analysis, and singing voice detection, due to its effectiveness in distinguishing between speech, music, and different audio effects.

The entropy of energy (e) indicates sudden variations in the energy levels of an audio signal To calculate e, a short-term frame is divided into L fixed-length sub-frames, and the energy for each sub-frame is determined This energy is then normalized by the total energy of the frame Finally, the entropy H(i) of the energy sequence is computed, providing insight into the audio signal's dynamics.

Several researchers have used the entropy of energy in detecting the onset of abrupt sounds, e.g [41, 42]

The spectral centroid is a key metric that indicates the center of gravity of a sound's spectrum, reflecting its spectral position and shape This measurement is closely linked to the perception of brightness in sounds The spectral centroid, denoted as C, can be calculated to analyze these characteristics effectively.

∑ x(k) (2.7) where x(k) is the weighted frequency value and f(k) is the center frequency of bin number k

The spectral centroid has been applied in digital audio and music processing as an automatic measure of musical timbre [43]

Spectral entropy is calculated in the frequency domain, similar to energy entropy To compute spectral energy, the spectrum of a short-term frame is divided into L sub-bands (bins) The energy of each bin, indexed by f from 0 to L - 1, is normalized using the cumulative spectral energy Finally, the entropy of the normalized spectral energy is determined based on the established formula.

Spectral entropy was applied for efficiently discriminating between speech and music in [44, 45]

Spectral flux measures the temporal variability of the spectrum, calculated as the squared difference between the normalized magnitudes of two continuous short-term window spectra To determine spectral flux, we first compute the kth normalized discrete Fourier transform (DFT) coefficient EN (k) at the ith frame The spectral flux Fl is subsequently derived from this coefficient.

Spectral flux was applied in ASC such as speech [46], music genre [47], and environmental sound [48]

Spectral roll-off refers to the frequency threshold below which a significant portion of audio signal energy, typically between 85% and 95%, is concentrated This concept is mathematically represented by Equation (2.12), where the mth DFT coefficient is linked to the spectral roll-off In this equation, 's' denotes the spectral value at bin 'k', while 'f' represents the band edges, and 'C' indicates the chosen percentage.

Spectral roll-off has been applied in ASC such as speech/music [49] and music genre [47, 50]

Mel-Frequency Cepstrum Coefficients (MFCCs) are derived from the cepstral representation of a signal and capture its short-term power spectrum Utilizing the discrete cosine transform of the log power spectrum on a non-linear mel scale, MFCCs closely mimic the human auditory system by evenly distributing frequency bands This characteristic makes MFCCs essential features in various signal processing applications The Mel-scale approximation is expressed as f = 2595 log(1 + f).

700 (2.13) where f is the physical frequency in Hz, and f is the perceived frequency

MFCCs were employed in ASC such as speech recognition [52, 53], speech enhancement [54], music genre classification [55], music information retrieval [56], audio similarity measurement [57], vowel detection [58], etc

A spectrogram is generated using the fast Fourier transform (FFT) applied to overlapping windows of a sound signal This process, known as the Short-time Fourier Transform (STFT), involves dividing the sound signal into short-term sequences of fixed size before applying the FFT The resulting spectrogram represents the complex magnitude of the STFT, providing a visual representation of the sound signal's frequency content over time.

The equation X (mL, ω) = ∑ x(n) w(mL − n) illustrates the relationship between the analysis window w(n) and the Fourier transform of the short-time section f(n) = x(n)w(mL − n) Here, L is an integer indicating the time separation between adjacent short-time sections, and for a fixed value of m, X (mL, ω) captures the Fourier transform concerning n of the defined short-time section.

In addition, a discrete STFT is defined as

X (mL, k) = X (mL, ω)| / (2.14b) where N is the number of discrete frequencies Finally, the spectrogram in logarithmic scale is defined as

Spectrogram has been applied widely in ASC, such as music classification [60], language identification [61], and acoustic scene classification [62], etc.

Classification models

In this section, the traditional models in ML are introduced These models are based on a simple analytical approach and can be adequate for classification assignments with limited data sizes

Nọıve Bayes (NB) is an algorithm based on employing Bayes’ theory with the

The "naive" assumption of conditional independence posits that each feature is independent of every other feature when the class variable is known According to Bayes' theory, this relationship is articulated in the context of the class variable \( y \) and the dependent feature vector \( (x_1, \ldots, x_n) \).

Naive Bayes (NB) algorithms require minimal training data to identify key parameters, thanks to their straightforward assumptions Despite their simplicity, these algorithms have proven effective in real-world applications, including document classification and spam filtering.

K-Nearest Neighbor (KNN) is a simple supervised learning algorithm applied in regression and classification tasks but is mostly used for classification tasks [66] KNN categorizes a new data point based on comparison with existing archived data and puts the new case on the most similar available categories Usually, the nearest neighbor is determined based on Euclidean distance, which computed as the root of squared differences between n-dimensional coordinates of two data points:

K-Nearest Neighbors (KNN) offers several advantages, including simplicity, robustness to noisy training data, and high efficiency when handling large datasets However, its main drawback is the significant computational cost associated with calculating distances for all training data samples.

A Decision Tree (DT) is a tree-structured model primarily utilized for classification tasks, although it can also be applied in regression In a DT, decision nodes represent the features of the dataset, branches indicate decision rules, and leaf nodes signify the outcomes The prediction process begins at the root node, where the algorithm compares the root attribute values with the data record's attributes Based on this comparison, the algorithm follows a branch to the next node, repeating the comparison with subsequent sub-nodes until it arrives at a leaf node, thus determining the class of the data sample.

Compared with other algorithms, DT has the advantages of being simple and easy to understand, and has lower requirements for data cleaning The disadvantage of

DT, however, is that it has many layers, making it complicated and may have an overfitting issue, which can be improved by using the Random Forest (RF) algorithm

RF is a classification model that contains many DTs on different subsets of the provided dataset and takes the average to increase the dataset's predictive efficiency

The RF prediction process, as shown in Fig 2.2, enhances accuracy by aggregating predictions from multiple decision trees rather than relying on a single tree This ensemble method determines the final outcome based on the majority vote from the trees, resulting in improved efficiency and reduced risk of overfitting as the number of trees increases.

Random Forest (RF) offers higher accuracy and mitigates overfitting issues when a sufficient number of trees are utilized, making it a robust choice compared to Decision Trees (DT) However, the primary drawback of RF is that an increased number of trees can lead to longer computation times, rendering it less efficient for real-time predictions.

Fig 2.2 RF prediction process example

Support Vector Machine (SVM) is designed to identify the optimal hyperplane that effectively separates n-dimensional space into distinct categories for accurate classification of new data points This hyperplane is determined by support vectors, which are the closest points from both classes SVM focuses on maximizing the margin, defined as the distance between the support vectors and the hyperplane, with the goal of finding the optimal hyperplane that provides the largest margin There are two main types of SVM: linear SVM, suitable for linearly separable data, and nonlinear SVM, used for data that cannot be separated linearly.

Fig 2.3 An SVM model in a binary classification problem

Support Vector Machines (SVM) excel in scenarios where there is a clear margin of separation between classes, particularly in high-dimensional spaces where the number of dimensions exceeds the number of samples However, SVMs are less effective for large datasets and struggle with noisy datasets, limiting their applicability in certain contexts.

Deep Neural Networks (DNNs) are structured like interconnected neurons in the human brain, operating in a cascade format that facilitates data flow through connected layers Utilizing a back-propagation technique, DNNs adjust the weights between nodes to ensure accurate outputs from input data Compared to traditional machine learning models, Deep Learning (DL) models demonstrate superior accuracy and performance, leveraging extensive datasets and complex multi-layered architectures This thesis introduces three types of DL models: Multi-Layer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs).

A Multi-Layer Perceptron (MLP) is a type of feed-forward neural network characterized by interconnected neurons that transmit data from one layer to the next It comprises an input layer for receiving data, one or more hidden layers that serve as the core processing units, and an output layer that delivers predictions or classification results Data flows in a forward direction from the input to the output layer, with weights being adjusted through the back-propagation learning algorithm during training MLPs are capable of approximating continuous functions and effectively solving nonlinear problems, making them particularly useful in applications such as pattern classification, recognition, prediction, and approximation in various fields.

The architecture of a multi-layer perceptron (MLP) can be represented mathematically, where the output of an n-layer network with input x is expressed as y = f(f(…f(x + b) + … + b) + b) In this equation, each layer is associated with a weight matrix, and the vector represents the bias values for each node in the ith layer The function f denotes the nonlinear activation function, which can include common types such as the sigmoid function, hyperbolic tangent function (tanh), and rectified linear unit (ReLU) The definitions of these activation functions are detailed in equations (2.19a-2.19c), with their graphical representations illustrated in Fig 2.5.

Fig 2.5 The plot of the common activation functions

MLP has been widely applied in data compression [72], time-series prediction

CNN is a DL algorithm inspired by the construction and functions of the visual cortex and designed to imitate the connectivity pattern of neurons in the human brain

Convolutional Neural Networks (CNNs) are extensively used for identifying and classifying features in images, and their application extends to acoustic signal processing, including speech recognition and sound classification A typical CNN comprises two primary components: the convolution layer and the fully connected (FC) layer The convolution layer extracts different features from the analyzed image, while the FC layer processes this output to create the most accurate representation of the image.

The convolution layer is the initial component of a Convolutional Neural Network (CNN), responsible for extracting features from input images It utilizes learnable filters to capture pixel correlations by sliding over the image both vertically and horizontally, a process defined by the stride, which determines the movement of the filter The output size of the image is influenced by several factors, including the original image dimensions (W × H), the kernel size (K), the stride value (S), and the number of layers with zero padding (P) This relationship can be mathematically represented as output size = H − K + 2P.

The output matrix, denoted as g(x, y), is derived from the input matrix A by applying a filter kernel W of the same size This relationship is mathematically expressed as g(x, y) = (W ∗ A)(x, y) = Σ W(s, t) A(x − s, y − t), where the indices s and t of the filter kernel must adhere to the constraints −a ≤ s ≤ a and −b ≤ t ≤ b.

Evaluation metrics

This study employs evaluation criteria including Precision (Pre), Sensitivity (Sen), F1_score (F1_s), and Overall accuracy (Acc), as outlined in Equations (2.25a-2.25d) To calculate these metrics, it is essential to determine the counts of true positives (TP), false positives (FP), and false negatives (FN) for each class.

TP: The true-positive of a class is the total number of correct predictions in the data labeled this class

FP: The false-positive of a class is the total number of false predictions that predicted is this class

FN: The false-negative of a class is the total number of false predictions in the data labeled this class

Acc = Total number of correct predictions

Total number of testing entries (2.25d)

Summary

Sound signal classification relies on the careful extraction and analysis of its features, which serve as essential descriptors for accurate classification The initial step involves selecting and optimizing these features to minimize computational costs while maintaining accuracy Once the features are extracted, the classification model employs them to accurately label unknown signals based on established classes An effective classification model should strike a balance between simplicity and high performance to achieve reliable results.

Mel-frequency cepstral coefficients (MFCCs) are widely utilized in traditional acoustic scene classification (ASC) models, but deep learning (DL) approaches favor spectrograms due to their compact representation and reduced data requirements While waveforms eliminate the need for handcrafted features, they demand higher computational resources and data Some research indicates that waveforms can be highly effective with DL algorithms However, questions remain regarding the suitability of mel spectroscopy for sound classification in DL models and the circumstances under which alternative features should be employed.

Deep learning (DL) models have increasingly supplanted traditional machine learning (ML) models in audio signal processing tasks These DL models excel at modeling time sequences, effectively addressing sequence labeling and transducing challenges While convolutional neural networks (CNNs) utilize a fixed receptive field that limits the temporal context for predictions, they allow for easy adjustments to the context size In contrast, recurrent neural networks (RNNs) can leverage an unlimited time context for predictions but require model adaptations, such as long short-term memory (LSTM) units, to achieve this, which complicates direct control over context size Additionally, RNNs necessitate sequential input processing, resulting in slower training and evaluation times on modern hardware compared to CNNs.

Location estimation of receiver in an audio room

Introduction

The intensity of sound from a fixed source is influenced by the position of the receivers, as sound intensity decreases with the square of the distance in a lossless medium Additionally, factors such as the directional sensitivity of the receiver and its physical presence in the sound field can affect the perceived intensity of the sound Consequently, accurately estimating the location of the sound receiver is essential for capturing specific audio signals, which can enhance research in Acoustic Source Localization (ASC).

Previous studies on sound source distance estimation have predominantly utilized handcrafted features Bronkhorst et al introduced a method leveraging room impulse response for distance assessment, while Lu et al developed a binaural evaluation technique based on the direct-to-reverberant ratio, which involves identifying the sound source's direction and isolating the reverberant signal Rodemann et al explored various audio features, including interaural intensity and temporal variations, discovering that mean signal amplitude and binaural cues can yield reliable distance estimations in specific contexts Additionally, Huang et al proposed a technique for estimating distances in single-channel audio by analyzing phase interference between observed and pseudo-observed signals Despite these advancements, reliance on handcrafted features can complicate the extraction process, and the accuracy of these methods remains inadequate, necessitating further improvement.

Research on the spatial relationship between sound sources and receivers has predominantly concentrated on sound source localization, focusing on aspects like azimuth, elevation, and distance However, studies examining the receiver's location are relatively scarce Given that many sound systems feature fixed sound sources, the audio signal received is influenced primarily by the acoustic environment and the receiver's position Consequently, accurately estimating the receiver's location is essential for achieving optimal audio quality and desired sound signals.

In recent years, deep neural networks (DNNs) have gained prominence in sound source localization methods Huang et al successfully utilized DNNs to determine sound pressure ranges, while Yiwere et al proposed a CRNN to assess source locations in known environments Yalta et al demonstrated the effectiveness of DNNs in localizing sound sources using a microphone array in reverberant settings Takeda et al addressed sound source localization through DNNs and discriminative training, showcasing the impressive performance of deep learning models in audio signal processing A literature survey highlights the ongoing challenges in sound source localization, particularly in identifying suitable audio signal features and enhancing accuracy.

In this chapter, three rectangular audio rooms were modeled using the image source model (ISM), simulating their shape, dimensions, and surface materials A fixed sound source within the room emits audio signals, which are captured by a receiver The received audio is transformed into a spectrogram through Short-Time Fourier Transform (STFT), enabling the application of a Convolutional Neural Network (CNN) to estimate the receiver's location as an Acoustic Source Classification (ASC) problem This method eliminates the need for complex feature extraction techniques, allowing the CNN to automatically identify essential features for classification The proposed approach demonstrates high performance, achieving over 93% testing accuracy in both simulations and experiments.

Methodology

The sound receiver's location estimation framework consists of two main phases: training and testing During the training phase, data is input as a signal and label, followed by segmentation and feature extraction through a spectrogram, culminating in the training of a CNN model In the testing phase, signals are segmented, and the trained CNN model predicts the receiver's location The accuracy of the prediction model is then assessed using specific evaluation parameters.

Fig 3.1 The main framework of the sound receiver's location estimation

A classification model is designed with reference to the models in previous studies

The classification model features two convolutional layers equipped with learnable filters and a fully connected (FC) layer The first convolutional layer utilizes 16 filters of size 3×3 pixels, while the second layer employs 32 filters of the same size Each layer's output is activated using the ReLU function, followed by max-pooling, which reduces dimensions by selecting the maximum value from a 5×5-pixel window After max-pooling, the feature maps are reshaped, and a dropout layer randomly sets input elements to zero based on a specified probability The output from the dropout layer is then fed into the FC layer, which consists of 100 neurons, culminating in classification through the softmax function.

Fig 3.2 Description of the proposed CNN architecture

The loss function the network used for training is the Cross-Entropy CE The loss function can be expressed as follows:

CE = − ∑ t log( ) (3.1) where C is the number of classes in the dataset and t and are the true and prediction labels, respectively

The CNN model was employed to classify audio signals based on the receiver's location within three simulated rooms Through practical experimentation, a suitable set of hyperparameters was identified and utilized during the training process, ensuring stability and high accuracy This same set of hyperparameters was subsequently applied in an experimental room.

Simulation

The simulation room is represented as a three-dimensional space defined by six rectangular faces with dimensions Lx×Ly×Lz Within this room, a sound source is positioned at ps [xs ys zs] T, while an acoustic receiver is located at pr = [xr yr zr] T, oriented in the direction dr = [ar er] T Here, [xs ys zs] and [xr yr zr] denote the coordinates of the sound source and receiver, respectively, and [ar er] indicates the receiver's azimuth and elevation angles.

Fig 3.3 Configuration of the simulation room

In the current study, three rectangular simulation rooms were generated with different dimensions and face materials based on ISM; they are denoted as Rooms A,

B, and C, respectively Room A is the largest with dimensions of 20 m × 1 0m × 5 m, like a hall; Room B is 6 m × 5 m × 4 m, like a classroom; Room C is 4 m × 3 m × 3 m, representing a small discussion room The dimensions, materials, location of the sound source, and receivers of these simulation rooms are listed in Table 3.1

The faces are characterized by frequency-dependent absorption coefficients that can be selected from Hall [94] Table 3.2 lists the absorption coefficients of the face materials for each frequency band

Binaural room impulse responses (BRIRs) and audio signals can be acquired from modeled simulation rooms that feature specific sound source and receiver placements Figure 3.4 displays the BRIR outcomes for the receiver positioned at three distinct locations within Rooms A, B, and C.

The received audio signal in a room varies based on the receiver's location, as illustrated in Fig 3.4 In Room A, for instance, distances of 12.04 m, 8.08 m, and 5.10 m from the sound source result in differing amplitudes, with greater distances yielding lower sound amplitudes due to sound attenuation over distance Factors such as air absorption and surface interactions further reduce sound intensity, while the reverberation time remains constant regardless of distance Room size also influences sound characteristics; larger rooms, like Room A, exhibit smaller sound amplitudes and longer reverberation times, whereas smaller rooms, such as Room C, demonstrate larger amplitudes and shorter reverberation times.

Table 3.1 Dimensions, face materials, the sound source’s location, and the receiver’s location of the simulation rooms

(ar, er) = (-90 0 , 90 0 ) Sample frequency fs = 44,100 Hz

Table 3.2 Absorption coefficients of face materials depend on the frequency band

(c) Room C Fig 3.4 BRIRs results of the receiver at the three different locations

The simulation rooms, as illustrated in Fig 3.5, were segmented into m×n×k smaller rectangles to represent various receiver locations Each class contains 50 audio signals in WAV format, collected from 50 randomly chosen receiver positions Consequently, the dataset for each simulation room comprises 50×m×n×k audio signals Table 3.3 details the number of classes and corresponding audio signals for the three simulation rooms.

Fig 3.5 Receiver's location division classes in the simulation rooms

Table 3.3 The number of classes and audio signals for each simulation room

The audio signals were segmented into ten 5-second intervals with a 50% overlap, and corresponding labels were assigned to each segment These segments were then analyzed using Short-Time Fourier Transform (STFT) The number of frequency points for calculating the Discrete Fourier Transform (DFT) is determined as the greater of 256 or the next power of two exceeding the segment length The results of the STFT analysis are visually represented as a spectrogram, which is mathematically expressed as s x(n) = 20log( , ).

Feature extraction involves setting a threshold to eliminate data below a specified level, enhancing the clarity of the spectrogram's features This process removes small-amplitude values that could interfere with noise filtering, thereby emphasizing the signal's characteristics In conditions akin to an office environment, where noise levels range from 40 dB to 60 dB, we tested various threshold values: 40 dB, 50 dB, and 60 dB The results indicated that a threshold of 50 dB yielded high and stable accuracy, leading to its selection for this study The thresholding algorithm is defined as s = s, for s ≥ s.

0, otherwise (3.3) where s is computed as the amplitude and s = 50 dB is the threshold value

Figure 3.6 illustrates the spectrograms of three simulation rooms, comparing the results with and without a thresholding algorithm The application of this thresholding technique significantly improves the performance of the spectrograms, leading to enhanced accuracy in the CNN model.

(c) Room C Fig 3.6 Spectrogram without and with the threshold in the feature extraction of an audio signal

The dataset images, each with a resolution of 224×224×3 pixels, were divided into training and testing sets, with 70% allocated for training and the remaining 30% for testing The adaptive moment estimation (Adam) optimizer was utilized, and various training tests were conducted to identify optimal hyperparameters for achieving high accuracy and stable training The selected hyperparameters included a learning rate of 10^-6, 30 epochs, and a batch size of 100 The convolutional neural networks (CNNs) were developed and executed using MATLAB R2020a on a computer featuring an Intel Core i9-9000K 3.6 GHz CPU and an NVIDIA GTX 1080Ti GPU.

In each simulation room, the input dataset consists of 500×m×n×k images, categorized into m×n×k classes based on the receiver's location, labeled from "1-1-1" to "m-n-k" in the directions of Rooms x, y, and z The details regarding the number of input images, class distribution, and training outcomes for the simulation rooms are illustrated in Fig 3.7 and Table 3.4 The audio signal identification accuracies achieved are 98.74%, 99.28%, and 99.38% for Rooms A, B, and C, respectively Additionally, the accuracy and loss curves indicate a stable training process across all three rooms.

(c) Room C Fig 3.7 Accuracy and loss curves of training progress of the simulation rooms

Table 3.4 Overall accuracy and training time of the simulation rooms

Room A has the longest training time, requiring 57 minutes and 5 seconds due to its extensive input data and classes, while Room C boasts the shortest training duration at just 25 minutes and 49 seconds.

Tables 3.5-3.7 detail the Precisions, Sensitivities, and F1_scores for the simulation rooms, indicating high classification accuracy across all classes In Room A, Precisions range from 92.59% to 100%, Sensitivities from 84.67% to 100%, and F1_scores from 91.70% to 100% Room B exhibits Precisions between 93.67% and 100%, Sensitivities from 94.67% to 100%, and F1_scores ranging from 95.95% to 100% Room C shows Precisions from 93.59% to 100%, Sensitivities between 90.67% and 100%, and F1_scores from 93.79% to 100% Overall, these findings demonstrate that all three simulation rooms achieve high accuracy, with closely aligned class performance.

Table 3.5 Precision, Sensitivity, and F1_score of Room A

Class Precision Sesitivity F1_score Class Precision Sesitivity F1_score 1-1-1 100.00% 100.00% 100.00% 3-3-1 98.04% 100.00% 99.01% 1-1-2 97.24% 94.00% 95.59% 3-3-2 100.00% 100.00% 100.00% 1-1-3 93.42% 94.67% 94.04% 3-3-3 100.00% 100.00% 100.00% 1-2-1 100.00% 100.00% 100.00% 3-4-1 97.30% 96.00% 96.64% 1-2-2 93.71% 99.33% 96.44% 3-4-2 100.00% 100.00% 100.00% 1-2-3 100.00% 94.67% 97.26% 3-4-3 100.00% 100.00% 100.00% 1-3-1 100.00% 98.67% 99.33% 4-1-1 98.68% 100.00% 99.34% 1-3-2 99.32% 97.33% 98.32% 4-1-2 98.04% 100.00% 99.01% 1-3-3 92.59% 100.00% 96.15% 4-1-3 100.00% 100.00% 100.00% 1-4-1 98.68% 100.00% 99.34% 4-2-1 97.37% 98.67% 98.01% 1-4-2 100.00% 100.00% 100.00% 4-2-2 100.00% 98.00% 98.99% 1-4-3 100.00% 98.67% 99.33% 4-2-3 100.00% 100.00% 100.00% 2-1-1 100.00% 100.00% 100.00% 4-3-1 98.68% 100.00% 99.34% 2-1-2 99.34% 100.00% 99.67% 4-3-2 98.67% 98.67% 98.67% 2-1-3 100.00% 98.67% 99.33% 4-3-3 100.00% 100.00% 100.00% 2-2-1 95.54% 100.00% 97.72% 4-4-1 100.00% 100.00% 100.00% 2-2-2 100.00% 95.33% 97.61% 4-4-2 100.00% 100.00% 100.00% 2-2-3 97.40% 100.00% 98.68% 4-4-3 100.00% 100.00% 100.00% 2-3-1 98.60% 94.00% 96.25% 5-1-1 100.00% 97.33% 98.65% 2-3-2 97.40% 100.00% 98.68% 5-1-2 100.00% 98.00% 98.99% 2-3-3 100.00% 97.33% 98.65% 5-1-3 98.04% 100.00% 99.01% 2-4-1 96.05% 97.33% 96.69% 5-2-1 100.00% 100.00% 100.00% 2-4-2 100.00% 98.67% 99.33% 5-2-2 93.75% 100.00% 96.77% 2-4-3 100.00% 100.00% 100.00% 5-2-3 100.00% 100.00% 100.00% 3-1-1 100.00% 98.67% 99.33% 5-3-1 100.00% 100.00% 100.00% 3-1-2 100.00% 100.00% 100.00% 5-3-2 100.00% 84.67% 91.70% 3-1-3 100.00% 100.00% 100.00% 5-3-3 100.00% 99.33% 99.67% 3-2-1 100.00% 96.67% 98.31% 5-4-1 98.68% 100.00% 99.34% 3-2-2 98.04% 100.00% 99.01% 5-4-2 92.59% 100.00% 96.15% 3-2-3 100.00% 100.00% 100.00% 5-4-3 100.00% 100.00% 100.00%

Table 3.6 Precision, Sensitivity, and F1_score of Room B Class Precision Sesitivity F1_score Class Precision Sesitivity F1_score 1-1-1 100.00% 98.67% 99.33% 3-1-1 99.33% 98.67% 99.00% 1-1-2 98.68% 100.00% 99.34% 3-1-2 98.68% 99.33% 99.00% 1-1-3 97.26% 94.67% 95.95% 3-1-3 100.00% 100.00% 100.00% 1-2-1 98.68% 100.00% 99.34% 3-2-1 100.00% 99.33% 99.67% 1-2-2 97.37% 98.67% 98.01% 3-2-2 100.00% 100.00% 100.00% 1-2-3 98.65% 97.33% 97.99% 3-2-3 99.34% 100.00% 99.67% 1-3-1 100.00% 98.67% 99.33% 3-3-1 100.00% 100.00% 100.00% 1-3-2 100.00% 100.00% 100.00% 3-3-2 100.00% 100.00% 100.00% 1-3-3 98.04% 100.00% 99.01% 3-3-3 100.00% 100.00% 100.00% 2-1-1 100.00% 100.00% 100.00% 4-1-1 100.00% 100.00% 100.00% 2-1-2 100.00% 95.33% 97.61% 4-1-2 100.00% 100.00% 100.00% 2-1-3 93.67% 98.67% 96.10% 4-1-3 100.00% 100.00% 100.00% 2-2-1 100.00% 100.00% 100.00% 4-2-1 100.00% 100.00% 100.00% 2-2-2 100.00% 100.00% 100.00% 4-2-2 98.68% 100.00% 99.34% 2-2-3 100.00% 97.33% 98.65% 4-2-3 100.00% 100.00% 100.00% 2-3-1 100.00% 100.00% 100.00% 4-3-1 100.00% 100.00% 100.00% 2-3-2 100.00% 98.67% 99.33% 4-3-2 100.00% 98.67% 99.33% 2-3-3 96.15% 100.00% 98.04% 4-3-3 100.00% 100.00% 100.00%

Table 3.7 Precision, Sensitivity, and F1_score of Room C Class Precision Sesitivity F1_score Class Precision Sesitivity F1_score 1-1-1 100.00% 100.00% 100.00% 2-2-3 100.00% 100.00% 100.00% 1-1-2 100.00% 100.00% 100.00% 2-3-1 100.00% 100.00% 100.00% 1-1-3 98.68% 100.00% 99.34% 2-3-2 97.40% 100.00% 98.68% 1-2-1 100.00% 100.00% 100.00% 2-3-3 100.00% 97.33% 98.65% 1-2-2 93.59% 97.33% 95.42% 3-1-1 100.00% 100.00% 100.00% 1-2-3 97.14% 90.67% 93.79% 3-1-2 100.00% 100.00% 100.00% 1-3-1 100.00% 100.00% 100.00% 3-1-3 100.00% 100.00% 100.00% 1-3-2 98.68% 100.00% 99.34% 3-2-1 100.00% 100.00% 100.00% 1-3-3 100.00% 100.00% 100.00% 3-2-2 100.00% 100.00% 100.00% 2-1-1 100.00% 100.00% 100.00% 3-2-3 100.00% 100.00% 100.00% 2-1-2 98.04% 100.00% 99.01% 3-3-1 100.00% 100.00% 100.00% 2-1-3 100.00% 98.67% 99.33% 3-3-2 100.00% 100.00% 100.00% 2-2-1 100.00% 100.00% 100.00% 3-3-3 100.00% 100.00% 100.00% 2-2-2 100.00% 99.33% 99.67%

The confusion matrix for Room C, illustrated in Fig 3.8, displays the relationship between actual and predicted class instances, with valid predictions indicated along the diagonal and invalid ones outside of it The model achieves an impressive overall accuracy of 99.38%, with sensitivity levels for the 27 classes being notably high; 22 classes reach a perfect sensitivity of 100%, while the lowest sensitivity recorded is 90.67% for Classes 1-2-3.

Adjacent classes exhibit a higher degree of similarity, as evidenced by confusion samples within them For instance, in class 1-2-2, four samples out of 150 were misclassified into the adjacent class 1-2-3 Similarly, class 1-2-3 contained 14 confusion samples, with two misclassified as 1-1-3, ten as 1-2-2, and two as 1-3-2 This increased confusion occurs because receiver locations that are close together tend to capture more similar audio signals compared to those that are farther apart.

The classification model demonstrates exceptional stability and accuracy in estimating receiver locations across all three simulation rooms, thanks to the optimized hyperparameters The next phase involves applying this model in experimental settings.

Experiment

The experiment was conducted in a facility at Feng Chia University, Taiwan The sound source is a loudspeaker, and the receiver is a Zoom H6 handy recorder using an

The XY microphone features rotating mics positioned at 120 degrees As illustrated in Fig 3.9, the experimental setup includes the source and receiver within a defined room Table 3.8 details the dimensions, face materials of the room, and the specific locations of both the source and receiver Audio signals from the source were recorded and categorized into 18 distinct classes, corresponding to the receiver's locations, labeled from class 1-1-1 to class 3-3-2.

Fig 3.9 Experiment room with the sound source and receiver

Table 3.8 Parameters of the experiment room

The dataset comprises 5,148 segments of 5-second audio, categorized into 18 classes based on the receiver's locations, ranging from 1-1-1 to 3-3-2 The audio signals, sampled at a frequency of 44,100 Hz, are analyzed using spectrograms for feature extraction to facilitate classification Figure 3.10 illustrates the spectrograms of the audio signals, both with and without a threshold applied.

Fig 3.10 Spectrogram of an audio signal of the experiment room

The CNN model was trained using a dataset divided into 70% for training and 30% for testing Figure 3.11 illustrates the training trend, along with the accuracy and loss curves observed during the training phase prior to hyperparameter adjustments.

The audio signal identification accuracy achieved is 93.67%, with smooth accuracy and loss curves indicating a stable training process The experiment room's performance metrics, detailed in Table 3.9, show Precisions between 86.60% and 98.77%, Sensitivities ranging from 88.37% to 100%, and F1_scores from 90.06% to 97.73% These findings demonstrate a high overall accuracy in the experiment room, with closely aligned class accuracies across different rooms.

Fig 3.11 Accuracy and loss curves of the training progress of the experiment room

Table 3.9 Precision, Sensitivity, and F1_score of the experiment room Class Precision Sesitivity F1_score Class Precision Sesitivity F1_score 1-1-1 95.35% 95.35% 95.35% 2-2-2 90.59% 89.53% 90.06% 1-1-2 95.56% 100.00% 97.73% 2-3-1 90.91% 93.02% 91.95% 1-2-1 95.00% 88.37% 91.57% 2-3-2 95.12% 90.70% 92.86% 1-2-2 93.90% 89.53% 91.67% 3-1-1 96.39% 93.02% 94.67% 1-3-1 96.25% 89.53% 92.77% 3-1-2 95.29% 94.19% 94.74% 1-3-2 86.60% 97.67% 91.80% 3-2-1 95.40% 96.51% 95.95% 2-1-1 92.31% 97.67% 94.92% 3-2-2 95.51% 98.84% 97.14% 2-1-2 87.23% 95.35% 91.11% 3-3-1 96.20% 88.37% 92.12% 2-2-1 92.13% 95.35% 93.71% 3-3-2 98.77% 93.02% 95.81%

The confusion matrix presented in Fig 3.12 demonstrates that the sensitivity across different classes is relatively uniform, indicating strong performance of the model Similar to findings in simulation rooms, adjacent classes show a higher degree of correlation For instance, among 86 samples in class 1-1-1, four were misclassified as the adjacent class 1-2-1 This outcome reinforces the effectiveness of the proposed CNN model with optimal hyperparameters for sound classification, utilizing spectrogram feature extraction Consequently, the receiver's location within the audio room was estimated with high accuracy.

Summary

This chapter presents a method for estimating a receiver's location within an audio room using a designed Convolutional Neural Network (CNN) Three rectangular audio simulation rooms were constructed, varying in size and surface materials, to align with the International Standards of Measurement (ISM).

In our study, we recorded audio signals from a fixed source at various locations within simulation rooms, utilizing a CNN model with optimized hyperparameters to accurately estimate the receiver's location The CNN architecture included two convolutional layers, a fully connected layer, and an output layer, achieving impressive audio signal identification accuracies of 98.74%, 99.28%, and 99.38% across three simulation setups with 60, 36, and 27 classes, respectively Additionally, when dividing the receiver's location into 18 classes, the model attained a high identification accuracy of 93.67% Notably, we observed that audio signals varied significantly at different receiver locations, with signals farther from the source exhibiting lower amplitudes, while reverberation time remained consistent This differentiation in audio signals facilitates effective classification based on the receiver's location, and the use of a CNN for location estimation, leveraging spectrogram feature extraction, resulted in remarkably high accuracy.

The findings can aid in determining the location of a sound receiver within an audio system, enhancing the design of sound systems Future research will focus on estimating the receiver's position in more intricate sound environments, including multisource and multiroom settings, utilizing advanced deep learning techniques like recurrent neural networks (RNN).

Abnormality detection in water pumps based on sound analysis ….… 46 4.1 Introduction

Methodology

Fig 4.1 The main framework of the CNN model for machine abnormality detection using sound signals

The machine abnormality detection method using sound consists of two key phases: training and testing During the training phase, raw signals are labeled and pre-processed to reduce noise, followed by feature extraction using Short-Time Fourier Transform (STFT) to create a mel-spectrogram, resulting in a trained Convolutional Neural Network (CNN) model In the testing phase, raw signals undergo similar pre-processing to eliminate noise, and features are extracted from the processed signals The trained CNN model is then utilized to classify the sound signals effectively.

The MIMII dataset, utilized for investigating industrial machinery and fault detection, comprises sound signals recorded from three individual water pumps These 16-bit sound signals were sampled at 16 kHz in a reverberant environment, capturing both normal sounds (ranging from 5,000 to 10,000 seconds) and abnormal sounds (approximately 1,000 seconds) To accurately reflect real-world conditions, various fault sounds were recorded, and the sound signals were segmented into 5-second intervals for dataset construction and labeling as normal or abnormal Detailed information on the datasets and their respective running conditions can be found in Table 4.1.

Machine type Number of audio signals

This study focused on enhancing sound signals through denoising and smoothing techniques Initially, a low-pass filter was utilized to remove high-frequency noise, followed by the application of a Savitzky–Golay (S–G) filter for further smoothing The processed sound signals were subsequently divided into 5-second audio segments to create a comprehensive dataset.

In sound signals, low-frequency elements generally possess higher energy compared to high-frequency elements, leading to a power spectrum density where noise is minimal at low frequencies and more pronounced at high frequencies This disparity results in a higher signal-to-noise ratio (SNR) in the low-frequency range and a lower SNR in the high-frequency range To achieve a more consistent SNR across all frequencies and reduce the negative impacts of SNR, high-frequency elements can be eliminated In this study, a sound signal sampled at 16,000 Hz was processed using a low-pass filter with a cut-off frequency of 800 Hz to effectively remove noise from the signal.

The S–G filter, developed by Savitzky and Golay, is a smoothing technique that enhances signal-to-noise ratio (SNR) while preserving the original signal's characteristics This method involves dividing data points into consecutive subsets, each fitted with a low-order polynomial The final smoothed values are derived from the convolution of these polynomials, utilizing a set of M = 2m + 1 convolution coefficients C, where x serves as the independent variable and y represents the observation values within a dataset of N points.

The implementation of the S–G filter requires three key inputs: the original signal C, the polynomial order k, and the frame size f Typically, the optimal values for k and f are determined through experience In this study, tests were performed to identify the most suitable input parameters, resulting in the selection of k = 3 and f = 27 for effectively smoothing the audio signal of the machine.

Fig 4.2 The samples of a normal and abnormal signals of three pumps with and without pre-processing

Figure 4.2 illustrates the comparison of normal and abnormal signals from three pumps before and after pre-processing, highlighting that the pre-processed signals exhibit a clearer trend with reduced noise In Pump 1, there is a notable distinction between normal and abnormal signals, where the normal signal is smaller and more stable Conversely, Pumps 2 and 3 show minimal differences between their normal and abnormal signals To address this issue, a CNN model will be developed.

The sound signals are analyzed through Short-Time Fourier Transform (STFT), utilizing 256 frequency points for accurate calculations A mel-spectrogram serves as a visual representation of the STFT, which is subsequently converted into an RGB image to be used as input for the Convolutional Neural Network (CNN) Figure 4.3 illustrates a mel-spectrogram image depicting both normal and abnormal sound signals randomly selected from three different pumps.

(c) Pump 3 Fig 4.3 Mel-spectrogram of the normal and abnormal sound signals from three pumps

AlexNet, a prominent convolutional neural network (CNN) architecture developed by Alex Krizhevsky in 2012, consists of 25 layers, including five convolutional layers and three fully connected layers In this study, we modified the input layer and the final fully connected layer to align with our specific dataset and classification classes, enabling a comparison between the performance of the AlexNet model and our custom CNN architecture.

Fig 4.4 The architecture of AlexNet

This study presents three CNN models designed for machine fault detection, aiming to create an optimal model that balances simplicity and high accuracy Each model features one to three convolution layers with a kernel size of 3 × 3 and an incremental number of filters (8, 16, and 32), which helps reduce computational costs and facilitates efficient backpropagation The architecture of one CNN model is illustrated in Fig 4.5, showcasing each layer's specific function, with input data represented as mel-spectrograms To mitigate overfitting, a dropout layer is incorporated to randomly deactivate units during training Detailed model specifications are provided in Table 4.2, while the hyperparameters governing the CNN, including learning rate, batch size, and number of epochs, are optimized through experience to ensure stable training and high accuracy, as detailed in Table 4.3.

Table 4.2 Parameters of the CNN architecture

Layer Model 1 Model 2 Model 3 conv1 k(3×3)/c8 k(3×3)/c8 k(3×3)/c8 maxpool1 k(5×5) k(5×5) k(5×5) conv2 k(3×3)/c16 k(3×3)/c16 maxpool2 k(5×5) k(5×5) conv3 k(3×3)/c32 maxpool3 k(5×5) dropout 0.2 0.2 0.2 fc1 100 100 100 fc2 50 50 k is the kernel size, and c is the channel number

Table 4.3 The setting hyperparameters of the CNN models

Hyperparameter Value Learning rate 1e-5 Number of epochs 60 Batch size 100

Fig 4.5 The architecture of one of the designed CNN models

In classification tasks, datasets are divided into training and testing sets, often resulting in an imbalance in the number of samples across different classes This imbalance can negatively impact the accuracy of minority classes during training To address this issue, the random oversampling technique is employed, which involves adding copies of samples from minority classes to the training set, ensuring that the number of samples in each group is balanced This approach is illustrated in Fig 4.6.

Fig 4.6 Data balancing using the random oversampling technique

Results and discussion

The input images were resized to a resolution of 100×100×3 for processing The convolutional neural networks (CNNs) were programmed and implemented using MATLAB R2020a on a computer with an Intel Core i9-9000K 3.6 GHz CPU and an NVIDIA GTX 1080Ti GPU The optimizer employed for the training was Adam.

4.3.1 Abnormality detection in a known machine

In the initial experiment, both the training and testing data were sourced from a single machine's dataset or a combination of datasets from three different machines The training set comprised 70% of the total dataset, with the remaining 30% allocated for testing Each CNN model was trained and evaluated individually for each pump, followed by an analysis of combined datasets from the same class across all three pumps The classification results for all four CNN models are detailed in Tables 4.4-4.7.

Table 4.4 Classification results of AlexNet for abnormality detection in a known machine

Accuracy Abormal Normal Abormal Normal Abormal Normal

Table 4.5 Classification results of Model 1 for abnormality detection in a known machine

The study reveals that all CNN models achieved impressive overall accuracies ranging from 96.51% to 100%, with Model 3 standing out as the most effective, reaching accuracies between 99.32% and 100% The models exhibited high Precision, Sensitivity, and F1 scores, reflecting balanced performance across both normal and abnormal classes Conversely, Model 1 demonstrated the lowest performance, with accuracies between 96.51% and 99.52%, indicating an imbalance in class accuracy; for instance, Pump 2 showed a Sensitivity of 100% for the normal class but only 65.00% for the abnormal class Overall, the high accuracy rates underscore the effectiveness of the proposed method in detecting abnormalities in known machines, particularly when training and testing datasets originate from the same machine, which enhances signal similarity and model performance.

Figure 4.7 illustrates the confusion matrices for each classification task, highlighting the highest accuracy achieved The matrices reveal that the sensitivity of the normal class consistently surpasses that of the abnormal class across most cases This discrepancy arises from the datasets, where the prevalence of abnormal signals is significantly lower than that of normal signals, or when faults occur, they may not be sufficiently distinct from the normal operating state of the pump.

(b) Pump 3 (d) Data mixing of all three pumps

Fig 4.7 Confusion matrices of each trained and tested models

4.3.2 Abnormality detection in an unknown machine

The operational principles of various pump models are fundamentally similar, producing a quiet sound with stable amplitude and frequency during normal conditions However, when faults occur, the sound becomes noisier with larger, unstable amplitudes, revealing common acoustic features across different models CNN models can learn these sound characteristics from one pump to detect abnormalities in another In an experiment, CNN models were trained using data from one pump and tested on another, utilizing separate datasets from three pumps as shown in Table 4.8, which outlines nine distinct pump combinations For instance, in the Pump 1–Pump 2 combination, the model was trained on Pump 1's data and tested with Pump 2's dataset Other combinations included training on datasets from two pumps and testing on a third The classification results for all four CNN models applied to the unknown machine are detailed in Tables 4.9-4.12.

Table 4.8 The nine different pump combinations

Item Training data Testing data

Table 4.9 Classification results of AlexNet for abnormality detection in an unknown machine

Accuracy Abnormal Normal Abnormal Normal Abnormal Normal

Table 4.10 Classification results of Model 1 for abnormality detection in an unknown machine

Accuracy Abnrmal Normal Abnormal Normal Abnormal Normal

This experiment revealed that the variations among three different pumps within the same class resulted in lower accuracy compared to established machine fault detection scenarios The classification models developed for this study demonstrated superior and more consistent accuracy than the AlexNet model.

Items 6 achieved accuracies of 94.69% and 94.09%, which are lower compared to other items in most CNN models, primarily because they utilized the smallest training dataset, Pump 3 In contrast, other items employed larger datasets, resulting in higher accuracy rates Notably, Items 7-9 demonstrated the highest accuracy across most CNN models, with a more balanced sensitivity between the two classes These items benefited from a larger combined training dataset sourced from two different pumps, unlike the other items that relied solely on data from a single pump.

In a comparison of CNN model performances, Model 3 achieved the highest accuracy, ranging from 94.09% to 99.48%, while Model 1 exhibited the lowest accuracy, between 81.84% and 98.30% Consequently, Model 3 is identified as the optimal choice for detecting abnormalities in unknown machines.

Figure 4.8 illustrates the confusion matrices for items with the highest accuracy, revealing that the sensitivity of the normal class typically surpasses that of the abnormal class across most classification tasks This finding aligns with established trends in machine classification scenarios.

The study revealed that models with a simple structure of 1-3 convolution layers achieved high accuracy, surpassing AlexNet in most classification tasks A basic CNN model featuring a single convolution layer demonstrated impressive accuracy rates between 96.51% and 99.52% for abnormality detection when both training and testing datasets originated from the same machine However, when datasets from different machines were utilized, additional convolution layers were necessary for effective classification Among the models tested, Model 3, which incorporated three convolution layers, proved to be the optimal choice, exhibiting the highest and most consistent accuracy across all classification tasks.

(i) Item 9 Fig 4.8 Confusion matrices of each trained and tested models in unknown machine

Summary

This study introduces a deep learning method utilizing convolutional neural networks (CNN) for detecting machine abnormalities through sound signal analysis Sound signals from three different machines were preprocessed to reduce noise, and features were extracted as mel-spectrograms The CNN model effectively learned the features of these mel-spectrogram images, enabling accurate classification and detection of abnormalities in water pumps An optimal CNN model was identified, comprising 15 layers, including one input layer, three convolutional layers, one dropout layer, three fully connected layers, and one output layer, achieving impressive accuracy of over 94% in all fault detection tasks.

The proposed method offers a reliable solution for detecting abnormalities in machines, enabling operators and manufacturers to identify and rectify faults promptly, thereby preventing system failures With its high detection accuracy, this approach is applicable across a diverse range of machinery The findings from this study can facilitate the development of an automatic pump abnormality detection system, which processes raw sound signals from pumps by reducing noise and extracting features to create RGB images These images are then analyzed using a trained CNN network to classify the signals as normal or abnormal, ultimately determining the presence of faults in the pump.

Fig 4.9 The system automatically detects the fault of a pump through the sound signal

Further research can enhance the diagnosis of faults in various machines by identifying their type and location In addition to the Convolutional Neural Network (CNN) model, other image classification techniques like Hidden Markov Models (HMM) and Recurrent Neural Networks (RNN) can also be utilized for fault detection through sound signal analysis.

Sound classification for diagnosis of heart valve diseases

Introduction

Cardiovascular disease is a leading global health issue, claiming millions of lives each year This condition affects heart health by narrowing, hardening, and clogging blood vessels, which impairs oxygen delivery to the brain and other organs Consequently, this can lead to organ failure and death Early and accurate diagnosis is crucial for timely treatment and reducing the risks associated with cardiovascular disease.

To assess heart health, doctors consider family history, risk factors such as smoking, diabetes, obesity, and stress, and perform essential tests Key diagnostic tests for cardiovascular disease include echocardiography, imaging techniques, and electrocardiograms Echocardiography utilizes ultrasound waves to create real-time images of the heart, allowing for the evaluation of heart wall movement, function, and valve integrity Imaging diagnostics, including cardiac MRI, CT scans, and myocardial perfusion imaging, provide detailed images of coronary artery anatomy, calcification levels, and stenosis Both echocardiography and imaging methods offer non-invasive, accurate results but require advanced equipment, skilled professionals, and can be time-consuming In contrast, electrocardiograms are simple and quick tests that detect electrical impulses during heart activity, identifying signs of coronary artery disease, myocardial ischemia, and other complications This technique is cost-effective, non-invasive, and delivers rapid results.

Evaluating heart sounds is a traditional technique for diagnosing heart disease In healthy individuals, heart sounds are characterized by the "lub" and "dub," where the "lub" signifies the closure of the atrioventricular valve at the beginning of systole, and the "dub" indicates the closure of the semi-circular valves at the end of systole The first heart sound is low and prolonged, while the second is sharper and more distinct at the heart's base Occasionally, a third sound may be heard in children and young adults, resulting from blood flow from the atria to the ventricles during early diastole Rarely, a fourth heart sound, known as the atrial sound, can also be detected Heart sounds can vary based on several factors, including the chest wall, heart environment, and the condition of the heart muscle and valves Increased heart sounds may occur during physical exertion or illness, while weakened heart sounds can indicate underlying issues such as mitral stenosis or cardiomyopathy The traditional method of diagnosing heart conditions through auscultation with a stethoscope remains prevalent due to its simplicity and quick results; however, its accuracy relies heavily on the examiner's expertise, highlighting the need for automated systems capable of accurately diagnosing heart disease through heart sounds.

This chapter presents two deep learning models, LSTM and CNN, designed to diagnose heart valve diseases using heart sound signals analyzed through log-mel spectrogram features These models effectively classify heart sounds into five distinct categories, achieving impressive accuracy rates ranging from 94% to 99.33% The findings of this study can significantly aid cardiologists in the rapid and precise diagnosis of heart disease.

Related works

Heart sound classification involves three key steps: first, the framing procedure, which transforms each heart sound signal into a fixed duration; second, the extraction of relevant features from the heartbeat sound signals for classifier use; and finally, the design of a classification model to perform the classification tasks The primary framework of this study is illustrated in Fig 5.1.

Fig 5.1 The main framework of a heart sound classifier system

Heart sound classification utilizes supervised learning to enable classifiers to learn from input data and categorize new data accordingly This process can involve bi-class categories, such as normal and abnormal sounds, or multi-class categories, including normal, murmur, and extra-systole A comprehensive summary of research employing machine learning techniques for heart sound classification is provided in Table 5.1.

Table 5.1 Summarized studies on the classification of heart sound using DL techniques

Study Dataset Extracted feature Classificat ion model

40 normal, 40 pulmonary, and 40 mitral stenosis signals

Wavelet entropies based on 6 features

614 normal and abnormal cadiac cycles

Spectral analysis with a time growing window

Time, Frequency and Time-frequency domains based on 18 features

Gram polynomial and the Fourier transform

Innocent murmur: 336 recordings and Abnormal murmur: 130 recordings

Study Dataset Extracted feature Classificat ion model

Overall accuracy Chen et al

200 normal and 49 extra-systole samples wavelet decomposition and Hilbert-Huang transform

18,179 normal and abnormal cadiac cycles

29 time domain, 66 frequency domain, and

The public dataset [147] MFCCs and DWT DNN 97%

The public dataset [147] Normalized signals CNN 97.0%

Datasets A and B [138] Six power characteristics MLP 98.63%

Abbreviation: ANFIS: adaptive-neuro fuzzy inference system, FFNN: feed-forward neural network, PNN: probabilistic neural network, DWT: discrete wavelet transform, GANs: generative adversary networks

Despite a large number of studies on heart sound classification, many problems remain unresolved, and improvements are needed:

- Some studies that used the training dataset are not large enough [129, 137, 142,

144, 148, 151] Therefore, it is hard to evaluate the effectiveness of these methods

- The performance of some studies is not high [133, 134, 136, 137, 142, 143, 148,

- Some studies had achieved high accuracy However, the limitation of these studies is only classifying heart sounds into bi-class (normal and abnormal classes) [130, 140,

- Some studies have classified heart sounds into multi-class, but accuracy still needs to be improved [146, 150]

Classifying heart sounds presents ongoing challenges due to the limitations outlined earlier The goal is to identify suitable classification models that can effectively categorize heart sounds into multiple classes while achieving high accuracy.

Methodology

The heart sound dataset, sourced from a public repository, comprises 1,000 wav signal samples with a sampling frequency of 8kHz It is categorized into five distinct classes: one normal class (N) and four anomalous classes, including Aortic Stenosis (AS), Mitral Regurgitation (MR), Mitral Stenosis (MS), and Systolic Murmur (MVP) Detailed information about the dataset is presented in Table 5.2.

Table 5.2 Detail of the dataset

To effectively apply classification algorithms to sound signals, data framing is essential due to the varying lengths of these signals This process standardizes the sampling rate of each record file, ensuring that each signal sample captures at least one complete cardiac cycle without excessive length, which can lead to increased data size and processing time An adult's heart rate typically ranges from 65 to 75 beats per minute, translating to a cardiac cycle of approximately 0.8 seconds Consequently, the signal samples were cropped into segments of 2.0 seconds, 1.5 seconds, and 1.0 seconds for optimal analysis.

In order to prepare the input data for classification models, the raw waveforms from the heart sound signals were converted into a log-mel spectrogram based on the

The Discrete Fourier Transform (DFT) of a sound signal is represented by the equation y(k) = 5x(n) e ( )( ), where k ranges from 1 to N To achieve a smoother data distribution, the logarithm of the log-mel spectrograms is computed with a small offset ε, leading to the definition s(k) = log (y(k) + ε), where ε is set to 10 This normalization and smoothing of the input data facilitate easier training of the network The effectiveness of this smooth distribution is illustrated in the histogram of pixel values shown in Figure 5.2.

Fig 5.2 The histogram of the pixel values of the training data

The waveform and log-mel spectrogram of some heart sound samples are depicted in Fig 5.3

Fig 5.3 Waveform and log-mel spectrogram of some heart sound samples

In this section, two DL models, LSTM and CNN, were used to classify heart sound

This study utilized an LSTM model featuring a single LSTM layer, a dropout layer, and three fully connected (FC) layers The architecture and parameter settings of the LSTM network are illustrated in Figure 5.4 and detailed in Table 5.3.

Fig 5.4 The architecture of the proposed LSTM model

Table 5.3 Setting parameters of proposed LSTM model

Layer Name Activations Learnables Total learnables

Fig 5.5 The architecture of the proposed CNN model

Table 5.4 Setting parameters of proposed CNN model

Layer Name Activations Learnables Total learnables

2×2 max pooling with stride [2 2] and padding "same"

This study presents CNN models specifically developed for classifying heart sounds, featuring three convolutional layers and two fully connected layers, as illustrated in Fig 5.5 The parameters utilized in the proposed CNN model are detailed in Table 5.4.

Results and discussion

In the experiment, the dataset was divided into two subsets: the training set, comprising 70% of the total data, and the testing set, which included the remaining 30% The implementation was carried out using MATLAB R2020a on a system featuring an Intel Core i9-9000K 3.6 GHz CPU and an NVIDIA GTX 1080Ti GPU The Adam optimizer was utilized, and preliminary training tests were conducted to determine the optimal hyperparameters, ensuring high accuracy and stable training processes The optimal hyperparameter values used during training are detailed in Table 5.5.

Table 5.5 The hyperparameters of the training processes

The classification results presented in Tables 5.6-5.8 indicate that the CNN model achieved the highest accuracy of 99.33% with a segment duration of 1.0 seconds, while the LSTM model recorded the lowest accuracy of 94.00% with a segment duration of 2.0 seconds Among the five classes, class N demonstrated the highest sensitivity, contrasting with class MVP, which showed the lowest sensitivity This trend is further illustrated in the confusion matrices (Figs 5.6, 5.7) In the six classification scenarios, each class was tested with 60 samples, and class N achieved perfect predictions in five cases, while class MVP had the fewest correct predictions These findings align with previous studies utilizing the same dataset [146, 150].

Table 5.6 Classification results of 2.0 s-segment duration

Pre Sen F1_s Acc Pre Sen F1_s Acc

(a) 2.0 s-segment duration (b) 1.5 s-segment duration (c) 1.0 s-segment duration

Fig 5.6 Confusion matrices of LSTM models

(a) 2.0 s-segment duration (b) 1.5 s-segment duration (c) 1.0 s-segment duration

Fig 5.7 Confusion matrices of CNN models

The segment duration significantly impacts classification accuracy in heart sound analysis Short segment durations may omit essential features, while excessively long durations can increase sample size and processing time In this study, all segment durations achieved high accuracy, with the CNN model exceeding 98.67% and the LSTM model reaching 97.00% Therefore, a segment duration of approximately 1.0 to 2.0 seconds is optimal, as it captures at least one complete heartbeat cycle and ensures efficient processing without sacrificing feature extraction.

Table 5.9 presents the prediction times for various classification models, revealing that the LSTM model has the longest prediction time at 4.56 ms for a segment duration of 2.0 seconds, while the CNN model achieves the shortest time of 2.21 ms with a 1.0-second segment This indicates that longer segment durations in the same classification model lead to increased data size, resulting in longer single sample prediction times Notably, the CNN model demonstrates approximately twice the speed of the LSTM model in single sample prediction time.

Table 5.9 Single sample prediction time (ms)

In our comparison of the two proposed models, as illustrated in Fig 5.8, both achieved impressive accuracy rates exceeding 94.00% The CNN model demonstrated superior performance with overall accuracies of 98.67%, 99.33%, and 99.33% for segment durations of 2.0s, 1.5s, and 1.0s, respectively In contrast, the LSTM model recorded accuracies of 94.00%, 94.00%, and 97.00% for the same durations These findings indicate that the CNN model outperforms the LSTM model in heart sound classification, showcasing higher accuracy and faster processing times.

Fig 5.8 Performance comparison of LSTM and CNN models

Fig 5.9 Performance comparison of previous studies and proposed models

This study compares the performance of two classification models with previous research, as illustrated in Fig 5.9 It successfully classified five heart sound classes, achieving an impressive accuracy of 99.33% with the CNN model, while the LSTM model also performed well, reaching an accuracy of 97.00% Notably, some prior studies have reported similarly high accuracy rates.

While previous studies [129, 141, 151] focused on classifying only two categories (normal and abnormal), our research advances the field by classifying five distinct classes In comparison to other studies utilizing the same dataset [146, 150], which achieved an accuracy of 97.00%, our findings demonstrate improved classification capabilities.

Ov era ll Ac curac y

O v era ll a cc u racy has improved significantly, reaching the highest accuracy of 99.33% In addition, the effect of segment duration on the classification performance is also considered in our study.

Summary and future works

In this study, deep learning (DL) techniques were employed to classify heart sounds into five distinct categories, utilizing two models: Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) The models were trained on features extracted as log-mel spectrograms, yielding impressive results Notably, the CNN model outperformed with an accuracy of 99.33%, marking the highest performance reported in the literature for this classification task This research contributes significantly to the field, offering valuable insights for cardiologists in diagnosing cardiovascular diseases.

We aim to develop an automated system for diagnosing cardiovascular diseases using heart sounds, leveraging a larger database This will involve integrating existing datasets and gathering new data from cardiology clinics The resulting system will serve as a valuable resource for cardiologists, particularly those with limited experience in clinical cardiovascular examinations.

Conclusion and future works

Conclusion

After introducing the general research background, challenges, and objectives in Chapter 1, the ASC methodologies were presented in Chapter 2 Three studies using

DL techniques in ASC were conducted in Chapters 3, 4, and 5 The work in this thesis is summarized as follows:

Chapter 2 has taken a comprehensive investigation of the methodology used for the ASC The extracted features, including time domain and frequency domain features, are presented and analyzed in Section 2.1 Section 2.2 gives a complete introduction to the ML models used in the ASC, including traditional ML models and DL models Compared to traditional ML models, DL models have many advantages and effectiveness in ASC However, the disadvantages of these models are the need for a dataset big enough for training and robust computer system requirement Two DL models, including RNN and CNN, are selected in this study Besides, the evaluation criteria are also presented in this chapter

In Chapter 3, a novel method utilizing a designed CNN was introduced to accurately estimate a receiver's location within an audio room This approach was tested in three simulated environments and one experimental room, yielding high accuracy across all scenarios The findings from this research are instrumental in determining a sound receiver's position in audio systems, ultimately enhancing sound system design.

Chapter 4 proposed a method using a DL technique with CNN for machine fault detection in water pumps based on sound signals analysis The experiments are conducted in two cases: fault detection in known machine unknown machine An optimal CNN model was designed with the respective hyperparameter set that achieved high accuracy in all fault detection tasks The proposed method can help operators and manufacturers detect and correct faults The results can also be used to develop an automatic pump fault detection system

In Chapter 5, we conducted heart sounds classifications using DL techniques Two

Our study introduces deep learning models, specifically LSTM and CNN, to classify heart sounds, achieving superior performance compared to previous research Notably, it is among the few studies to categorize heart sounds into five distinct classes These findings are valuable for cardiologists in diagnosing cardiovascular diseases effectively.

The proposed methods have been effectively assessed using a public dataset, offering valuable tools for the development of ASC models Additionally, the results demonstrate that deep learning techniques significantly enhance the effectiveness of ASC.

Future works

This thesis demonstrates notable successes using various public databases, yet these accomplishments represent just a fraction of the extensive field of Affective Sentiment Classification (ASC) Future research directions are proposed to expand upon these findings.

The audio database remains relatively small compared to image databases, with some datasets being limited in size, which affects classification accuracy To improve deep learning applications, it is essential to develop a more extensive audio database with larger datasets.

This thesis utilizes a limited number of deep learning (DL) models, each tailored to specific datasets As larger datasets are incorporated, the development of more advanced DL models and their broader applications will emerge These models are expected to serve as versatile tools in the fields of Automatic Speech Classification (ASC).

[1] G Potamianos, C Neti, J Luettin, and I Matthews, "Audio-visual automatic speech recognition: An overview," Issues in visual and audio-visual speech processing, vol 22, p 23, 2004

[2] E Benetos, S Dixon, Z Duan, and S Ewert, "Automatic music transcription:

An overview," IEEE Signal Processing Magazine, vol 36, pp 20-30, 2018

[3] Y E Kim, E M Schmidt, R Migneco, B G Morton, P Richardson, J Scott, et al., "Music emotion recognition: A state of the art review," in Proc ismir,

[4] W M Campbell, J P Campbell, D A Reynolds, E Singer, and P A Torres-

Carrasquillo, "Support vector machines for speaker and language recognition,"

Computer Speech & Language, vol 20, pp 210-229, 2006

[5] N Dehak, P A Torres-Carrasquillo, D Reynolds, and R Dehak, "Language recognition via i-vectors and dimensionality reduction," in Twelfth annual conference of the international speech communication association, 2011

[6] H.-D Yang, "Sign language recognition with the kinect sensor based on conditional random fields," Sensors, vol 15, pp 135-147, 2015

[7] T Heittola, A Mesaros, A Eronen, and T Virtanen, "Audio context recognition using audio event histograms," in 2010 18th European Signal Processing Conference, 2010, pp 1272-1276

[8] J S Boreczky and L D Wilcox, "A hidden Markov model framework for video segmentation using audio and image features," in Proceedings of the 1998 IEEE

International Conference on Acoustics, Speech and Signal Processing, ICASSP'98 (Cat No 98CH36181), 1998, pp 3741-3744

In their work presented at the Thirty-Third Asilomar Conference on Signals, Systems, and Computers, T Zhang and C.-C Kuo explore methods for the classification and retrieval of sound effects within audiovisual data management Their research addresses the challenges of efficiently organizing and accessing sound effects, highlighting the importance of effective categorization in enhancing multimedia content usability.

[10] C.-C Lin, S.-H Chen, T.-K Truong, and Y Chang, "Audio classification and categorization based on wavelets and support vector machine," IEEE Transactions on Speech and Audio Processing, vol 13, pp 644-651, 2005

[11] K Umapathy, S Krishnan, and R K Rao, "Audio signal feature extraction and classification using local discriminant bases," IEEE Transactions on Audio, Speech, and Language Processing, vol 15, pp 1236-1246, 2007

[12] C Xu, N C Maddage, and X Shao, "Automatic music classification and summarization," IEEE transactions on speech and audio processing, vol 13, pp 441-450, 2005

[13] J Ajmera, I McCowan, and H Bourlard, "Speech/music segmentation using entropy and dynamism features in a HMM classification framework," Speech communication, vol 40, pp 351-363, 2003

[14] C Panagiotakis and G Tziritas, "A speech/music discriminator based on RMS and zero-crossings," IEEE Transactions on multimedia, vol 7, pp 155-166,

[15] S Honda, T Shinohara, T Uebo, and N Nakasako, "Estimating the Distance to a Sound Source using Single-Channel Cross-Spectral Method between Observed and Pseudo-Observed Waves based on Phase Interference," in

Proceedings of the 23rd International Congress on Sound & Vibration, Athens, Greece, 2016, pp 10-14

[16] F Rosenblatt, "The perceptron: a probabilistic model for information storage and organization in the brain," Psychological review, vol 65, p 386, 1958

[17] D E Rumelhart, G E Hinton, and R J Williams, "Learning representations by back-propagating errors," nature, vol 323, pp 533-536, 1986

[18] A Krizhevsky, I Sutskever, and G E Hinton, "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems, vol 25, pp 1097-1105, 2012

Deep neural networks have emerged as a powerful tool for acoustic modeling in speech recognition, as highlighted by the collaborative insights of four prominent research groups This approach significantly enhances the accuracy and efficiency of speech recognition systems, as discussed in the IEEE Signal Processing Magazine The collective findings emphasize the transformative impact of deep learning techniques on the field, paving the way for advancements in automated speech processing.

[20] H Purwins, B Li, T Virtanen, J Schlüter, S.-Y Chang, and T Sainath, "Deep learning for audio signal processing," IEEE Journal of Selected Topics in Signal

[21] F Richardson, D Reynolds, and N Dehak, "Deep neural network approaches to speaker and language recognition," IEEE signal processing letters, vol 22, pp 1671-1675, 2015

[22] S Bansal, H Kamper, A Lopez, and S Goldwater, "Towards speech-to-text translation without speech recognition," arXiv preprint arXiv:1702.03856,

[23] A Sehgal and N Kehtarnavaz, "A convolutional neural network smartphone app for real-time voice activity detection," IEEE Access, vol 6, pp 9017-9026,

[24] Detection and Classification of Acoustic Scenes and Events Available: http://dcase.community/

[25] M Yiwere and E J Rhee, "Sound Source Distance Estimation Using Deep

Learning: An Image Classification Approach," Sensors, vol 20, p 172, 2020

[26] N Yalta, K Nakadai, and T Ogata, "Sound source localization using deep learning models," Journal of Robotics and Mechatronics, vol 29, pp 37-48,

[27] P Bentley, G Nordehn, M Coimbra, S Mannor, and R Getz Classifying Heart

Sounds Challenge Available: http://www.peterjbentley.com/heartchallenge/

[28] Q Chen, W Zhang, X Tian, X Zhang, S Chen, and W Lei, "Automatic heart and lung sounds classification using convolutional neural networks," in 2016

Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016, pp 1-4

[29] H Liu, L Li, and J Ma, "Rolling bearing fault diagnosis based on STFT-deep learning and sound signals," Shock and Vibration, vol 2016, 2016

[30] D Yu and L Deng, "Deep learning and its applications to signal and information processing [exploratory dsp]," IEEE Signal Processing Magazine, vol 28, pp 145-154, 2010

[31] T Giannakopoulos and A Pikrakis, Introduction to Audio Analysis: a

[32] Z Fu, G Lu, K M Ting, and D Zhang, "A survey of audio-based music classification and annotation," IEEE transactions on multimedia, vol 13, pp 303-319, 2010

[33] P Mermelstein, "Automatic segmentation of speech into syllabic units," The

Journal of the Acoustical Society of America, vol 58, pp 880-883, 1975

[34] G Peeters, "A large set of audio features for sound description (similarity and classification) in the CUIDADO project," CUIDADO Ist Project Report, vol

[35] H Jiang, J Bai, S Zhang, and B Xu, "SVM-based audio scene classification," in 2005 International Conference on Natural Language Processing and Knowledge Engineering, 2005, pp 131-136

[36] V Peltonen, J Tuomi, A Klapuri, J Huopaniemi, and T Sorsa, "Computational auditory scene recognition," in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002, pp II-1941-II-1944

[37] J Saunders, "Real-time discrimination of broadcast speech/music," in 1996

IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, 1996, pp 993-996

[38] P Ahrendt, A Meng, and J Larsen, "Decision time horizon for music genre classification using short time features," in 2004 12th European Signal Processing Conference, 2004, pp 1293-1296

[39] R Bachu, S Kopparthi, B Adapa, and B Barkana, "Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal," in American

Society for Engineering Education (ASEE) zone conference proceedings, 2008, pp 1-7

[40] M Ramona, G Richard, and B David, "Vocal detection in music with support vector machines," in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp 1885-1888

[41] A Pikrakis, T Giannakopoulos, and S Theodoridis, "Gunshot detection in audio streams from movies by means of dynamic programming and bayesian networks," in 2008 IEEE International Conference on Acoustics, Speech and

[42] T Giannakopoulos, A Pikrakis, and S Theodoridis, "A multi-class audio classification method with respect to violent content in movies using bayesian networks," in 2007 IEEE 9th Workshop on Multimedia Signal Processing, 2007, pp 90-93

[43] E Schubert, J Wolfe, and A Tarnopolsky, "Spectral centroid and timbre in complex, multiple instrumental textures," in Proceedings of the international conference on music perception and cognition, North Western University, Illinois, 2004, pp 112-116

[44] A Pikrakis, T Giannakopoulos, and S Theodoridis, "A computationally efficient speech/music discriminator for radio recordings," in ISMIR, 2006, pp 107-110

[45] A Pikrakis, T Giannakopoulos, and S Theodoridis, "A speech/music discriminator of radio recordings based on dynamic programming and bayesian networks," IEEE Transactions on Multimedia, vol 10, pp 846-857, 2008

[46] S Lee, J Kim, and I Lee, "Speech/audio signal classification using spectral flux pattern recognition," in 2012 IEEE Workshop on Signal Processing Systems, 2012, pp 232-236

In their 2003 study presented at the 26th annual ACM SIGIR conference, T Li, M Ogihara, and Q Li conducted a comparative analysis of content-based music genre classification The research, detailed in pages 282 to 289 of the conference proceedings, explores various methodologies for categorizing music genres based on their content, contributing valuable insights to the field of information retrieval in music.

[48] X Valero and F Alías, "Applicability of MPEG-7 low level descriptors to environmental sound source recognition," in Proceedings 1st Euroregio Conference, Ljubjana, 2010

[49] A I Al-Shoshan, "Speech and music classification and separation: a review,"

Journal of King Saud University-Engineering Sciences, vol 19, pp 95-132,

[50] L Lu, D Liu, and H.-J Zhang, "Automatic mood detection and tracking of music audio signals," IEEE Transactions on audio, speech, and language processing, vol 14, pp 5-18, 2005

[51] D O’Saughnessy, "Speech communication-human and machine," Reading, PA:

[52] C Ittichaichareon, S Suksri, and T Yingthawornsuk, "Speech recognition using MFCC," in International conference on computer graphics, simulation and modeling, 2012, pp 135-138

[53] V Tiwari, "MFCC and its applications in speaker recognition," International journal on emerging technologies, vol 1, pp 19-22, 2010

[54] S O Sadjadi and J H Hansen, "Assessment of single-channel speech enhancement techniques for speaker identification under mismatched conditions," in Eleventh Annual Conference of the International Speech Communication Association, 2010

[55] G Kour and N Mehan, "Music genre classification using MFCC, SVM and

BPNN," International Journal of Computer Applications, vol 112, 2015

[56] O Lartillot, P Toiviainen, and T Eerola, "A matlab toolbox for music information retrieval," in Data analysis, machine learning and applications, ed: Springer, 2008, pp 261-268

[57] J H Jensen, M G Christensen, D P Ellis, and S H Jensen, "Quantitative analysis of a common audio similarity measure," IEEE Transactions on Audio,

Speech, and Language Processing, vol 17, pp 693-703, 2009

[58] Z Ali, M Alsulaiman, G Muhammad, I Elamvazuthi, and T A Mesallam,

"Vocal fold disorder detection based on continuous speech by using MFCC and GMM," in 2013 7th IEEE GCC Conference and Exhibition (GCC), 2013, pp 292-297

[59] A Şengür, Y Guo, and Y Akbulut, "Time–frequency texture descriptors of

EEG signals for efficient detection of epileptic seizure," Brain Informatics, vol

[60] Y M Costa, L S Oliveira, and C N Silla Jr, "An evaluation of convolutional neural networks for music classification using spectrograms," Applied soft computing, vol 52, pp 28-38, 2017

[61] A Montalvo, Y M Costa, and J R Calvo, "Language identification using spectrogram texture," in Iberoamerican Congress on Pattern Recognition,

[62] L Pham, H Phan, T Nguyen, R Palaniappan, A Mertins, and I McLoughlin,

"Robust acoustic scene classification using a multi-spectrogram encoder- decoder framework," Digital Signal Processing, vol 110, p 102943, 2021

[63] H Zhang, "The Optimality of Naive Bayes, 2004," American Association for

Artificial Intelligence (www aaai org), 2004

[64] A McCallum and K Nigam, "A comparison of event models for naive bayes text classification," in AAAI-98 workshop on learning for text categorization,

[65] V Metsis, I Androutsopoulos, and G Paliouras, "Spam filtering with naive bayes-which naive bayes?," in CEAS, 2006, pp 28-69

[66] L E Peterson, "K-nearest neighbor," Scholarpedia, vol 4, p 1883, 2009

[67] S R Safavian and D Landgrebe, "A survey of decision tree classifier methodology," IEEE transactions on systems, man, and cybernetics, vol 21, pp 660-674, 1991

[68] A Parmar, R Katariya, and V Patel, "A review on random forest: An ensemble classifier," in International Conference on Intelligent Data Communication Technologies and Internet of Things, 2018, pp 758-763

[69] T M Oshiro, P S Perez, and J A Baranauskas, "How many trees in a random forest?," in International workshop on machine learning and data mining in pattern recognition, 2012, pp 154-168

[70] Y LeCun, Y Bengio, and G Hinton, "Deep learning," nature, vol 521, pp

[71] L B Almeida, "C1 2 Multilayer perceptrons," Handbook of Neural

[72] I Vilovic, "An experience in image compression using neural networks," in

[73] T Koskela, M Lehtokangas, J Saarinen, and K Kaski, "Time series prediction with multilayer perceptron, FIR and Elman neural networks," in Proceedings of the World Congress on Neural Networks, 1996, pp 491-496

[74] T.-h Kim, "Pattern recognition using artificial neural network: a review," in

International Conference on Information Security and Assurance, 2010, pp

[75] D H Hubel and T N Wiesel, "Receptive fields of single neurones in the cat's striate cortex," The Journal of physiology, vol 148, pp 574-591, 1959

[76] A A M Al-Saffar, H Tao, and M A Talab, "Review of deep convolution neural network in image classification," in 2017 International Conference on

Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET), 2017, pp 26-31

[77] A B Nassif, I Shahin, I Attili, M Azzeh, and K Shaalan, "Speech recognition using deep neural networks: A systematic review," IEEE access, vol 7, pp 19143-19165, 2019

[78] A Khamparia, D Gupta, N G Nguyen, A Khanna, B Pandey, and P Tiwari,

"Sound classification using convolutional neural network and tensor deep stacking network," IEEE Access, vol 7, pp 7717-7727, 2019

[79] K Parikh (2019) Understanding the Convolution function and CNN

Available: https://medium.com/@parikhkadam/article-1-understanding-the- convolution-function-and-cnn-21dca53e2c27

[80] S Hochreiter and J Schmidhuber, "Long short-term memory," Neural computation, vol 9, pp 1735-1780, 1997

[81] A W Bronkhorst, "Modeling auditory distance perception in rooms," presented at the Proceedings of the AAE Forum Acusticum, Sevilla, Spain, 2002

In their research, Lu and Cooke explore the binaural estimation of sound source distance by analyzing the direct-to-reverberant energy ratio, addressing both static and moving sources Their findings, published in the IEEE Transactions on Audio, Speech, and Language Processing, contribute valuable insights into auditory perception and sound localization techniques.

[83] T Rodemann, "A study on distance estimation in binaural sound localization," in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems,

[84] L Wang and A Cavallaro, "Time-frequency processing for sound source localization from a micro aerial vehicle," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp

[85] S Chakrabarty and E A Habets, "Broadband DOA estimation using convolutional neural networks trained with noise signals," in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp 136-140

[86] S Chakrabarty and E A Habets, "Multi-speaker localization using convolutional neural network trained with noise," arXiv preprint arXiv:1712.04276, 2017

[87] T Rodemann, G Ince, F Joublin, and C Goerick, "Using binaural and spectral cues for azimuth and elevation localization," in 2008 IEEE/RSJ International

Conference on Intelligent Robots and Systems, 2008, pp 2185-2190

[88] L Perotin, R Serizel, E Vincent, and A Guérin, "CRNN-based joint azimuth and elevation localization with the Ambisonics intensity vector," in 2018 16th

International Workshop on Acoustic Signal Enhancement (IWAENC), 2018, pp

[89] N D Gaubitch, W B Kleijn, and R Heusdens, "Auto-localization in ad-hoc microphone arrays," in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp 106-110

[90] R Parhizkar, I Dokmanić, and M Vetterli, "Single-channel indoor microphone localization," in 2014 IEEE International Conference on Acoustics, Speech and

[91] Z Huang, J Xu, Z Gong, H Wang, and Y Yan, "Multiple source localization in a shallow water waveguide exploiting subarray beamforming and deep neural networks," Sensors, vol 19, p 4768, 2019

[92] R Takeda and K Komatani, "Sound source localization based on deep neural networks with directional activate function exploiting phase information," in

2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2016, pp 405-409

[93] J B Allen and D A Berkley, "Image method for efficiently simulating small‐ room acoustics," The Journal of the Acoustical Society of America, vol 65, pp 943-950, 1979

[95] (2020) Office Noise and Acoustics Available: https://canadasafetycouncil.org/office-noise-and-acoustics/

[96] P Henriquez, J B Alonso, M A Ferrer, and C M Travieso, "Review of automatic fault diagnosis systems using audio and vibration signals," IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol 44, pp 642-652,

[97] N Tandon and A Choudhury, "A review of vibration and acoustic measurement methods for the detection of defects in rolling element bearings," Tribology international, vol 32, pp 469-480, 1999

[98] Z Zhong, J Chen, P Zhong, and J Wu, "Application of the blind source separation method to feature extraction of machine sound signals," The International Journal of Advanced Manufacturing Technology, vol 28, pp 855-

[99] Y Yao, H Wang, S Li, Z Liu, G Gui, Y Dan, et al., "End-to-end convolutional neural network model for gear fault diagnosis based on sound signals," Applied Sciences, vol 8, p 1584, 2018

[100] W Li, Y Tsai, and C Chiu, "The experimental study of the expert system for diagnosing unbalances by ANN and acoustic signals," Journal of Sound and

[101] U Benko, J Petrovčič, Đ Juričić, J Tavčar, J Rejec, and A Stefanovska,

"Fault diagnosis of a vacuum cleaner motor by means of sound analysis,"

Journal of Sound and Vibration, vol 276, pp 781-806, 2004

[102] U Benko, J Petrovc̆ic̆, Đ Juričić, J Tavčar, and J Rejec, "An approach to fault diagnosis of vacuum cleaner motors based on sound analysis," Mechanical Systems and Signal Processing, vol 19, pp 427-445, 2005

[103] J Lin, "Feature extraction of machine sound using wavelet and its application in fault diagnosis," NDT & e International, vol 34, pp 25-30, 2001

[104] H Kumar, V Sugumaran, and M Amarnath, "Fault diagnosis of bearings through sound signal using statistical features and Bayes classifier," 2016

In a study by Khazaee et al., the authors explore the use of classifier fusion techniques for diagnosing and classifying faults in planetary gears They employ vibration and acoustic signals, leveraging Dempster–Shafer evidence theory to enhance the accuracy of their fault detection methods This innovative approach contributes to the advancement of predictive maintenance strategies in mechanical systems.

Institution of Mechanical Engineers, Part E: Journal of Process Mechanical Engineering, vol 228, pp 21-32, 2014

[106] J Lee, H Choi, D Park, Y Chung, H.-Y Kim, and S Yoon, "Fault detection and diagnosis of railway point machines by sound analysis," Sensors, vol 16, p 549, 2016

M Gan and C Wang developed a hierarchical diagnosis network utilizing deep learning techniques for effective fault pattern recognition in rolling element bearings Their research, published in "Mechanical Systems and Signal Processing," highlights the significance of advanced diagnostic methods in enhancing the reliability and performance of mechanical systems.

[108] M He and D He, "Deep learning based approach for bearing fault diagnosis,"

IEEE Transactions on Industry Applications, vol 53, pp 3057-3065, 2017

[109] S Haidong, J Hongkai, L Xingqiu, and W Shuaipeng, "Intelligent fault diagnosis of rolling bearing using deep wavelet auto-encoder with extreme learning machine," Knowledge-Based Systems, vol 140, pp 1-14, 2018

[110] F Jia, Y Lei, L Guo, J Lin, and S Xing, "A neural network constructed by deep learning technique and its application to intelligent fault diagnosis of machines," Neurocomputing, vol 272, pp 619-628, 2018

[111] C Li, R.-V Sánchez, G Zurita, M Cerrada, and D Cabrera, "Fault diagnosis for rotating machinery using vibration measurement deep statistical feature learning," Sensors, vol 16, p 895, 2016

In their 2018 study published in Mechanical Systems and Signal Processing, Zhang et al introduced a novel deep convolutional neural network designed for bearing fault diagnosis, specifically addressing challenges posed by noisy environments and varying operational loads The research emphasizes innovative training methods that enhance the network's performance in accurately identifying faults, making it a significant contribution to the field of mechanical system monitoring.

[113] K Liang, N Qin, D Huang, and Y Fu, "Convolutional recurrent neural network for fault diagnosis of high-speed train bogie," Complexity, vol 2018,

A deep convolutional neural network (CNN) approach that integrates information fusion has been developed for bearing fault diagnosis across various working conditions This innovative method enhances the accuracy and reliability of diagnosing mechanical faults, as detailed in the research published in the Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science.

[115] R.-Y Yang and R Rai, "Machine auscultation: enabling machine diagnostics using convolutional neural networks and large-scale machine audio data,"

Advances in Manufacturing, vol 7, pp 174-187, 2019

[116] H Purohit, R Tanabe, K Ichige, T Endo, Y Nikaido, K Suefusa, et al.,

"MIMII Dataset: Sound dataset for malfunctioning industrial machine investigation and inspection," arXiv preprint arXiv:1909.09347, 2019

[117] S V Vaseghi, Advanced digital signal processing and noise reduction: John

[118] A O M Salih, "Audio Noise Reduction Using Low Pass Filters," Open Access

[119] A Savitzky and M J Golay, "Smoothing and differentiation of data by simplified least squares procedures," Analytical chemistry, vol 36, pp 1627-

[120] J Chen and Y Shen, "The effect of kernel size of CNNs for lung nodule classification," in 2017 9th international conference on advanced infocomm technology (ICAIT), 2017, pp 340-344

[121] N Srivastava, G Hinton, A Krizhevsky, I Sutskever, and R Salakhutdinov,

"Dropout: a simple way to prevent neural networks from overfitting," The journal of machine learning research, vol 15, pp 1929-1958, 2014

[122] (2021) Cardiovascular Diseases Available: https://www.who.int/health- topics/cardiovascular-diseases#tab=tab_1

[123] C M Otto, Textbook of clinical echocardiography: Elsevier Health Sciences,

[124] S A Morris and T C Slesnick, "Magnetic resonance imaging," Visual Guide to Neonatal Cardiology, pp 104-108, 2018

[125] M Ter-Pogossian, E Weiss, R Coleman, and B Sobel, "Computed tomography of the heart," American Journal of Roentgenology, vol 127, pp 79-

[126] A Varga-Szemes, F G Meinel, C N De Cecco, S R Fuller, R R Bayer, and

U J Schoepf, "CT myocardial perfusion imaging," American Journal of Roentgenology, vol 204, pp 487-497, 2015

[127] S Karpagachelvi, M Arthanari, and M Sivakumar, "ECG feature extraction techniques-a survey approach," arXiv preprint arXiv:1005.0957, 2010

[128] B S Emmanuel, "A review of signal processing techniques for heart sound analysis in clinical diagnosis," Journal of medical engineering & technology, vol 36, pp 303-307, 2012

[129] H Uğuz, "Adaptive neuro-fuzzy inference system for diagnosis of the heart valve diseases using wavelet transform with entropy," Neural Computing and applications, vol 21, pp 1617-1628, 2012

[130] A Gharehbaghi, T Dutoit, P Ask, and L Sửrnmo, "Detection of systolic ejection click using time growing neural network," Medical engineering & physics, vol 36, pp 477-483, 2014

[131] M Zabihi, A B Rad, S Kiranyaz, M Gabbouj, and A K Katsaggelos, "Heart sound anomaly and quality detection using ensemble of neural networks without segmentation," in 2016 Computing in Cardiology Conference (CinC), 2016, pp 613-616

[132] C Liu, D Springer, Q Li, B Moody, R A Juan, F J Chorro, et al., "An open access database for the evaluation of heart sound algorithms," Physiological Measurement, vol 37, p 2181, 2016

[133] C Potes, S Parvaneh, A Rahman, and B Conroy, "Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds," in

2016 Computing in Cardiology Conference (CinC), 2016, pp 621-624

[134] H.-l Her and H.-W Chiu, "Using time-frequency features to recognize abnormal heart sounds," in 2016 Computing in Cardiology Conference (CinC),

[135] G D Clifford, C Liu, B Moody, D Springer, I Silva, Q Li, et al.,

"Classification of normal/abnormal heart sound recordings: The PhysioNet/Computing in Cardiology Challenge 2016," in 2016 Computing in

[136] M Tschannen, T Kramer, G Marti, M Heinzmann, and T Wiatowski, "Heart sound classification using deep structured features," in 2016 Computing in Cardiology Conference (CinC), 2016, pp 565-568

[137] W Zhang and J Han, "Towards heart sound classification without segmentation using convolutional neural network," in 2017 Computing in Cardiology (CinC),

[138] P Bentley, G Nordehn, M Coimbra, S Mannor, and R Getz, "Classifying heart sounds challenge," Retrieved from Classifying Heart Sounds Challenge: http://www peterjbentley com/heartchallenge, 2011

[139] M Nassralla, Z El Zein, and H Hajj, "Classification of normal and abnormal heart sounds," in 2017 Fourth International Conference on Advances in Biomedical Engineering (ICABME), 2017, pp 1-4

[140] F Beritelli, G Capizzi, G L Sciuto, C Napoli, and F Scaglione, "Automatic heart activity diagnosis based on Gram polynomials and probabilistic neural networks," Biomedical engineering letters, vol 8, pp 77-85, 2018

[141] S Latif, M Usman, R Rana, and J Qadir, "Phonocardiographic sensing using deep learning for abnormal heartbeat detection," IEEE Sensors Journal, vol 18, pp 9393-9400, 2018

[142] V Sujadevi, K Soman, R Vinayakumar, and A P Sankar, "Deep models for phonocardiography (PCG) classification," in 2017 International Conference on

Intelligent Communication and Computational Techniques (ICCT), 2017, pp

[143] B Bozkurt, I Germanakis, and Y Stylianou, "A study of time-frequency features for CNN-based automatic heart sound classification for pathology detection," Computers in biology and medicine, vol 100, pp 132-143, 2018

[144] L Chen, J Ren, Y Hao, and X Hu, "The diagnosis for the extrasystole heart sound signals based on the deep learning," Journal of Medical Imaging and Health Informatics, vol 8, pp 959-968, 2018

[145] M Sotaquirá, D Alvear, and M Mondragón, "Phonocardiogram classification using deep neural networks and weighted probability comparisons," Journal of medical engineering & technology, vol 42, pp 510-517, 2018

[146] G.-Y Son and S Kwon, "Classification of heart sound signal using multiple features," Applied Sciences, vol 8, p 2344, 2018

[147] (2018) Available: https://github.com/yaseen21khan/Classification-of-Heart-

Sound-Signal-Using-Multiple-Features-/find/master

[148] A Raza, A Mehmood, S Ullah, M Ahmad, G S Choi, and B.-W On,

"Heartbeat sound signal classification using deep learning," Sensors, vol 19, p

[149] J M.-T Wu, M.-H Tsai, Y Z Huang, S H Islam, M M Hassan, A Alelaiwi, et al., "Applying an ensemble convolutional neural network with Savitzky–

Golay filter to construct a phonocardiogram prediction model," Applied Soft Computing, vol 78, pp 29-40, 2019

[150] S L Oh, V Jahmunah, C P Ooi, R.-S Tan, E J Ciaccio, T Yamakawa, et al., "Classification of heart sound signals using a novel deep WaveNet model," Computer Methods and Programs in Biomedicine, p 105604, 2020

[151] P Narváez, S Gutierrez, and W S Percybrooks, "Automatic Segmentation and

Classification of Heart Sounds Using Modified Empirical Wavelet Transform and Power Features," Applied Sciences, vol 10, p 4791, 2020

[152] B J Gersh, Mayo Clinic heart book: W Morrow, 2000

[153] J O Smith, Mathematics of the discrete Fourier transform (DFT): with audio applications: Julius Smith, 2007

Minh-Tuan Nguyen received his B.E and M.E degrees in Mechanical

Engineering from Hanoi University of Science and Technology, Hanoi, Vietnam in

2008 and 2013 He has worked as a lecturer at the Faculty of Mechanical Engineering, Hung Yen University of Technology and Education, Hung Yen, Vietnam since 2009

At present, he is pursuing a Ph.D degree in Mechanical Engineering at Feng Chia University, Taiwan (ROC) His research interest focuses on audio signal processing using DL techniques.

Tiêu đề	Audio Signal Classification Using Deep Learning Techniques
Tác giả	Minh Tuan Nguyen
Người hướng dẫn	Professor Jin H. Huang
Trường học	Feng Chia University
Chuyên ngành	Mechanical and Aeronautical Engineering
Thể loại	Ph.D. Thesis
Năm xuất bản	2021
Thành phố	Taichung

Định dạng
Số trang	109
Dung lượng	2,65 MB