Chapter 3 NOISE REDUCTION BY SPECTRAL SUBTRACTION WITH SCALAR
3.2. Scalar Kalman Filter for reducing residual noise
3.2.1. Model for magnitude of both residual noise and clean speech
As stated in the previous section, the residual noise is the estimation error of spectral subtraction method in the parts without speech activity and given by:
𝜖[𝑛, 𝑘] = (|𝑉[𝑛, 𝑘]| − 𝜇[𝑘])𝑒𝑖𝜃𝑉[𝑛,𝑘]
Because 𝜇[𝑘] is the mean of |𝑉[𝑛, 𝑘]|, we can easily see that 𝐸[|𝜖[𝑛, 𝑘]|] = 0. Note that we are talking about the residual noise before Half-wave Rectification, because the expectation of residual noise after Half-wave Rectification is obviously larger than zero.
In the Chapter 2, we have known that the Kalman Filter can be used to estimate a signal corrupted by white noise. This gives us the idea to use the Kalman Filter to filter the residual noise, if we can approximate the residual noise as white noise.
STFT
Noise estimation
Subtract Bias
Half-Wave Rectify
Reduce Noise Residual
ISTFT 𝑥[𝑛]
𝑠̂[𝑛]
30
For the sake of simplicity, let only analyse one particular frequency component 𝑘 instead of full frequency spectrum. In the parts without speech activity, we will have relation between magnitude of residual noise |𝜖𝑘[𝑛]|, magnitude of clean speech |𝑆𝑘[𝑛]| and post-subtraction magnitude |𝑆̂𝑘[𝑛]| is expressed by:
|𝑆̂𝑘[𝑛]| = 1 ∙ |𝑆𝑘[𝑛]| + |𝜖𝑘[𝑛]|
Because:
𝜖𝑘[𝑛] = 𝑆̂𝑘[𝑛] − 𝑆𝑘[𝑛]
⇒ 𝑆̂𝑘[𝑛] = 𝑆𝑘[𝑛] + 𝜖𝑘[𝑛]
⇒ |𝑆̂𝑘[𝑛]| = |𝑆𝑘[𝑛]| + |𝜖𝑘[𝑛]| (𝑤𝑖𝑡ℎ |𝑆𝑘[𝑛]| = 0)
That means relations between|𝑆̂𝑘[𝑛]| and |𝑆𝑘[𝑛]| during the times without speech activity is linear. Furthermore, we can model |𝑆𝑘[𝑛]| as a random walk process with zero mean, zero variance:
|𝑆𝑘[𝑛]| = 1 ∙ |𝑆𝑘[𝑛 − 1]| + 𝑤[𝑛 − 1]
(𝑤 is zero mean, zero variance random variable)
|𝑆̂𝑘[𝑛]| = 1 ∙ |𝑆𝑘[𝑛]| + |𝜖𝑘[𝑛]|
This is exactly the type of model we can apply scalar Kalman Filter to filter out|𝜖𝑘[𝑛]|.
|𝑆𝑘[𝑛]|can be considered as the unknown signal, |𝑆̂𝑘[𝑛]| is the measurement, |𝜖𝑘[𝑛]| is the measurement noise, and𝑤[𝑛] is the process noise.
This model can also be used for the parts of noisy signal with speech activity too. In that case, the variance of 𝑤[𝑛] is non-zero. However, the linear relation between |𝑆̂𝑘[𝑛]| and |𝑆𝑘[𝑛]|
doesn’t hold anymore, because we must take into account the phase difference between 𝑆𝑘[𝑛]
and 𝜖𝑘[𝑛].
Actually, if 𝑆𝑘[𝑛] ≫ 𝜖𝑘[𝑛], we can still assume |𝑆̂𝑘[𝑛]| = |𝑆𝑘[𝑛]| + |𝜖𝑘[𝑛]|. According to the addition rule of two sinusoidal signals, the value of|𝑆̂𝑘[𝑛]| can be varying from |𝑆𝑘[𝑛]| −
|𝜖𝑘[𝑛]| (the phase difference is π) to |𝑆𝑘[𝑛]| + |𝜖𝑘[𝑛]| (phase difference is 0). So the maximum error is only 2 ∙ |𝜖𝑘[𝑛]|, which is insignificant compare to 𝑆𝑘[𝑛]. Not to mention that by applying spectral subtraction, we have already ignored the role of phase difference between the speech and the noise (Assumption 3).Therefore, we will use the above model for both the parts with and without speech activity of the noisy signal.
Another problem we must take into account is that the Kalman Filter is only working best with Gaussian white noises. The more Gaussian and white noises are, the more effective
31
Kalman Filter is. Normally, this condition is not fulfilled, because the magnitude residual noise in reality is usually not white, and the motion of speech magnitude spectrum is not a Gaussian random walk process either. However, it doesn’t mean that we cannot use Kalman Filter; it only means that the filter results will not be optimal.
3.2.2. Scalar Kalman Filter
Before proceeding to model’s parameters determination problem, we will state the equations of the Kalman Filter using to reduce the residual noise. This scalar Kalman Filter is applied to each frequency component 𝑘 separately. For convenience, let 𝑦𝑛denotes |𝑆𝑘[𝑛]|, 𝑧𝑛denotes
|𝑆̂𝑘[𝑛]| and 𝜀𝑛 denotes |𝜖𝑘[𝑛]|. We also call the variance of |𝜖𝑘[𝑛]| as 𝑅𝑛 and the variance of 𝑤[𝑛] as 𝑄𝑛. So the model:
|𝑆𝑘[𝑛]| = |𝑆𝑘[𝑛 − 1]| + 𝑤[𝑛 − 1]
|𝑆̂𝑘[𝑛]| = 1 ∙ |𝑆𝑘[𝑛]| + |𝜖𝑘[𝑛]|
Will be re-written as:
𝑦𝑛 = 𝑦𝑛−1+ 𝑤𝑛−1 𝑧𝑛 = 𝑦𝑛 + 𝜀𝑛 Our Kalman Filter equations are:
Predict step:
𝑦̂𝑛|𝑛−1 = 𝑦̂𝑛−1|𝑛−1 𝑃𝑛|𝑛−1= 𝑃𝑛−1|𝑛−1+ 𝑄𝑛 Update step:
𝐾𝑛 = 𝑃𝑛|𝑛−1 𝑃𝑛|𝑛−1+ 𝑅𝑛
𝑦̂𝑛|𝑛 = 𝑦̂𝑛|𝑛−1+ 𝐾𝑛∙ (𝑧𝑛− 𝑦̂𝑛|𝑛−1) 𝑃𝑛|𝑛 = (1 − 𝐾𝑛) ∙ 𝑃𝑛|𝑛−1
There are two parameters we need to determine, the measurement noise variance 𝑅𝑛 and the process noise variance 𝑄𝑛.
3.2.3. Measurement noise variance R
The variance 𝑅𝑛 is easy to determine, because it is the magnitude variance of residual noise.
With the stationary noise assumption (Assumption 2), 𝑅𝑛 is a constant over time. Therefore, just like the expectation of noise’s magnitude using for spectral subtraction, we will estimate
32
𝑅 by calculate the sample variance of noise’s magnitude taken from the part without speech activity:
𝑅 ≈ 1
𝑁∑(𝑉[𝑛, 𝑘] − 𝜇[𝑘])2
𝑁
𝑛=1
3.2.4. Process noise variance Q
Determining 𝑄𝑛 is more complicated than 𝑅 because unlike the background noise, the speech magnitude spectrum is not stationary over time. In the parts of noisy signal without speech activity, 𝑄𝑛 = 0, and in the parts of noisy signal with speech activity, 𝑄𝑛 > 0.Instead of setting 𝑄𝑛 to zero during non-speech parts, we will use the following formula to estimate 𝑄𝑛 in all cases:
𝑄𝑛 ≈ 1
𝑀 ∑ (∆𝑦̂𝑛− 𝑀𝑒𝑎𝑛(∆𝑦̂𝑛))2
𝑛
𝑚=𝑛−𝑀+1
∆𝑦̂𝑛 = 𝑦̂𝑛|𝑛− 𝑦̂𝑛−1|𝑛−1
That means we will estimate 𝑄𝑛 by calculate the sample variance of ∆𝑦̂𝑛, which is the difference between filtered values of 𝑦. The reason is:
𝐸[∆𝑦̂𝑛] = 𝐸[𝑦̂𝑛|𝑛− 𝑦̂𝑛−1|𝑛−1] ≈ 𝐸[𝑦𝑛 − 𝑦𝑛−1] = 𝐸[𝑤𝑛−1]
⇒ 𝐸[𝑉𝑎𝑟(∆𝑦̂𝑛)] ≈ 𝐸[𝑉𝑎𝑟(𝑤𝑛)] = 𝐸[𝑄𝑛]
While being workable for the parts without speech activity, this method cannot estimate the value of 𝑄𝑛 in the parts with speech activity, because of the short burst characteristic of speech signal. For example, during the part without speech activity, the estimated 𝑄𝑛 is close to zero. This, in turn, will lead to small 𝑃𝑛|𝑛−1 value comparing to 𝑅, which means the filter will trust the prediction 𝑦̂𝑛|𝑛−1 more than the measurement𝑧𝑛. When changing from the part without speech activity to the part with speech activity, the dynamic process in reality is also changed. But the filter is still using the old model parameter, and this will lead to bad estimation of 𝑦𝑛. The more sudden the change is, the worse the estimation is. Bad estimation of 𝑦𝑛 will in turn lead to bad estimation of 𝑄𝑛, and thus the vicious cycle continues. If the dynamic process is changing gradually, then the filter can keep up with the change. However, the speech signals are short bursts and not gradually changing process.
The solution to this problem is that when the dynamic model does not fit with reality anymore, we increase the value 𝑃𝑛|𝑛−1 so that the filter will trust the measurement 𝑧𝑛 more than the prediction based on old model. In order to detect when the model does not fit with reality, we compare the difference |𝑧𝑛− 𝑦̂𝑛|𝑛−1| with a threshold value:
𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = 𝜏√𝑅 + 𝑃𝑛|𝑛−1
(𝜏 𝑖𝑠 𝑎 𝑐ℎ𝑜𝑠𝑒𝑛 𝑓𝑎𝑐𝑡𝑜𝑟)
33
If the difference |𝑧𝑛− 𝑦̂𝑛|𝑛−1| is greater than threshold, then the dynamic model does not fit with reality. Because 𝑧𝑛 is a Gaussian random variable with 𝐸[𝑧𝑛] = 𝑦𝑛, 𝑉𝑎𝑟[𝑧𝑛] = 𝑅 and 𝑦̂𝑛|𝑛−1 is also a Gaussian random variable with 𝐸[𝑦̂𝑛|𝑛−1] = 𝑦𝑛, 𝑉𝑎𝑟[𝑦̂𝑛|𝑛−1] = 𝑃𝑛|𝑛−1, so 𝑧𝑛− 𝑦̂𝑛|𝑛−1 must also be a Gaussian random variable with 𝐸[𝑧𝑛− 𝑦̂𝑛|𝑛−1] = 0 and 𝑉𝑎𝑟[𝑧𝑛− 𝑦̂𝑛|𝑛−1] = 𝑅 + 𝑃𝑛|𝑛−1. That means, the probability of |𝑧𝑛− 𝑦̂𝑛|𝑛−1| > 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 is small when 𝜏 is big.
3.2.5. Algorithm
The scalar Kalman Filter for each frequency component is described by the flowchart below:
Figure 7: Flowchart of Kalman Filter with each frequency component
Figure 8: Block diagram of the presented method
STFT Noise
estimation
Kalman Filter
Half-Wave Rectify ISTFT 𝑥[𝑛]
𝑠̂[𝑛]
Subtract Bias 𝑆𝑡𝑎𝑟𝑡 𝑄 ≔ 0; 𝑃 ≔ 0;
𝑦[0] ≔ 0; 𝑛 ≔ 1 𝑛 > 𝑙𝑒𝑛(𝑆𝑘)?
𝑦[𝑛] ≔ 𝑦[𝑛 − 1]; 𝑃 ≔ 𝑃 + 𝑄;
𝑧[𝑛] ≔ |𝑆̂𝑘[𝑛]|
|𝑧[𝑛] − 𝑦[𝑛]|
> 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑?
𝑃 ≔ (𝑧[𝑛] − 𝑦[𝑛])2
𝐾 ≔ 𝑃 (𝑃 + 𝑅)⁄ ;
𝑦[𝑛] ≔ 𝑦[𝑛] + 𝐾(𝑧[𝑛] − 𝑦[𝑛]);
𝑃 ≔ 𝑃(1 − 𝐾) 𝐷𝑖𝑓𝑓[𝑛] ≔ 𝑦[𝑛] − 𝑦[𝑛 − 1]
Estimate 𝑄 from 𝐷𝑖𝑓𝑓 𝑛 ≔ 𝑛 + 1
Output𝑦 𝐸𝑛𝑑
Threshold was excessed in other
frequencies?
T
T T F
F
F
34