Sudden Noise Reduction Based on GMM with Noise Power Estimation

doi:10.4236/jsea.2010.34039

Paper Menu >>

Journal Menu >>

J. Software Engineering & Applications, 2010, 3: 341-346

doi:10.4236/jsea.2010.34039 Published Online April 2010 (http://www.SciRP.org/journal/jsea)

341

Sudden Noise Reduction Based on GMM with

Noise Power Estimation

Nobuyuki Miyake, Tetsuya Takiguchi, Yasuo Ariki

Graduate School of Engineering, Kobe University, Kobe, Japan.

Email: takigu@kobe-u.ac.jp

Received January 5th, 2010; revised February 27th, 2010; accepted March 2nd, 2010.

ABSTRACT

This paper describes a method for reducing sudden noise using noise detection and classification methods, and noise

power estimation. Sudden noise detection and classification have been dealt with in our previous study. In this paper,

GMM-based noise reduction is performed using the de tection and classification resu lts. As a result of cla ssification, we

can determine the kind of noise we a re dealing with, but the power is unknown. In this pap er, this problem is solved by

combining an estimation of noise power with the noise reduction method. In our experiments, the proposed method

achieved good perf or mance for recogniti on o f utt e ra nces overlapped by sudden noises.

Keywords: Sudden Noise, Model-Based Noise Reduction, Speech Recognit ion

1. Introduction

Sudden and short-term noises often affect the perform-

ance of a speech recognition system. To recognize the

speech data correctly, noise reduction or model adapta-

tion to the sudden noise is required. However, it is diffi-

cult to remove such noises because we do not know

where the noise overlapped and what the noise was.

There have been many studies conducted on non-sta-

tionary noise reduction in a single channel [1-4]. The

target of our study is mostly sudden noise from among

these non-stationary noises. There have been many stud-

ies on model-based noise reduction [5-7]. These methods

are effective for additive noises. However, these reduc-

tion methods are difficult to apply for sudden noise re-

duction directly since these methods require the noise

information in order to be carried out.

In our previous study [8], we proposed detecting and

classifying these noises before removing them. But there

is a problem with this because the noise power is un-

known from the classification results, although the kind

of noise can be estimated. In this paper, we propose a

noise reduction method that uses the results of noise de-

tection and classification to accomplish the noise reduc-

tion. The proposed method integrates noise power esti-

mation with the noise reduction based on GMM to solve

the aforementioned problem.

2. System Overview

Figure 1 shows the overview of the noise reduction sys-

tem. The speech waveform is split into small segments

using a window function. Each segment is converted to a

feature vector, which is a log Mel-filter bank. Next, the

system identifies whether or not the feature vector is

noisy speech overlapped by sudden noises using a

non-linear classifier based on AdaBoost. The system

clarifies the sudden noise type only from the detected

noisy frame using a multi-class classifier. Then a noise

reduction method based on GMM is applied. Even

though we apply the proposed technique to the output

from AdaBoost, it can be successfully applied to that

from a binary identification technique such as SVM.

3. Clustering Noise

There are many kinds of noises in a real environment.

Figure 1. System overview of sudden noise r eduction

Sudden Noise Reduction Based on GMM with Noise Power Estimation

342

The smaller the difference between the noise in training

and the overlapped noise in the test, the better the per-

formance of the noise reduction method in Section 5 is.

But there are many kinds of noises, and potential noises

need to be grouped by noise type in some way. Therefore,

we made a tree of noise types based on the k-means me-

thod, where we used the log Mel-filter bank as the noise

feature.

3.1 K-Means Clustering Limited by Distance to

Center

K-means clustering usually sets the number of classes. In

our method, the number of classes is decided automati-

cally by increasing class so that distance d between the

data and the center of a class must be smaller than an

upper limit



decided beforehand.

First, all data are clustered using the k-means cluster-

ing method. Next, we calculate the distance d between

the data and the center of the class to which the data be-

longs. If the distance d is bigger than



(d >



), this

class is divided into two classes and k-means clustering

is performed. This step is repeated until all the distances

are less than



The noise data for noise reduction is given as the mean

value of each class data. So, the smaller the upper limit



is, the higher the noise reduction performance is ex-

pected to be because the variance of the class becomes

smaller.

3.2 Tree of Noise Types

One problem with the above k-means algorithm is that

too many classes may be created when



is set small.

This problem is solved by making a tree using the above

k-means clustering, while



is set at a larger value and

all the data are clustered. The bigger the level is, the less

distance there is. In this paper,



is set to be reduced by

half with each level increment change on the noise tree.

Figure 2 shows an example of one such tree. In this

paper, the clustering is performed using the mean vectors

of each type of noise.

Figure 2. An example of a tree of noise types

4. Noise Detection and Classification

4.1 Noise Detection

Noise detection and classification are described in [8]. A

non-linear classifier H(x), which divides clean speech

features and noisy speech features, is learned using

AdaBoost. Boosting is a voting method using weighted

weak classifiers and AdaBoost is one method of boosting

[9]. The AdaBoost algorithm is as follows.

Input: n examples where

means a label of and it is {-1,1}

)},(),...,,{( 11nn yxyxZi

Initialize:















)(

iyif

yif

where, m is the number of positive data, and l is the

number of negative data.

Do for t = 1,…,T

1) Train a base learner with respect to weighted exam-

ple distributionand obtain hypothesis

w}1.1{: xht

2) Calculate the training error of









iti

itt xhy

zwe

1)(

)(

3) Set

e

1

log



4) Update example distribution









n

ititit

xhyzw

)}(exp{)(

)(



Output: final hypothesis





tt xhxf)()(



AdaBoost algorithm uses a set of training data,

{( ,), . . ., (, )}, where is the i-th feature

vector of the observed signal, and y is a set of possible

labels. For noise detection, we consider just two possible

labels, Y = {−1, 1}, where label 1 means noisy speech

and label −1, means speech only. In this paper, sin-

gle-level decision trees (also known as decision stumps)

are used as weak classifiers, and the threshold of f(x) is 0.











0)(,

0)(,

)( xfifspeechclean

xfifspeechnoisy

xH (1)

Using this classifier, we determine whether the frame

is noisy or not.

4.2 Noise Classification

Noise classification is performed for the frame detected

Sudden Noise Reduction Based on GMM with Noise Power Estimation343

as noisy speech. If the frame is noise only, it may be

classified by calculating the distance from templates. But

it is supposed that the frame contains speech, too. In this

paper, we use AdaBoost for noise classification. Ada-

Boost is extended and used to carry out multi-class clas-

sification utilizing the one-vs-rest method, and a mul-

ti-class classifier is created. The following shows this

algorithm.

Input: m examples {(),…,( )}

11,yx mm yx,

},...,1{Kyi

Do for k = 1,…,K

1) Set labels











otherwise

kyif

i,1

2) Learn k-th classifier using AdaBoost for

data set

)(xf k

), k

y(),...,,(11m

kk xyxZ 

Final classifier:

)(maxarg

ˆxfk k



This classifier is made at each node in tree. K is the

total number of the noise classes in a node. In this paper,

each node has from 2 to 5 classes.

5. Noise Reduction Method

5.1 Noisy Speech

The observed signal feature , which is the energy

of filter b of the Mel-filter bank at frame t, can be written

as the follows using clean speech and additive

noise

)(tXb

)(tSb

)(tNb

)()()( tNtStX bbb



 (2)

In this paper, we suppose that noises are detected and

classified but the SNR is unknown. In other words, the

kind of the additive noise is estimated but the power is

unknown. Therefore, the parameter



, which is used to

adjust the power is used as follows.

)()()(tNtStX bbb 







(3)

In this case, the log Mel-filter bank feature

(= ) is

)(txb

)(log tXb

)),(),(()(

))}()(exp(1log{)(

))}(exp())(log{exp()(



tntsGts

tstnts

tntstx

bbb







(4)

The clean speech feature can be obtained by es-

timating

)(tsb

)),(),((



tntsGb and subtracting it from . )(txb

5.2 Speech Feature Estimation Based on GMM

The GMM-based noise reduction method is performed to

estimate s(t) [5,6]. (In [5,6], the noise power parameter



is not considered.) The algorithm estimates the value

of the noise using the clean speech GMM in the log

Mel-filter bank domain. A statistical model of clean

speech is given as an M-Gaussian mixture model.



msms

sNmsp),;()Pr()( ,,



(5)

Here, N(*) denotes the normal distribution, and ms,



and ms,



are the mean vector and the variance matrix of

the clean speech s(t) at the mixture m. The noisy speech

model is assumed using this model as follows:



mxmx

xNmxp ),;()Pr()( ,,



(6)

),,( ,,,





nmsmsmx G





(7)

msmx ,, 





(8)

where n



is the mean vector for one of the noise

classes, which is decided by the result of the noise classi-

fication. At this time, the estimated v),,(

alue of



nsG is

given as follows:





nms

GxmPnsG ),,()|(),,(

ˆ,



(9)

where,







mxmx

xNm

xmp ),;()Pr(

),;()Pr(

)|(



(10)

The clean speech feature s is estimated by subtracting

from feature x of the observed signal. ),,(



nsG

),,(



nsGxs  (11)

5.3 Noise Power Estimation Based on EM

Algorithm

The parameter



, which is used to adjust the noise

power, is unknown. Therefore, (9) cannot be used be-

cause mx,



and p(m|x) depend on



. In this paper, this

parameter is calculated by the EM algorithm. The EM

algorithm is used for estimation of noise power



for

maximizing p(x) which is the likelihood of a noisy

speech feature. p(x) is written as (6), in which mx,



depends on



. So, we replace p(x) with p(x|



), and the

noise power parameter



is calculated by maximizing

likelihood p(x|



) using the EM algorithm.

E-step:





kk mxpmxpQ )|,(log)|,(),( )()(



(12)

M-step:

),(maxarg)()1(





kk Q

 (13)

where k is the iteration index. The above two steps are

calculated repeatedly until converges to optimum

)( k



Sudden Noise Reduction Based on GMM with Noise Power Estimation

344

solution. In M-step, the solution is found by calculating

the following equation.

),( )( 







Q (14)

This equation can be expanded as follows.

















bbmsbnbb

bmsbnbmsb

mxp

mxpmxp

))exp(1(

))exp(1log(

)|,(

)|,(log)|,(

),(

,,,

,,,,,

)(













(15)

However, it is difficult to find a solution of this equa-

tion analytically. So, Newton’s method is used for this

equation. An approximation of the optimum solution is

calculated repeatedly as follows using Newton’s method.

),()()(

1lk









),( )()(

2lk









)()1(

ll 





(16)

Equation (16) is calculated repeatedly until



con-

verges. The initial value of Newton’s method was set at 0.

6. Experiments

In order to evaluate the proposed method, we carried out

isolated word recognition experiments using the ATR

database for speech data and the RWCP corpus for noise

data [10].

6.1 Experimental Conditions

The experimental conditions are shown in Table 1. All

features were gotten in a 20 ms window by 10 ms frame

shift. The word utterances of ten different people are

recorded in the ATR database. There were 105 types of

noises in the RWCP corpus [10]. The kinds of noises, for

example, are telephone sounds, beating woods, tearing

paper and so on. One kind of noise consists of 100 data

samples, which are divided into 50 samples for testing

and 50 samples for training. The noise tree was made

using the mean vectors of the training samples, and these

vectors were divided into 37 classes (which is the total

number of leaves). Learning classifiers for detection and

classification were performed using the noisy speech

features. So, we made noisy utterances in each class,

adding noises to 2,000 × 10 clean utterances of 10 per-

sons (five men, five women) for training data. Clean ut-

terances were in ATR database which were Japanese

word utterances of 10 persons. In this case, SNR is ad-

justed between –5 dB and 5 dB. One model of GMM for

Table 1. Experimental conditions

Making tree.

Feature parameters 24-log Mel-filter bank

Tree depth 5

Upper limit



(in order of depth level)

50, 25, 12, 6

Detection and classification

Feature parameters 24-log Mel-filter bank

Number of weak learners 200

Noise reduction

Feature parameters 24-log Mel-filter bank

Number of components of GMM

16, 32, 64

Speech recognition

Feature parameters 12-MFCC+ +



Acoustic models Phoneme HMM

5 states, 12 mixtures

Lexicon 500 words

Figure 3. An example of noisy speech

noise reduction and HMM for recognition were learned

using the same 2,000 × 10 clean utterances of 10 persons.

In order to make test data, we used 500 × 10 different

word utterances by the same 10 persons. Some noises

overlapped one test utterance with adjusting SNR to

–5, 0 and 5 dB and duration time of each noise to 10 ~

200 ms. Figure 3 shows an example of noisy speech.

6.2 Experimental Results

Table 2 shows the results of detection and classification.

“Recall” is the ratio of detected true noisy frames among

all the noisy frames, “Precision” is the ratio of detected

true noisy frames among all the detected frames and

“Classification” is the rate of true classification frames

among the detected noisy frames. In this table, Recall

rate and Precision rate are higher value, which mean

noise is well detected. The classification rate was low,

however. Even if the classification results are different

Sudden Noise Reduction Based on GMM with Noise Power Estimation345

from the real noise label, though, if the noises are classi-

fied near to the real noise, the negative effect on noise

reduction may be negligible.

Figure 4 shows the recognition rate for each SNR. In

Figure 4, the baseline means noise reduction is not ap-

plied and “No estimation of noise power” means that

power estimation was not performed in GMM-based

noise reduction (calculated in (11) as



= 1). “EM al-

gorithm” means that noise power is estimated using the

Table 2. Results of detection and classification

5 dB 0 dB -5 dB

Recall 0.850 0.908 0.942

Precision 0.861 0.868 0.871

Classification 0.290 0.382 0.406

64.3 65.865.8 64.1 67.1

82.7 83.5 84.3 88.3

100

Baseline163264Oracle

label

Recognition rate [%]

No estimation of power

EM algorithm

53.3

62.8 57.6 63 67.3

77.478.2 79.5 82.5

100

Baseline163264Oracle

label

Recognition rate [%]

No estimation of power

EM algorithm

43.1

60.6 56.7 56.3

67.5

70.671.9 72.4 74.9

100

Baseline163264Oracle

label

Recognition rate [%]

No estimation of power

EM algorithm

Figure 4. Recognition results at SNRs of –5 dB, 0 dB and

5 dB

method written in section 5.3. “Oracle label” means that

correct detection and classification results were given. In

this case (Oracle-label), 64 Gaussian components were

used. In cases where there were no noises, the recogni-

tion rate is 97.4%. As shown in Figure 4, the recognition

rate was improved by using the proposed method. Fur-

thermore, the proposed method has higher performance

than no estimation.

6.3. Experiments for Unknown Noise

We examined the effectiveness of the proposed method

for dealing with unknown noises using 10-fold cross

validation of noise type. 105 types of noise were divided

into 10 sets, with 9 sets for training and 1 set for testing.

The noise tree and classifiers were created using training

sets and test data were made using test sets. Experimental

conditions were similar to those in Table 1, but we ex-

amined only 64 Gaussian mixture components for noise

reduction. Table 3 shows the detection results. Classifi-

cation rate cannot be evaluated because the classes of the

noises that overlapped utterances are not defined. Figure

5 shows recognition rate for unknown noises for test sets.

As shown in this Figure 5, the proposed method im-

proved the word recognition rate for unknown noises.

But, in comparison with the “Oracle label”, the perform-

ance of speech recognition degraded due to differences

between the training and test noise data.

Number of components of

GMM used for noise reduction 7. Conclusions

In this paper, we have described a sudden noise reduction

method. Noise detection and classification are performed

using AdaBoost, and GMM-based noise reduction is

performed using the detection and classification results.

Combining an estimation of noise power with the noise

reduction method, we solved the problem of word recog-

Table 3. Results of detection for unknown noises.

5 dB 0 dB -5 dB

Recall 0.831 0.886 0.926

Precision 0.849 0.856 0.860

Number of components of

GMM used for noise reduction

45.3

77.7 73.8

67.1

100

5 dB0 dB-5 dB

Recognition rate [%]

Baseline

EM algorithm

Figure 5. Recognition results for words utterances mixed

unknown noises

Number of components of

GMM used for noise reduction

Sudden Noise Reduction Based on GMM with Noise Power Estimation

346

nition when that noise power was unknown. Our pro-

posed method improved the word recognition rate, al-

though admittedly, the classification accuracy was not

high. Furthermore, although this method was effective

for unknown noises, it will need combination of a noise

adaptation, tracking technique and so on. In future re-

search, we will attempt to verify effectiveness of this

new method in dealing with sudden noise when a large

vocabulary is used.

REFERENCES

[1] M. Fujimoto, et al., “Particle Filter Based Non-Stationary

Noise Tracking for Robust Speech Recognition,”

Proceedings of International Conference on Acoustics,

Speech, and Signal Processing (ICASSP), 2005, pp.

257-260.

[2] M. Kotta, et al., “Speech Enhancement in Non-Stationary

Noise Environments Using Noise Properties,” Speech

Communication, Vol. 48, No. 11, 2006, pp. 96-109.

[3] T. Jitsuhiro, et al., “Robust Speech Recognition Using

Noise Suppression Based on Multiple Composite Models

and Multi-Pass Search,” Proceedings of Automatic Speech

Recognition and Understanding (ASRU), 2007, pp. 53-58.

[4] T. Hirai, S. Kuroiwa, S. Tsuge, F. Ren, M. A. Fattah, “A

Speech Emphasis Method for Noise-Robust Speech

Recognition by Using Repetitive Phrase,” Proceedings of

International Conference on Chemical Thermodynamics

(ICCT), 2006, pp. 1-4.

[5] P. J. Moreno, B. Raj and R. M. Stern, “A Vector Taylor

Series Approach for Environment Independent Speech

Recognition,” Proceedings of International Conference

on Acoustics, Speech, and Signal Processing (ICASSP),

1996, pp. 733-736.

[6] J. C. Segura, et al., “Model-Based Compensation of the

Additive Noise for Continuous Speech Recognition.

Experiments Using the AURORA II Database and

Tasks,” Proceedings of Eurospeech, 2001, pp. 221-224.

[7] L. Deng, et al., “Enhancement of Log Mel Power Spectra

of Speech Using a Phase-Sensitive Model of the Acoustic

Environment and Sequential Estimation of the Corrupting

Noise,” IEEE Transactions on Speech and Audio Pro-

cessing, Vol. 12, 2004, pp. 133-143.

[8] N. Miyake, T. Takiguchi and Y. Ariki, “Noise Detection

and Classification in Speech Signals with Boosting,”

IEEE Workshop on Statistical Signal Processing (SSP),

2007, pp. 778-782.

[9] Y. Freund, et al., “A Decision-Theoretic Generalization

of On-Line Learning and an Application to Boosting,”

Journal of Computer and System Sciences, Vol. 55, 1997,

pp. 119-139.

[10] S. Nakamura, et al., “Acoustical Sound Database in Real

Environments for Sound Scene Understanding and

Hands-Free Speech Recognition,” Proceedings of 2nd

ICLRE, 2000, pp. 965-968.