### Paper Menu >>

### Journal Menu >>

J. Software Engineering & Applications, 2010, 3: 341-346 doi:10.4236/jsea.2010.34039 Published Online April 2010 (http://www.SciRP.org/journal/jsea) Copyright © 2010 SciRes JSEA 341 Sudden Noise Reduction Based on GMM with Noise Power Estimation Nobuyuki Miyake, Tetsuya Takiguchi, Yasuo Ariki Graduate School of Engineering, Kobe University, Kobe, Japan. Email: takigu@kobe-u.ac.jp Received January 5th, 2010; revised February 27th, 2010; accepted March 2nd, 2010. ABSTRACT This paper describes a method for reducing sudden noise using noise detection and classification methods, and noise power estimation. Sudden noise detection and classification have been dealt with in our previous study. In this paper, GMM-based noise reduction is performed using the de tection and classification resu lts. As a result of cla ssification, we can determine the kind of noise we a re dealing with, but the power is unknown. In this pap er, this problem is solved by combining an estimation of noise power with the noise reduction method. In our experiments, the proposed method achieved good perf or mance for recogniti on o f utt e ra nces overlapped by sudden noises. Keywords: Sudden Noise, Model-Based Noise Reduction, Speech Recognit ion 1. Introduction Sudden and short-term noises often affect the perform- ance of a speech recognition system. To recognize the speech data correctly, noise reduction or model adapta- tion to the sudden noise is required. However, it is diffi- cult to remove such noises because we do not know where the noise overlapped and what the noise was. There have been many studies conducted on non-sta- tionary noise reduction in a single channel [1-4]. The target of our study is mostly sudden noise from among these non-stationary noises. There have been many stud- ies on model-based noise reduction [5-7]. These methods are effective for additive noises. However, these reduc- tion methods are difficult to apply for sudden noise re- duction directly since these methods require the noise information in order to be carried out. In our previous study [8], we proposed detecting and classifying these noises before removing them. But there is a problem with this because the noise power is un- known from the classification results, although the kind of noise can be estimated. In this paper, we propose a noise reduction method that uses the results of noise de- tection and classification to accomplish the noise reduc- tion. The proposed method integrates noise power esti- mation with the noise reduction based on GMM to solve the aforementioned problem. 2. System Overview Figure 1 shows the overview of the noise reduction sys- tem. The speech waveform is split into small segments using a window function. Each segment is converted to a feature vector, which is a log Mel-filter bank. Next, the system identifies whether or not the feature vector is noisy speech overlapped by sudden noises using a non-linear classifier based on AdaBoost. The system clarifies the sudden noise type only from the detected noisy frame using a multi-class classifier. Then a noise reduction method based on GMM is applied. Even though we apply the proposed technique to the output from AdaBoost, it can be successfully applied to that from a binary identification technique such as SVM. 3. Clustering Noise There are many kinds of noises in a real environment. Figure 1. System overview of sudden noise r eduction Sudden Noise Reduction Based on GMM with Noise Power Estimation 342 The smaller the difference between the noise in training and the overlapped noise in the test, the better the per- formance of the noise reduction method in Section 5 is. But there are many kinds of noises, and potential noises need to be grouped by noise type in some way. Therefore, we made a tree of noise types based on the k-means me- thod, where we used the log Mel-filter bank as the noise feature. 3.1 K-Means Clustering Limited by Distance to Center K-means clustering usually sets the number of classes. In our method, the number of classes is decided automati- cally by increasing class so that distance d between the data and the center of a class must be smaller than an upper limit decided beforehand. First, all data are clustered using the k-means cluster- ing method. Next, we calculate the distance d between the data and the center of the class to which the data be- longs. If the distance d is bigger than (d > ), this class is divided into two classes and k-means clustering is performed. This step is repeated until all the distances are less than . The noise data for noise reduction is given as the mean value of each class data. So, the smaller the upper limit is, the higher the noise reduction performance is ex- pected to be because the variance of the class becomes smaller. 3.2 Tree of Noise Types One problem with the above k-means algorithm is that too many classes may be created when is set small. This problem is solved by making a tree using the above k-means clustering, while is set at a larger value and all the data are clustered. The bigger the level is, the less distance there is. In this paper, is set to be reduced by half with each level increment change on the noise tree. Figure 2 shows an example of one such tree. In this paper, the clustering is performed using the mean vectors of each type of noise. Figure 2. An example of a tree of noise types 4. Noise Detection and Classification 4.1 Noise Detection Noise detection and classification are described in [8]. A non-linear classifier H(x), which divides clean speech features and noisy speech features, is learned using AdaBoost. Boosting is a voting method using weighted weak classifiers and AdaBoost is one method of boosting [9]. The AdaBoost algorithm is as follows. Input: n examples where means a label of and it is {-1,1} )},(),...,,{( 11nn yxyxZi y i x Initialize: 1, 2 1 1, 2 1 )( 1 i i iyif l yif m zw where, m is the number of positive data, and l is the number of negative data. Do for t = 1,…,T 1) Train a base learner with respect to weighted exam- ple distributionand obtain hypothesis t w}1.1{: xht 2) Calculate the training error of t h n i iti itt xhy zwe 12 1)( )( 3) Set t t te e 1 log 4) Update example distribution t w n j ititit ititit it xhyzw xhyzw zw 1 1 )}(exp{)( )}(exp{)( )( Output: final hypothesis t tt xhxf)()( AdaBoost algorithm uses a set of training data, {( ,), . . ., (, )}, where is the i-th feature vector of the observed signal, and y is a set of possible labels. For noise detection, we consider just two possible labels, Y = {−1, 1}, where label 1 means noisy speech and label −1, means speech only. In this paper, sin- gle-level decision trees (also known as decision stumps) are used as weak classifiers, and the threshold of f(x) is 0. 1 x1 yN xN yi x 0)(, 0)(, )( xfifspeechclean xfifspeechnoisy xH (1) Using this classifier, we determine whether the frame is noisy or not. 4.2 Noise Classification Noise classification is performed for the frame detected Copyright © 2010 SciRes JSEA Sudden Noise Reduction Based on GMM with Noise Power Estimation343 as noisy speech. If the frame is noise only, it may be classified by calculating the distance from templates. But it is supposed that the frame contains speech, too. In this paper, we use AdaBoost for noise classification. Ada- Boost is extended and used to carry out multi-class clas- sification utilizing the one-vs-rest method, and a mul- ti-class classifier is created. The following shows this algorithm. Input: m examples {(),…,( )} 11,yx mm yx, },...,1{Kyi Do for k = 1,…,K 1) Set labels otherwise kyif yi k i,1 ,1 2) Learn k-th classifier using AdaBoost for data set )(xf k ), k m y(),...,,(11m kk xyxZ Final classifier: )(maxarg ˆxfk k k This classifier is made at each node in tree. K is the total number of the noise classes in a node. In this paper, each node has from 2 to 5 classes. 5. Noise Reduction Method 5.1 Noisy Speech The observed signal feature , which is the energy of filter b of the Mel-filter bank at frame t, can be written as the follows using clean speech and additive noise )(tXb )(tSb )(tNb )()()( tNtStX bbb (2) In this paper, we suppose that noises are detected and classified but the SNR is unknown. In other words, the kind of the additive noise is estimated but the power is unknown. Therefore, the parameter , which is used to adjust the power is used as follows. )()()(tNtStX bbb (3) In this case, the log Mel-filter bank feature (= ) is )(txb )(log tXb )),(),(()( ))}()(exp(1log{)( ))}(exp())(log{exp()( tntsGts tstnts tntstx bb bbb bbb (4) The clean speech feature can be obtained by es- timating )(tsb )),(),(( tntsGb and subtracting it from . )(txb 5.2 Speech Feature Estimation Based on GMM The GMM-based noise reduction method is performed to estimate s(t) [5,6]. (In [5,6], the noise power parameter is not considered.) The algorithm estimates the value of the noise using the clean speech GMM in the log Mel-filter bank domain. A statistical model of clean speech is given as an M-Gaussian mixture model. M m msms sNmsp),;()Pr()( ,, (5) Here, N(*) denotes the normal distribution, and ms, and ms, are the mean vector and the variance matrix of the clean speech s(t) at the mixture m. The noisy speech model is assumed using this model as follows: M m mxmx xNmxp ),;()Pr()( ,, (6) ),,( ,,, nmsmsmx G (7) msmx ,, (8) where n is the mean vector for one of the noise classes, which is decided by the result of the noise classi- fication. At this time, the estimated v),,( alue of nsG is given as follows: m nms GxmPnsG ),,()|(),,( ˆ, (9) where, m mxmx mxmx xNm xNm xmp ),;()Pr( ),;()Pr( )|( ,, ,, (10) The clean speech feature s is estimated by subtracting from feature x of the observed signal. ),,( ˆ nsG ),,( ˆ nsGxs (11) 5.3 Noise Power Estimation Based on EM Algorithm The parameter , which is used to adjust the noise power, is unknown. Therefore, (9) cannot be used be- cause mx, and p(m|x) depend on . In this paper, this parameter is calculated by the EM algorithm. The EM algorithm is used for estimation of noise power for maximizing p(x) which is the likelihood of a noisy speech feature. p(x) is written as (6), in which mx, depends on . So, we replace p(x) with p(x| ), and the noise power parameter is calculated by maximizing likelihood p(x| ) using the EM algorithm. E-step: m kk mxpmxpQ )|,(log)|,(),( )()( (12) M-step: ),(maxarg)()1( kk Q (13) where k is the iteration index. The above two steps are calculated repeatedly until converges to optimum )( k Copyright © 2010 SciRes JSEA Sudden Noise Reduction Based on GMM with Noise Power Estimation 344 solution. In M-step, the solution is found by calculating the following equation. 0 ),( )( k Q (14) This equation can be expanded as follows. bbmsbnbb bmsbnbmsb m k m k k x mxp mxpmxp Q ))exp(1( ))exp(1log( )|,( )|,(log)|,( ),( ,,, 2 , ,,,,, )( )( )( (15) However, it is difficult to find a solution of this equa- tion analytically. So, Newton’s method is used for this equation. An approximation of the optimum solution is calculated repeatedly as follows using Newton’s method. ),()()( 1lk Q f ),( )()( 2 2 2lk Q f 2 1 )()1( f f ll (16) Equation (16) is calculated repeatedly until con- verges. The initial value of Newton’s method was set at 0. 6. Experiments In order to evaluate the proposed method, we carried out isolated word recognition experiments using the ATR database for speech data and the RWCP corpus for noise data [10]. 6.1 Experimental Conditions The experimental conditions are shown in Table 1. All features were gotten in a 20 ms window by 10 ms frame shift. The word utterances of ten different people are recorded in the ATR database. There were 105 types of noises in the RWCP corpus [10]. The kinds of noises, for example, are telephone sounds, beating woods, tearing paper and so on. One kind of noise consists of 100 data samples, which are divided into 50 samples for testing and 50 samples for training. The noise tree was made using the mean vectors of the training samples, and these vectors were divided into 37 classes (which is the total number of leaves). Learning classifiers for detection and classification were performed using the noisy speech features. So, we made noisy utterances in each class, adding noises to 2,000 × 10 clean utterances of 10 per- sons (five men, five women) for training data. Clean ut- terances were in ATR database which were Japanese word utterances of 10 persons. In this case, SNR is ad- justed between –5 dB and 5 dB. One model of GMM for Table 1. Experimental conditions Making tree. Feature parameters 24-log Mel-filter bank Tree depth 5 Upper limit (in order of depth level) 50, 25, 12, 6 Detection and classification Feature parameters 24-log Mel-filter bank Number of weak learners 200 Noise reduction Feature parameters 24-log Mel-filter bank Number of components of GMM 16, 32, 64 Speech recognition Feature parameters 12-MFCC+ + Acoustic models Phoneme HMM 5 states, 12 mixtures Lexicon 500 words × Figure 3. An example of noisy speech noise reduction and HMM for recognition were learned using the same 2,000 × 10 clean utterances of 10 persons. In order to make test data, we used 500 × 10 different word utterances by the same 10 persons. Some noises overlapped one test utterance with adjusting SNR to –5, 0 and 5 dB and duration time of each noise to 10 ~ 200 ms. Figure 3 shows an example of noisy speech. 6.2 Experimental Results Table 2 shows the results of detection and classification. “Recall” is the ratio of detected true noisy frames among all the noisy frames, “Precision” is the ratio of detected true noisy frames among all the detected frames and “Classification” is the rate of true classification frames among the detected noisy frames. In this table, Recall rate and Precision rate are higher value, which mean noise is well detected. The classification rate was low, however. Even if the classification results are different Copyright © 2010 SciRes JSEA Sudden Noise Reduction Based on GMM with Noise Power Estimation345 from the real noise label, though, if the noises are classi- fied near to the real noise, the negative effect on noise reduction may be negligible. Figure 4 shows the recognition rate for each SNR. In Figure 4, the baseline means noise reduction is not ap- plied and “No estimation of noise power” means that power estimation was not performed in GMM-based noise reduction (calculated in (11) as = 1). “EM al- gorithm” means that noise power is estimated using the Table 2. Results of detection and classification 5 dB 0 dB -5 dB Recall 0.850 0.908 0.942 Precision 0.861 0.868 0.871 Classification 0.290 0.382 0.406 64.3 65.865.8 64.1 67.1 82.7 83.5 84.3 88.3 40 50 60 70 80 90 100 Baseline163264Oracle label Recognition rate [%] No estimation of power EM algorithm 53.3 62.8 57.6 63 67.3 77.478.2 79.5 82.5 40 50 60 70 80 90 100 Baseline163264Oracle label Recognition rate [%] No estimation of power EM algorithm 43.1 60.6 56.7 56.3 67.5 70.671.9 72.4 74.9 40 50 60 70 80 90 100 Baseline163264Oracle label Recognition rate [%] No estimation of power EM algorithm Figure 4. Recognition results at SNRs of –5 dB, 0 dB and 5 dB method written in section 5.3. “Oracle label” means that correct detection and classification results were given. In this case (Oracle-label), 64 Gaussian components were used. In cases where there were no noises, the recogni- tion rate is 97.4%. As shown in Figure 4, the recognition rate was improved by using the proposed method. Fur- thermore, the proposed method has higher performance than no estimation. 6.3. Experiments for Unknown Noise We examined the effectiveness of the proposed method for dealing with unknown noises using 10-fold cross validation of noise type. 105 types of noise were divided into 10 sets, with 9 sets for training and 1 set for testing. The noise tree and classifiers were created using training sets and test data were made using test sets. Experimental conditions were similar to those in Table 1, but we ex- amined only 64 Gaussian mixture components for noise reduction. Table 3 shows the detection results. Classifi- cation rate cannot be evaluated because the classes of the noises that overlapped utterances are not defined. Figure 5 shows recognition rate for unknown noises for test sets. As shown in this Figure 5, the proposed method im- proved the word recognition rate for unknown noises. But, in comparison with the “Oracle label”, the perform- ance of speech recognition degraded due to differences between the training and test noise data. Number of components of GMM used for noise reduction 7. Conclusions In this paper, we have described a sudden noise reduction method. Noise detection and classification are performed using AdaBoost, and GMM-based noise reduction is performed using the detection and classification results. Combining an estimation of noise power with the noise reduction method, we solved the problem of word recog- Table 3. Results of detection for unknown noises. 5 dB 0 dB -5 dB Recall 0.831 0.886 0.926 Precision 0.849 0.856 0.860 Number of components of GMM used for noise reduction 69 57 45.3 77.7 73.8 67.1 40 50 60 70 80 90 100 5 dB0 dB-5 dB Recognition rate [%] Baseline EM algorithm Figure 5. Recognition results for words utterances mixed unknown noises Number of components of GMM used for noise reduction Copyright © 2010 SciRes JSEA Sudden Noise Reduction Based on GMM with Noise Power Estimation Copyright © 2010 SciRes JSEA 346 nition when that noise power was unknown. Our pro- posed method improved the word recognition rate, al- though admittedly, the classification accuracy was not high. Furthermore, although this method was effective for unknown noises, it will need combination of a noise adaptation, tracking technique and so on. In future re- search, we will attempt to verify effectiveness of this new method in dealing with sudden noise when a large vocabulary is used. REFERENCES [1] M. Fujimoto, et al., “Particle Filter Based Non-Stationary Noise Tracking for Robust Speech Recognition,” Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2005, pp. 257-260. [2] M. Kotta, et al., “Speech Enhancement in Non-Stationary Noise Environments Using Noise Properties,” Speech Communication, Vol. 48, No. 11, 2006, pp. 96-109. [3] T. Jitsuhiro, et al., “Robust Speech Recognition Using Noise Suppression Based on Multiple Composite Models and Multi-Pass Search,” Proceedings of Automatic Speech Recognition and Understanding (ASRU), 2007, pp. 53-58. [4] T. Hirai, S. Kuroiwa, S. Tsuge, F. Ren, M. A. Fattah, “A Speech Emphasis Method for Noise-Robust Speech Recognition by Using Repetitive Phrase,” Proceedings of International Conference on Chemical Thermodynamics (ICCT), 2006, pp. 1-4. [5] P. J. Moreno, B. Raj and R. M. Stern, “A Vector Taylor Series Approach for Environment Independent Speech Recognition,” Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1996, pp. 733-736. [6] J. C. Segura, et al., “Model-Based Compensation of the Additive Noise for Continuous Speech Recognition. Experiments Using the AURORA II Database and Tasks,” Proceedings of Eurospeech, 2001, pp. 221-224. [7] L. Deng, et al., “Enhancement of Log Mel Power Spectra of Speech Using a Phase-Sensitive Model of the Acoustic Environment and Sequential Estimation of the Corrupting Noise,” IEEE Transactions on Speech and Audio Pro- cessing, Vol. 12, 2004, pp. 133-143. [8] N. Miyake, T. Takiguchi and Y. Ariki, “Noise Detection and Classification in Speech Signals with Boosting,” IEEE Workshop on Statistical Signal Processing (SSP), 2007, pp. 778-782. [9] Y. Freund, et al., “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” Journal of Computer and System Sciences, Vol. 55, 1997, pp. 119-139. [10] S. Nakamura, et al., “Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition,” Proceedings of 2nd ICLRE, 2000, pp. 965-968. |