Journal of Signal and Information Processing, 2011, 2, 165-169
doi:10.4236/jsip.2011.23021 Published Online August 2011 (http://www.SciRP.org/journal/jsip)
Copyright © 2011 SciRes. JSIP
165
MAP-Based Audio Coding Compensation for
Speaker Recognition*
Tao Jiang, Jiqing Han
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
Email: {taojiang, jqhan}@hit.edu.cn
Received March 8th, 2011; revised April 20th, 2011; accepted April 27th, 2011.
ABSTRACT
The performance of the speaker recognition system declines when training and testing audio codecs are mismatched. In
this paper, based on analyzing the effect of mismatched audio codecs in the linear prediction cepstrum coefficients, a
method of MAP-based audio coding compensation for speaker recognition is proposed. The proposed method firstly
sets a standard codec as a reference and trains the speaker models in this codec format, then learns the deviation dis-
tributions between the standard codec format and the other ones, next gets the current bias via using a small number
adaptive data and the MAP-based adaptive technique, and then adjusts the model parameters by the type of coming
audio codec format and its related bias. During the test, the features of the coming speaker are used to match with the
adjusted model. The experimental result shows that the accuracy reached 82.4% with just one second adaptive data,
which is higher 5.5% than that in the baseline system.
Keywords: Audio Coding Compensation, Speaker Recognition, MAP-Based
1. Introduction
Speaker recognition is a technology which extracts
speaker information from speech signals to identify the
speaker's identity. Most of the former speaker recogni-
tion systems are directed for the speech at the high bit
rate coding [1,2]. In recent years, with the rapid devel-
opment of network technology, the speech is encoded
with compression for effective transmit, and the bit rate
is relatively low, which results in the distortion of speech
signal and the decline of speaker recognition perform-
ance. Particularly, when the condition of training and
testing is codec mismatch, the performance is even worse
[3-5]. The compensation method of audio coding influ-
ence has been attracting more attentions of a variety of
researchers. Many techniques of compensating the deg-
radation caused by this mismatch have been developed.
They are roughly grouped into two categories, namely 1)
feature compensation, in which the process of feature
extraction is modified and 2) model adaptation, in which
the parameters of recognition models are adju sted. In [6],
four standard speech coding algorithms, i.e. GSM (12.2
kbps), G.729 (8 kbps), G.723 (5.3 kbps) and MELP (2.4
kbps) were used for testing the mismatch influence for
speaker recognition, and also discussed the effect of
score normalization. In [4], two approaches were pro-
posed to improve the performance of Gaussian mixture
model (GMM) speaker recognition, which were obtained
from the G.729 resynthesized speech. The first one ex-
plicitly uses G.729 spectral parameters as a feature vector,
and the second one calcu lates Mel-filter bank energies of
speech spectra built up from G.729 parameters. In [7],
the effect of the codec in GSM cellular telephone net-
works was investigated, in which the performance of the
text-dependent speaker verification system trained with
A-law coded speech and tested with GSM coded speech,
as well as that of the system trained with GSM coded
speech and tested with GSM coded speech were compared.
Several parameter representations which were derived
from fast Fourier transform and linear prediction cep-
strum coefficients (LPCC) [8] estimates were compared.
Although various researches of mismatch effects
caused by different codecs in speaker recognition have
been investigated, most of them were related with speech
codecs. So far, there are little works on mismatch effects
caused by stream media codecs in speak er recogn ition. In
this paper, we study the speaker recognitio n under stream
media codecs, and select four popular known coding or
unknown coding algorithm in stream media codecs on
*This work was supported by the National Basic Research Program o
f
China (973 Program, No. 2007CB311100).
MAP-Based Audio Coding Compensation for Speaker Recognition
166
the Internet: mp3 (192 kbps, known coding algorithm),
rm (64 kbps, unknown coding algorithm), wma (128 kb p s ,
unknown coding algorithm) and ogg (128 kbps, known
coding algorithm). We analyze the influence of parame-
ters caused by these codecs, and compensate the distor-
tions in the feature domain. We propose a method of
MAP-based audio coding compensation for speaker rec-
ognition, which is a model adaptation method. The pro-
posed method first sets a standard codec as a reference
and trains the speaker models in this codec format, then
learns the deviation distributions between the standard
codec format and the other ones, next gets the current
bias via using a small number adaptive data and MAP-
based adaptive technique, and then adjusts the model
parameters by the type of coming audio coding format
and the related bias. During the test, the features of com-
ing speaker are used to match with the adjusted model, so
as to effectively solve the codec mismatch problem.
2. Influence Analysis of Audio Codecs in
LPCC Domain
LPCC is the dominant feature which is frequently used in
speaker recognition and speech recognition. In order to
implement the compensation of audio codec influence, a
statistics analysis of audio codec influence is first con-
ducted in th e LPCC domain.
Under a variety of audio codecs, LPCC parameters of
distortio n deviation i is defined as: 0ii ,
, where 0 is the standard codec feature,
iis feature of ith codec, and is the bias between ith
codec and the standard one.
hhoo
0, 1,2,3,4i
o
o
i
h
The 12 dimensional average characteristics parameters
are extracted from four types of coded corpus, each of
which includes 20 males and 20 females, with two min-
utes per speaker. Figure 1 gives the average deviation of
the LPCC parameters.
From Figure 1, it can be seen that the low-dimen-
sional deviations are larger, while the high-dimensional
ones are smaller.
Deviation
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.81 2 3 4 5 6 78 9 10 11 12
mp3 rm wma
ogg
Dimensi on
mp3
rm
wma
ogg
Figure 1. The average deviation of LPCC parameters.
The average values of 12-dimensional characteristics
are calculated and the deviations of them are obtained
from a speech set. In the speech set, the speaker number
is 200 (100 males, 100 females), the time range is from 2
seconds to 30 minutes and there are four types codec.
Figure 2 is the third-dimensional deviation for the mp3
coding, and Figure 3 is the eleven-dimensional deviation
for the wma coding.
From Figures 2 and 3, it can be seen that the devia-
tions change in the vicinity of a certain value. The
devi ations of low-dimensional characteristics are scattered,
and those of high-dimensional ones are small and rela-
tively concentrated. In short, the deviation distribution of
LPCC can be described with a single-Gaussian distribu-
tion.
h
3. MAP-Based Coding Compensation
Maximum a posteriori(MAP) [9] is an adaptation ap-
proach, which has been widely adopted and successfully
applied to speaker adaptation. In this technique, the pa-
rameters of the model are regards as random variables,
which have an assumed joint prior probability density
function (p.d.f.). The MAP estimation of the parameter
vectors is defined as the mode of the posterior p.d.f.
Deviation
-0.2
0
0.2
0.4
0.6
-0.4
-0.60204060 80 100 120
Numbers
Figure 2. The third-dimensional deviation of mp3.
Deviation
-0.2
0
0.2
0.4
0.6
-0.4
-0.60204060 80 100 120
Numbers
Figure 3. The eleventh-dimensional deviation of wma.
Copyright © 2011 SciRes. JSIP
MAP-Based Audio Coding Compensation for Speaker Recognition167
given the adaptation. The MAP-based compensation me-
thod is based on two assumptions [6]:
1) Hypothesis that there is the codec bias between the
testing and training codecs.
2) Assumption that the deviation distribution can be
described with a single-Gaussian distribution
,
, in
which
is the mean vector and is the covariance
matrix.
The formula derivation of MAP codec estimation is
described as follows: let denote the codec bias be-
tween the testing and training feature vectors. The codec
bias is characterized by a multivariate Gaussian.
h
h
Based on MAP criterion, the codec bias is esti-
mated by maximizing a po sterior probability h
,Xph
,
i.e.
argmax ,
MAP h
hph
X
(1)
where
is the speaker model, and

1,,
T
xx is
the vector sequence of testing speech feature.
This equation is also equivalent to



argmax log,log
MAP h
hpXh
ph
(2)
where is the prior knowledge of codec bias.

ph
Thus, the maximization problem is transferred to
maximize the sum of log likelihood
log ,pXh
and
logarithm of a prior p.d.f. . This motivates us
to introduce a scale factor

log ph
into Equation (2) for eva-
luating the weights of these two terms. Thus, we gener-
alize the MAP estimation as follows,

 
argmax log,
1log
MAP h
hp
ph
Xh
 (3)
where

,| hXp is a mixed Gaussian distribution, i.e.


1
1
,,
,
M
i
M
ii
i
pXh pXih
cpX h
,
(4)
where
M
is the number of mixture components, and
is the mixture weight.
i
cFor Equation (3), the current codec bias is estimated
using expectation maximum (EM) [10] algorithm in the
T-frame data. Function can be written as follows:
Q






11
,,
,log
,
1log
TM t
t
ti t
pxih
Qhh pxih
pxh
Tph
,,


 (5)
where is the previous iteration result,
hh is the cur-
rent iteration result,
,
t
pxh
is the Gaussian mixture
density after the adjustment of vectortby the bias h, and x
,,
t
pxih
is the component density.
Assumption that the covariance is a diagonal matrix,
let 0
j
Qh
, we can find







22
11
22
11
,1
,
|, 11
,
j
TM tj ij
ii tij
ti ij ij
t
TM
ii t
ti ij ij
t
h
x
cpxhT
pxh
cp x hT
pxh

















(6)
where
j
h is the jth component of the current iteration
result, tj
x
is the jth component of vector t
x
, ij
is the
mean and 2
ij
is the covariance of the speaker. hj
is
the mean and 2
hj
is the covariance of the codec bias.
In Equation (6), hj
and 2
hj
are unknown. The co-
dec bias should to be gotten firstly. Given the scale
factor h1
, the Equation (6) is reduced to the maxi-
mum likelihood (ML) estimation, i.e.




2
11
2
11
,
,
,1
,
TM tj ij
iit
ti ij
t
jTM
iit
ti ij
t
cp x h
pxh
hcpxh
pxh


x


(7)
If ther e are
H
types of audio codec, we can get a set
of a prior code statistics
12
,,,
MM MH
hh h, the mean
h
and variance h
can be estimated using Equations
(8) and (9).
1
1H
h
k
h
H
M
k
(8)
2
1
1H
hMk
k
h
H
 
h
(9)
In Equation (7), the initial is
h
011
1TM
iti
ti
hcx
T


 (10)
where t
x
is the feature vector in non-standard codec,
i
is the mean of speaker model in standard codec.
The codec compensated vectors are given by
M
AP
X
Xh (11)
We obtain the final recognition result by finding the
maximum a posterior probability for the compensated
sequence in standard speaker models.
The overall framework with MAP coding compensa-
tion has the following steps: first, select a standard codec
and assume that the deviation between the other codecs
and the standard codec follows a Gaussian distribution;
next, estimates the specific distribution with coding
Copyright © 2011 SciRes. JSIP
MAP-Based Audio Coding Compensation for Speaker Recognition
168
changing corpus as a codec bias prior knowledge; then,
gets current codec bias using a small amount of testing
data by MAP algorithm and adjusts the testing data; fi-
nally, recognize in standard speaker models and obtains
the final result.
4. Experiments and Discussions
In order to evaluate our proposed method, some experi-
ments were designed to test it. The related works are also
discussed in detail. The corpus of experiments was col-
lected from Internet, which includes the data of a variety
of codec types and speakers. There were 200 speakers,
including 100 males and 100 females. The corpus con-
tained news, talks, recitations, interviews and so on. The
time duration of speech was from 2 seconds to 20 min-
utes. The speech from Internet is used as the original
which is in the standard codec. We then obtain the mp3,
rm, wma and ogg coded speech, which are named as fol-
lows: 0 - no codec, 1 - mp3 codec, 2 - rm codec, 3 - wma
codec, 1 - mp3 codec.
The number of standard sp eaker GMM is 128, and the
speech used for training per speaker is 5 minutes and that
used for testing is from 1 second to 6 seconds. The con-
tents between training and testing data are different. We
select 12-dimensional characteristics of LPCC and its
difference as the features. In the MAP estimation formula,
needs to be determined. We conducted a series of
experiments to compare the performances of using dif-
ferent
values. Figure 4 shows the accuracy com-
parisons when the value of
changes from 0.0 to 0.9
and the testing speech is 5 seconds.
From Figure 4, we can see that the recognition rate
using MAP coding compensation is highest when the
adjustment factor
is around 0.5. Under this condition,
considering the Equation (6), we may find that the pro-
portion of adaptive data and prior knowledge is close. In
the following experiments, the value adjustment factor
is selected as 0.5.
We compared the performances of MAP-based method
α
90
Accuracy(%)
85
80
75
70
65
60
55
50
45
40
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.9
Figure 4. The recognition rate when the value of
is be-
tween 0.0 and 0.9.
Table 1. The recognition rates of MAP method when using
different adaptive data.
Duration of adaptive speechBaseline MAP
0 s 76.9%
1 s 82.4%
2 s 82.3%
3 s 82.6%
4 s 82.4%
5 s 83.1%
6 s 82.7%
when using different lengths of adaptive data. Table 1
gives the results of baseline system and the system using
MAP-based method when adaptive data time is from 1
second to 6 seconds.
From the above experimental results, it can be seen
that the influence of codec mismatch is very large. The
recognition rate of the baseline system is only 76.9%.
With using MAP-based compensation method, the sys-
tem performance was improved effectively under codec
mismatch condition. When just using 1 second adaptive
data, the accuracy reaches 82.4%, which was 5.5% high-
er than the baseline system; when the adaptive data is 5
seconds, the recognition rate reaches 83.1%. With the
adaptive data increasing, the performance of MAP-based
compensation method is gradually close to the former.
The performance of the system with 6 seconds adaptive
data decreases a little comparing with that of the system
with 5 seconds adaptive data. This shows that the codec
prior knowledge is useful to improve the system per-
formance when the adaptive speech is little. With the
adaptive speech increasing, the effect of the codec prior
knowledge would gradually reduce .
5. Conclusions
This paper analyses the effect of the audio coding on
speaker recognition parameters LPCC, and introduce
MAP technique to compensate the codec mismatch. The
proposed method can reduce the influence of training and
testing codec mismatch. Experimental results show that
with one second adaptive data and using proposed me-
thod, an increase of 5.5% in accuracy is obtained com-
paring with the baseline system. Thus, the proposed me-
thod could effectively reduce the influence of training
and testing codec mismatch.
REFERENCES
[1] F. Bimbot, J. F. Bonastre, C. Fredouille, G. Gravier, I.
Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-
García, D. P. Delacrétaz and D. A. Reynolds, “A Tutorial
Copyright © 2011 SciRes. JSIP
MAP-Based Audio Coding Compensation for Speaker Recognition
Copyright © 2011 SciRes. JSIP
169
on Text-Independent Speaker Verification,” EURASIP
Journal on Applied Signal Processing, Vol. 4, 2004, pp.
430-451. doi:10.1155/S1110865704310024
[2] T. Kinnunen and H. Li, “An Overview of Text-Inde-
pendent Speaker Recognition: From Features to Su-
pervectors,” Speech Communication, Vol. 52, No. 1, 2010,
pp. 12-40. doi:10.1016/j.specom.2009.08.009
[3] M. Phythian, J. Ingram and S. Sridharan, “Effects of
Speech Coding on Text-Dep endent S peaker Reco gnition, ”
Proceedings of IEEE Conference Speech and Image
Technologies for Computing and Telecommunications,
Vol. 1, Brisbane, December 1997, pp. 137-140.
[4] R. B. Dunn, T. F. Quatieri, D. A. Reynolds and J. P.
Campbell, “Speaker Recognition from Coded Speech and
the Effects of Score Normalization,” 35th Asilomar Con-
ference on Signals, System s and Computers, Vol. 2, Pa-
cific Grove, November 2001, pp. 1562-1567.
[5] T. Jiang, B. Y. Gao and J. Q. Han, “Speaker Identification
and Verification from Audio Coded Speech in Matched
and Mismatched Conditions,” IEEE International Con-
ference on Robotics and Biomimetics, Guilin, December
2009, pp. 2199-2204.
[6] T. F . Quatieri, R. B. Dunn, D. A. Reynolds, J. P. Campbell
and E. Singer, “Speake r Recogni tion Using G. 7 29 Speech
Codec Parameters,” IEEE International Conference on
Acoustics, Speech, and Signal Processing, Istanb ul, Vol. 2,
June 2000, pp. 1089-1092.
[7] M. G. Kuitert and L. Boves, “Speaker Verification with
GSM Coded Telephone Speech,” Proceedings EU-
ROSPEECH 1997, Vol. 2, Rhodes, September 1997, pp.
975-978.
[8] B. S. Atal, “Effectiveness of Linear Prediction Character-
istics of the Speech Wave for Automatic Speaker Identi-
fication and Verification,” Journal of the Acoustical Soci-
ety of America, Vol. 55, No. 6, 1974, pp. 1304-1312.
doi:10.1121/1.1914702
[9] G. L. Gauvain and C. H. Lee, “Maximum a Posteriori
Estimation for Multivariate Gaussian Mixture Observa-
tions of Markov Chains,” IEEE Transactions on Speech
and Audio Processing, Vol. 2, No. 2, 1994, pp. 291-298.
doi:10.1109/89.279278
[10] J. Bilmes, “A Gentle Tutorial on the EM Algorithm and I ts
Application to Param eter Estim ation for Ga ussian Mixtur e
and Hidden Markov Models,” Technical Report IC-
SI-TR-97-021, 1997.