Q. HE ET AL.
Copyright © 2013 SciRes. CN
4.4. Speaker Recognition System of Score
Reliability Fusion Based on the Baseband
Correlation
Speaker recognition model training is the same as the
traditional speaker recognition. For the collection of N
registered speaker, we train one speaker model λi for
every speaker, 1
i
N. In the test, we should determine
the gender information of testing speech X
{xt ǀ t
1,2,
Λ,T}, then calculate the reliability weighting coeff icients
wg(t), t
1,2,Λ,T. At last, we have a weight for X about
the score for each frame of Λ
{λ1,λ2,L,λN}, and the
testing speech is determined the speaker i·corresponded
the probability value of the maximum model.
∑
=
≤≤≤≤
==
T
titg
Ni
i
Ni
xptwXPW
1
11
*
)|(log)(maxarg)|(logmaxargi
λλ
(8)
5. Experimental Analysis and Discussion
Experimental corpus base Mandarin Affective Speech
Corpus (MASC) and Emotional Prosody Speech and
Transcripts (EPST). MASC has 23 female and 45 male
speakers’ utterance in Chinese mandarin with 5 emotion-
al classifications (neutral, angry, happy, scared, and sad
classifications). Every speaker has 5 phrases and 60 sen-
tences in every emotional state. Each phrase lasts 0.8
second averagely, while each sentence lasts 2 seconds
averagely. Besides, there are 2 short passages with aver-
age duration of 15 seconds per passage in neutral state.
EPST is the first emotional speech corpus released by
Linguistic Data Consortium (LDC). It includes 8 actors
(3 male, 5 female). 7 speakers of them provide their Eng-
lish speech in 14 emotional classifications and their neu-
tral speech with different distance. The corpus used in
the experiment is split into 3 parts: Speeches of the first
18 people (7 female and 11 male) in MASC were taken
as development data to obtain fitting parameters; Speeches
of the remains in MASC were test data. 2 short passages
of every speaker were used to train speaker model, and
the other 15,000 sentences were used as testing speeches;
In addition, speeches of 7 speakers in EPST correspond-
ing with 5 same emotional classifications as MASC w ere
treated as extended test data. About 5 minutes neutral
speeches of each speaker in normal distance were used to
train speaker model. 5 kinds of emotional sentences with
total count 670 were taken as testing speech.
In the experiment, UBM is adopted 1024 orders and
characteristics are 13-dimensional MFCC and its delta.
The length of window for MFCC, energy and pitch are
32 ms uniformly, and step sizes are 16 ms uniformly. All
neutral speeches of the first 18 people in MASC were
used as training speeches for UBM, and UBM was ob-
tained by expectation maximization algorithm (EM). For
every speaker, his speaker model was obtained using his
neutral speech from UBM by MAP. In addition, GMM is
adopted 1024 ord e rs in gender recognition.
For verifying the validity of two kinds fusion weight
estimating strategy based on the score reliability assess-
ment, this part will compare the four methods of recogni-
tion performance on the MASC corpus and EPST corpus.
The four methods are: the bi-model method fusion weight
estimating strategy based on the score reliability assess-
ment (score difference), the bi-model method based on
the weight strategy of recognition rate (recognition rate),
the bi-model method based on the equal weight (equal
weight) and the traditional GMM-UBM me thod (datum).
5.1. Experimental results on the MASC
For each speaker , the two-stage neutral parag raphs speech
are used to each speaker models from the UBM adaptive.
The specific experimental results of the three methods
are shown in Tab le 1. From the table, we first found this
method as opposed to elimination in anger, pleasure and
panic on the recognition rate improved, 0.7%, 2.00% and
1.53% respectively. This suggests that high mismatch
part which is excluded of your tests to correctly identify
the voice still has a role, and recognition of the base-
band-weighted approach can effectively measure the role
of that part of the speech. For two emotion of LD type,
neutral and sad, due to difference testing had been wrong-
fully convicted as a high difference in emotional, leading
to more significant decrease in the elimination. The LD
type emotional statement which is wrongfully convicted,
identified as highly mismatched parts of voice baseband
is relatively low. Weights in weighted method in this
paper is much higher, thereby avoiding removal method
to ignore the part of speech, negative impact on system
performance. Relative to the elimination method in this
article as a whole has increased 1.1%, relative to the base
has increased 2.48%.
5.2. Experimental Results on the EPST
The EPST corpus and MFC corpus are the speech data-
base under two differ ent cultur al backgr ounds. To furth er
validate the effectiveness of the proposed method, we
Table 1. Experimental res ul t s on the MASC.
Method IR (%)
Datum Reject Baseband weight
Neutral 96.23 95.47 96.13
Angry 31.50 35.80 36.50
Happy 33.57 36.40 38.40
Scared 35.00 36.10 37.63
Sad 61.43 60.87 61.50
Average 51.55 52.93 54.03