Emotional Speech Synthesis Based on Prosodic Feature Modification

doi:10.4236/eng.2013.510B015

Paper Menu >>

Journal Menu >>

Engineering, 2

http://dx.doi.or

Emo

ABSTRA

The synthesis

and so on. In

Ti me D o ma i

synthesized s

osed emotio

al expression

speech uttera

Keywords: E

1. Introdu

The modern

of application

could conduc

virtual agent

speech synth

or children’s

thesizer coul

lost the use

speech synth

form) from t

a totally natu

drawbacks e

with emotion

man commun

munication i

could not exp

al speech sy

synthesized

speech.

Two majo

dominate the

tive synthesi

speech data

acoustic corr

man speech r

tion category

adapted [2] t

013, 5, 73-77

/10.4236/eng.

13 SciRes.

ional

School

of emotional

this wor k, an

Pitch Synchr

eech with fo

al speech sy

The subject

ces.

otional Spe

tion

peech synthe

s. In the call-

dialogues w

devices cou

sis technique

oys. In the

even be use

of their voi

sizers could

xt. However,

al way as a h

isting is that

. Emotion ex

ication, an ef

s virtually i

ess or unders

thesis aims

peech to pr

approaches t

iterature: for

[1]. Forman

entirely base

lates of the s

ecordings. Ac

re derived fr

create a sig

013.510B015

peech

Lin

of Electrical E

f Electrical an

Email:

speech has wi

emotional spe

nous OverLa

r types of em

thesis system

ve test reach

ch Synthesis;

is system ha

enters, the sp

th customers.

d read loud

, such as in t

edicine field,

d to speak fo

e. The majo

roduce voice

few machine

man being.

he machines

ression is a

ective human

possible w

tand affection

o add huma

duce more

emotional s

ant synthesi

synthesis ge

d on rules s

eech and doe

oustic profile

m the literatu

al. In 1989,

ublished Onlin

Synthe

, Hua

gineering and I

Computer En

ing.he@scu.ed

Receive

de applicatio

ech synthesis

Add (TD-P

otion: angry,

achieves a go

s high classi

Prosodic Feat

a wide varie

ech synthesiz

The intellige

to users usi

e video gam

the speech sy

sufferers w

ity of mode

(acoustic wav

can “speak”

ne of the maj

ould not spe

ital part in h

to-human co

thout speak

s. The emotio

emotions in

atural affecti

eech synthe

and concaten

erates acous

rrounding t

not utilize h

for each em

e and manual

Jenet Cahn [

October 2013

sis Bas

ificati

uang

, Ma

formation, Sic

ineering, RMI

.cn, margaret.l

November 20

s in the field

system is pro

OLA) wavef

appy, sad an

d performan

fication accu

res; Time D

]

imple

using

researc

[4]. De

tic par

thesis

has an

tenativ

er to g

ances

variety

speech

solve t

sodic

smalle

types

speech

ramete

of spe

catenat

chrono

In t

ropos

TS-PS

2. E

In this

speech

(http://www.sc

d on

garet Lech

uan Universit

University, M

ech@rmit.edu.

f human-co

osed based o

rm concatena

bore

. The

e. The produ

acy for diffe

main Pitch S

ented the s

formant sy

es have

spite of the hi

meters provi

s not widely

unnatural, m

synthesis [5

]

nerate the sy

re more nat

of emotions,

database to b

is problem,

trategies into

number of s

f emotion c

through mod

s (like the fu

ch contour),

ive approach

s OverLap A

is work, an e

d based on

LA concate

otional Sp

work, a pro

synthesis sys

rp.org/journal/

rosodi

, Chengdu, Chi

lbourne, Austr

puter interact

prosodic f ea

tive algorith

xperiment res

ed utterances

ent types of

nchronous O

nthesized em

thesis syste

done using t

gh degree of

ed in this te

applied, sinc

chanical sou

]

joins recordi

thetic speech

ral. Howeve

the system re

ild a selecti

everal resear

unit selectio

eech corpor

uld be adde

fying corresp

damental fre

d then apply

s, such as th

dd) algorithm

otional spee

prosodic fea

ative synthesi

ech S

nth

odic modific

em is propos

ng)

Feat

lia

on, medicine,

ures modific

. The system

lts show that

present clear

ynthesized e

erlap Ad

otional speec

. After then

e formant sy

ontrol over t

hnique, for

the resultin

d. In contras

gs of a hum

. The generati

, in order to

quires a large

g units pool

[

hers incorpo

[10,11]. In t

is required.

into the sy

nding to aco

uency, or the

ng the wavef

PSOLA (Pi

[12].

h synthesis s

ure modifica

method.

sis S

stem

tion based e

d. The block

ENG

industry

tion and

roduces

the pro-

motion-

otional

firstly

several

thesizer

e acous-

ant syn-

speech,

, conca-

n spea

g utte

produce

r size of

[

6-9]. To

ate pro-

is way,

ifferent

thesized

stic pa-

duration

rm con-

ch Syn-

ystem is

ion and

otional

diagram

L. HE ET AL.

is illustrated in Figure 1.

In the experiment, the speech utterances, which are

selected from emotional speech database with four dif-

ferent types of emotion: angry, happy, sad and bored, are

divided into a set of units. For the concatenative speech

synthesis, there are various possible choices for the type

of unit, the most popularly used types include words,

syllables, phonemes, demi-syllables and diphones. The

speech unit is the basic synthesis building block. It is

known that when the length of unit goes longer, the ef-

fect of context decreases and the quality of resulting

synthesized speech increase. Considering the size of

speech database, in this work, the chosen concatenative

unit is word. For the segmented un its, the prosod ic an aly-

sis is applied to calculate the pitch, energy and duration

rules. The length of silence is also calculated to decide

the pause assignment. The prosod ic feature templates for

different types of emotion are then built up using the

estimated parameters.

In the next step, for the neutral input speech, the ut-

terances are segmented into units (word ) firstly. Then the

prosodic features are extracted for each unit, and mod-

ified according to the prosodic feature templates which

are built in the previous step, with the corresponding

emotion type. At last, the time-domain pitch synchronous

overlap add (TS-PSOLA) concatenative synthesis me-

thod is used to smooth and modify unit boundary, and to

produce the final synthesized emotional speech.

3. Speech Database

The publicly available Berlin emotional speech (BES)

database [13] is used in this work, for the purpose of

comparison with other experiments. In the construction

of the database, the text materials are carefully chosen to

achieve natural emotion arousal. 10 sentences (5 short

and 5 long sentences) frequently used in everyday com-

munication are selected. The speech utterances are pro-

duced by ten actors (5 female and 5 male). A total of 248

emotional recordings are selected in this work, with four

different types of emotion: angry, happy, sad and bored,

to build up the emotion al prosodic feature templates. For

the emotion “angry”, it is a short clipped speech, with

one word or syllable being more strongly stressed. For

the emotion “happy”, the voice tone is high pitched,

speech is faster or louder than usual. Under “sad” emo-

tional state, a person tends to speak slowly, and use a low

Figure 1. Block diagram of emotional speech synth esis.

voice tone. For the emotion “bored”, the voice tone is

cold and dull. For each emotion, the number of record-

ings is 62. The sampling frequency for the data is 16

kHz.

In addition, 40 speech recordings under neutral state

are used for testing. A neutral voice tone is even, relaxed,

without marked stress on individual syllables. For each

neutral speech, four synthesized utterances are produced

with emotion type of angry, happy, sad and bored.

4. Prosodic Feature s Calculation

In this work, three prosodic features are extracted: fun-

damental frequency, energy and time duration.

4.1. Calculation of Fundamental Frequency

As illustrated in Figure 2, the fundamental frequency F0

of vocal folds vibration is estimated simultaneously in

the time domain using the autocorrelation method [13],

and in the frequency domain using the cepstral method

[13]. The average value of these two measurements pro-

vides the final estimate of F0.

The frequency domain cepstrum method of the F0 es-

timation looks for a periodicity in the log spectrum of the

signal; if the log amplitude spectrum contains many reg-

ularly spaced harmonics, then the Fourier analysis of the

spectrum is expected to show a peak corresponding to the

spacing between the harmonics: i.e. the fundamental

frequency.

The time domain autocorrelation method, on the other

hand, estimates the fundamental frequency directly from

the waveform using the autocorrelation function which is

expect to show peaks at delays corresponding to mul-

tiples of the glottal wave period (1/F0). The autocorrela-

tion is calculated as:

)()()( 1

kjsjskR m

jmm += 

−−

(1)

where s is the speech signal.

4.2. Calculation of Energy

Speech utterance is a non-stationary signal. However, it

could be viewed as a stationary signal in a short-time

roughly ranging between 16 ms and 32 ms.

In the experiment, the short-time energy is calculated

Figure 2. A flowchart of the fundamental frequency estima-

tion method.

Spectrum Log-Spectrum DFT of Log-SpectrumPeak Location

Autocorrelation Peak Location

Voiced

Frame

F0estimation in the time domain

F0estimation in the frequency domain

Averag i ng

Spectrum Log-Spectrum DFT of Log-SpectrumPeak Location

Autocorrelation Peak Location

Voiced

Frame

F0estimation in the time domain

F0estimation in the frequency domain

Averag i ng

L. HE ET AL.

for the speech frame with the length of 16 ms and 50%

overlap. The short-time energy for a speech signal s[n] is

calculated as:



+−=

−=

Lnm

nmnwmsE ˆ

ˆ])

[][( (2)

where s[m] is the speech signal, ]

[mnw − is th e applied

window. rRn =

ˆ, where R represents frame shift and r is

the integer.

4.3. Calculation of Time Duration

The time duration of each unit under different emotional

states is calculated to obtain the prosodic characteristics

of speech signals.

Moreover, the duration of silence in each sentence is

estimated, in order to get the pause assignment for each

type of emotion.

The duration of silence is calculated through speech

endpoint detection method. The endpoint detection algo-

rithm aims to iden tify the speech sign al from background

noise. The short-time energy is calculated to detect voic-

ed speech and short-time zero crossing rate is estimated

to decide the voiceless speech. The length of silence is

then calculated while removing the speech parts.

Table 1 shows the average value of prosodic features

for the units under five different types of emotion.

5. TS-PSOLA Method

Time Domain Pitch Synchronous Overlap Add is a po-

pularly used concatenative synthesis method. The basic

contribution of TD-PSOLA technique is to modify the

pitch directly on the speech waveform. There are three

steps for TD-PSOLA: pitch synchronization analysis,

pitch synchronization modification and pitch synchroni-

zation synthesis.

Pitch synchronization analysis is the core of TD-

PSOLA method, it finishes two tasks: fundamental fre-

quency detection and pitch mark. Let xm(n) denotes the

windowed short time signal:

n)x(n)-()(mmm thnx = (2)

where tm is the mark point of pitch, hm is the window

sequence.

Pitch synchronization modification adapts the pitch

mark by changing the duration (insert or delete the se-

quence with the length of pitch duration) and tone (in-

crease or decrease the fundamental frequency).

The pitch synchronization synthesis adds the new se-

quence signal produced in the previous step. In this work,

the Least-Square Overlap-Added Scheme method is used

to get the synthesized signal:



thna

nn)-(

n)-()(x

)(x

qqq

(3)

Table 1. Average prosodic feature values of units under five

emotional states.

Prosodic featuresEmotion types

Neutral AngryHappySad Bored

Pitch (Hz) 149 251 203 126 169

Time duration (s)0.24 0.16 0.18 0.260.25

Energy (db) 56.4 73.2 64.8 52.149.3

where q

t is the new pitch mark, q

h is the synthesized

window sequence, q

a is the weight to compensate the

energy loss when modifying the pitch value.

6. Experiments and Results

Figure 3 illustrates an example of the emotional speech

synthesis applying the proposed method in this work.

Figure 3 shows the waveforms of the utterance “Das

schwarze Stück Papier befindet sich da oben neben dem

Holzstück” produced under neutral and four types of

emotional states: angry, happy, sad and bored. Figure 3

also shows the waveforms of the synthesized emotional

speech under four different types of emotional states

based on the prosodic feat u re modificati on algori t hm.

In order to evaluate the performance of proposed emo-

tional speech synthesis system, a subjective test is made.

Six participators listened the synthesized emotional

speech utterances, and selected which type of emotion

they are. The subjective test results (confusion matrix)

are listed in Table 2.

7. Conclusions and Discussion

In this work, a prosodic feature modification method

combined with PSOLA algorithm is proposed in order to

add the emotional color to a neutral speech. Figure 3

shows the waveforms of the natural and synthesized

speech signals under four different types of emotion. It is

seen that the waveforms of the synthesized speech are

distinguished among different types of emotion, and they

are similar to the waveforms of natural speech pro-

nounced by human beings. The subjective test as illu-

strated in Table 2 indicates that the synthesized speech

signals contain clear emotion colors, it is easy to classify

emotion types from the synthesized utterances by human

being. For the synthesized speech, the emotion “angry” is

easiest to classify. This is because the natural emotion

“angry” contains strong emotional arousal, resulting in

distinguished prosodic characteristics. The emotion

“bored” obtaines the lowest subjective classification ac-

curacy, this is probably because the acoustic characteris-

tics of emotion “bored” is not clear, this kind of emotion

is mainly expressed through the linguistic information.

One of the shortcomings of this work is that the emo-

tional speech data size is limited. In order to meet the

Figure 3. Wav

tional speech (

needs of nat

sion, a much

ase is requir

In order to

concatenative

13 SciRes.

forms of neut

ngry (c), hap

ral conversati

larger size o

roduce more

unit selected

al speech (a),

y (e), sad (g) a

n with rich

emotional sp

natural emoti

in this work

motional spee

d bored (i)).

motion expr

ech units dat

nal speech, t

is “word”, b

E ET AL.

h (angry (b),

cause t

for eac

cation,

Theref

speech

appy (d), sad (

e speech dat

type of emo

there are ess

re, there wil

which is not i

) and bored (

base provide

ion. Howeve

ntially an infi

be some wo

the “diction

)) and synthes

correspondi

, in the real-l

nite number

ds in the sy

ry” database,

ENG

zed emo-

g words

fe appli-

f words.

thesize

only the

L. HE ET AL.

Table 2. Subjective test re sults.

Synthesized

emotion type s

Classified emotion types

Angry Happy Sad bored

Angry 90.3% 8.1% 1.6% 0%

Happy 8.1% 87.1% 3.2% 1.6%

Sad 6.5% 6.5% 80.5% 6.5%

Bored 3.2% 8.1% 11.3% 77.4%

optimized word could be found through unit selection

algorithm, resulting lower quality of the final speech.

One of the solutions is to choose shorter length of unit

type, like syllables, phonemes and so on. In this way, a

smaller size of data set will be required.

REFERENCES

[1] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis,

S. Kollias, W. Fellenz and J. G. Taylor, “Emotion Recog-

nition in Human-Computer Interaction,” Signal Proces-

sing Magazine, IEEE, Vol. 18, No. 1, 2001, pp. 32-80.

http://dx.doi.org/10.1109/79.911197

[2] M. Schröder, R. Cowie and E. Cowie, “Emotional Speech

Synthesis: A Review,” Eurospeech-2001, 2001.

[3] J. E. Cahn, “The Generation of Affect in Synthesized

Speech,” Journal of the American Voice I/O Society, Vol.

9, 1990, pp. 1-19.

[4] F. Burkhardt and F. Sendlmeier, “Verification of Acous-

tical Correlates of Emotional Speech Using Formant-

Synthesis,” ISCA Workshop on Speech & Emotion,

Northern Ireland, 2000, pp. 151-156.

[5] M. Bulut, S. Narayan and A. Syrdal, “Expressive Speech

Synthesis Using a Concatenative Synthesizer,” Proceed-

ings of ICSLP, 2002, pp. 1265-1268.

[6] E. Eide , “Pre se rva t i on, Ide n tif i cat i on, an d Use of E mo ti on

in a Textto-Speech System,” Proceedings of IEEE Work-

shop on Speech Synthesis, 2002, pp. 127-130.

[7] A. W. Black and N. Cambpbell, “Optimising Selection of

Uni ts fro m Sp eech Database for Concatenative Synthes is, ”

Proceedings of EUROSPEECH-95, 1995, pp. 581-584.

[8] J. Pitrelli, R. Bakis, E. Eide, R. Fernandez, W. Hamza and

M. Picheny, “The IBM Expressive Text-to-Speech Syn-

thesis System for American English,” IEEE Transactions

on Speech Audio Process, Vol. 14, No. 4, 2006, pp. 1099-

1108. http://dx.doi.org/10.1109/TASL.2006.876123

[9] W. Hamza, R. Bakis, E. Eide, M. Picheny and J. Pitrelli,

“The IBM Expressive Speech Synthesis System,” Pro-

ceedings of ICSLP, 2004.

[10] G. Hofer, K. Richmond and R. Clark, “Informed Blend-

ing of Databases for Emotional Speech Synthesis,” Pro-

ceedings of Interspeech, 2005, pp. 501-504.

[11] M. Schroder, “Speech and Emotion Research: An Over-

view of Research Frameworks and a Dimensional Ap-

proach to Emotional Speech Synthesis,” Ph.D. Thesis,

Saarland University, Saarland, 2004.

[12] L. R. Rabiner and R. W. Schafer, “Digital Processing of

Speech Signals,” Prentice-Hall, Inc., Englewood Cliffs,

1978.

[13] F. Burkhardt, A. Paeschke, M. Rolfes, et al., “A Database

of German Emotional Speech,” Proceedings of Inters-

peech, 2005.