Design of Multilingual Speech Synthesis System

doi:10.4236/iim.2010.21008

Paper Menu >>

Journal Menu >>

Intelligent Information Management, 2010, 2, 58-64

doi:10.4236/iim.2010.21008 Published Online January 2010 (http://www.scirp.org/journal/iim)

Design of Multilingual Speech Synthesis System

S. SARASWATHI1, R. VISHALAKSHY2

1Assistant Professor, Department of Information Technology, Pondicherry Engg.,College, Pondicherry, India

2Department of Information Technology, Pondicherry Engg., College, Pondicherry, India

Email: swathimuk@yahoo.com

Abstract

The main objective of this paper is to convert the written multilingual text into machine generated synthetic

speech. This paper is proposed in order to provide a complete multilingual speech synthesizer for three lan-

guages Indian English, Tamil and Telugu. The main application of TTS system is that it will be helpful for

blind and mute people that they could have the text read to them by computer. TTS system will help in re-

trieving the information from sites that contain information in different languages. It can be used in educa-

tional institutions for pronunciation teaching of different languages. We use concatenative speech synthesis

where the segments of recorded speech are concatenated to produce the desired output. We apply prosody

which makes the synthesized speech sound more like human speech. Smoothing is also done to smooth the

transition between segments in order to produce continuous output. The Optimal Coupling algorithm is en-

hanced to improve the performance of speech synthesis system.

Keywords: Prosody, Smoothing, Optimal Coupling

1. Introduction

Speech synthesis is the artificial production of human

speech. A computer system used for this purpose is call-

ed a speech synthesizer, and can be implemented in soft-

ware or hardware. A text-to-speech (TTS) system conve-

rts normal language text into speech; other systems rend-

er symbolic linguistic representations like phonetic tran-

scriptions into speech. Synthesized speech can be created

by concatenating pieces of recorded speech that are stor-

ed in a database. Systems differ in the size of the stored

speech units. The speech units can be phonemes, dipho-

nes, syllables etc. For specific usage domains, the storage

of entire words or sentences allows for high-quality output.

A text-to-speech system is composed of two parts: a

front-end and a backend. The front-end has two major

tasks. First, it converts raw text containing symbols like

numbers and abbreviations into the equivalent words.

This process is often called text normalization, Preproc-

essing, or tokenization. The front-end then assigns pho-

netic transcriptions to each word, and divides and marks

the text into prosodic units, like phrases, clauses, and

sentences. The process of assigning phonetic transcrip-

tions to words is called text-to-phoneme conversion.

Phonetic transcriptions and prosody information together

make up the symbolic linguistic representation which is

carried out by the front-end. The back-end often referred

to as the synthesizer converts the symbolic linguistic

representation into sound. Although text-to-speech sys-

tems have improved over the past few years, some

shortcomings still exist. For instance, many text-to-

speech systems are designed for only a single language.

However, there are many applications that need a system

that can provide speech synthesis of words from multiple

languages, and in particular, speech synthesis where

words from two or more languages are contained in the

same sentence. In fact when we look at newspapers we

can find many sentences with embedded foreign lan-

guage expressions or proper names. It would not be ap-

propriate to have these sentences pronounced with a

speech engine developed for a single language. The pa-

per is proposed to specifically address the issues of

building synthetic voices for the three major languages

namely English, Tamil and Telugu.

The techniques employed for synthesizing speech

from text may be broadly classified into three categories:

1) Formant-based, 2) Parameter-based and 3) Concatena-

tion-based. The three sub-categories of Concatenation-

based are 1) Unit Selection Synthesis, 2) Diphone Syn-

thesis and 3) Domain-Specific Synthesis. We use unit

selection synthesis [1] in which, the prerecorded words

are split into syllables and stored with the corresponding

syllable names and maintained in a database. Based on

the input, the syllable units are selected from the data-

base and concatenated to generate the speech.

To make the speech sound more natural Prosody and

S. SARASWATHI ET AL. 59

Smoothing is applied at right places. Various techniques

available for smoothing include Spectral Smoothing,

Optimal Coupling, Waveform Interpolation, LP Tech-

niques and Pitch Synchronous Overlap Add (PSOLA) [2].

We use a technique based on Optimal Coupling algo-

rithm.

The most important qualities of a speech synthesis

system are naturalness and Intelligibility. Naturalness

describes how closely the output sounds like human

speech, while intelligibility is the ease with which the

output is understood. The ideal speech synthesizer is

both natural and intelligible. Speech synthesis systems

usually try to maximize both characteristics. The tech-

nologies for generating synthetic speech waveforms are

listed below.

1) Concatenative synthesis

a) Unit selection synthesis

b) Diphone synthesis

c) Domain-specific synthesis

2) Formant synthesis

3) Articulatory synthesis

4) HMM-based synthesis

5) Sinewave synthesis

Each technology has its own strengths and weaknesses,

and the intended uses of a synthesis system will typically

determine which approach is used. In Thirukkural Text-

to-Speech Synthesis System [3], Concatenative synthesis

approach is used where natural speech is concatenated to

give the resulting speech output. It involves two phases,

namely, the offline phase and the online phase. Offline

phase includes pre-processing, segmentation and pitch

marking. Online phase includes text analysis and synthe-

sis. Here, syllables have been used as basic units. The

size of the database used is high. Attempts are to be

made to make the speech sound more natural, add emo-

tions and also to provide good synthesis for alien words.

Formant-based speech synthesizer employing demi-

syllable concatenation [4], involves identifying and ex-

tracting the formants from an actual speech signal (la-

beled to identify approximate demi-syllable areas) and

then using this information to construct demi-syllable

segments each represented by a set of filter parameters

and a source signal waveform. The basic unit being

demi-syllable requires numerous concatenation points

and hence the speech lacks from continuous flow.

In Hidden Markov Model (HMM) based speech syn-

thesis system [5] the speech waveform is generated from

HMMs themselves, and applies it to English speech syn-

thesis using the general speech synthesis architecture of

Festival Framework. Although synthesized speech has a

typical quality of “vocoded speech,” it has been shown

that the mixed excitation model based on Mixed Excita-

tion LPC Vocoder technique (MELP) and post filtering

can improve the speech quality significantly.

In Dhvani TTS system [6], they use phoneme con-

catenation technology with an attempt to cover all Indian

languages under a single framework. This system can

detect the languages and it dispatches the text to the cor-

responding phonetic synthesizer. Here, prosody is not

given much importance and also the quality of the speech

can be improved to some extent. We have proposed a

Multilingual Speech Synthesizer to overcome the draw-

backs of these systems to the extent as possible.

The paper is organized into 3 sections. In Section 2,

we discuss on the proposed system with the description

of modules. In Section 3, the results are discussed.

2. Multilingual TTS System

The TTS system comprises modules for Preprocessing,

Unicode Conversion, Segmentation, Concatenation, Pro-

sody and Smoothing. Figure 1 illustrates the overall ar-

chitecture of TTS system.

2.1. Preprocessing

The first stage of a TTS system is the pre-processing

module, called tokenization. It converts the input text

into a sequence of words and symbols to be processed by

the rest of the system. It identifies and makes decisions

on what to do with punctuation marks and other non-alp

habetic textual symbols (e.g. parentheses), identifies and

expands abbreviations, acronyms, and numbers to full-

blown orthographic strings. Each input line is scanned

and each recognized construct (word, number, symbol,

etc) is converted into an appropriate word or sequence of

words as shown in Figure 2.

Figure 1. Overall block diagram of the TTS system

Figure 2. Preprocessing

S. SARASWATHI ET AL.

Preprocessing includes,

1) The removal of single and double quotes from the

start and end of the token.

2) Expansion of abbreviations, acronyms, and num-

bers.

3) In the case of Tamil and Telugu input, if there is any

English word, it is replaced by an equivalent word in the

corresponding language.

2.2. Unicode Conversion

Unicode is an industry standard allowing computers to

consistently represent and manipulate text expressed in

most of the world's writing systems. In our work, we

have used Unicode conversion to accept the input text in

multiple fonts. The logic behind Unicode conversion is

to identify the font in which the given text is encoded

and applying Unicode encoding to the given font.

For Tamil, since the input may be in TAM, TAB or

TSCII, the encoding techniques for TAM, TAB and

TSCII are explored and for each and every character in

the input, Unicode encoding is applied to convert the

given text into unicode. Similarly for Telugu, the encod-

ing technique of shree font is explored and Unicode en-

coding is applied to the input. In case of multilingual

input, the text to be given is already converted to unicode

and saved as a text file containing inputs from all the

three languages. If the input is going to be in a single

language then the Unicode conversion will be carried out

internally as shown in Figure 3.

2.3. Segmentation

Speech in Indian language is based on basic sound units

which are inherently syllable units made from C, CV,

CCV, VC and CVC combinations, where C is a conso-

nant and V is a vowel. From perceptual results, it is ob-

served that from four different choices of speech units

like syllable, diphone, phone and half phone, the syllable

unit performs better than all the rest and is a better rep-

resentation for Indian languages.

In our work, for English, we segment the text into syl-

lables using the basic syllable pattern (C) (C) (C) V (C)

among V, CV, CCV, CCV, VC, VCC, VCCC, CVC, VVC,

CCVC, CCVCC, CCCVC.

2.3.1. Rules for Segmentation of English

1) If a word follows a basic syllable pattern then it is not

split into syllables.

2) If a word contains more than one syllable pattern

we frame rules based on the pronunciation like,

3) If CV is followed by CCV then split the word as

CVC and CV.

4) If CV is followed by VC then split the word as CV

and VC.

For Tamil and Telugu, the basic syllable pattern is

Figure 3. Unicode conversion module

Figure 4. Segmentation of text into syllables

Table 1. Word count for all the three languages

Language No. of Words Syllable count

English 4200 2225

Tamil 4213 1120

Telugu 4118 1950

represented in the form of (C) V (C) (C) and the possible

syllable patterns are V, CV, CVC, CVCC, VC, VCC [7]

2.3.2. Rules for Segmentation of Tamil and Telugu

1) If characters after CV pattern are of type CV then the

syllables are split as CV and CV.

2) If the CV pattern is followed by CCV then syllables

are split as CVC and CV.

3) If the CV pattern is followed by CCCV then sylla-

bles are split as CVCC and CV.

4) If the VC pattern is followed by V then syllables are

split as V and CV.

5) If the VC pattern is followed by CVC then syllables

are split as VC and CVC.

Based on these rules the text is split into syllables and

the location of the corresponding sound files for the syl-

lables are written in a text file.

2.3.3. Collection of Syllables (Speech)

After segmentation, recording is done at word level and

segmented into syllables manually. The recording is done

at a frequency of 48 KHz and the channel type is set as

Stereo. The syllable count is reduced by finding the syl-

lables which are common for Tamil and Telugu. Eg:

Consider the following words:

“Alochinchadam” in Telugu and “alosanai” in Tamil.

The syllables for these words are as follows Alochincha

dam–a/lo/chin/cha/dam Alosanai –a/lo/sa/nai. In this case,

the syllables ‘a’ and ‘lo’ are common for both the lan-

S. SARASWATHI ET AL. 61

guages. Hence these syllables are stored only once in the

database which can be accessed by both the languages

and thereby reduces the size of the database.

2.3.4. Collection of Input Text

We collected the input text from different sites

Telugu: www.saakshi.com

Tamil: www.dinamani.com

English: www.hindu.com

Domain: Nuclear deal

The following table gives details about the count of

the words and syllables for the three languages.

Syllables common between Tamil and Telugu: 323

2.4. Concatenation

In concatenative speech synthesis, the segments of re-

corded speech are concatenated to produce the desired

output [8]. Generally, this technique produces the most

natural-sounding synthesized speech since the number of

concatenating points is less. There are three main sub-

types of concatenative speech synthesis namely Unit

selection synthesis and Diphone synthesis. In our work,

we use unit selection synthesis in which, the prerecorded

words are split into syllables and are stored with the cor-

responding syllable names and maintained in a database.

In case of Tamil and Telugu, the syllable units are named

using English letters based on the pronunciation in the

corresponding language. For eg., the syllables in the

word are named as follows:

- veeinn

- duim

Based on the input, the syllable units are selected from

the database and concatenated using MATLAB simulator

integrated with java to create complete utterances as

shown in Figure 5.

2.5. Prosody and Smoothing

2.5.1. Prosody

Prosody is the rhythm, stress, and intonation of speech.

Prosody may reflect the emotional state of a speaker;

whether an utterance is a statement, a question, or a

command; whether the speaker is being ironic or sarcas-

tic; emphasis, contrast and focus; and other elements of

language which may not be encoded by grammar. Hence,

applying the concept of prosody to the TTS system

makes the synthesis speech sounds more like human

speech. There are several techniques used for prosody

[9,10]. We make use of the punctuation marks like ‘!’, ‘?’,

‘;’ to give stress. In this method, if the words in the input

text are followed by any of these special symbols they

are read with intonations. For this method to work, the

syllables which are to be read with intonation are main-

tained as a separate database and these syllable units are

Figure 5. Concatenation using unit selection synthesis

Figure 6. Smoothing

concatenated when intonation is to be given.

2.5.2. Smoothing

Smoothing is used to smooth the transition between

segments in order to produce continuous output as that

produced by human speech as shown in Figure 6.

There are several techniques used for smoothing like

Spectral Smoothing, Optimal Coupling, Waveform In-

terpolation, LP Techniques and Pitch Synchronous Over-

lap Add (PSOLA). Among all these techniques optimal

coupling algorithm is easy to implement at low cost.

It is common in concatenative synthesis that the boun-

daries of speech segments are fixed, but the optimal cou-

pling technique allows the boundaries to move to provide

the best fit with adjacent segments. A measure of mis-

match is tested at a number of possible segment bounda-

ries until the closest match is found. While any form of

measure may be used, for the sake of improving spectral

quality, using a spectral discontinuity measure is appro-

priate. Measures considered include melfrequency cep-

stral coefficients (MFCC) [11] and the auditory-neural

based measure (ANBM) [12]. It is not necessary to im-

plement optimal coupling to perform spectral smoothing,

but it does provide some improvement at a small cost.

We proposed a new technique for Smoothing based on

Optimal Coupling algorithm. As shown in Figure 7, ini-

tially we set the boundaries to search for the optimal

concatenating point. If the given syllable unit is the be-

ginning of the word then the boundary is set from 5n/6th

position to n where, n is the length of the syllable unit.

Otherwise the boundary is set from the beginning to

n/3rd position. Then the Hamming windows are formed

to process on the selected portions to find the concate-

S. SARASWATHI ET AL.

Figure 7. Smoothing using optimal coupling

nating point. The zero-crossing counts are found for all

the hamming windows and the point in the area contain-

ing the minimum count is chosen as the optimal con-

catenating point. Finally the syllables are concatenated at

the chosen points to generate the smoothened speech

output.

3. Results and Analysis

The results are analyzed using two parameters namely

Mel-Frequency Cepstral Coefficients (MFCCs) and

Mean Opinion Score (MOS).

3.1. Mel-Frequency Cepstral Coefficient (MFCC)

Mel-frequency cepstral coefficients (MFCCs) are coeffi-

cients that collectively make up an MFC. They are de-

rived from a type of cepstral representation of the audio

clip. The difference between the cepstrum and the Mel-

Frequency Cepstrum is that in the MFC, the frequency

bands are equally spaced on the mel scale, which ap-

proximates the human auditory system's response more

closely than the linearly-spaced frequency bands used in

the normal cepstrum. This frequency warping can allow

for better representation of sound, for example, in audio

compression. In sound processing, the Mel-Frequency

Cepstrum (MFC) is a representation of the short- term

power spectrum of a sound, based on a linear cosine

transform of a log power spectrum on a nonlinear mel

scale of frequency.

MFCCs are commonly derived as follows:

1) Take the Fourier transform of (a windowed excerpt

of) a signal.

2) Map the powers of the spectrum obtained above

onto the mel scale, using triangular overlapping win-

dows.

3) Take the logs of the powers at each of the mel fre-

quencies.

4) Take the discrete cosine transform of the list of mel

log powers, as if it were a signal.

5) The MFCCs are the amplitudes of the resulting

spectrum.

MFCCs are commonly used as features in speech rec-

ognition systems, such as the systems which can auto-

matically recognize numbers spoken into a telephone.

They are also common in speaker recognition, which

is the task of recognizing people from their voices. The

MFCC values are calculated for a set of original words

say, and the concatenated words. The Euclidean distance

between the values of P and Q are calculated using the

formula,

112 2

()( )(

()

pqp qp q



 





)

(1)

The results for English, Tamil and Telugu are tabulated

in Table 2,3 and 4 respectively.

The deviation of the concatenated words with the

original words in English, Tamil and Telugu based on

their MFCC features are plotted in Figures 8,9, and 10

respectively. The average deviation of MFCC value for

concatenated English words is 1.398, for Tamil 1.463 ,

and for Telugu 1.492 .Though smoothing technique

works well for all the three languages minimum devia-

tion is obtained for English.

3.2. Mean Opinion Score (MOS)

The Mean Opinion Score (MOS) provides a numerical

S. SARASWATHI ET AL. 63

indication of the perceived quality of received media

after compression and/or transmission. The MOS is ex-

pressed as a single number in the range 1 to 5, where 1 is

lowest perceived audio quality, and 5 is the highest perceived

Figure 8. Words (English) vs deviation based on MFCC

Figure 9. Words (Tamil) vs deviation based on MFCC

Figure 10. Words (Telugu) vs deviation

Table 2. Result analysis for English using MFCC

Words Deviation from the original word

President 1.13

Agreement 1.26

Prevent 1.33

India 1.35

Critics 1.40

Sufficient 1.41

Overly 1.46

Nuclear 1.51

Lacking 1.56

Safeguards 1.57

Table 3. Result analysis for Tamil using MFCC

Table 4. Result analysis for Telugu using MFCC

Table 5. Rating of MOS

Mean opinion score (MOS)

MOS Quality Impairment

5 Excellent Imperceptible

4 Good Perceptible but not annoying

3 Fair Slightly annoying

2 Poor Annoying

1 Bad Very annoying

Table 6. Results of mean opinion score

MOS

LISTENERS English Tamil Telugu

Male (28) 3.9 3.7 3.5

Female (40) 4.0 3.9 3.6

audio quality measurement. The MOS is generated by

averaging the results of a set of standard, subjective tests

where a number of listeners rate the heard audio quality

of test sentences read aloud by both male and female

speakers over the communications medium being tested.

A listener is required to give each sentence a rating using

the rating scheme as shown in Table 5 and MOS is the

arithmetic mean of all the individual scores, and can

range from 1 (worst) to 5 (best). We took objective

measurements from listener tests. Sixty expert listeners

S. SARASWATHI ET AL.

were asked to indicate their preferences in terms of natu-

ralness and intelligibility for different words and phrases.

Table 6 shows the preliminary results of Mean Opinion

Score (MOS).

The listeners ranked the smoothened speech produced

after concatenation as compared with natural speech. The

results taken using these parameters signify that the

speech is generated with minimal distortion.

4. Conclusions

The system has been successfully developed for the three

languages Indian English, Tamil and Telugu. Syllable

based concatenation was used which reduced the con-

catenation points and hence minimal distortion. Optimal

Coupling technique has been used for Smoothing which

resulted in natural sounding speech. The synthesizer was

tested for its quality through expert listeners. In the pro-

posed work we process only domain restricted text. In

future the Synthesizer has to be enhanced to process un-

restricted text and the Smoothing technique needs to be

explored further to find the optimal concatenating point.

Also more Indian languages have to be included.

5. References

[1] S. P. Kishore, R. Kumar, and R. Sangal “A data-driven

synthesis approach for Indian languages using syllable as

basic unit,” in Intl. Conf. on Natural Language Processing

(ICON), pp. 311–316, 2002.

[2] D. T. Chappell and J. H. L. Hansen “Spectral smoothing

for speech segment concatenation,” Speech Communica-

tion, Vol. 36, No. 3–4, 2002.

[3] G. L. Jayavardhana Rama, A. G.Ramakrishnan, M. Vijay

Venkatesh and R. Murali Shankar, “Thirukkural–A text-

to-speech synthesis system,” Paper Presented in the the

Tamil Internet 2001 Conference and Exhibition (TI2001),

2001.

[4] S. Schötz, “Data-driven formant synthesis of speaker

age,” In G. Ambrazaitis and S. Schötz (eds.). Lund Work-

ing Papers 52, Proceedings of Fonetik, Lund, pp.

105–108. 2006

[5] K. Tokuda, H. Zen, and A. W. Black, “An hmm-based

speech synthesis system applied to English,” paper Pre-

sented in the Proc. of IEEE Speech Synthesis Workshop,

2002.

[6] http://dhvani.sourgeforge.net, (Dhvani-TTS System for

Indian Languages), 2001.

[7] S. Saraswathi and T. V. Geetha, “Language models for

Tamil Speech Recognition,” Publication in IETE Special

Issue on Spoken Language Processing, Vol. 24, No. 5, pp.

375–383, 2007.

[8] Céu Viana, “Concatenative speech synthesis for European

Portuguese,” Paper Presented in the third ESCA/CO-

SCOSDA International Workshop on Speech Synthesis,

Australia, 1998.

[9] N. Sridhar Krishna and H. A. Murthy, “Duration model-

ing of Indian languages Hindi and Telugu,” Paper Pre-

sented in the proceedings of 5th ISCA Speech Synthesis

Workshop, 2004.

[10] N. Sridhar Krishna and H. A. Murthy, “A new prosodic

phrasing model for Indian language Telugu”, Paper Pre-

sented in the Proceedings of Interspeech-2004, ICSLP 8th

International Conference on Spoken Language Processing,

pp. 793–796, 2004.

[11] S. Imai, “Cepstral analysis synthesis on the mel frequen-

cy Scale,” Paper Presented in Proceedings of ICASSP,

Vol. 8, pp. 93–96, 1983.

[12] J. H. L. Hansen and D. T. Chappell, “An auditory based

distortion measure with application to concatenative

speech synthesis,” IEEE Transactions on Speech and Au-

dio Processing, Vol. 6, No. 5, pp. 489–495, ge and Data

Engineering, 1999, Vol. 11, No. 1, pp. 133–142, 1998.