A HMM-Based System To Diacritize Arabic Text

doi:10.4236/jsea.2012.512B024

Paper Menu >>

Journal Menu >>

A Journal of Software Engineering and Applications, 2012, 5, 124-127

doi:10.4236/jsea.2012.512b024 Published Online December 2012 (http://www.sc irp.org/journal/jsea)

A HMM-Based System To D i acrit ize Arabic Text

M S Khorsheed

National Center for Robotics & Intelligent Systems, King Abdulaziz City for Science & Technology (KACST), POB 6086, Riyadh,

Saudi Arabia.

Email: mkhorshd@kacst.edu.sa

Received 2012.

ABSTRACT

The Arabic language comes under the category of Semitic languages with an entirely different sentence structure in

terms of Natural Language Processing. In such languages, two different words may have identical spelling whereas

their pronunciations and meanings are totally different. To remove this ambig uity, special marks are put above or below

the spelling characters to determine the correct pronunciation. These marks are called diacritics and the language that

uses them is called a diacritized language. This paper presents a system for Arabic language diacritization using Hid-

den Markov Models (HMMs). The system employs the renowned HMM Tool Kit (HTK). Each single diacritic is

represented as a separate model. The concatenation of output models is coupled with the input character sequence to

form the fully diacritized text. The performance of the proposed system is assessed using a data corpus that includes

more than 24000 sentences.

Keywords: Arabic; Hidden Markov Models; Text-to-speech; Diacritizatio n

1. Introduction

The pronunciation of a word in some languages, like

English, is almost always fully determined by its consti-

tuting characters. In these languages, the sequence of

consonants and vowels determines the correct corres-

ponding voice while pronouncing a wo rd. Such languag-

es are called non-diacritized languages. On the other

hand, there are languages, like Arabic, where the pro-

nunciation of their words cannot be fully determined by

their s pelling characters only.

The Arabic alphabet consists of 28 basic le tters, which

consists of strokes and dots. Ten of them have one dot,

three have two dots, two have three dots. Dots can be

above, below or in the middle of the letter as shown in

Figure 1. The shape of the letter is context sensitive,

depending on its location within a word. A letter can

have up to four different shapes: isolated (Is), beginning

(In) connection from the left, middle (M) connection

from the left and right, and end (E) connection from the

right, Fig ure 2.

Arabic characters may have diacritics which are writ-

ten as strokes and can change the pronunciation and the

meaning of the word. Diacritics may appear as strokes

above the character as Fatha, Dhamma, Sukun, Shadda

or Maddah, or below the character as Kasra. Tanween is

a form of discretizing Arabic script but with double Fa-

tha, double Dhamma or double Kasra. Tanween occurs

only at the end of the word. Figure 3 shows various dia-

critics which may appear with Arabic characters.

Diacritics perform an essential function in pronounc-

ing a certain word. To illustrate this with an example:

consider the word appears in Figure 4, which can be

pronounced: (1) coll e ge , or (2) kidney. Thus, for an undia-

critiz ed Arabic word there may be a vast number of pro-

nunciations whereas for the diacritized Arabic word it is

usually a one-to -one table matching process. In spite of

this importance, text may be undiacritized a nd readers of

Arabic are accustomed to inferr ing the me anin g from the

context, but when it comes to computer processing, the

computer still needs to be provided with algorithms to

mimic the human ability to identify the proper vocalization

Figure 1. Arabic character samples showing dots above,

below and in middle of a character.

Figure 2 . Sample charac ter i n four different shapes

A HMM-Based System To Diacr itize Arabi c Text

125

of the text. Such a tool is an essential infrastructure for

applications as Text-to-Speech, Transliteration and Auto-

mat i c Translation. The problem of automatic generation

of the Arabic diacritic marks is known in the literature

under various translations though in this report it is re-

ferred to as diacritization.

RDI of Egypt [1] has developed a system called

ArabDiac 2.0 based on a proprietary morphological ana-

lyzer, as well as syntactical rules and statistical informa-

tion. The system is claimed to achieve 95% accuracy on

the word level. However, the detailed of their technical

approach has not been published. Among other systems

are Al-Alamia of Kuwait [2] and CIMOS of France [3].

The formal approach to the problem of restoration of the

diacrical marks of Arabic text involves a complex inte-

gration of the Arabic morphological, syntactic, and se-

mantic rules [4]. A morphological rule matches the un-

diacritized word to known patterns or templates and re-

cognizes prefixes and suffixes. Syntax applies specific

syntactic rules to determine the final dia critical marks b y

applying Finite State Automata. Semantics help to re-

solve ambiguous cases and to filter out hypothesis.

2. System Overview

Figure 5 shows the block diagram of the proposed sys-

tem. The global model is a network of interconnected

small models. Each one of those models represents a

diacritic. Table 1 illustrates the fourteen possible diacrit-

ics besides the no diacritic sign.

Figure 3 . Various diacrit ics for Arabic characters.

Figure 4. A sa mple wor d wit h tw o pronunciation.

Table 1. A list of possible diacritics and the corresponding

mode l names .

Diacritic Mo del name

Fatha Mfa

Shadda+Fatha Msf

Tanween Fatha Mtf

Shadda+ Tanween Fatha Stf

Damma Mda

Shadda+Damma Msd

Tanween Damma Mtd

Shadda+ Tanween Damma Std

Kasra Mka

Shadda+Kasra Msk

Tanween Kasra Mtk

Shadda+ Tanween Kasra Stk

Madda Mmd

Sukun Mso

Nodiacr itic Non

The system may be divided into two stages. The first

stage is performed prior to HTK, and includes feature

extraction. The objective is to transfer the Arabic text

into a sequence of discrete symbols. Each character is

coded according to its location within the word; start,

middle or end. The feature space includes 110 features.

This consists of 84 features for the 28 basic characters,

24 features for the additional characters mentioned earli-

er, one feature for the space character and another feature

to represents the start/end of a sentence.

Stage two is perfor med within HTK. It couples the ex-

tracted features with the corresponding ground truth to

estimate the diacritic model parameters. The final output

of this sta ge is a lexicon{ free syste m to diacritize Arabic

text. During recognition, an input pattern of discrete

symbols rep resenting the text line is injected to the glob-

al model which outputs a stream of diacritics that are

combined with the input character sequence to form the

diacritized text.

3. HMM Tool Kit

The hidden Markov model toolkit (HTK) [5] is a porta-

ble toolkit for b uilding and manipulatin g hidden Markov

models. It is primarily designed for building

HMM{based speech recognition systems. HTK was

originally developed at the Speech Vision and Robotics

Group of the Cambridge University Engineering De-

partment (CUED).

Much of the functionality of HTK is built into the li-

A HMM-Based System To Diacr itize Arabi c Text

126

brary modules available in C source code. These mod-

ules are designed to run with the traditional command

line style interface, so it is simple to write scripts to c on-

trol HT K too ls execution. The HTK too ls are categorized

into four phases: data preparation, training, testing and

resul t analysis too l s.

The data preparation tools are designed to obtain the

speech data from data bases, CD ROM or record the

speech manually. These tools parameterize the speech

data and generate the associated speech labels. For the

curr ent wo rk, t he ta s k of t ho se too l s is p e rformed prior to

the HT K, as p revious ly expla ined, t hen the r esult is con-

verted to HTK data format.

HTK allows HMMs to be built with any desired to-

pology using simple text files. The training tools adjusts

HMM parameters using the prepared training data,

representing text lines, coupled with the data transcrip-

tion. These tools apply the Baum{Welch re-estimation

procedure [ 6] to maximize t he likelihood probabilitie s of

the training data given the model.

HTK provides a recognition tool to decode the se-

quence of observations and output the associated state

sequence. The recognition tool requires a network to

describe the transition probabilities from one model to

another. T he dictionar y a nd la ng uage mod el c an b e inp ut

to the tool to help the recognizer to output the correct

state sequence.

The result analysis tool evaluates the performance of

the recog nition system by matching the reco gnizer output

data with the original reference transcription. This com-

parison is performed using dynamic programming to

align the two transcriptions, the output and the ground

truth, and then count the number of: substitutio n (S) and

deletion (D) .

The optimal string match calculates a score for

matching the o utput sa mple with respect to the reference

line. The procedure works such that identical labels

match with score 0, a substitution carries a score of 10

and a label deletion carries a score of 7. The optimal

string match is the label alignment which has the lowest

possible score. Once the optimal alignment has been

found, the correction rate (CR) is then:

Figure 5. The HTK based system.

Figure 6. Optimal string matching: the uppe r line shows the

reference t ext where the lower line is an output sample.

Table 2. CR using two differe nt training sets .

1000

sentences 2000

sentences

Test s et Trainin g set 72.76% 72.80%

Test s et Trainin g set 72.50% 72.67%

100%

CR N

= ×

(1)

where N is the total number of labels in the recognizer

output sequence and H is the number of correct (hit) la-

bels

SDNH−−=

(2)

Accordingly,

100%

NDS

CR N

−−

= ×

(3)

Figure 6 illustrates an Arabic reference line and a

possible system output. The sentence includes 25 labels.

The system output includes one substitution and one de-

letion. The correction rate equals 92%.

4. Recognition Results

To assess the performance of the proposed system , we

built a data corpus which includes more than 24000

Arabic sentences with more than 200000 words. The

corpus contains Arabic text that covers different subjects

to avoid biasing the likelihood probability of the lan-

guage model. The global model consists of 15 models

stated in Table 1.

The first set of experiments was performed with a

training data set of 10000 sentences. The test data set

was either similar to or different from the training data

set and it ranges from 3000 to 9000 sentences. In both

A HMM-Based System To Diacr itize Arabi c Text

127

cases CR ranges between 72:29% and 72:79%. The same

experiments were performed again but this time with

20000 sentences as a training data set and a test data set

that ranges between 4000 and 16000 sentences. CR

scored almost the same result with an average value

72:83%. Table 2 briefly illustrates the average correction

rates of those experiments. This concludes that 10000

sentences are quite enough to train the global model.

5. Acknowledgement

This research project is completely funded by King Ab-

dulaziz City for Science & Technology.

REFERENCES

[1] RDI, “ArabDiac” http://www.rdi-eg.com/rdi/ Research.

[2] SAKHR, http://www.sakhr.com/.

[3] CIMOS, http://www.cimos.com/.

[4] A. Farghaly and J. Snellart, “Intuitive coding of the

Arabi c lexicon ,” Louisiana-USA, 23 September 2003.

[5] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell,

D. Ollason, V. Valtchev, and P. Woodland, “The HTK

Book,” Cambridge University Engineering Dept., 2001.

[6] L. Rabiner and B. Juang, “Fundamentals Of Speech

Recognition,” P rentice Hall, 1993.