A Journal of Software Engineering and Applications, 2012, 5, 124-127
doi:10.4236/jsea.2012.512b024 Published Online December 2012 (http://www.sc irp.org/journal/jsea)
Copyright © 2012 S ciRes. JSEA
A HMM-Based System To D i acrit ize Arabic Text
M S Khorsheed
National Center for Robotics & Intelligent Systems, King Abdulaziz City for Science & Technology (KACST), POB 6086, Riyadh,
Saudi Arabia.
Email: mkhorshd@kacst.edu.sa
Received 2012.
ABSTRACT
The Arabic language comes under the category of Semitic languages with an entirely different sentence structure in
terms of Natural Language Processing. In such languages, two different words may have identical spelling whereas
their pronunciations and meanings are totally different. To remove this ambig uity, special marks are put above or below
the spelling characters to determine the correct pronunciation. These marks are called diacritics and the language that
uses them is called a diacritized language. This paper presents a system for Arabic language diacritization using Hid-
den Markov Models (HMMs). The system employs the renowned HMM Tool Kit (HTK). Each single diacritic is
represented as a separate model. The concatenation of output models is coupled with the input character sequence to
form the fully diacritized text. The performance of the proposed system is assessed using a data corpus that includes
more than 24000 sentences.
Keywords: Arabic; Hidden Markov Models; Text-to-speech; Diacritizatio n
1. Introduction
The pronunciation of a word in some languages, like
English, is almost always fully determined by its consti-
tuting characters. In these languages, the sequence of
consonants and vowels determines the correct corres-
ponding voice while pronouncing a wo rd. Such languag-
es are called non-diacritized languages. On the other
hand, there are languages, like Arabic, where the pro-
nunciation of their words cannot be fully determined by
their s pelling characters only.
The Arabic alphabet consists of 28 basic le tters, which
consists of strokes and dots. Ten of them have one dot,
three have two dots, two have three dots. Dots can be
above, below or in the middle of the letter as shown in
Figure 1. The shape of the letter is context sensitive,
depending on its location within a word. A letter can
have up to four different shapes: isolated (Is), beginning
(In) connection from the left, middle (M) connection
from the left and right, and end (E) connection from the
right, Fig ure 2.
Arabic characters may have diacritics which are writ-
ten as strokes and can change the pronunciation and the
meaning of the word. Diacritics may appear as strokes
above the character as Fatha, Dhamma, Sukun, Shadda
or Maddah, or below the character as Kasra. Tanween is
a form of discretizing Arabic script but with double Fa-
tha, double Dhamma or double Kasra. Tanween occurs
only at the end of the word. Figure 3 shows various dia-
critics which may appear with Arabic characters.
Diacritics perform an essential function in pronounc-
ing a certain word. To illustrate this with an example:
consider the word appears in Figure 4, which can be
pronounced: (1) coll e ge , or (2) kidney. Thus, for an undia-
critiz ed Arabic word there may be a vast number of pro-
nunciations whereas for the diacritized Arabic word it is
usually a one-to -one table matching process. In spite of
this importance, text may be undiacritized a nd readers of
Arabic are accustomed to inferr ing the me anin g from the
context, but when it comes to computer processing, the
computer still needs to be provided with algorithms to
mimic the human ability to identify the proper vocalization
Figure 1. Arabic character samples showing dots above,
below and in middle of a character.
Figure 2 . Sample charac ter i n four different shapes
A HMM-Based System To Diacr itize Arabi c Text
Copyright © 2012 S ciRes. JSEA
125
of the text. Such a tool is an essential infrastructure for
applications as Text-to-Speech, Transliteration and Auto-
mat i c Translation. The problem of automatic generation
of the Arabic diacritic marks is known in the literature
under various translations though in this report it is re-
ferred to as diacritization.
RDI of Egypt [1] has developed a system called
ArabDiac 2.0 based on a proprietary morphological ana-
lyzer, as well as syntactical rules and statistical informa-
tion. The system is claimed to achieve 95% accuracy on
the word level. However, the detailed of their technical
approach has not been published. Among other systems
are Al-Alamia of Kuwait [2] and CIMOS of France [3].
The formal approach to the problem of restoration of the
diacrical marks of Arabic text involves a complex inte-
gration of the Arabic morphological, syntactic, and se-
mantic rules [4]. A morphological rule matches the un-
diacritized word to known patterns or templates and re-
cognizes prefixes and suffixes. Syntax applies specific
syntactic rules to determine the final dia critical marks b y
applying Finite State Automata. Semantics help to re-
solve ambiguous cases and to filter out hypothesis.
2. System Overview
Figure 5 shows the block diagram of the proposed sys-
tem. The global model is a network of interconnected
small models. Each one of those models represents a
diacritic. Table 1 illustrates the fourteen possible diacrit-
ics besides the no diacritic sign.
Figure 3 . Various diacrit ics for Arabic characters.
Figure 4. A sa mple wor d wit h tw o pronunciation.
Table 1. A list of possible diacritics and the corresponding
mode l names .
Diacritic Mo del name
Fatha Mfa
Shadda+Fatha Msf
Tanween Fatha Mtf
Shadda+ Tanween Fatha Stf
Damma Mda
Shadda+Damma Msd
Tanween Damma Mtd
Shadda+ Tanween Damma Std
Kasra Mka
Shadda+Kasra Msk
Tanween Kasra Mtk
Shadda+ Tanween Kasra Stk
Madda Mmd
Sukun Mso
Nodiacr itic Non
The system may be divided into two stages. The first
stage is performed prior to HTK, and includes feature
extraction. The objective is to transfer the Arabic text
into a sequence of discrete symbols. Each character is
coded according to its location within the word; start,
middle or end. The feature space includes 110 features.
This consists of 84 features for the 28 basic characters,
24 features for the additional characters mentioned earli-
er, one feature for the space character and another feature
to represents the start/end of a sentence.
Stage two is perfor med within HTK. It couples the ex-
tracted features with the corresponding ground truth to
estimate the diacritic model parameters. The final output
of this sta ge is a lexicon{ free syste m to diacritize Arabic
text. During recognition, an input pattern of discrete
symbols rep resenting the text line is injected to the glob-
al model which outputs a stream of diacritics that are
combined with the input character sequence to form the
diacritized text.
3. HMM Tool Kit
The hidden Markov model toolkit (HTK) [5] is a porta-
ble toolkit for b uilding and manipulatin g hidden Markov
models. It is primarily designed for building
HMM{based speech recognition systems. HTK was
originally developed at the Speech Vision and Robotics
Group of the Cambridge University Engineering De-
partment (CUED).
Much of the functionality of HTK is built into the li-
A HMM-Based System To Diacr itize Arabi c Text
Copyright © 2012 S ciRes. JSEA
126
brary modules available in C source code. These mod-
ules are designed to run with the traditional command
line style interface, so it is simple to write scripts to c on-
trol HT K too ls execution. The HTK too ls are categorized
into four phases: data preparation, training, testing and
resul t analysis too l s.
The data preparation tools are designed to obtain the
speech data from data bases, CD ROM or record the
speech manually. These tools parameterize the speech
data and generate the associated speech labels. For the
curr ent wo rk, t he ta s k of t ho se too l s is p e rformed prior to
the HT K, as p revious ly expla ined, t hen the r esult is con-
verted to HTK data format.
HTK allows HMMs to be built with any desired to-
pology using simple text files. The training tools adjusts
HMM parameters using the prepared training data,
representing text lines, coupled with the data transcrip-
tion. These tools apply the Baum{Welch re-estimation
procedure [ 6] to maximize t he likelihood probabilitie s of
the training data given the model.
HTK provides a recognition tool to decode the se-
quence of observations and output the associated state
sequence. The recognition tool requires a network to
describe the transition probabilities from one model to
another. T he dictionar y a nd la ng uage mod el c an b e inp ut
to the tool to help the recognizer to output the correct
state sequence.
The result analysis tool evaluates the performance of
the recog nition system by matching the reco gnizer output
data with the original reference transcription. This com-
parison is performed using dynamic programming to
align the two transcriptions, the output and the ground
truth, and then count the number of: substitutio n (S) and
deletion (D) .
The optimal string match calculates a score for
matching the o utput sa mple with respect to the reference
line. The procedure works such that identical labels
match with score 0, a substitution carries a score of 10
and a label deletion carries a score of 7. The optimal
string match is the label alignment which has the lowest
possible score. Once the optimal alignment has been
found, the correction rate (CR) is then:
Figure 5. The HTK based system.
Figure 6. Optimal string matching: the uppe r line shows the
reference t ext where the lower line is an output sample.
Table 2. CR using two differe nt training sets .
1000
sentences 2000
sentences
Test s et Trainin g set 72.76% 72.80%
Test s et Trainin g set 72.50% 72.67%
100%
H
CR N
= ×
(1)
where N is the total number of labels in the recognizer
output sequence and H is the number of correct (hit) la-
bels
SDNH−−=
(2)
Accordingly,
100%
NDS
CR N
−−
= ×
(3)
Figure 6 illustrates an Arabic reference line and a
possible system output. The sentence includes 25 labels.
The system output includes one substitution and one de-
letion. The correction rate equals 92%.
4. Recognition Results
To assess the performance of the proposed system , we
built a data corpus which includes more than 24000
Arabic sentences with more than 200000 words. The
corpus contains Arabic text that covers different subjects
to avoid biasing the likelihood probability of the lan-
guage model. The global model consists of 15 models
stated in Table 1.
The first set of experiments was performed with a
training data set of 10000 sentences. The test data set
was either similar to or different from the training data
set and it ranges from 3000 to 9000 sentences. In both
A HMM-Based System To Diacr itize Arabi c Text
Copyright © 2012 S ciRes. JSEA
127
cases CR ranges between 72:29% and 72:79%. The same
experiments were performed again but this time with
20000 sentences as a training data set and a test data set
that ranges between 4000 and 16000 sentences. CR
scored almost the same result with an average value
72:83%. Table 2 briefly illustrates the average correction
rates of those experiments. This concludes that 10000
sentences are quite enough to train the global model.
5. Acknowledgement
This research project is completely funded by King Ab-
dulaziz City for Science & Technology.
REFERENCES
[1] RDI, “ArabDiac” http://www.rdi-eg.com/rdi/ Research.
[2] SAKHR, http://www.sakhr.com/.
[3] CIMOS, http://www.cimos.com/.
[4] A. Farghaly and J. Snellart, “Intuitive coding of the
Arabi c lexicon ,” Louisiana-USA, 23 September 2003.
[5] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell,
D. Ollason, V. Valtchev, and P. Woodland, The HTK
Book,” Cambridge University Engineering Dept., 2001.
[6] L. Rabiner and B. Juang, “Fundamentals Of Speech
Recognition,” P rentice Hall, 1993.