wsf">small models. Each one of those models represents a
diacritic. Table 1 illustrates the fourteen possible diacrit-
ics besides the no diacritic sign.
Figure 3 . Various diacrit ics for Arabic characters.
Figure 4. A sa mple wor d wit h tw o pronunciation.
Table 1. A list of possible diacritics and the corresponding
mode l names .
Diacritic Mo del name
Fatha Mfa
Shadda+Fatha Msf
Tanween Fatha Mtf
Shadda+ Tanween Fatha Stf
Damma Mda
Shadda+Damma Msd
Tanween Damma Mtd
Shadda+ Tanween Damma Std
Kasra Mka
Shadda+Kasra Msk
Tanween Kasra Mtk
Shadda+ Tanween Kasra Stk
Madda Mmd
Sukun Mso
Nodiacr itic Non
The system may be divided into two stages. The first
stage is performed prior to HTK, and includes feature
extraction. The objective is to transfer the Arabic text
into a sequence of discrete symbols. Each character is
coded according to its location within the word; start,
middle or end. The feature space includes 110 features.
This consists of 84 features for the 28 basic characters,
24 features for the additional characters mentioned earli-
er, one feature for the space character and another feature
to represents the start/end of a sentence.
Stage two is perfor med within HTK. It couples the ex-
tracted features with the corresponding ground truth to
estimate the diacritic model parameters. The final output
of this sta ge is a lexicon{ free syste m to diacritize Arabic
text. During recognition, an input pattern of discrete
symbols rep resenting the text line is injected to the glob-
al model which outputs a stream of diacritics that are
combined with the input character sequence to form the
diacritized text.
3. HMM Tool Kit
The hidden Markov model toolkit (HTK) [5] is a porta-
ble toolkit for b uilding and manipulatin g hidden Markov
models. It is primarily designed for building
HMM{based speech recognition systems. HTK was
originally developed at the Speech Vision and Robotics
Group of the Cambridge University Engineering De-
partment (CUED).
Much of the functionality of HTK is built into the li-
A HMM-Based System To Diacr itize Arabi c Text
Copyright © 2012 S ciRes. JSEA
126
brary modules available in C source code. These mod-
ules are designed to run with the traditional command
line style interface, so it is simple to write scripts to c on-
trol HT K too ls execution. The HTK too ls are categorized
into four phases: data preparation, training, testing and
resul t analysis too l s.
The data preparation tools are designed to obtain the
speech data from data bases, CD ROM or record the
speech manually. These tools parameterize the speech
data and generate the associated speech labels. For the
curr ent wo rk, t he ta s k of t ho se too l s is p e rformed prior to
the HT K, as p revious ly expla ined, t hen the r esult is con-
verted to HTK data format.
HTK allows HMMs to be built with any desired to-
pology using simple text files. The training tools adjusts
HMM parameters using the prepared training data,
representing text lines, coupled with the data transcrip-
tion. These tools apply the Baum{Welch re-estimation
procedure [ 6] to maximize t he likelihood probabilitie s of
the training data given the model.
HTK provides a recognition tool to decode the se-
quence of observations and output the associated state
sequence. The recognition tool requires a network to
describe the transition probabilities from one model to
another. T he dictionar y a nd la ng uage mod el c an b e inp ut
to the tool to help the recognizer to output the correct
state sequence.
The result analysis tool evaluates the performance of
the recog nition system by matching the reco gnizer output
data with the original reference transcription. This com-
parison is performed using dynamic programming to
align the two transcriptions, the output and the ground
truth, and then count the number of: substitutio n (S) and
deletion (D) .
The optimal string match calculates a score for
matching the o utput sa mple with respect to the reference
line. The procedure works such that identical labels
match with score 0, a substitution carries a score of 10
and a label deletion carries a score of 7. The optimal
string match is the label alignment which has the lowest
possible score. Once the optimal alignment has been
found, the correction rate (CR) is then:
Figure 5. The HTK based system.
Figure 6. Optimal string matching: the uppe r line shows the
reference t ext where the lower line is an output sample.
Table 2. CR using two differe nt training sets .
1000
sentences 2000
sentences
Test s et Trainin g set 72.76% 72.80%
Test s et Trainin g set 72.50% 72.67%
100%
H
CR N
= ×
(1)
where N is the total number of labels in the recognizer
output sequence and H is the number of correct (hit) la-
bels
SDNH−−=
(2)
Accordingly,
100%
NDS
CR N
−−
= ×
(3)
Figure 6 illustrates an Arabic reference line and a
possible system output. The sentence includes 25 labels.
The system output includes one substitution and one de-
letion. The correction rate equals 92%.
4. Recognition Results
To assess the performance of the proposed system , we
built a data corpus which includes more than 24000
Arabic sentences with more than 200000 words. The
corpus contains Arabic text that covers different subjects
to avoid biasing the likelihood probability of the lan-
guage model. The global model consists of 15 models
stated in Table 1.
The first set of experiments was performed with a
training data set of 10000 sentences. The test data set
was either similar to or different from the training data
set and it ranges from 3000 to 9000 sentences. In both
A HMM-Based System To Diacr itize Arabi c Text
Copyright © 2012 S ciRes. JSEA
127
cases CR ranges between 72:29% and 72:79%. The same
experiments were performed again but this time with
20000 sentences as a training data set and a test data set
that ranges between 4000 and 16000 sentences. CR
scored almost the same result with an average value
72:83%. Table 2 briefly illustrates the average correction
rates of those experiments. This concludes that 10000
sentences are quite enough to train the global model.
5. Acknowledgement
This research project is completely funded by King Ab-
dulaziz City for Science & Technology.
REFERENCES
[1] RDI, “ArabDiac” http://www.rdi-eg.com/rdi/ Research.
[2] SAKHR, http://www.sakhr.com/.
[3] CIMOS, http://www.cimos.com/.
[4] A. Farghaly and J. Snellart, “Intuitive coding of the
Arabi c lexicon ,” Louisiana-USA, 23 September 2003.
[5] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell,
D. Ollason, V. Valtchev, and P. Woodland, The HTK
Book,” Cambridge University Engineering Dept., 2001.
[6] L. Rabiner and B. Juang, “Fundamentals Of Speech
Recognition,” P rentice Hall, 1993.