
The Enhancement of Arabic Stemming by Using Light Stemming and Dictionary-Based Stemming
Copyright © 2011 SciRes. JSEA
526
in the corpus, with accuracy ranging from 87.11% in the
fifth document to 90.22% in the second document. In the
evaluation of stemmers, the accuracy value of the stem-
mer is affected by the following factors:
1) The type of approach: the stemmers have different
precision values with different types of approaches in the
same data.
2) The corpus: the size and composition of the corpus
that is used for evaluation plays an important role in in-
creasing or decreasing the precision values for the stem-
mers.
3) The pre-processing: this includes some linguistic
tools such as the tokenization, identification of Arabic
stop-words, named entity recognition, and handling of
Arabic multi-word expressions. These linguistic tools are
used to reduce the ambiguity of words in order to in-
crease the accuracy and effectiveness of the stemmer.
7. Conclusions
In this study, we have presented the enhanced stemming
for extracting the stem and root of Arabic words. The
enhanced stemming was designed to overcome the dis-
advantages of the light stemming and dictionary-based
stemming. The problem of the broken (irregular) plurals
for nouns and irregular verbs that cannot be solved by the
light stemmer has been identified by the dictionary-based
stemmer. In contrast, the words that cannot be stemmed
in the dictionary-based stemmer because they are not
found in the lexicon of Arabic stems have been handled
by the light stemmer. In order to evaluate the enhanced
stemmer, we applied our method for an in-house col-
lected corpus from Arabic newspaper archives. In our
experiment, the average of accuracy in enhanced stem-
mer on the corpus is 96.29%. The accuracy values of
enhanced stemmer had been increased in all documents
in the corpus when they compared with the accuracy
values in light stemmer (85.5%) and dictionary-based
stemmer (88.63%). The accuracy value of stemmer de-
pends on many factors, including the type of stemming
approach, the size and composition of the corpus, and
pre-processing (such as tokenization, identification of
Arabic stop-words, named entity recognition, and han-
dling of Arabic multi-word expressions). The enhanced
stemming method that had been demonstrated for ex-
tracting the root and stem of Arabic words can be
straightforwardly expanded to identify the linguistic
category of the word.
REFERENCES
[1] Al. Hajjar, M. Hajjar and K. Zreik, “A New System for
Evaluation of Arabic Root Extraction Methods,” Pro-
ceedings of the 5th International Conference on Internet
and Web Applications and Services, ICIW, Barcelona,
Spain, 9-15 May 2010, pp. 506-512.
[2] E. Al-Shammari and J. Lin, “A Novel Arabic Lemmatiza-
tion Algorithm,” Proceedings of the 2nd Workshop on
Analytics for Noisy Unstructured Text Data, Singapore,
24 July 2008.
[3] B. Al-Salemi and M. J. Ab Aziz, “Statistical Bayesian
Learning for Automatic Arabic Text Categorization”,
Journal of Computer Science, Vol. 7, No. 1, 2011, pp.
39-45. doi:10.3844/jcssp.2011.39.45
[4] K. R. Beesley and L. Karttunen, “Finite-State Mor-
phology: Xerox Tools and Techniques,” CSLI, Stanford,
2003.
[5] K. Shaalan, M. Magdy and A. Fahmy, “Morphological
Analysis of Ill-Formed Arabic Verbs in Intelligent Lan-
guage Tutoring Framework,” Proceedings of the Twenty-
Third International Florida Artificial Intelligence Re-
search Society Conference, 19-21 May 2010, pp. 277-
282.
[6] M. A. Attia, “An Ambiguity Controlled Morphological
Analyzer for Modern Standard Arabic Modeling Finite
State Networks,” Proceedings of the Challenge of Arabic
for NLP/MT Conference, The British Computer Society,
London, 2006.
[7] A. Boudlal, R. Belahbib, A. Lakhouaja and A. Mazroui,
“A Markovian Approach for Arabic Root Extraction,”
The International Arab Journal of Information Techno-
logy, Vol. 8, No. 1, 2009, pp. 13-20.
[8] M. Sawalha and E. Atwell, “Adapting Language Gram-
mar Rules for Building Morphological Analyzer for Ara-
bic Language,” Proceedings of the Workshop of Morpho-
logical Analyzer Experts for Arabic language, organized
by Arab League Educational, 2009.
[9] R. Sonbol, N. Ghneim and M. S. Desouki, “Arabic Mor-
phological Analysis: A New Approach. Information and
Communication Technologies: From Theory to Applica-
tions,” The 3rd International Conference on Information
& Communication Technologies: From Theory to Appli-
cations, 7-11 April 2008, pp. 1-6.
[10] A. A. Mohd Juzaiddin, A. Fatimah, A. A. Abdul Azim
and M. Ramlan, “Pola Grammar Technique to Identify
Subject and Predicate in Malaysian Language,” The Sec-
ond International Joint Conference on Natural Language
Processing, 11-13 October 2005, pp. 185-190.
[11] A. M. Saif and M. J. A. Aziz, “An Automatic Collocation
Extraction from Arabic Corpus,” Journal of Computer
Science, Vol. 7, No. 1, 2011, pp. 6-11.
[12] K. Taghva, R. Elkoury and J. Coombs, “Arabic Stemming
without a Root Dictionary,” International Conference on
Information Technology: Coding and Computing (ITCC’
05), 4-6 April 2005, pp. 152-157.
[13] R. Alshalabi, “Pattern-Based Stemmer for Finding Arabic
Roots” Asian Network for Scientific Information Tech-
nology Journal, Vol. 4, No. 1, 2005. pp. 38-43.
[14] T. Buckwalter, “Issues in Arabic Orthography and Mor-
phology Analysis,” The Workshop on Computational
Approaches to Arabic Script-Based Languages, COLING
Geneva, 2004, pp. 31-34.