Arabic, as one of the Semitic languages, has a very rich and complex morphology, which is radically different from the European and the East Asian languages. The derivational system of Arabic, is therefore, based on roots, which are often inflected to compose words, using a spectacular and a relatively large set of Arabic morphemes affixes, e.g., antefixs, prefixes, suffixes, etc. Stemming is the process of rendering all the inflected forms of word into a common canonical form. Stemming is one of the early and major phases in natural processing, machine translation and information retrieval tasks. A number of Arabic language stemmers were proposed. Examples include light stemming, morphological analysis, statistical-based stemming, N-grams and parallel corpora (collections). Motivated by the reported results in the literature, this paper attempts to exhaustively review current achievements for stemming Arabic texts. A variety of algorithms are discussed. The main contribution of the paper is to provide better understanding among existing approaches with the hope of building an error-free and effective Arabic stemmer in the near future.
The major task of an Information Retrieval (IR) system is how to match between a searchable document representation (documents) and a user need, which is always expressed in terms of queries. The process of representing documents, in which keywords or terms are extracted, is called indexing. Indexing often goes through several operations, most of which are language-dependent. Among these operations, stemming stands as one of the major steps that every IR system must handle. Since documents and/or queries may have several forms of a particular word, stemming is the process of mapping and transforming all the inflected forms of that word into a common, shared and canonical form and, thereby, this canonical form would be the most appropriate form for indexing and for searching, as well. In other words, stemming renders different inflected and variant forms of a certain word to a single word stem. In monolingual IR, stemming appears to have a positive impact on recall more than precision [
Over the last decades, Arabic has become one of the popular areas of research in IR, especially with the explosive growth of the language on the Web, which shows the need to develop good techniques for the increasing contents of the language. This increasing interest in Arabic, however, is caused by its complex morphology, which is radically different from the European and the East Asian languages [
A number of studies have been devoted to stemming for a wide range of languages, including Arabic. Different approaches were proposed. For Arabic stemming [
Among these techniques, two major approaches are the most dominant for Ara- bic stemming. These are light stemming (known also as affix removal stemming) and heavy stemming (morphological analysis stemming). The light stemming chops off some affixes―such as plural endings in English lightly from words, whereas the second technique, which is heavy stemming, performs heuristic and linguistic processes so as to extract the root of the word, the possible roots or the stem of the word. The stem in Arabic IR is the least form of the word without any prefixes and suffixes, whereas the root of the surface form is the basic unit which often consists of three letters. Technically, root base stemmers attempt always to analyze words and to produce their roots.
Other techniques such as the use of corpus-based statistics and lexicons (to determine most frequent affixes and employing genetic algorithms and neural networks) have been also reported in the literature. Approaches like co-occu- rrence techniques for clustering words together and the use of parallel corpora have been also investigated.
However, in spite of the significant achievements and developments of these Arabic stemming techniques, each of the proposed approaches has some pros and cons and it is yet unclear which technique is to be adopted for indexing and/ or stemming Arabic texts.
This paper attempts to review current techniques to Arabic stemming problem. It provides firstly a comprehensive examination to the features of the Arabic that make the language challenging to Natural Language processing (NLP) and Information Retrieval (IR). The paper also compares among a considerable number of stemmers and how each of them works and produces the stem and/or root from Arabic text. The strengths and the weaknesses of each technique are also provided.
The rest of this paper is organized as follows. Section two introduces the characteristics of Arabic language which makes it challenging to Arabic IR task. Section three is an in-depth coverage for the existing approaches to Arabic stemming. Several studies are presented in this section. In section four an intensive discussion on the current approaches and their limitations is conducted. In section five, the paper is concluded.
Arabic is one of the Semitic languages, which also includes Hebrew, Aramaic and Amharic. It is the lingua-franca of a large group of people. It is estimated that there are approximately four hundred million first-language speakers of Arabic [
Sentences in Arabic are delimited by periods, dashes and commas, while words are separated by white spaces and other punctuation marks. Arabic script is written from right-to-left while Arabic numbers are written and read from left-to- right. Script of Arabic consists of two types of symbols [
Arabic words are classified into three main parts-of-speech: nouns (including adjectives and adverbs), verbs and particles. Particles in Arabic are attached to verbs and nouns. Words in Arabic are either masculine or feminine. The feminine is often formed differently from the masculine, e.g., مُبرمج and مُبرمجـة (meaning: single masculine programmer and single feminine programmer, respectively). The same feature appears also in both nouns and verbs in literary Arabic in order to indicate number (singular, dual “for describing two entities” and plural) as in مُبرمج, مُبرمجان and مُبرمجون (meaning: singular programmer, two program- mers and more than two programmers, respectively).
Arabic has a complex morphology. Its derivational system is based on 10,000 independent roots [
Words and morphological variations are derived from roots using patterns. Grammatically, the main pattern, which corresponds to the tri-literal root, is the pattern فعَل (transliterated as f-à-l). More regular patterns, adhering to well- known morphological rules, can be derived from the main pattern فعل (f-à-l). Examples of some patterns are فَعَل، فِعَال and أَفَاعِيل, transliterated as f-à-l, f-i-à-l and a-f-à-i-l, respectively.
Different kinds of affixes can be added to the derived patterned words to construct a more complex structure. Definite articles―like ال (its counterpart is the definite “the”), conjunctions, particles and other prefixes can be affixed to the beginning of a word, whereas suffixes can be added to the end. For example, the word لنجْمَعنّهم (meaning: we will surely gather them) can be decomposed as follows: (antefix: ل, prefix: ن, root: جمع, suffix: ن and postfix: هم). For the purpose of understanding stemming, all Arabic affixes are listed in
Antefixes, whether they are separated or not, are usually prepositions added to the beginning of words before prefixes. Prefixes are attached to exemplify the present tense and imperative forms of verbs and usually consist of one, two or three letters. Suffixes are added to denote gender and number, for examples in dual feminine and plural masculine. Postfixes are used to indicate pronouns and to represent the absent person (third person), for example. Usually this morphology is used to create verbal and nominal phrases.
Word |
---|
أخلاء |
أخلائه، أخلاؤه، أخلاءه، أخلائهم، أخلاءهم، أخلاؤهم، أخلائهم، أخلائهن، أخلائهما، أخلاؤهما، أخلاءنا، أخلائنا، أخلاؤنا، أخلائكم، أخلائك، أخلاءك، أخلاؤها، أخلاؤها، أخلائها،أخلائي، وأخلائي، الأخلاء، بالأخلاء، بأخلاء، بأخلائهم ... إلخ |
Antefixes | Prefixes | Suffixes | Postfixes |
---|---|---|---|
وبال، وال، بال، فال، كال، ولل، ال، وب، ول، لل، فس، فب، فل، وس، ك، ف، ب، ل | ا، ن، ي، ت | تا، وا، ين، ون، ان، ات، تان، تين، يون، تما، تم، و، ي، ا، ن، ت، نا، تن | ي، ه، ك، كم، هم، نا، ها، تي، هن، كن، هما، كما |
Prepositions meaning respectively: and with the, and the, with the, then the, as the, and to (for) the, the, and with, and to (for), then will, then with, then to (for), and will, as, then, and, with, to (for) | Letters meaning the conjugation person of verbs in the present tense | Terminations of conjugation for verbs and dual/plural/female/male marks for nouns | Pronouns meaning respectively: my, his, your, your, their, our, her, my, their, your, their, your |
Arabic Word | Pattern Transliterated | Meaning |
---|---|---|
حسب | f-à- l | Compute (a tri-literal root) |
يحسب | y- f-à- l | He computes |
حسبنا | f-à- l-n-a | We compute |
حسبن | f-à- l-n | They compute (plural feminine) |
يحسبون | y- f-à- l-o-n | They compute (plural masculine) |
حسبا | f-à- l-a | They compute (dual masculine) |
حاسوب | f-a-à-o- l | Computer (Machine name) |
حسّب | f-à- à- l | He computes (for intensifying verbs) |
فعل (f-à-l), according to some different patterns, in which some letters are added to the main pattern.
Affixes in Arabic may include also some clitics. Clitics, which have been used in the proposed stemmers and can be proclitics or enclitics according to their locations in words, are morphemes that have the syntactic characteristics of a word but are morphologically bound to other words [
Arabic also has three grammatical cases, as well. These cases are: nominative, accusative and genitive. For example, if the noun is a subject, then it will have the nominative grammatical case; if it is an object, the noun will be in the accusative case; and the noun will be in a genitive case if it is an object for a preposition. These grammatical cases cause Arabic to derive many words from a single noun (i.e. adjective) because it often results in a different form of the word. Note that adjectives in Arabic are nouns. For example, the different forms that can be derived from the adjective مزارع (meaning: farmer) according to their both gram- matical forms may include words like: مزارعة (for singular feminine in nominative, accusative and genitive cases), مزارعان (for dual masculine in nominative case), مزارعَين (for dual masculine in accusative and genitive cases), مزارعتان (for dual feminine in nominative case), مزارعتين (for dual feminine in accusative and genitive cases), مزارعون (plural masculine in nominative case), مزارعِين (for plural masculine in accusative and genitive cases) and مزارعات (for plural feminine in nominative, accusative and genitive cases).
Morphology adds a level of ambiguity that makes the exact keyword matching mechanism inadequate for retrieval. Morphological ambiguity can appear in se- veral cases. For example, clitics may accidentally produce a form that is homographic or homogenous (the same word with two or more different meanings) with another full word [
Besides the complex morphology, Arabic also has a very complex type of plu- rals known as broken plural. Plurals in Arabic do not obey morphological rules. They are similar to cases like: corpus and corpora; and mouse and mice in English, but differing in that there is no rule-based morphological syntax to the broken plurals. Broken plurals constitute 10% of Arabic texts and 41% of plurals [
Diversity in broken plurals makes them highly unpredictable. In most cases knowing the singular form does not assist to deduce the plural, and vice-versa. This fact shows how much broken plurals lead to a mismatch problem in Arabic IR.
Arabic also has very diverse types of orthographic variations. They are very common and present real challenges for both Arabic IR and NLP systems. Examples include, but they are not limited to Typographical Variations, which merely caused by the Arabic letters ALIF with its different glyphs (أ, إ, آ and ا) and YAA with its dotted and un-dotted forms (ي and ى) and HAA with the forms ه and ة. In most cases, one of the glyphs of a certain letter is altered/ dropped, initially, medially or finally, with another glyph of the same letter when writing text [
Since Arabic is an inflectional language, a large number of studies have been de- voted to the analysis of the best approach to index Arabic words. The process of producing index terms often goes through several operations, most of which are language-dependent. Normalization and stemming are among these major pro- cesses.
Normalization is the process of producing the canonical form of a token and/or a word in order to maximize matching between a query token and document collection tokens. In its simple form normalization pre-processes tokens to a single form, but very lightly. This is often done in several pre-processing stages so as to render different forms of a particular letter to a single Unicode representation, e.g., replacing the Arabic letter un-dotted ى with a final dotted ي, when this letter appears at the end of an Arabic word.
In its complex forms, normalization is used to handle morphological variation and inflation of words [
Since documents and/or queries may have several forms of a particular word, stemming should map and transform all the inflected forms of a word into a common shared form and, thereby, this shared form would be the most appropriate form for indexing the representations of documents and for searching as well.In monolingual IR, stemming appears to have a positive impact on recall more than precision [
MSA | Variant | Gloss | Typographical Occurrence |
---|---|---|---|
امتحان | إمتحان | Exam | The final bare ALIF is changed to ALIF HAMZA below |
صفاء | صفا | Purity | The final HAMZA is dropped |
قرآن | قران | The Quran | ALIF MADDA in the middle is altered to bare ALIF |
علاء | علا | A proper noun | They compute (plural feminine) |
نافذه | نافذة | Window | The final letter HAA is altered to a different letter, which is TAA MARBOOTA |
زراعي | زراعى | Agricultural | The final dotted YAA is changed to un-dotted YAA |
stemming is that it also reduces the size of the index since many words are grouped together in a single canonical form.
In Arabic IR, the word is the surface form which is often obtained by tokenizing the text (i.e. tokenizing text on white space and punctuations). Thus, the word in Arabic in its complete structure is a concatenated form of letters consisting of prefixes, morpheme and suffixes, e.g., وألعابهم (meaning: and their games or their toys). From that perspective, the issue of whether Arabic index terms should be roots or stems has always been a major question. Cited in [
On the other hand, it is known in Arabic linguistics community that the root of the Arabic surface form is the basic unit, which usually rhymed and/or patterned by the pattern فعَل as it was described earlier. Accordingly, if an Arabic root is to be extracted from a surface form, all the affixes that appear in that word, even they are written medially, should be stripped-off.
Accordingly indexing Arabic words has two different paradigms [
In order to achieve the goal of indexing the most adequate Arabic term (stem or root) from a word/token, several approaches have investigated from the use of lexicons and dictionaries to morphological analysis and combination of different techniques. Each method has its pros and cons and the studies investigated exhaustively what is the best technique to index Arabic words.
Due to large number of the studies in this specific area, researchers attempt to classify the techniques according to their algorithmic behaviors. Larkey, et al., [
・ Manually constructed dictionaries, in which words with their roots and their possible segmentations are stored in a large lookup table.
・ Affix truncation techniques which often attempt to stem the words lightly by removing common suffixes and prefixes.
・ Morphological analyzers, in which the root is extracted using morphological analysis.
・ Statistical stemming which is based on clustering similar words in documents together.
In spite of the good classification of these techniques, but in the opinion of the authors this classification needs to be extended so as to include newer techni- ques. The new extended classification is shown in
Before delving into the details of each of the employed technique, it is important first to cover simple normalization. This is because stemming is in fact a complex normalization technique as it was illustrated earlier. In addition, the majority of the techniques perform some normalization technique firstly. Next sections explain normalization and stemming techniques in details.
Before normalization, the majority of the Arabic stemming techniques process texts. Preprocessing in Arabic includes removal of non-characters, normalization of letters and removal of stopwords. Removal of non-characters [
As it was shown earlier, normalization in Arabic is used to render different forms of a letter with a single Unicode representation. This is important to mo- derate the orthographic variations. Since there are only few Arabic letters that are the sources for orthographic variations of words, most stemming approaches handle them in a similar way. Accordingly, the majority of the stemming techniques normalize documents and queries using some or all of the following nor- malization [
・ Replacing ALIF in HAMZA forms (ALIF combined with HAMZA that is written above or below the ALIF like in “أ and إ”) and ALIF MADDA (آ) with bare ALIF (ا).
・ Replacing final un-dotted YAA (ى) with dotted YAA (ي).
・ Replacing final TAA MARBOOTA (ة) with HAA (ه).
・ Replacing the sequence ءى with ئ.
・ Replacing the sequence يء with ئ.
・ Replacing ؤ with bare ALIF (ا).
In spite of the wide use of these normalization steps, Abdelali, et al., [
To address the impact of Arabic challenges on both monolingual and cross- lingual retrieval and the problem of orthographic resolution errors, such as changing the letter YAA (ي) to the letter ALIF MAKSURA (ى) at the end of a word, the studies in Xu, et al. [
The use of normalization techniques is almost similar in Arabic and it seems that in order to increase matching, the penalty paid is to normalize Arabic letters before stemming the words in which they occur.
As it was illustrated earlier, we extended the classification of the employed approa- ches for stemming Arabic texts. The next section describes the techniques in this classification in details.
With the premise that the basic unit in Arabic is the root, root―based stemming technique attempts to perform heuristic and linguistic morphological analysis so as to extract the root of a word. For example, root-based algorithms produce the root عملfor the word وأعمالهم (meaning: and their works) because all affixes are removed. To achieve this goal of obtaining roots, researchers employ the use of Arabic morphological analyzers.
Khoja stemmer [
One advantage of Khoja stemmer is that it has the ability to detect letters that were deleted during the derivational process of words. For instance, the last letter YAA is removed in a word like امشي (meaning: go), resulting in امش, if it appears in an imperative form. As another example, the last letter ALIF in the root نما (meaning: grew) will be modified to WAW in the present form of this root and thus it will be نٌمو instead of نٌما. Using Khoja stemmer, it is possible to handle such cases.
However, in spite of its superiority and its wide use, the algorithm has a major drawback, that is the over-stemming in which the stemmer may erroneously cluster some semantically different words into a single root. This is because a tremendous number of Arabic words may have different semantic meanings although they share the same root, leading to low precision and high level of ambiguity. For example, both the words مقاتلات (meaning: fighters) and يتقاتلون (meaning: they are fighting each others) are originated from the canonical root قتل (meaning: to kill). Examples also include words like طفيليات (meaning: parasites) and لعوب (meaning: irresponsible) in which the produced roots using Khoja stemmer are طفل and لعب. Both stems are semantically different from the original word.
Additionally, sometimes the algorithm removes some affixes that are parts of words (known as mis-stemming), such as in the word مدرسه (meaning: schools) which will be stemmed to the root درس (meaning: lesson or learn in past tense). Khoja stemmer may also result in truncating some letters that are parts of the word. It is clear that removal of prefixes and suffixes blindly causes the stemmer to erroneously remove some original letters from the root. For instance, chopping-off suffixes and prefixes blindly from a word like بالغات (meaning: feminine adults) will result in removing the letters بال, which will be handled in the algorithm as a prefix although they are original letters of the root بلغ (meaning: to attain or to accomplish).
In his study for the Holy Quran, Hammo [
Darwish [
For the root detection phase, Sebawai takes the input word and produces all the possible combinations among prefix, suffix and template, which could result in forming that word. Once a possible combination is obtained, its product pro- bability (with the independence assumption) is computed according to the previously estimated probabilities. The higher probability computed of a certain combination, its root is detected and matched against 10,000 roots to check its validity.
Sebawai has some limitations stated by its developer. First, it cannot stem transliterated words such as entity names (i.e., انجلترا, which means England) because it binds the choice of roots to a fixed set. Second, Sebawai cannot deal with some individual words that constitute complete sentences, like لنَهْدِيَنّهُم (meaning: we will surely guide them) because the appearance of such words is very rare and thus, low probabilities are assigned. Additionally, since Sebawai is a root-based stemmer, it results in the same problem of over-stemming as in Khoja.
Buckwalter [
Unlike, Khoja for example, Buckwalter produces a single stem or all the possible stems of the input word. The basic idea is similar to the one presented by Sebawi. At first, manually constructed tables are collected. The tables are based on three groups (prefixes, possible stems and suffixes). In addition, the valid combinations of each pair of the three groups (prefix/stem pairs, prefix/suffix pairs and stem/suffix pairs), are also stored in form of truth tables. Thus during the root detection phase, Buckwalter algorithm, which is coded in a program called Ara Morph, divided input word into three sub-strings (potential prefix, stems and suffix), with all its possibilities. The produced sub-strings are generated according to the pre-constructed tables. Following this, a matching process is performed for each possible combination of prefix, stem and suffix that could yield the input word. Hence using the truth tables pairs and if the first sub-string is a correct prefix, the second sub-string is a legitimate stem, the third sub-string is a legitimate sub-string and if the combination of all of them is valid then the second sub-string will extracted as a stem for the input word. If more than one stem is obtained then all of them will be listed.
Buckwalter is not just a stemmer. Instaed, it also tags the words with its possible POS and provides all the possible translations in English for that word. For example, for the word تعمل (tEml in Buckwalter transliteration), a version of the Buckwalter analyzer provided many solutions, two of them are presented in
One deficiency of Buckwalter’s analyzer is that some words may not be stemmed because they may not be included in the stem table. In addition, broken plurals are not managed by the Buckwalter stemmer [
Based on Buckwalter analyzer and the fact that the analyzer lists all the possible stems, Xu, et al., [
Ghwanmeh, et al., [
Recently, Al-Kabi, et al., [
Results reported in Al-Kabi study showed that the proposed algorithm yields higher accuracy when it was compared to Khoja stemmer. One of the cons of the developed stemmer, however, is that it fails to extract roots from words whose lengths are less than 4 letters. In addition, the dataset that have been used in study is extremely small. It only contains 6081 Arabic words. Therefore, the results of the study can be considered as indicative rather than conclusive.
To mitigate the impact of the major drawback of root-based algorithms, which is losing stem semantics, light stemming for Arabic was also proposed. Light stemmers chop off some affixes such as plural endings in English lightly from words and without performing deep linguistic analysis. From that perspective, the majority of the approaches attempt to strip off the most frequent prefixes (i.e. definite articles), suffixes (i.e. possessive pronouns) and any antefixes or postfixes that can be attached to the beginning or endings of words. For example, light stemmers generate أعمال (meaning: works) because only prefixes (including antefixes) and suffixes (including postfixes) are removed. The decision of removing any affixes, however, is usually controlled by some heuristic rules derived from common use of these antefixes. Examples of such types of stemmers include, but are not limited to, Al-stem by Darwish and Oard [
Al-stem is a light stemmer, presented by Darwish and Oard [
Based on the assumption that light stemming preserves the meaning of words, unlike root-based techniques, Aljlayl and Frieder [
After the input word is fed to the algorithm, the stemmer truncates the letter و (pronounced as WAW and it means and) only if the length of the word is greater than or equal to 3. Following this, articles are truncated from the beginning of words. If the length is of the input word is still greater than or equal to 3, longest suffixes are removed if and only if the remaining letters are 3 or more. Next, the algorithm truncates prefixes from the produced word in the previous step, but, if and only if the remaining letters are also greater than or equal 3. The last step is repeatedly performed until the stem is obtained. In some cases the algorithm uses a normalization technique for words as well as removing all the diacritical marks except the diacritical mark shadda. This is because shadda is a sign for a duplication process of a consonant and thus it exemplifies a letter that could be lost if shadda is removed. One advantage of the algorithm is that it can deal with some arabicized words according to a predefined list. Arabicization referred to Arabic transliterated, rather than translated, words that are borrowed from other languages e.g., كمبيوتر (meaning: computer). Arabicized words in Arabic are often nouns and terminology derived from other languages. However, entries in such an arabicized list would probably be limited in its coverage. Aljlayl and Frieder concluded that their light stemming algorithm outperforms root-based algorithms, in particular the Khoja stemmer.
Larkey, Ballesteros and Connell [
・ Peel away the letter و (meaning: and) from the beginning of words for light 2, light 3, and light 8 only if there are 3 or more remaining letters after removing the و. Such condition attempts to avoid removing words that start with the letter و.
・ Truncate definite articles if this leaves 2 letters or more.
・ Remove suffixes, listed in table below from right to left, from the end of words if this leaves 2 letters or more.
In monolingual and cross lingual experiments, developers of light 8 concluded that it outperforms the Khoja stemmer, especially after removing stopwords with or without query expansion. Actually, Larkey, Ballesteros and Connell concluded that removing stopwords results in a small increase in average precision, which is statistically significant for light 2 and light 8, but not for raw (the case of no stemming or normalization) and normalized words. In the same experiments, Larkey, Ballesteros and Connell used co-occurrence analysis, based on a string similarity metric, to refine some simple stemmers that are light stemmers followed by removal of vowel letters plus HAMZA (ء). From the experiment, it is concluded that a repartitioning process consisting of vowel removal followed by refinement using co-occurrence analysis performed better than no-stemming or very light stemming. In contrast, light8 stemming followed by vowel removal and the co-occurrence analysis is not better that light8 with stop word removal.
Larkey, Ballesteros and Connell [
Light stemmer type | Removing from front | Removing from end |
---|---|---|
Light1 | ال، وال، بال، كال، فال | none |
Light2 | ال، وال، بال، كال، فال، و | none |
Light3 | ال، وال، بال، كال، فال، و | هـ، ة |
Light8 | ال، وال، بال، كال، فال، و | ها، ان، ات، ون، ين، يه، ية، هـ، ة، ي |
1) Peel away the letter و (meaning: and) from the beginning of words if there are 3 or more remaining letters after removing the و.
2) Truncate definite articles if this leaves 2 letters or more.
3) Remove suffixes, starting from right to left, from the end of words if this leaves 2 letters or more.
The robust feature of light 10 and in light stemming approaches in general, is that the stemmer minimizes the impact of the over-stemming problem. Since only few prefixes and suffixes are removed then the semantic meanings of words will be preserved. Consider the word الطفيليات. If the word is lightly stemmed, then the resulted stem is طفيل (as only the definite article prefix ال and the plural feminine suffix ات will be eliminated according to the algorithm). It is noticed that both the word and the stem have the same semantic meaning. In general, this is a very strong feature for light-stemming approaches. In the experiments, the developers of light 10 showed that it outperforms Khoja stemmer and the difference is statistically significant.
In the same study, the produced stems using light10 was also compared to the generated stems after words were processed using both Buckwalter and Diab analyzers [
In spite of the above stated conclusion about light 10, but yet the stemmer still have major drawbacks that can be identified. The obvious one is the under-stem- ming problem, in which words with the same meanings may be clustered into different groups. For instance, the stemmer fails to group the words اقتتل (meaning: they are fighting hardly with each others), which is stemmed to اقتتل, and القاتل (meaning: the killer), which is stemmed to قاتل, although both words are semantically similar. As a result, the stemmer may result in low recall as many relevant documents will not be retrieved. Under-stemming is limited to light 10 only and it appears in every light stemmer in Arabic studies.
Inspired by the drawbacks of both light and heavy stemming techniques, Kadri and Nie [
Chen and Gey [
The algorithm that handles affixes removal is well controlled in the Berkeley team work.
For the removal of prefixes, the algorithm checks the length of the input word and according to that length; a specific rule will be applied. If the length is:
・ At least 5 characters, truncate the first 3 characters if and only if the first three letters of the word is in the list (وال, بال, كال, ولل, مال , اال, سال, لال).
・ At least 4 characters, truncate the first 2 characters if and only if the first two letters of the word is in the list (با, لل, وم, وت, لا, سي, وس, وي, ول, كا, فا). However, if the word begins with the letter و, remove that letter.
・ At least 4 characters and the initial letter in the word is ب or ل, then truncate the occurring letter if and only if the produced word (after the letter ب or ف is being truncated) is in the TREC 2002 Arabic collection.
For stripping-off suffixes, the algorithm also checks the length of the word (before removing the suffix but after the removal of the prefix). If the length is:
・ At least 4 characters long, truncate recursively the following letters according to their occurrences in the list: (ها, ية, هم, وا, نا, ما, وا, يا, ني, هن, كم, كن, تم, تن, ين, ان, ات, ون).
・ At least 3 characters long, truncate recursively the following one-character suffixes that appear in the list: (ة, ه, ي, ت).
The results reported by Chen and Gey study, showed that the algorithm was very beneficial to retrieval performance.
Inspired by the fact that some light stemmer may result in removing some affixes that are parts of the original words, Nwesri, et al., [
Ababneh, et al. [
Very recently, Sameer [
Morphological analyzers for dialectal Arabic have been also proposed. It is known that fro Arabic language, there is a continuum of spoken dialects varying geographically, but also by social class, which are native languages. These dialects differ phonologically, lexically, morphologically and syntactically from one another [
Motivated by the fact that the stemming technique is a language-dependent process, statistical-based stemmers that demonstrate as language-independent techniques to conflation were also proposed for Arabic IR. Examples of statistical stemmers are those based on corpus analysis [
From that perspective, using co-occurrence statistics and association relationship measures (i.e. Mutual Information) between word pairs, makes it possible to determine which words are semantically different and which are similar, even if the words have the same letters. For instance, consider the Arabic word. ذهب the word is a polysemous as two meanings can be provided: go and gold. Accordingly, by making use of co-occurrence statistics, the two words should be stemmed to different clusters if the context in which they appear indicates such distinctive meanings. Reported results by Xu and Croft [
The use of association similarity measures to words level has been also used for Arabic IR stemming. The premise here is that segmenting each word into a set of 3-grams for example, and computing a similarity measure, using Dice Co- efficient for example, between the set of the 3-grams of that word and the set of the possible 3-grams of the query word would result in a similarity value that might allow clustering the word in a specific class. The major advantage for the use of N-grams models is that they are able to capture broken plurals. In spite of that broken plurals do not get conflated with their singular forms because they preserve some affixes and internal differences, but, yet, the singular form and broken plural of a certain word have some common letters in many broken plurals. Accordingly, segmenting the word in its singular and plural forms would probably capture the shared letters and hence, the plural and also different inflected forms of words are thus, clustered.
Using these arguments, Mustafa and Al-Radaideh [
The same technique had been also used by Khreisat [
Hmeidi, et al., [
Xu, et al. [
The same authors extended their study to include spelling normalization [
Chen and Gey [
Stemming based on light/simple tagging has been also utilized for Arabic texts. The idea is to lightly tag words into some different tags so as to use different ty- pes of stemming techniques. Al-Shammari and Lin [
Mansour et al. [
Both methods of Mansour and Al-Shammari seem reasonable but the experiments were not comprehensive as small and non-standard sets were used (only 24 texts were used by Mansour).
Inspired by the fact that Neural Networks (NN) can be applied to a large number of applications, Alserhan and Ayesh [
Boubas, et al., [
It is apparent that in highly morphological languages such as Arabic stemming could have a significant impact on retrieval. This is very evident in the majority of the studies provided in the paper. It is also clear that heavy and light stemming approaches are the most dominant ones among the existing approaches for stemming Arabic but, it can be concluded from the reported studies that light-based stemming is better than heavy-based stemming. But, each of the two paradigms has some pros and cons. On one hand, heavy stemming often results in over- stemming, leading to a low precision. This is especially true in morphologically rich languages, as Arabic, which are often rich of polysemous words in which a single word could have multiple meanings. So, rendering infected forms of words into a single root would probably results in returning large number of irrelevant documents.
On the other hand, light stemming may not succeed to cluster semantically similar words together (under-stemming), resulting in low recall. However, in spite of this major drawback, light 10 is the best known algorithm for indexing Arabic texts. It has been identified as a fashionable solution to Arabic stemming problem. Therefore, light 10 has been added to the most famous IR systems like the Lucene and the Lemur toolkit.
It is true that light stemming preserves the meaning of words, unlike root- based techniques, and achieves the goal of retrieving the most pertinent documents, but in the opinion of the authors of this paper, the major reason behind the success of light-based stemming over root-based stemming is that the strippable affixes in the former approach are those appearing in Arabic nouns and adjectives (i.e., وال، بال، كال، فال,) and the belief is in Arabic nouns and adjectives are much larger than verbs. In fact, there are only few morphological rules, known as “ten-verb-addition”, to formulate verbs from roots. In contrast, root- based stemming techniques, which often attempt to produce root, focus on verbs and handle even nouns and adjective in the same context. It is evident that it is not always correct to produce the root of proper nouns or nouns in general. Let’s consider the following nouns: السعودية, المكاني, المهرجان, باراك أوباما and الستائر (meanings respectively: Saudia, the festival, spatial, the US leader Barak Obama). Using a root based stemmer like Khoja, the stems are سعد, كني, هرج, برا وبا and ستر, respectively. All of the stems are either chaotic or/and do not have similar semantic meanings to their original words. These two facts are the main causes for why light stemming techniques outperform root-based st.
However, in IR the most semantically clustered words, the better retrieval task. Therefore, in spite of the achievements of light stemming techniques, in general (and light 10, in particular), the belief is that they are not the best paradigm for indexing Arabic texts. In addition, it is obvious that in a rich language like Arabic, the process of relatively blind removal of affixes (i.e. prefixes and suffixes) could have a significant impact on words and may lead to mis-stem- ming and ambiguity problems, in which some letters that are original in words are erroneously stripped-off. In fact, light stemming techniques are really simple as they depend solely on the removal of affixed and some controlled rules devised experimentally.
When it comes to Arabic English CLIR, in which the query language is different from the language that presents in document collection, the belief is that it may result in some relatively high OOV (Out-of-Vocabulary) words. This refers to that the majority of the Arabic-to-English dictionaries (not the opposite) list their entries in terms of roots. In fact, whenever Arabic native speakers need to translate an Arabic word into English, they always render Arabic words to their roots to increase the possibility of capturing the translation senses. This is an accredited point for root-based techniques.
In the opinion of the authors, the only way to avoid ambiguity that may occur due to blind removal of affixes (when a letter or some consecutive letters are not parts from words), is to use some statistical data extracted from corpora. It is very evident (as in Kadri and Nie study above) that such corpus statistics are very useful for handle such ambiguity because the removal decision of affixes depends on the distribution of that affixes in the corpus and whether it is most frequent or not and thus, our certainty about the removal process is handled and estimated, which could help a lot in the final decision.
Nevertheless, the use of corpus statistics just imposes burden of removal ambiguity to domain and size of that certain corpus. The belief of the authors is that there is always a possibility of undesirable behavior and/or poor performance once moving from one domain to another domain and/or when the corpus size changes. This is not surprising because using statistics depends solely on the size and the domain from which data is sampled. Consider, for example, Arabic technical words in computer science. There are a considerable number of words that are borrowed from other languages (i.e. English). So, using of corpus statistics to avoid removal ambiguity, with a corpus of computer science may result in dropping performance of some technique radically because news collections, for examples, may have unique features that may not be found in other genres.
It is also noticed in the reported studies in Arabic stemming, that a considerable number of the developed approaches had been tested using small collections, rather than using standard corpora (i.e. TREC 2001 and 2002). This is a major drawback in the developed approaches because their results deemed to indicative rather than conclusive in this case. It is not guaranteed to get the same achieved results, when such techniques are tested using standard test beds.
The majority of the developed techniques didn’t show how they handle broken plurals in Arabic. In fact, only few studies addresses the problem in terms of statistical stemming and/or using some clustering approaches or translation techniques using an aligned corpora as it has been described earlier in this paper. As illustrated earlier, broken plurals represent 10% of Arabic texts. It is not avoidable and the belief is that the problem should be handles within stemming techniques.
The use of simple part of speech tagging, statistical stemming and artificial intelligence are very useful to Arabic stemming task. Statistical stemming (i.e. N-grams models) and artificial intelligence techniques, for examples, have the ability to detect broken plurals of Arabic words, unlike root and heavy based techniques. They also have the ability of cluster words that are related together and to minimize polysemy impact (i.e. when a word have multiple meanings, clustering could distinguish between these meanings). Nevertheless, the use of such techniques increases performance penalty needed to identify clusters and/or detect tag-of-speech of words. In addition, simple POS tagging techniques rely on a hypothesis that does not always holds, which is the preceding words. In fact, the majority of the Arabic words cannot be determined by only preceded words. This is may be the major reason for using only small text collections for reported experiments which used such technique.
Orthographic variation in Arabic and various writing of some letters could also have a significant, because incorrect normalization may yield a stemmed word that does not share the meaning with the original word. Such orthographical differences should be handled carefully within stemming techniques. In the belief of the authors of this paper, Arabic stemming task should not be dependent on a specific approach. This is the only method to develop an accurate and error-free Arabic stemmer.
Arabic language is an extremely rich with its morphology, derivational system and grammatical rules. For such a language, stemming techniques could have a significant impact on improving retrieval performance. This paper reviews a considerable number of studies that have been conducted to resolve Arabic stemming problem. Several studies are presented and the causes for why Arabic language is challenging and its implications on NLP and IR have been well analyzed in the paper.
However, in spite of the achievements, it is yet not apparent which approach is the best for indexing Arabic texts. It is true that in NLP and IR research, community light stemming techniques have been widely adopted for their simplicity and their ability to preserve words meanings. But, yet, they are still far from the optimal accuracy. Root-based stemmers may result in higher recall but, many irrelevant documents may be retrieved because they cluster words in different classes. Additionally, light stemmers focus on nouns and hence, they perform relatively poor for nouns. Morphological analyzers, as Khoja stemmer which firstly tokenizes the input text, could result in incorrect tokenization and in stemming consequently. Additionally, they have been adopted to focus on verbs rather than nouns. Tagging techniques could improve performance but in an ad-hoc task like IR they are not the optimum. Statistical techniques could contribute to resolving broken plural problem but, they depend solely on corpus statistics, which could be changed as we move from a specific domain to another. We can conclude that there is no optimal solution yet for the problem of how to index Arabic terms.
Mustafa, M., Eldeen, A.S., Bani-Ahmad, S. and Elfaki, A.O. (2017) A Comparative Survey on Arabic Stemming: Approaches and Challenges. Inte- lligent Information Management, 9, 39-67. https://doi.org/10.4236/iim.2017.92003