The documents contain a large amount of valuable knowledge on various subjects and, more recently, documents on the Internet are available from various sources. Therefore, automatic, rapid and accurate classification of these documents with less human interaction has become necessary. In this paper, we introduce a new algorithm called the highest repetition of words in a text document (HRWiTD) to classify the automatic Arabic text. The corpus is divided into a train set and a test set to be applied to proposed classification technique. The train set is analyzed for learning and the learning data is stored in the Learning Dataset file. The category that contains the highest repetition for each word is assigned as a category for the word in Learning Dataset file. This file includes non-duplicate words with the value of higher repetition and categories and they get from all texts in the train set. For each text in the test set, the category of words is assigned to a specific category by using Learning Dataset file. The category that contains the largest number of words is assigned as the predicted category of the text. To evaluate the classification accuracy of the HRWiTD algorithm, the confusion matrix method is used. The HRWiTD algorithm has been applied to convergent samples from six categories of Arabic news at SPA (Saudi Press Agency). As a result, the accuracy of the HRWiTD algorithm is 86.84%. In addition, we used the same corpus with the most popular machine learning algorithms which are C5.0, KNN, SVM, NB and C4.5, and their results of classification accuracy are 52.86%, 52.38%, 51.90%, 51.90% and 30%, respectively. Thus, the HRWiTD algorithm gives better classification accuracy compared to the most popular machine learning algorithms on the selected domain.
The internet is a very effective technique for obtaining a huge amount of information in different forms such as documents. Recently, there are millions of documents from various sources, most of which contain valuable information. Manual classification of documents consumes time and is very difficult, especially when people must estimate the category based on the information included. Therefore, the automatic text classification is used to discover the basic information of text documents automatically while saving human effort and time [
Automatic text categorization is assigning and categorizing texts by using a set of predetermined categories based on the contents of the text. Specifically, it is filtering and routing, clustering information in related texts, and then classifying the texts into specified topics [
The classification of Arabic texts has received great attention in many recent researches based on the importance of the Arabic language and the huge population who speak Arabic. In this paper, we introduce the HRWiTD algorithm used to automatically analyze Arabic texts to estimate classifications (categories). The proposed algorithm abbreviation refers to highest repetition of words in a text document. The proposed algorithm abbreviation refers to highest repetition of words in a text document. The proposed technique for classifying text is built based on three main stages, pre-processing stage to remove noisy data. Feature extraction stage to learn dataset and build Learning Dataset file based on the extracted features from the train set. Learning Dataset file includes non-duplicate words with its highest repetition values and categories. Classification stage is estimating the classification of texts by using HRWiTD algorithm (the expected classification of the text is the category with the largest number of words). If the average of total repetition for all words in a text (that contains a predetermined classification (categories)) is less than 33.33%, the proposed classification of text sets is “General” category.
The HRWiTD algorithm has been applied to convergent samples of six categories namely culture, economic, public, political, social, and sports to obtain the best classification accuracy. The selected corpus has got from SPA (Saudi Press Agency), it contains 1421 Arabic texts (Newswire), it was divided into two sets, 70% train set and 30% test set and this division is the best to get the best classification accuracy based on [
Based on recent research, various automated learning algorithms have been successfully applied to Arabic text. The most famous techniques to classify Arabic text from the best to the worst are C5.0 classifier, Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (C4.5), and K-nearest neighbor (KNN) [
The second section presents some of the relevant work, the third section introduces the proposed work including the HRWiTD algorithm and the evaluation method used in the details, the fourth section presents the experimental results of the proposed algorithm and the most popular machine learning algorithms with their comparison, the latter part is the conclusion
Text classification (TC) in data mining field is the process of extracting useful knowledge from text by analyzing complex and textual data [
In many text mining algorithms, pre-processing is one of the main components of text classification. Typically, the TC framework begins with the pre-processing, then the extraction feature, and finally the classification steps [
The automatic text classification is used to classify texts in many languages such as Arabic. Arabic is the native language of more than 300 million people and is widely spread in the world [
Recently, many types of research have been published in machine learning algorithms for the classification of Arabic text. Naïve Bayes is used to automatically classify Arabic documents in El-Kourdi et al. [
Different Machine learning algorithms that are applied to Arabic texts have produced the different classification accuracy that is presented in [
In this paper, there are three main phases to classify Arabic texts, pre-processing, feature extraction and classification. In the pre-processing stage, the selection feature is used to remove noisy data such as numbers, punctuations, kashida, stop words and diacritics [
The accuracy of using the HRWiTD algorithm for classifying is evaluated through the confusion matrix. This method evaluates the predicted classification of the texts with the actual classification (from six categories) in the Arabic news (SPA).
This section describes the main stages of classification of Arabic texts in details.
processing, data division, feature extraction from train set, filtering, feature extraction, data representation, applying a HRWiTD algorithm, and performance evaluation.
Data collection is the first and very important stage for the classification of Arabic texts. We chose an Arab source (Newswire) from the Saudi Press Agency (Saudi Press Agency), which includes convergent samples of six categories. We choose a SPA source for two reasons: availability of actual classification (category) for each text in corpus and availability of SPA texts on the Web. SPA statistics are shown in
The process of pre-processing is actually a process of improving the classification of text documents by removing the data that is worthless. The data may include worthless numbers, punctuations, kashida, Hamza “,” diacritics, and stop words. Some words do not belong to any classification such as prepositions, pronouns, etc., so we append them to a stop word list see
At this stage, ATC Tool is used to dividing corpus into two partitions, the train set, and the test set. The train set contains 70% of a selected corpus and a test set
Source | Classes | No. of Texts | No. of Words | No. of Unique Words |
---|---|---|---|---|
Saudi Press Agency | Cultural | 251 | 47,499 | 9993 |
Economic | 248 | 40,065 | 7780 | |
General | 171 | 32,395 | 8592 | |
Political | 250 | 35,350 | 7430 | |
Social | 251 | 49,615 | 10,124 | |
Sports | 250 | 41,657 | 7332 | |
Total | 6 | 1421 | 246,581 | 51,251 |
Classes | No. of Texts |
---|---|
Demonstrative pronouns | هذا, هذه, ذلك, تلك, هذان, هذين, هتان, هتين, هؤلاء, أولائك, .... |
Relative pronouns | الذي, التي, اللذا, اللذين, اللتان, اللتين, الذين, اللاتي, اللواتي, .... |
Subject pronouns | انا, انتَ, انتِ, هو, هي, نحن, أنتما, هما, نحن, أنتم, أنتن, هم, هن, ... |
possessive pronouns | عند, مع, ل, لها, .... |
Numbers | واحد, اثنين ,ثلاثه , .... |
Special converters to accusative | كان وأخواتها, إنّ وأخواتها, ظنّ وأخواتها |
Prepositions of time | حين, صباح, ظُهر, ساعة, سنة , أمس, .... |
Prepositions of place | فوق, تحت, أمام, وراء, حيث , دون, .... |
Prepositions | الواو, الفاء, ثم, حتى, أَو, أَم, بل, لا, لكن, .... |
Conjunction | من, عن, على, في, الباء, إلى, اللام, الكاف, حتى, رُبَّ, مذ, منذ, التاء, الواو, .... |
Countries & Cities | اليمن, امريكا, المانيا,...... & صنعاء, نيورك , برلين, ..... |
Proper Noun | احمد, علي , خالد, محمد, عمر, عبدلله , .... |
Nationalities | يمني, امريكي, الماني,.... |
Others | بنت ,بن , ابن, ام, اب, اخ’ اخت, جد, جده, حفيد, حفيدة, عم, عمه, خال, خالة, اليوم, غدا |
(Suffix or prefix of singular/dual/plural/feminine/masculine with any Stop Words mentioned above in this table) or (Article with any Stop Words mentioned also above in this table).
contains 30%, and this division is best for the best performance of the classification based on [
In this stage, we use data from train set and test set from internal or external source. Features extract and the repetition list of words generates by using the ATC tool. The ATC tool lists and saves the repetitions of each word in all texts of the train set in a train list file. It also lists and saves the repetitions of each word in all texts of the test set in a test list file. In addition, add a field to train the list file and the test list file to label the category of each word. The category of words in the train list is the actual category. On the other hands, the word categories in the test list are set from the Dataset Learning file of the same words.
At this stage, train file will filter by remove the duplication words with their classifications. The word that has the highest repetition will remain with its relative data (repetitive number and category) and delete the same words and its relative data with less repetition.
At this stage, the train list file that is produced from the filter stage will format into Learning Dataset file. The test list file that is produced from the extract feature stage will be used for classifying text with HRWiTD algorithm. The data will be represented as an array with n rows and m columns where rows correspond to words in text and columns that correspond to repetition and category.
In this step, the Learning Dataset file is produced from the data representation stage and the test list file will be used in the classification algorithm (HRWiTD). The test list file is used to store the predicted classification (which gets from Learning Dataset file) for all words in each text. Predicated classification file is used to store the predicted classification of all test texts. Details of the HRWiTD algorithm process are given in
The performance of using the HRWiTD algorithm for classifying texts has been evaluated using the confusion matrix [
It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other. Most performance measures are computed
from the confusion matrix. The actual and predicted information (classification) will be assigned by using HRWiTD algorithm. The confusion matrix should evaluate the performance using the actual and predicted information in the matrix, see
Entries in the confusion matrix have the following meaning in the context of our study:
・ True negative (TN) is the number of correct predictions that an instance is negative.
n = 420 (30% of Data set) | Predicted | |||
---|---|---|---|---|
Negative | Positive | |||
Actual | Negative | TN | FP | TN + FP |
Positive | FN | TP | FN + TP | |
TN + FN | FP + TP | Total |
・ False positive (FP) is the number of incorrect predictions that an instance is positive.
・ False negative (FN) is the number of incorrect of predictions that an instance negative.
・ True positive (TP) is the number of correct predictions that an instance is positive.
・ Total is the summation of all above variables. See Equation (1).
Total = TN + FP + FN + TP (1)
Overall, the accuracy (AC) is the proportion of the total number of predictions that were correct. It is determined by using the Equation (2):
AC = ( TN + TP ) / Total (2)
・ There are two possible predicted classifications: “Positive” and “Negative”. If we were predicting the target classification (ex. “Sport”) of text, for example, “Positive” would mean it belongs to that target classification, and “Negative” would mean it doesn’t belong to that target classification.
・ The classifier (HRWiTD algorithm) has a total of 420 (test data) out of 1421 predictions for each of six categories, including 70 text per category.
・ Out of those 420 cases, the classifier predicted “Positive” FP + TP times, and “Negative” TN + FN times.
・ In reality, FN + TP classification in the table is belong to target classification, and TN + FP classification do not.
The HRWiTD algorithm is used to classify Arabic texts. The confusion matrix method was used to determine the classification accuracy of the HRWiTD algorithm, which is 86.84% in this experiment, see
The data in
See
Classifier | Accuracy (%) | Dataset |
---|---|---|
C5.0 | 52.86 | Frequency, 70, DF, CHI, 30 |
KNN | 52.38 | Frequency, 70, DF, CHI, 30 |
NB | 51.90 | TFiDF, 70, DF, IG, 30 |
SVM | 51.90 | LTC, 70, TF, CHI, 30 |
C4.5 | 30 | Boolean, Frequency, TFiDF, 70 |
Category | C5.0 Classifier | HRWiTD Algorithm (Train 70%) |
---|---|---|
Frequency, 70%, CHI, DF, 30 | ||
Culture | 40.00% | 85.24% |
Economic | 47.14% | 91.43% |
General | 80.00% | 64.58% |
Political | 32.86% | 94.29% |
Social | 35.71% | 86.19% |
Sport | 81.43% | 99.29% |
Total | 52.86% | 86.84% |
Total = [ ( AC ( Culture ) + AC ( Economic ) + AC ( General ) + AC ( Political ) + AC ( Social ) + AC ( Sport ) ) * 100 ] / 6 (3)
In summary, this paper was carried out to classify Arabic texts automatically using the HRWiTD algorithm. We have applied it to 1421 Arabic Newswire from the Saudi Press Agency (SPA). The corpus includes convergent samples of six categories (culture, economic, public, political, social, and sports). In this paper, the average of the overall classification accuracy for six categories is 86.84 %; confusion matrix method is used to evaluate the classification accuracy. The classification technique in this paper is constructed based on three main phases which are preprocessing, features extraction and classification by using HRWiTD algorithm. The repetition for a predetermined category of each word in the text is calculated. If the average of the total of those words is less than 33.33%, the expected classification of text is “General” category; otherwise, the expected classification of text is the category with the largest number of words. We compared the accuracy of the proposed algorithm (HRWiTD) with the accuracy of the most popular techniques and the accuracy of C5.0, KNN, SVM, NB and C4.5 classifies are 52.86%, 52.38%, 51.90%, 51.90% and 30%, respectively. The best classification performance was when techniques used advanced methods for term selection (CHI, IG, None), different weight methods (Boolean, Entropy, Frequency, LTC, Relative Frequency, TFC and TFiDF), and two sample methods for term selection (TF and DF). Thus, we conclude that the best technique to classify Arabic texts in the selected domain is obtained from the HRWiTD algorithm. In addition, the HRWiTD algorithm gives the best classification accuracy for each individual classification except the “General” category. In future work, first, the HRWiTD algorithm needs to be improved to get better results to classify all text categories; here we cover only six categories and other categories were assigned general category as general. Second, it needs to extend the experimental corpus from different resources to demonstrate efficiency. In this research, we applied the proposed algorithm on 1421 texts, and there are a number of words in the texts that their categories are unknown and which can lead to a poor classification of texts. Therefore, the corpus must be much larger to get the best learning.
This research is supported by German Academic Exchange Service (DAAD).
Othman, E. and Al-Hamadi, A. (2018) Automatic Arabic Document Classification Based on the HRWiTD Algorithm. Journal of Software Engineering and Applications, 11, 167-179. https://doi.org/10.4236/jsea.2018.114011