International Journal of Intelligence Science, 2014, 4, 24-28
Published Online January 2014 (http://www.scirp.org/journal/ijis)
http://dx.doi.org/10.4236/ijis.2014.41004
Mobile SMS Spam Filtering for Nepali Text Using Naïve
Bayesian and Support Vector Machine
Tej Bahadur Shahi1, Abhimanu Yadav2
1,2Central Department of Computer Science and Information Technology, Kathmandu, Nepal
Email: tejshahi1984@yahoo.com
Received October 16, 2013; revised November 16, 2013; accepted November 25, 2013
Copyright © 2014 Tej Bahadur Shahi, Abhimanu Yadav. This is an open access article distributed under the Creative Commons At-
tribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is prop-
erly cited. In accordance of the Creative Commons Attribution License all Copyrights © 2014 are reserved for SCIRP and the owner
of the intellectual property Tej Bahadur Shahi, Abhimanu Yadav. All Copyright © 2014 are guarded by low and by SCIRP as a
guardian.
ABSTRACT
Spam is a universal problem with which everyone is familiar. A number of approaches are used for Spam filter-
ing. The most common filtering technique is content-based filtering which uses the actual text of message to de-
termine whether it is Spam or not. The content is very dynamic and it is very challenging to represent all infor-
mation in a mathematical model of classification. For instance, in content-based Spam filtering, the characteris-
tics used by the filter to identify Spam message are constantly changing over time. Naïve Bayes method
represents the changing nature of message using probability theory and support vector machine (SVM)
represents those using different features. These two methods of classification are efficient in different domains
and the case of Nepali SMS or Text classification has not yet been in consideration; these two methods do not
consider the issue and it is interesting to find out the performance of both the methods in the problem of Nepali
Text classification. In this paper, the Naïve Bayes and SVM-based classification techniques are implemented to
classify the Nepali SMS as Spam and non-Spam. An empirical analysis for various text cases has been done to
evaluate accuracy measure of the classification methodologies used in this study. And, it is found to be 87.15%
accurate in SVM and 92.74% accurate in the case of Naïve Bayes.
KEYWORDS
SMS Spam Filtering; Classification; Support Vector Machine; Naï ve B aye s; Preprocessing; Feature Extraction;
Nepali SMS Datasets
1. Introduction
Spam can be defined as unsolicited (unwanted, junk)
email for a recipient or any email that the users do not
wanted to have in their inboxes. Spam filtering is a spe-
cial problem in the field of document classification and
machine learning. In recent years, the technological de-
velopment in mobile devices has increased in computa-
tional power, and other powerful systems have been ca-
pable to be connected to mobile phone networks. This
has also increased the communication through SMS. No-
body wants the unwanted SMS on his cell phone’s inbox
and they want their inboxes to be free from such annoy-
ing SMS. SMS has certain characters that are different
from mails. A mail consists of certain structured informa-
tion such as subject, mail header, salutation, sender’s
address etc. but SMS lacks such structured information.
These make the SMS classification task much difficult.
This situation makes the necessity for developing an ef-
ficient SMS filtering method. The basic principle of
Spam filtering is shown in Figure 1.
2. Related Work
Before 1990, some Spam prevention tools began to
emerge in response to the Spammers who started to au-
tomate the process of sending Spam email. The first
Spam prevention tool has used simple approach, based
on language analysis by simply scanning emails for some
suspicious senders or phrases like “click here to buy” and
free of charge”. In late 1990s, blacklisting and white-
listing methods were implemented at the Internet Service
OPEN ACCESS IJIS
T. B. SHAHI, A. YADAV
25
Figure 1. The basic idea of Spam filtering.
Provider (ISP) level. However, these methods suffered
from some maintenance pro blems.
There are many efforts underway to stop the increase
of Spam that plagues almost every user on the mobile
network. Various techniques have been used to filter the
Spam messages. Naïve Bayes [1] classifier is a simple
probabilistic classifier. Its main advantage is that naïve
Bayes classifiers can be trained very efficiently in a su-
pervised learning. Naïve Bayesian classifiers are used for
parameter estimation in numerous practical applications.
In supervised learning, the parameters are estimated by
Maximum Likelihood Estimation (MLE) method. Deci-
sion Tree [2] is one of the most famous tools of decision-
making theory. Decision tr ee is a classifier in the form of
a tree structure that show s the reasoning proc ess. Suppor t
Vector Machines [3] is a linear maximal margin binary
classifier. It can be interpreted as finding a hyper-plane
in a linearly separable feature space that separates the
two classes with maximum margin—the instances closest
to the hyper-plane are known as the “support vectors” as
they support the h yper-plane on both sides of the margin.
Using these techniques, different software has been de-
veloped to filter the Spam emails. The basic concept of
these techniques is the classification of SMS or email
using trained classifier that can automatically predict if
an incoming SMS or email is Spam or legitimate. This
automatic process increases filtering performance and
provides better usability than manu al c las s ification.
Some more complex approaches were also purposed
against Spam problem. Most of them were implemented
by using machine learning methods. A Naïve Bayes al-
gorithm is used frequently which has shown a considera-
ble success in filtering Spam e-mails in English [4].
Knowledge-based and rule-based systems were also used
by researchers for English Spam filters [5,6]. SVM is
used for text classification [7], which can also be applied
for Spam filtering.
There is no work done for Nepali text SMS Spam fil-
tering yet and it is much more necessary to start the work.
The resource such as training SMS corpus is also not
available for Nepali language and the corpus used in this
work is created manually. The training corpus developed
during this study c an be made available for res earch pro-
poses.
3. Methodology: A Proposed Framework for
Spam SMS Filtering
Spam filtering engine fl o wc hart is gi ve n in Figure 2.
This describes top level data flow diagram of Spam
classification problem used in this research work. The
proposed system framework contains three steps: prepro-
cessing, feature extraction and classification.
3.1. Preprocessing
The purpose of pre-processing is to transform messages
in SMS into a uniform format that can be understood
by the learning algorithm. The first step of text mining
process is text pre-processing in which the collection of
documents is analysed syntactically or semantically.
The text message document is considered as a bag of
words because the words and their occurrences are
used to represent the document. The algorithm applied
in this stage are stemming and stop word removal,
number removal and strip whitespaces.
3.2. TF-IDF Calculation and Feature Vector
Construction
In this work, the most widely adopted feature weight-
ing scheme known as TF-IDF scheme, in Information
Retrieval (IR), TF-IDF, to represent the email as a
vector in a vector space model, and it is calculated as
Equation (1):
2
log
log
ij i
ij
kj
kk
D
tf DF
aD
tf DF
=



(1)
Figure 2. Framework for Spam filtering.
OPEN ACCESS IJIS
T. B. SHAHI, A. YADAV
26
where tfij is SMS in the training set and DFi is the
number of SMS, containing the term i. The importance
of a term in a SMS is measured by the frequency and
its inverse document frequency.
3.3. Classificati on
Consider the problem of classifying documents or mes-
sage (SMS) by their content, for example, into Spam and
Non-Spam Messages. A document is drawn from set of
documents (Spam and Non-Spam) which can be modeled
as sets of words.
The (independent) probability that the ith word of a
given document occurs in a document from class C can
be written as
( )
i
pwC
.
Then the probability that a given document D contains
all of the words wi, given a class C,
( )
( )
i
i
pDCpw C=
(2)
Now by definition
( )
( )
( )
pD C
p DCpC
=
(3)
And
()
( )
( )
pD C
pCD pD
=
(4)
Bayes’ theorem manipulates these into a statement of
probability in terms of likelihood
( )
( )
()
()
pC
p CDpDC
pD
=
(5)
Assume for the moment that there are only two mu-
tually exclusive classes, S and ¬S (i.e. Sp am and not
Spam), such that every element (message) is in either one
or the othe r:
( )
( )
i
i
p DSp w S=
(6)
And
(7)
Using the Bayesian result above
( )
( )
( )
( )
i
i
pS
pSD pwS
pD
=
(8)
And
( )
()
( )
( )
i
i
pS
p SDpw S
pD
¬
¬= ¬
(9)
Finally the document can be classifie d as follows. It is
Spam if
() ()
pSDp SD
In their basic form shown in Figure 3, S VM construct
Figure 3. Support vector machine .
the hyper-plane in input space that correctly separate the
example data into two classes. Hence, SVM is a binary
classifier. This hyper-plane can be used to make the pre-
diction of class for unseen data. The hyper-plane always
exists for the linearly separable data [8]. Each SMS is
converted into feature vector on the Bag of word basis
and the length of feature vector is equal to number of
words in the Dictionary. The Dictionary consists of fea-
ture word from the training corpus. Some frequent Spam
words are also included in dictionary.
4. Experimental Setup and Results
Java programming language is used for the imple men ta-
tion of the proposed framework. SVM light [9] is used
for as classification tool for SVM and Naïve Bayes is
implemented in Java.
Naïve Bayes and Support Vector Machine algorithms
have been implemented for the Spam filtering task. The
study has gone through the empirical analysis of the per-
formance of both the Spam filters (SVM and Naïve
Bayes) for Nepali SMS. It is observed from the experi-
ment that the Spam Filter based on Naïve Bayes outper-
forms the Spam Filter based on SVM. Extensive tests
have been performed with varying numbers of data set
sizes. The success rates reach their maximum using all
the messages and all the words in training cor pus.
Tables 1-3 show the results of experiment and it is
shown that the learning methods perform well when they
are trained using more examples.
5. Conclusions and Future Work
The main concern for this study was to examine the effi-
ciency of Naïve Bayesian and SVM Spam filters. The
comparison of efficiency between these Spam filters was
done on the basis of the accuracy, precision and recall.
This comparison helps to find the best algorithm for
Spam filtering.
OPEN ACCESS IJIS
T. B. SHAHI, A. YADAV
27
Table 1. SVM classification results.
No. of test Messages
(
Spam/Non-Spam)
SVM
(Correct/Incorrect) SMS Accuracy Precision Recall
1 10 (8/2) 80% 77.78% 100%
2 30 (26/4) 86.67% 83.33% 100%
3 50 (42/8) 84% 78.95% 100%
4 68 (59/9) 86.76% 80.85% 100%
5 90 (81/9) 90% 86.96% 100%
6 110 (99/11) 90% 86.42% 100%
7 150 (139/11) 92.67% 89.11% 100%
Table 2. Naïve Bayes
No. of test Messages
(Spam/Non-Spam)
Naïve Bayes
(Correct/Incorrect) SMS Accuracy Precision Recall
1 10 (9/1) 90% 100% 75%
2 30 (28/2) 93.33% 100% 83.33%
3 50 (45/5) 90% 90% 85.71%
4 68 (63/5) 92.65% 93.33% 90.32%
5 90 (84/6) 93.33% 93.33% 87.5%
6 110 (104/6) 94.54% 95% 90.47%
7 150 (143/7) 95.33% 96.67% 92.06%
Table 3. Comparative results of SVM and Naïve Bayes.
No. of test Messages (Spam and non-Spam) Accuracy
SVM Naïve Bayes
1 10 80% 90%
2 30 86.67% 93.33%
3 50 84% 90%
4 68 86.76% 92.65%
5 90 90% 93.33%
6 110 90% 94.54%
7 150 92.67% 95.33%
The classification accuracy of 92.74% was obtained
for the Naïve Bayes classifier and 87.15% accuracy was
obtained for SVM classifier on Nepali Spam dataset. On
the basis of accuracy, Naïve Bayes is a better classifica-
tion technique tha n SVM-based classifier.
No hundred percent filtering Spam system is invented
till now. The classification accuracy of Naïve Bayesian
and SVM proposed in this research work, however, can
be further improved. Here, the TF-IDF scheme was used
to make feature vector, which did not consider the indi-
vidual word in SMS. i.e., it only considers the weighted
words. Some techniques that use context base features
can be use d.
The features used to convert given Spam into vector
can be enriched so that the higher accuracy can be
achieved. Due to the small SMS corpus size, there is the
OPEN ACCESS IJIS
T. B. SHAHI, A. YADAV
28
unknown word problem in Naïve Bayes classifier. Hence,
some other techniques to handle the unknown word can
be used. The size of SMS corpus can be increased by
collecting more real SMS in the future.
Acknowledgements
Authors would like to thank Dr. Shashidhar Ram Joshi,
Professor of Computer Science at institute of engineering,
Pulchwok and Asst. Prof. Dr. Sanjib Pandey for their
supervis ion durin g t he comple tion of this w ork.
REFERENCES
[1] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, “A
Bayesian Approach to Filtering Junk E-mail,” Learning
for Text Categorization Paper s from the AAAI Workshop,
1998, pp. 55-62.
[2] S. Carrerasx. and L. Marquez, “Boosting Trees for Anti-
Spam Email Filtering,Proceeding of RANLP, Tizigov-
chark, 5-7 September 2001, pp. 58-64.
[3] H. Drucker, D. H. Wu and V. N. Vapnik, “Support Vector
Machines for Spam Categorization,IEEE Transaction
on Neural Networks, Vol. 10, No. 5, 1999, pp. 1048-1054.
http://dx.doi.org/10.1109/72.788645
[4] I. Androutsopoulos and J. Koutsias,An Evaluation of
Naive Bayesian Networks,” Machine Learning in the
New Information Age, Barcelona, 2000, pp. 9-17.
[5] W. Cohen, “Learning Rules That Classify E-Mail,” AAAI
Spring Symposium on Machine Learning in Information
Access, Stanford, 25-27 Marc h 1996, pp. 18-25.
[6] C. Apte, F. Damerau and S. M. Weiss, “Automated
Learning of Decision Rules for Text Categorization,”
ACM Transactions on Information Systems, Vol. 12, No.
3, 1994, pp. 233-251.
http://dx.doi.org/10.1145/183422.183423
[7] T. A. Almeida, J. M. G. Hidalgo and A. Yamakami,
Contributions to the Study of SMS Spam Filtering: New
Collection and Results,” ACM Transection on Informa-
tion System, Mountain View, 19-22 September 2011, pp.
20-25.
[8] P. Graham, “A Plan for Spam,2002.
http://www.paulgraham.com/Spam.html
[9] V. T. Joachims, Making Large-Scale SVM Learning
Practical,” In: B. Schölkopf, C. Burges and A. Smola,
Eds., Advances in Kernel Methods Support Vector Learn-
ing, MIT-Press, Cambridge, 1999.
OPEN ACCESS IJIS