Intelligent Information Management, 2013, 5, 171-174
Published Online November 2013 (http://www.scirp.org/journal/iim)
http://dx.doi.org/10.4236/iim.2013.56018
Open Access IIM
Semantic Sentence Similarity Using Finite State Machine
Chiranjibi Sitaula1, Yadav Raj Ojha2
1Central Department of Computer Science and Information Technology,
Tribhuvan University, Kathmandu, Nepal
2Nepal KC Consultancy, Software Company, Kathmandu, Nepal
Email: candsbro@gmail.com
Received September 27, 2013; revised October 25, 2013; accepted November 5, 2013
Copyright © 2013 Chiranjibi Sitaula, Yadav Raj Ojha. This is an open access article distributed under the Creative Commons Attri-
bution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
ABSTRACT
In this paper, a finite state machine appro ach is followed in order to find the semantic similarity of two sentences. The
approach exploits the concept of bi-directional logic along with a semantic ordering approach. The core part of this ap-
proach is bi-directional logic of artificial intelligence. The bi-directional logic is implemented using Finite State Ma-
chine algorithm with slight modification . For finding the semantic similarity, keyword has played climactic i mportance.
With the help of the keyword approach, it can be found easily at the sentence level according to this algorithm. The al-
gorithm is proposed especially for Nepali texts. With the polarity of the individual keywords, the finite state machine is
made and its final state determines its polarity. If two sentences are negatively polarized, they are said to be coherent,
otherwise not. Similarly, if two sentences are of a positive nature, they are said to be coheren ce. For measuring the co-
herence (similarity), contextual concept is taken into consideration. The semantic approach , in this research, is a totally
contextual based method. Two sentences are said to be semantically similar if they bear the same context. The total ac-
curacy obtained in this algorithm is 90.16%.
Keywords: Artificial Intelligence; Natural Language Processing; Text Mining; Semantic Similarity; Finite State
Machine
1. Introduction
Semantic similarity is a relationship between common
and different features (meaning) of two compared words.
It is a fundamental and widely used concept in recent
applications of Natural Language Processing. Sentence
semantic similarity is main conc ern to similarity between
concepts according to their presumed natural relation-
ships.
In order to compute semantic similarity on the sen-
tence level, we ideally should compare some kind of
meaning representation of the sentences. Finding such
meaning in Nepali texts is more difficult than English
text because the structure matters a lot. The structure
used in Nepali sentences is not same as English texts. So,
a novel approach is pro posed in order to tackle the prob-
lem of such texts.
Lots of algorithms and strategies have been developed
in natural language processing in order to find the se-
mantic similarity of English texts. But few works have
been done in Nepali. And morphologically, processing
Nepali text is also of great challenge because it does not
match so easily with English sentences. Using the con-
cept of finite state machine, the semantically similar sen-
tences can be retrieved in Nepali texts. The position of
subject, verb and object is different from English texts.
The concepts of theory of computation and artificial in-
telligence are major here in this model.
Finite State Automata is an abstract machine which
can be in one of the finite number of state. Machine can
be in only one state at a time. It can be changed from one
state to another by triggering condition for each transi-
tion. State machine has been used to describe linguistics
—to describe the grammars of natural languages and to
learn the corpus patterns.
First Order Logic or Prepositional Logic (also called
sentential logic) is a formal language which is used for
expressing statements. Prepositional logic is used to de-
termine when a statement is true in a structure. Here,
proposition is a statement which is either true or false.
Prepositions are connected to give truth or falsity of the
compound propositions. Ex: Bi-conditional (or equiva-
lence
).
C. SITAULA, Y. R. OJHA
Open Access IIM
172
Bi-conditional logic is true whenever both of its com-
ponents have the same truth value, either both true or
both false.
Finite State Machine to recognize the bi-conditional
logic is as shown in Figure 1.
But we can modify it to recognize the single transition
state (single word) or double transition states (two words)
as shown in the Figure 2.
Nepali language has the word order and language writ-
ing scripts are different from English language. Nepali
language is Subject Object Verb (SOV) language.
2. Literature Review
[1] proposed sentence similarity find method based on
Semantic Nets and Corpus Statistics. This method found
best for the shortest length sentences. The semantic si-
milarity of two sentences is calculated based on cosine
similarity, word order similarity and overall sentence si-
milarity by using information from a structured lexical
database and from corpus statistics.
Sentence semantic similarity calculating method based
on segmented semantic comparison was proposed in [2].
Best results could be achieved and the calculating proc-
ess would more fit to the semantic logic in the shortest
sentence.
Figure 1. FSA to recognize the bi-conditional logic.
Figure 2. Finite State Machine representation of bi-condi-
tional logic.
[3] described a metric method for computing sentence
level semantic textual similarity, which is based on a
probabilistic finite state machine model that computes
weighted edit distance.
Estimating of the con text similarity based on closeness
of the semantic load of two comparing sentences is re-
searched by [4].
A FSA is built through a systematic analysis of the
patterns of meaning and use for each verb [5]. Also, the
algorithm for learning regular grammars from examples
to recognize the corpus patterns is given.
Similarly, [6] proposed method based on different as-
pects like Objects-Specified similarity, Objects-Property
similarity, Objects-Behavior similarity and Overall sen-
tence similarity. Sentences would be divided into the
trunk and the other segments were set different weights,
and the grammatical and semantic structure of the sen-
tences would be analyzed.
The work done by [7] was to produce a measure of
semantic similarity, which is a good predictor of “relat-
edness” between sentences, with the ultimate goal of as-
sessing the coherence of an essay.
Two approaches are proposed by [8], the first ap-
proach proposed used lexical similarity & the second
used semantic similarity by means of term expansion
with synonyms. [9] proposed semantic similarity based
on WORDNET dictionary with some sequential process
like tokenization, POS tagging, word distance calculation
etc.
More importantly, [10] did research on structure of
Nepali Grammar. This paper has mainly explained parts
of speech of Nepali Language, special characteristics in
Nepali Language & overview of the sentential structure
of the Nepali Language.
Almost all above methods are based on lexical data-
base WORDNET. Nepali Computational Grammar (NCG)
structural detail is done by [11], which essentially in-
volves the development of the intermediate modules like
the Parts-of-Speech (POS) Tagger, Chunker and the
Parser. [12] built Nepali WORDNET, the rich lexical
resource for the Nepali Language for effective machine
translation. They did this task inspiration from English
WORDNET and Hindi WORDNET. This dictionary can
be helpful to get word distance of any two words in the
Nepali Language. [13] employed the concept of prob-
abilistic concept fo r finding the semantic similarity of the
sentence.
3. Proposed Model
Almost all existing methods for computing sentence si-
milarity have been adopted from approaches used for
long text documents. These methods process sentences in
a very high-dimensional space and are inefficient for
short text application domain. In order to get accuracy,
C. SITAULA, Y. R. OJHA
Open Access IIM
173
computing the similarity between very short texts of sen-
tence length has been proposed. The proposed model
considers the first order logic [14] with the finite state
machine approach. The research work also is influenced
by concept [15].
The semantic concept of contextual [16] is taken into
consideration for proposing the new algorithm.
This paper focuses directly on computing the Sentence
Semantic Similarity in Nepali Language for short texts.
The sample similar sentences are shown in Tables 1 and 2.
The similar sentences are those which are contextually
similar in this research. For the whole research work,
particular context is taken into consideration.
The proposed model is shown in Figure 3. It shows
the different steps employed for the research activities.
According to Figure 3, firstly the document is sepa-
rated into individual sentence. The individual keyword is
taken from each sentence with its semantic orientation.
After calculating the semantic orientation of each token,
with the help of dictionary, Finite State Machine is made.
With the help of finite state machine, the polarity of each
sentence is determined. The sentences having same po-
larity is taken as the same orientation depending on par-
ticular context.
4. Evaluation and Output
For the evaluation purpose, around 1200 sentences are
taken for the experiment. The datasets used in this ex-
periment are of different type. It was implemented under
Visual Studio 2008. Programming language was C#. For
measuring the performance, recall is used.
The comparison of the hybrid algorithms with tradi-
tional rule based algorithm is made. The output is listed
in the below Table 3.
The result obtained in Table 3 showed that the recall
of the given algorithm. This algorithm gave different re-
call values depending of the nature of datasets used.
When the datasets of simple nature is used, it gave the
recall value of highest i.e. 95. Similarly, if the dataset of
medium type is used , it gave the recall value of 90.5 and
finally with the complex and ambiguous type of senten-
ces, the recall value dropped to 85.
Table 1. First similar sentences.
First Similar Sentences


 

 

Table 2. Second similar sentences.
Second Similar Sentences
   
   
Breakdown document into
sentences
Load dicti onar y(co r pus) as
hashmap.
Key: hash key (+ve/-ve form)
Value: Array of similar w ords
Check f inal state to verify the
S imilarity of two sentence s
Ch eck the w ord order of
sentences
Replace every words hash
key(+ve/-ve form) in to
True/False form
Assign hash key(+ve/-ve key) to
every wo rds of sent ences .
(Same key for similar w ords)
Im pl ement FSA w hi ch recogniz e
the Bi-conditional Logic
Perf orm State Transi t ion on each
sentence
Figure 3. Operational flow of the proposed methodology.
Table 3. Output of proposed algorithm.
Group# of Sentences Recall Average Recall
1 400 90.5
2 400 85
3 400 95
90.16
Similarly, the graph is plotted with x-coordin ate as the
# of sentences with y-coordinate as the recall obtained
from the experiment. From the Figure 4, it can be seen
that the recall value gets increased for simplicity of sen-
tences and has degraded as the co mplexity of the data set
increases. The ambiguity of the sentences degraded the
performances.
C. SITAULA, Y. R. OJHA
Open Access IIM
174
Figure 4. Graph of # of group and recall.
5. Conclusion and Limitation
The implemented algorithm gave recall of 90.15%. The
algorithm is the exploitation of the finite state machine
with bi-directional logic. The bi-directional logic is
modified here in order to perform the semantic similarity
of the sentences. The semantic ordering works have to be
performed carefully otherwise the performance may be
degraded. The sentences used in this experiment are con-
text based. It can be compared with uni-directional logic
of artificial intelligence for further works. Although the
algorithm is designed for Nepali texts, it can also be used
in other languages.
6. Acknowledgements
I would like to thank Prof. Dr. Sashidhar Ram Joshi from
IOE, Tribhuvan University, Nepal for providing me im-
plausible support.
REFERENCES
[1] Y. H. Li, et al., “Sentence Similarity Based on Semantic
Nets and Corpus Statistics,” IEEE Transactions on Knowl-
edge and Data Engineering, Vol. 18, No. 8, 2006, pp.
1138-1150. http://dx.doi.org/10.1109/TKDE.2006.130
[2] Y. T. Liu an d Y. J. Lian g, “A Sente nce Sema ntic Simi lar-
ity Calculating Method Based on Segmented Semantic-
Comparision,” Journal of Theoretical and Applied In-
formation Technology, Vol. 48, No. 1, 2013, pp. 231-235.
[3] M. Q. Wang and D. Cer, “Stanford: Probabilistic Edit Dis-
tance Metrics for STS,” Unpublished.
[4] M. Mehdi and S. M. Fakhrahmad, “Effective Estimation
of Context Similarity: A Proposed Matching Model Bas-
ed on Weighted Semantic Load,” International Journal of
Artificial Intelligence & Applications, Vol. 3, No. 3 , 2012 , pp
1-10.
[5] O. Popescu, “Learning Corpus Patterns Using Finite State
Automata,” FBK-irst, Trento, 2013.
[6] L. Li, et al., “Measuring Sentence Similarity from Dif-
ferent Aspects,” Proceeding of the 8th International Con-
ference on Machine Learning and Cybernetics, Baoding,
12-15 July 2009, pp. 2244-2248.
[7] D. Higgins and J. Burstein, “Sentence Similarity Meas-
ures for Essay Coherence,” Proceedings of the 7th Inter-
national Workshop on Computational Semantics (IWCS),
Tilburg, 2007, pp. 1-12.
[8] D. Vilarino, et al., “BUAP: Lexical and Semantic Simi-
larity for Cross-Lingual Textual Entailment,” Proceed-
ings of the 1st Joint Conference on Lexical and Computa-
tional Semantics, Montreal, 7-8 June 2012, pp. 706-7 09.
[9] T. N. Dao and T. Simpson, “Measuring Similarity between
Sentences,” Unpublished.
[10] B. K. Bal, “Structure of Nepali Grammar,” PAN Local-
ization, Madan Puraskar Pustakalaya, Kathmandu, Nepal,
2004, pp. 332-396.
[11] P. Rupakheti, L. P. Khatiwada and B. K. Bal, “Report on
Nepali Computational Grammar,” Unpublished, pp. 1-25.
[12] A. Chakrabarty, B. Purkayastha and A. Roy, “Experi-
ences in Building the Nepali Wordnet-Insights and Chal-
lenges,” The 5th Global Wordnet Conference at CFILT,
IIT Bombay, Mumbai.
[13] I. Beltagy, et al., “Montague Meets Markov: Deep Se-
mantics with Probabilistic Logical Form,” 2nd Joint
Conference on Lexical and Computational Semantics:
Proceeding of the Main Conference and the Shared Task,
Atlanta, 13-14 June 2013, pp. 11-21.
[14] S. Ferilli, et al., “Plugging Taxonomic Similarity in First-
Order Logic Horn Clauses Comparison,” AI*IA 2009:
Emergent Perspectives in Artificial Intelligence, Lecture
Notes in Computer Science, Vol. 5883, 2009, pp. 131-140.
http://dx.doi.org/10.1007/978-3-642-10291-2_14
[15] D. K. Lin, “An Information-Theoretic Definition of Simi-
larity,” ICML, Vol. 98, 1998, pp. 296-304.
[16] C. Sitaula, “Se mantic Text Clustering Using Enhan ced Vec-
tor Space Model using Nepali Language,” GESJ, Vol. 36,
No. 4, 2012, pp. 41-46.