J. Software Engineering & Applications, 2009, 2:55-59
Published Online April 2009 in SciRes (www.SciRP.org/journal/jsea)
Copyright © 2009 SciRes JSEA
A Semantic Vector Retrieval Model for Desktop
Documents
Sheng Li 1
1School of Information, Zhongnan University of Economic and Law, Wuhan, China
Email: kinglisheng@163.com
Received December 3rd, 2008; revised January 29th, 2009; accepted February 18th, 2009.
ABSTRACT
The paper provides a semantic vector retrieval model for desktop documents based on the ontology. Comparing with
traditional vector space model, the semantic model using semantic and ontology technology to solve several problems
that traditional model could not overcome such as the shortcomings of weight computing based on statistical method,
the expression of semantic relations between different keywords, the description of document semantic vectors and the
similarity calculating, etc. Finally, the experimental results show that the retrieval ability of our new model has signifi-
cant improvement both on recall and precision.
Keywords: Semantic Desktop, Information Retrieval, Ontology, Vector Retrieval Model
1. Introduction
As an important branch of the semantic Web [1] technol-
ogy, the semantic desktop indicates the development direc-
tion of desktop management technology in the future [2]. In
order to implement semantic desktop retrieval, a certain
information retrieval model is required, and it is an im-
portant research topic of information retrieval. At present,
researchers provide a variety of information retrieval
model from different angles such as probabilistic retrieval
model, fuzzy retrieval model, and vector space retrieval
model (VSM) [3]. According to them, the vector space
model is the most effective one to express the structure of
documents.
The main advantage of traditional vector space model
is its simplicity, which could describe unstructured
documents with the form of vectors, making it possible
to use various mathematic methods to be dealt with. There-
fore, we consider using ontology-based semantic informa-
tion management methods to improve traditional vector
space model, creating a semantic vector space model.
2. Traditional Vector Space Model
In the vector space model, the characteristic item t
(also known as the index item) is the basic language unit
appearing in document d, which could represent some
character of the document. The weight of characteristic item
is ik
ω
, which reflects the ability of characteristic item
k
t describing document d. The characteristic item fre-
quency ik
tf and the inverse document frequency k
idf
are used to calculate the value of ik
ω
with the formula
that
=
×
=
kikik idftf
ω
)1)/((log 2kik nNtf , Where
ik
tf is the frequency of characteristic item k
t in docu-
ment i
d, and N is the number of documents, k
n is the
number of documents that involved the characteristic
item k
t. From this formula, we can see that the value of
ik
ω
increases with ik
tf and decreases with k
n.
The distance between two document vectors is repre-
sented by similarity. The similarity between document
i
d and j
d is defined as the cosine of the angle between
two vectors:
)ω)(ω(
ωω
θ),dSim(dm
k
jk
m
k
ik
m
k
jkik
ji
∑∑
==
=
×
==
1
2
1
2
1
cos (1)
During the procedure of query matching, the Boolean
model could be used to realize the vector conversion of
query condition QS .
=.lse ,0
, ,1
e
QSt if
qj
j
The information retrieval algorithm based on the
aforementioned basic knowledge is as follows:
1) Creating characteristic item database: Input the
characteristic item of documents set, and creating char-
acteristic item database;
2) Creating document information base: Input the con-
tent of documents into database, and creating the docu-
ment information database;
56 A Semantic Vector Retrieval Model for Desktop Documents
Copyright © 2009 SciRes JSEA
3) Creating document vector database: For each record in
document information base, computing its characteristic
item weight by formula introduced before, and founding
its corresponding document vector;
4) Document query: The user input query condition.
Then, acquire eligible document vector by Boolean
model, computing the similarity between the query con-
dition and each document by Formula (1);
5) Output the ranking result: According to the similari-
ties computed in step 4), output the query result.
3. The Features of New Model
Though the semantic vector space model draws on some
thinking of traditional vector space model, it make some
useful improvements based on the specific features of
semantic information expression. The main features of
semantic vector space model include:
1) The elements and dimension of semantic vector
space are different from traditional one. In semantic vec-
tor space model, the document characteristic item se-
quence is not represented by the keywords as usual but
the concepts extracted from documents. These concepts
contain rich meaning in the ontology. At the same time,
for each concept in the concept space, there is a corre-
sponding list to describe. The list represents a vector in the
property space. Therefore, each semantic vector in this
model is composed of a 2D vector. So, the description ca-
pacity of semantic model is better than the traditional one.
2) The method for determining each item’s weight is
different between semantic vector space model and tradi-
tional one. In the semantic model, the weight of an item
is related to not only the frequency of a keyword, but also
the description of corresponding concept involved in the
document. In addition, the TFIDF function in traditional
model cannot accurately reflect the distribution of items
in the documentation set. In semantic vector space model,
the items in different position of a document will be set
with different weights. For example, the items appearing
in the title of one document will be heavier than the ones
appearing in the abstract.
3) The two models use different algorithm to compute
the similarity. In the semantic vector space model, the
comparability and relativity between two concepts are
fully taken into account. For example, in traditional vec-
tor space model, the words “People”, “Person”, and
“Human” are totally different concepts, but these words
could be conclude as one concept according to corre-
sponding structures or relationships.
4) Besides the differences introduced above, the most
important feature of semantic model is the using of on-
tology as a carrier of information. Comparing with tradi-
tional text retrieval methods, the new model involved the
semantic information in the ontology.
4. Ontology Creating
Except for the differences introduced in last section, an
important character of SVM is the usage of ontology as
an information carrier.
The ontology could be seen as a specification of con-
ceptualizations, it defines a group of concepts. Commonly,
ontology could be divided into general ontology such as
WordNet [4] and domain ontology that describe concepts
in some special domain. In this paper, we only focus on
ontologies in computer science domain.
4.1 The Relationships in the Ontology
In the ontology, concepts link themselves with other
concepts through relationships. In the hierarchical structure
graph of ontology, each edge represents a relationship.
Three most common relationships are “Is-A”, “Part-Of”
and “Entity Relationship” [5].
1) Is-A Relationship: It describes the relationship of
Generalization. For example, “Entity Extraction” Is-A
“Information Extraction”;
2) Part-Of Relationship: It describes the containing re-
lationship between concepts. For example, the “CPU” is
a Part-Of “Computer”;
3) Entity Relationship: It describes the member rela-
tionship between a concept and its individual object. For
example, “T. Berners-Lee” is an entity of concept “au-
thor”.
4.2 The Structure in the Ontology
According to the basic principles of ontology and the
ACM Topic Hierarchy [6], we create ontology to describe
the terms about computer science, called “CmpOnto”.
Then, the ontology “SwetoDblp_2” is created through the
extension of SwetoDblp [7] on the aspect of research
field and keywords. The segment of ontology CmpOnto
is as follows:
<owl:Class
rdf:about="http://www.acm.org/class/1998/acm#H.3">
<rdfs:label>INFORMATION STORAGE AND
RETRIEVAL</rdfs:label>
<rdfs:subClassOf
rdf:resource="http://www.acm.org/class/1998/acm#H"/>
</owl:Class>
...
<owl:Class
rdf:about="http://www.acm.org/class/1998/acm#H.3.3">
<rdfs:label>Information Search and Retrieval</rdfs:label>
<rdfs:subClassOf
rdf:resource="http://www.acm.org/class/1998/acm#H.3"/>
<owl:disjointWith>
<owl:Class
rdf:ID="http://www.acm.org/class/1998/acm#H.3.1">
<owl:Class
rdf:ID="http://www.acm.org/class/1998/acm#H.3.2">
...
</owl:disjointWith>
</owl:Class>
The segment of ontology SwetoDblp_2 is as follows:
<owl:Class
rdf:about="http://lsdis.cs.uga.edu/projects/semdis/opus#Article"
>
A Semantic Vector Retrieval Model for Desktop Documents 57
Copyright © 2009 SciRes JSEA
<rdfs:label>Article</rdfs:label>
<rdfs:subClassOf
rdf:resource="http://lsdis.cs.uga.edu/projects/semdis/opus#Publ
ication"/>
<rdfs:comment>An article from a journal or maga-
zine.</rdfs:comment>
<owl:equivalentClass
rdf:resource="http://knowledgeweb.semanticweb.org/semanticp
ortal/OWL/Documentation_Ontology.owl#Article_in_Journal"
/>
<owl:equivalentClass
rdf:resource="http://sw-portal.deri.org/ontologies/swportal#Arti
cle" />
<owl:equivalentClass
rdf:resource="http://purl.org/net/nknouf/ns/bibtex#Article" />
</owl:Class>
...
<owl:ObjectProperty
rdf:about="http://lsdis.cs.uga.edu/projects/semdis/opus#at_univ
ersity">
<rdfs:comment>Indicates that a publication originates or is
related to a specific University.</rdfs:comment>
<rdfs:label>at university</rdfs:label>
<rdfs:range
rdf:resource="http://lsdis.cs.uga.edu/projects/semdis/opus#Univ
ersity"/>
<rdfs:domain
rdf:resource="http://lsdis.cs.uga.edu/projects/semdis/opus#Publ
ication"/>
</owl:ObjectProperty>
5. Computing the Semantic Similarity
During the procedure of information retrieval based on
semantic similarity, the concepts and properties in the
vector are processed respectively. Considering the rela-
tivity between different conceptual entities and compara-
ble properties, the method for measuring the concept
similarity and the property similarity are introduced. Fi-
nally, the semantic similarity algorithm was provided.
5.1 The Concept Similarity
Ontology uses hierarchical tree structure to describe the
logical relationship between concepts, which is the se-
mantic basis for our retrieval algorithm. Since there is
certain relativity between different concepts, we use
concept similarity to describe and measure it in order to
improve the precision of retrieval. Before computing the
concept similarity, we give 3 definitions for different
kinds of relationship between concepts as following:
Definition 1: The homology concepts. In the hierarchical
tree structure of ontology, concept A and concept B are
homology concepts if the node of concept A is the an-
cestor node of concept B. Call A is the nearest root con-
cept of B, notes as R(A,B); The distance between A and B
is )()(),( AdepBdepBAd −= , where )(Cdep is the
depth of node C in the hierarchical tree structure.
Definition 2: The non-homologous concepts. In the hi-
erarchical tree structure of ontology, concept A and con-
cept B are non-homology concepts if concept A is neither
the ancestor node nor the descendant node of concept B;
If R is the nearest ancestor node of both A and B, Call R
is the nearest root concept of A and B, notes as R (A, B);
The distance between A and B is
=
),( BAd
),(),( RBdRAd
+
Definition 3: The semantic related concepts. Concept
C is the semantic related concept of A and B, if and only
if C satisfy the following conditions: If concept A and B
are homology concepts, C exists in the sub-trees with
root of A but not exists in the sub-trees with root of B; if
concept A and B are non-homology concepts, C exists in
the sub-trees with root of R, but not exits in the sub-trees
with root of A or B.
Figure 1 shows details of the relationships described
above. According to these definitions, the structure simi-
larity between concept A and concept B is:
=
+
××
+
××
+
=
.0 if 1,
concepts
;
homology -non are BA, and ,0 if
,
1
1
concepts; homology are BA, and 0 if
1
1
d(A, B)
d(A, B)
son(R)
son(B)son(A)
d(A, B)
β
)
B)) dep(R(A,
α
(
,d(A,B)
,
son(A)
son(B)
d(A, B)
β
)
B)) dep(R(A,
α
(
'Sim(A, B)
(2)
where son(C) present the total number of nodes in sub-
tree with the root of concept C. The parameter α and βis
used to adjust the weight of dep(R(A,B)) and ),( BAd ,
whose range is (0, 1), and setting by filed experts.
According to formula given above, the concept simi-
larity decreases with the distance between concepts. At
the same time, for two concepts, the deeper the nearest
root they have, the more common properties they should
have, and the more similar they should be. Further more,
the number of nodes in the sub-tree and semantic related
concepts are also important factors during the computing
of similarity.
Root
A
B
Dep(B)
Dep(A)
d(A,B)
...
Root
R
B
d(B,R)
d(A,R)
...
A
Root
A(R)
B
...
...
C
...
Root
R
B...
C...
A
Homology ConceptsNon-homology Concepts
Semantic Related Concepts
...
...
...
Figure 1. Three patterns of concepts
58 A Semantic Vector Retrieval Model for Desktop Documents
Copyright © 2009 SciRes JSEA
Finally, the formula defines that the similarity between
the same concepts is 1, and the distance between them is 0.
5.2 The Property Similarity
Each concept in the ontology may have several different
entities, the main difference among these entities rest with
their property values. Further more, different concepts may
have same properties. Therefore, not only the concept
similarity but also the property similarity should be con-
sidered during the computing of similarity between two
entities. For the property similarity measuring, we have
definition as following:
Definition 4: Suppose I is the entity of concept C, the
value of its property Pi is pi, i=1,2,...,n. Use I=C[P] to
present this entity, where P is the property vector
(p1,p2,…, pn).
Only the common properties need to process when
computing the similarity between property vector
P=(p1,p2,...,pm) and ),...,,( 21 n
qqqQ =.
At first, transform the property vectors P and Q into
common property vectors ),....,,( 21r
pppP
′′
=
and
),...,,( 21 r
qqqQ ′′′
=
. Then, according to the properties de-
fined in the ontology and the similarity of property value,
the property similarity of vector P and Q is given:
),(.
2
),(),(
1
iii
r
i
ii
ppqpSimQPSimQPSim ′′
+
=
′′
=
=
γμ
(3)
where i
μ
and i
γ
are weights of property i
p
and i
q
respectively in their property vector, which are preset in
the ontology; ),( iii qpSim ′′ is the similarity of property
values, which is preset by field expert in the ontology.
For example, the similarity between property value “Data
mining” and “Information Retrieval” is 0.7, and that be-
tween “Data mining” and “Network” is 0.1. The range of
),( QPSimp is [0,1].
5.3 The Semantic Similarity
After computing the concept similarity of semantic vector
and the property similarity of conceptual entity, we can
get the final semantic similarity of semantic vector.
Suppose ])[],...,[(111 mm PAPAV = and 2
V=
])[],...,[(11 nn QBQB are two semantic vectors. The se-
mantic similarity between 1
V and 2
V is:
=),( 21 VVSimV
()
=
⋅−+⋅
m
i
jiPjiC
jQPSimBASimMax
m1
,, )()1()(
1
ωω
(4)
where
ω
is the weight of concept similarity, and its
range is [0, 1].
Now, the main retrieval algorithm is as follows:
Begin
1) Initialize the documentation set, then load the user
query vector 1
V and deciding its document clustering;
2) Load the semantic index file of documents, initializing
the semantic vector 2
V;
3) For each vector in the document clustering includes
1
V. if current vector has never been processed then con-
tinue; else, process the next vector;
4) Compute all the concept similarity between concepts
in 1
V and 2
V;
5) Compute all the property similarity between concepts
in 1
V and 2
V;
6) Compute the semantic similarity between 1
V and
2
V, insert 2
V into list S with descending order;
7) Output top n items in list S as retrieval results;
End.
6. Experiment and Analysis
In order to verify the effectiveness of our method, we
design a prototype system and chose 100 abstracts
downloading from DBLP as retrieval target document. In
this prototype system, we use ontologies CmpOnto and
SwetoDblp_2 introduced in Section 4.
In the experiment, the depth of ontology concept tree is
5, the range of )),((BARdep in Formula (2) is [1,5],
and the value of ),( BAd is an integer from 1 to 10;
both the value of weight
α
and
β
is 0.5; the i
μ
and
i
γ
are parameters preset in the ontology, which could be
gained by statistical method. The value of
ω
in For-
mula (4) will make influence on the retrieval results
ranking. In order to choose proper
ω
, we implement pri-
mary experiment for analysis and choosing 0.8 as the
optimal value of
ω
.
The first step of experiment is document pretreatment.
Each document is described by a semantic eigenvector
2
V including 1 to 4 conceptual entities. We can find that
the average precision of retrieval increase from 60% to
80% according to the increase of concepts in 2
V. The
corresponding results are shown in Table 1.
Figure 2. The influence of concepts in V2 on the precision
Table 1. The influence of concepts in V2 on the precision
V2
V1 1C 2C 3C 4C
1 C 0.619 0.711 0.759 0.802
2 C 0.625 0.716 0.752 0.796
3 C 0.630 0.711 0.771 0.803
4 C 0.633 0.724 0.763 0.797
0.6
0.65
0.7
0.75
0.8
0.85
1C 2C 3C 4C
The member of concepts
1 C
2 C
3 C
4 C
precision
precisi on
The member of conce
ts
0.85
0.8
0.75
0.7
0.65
06
1C 2C 3C 4C
1 C
2C
3C
4C
A Semantic Vector Retrieval Model for Desktop Documents 59
Copyright © 2009 SciRes JSEA
Figure 3. The influence of properties in V1 on the precision
Table 2. The influence of properties in V1 on the precision
V1
V2 1C 2C 3C 4C
1 P 0.513 0.621 0.675 0.721
2 P 0.662 0.764 0.804 0.859
3 P 0.649 0.741 0.785 0.821
4 P 0.637 0.739 0.781 0.811
Table 3. The comparing of different retrieval models
Precision
Documents Keywords Retrieval Semantic Re-
trieval
5 74.2% 88.3%
10 66.5% 82.5%
15 58.4% 75.5%
20 50.3% 71.2%
25 43.9% 65.8%
30 37.7% 58.5%
35 34.3% 50.4%
40 27.4% 47.2%
45 21.6% 42.9%
50 19.4% 36.3%
Average 43.37% 61.86%
Figure 2 could reflect the relationship between the
number of concepts in 2
V and the precision of query
more directly.
Further more, statistical results show that the number
of properties in the conceptual entity could also make
influence on the precision. When the number of proper-
ties is 2, the effects go best. If a concept has too many
properties, some proper target will be missed because of
so many restrictive conditions. The corresponding results
are shown in Table 2.
The Figure 3 is corresponding to Table 2.
In addition, we compare our new model with tradi-
tional VSM model based on keywords. The number of
documents and the precision of retrieval are shown in
Table 3. The average precision of semantic retrieval is
61.86%, but only 43.37% by traditional method in the
same documentation set. According to the experimental
data and analysis above, we know that the ontology could
play a positive role in upgrading the precision of retrieval.
7. Conclusions
This paper provides a semantic retrieval model based on
the ontology for desktop documents. Comparing with
traditional vector space model, the new model using se-
mantic and ontology technology to solve a series of
problems that traditional model could not overcome. The
experimental results prove the effectiveness of this new
model.
In addition, the individual analyses for retrieval results
tell us that there is little distinction in result ranking by
different retrieval methods. The main reason for precision
upgrading is that the semantic retrieval method could
reduce the similarity of incorrect results, so that the cor-
rect result could be ranked in the front position. There-
fore, how to re-rank and optimize the retrieval results is
an important task, and it is our main item in the next
stage.
REFERENCES
[1] B. Lee, Hendler, and Lassila, “The semantic web,” Scientific
American, Vol. 34, pp. 34-43, 2001.
[2] S. Decker and M. Frank, “The social semantic desktop,”
WWW 2004 Workshop Application Design, Development
and Implementation Issues in the Semantic Web, 2004.
[3] I. R. Silva, J. N. Souza, and K. S. Santos, “Dependence
among terms in vector space model,” Database Enginee-
ring and Applications Symposium, pp. 97-102, 2004.
[4] G. A. Millet, “Wordnet: An electronic lexical database,”
Communications of the ACM, 38(11): pp. 39-41, 1995.
[5] G. Asian and D. McLeod, “Semantic heterogeneity reso-
lution in federated database by metadata implan- tation
and stepwise evolution,” The VLDB Journal, the Interna-
tional Journal on Very Large Databases, Vol. 18, pp.
22-31, 1999.
[6] ACM Topic: http://www.acm.org/class/.
[7] B. Aleman-Meza, F. Hakimpour, I. B. Arpinar, and A. P.
Sheth, “SwetoDblp ontology of Computer Science publications,”
Web Semantics: Science, Services and Agents on the
World, pp. 151-155, 2007.
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
1P 2P3P 4P
The number of properties
1C
2C
3C
4C
precision
The number of properties
1P 2P 3P 4P
1 C
precision
0
.
9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
2 C
3 C
4 C