A Semantic Vector Retrieval Model for Desktop Documents

doi:10.4236/jsea.2009.21009

Paper Menu >>

Journal Menu >>

J. Software Engineering & Applications, 2009, 2:55-59

Published Online April 2009 in SciRes (www.SciRP.org/journal/jsea)

A Semantic Vector Retrieval Model for Desktop

Documents

Sheng Li 1

1School of Information, Zhongnan University of Economic and Law, Wuhan, China

Email: kinglisheng@163.com

Received December 3rd, 2008; revised January 29th, 2009; accepted February 18th, 2009.

ABSTRACT

The paper provides a semantic vector retrieval model for desktop documents based on the ontology. Comparing with

traditional vector space model, the semantic model using semantic and ontology technology to solve several problems

that traditional model could not overcome such as the shortcomings of weight computing based on statistical method,

the expression of semantic relations between different keywords, the description of document semantic vectors and the

similarity calculating, etc. Finally, the experimental results show that the retrieval ability of our new model has signifi-

cant improvement both on recall and precision.

Keywords: Semantic Desktop, Information Retrieval, Ontology, Vector Retrieval Model

1. Introduction

As an important branch of the semantic Web [1] technol-

ogy, the semantic desktop indicates the development direc-

tion of desktop management technology in the future [2]. In

order to implement semantic desktop retrieval, a certain

information retrieval model is required, and it is an im-

portant research topic of information retrieval. At present,

researchers provide a variety of information retrieval

model from different angles such as probabilistic retrieval

model, fuzzy retrieval model, and vector space retrieval

model (VSM) [3]. According to them, the vector space

model is the most effective one to express the structure of

documents.

The main advantage of traditional vector space model

is its simplicity, which could describe unstructured

documents with the form of vectors, making it possible

to use various mathematic methods to be dealt with. There-

fore, we consider using ontology-based semantic informa-

tion management methods to improve traditional vector

space model, creating a semantic vector space model.

2. Traditional Vector Space Model

In the vector space model, the characteristic item t

(also known as the index item) is the basic language unit

appearing in document d, which could represent some

character of the document. The weight of characteristic item

is ik

, which reflects the ability of characteristic item

t describing document d. The characteristic item fre-

quency ik

tf and the inverse document frequency k

idf

are used to calculate the value of ik

with the formula

that

kikik idftf

)1)/((log 2+× kik nNtf , Where

tf is the frequency of characteristic item k

t in docu-

ment i

d, and N is the number of documents, k

n is the

number of documents that involved the characteristic

item k

t. From this formula, we can see that the value of

increases with ik

tf and decreases with k

The distance between two document vectors is repre-

sented by similarity. The similarity between document

d and j

d is defined as the cosine of the angle between

two vectors:

)ω)(ω(

ωω

θ),dSim(dm

jkik

∑∑

∑

cos (1)

During the procedure of query matching, the Boolean

model could be used to realize the vector conversion of

query condition QS .

⎩

⎨

⎧∈

=.lse ,0

, ,1

QSt if

The information retrieval algorithm based on the

aforementioned basic knowledge is as follows:

1) Creating characteristic item database: Input the

characteristic item of documents set, and creating char-

acteristic item database;

2) Creating document information base: Input the con-

tent of documents into database, and creating the docu-

ment information database;

56 A Semantic Vector Retrieval Model for Desktop Documents

3) Creating document vector database: For each record in

document information base, computing its characteristic

item weight by formula introduced before, and founding

its corresponding document vector;

4) Document query: The user input query condition.

Then, acquire eligible document vector by Boolean

model, computing the similarity between the query con-

dition and each document by Formula (1);

5) Output the ranking result: According to the similari-

ties computed in step 4), output the query result.

3. The Features of New Model

Though the semantic vector space model draws on some

thinking of traditional vector space model, it make some

useful improvements based on the specific features of

semantic information expression. The main features of

semantic vector space model include:

1) The elements and dimension of semantic vector

space are different from traditional one. In semantic vec-

tor space model, the document characteristic item se-

quence is not represented by the keywords as usual but

the concepts extracted from documents. These concepts

contain rich meaning in the ontology. At the same time,

for each concept in the concept space, there is a corre-

sponding list to describe. The list represents a vector in the

property space. Therefore, each semantic vector in this

model is composed of a 2D vector. So, the description ca-

pacity of semantic model is better than the traditional one.

2) The method for determining each item’s weight is

different between semantic vector space model and tradi-

tional one. In the semantic model, the weight of an item

is related to not only the frequency of a keyword, but also

the description of corresponding concept involved in the

document. In addition, the TFIDF function in traditional

model cannot accurately reflect the distribution of items

in the documentation set. In semantic vector space model,

the items in different position of a document will be set

with different weights. For example, the items appearing

in the title of one document will be heavier than the ones

appearing in the abstract.

3) The two models use different algorithm to compute

the similarity. In the semantic vector space model, the

comparability and relativity between two concepts are

fully taken into account. For example, in traditional vec-

tor space model, the words “People”, “Person”, and

“Human” are totally different concepts, but these words

could be conclude as one concept according to corre-

sponding structures or relationships.

4) Besides the differences introduced above, the most

important feature of semantic model is the using of on-

tology as a carrier of information. Comparing with tradi-

tional text retrieval methods, the new model involved the

semantic information in the ontology.

4. Ontology Creating

Except for the differences introduced in last section, an

important character of SVM is the usage of ontology as

an information carrier.

The ontology could be seen as a specification of con-

ceptualizations, it defines a group of concepts. Commonly,

ontology could be divided into general ontology such as

WordNet [4] and domain ontology that describe concepts

in some special domain. In this paper, we only focus on

ontologies in computer science domain.

4.1 The Relationships in the Ontology

In the ontology, concepts link themselves with other

concepts through relationships. In the hierarchical structure

graph of ontology, each edge represents a relationship.

Three most common relationships are “Is-A”, “Part-Of”

and “Entity Relationship” [5].

1) Is-A Relationship: It describes the relationship of

Generalization. For example, “Entity Extraction” Is-A

“Information Extraction”;

2) Part-Of Relationship: It describes the containing re-

lationship between concepts. For example, the “CPU” is

a Part-Of “Computer”;

3) Entity Relationship: It describes the member rela-

tionship between a concept and its individual object. For

example, “T. Berners-Lee” is an entity of concept “au-

thor”.

4.2 The Structure in the Ontology

According to the basic principles of ontology and the

ACM Topic Hierarchy [6], we create ontology to describe

the terms about computer science, called “CmpOnto”.

Then, the ontology “SwetoDblp_2” is created through the

extension of SwetoDblp [7] on the aspect of research

field and keywords. The segment of ontology CmpOnto

is as follows:

<owl:Class

rdf:about="http://www.acm.org/class/1998/acm#H.3">

<rdfs:label>INFORMATION STORAGE AND

RETRIEVAL</rdfs:label>

<rdfs:subClassOf

rdf:resource="http://www.acm.org/class/1998/acm#H"/>

</owl:Class>

...

<owl:Class

rdf:about="http://www.acm.org/class/1998/acm#H.3.3">

<rdfs:label>Information Search and Retrieval</rdfs:label>

<rdfs:subClassOf

rdf:resource="http://www.acm.org/class/1998/acm#H.3"/>

<owl:disjointWith>

<owl:Class

rdf:ID="http://www.acm.org/class/1998/acm#H.3.1">

<owl:Class

rdf:ID="http://www.acm.org/class/1998/acm#H.3.2">

...

</owl:disjointWith>

</owl:Class>

The segment of ontology SwetoDblp_2 is as follows:

<owl:Class

rdf:about="http://lsdis.cs.uga.edu/projects/semdis/opus#Article"

A Semantic Vector Retrieval Model for Desktop Documents 57

<rdfs:label>Article</rdfs:label>

<rdfs:subClassOf

rdf:resource="http://lsdis.cs.uga.edu/projects/semdis/opus#Publ

ication"/>

<rdfs:comment>An article from a journal or maga-

zine.</rdfs:comment>

<owl:equivalentClass

rdf:resource="http://knowledgeweb.semanticweb.org/semanticp

ortal/OWL/Documentation_Ontology.owl#Article_in_Journal"

<owl:equivalentClass

rdf:resource="http://sw-portal.deri.org/ontologies/swportal#Arti

cle" />

<owl:equivalentClass

rdf:resource="http://purl.org/net/nknouf/ns/bibtex#Article" />

</owl:Class>

...

<owl:ObjectProperty

rdf:about="http://lsdis.cs.uga.edu/projects/semdis/opus#at_univ

ersity">

<rdfs:comment>Indicates that a publication originates or is

related to a specific University.</rdfs:comment>

<rdfs:label>at university</rdfs:label>

<rdfs:range

rdf:resource="http://lsdis.cs.uga.edu/projects/semdis/opus#Univ

ersity"/>

<rdfs:domain

rdf:resource="http://lsdis.cs.uga.edu/projects/semdis/opus#Publ

ication"/>

</owl:ObjectProperty>

5. Computing the Semantic Similarity

During the procedure of information retrieval based on

semantic similarity, the concepts and properties in the

vector are processed respectively. Considering the rela-

tivity between different conceptual entities and compara-

ble properties, the method for measuring the concept

similarity and the property similarity are introduced. Fi-

nally, the semantic similarity algorithm was provided.

5.1 The Concept Similarity

Ontology uses hierarchical tree structure to describe the

logical relationship between concepts, which is the se-

mantic basis for our retrieval algorithm. Since there is

certain relativity between different concepts, we use

concept similarity to describe and measure it in order to

improve the precision of retrieval. Before computing the

concept similarity, we give 3 definitions for different

kinds of relationship between concepts as following:

Definition 1: The homology concepts. In the hierarchical

tree structure of ontology, concept A and concept B are

homology concepts if the node of concept A is the an-

cestor node of concept B. Call A is the nearest root con-

cept of B, notes as R(A,B); The distance between A and B

is )()(),( AdepBdepBAd −= , where )(Cdep is the

depth of node C in the hierarchical tree structure.

Definition 2: The non-homologous concepts. In the hi-

erarchical tree structure of ontology, concept A and con-

cept B are non-homology concepts if concept A is neither

the ancestor node nor the descendant node of concept B;

If R is the nearest ancestor node of both A and B, Call R

is the nearest root concept of A and B, notes as R (A, B);

The distance between A and B is

),( BAd

),(),( RBdRAd

Definition 3: The semantic related concepts. Concept

C is the semantic related concept of A and B, if and only

if C satisfy the following conditions: If concept A and B

are homology concepts, C exists in the sub-trees with

root of A but not exists in the sub-trees with root of B; if

concept A and B are non-homology concepts, C exists in

the sub-trees with root of R, but not exits in the sub-trees

with root of A or B.

Figure 1 shows details of the relationships described

above. According to these definitions, the structure simi-

larity between concept A and concept B is:

⎪

⎩

⎪

⎨

⎧

≠

××

−

≠

××

−

.0 if 1,

concepts

;

homology -non are BA, and ,0 if

concepts; homology are BA, and 0 if

d(A, B)

son(R)

son(B)son(A)

d(A, B)

)

B)) dep(R(A,

(

,d(A,B)

son(A)

son(B)

d(A, B)

)

B)) dep(R(A,

(

'Sim(A, B)

(2)

where son(C) present the total number of nodes in sub-

tree with the root of concept C. The parameter α and βis

used to adjust the weight of dep(R(A,B)) and ),( BAd ,

whose range is (0, 1), and setting by filed experts.

According to formula given above, the concept simi-

larity decreases with the distance between concepts. At

the same time, for two concepts, the deeper the nearest

root they have, the more common properties they should

have, and the more similar they should be. Further more,

the number of nodes in the sub-tree and semantic related

concepts are also important factors during the computing

of similarity.

Root

Dep(B)

Dep(A)

d(A,B)

...

Root

d(B,R)

d(A,R)

...

Root

A(R)

...

Root

B...

C...

Homology ConceptsNon-homology Concepts

Semantic Related Concepts

...

Figure 1. Three patterns of concepts

58 A Semantic Vector Retrieval Model for Desktop Documents

Finally, the formula defines that the similarity between

the same concepts is 1, and the distance between them is 0.

5.2 The Property Similarity

Each concept in the ontology may have several different

entities, the main difference among these entities rest with

their property values. Further more, different concepts may

have same properties. Therefore, not only the concept

similarity but also the property similarity should be con-

sidered during the computing of similarity between two

entities. For the property similarity measuring, we have

definition as following:

Definition 4: Suppose I is the entity of concept C, the

value of its property Pi is pi, i=1,2,...,n. Use I=C[P] to

present this entity, where P is the property vector

(p1,p2,…, pn).

Only the common properties need to process when

computing the similarity between property vector

P=(p1,p2,...,pm) and ),...,,( 21 n

qqqQ =.

At first, transform the property vectors P and Q into

common property vectors ),....,,( 21r

pppP

′

′′

′ and

),...,,( 21 r

qqqQ ′′′

′. Then, according to the properties de-

fined in the ontology and the similarity of property value,

the property similarity of vector P and Q is given:

),(.

),(),(

iii

ppqpSimQPSimQPSim ′′

′′

=∑

γμ

(3)

where i

and i

are weights of property i

′

and i

′

respectively in their property vector, which are preset in

the ontology; ),( iii qpSim ′′ is the similarity of property

values, which is preset by field expert in the ontology.

For example, the similarity between property value “Data

mining” and “Information Retrieval” is 0.7, and that be-

tween “Data mining” and “Network” is 0.1. The range of

),( QPSimp is [0,1].

5.3 The Semantic Similarity

After computing the concept similarity of semantic vector

and the property similarity of conceptual entity, we can

get the final semantic similarity of semantic vector.

Suppose ])[],...,[(111 mm PAPAV = and 2

])[],...,[(11 nn QBQB are two semantic vectors. The se-

mantic similarity between 1

V and 2

V is:

=),( 21 VVSimV

()

∑

⋅−+⋅

jiPjiC

jQPSimBASimMax

,, )()1()(

ωω

(4)

where

is the weight of concept similarity, and its

range is [0, 1].

Now, the main retrieval algorithm is as follows:

Begin

1) Initialize the documentation set, then load the user

query vector 1

V and deciding its document clustering;

2) Load the semantic index file of documents, initializing

the semantic vector 2

3) For each vector in the document clustering includes

V. if current vector has never been processed then con-

tinue; else, process the next vector;

4) Compute all the concept similarity between concepts

in 1

V and 2

5) Compute all the property similarity between concepts

in 1

V and 2

6) Compute the semantic similarity between 1

V and

V, insert 2

V into list S with descending order;

7) Output top n items in list S as retrieval results;

End.

6. Experiment and Analysis

In order to verify the effectiveness of our method, we

design a prototype system and chose 100 abstracts

downloading from DBLP as retrieval target document. In

this prototype system, we use ontologies CmpOnto and

SwetoDblp_2 introduced in Section 4.

In the experiment, the depth of ontology concept tree is

5, the range of )),((BARdep in Formula (2) is [1,5],

and the value of ),( BAd is an integer from 1 to 10;

both the value of weight

and

is 0.5; the i

and

are parameters preset in the ontology, which could be

gained by statistical method. The value of

in For-

mula (4) will make influence on the retrieval results

ranking. In order to choose proper

, we implement pri-

mary experiment for analysis and choosing 0.8 as the

optimal value of

The first step of experiment is document pretreatment.

Each document is described by a semantic eigenvector

V including 1 to 4 conceptual entities. We can find that

the average precision of retrieval increase from 60% to

80% according to the increase of concepts in 2

V. The

corresponding results are shown in Table 1.

Figure 2. The influence of concepts in V2 on the precision

Table 1. The influence of concepts in V2 on the precision

V1 1C 2C 3C 4C

1 C 0.619 0.711 0.759 0.802

2 C 0.625 0.716 0.752 0.796

3 C 0.630 0.711 0.771 0.803

4 C 0.633 0.724 0.763 0.797

0.6

0.65

0.7

0.75

0.8

0.85

1C 2C 3C 4C

The member of concepts

1 C

2 C

3 C

4 C

precision

precisi on

The member of conce

0.85

0.8

0.75

0.7

0.65

1C 2C 3C 4C

1 C

A Semantic Vector Retrieval Model for Desktop Documents 59

Figure 3. The influence of properties in V1 on the precision

Table 2. The influence of properties in V1 on the precision

V2 1C 2C 3C 4C

1 P 0.513 0.621 0.675 0.721

2 P 0.662 0.764 0.804 0.859

3 P 0.649 0.741 0.785 0.821

4 P 0.637 0.739 0.781 0.811

Table 3. The comparing of different retrieval models

Precision

Documents Keywords Retrieval Semantic Re-

trieval

5 74.2% 88.3%

10 66.5% 82.5%

15 58.4% 75.5%

20 50.3% 71.2%

25 43.9% 65.8%

30 37.7% 58.5%

35 34.3% 50.4%

40 27.4% 47.2%

45 21.6% 42.9%

50 19.4% 36.3%

Average 43.37% 61.86%

Figure 2 could reflect the relationship between the

number of concepts in 2

V and the precision of query

more directly.

Further more, statistical results show that the number

of properties in the conceptual entity could also make

influence on the precision. When the number of proper-

ties is 2, the effects go best. If a concept has too many

properties, some proper target will be missed because of

so many restrictive conditions. The corresponding results

are shown in Table 2.

The Figure 3 is corresponding to Table 2.

In addition, we compare our new model with tradi-

tional VSM model based on keywords. The number of

documents and the precision of retrieval are shown in

Table 3. The average precision of semantic retrieval is

61.86%, but only 43.37% by traditional method in the

same documentation set. According to the experimental

data and analysis above, we know that the ontology could

play a positive role in upgrading the precision of retrieval.

7. Conclusions

This paper provides a semantic retrieval model based on

the ontology for desktop documents. Comparing with

traditional vector space model, the new model using se-

mantic and ontology technology to solve a series of

problems that traditional model could not overcome. The

experimental results prove the effectiveness of this new

model.

In addition, the individual analyses for retrieval results

tell us that there is little distinction in result ranking by

different retrieval methods. The main reason for precision

upgrading is that the semantic retrieval method could

reduce the similarity of incorrect results, so that the cor-

rect result could be ranked in the front position. There-

fore, how to re-rank and optimize the retrieval results is

an important task, and it is our main item in the next

stage.

REFERENCES

[1] B. Lee, Hendler, and Lassila, “The semantic web,” Scientific

American, Vol. 34, pp. 34-43, 2001.

[2] S. Decker and M. Frank, “The social semantic desktop,”

WWW 2004 Workshop Application Design, Development

and Implementation Issues in the Semantic Web, 2004.

[3] I. R. Silva, J. N. Souza, and K. S. Santos, “Dependence

among terms in vector space model,” Database Enginee-

ring and Applications Symposium, pp. 97-102, 2004.

[4] G. A. Millet, “Wordnet: An electronic lexical database,”

Communications of the ACM, 38(11): pp. 39-41, 1995.

[5] G. Asian and D. McLeod, “Semantic heterogeneity reso-

lution in federated database by metadata implan- tation

and stepwise evolution,” The VLDB Journal, the Interna-

tional Journal on Very Large Databases, Vol. 18, pp.

22-31, 1999.

[6] ACM Topic: http://www.acm.org/class/.

[7] B. Aleman-Meza, F. Hakimpour, I. B. Arpinar, and A. P.

Sheth, “SwetoDblp ontology of Computer Science publications,”

Web Semantics: Science, Services and Agents on the

World, pp. 151-155, 2007.

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

1P 2P3P 4P

The number of properties

precision

The number of properties

1P 2P 3P 4P

1 C

precision

0.85

0.8

0.75

0.7

0.65

0.6

0.55

0.5

2 C

3 C

4 C