Element Retrieval Using Namespace Based on Keyword Search over XML Documents

doi:10.4236/jsea.2010.31008

Paper Menu >>

Journal Menu >>

J. Software Engineering & Applications, 2010, 3: 65-72

doi:10.4236/jsea.2010.31008 Published Online January 2010 (http://www.SciRP.org/journal/jsea)

Element Retrieval Using Namespace Based on Keyword

Search over XML Documents

Yang WANG1, Zhikui CHEN1, Xiaodi HUANG2

1School of Software, Dalian University of Technology, Dalian, China; 2School of Computing and Mathematics, Charles Sturt Uni-

versity, Australia.

Email: Wayag2000@yahoo.com.cn, zkchen@dlut.edu.cn, xdhuang@csu.edu.au

Received July 31st, 2009; revised September 12th, 2009; accepted September 22nd, 2009.

ABSTRACT

Querying over XML elements using keyword search is steadily gaining popularity. The traditional similarity measure is

widely employed in order to effectively retrieve various XML documents. A number of authors have already proposed

different similarity-measure methods that take advantage of the structure and content of XML documents. However,

they do not consider the similarity between latent semantic information of element texts and that of keywords in a query.

Although many algorithms on XML element search are available, some of them have the high computational complexity

due to searching for a huge number of elements. In this paper, we propose a new algorithm that makes use of the se-

mantic similarity between elements instead of between entire XML documents, considering not only the structure and

content of an XML document, but also semantic information of namespaces in elements. We compare our algorithm

with the three other algorithms by testing on real datasets. The experiments have demonstrated that our proposed

method is able to improve the query accuracy, as well as to reduce the running time.

Keywords: Semantics, Namespace, SVD, Text Matching

1. Introduction

Keyword search querying over XML elements has emer-

ged as one of the most effective paradigms in information

retrieval. To identify relevant results for an XML key-

word query, different approaches lead to various search

results in general. Some authors calculated the similarity

between the content of XML documents and query, only

analyzing the content and structure of XML (e.g., [1–3]).

Many algorithms calculate the degree of text of elements

matching with the keywords to produce the ranked re-

sult-list (e.g., DIL Query processing algorithm [4] and

Top-k algorithm [5]). The classical methods focus on

TF-IEF formula to calculate the cosine similarity between

elements and query (e.g., Tae-Soon Kim et al. [6]; Maria

Izabel M et al. [7]; Yun-tao Zhang et al. [8]).

In particular, overlaps of elements in XML documents

must be considered. For several overlapping relevant ele-

ments, we have to choose which one should be avoided

to ensure that users do not see the same information for

several times. Su Cheng Haw et al. [9] presented the

TwigINLAB algorithm to improve XML Query process-

ing. In this paper, we modify it to deal with the elements

overlap occurring in keyword search results.

On the basis of previous work, we make the following

contributions in this paper. Firstly, we utilize the seman-

tic information of namespaces in elements to filter the

relevant components since the text of elements are com-

monly related with semantic information of namespace.

Secondly, the precision and recall of our algorithm show

that the non-text matching but semantic relevant ele-

ments with respect to the keyword can be effectively

retrieved. Compared with traditional work, our algorithm

also shows the better performance on time execution

over a large collection of elements.

The rest of this paper is organized as follows: Section

2 introduces the element-rank schema by keyword search.

Section 3 presents the Namespace Filter Algorithm

(NFA). The experiments on the comparison of NFA and

related methods are reported in Section 4. Related work

is presented in Section 5, followed by the conclusion.

2. Element-Rank Schema

In this section, we utilize the namespace of elements to

describe our element rank schema. Another goal of util-

izing namespace is to filter relevant elements with the

keyword in a query to reduce time execution compared

with traditional algorithms.

Interestingly, namespaces can distinguish different ele-

Element Retrieval Using Namespace Based on Keyword Search over XML Documents

ments containing the same markup that refers to different

semantic meanings. As an illustration, we consider two

elements with the same markup of <table>:

<table>

<td>apple</td>

<td>banana<td>

</table>

<table>

<name>coffee table</name>

</table>

This will lead to the confliction when they are in the

same XML document. Thus, we utilize different name-

spaces of 'h' and 'f' to distinguish them as below.

<h:table xmlns:h = "http://.../fruit">

<h:td>apple</h:td>

<h:td>banana</h:td>

</h:table>

<f:table xmlns:f = "http://.../furniture">

<f:name>coffee table</f:name>

<f:width>80</f:width>

<f:length>120</f:length>

</f:table>

As discussed above, the text of elements is commonly

related with the semantic information of their namespaces.

Given the semantic information of namespace that is ir-

relevant to the keyword, it is not desirable to access all the

elements containing this namespace. In order to calculate

the semantic similarity between namespaces and key-

words, we map semantic information of namespaces and

keywords into different vectors in a concept vector space

created by Singular Value Decomposition (SVD) [10]

over a collection of elements. In order to do this, Defini-

tions 1 and 2 are provided as follows.

Definition 1: : a function that maps the

namespace of element v into a vector and represents a

special meaning in the concept vector space created by

SVD.

)(vprefix

Definition 2:: the

degree of relevance calculated by the cosine similarity

between the namespace vector of element v and the key-

word vector in concept vector space created by SVD.

)),(( keywordvprefixncorrelatio

The value of the correlation is commonly normalized

between the range [-1,1]. If the semantic meaning of

namespaces is very close to that of the keywords, the

value of ‘correlation’ will be around 1. Nobert Govert et

al. [2] proposed the concept of degree of relevance be-

tween elements. We extend it to include several intervals

in [-1,1] to describe the degree of the semantic similarity

of namespaces and keywords. Without loss of generality,

some definitions are provided to describe the degree of

relevance between namespaces and keywords.

Definition 3: High relevance: the high correlation

between namespaces and keywords which satisfies

1)),((

1 keywordvprefixncorrelatio



(1)

Definition 4: Common relevance: the median corre-

lation between namespaces and keywords which satisfies

12 )),((







keywordvprefixncorrelatio (2)

Definition 5: Irrelevance: the lower correlation be-

tween namespaces and keywords which satisfies

)),((1









keywordvprefixncorrelatio (3)

In the above equations, we have 10 12







, our

ranking algorithm accesses the elements containing the

namespaces that satisfy either Equation (1) or Equation

(2) rather than Equation (3).

3. Espace Filter Algorithm

In this section, we introduce some preliminary knowl-

edge, followed by presenting our algorithm called the

Namespace Filter Algorithm (NFA).

3.1 Preliminaries

The idftf



weight is commonly used to calculate the

term weight in documents in the field of traditional in-

formation retrieval. The purpose of our work is to re-

trieve the appropriate nested elements that contain the

relevant text to keywords instead of entire XML docu-

ments. So we extend idftf



to for ele-

ments in XML documents.

ieftf et 

Notations:

tf , the number of times that keyword t occurs in

the text of element e.

tf , the number of times that keyword t occurs

in the query q.

ief 10

log whereis the total number of elements

over a collection of XML documents, and is the

number of elements that contain the keyword. We

then give Definition 6 as below.

Definition 6: keyword weights in elements and query









otherwise

tfifieftf

Wetet

0)log1( ,,10

, (4)

ieftfW qtqt )log1( ,10, (5)

where is keyword weight in the text of element ,

and keyword weight in a query . We calculate

the cosine similarity between query vector

qt,



and ele-

ment vector e



in Equation (6) on text matching factor.













iii

eqscore

),( (6)

Element Retrieval Using Namespace Based on Keyword Search over XML Documents67

Figure 1. Example of elements with the label ID in the XML

document tree

where is the ith keyword weight in



, and is the

ith keyword weight in



. Their weight values are calcu-

lated using Equation (4) and Equation (5), respectively.

In XML documents, elements are of varying size and

nested. Since relevant elements can be at any level of

granularity, either an element or its children can be rele-

vant to a given query. These facts commonly lead to a

problem that the same resulting elements of a query based

on keyword search will be presented to users for several

times. As an illustration, Consider the structure of an XML

document that is shown as the labeled tree in Figure 1.

Besides, let us suppose the relevant element list after key-

word search are listed in Table 1.

Elements with ID 0.2.1 and 0.2 are overlapping, so are

with 0.1, 0.1.1, and 0.12. If one element's parent is the

component of another element, the two relevant compo-

nents can be merged into one. An element will be merged

into its parent only if the number of the keyword occur-

ring in this particular element is less than that of its par-

ent element. In this way, there will be no overlap in the

resulting list shown in Table 2.

Furthermore, we denote value[v] calculated by NFA in

Section 3.2 as element v. Combining with Definition 2.

The final comprehensive evaluation formula about rele-

Table 1. Example of ranked list

Rank Self Parent

1 0.2.1 0.2

2 0.2 Root

3 0.1 Root

4 0.1.1 0.1

5 0.1.2 0.1

Table 2. Result list without overlap

Rank Self Parent

1 0.2.1 0.2

2 0.1 Root

vant elements ranking is given as Equation (7).

][)),(()( 21 vvalueakeywordvprefixncorrelatioavrank 

(7)

where 1





aa . In order to highlight the factor of

namespace's semantic, we have .

10 12  aa

3.2 NFA Description

In the following discussion, we will focus on presenting

the Namespace Filter Algorithm (NFA) and how it per-

forms based on the keyword search over a collection of

elements.

Let A be a set consisting of different elements to be

accessed by NFA, and the namespaces of elements in set

A satisfy either Equation (1) or Equation (2). Other ele-

ments not included in A will be neglected by NFA. The

length[e] in Equation (6) is defined as 

Value[e] in Figure 2 gives the degree of text matching

between the text of element and keywords.

3.3 An Example

To evaluate the effectiveness of NFA, using an example,

we perform it with different pair values of 1



and 2



Equations (1) and (2). We empirically provide an XML

document named as record.xml in Figure 3 which con-

sists of many elements with namespace 'c' describing

semantic "computer" and 'n' describing "joy". Let the

query be "data and space in algorithm". Meanwhile, we

set 1



in Equation (1) and 2



in Equation (2) to 0.8

and 0.6, respectively. SVD is commonly applied to docu-

ments in traditional information retrieval. We extend it to

NFA : retrieve the ranked element based on the keyword

Input:query, a collection of relevant elements denoted as A

Output: top k elements of ranked result list

Description:

01 float value[N] = 0//N is the number of elements



02 float Length[N]

03 for each keyword t in the query

04 do for each pair(element  A,tf(t,e))

05 do value[e] += //Equations.(4) and (5)

qtet WW ,, 

06 end-for

07 end-for

08 for each element e

09 do value[e] = value[e] / length[e]

10 end-for

11 merge the overlap

12 calculate the rank[] with Equation (7)

13 return top K elements of rank[] over all documents

Figure 2. Namespace filter algorithm

Element Retrieval Using Namespace Based on Keyword Search over XML Documents

Table 3. Term-element matrix M

<root1>

<c:cs xmlns:c = "http://....../computer">

<c:DBMS>

<c:DB>attribute</c:DB>

<c:DB>Management</c:DB>

</c:DBMS>

<c:programming>

<c:complexity>data and space</c:complexity>

<c:time>data in computer's Algorithm</c:time>

</c:programming>

<c:java>data of Algorithm in computer science</c:java>

</c:cs>

<n:joy xmlns:n = "http://....../happiness">

<n:entertainment>

<n:in>no space with audience's joy</n:in>

<n:out>jackson dance in large space</n:out>

</n:entertainment>

</n:joy>

</root1>

0.1.1.1 0.1.1.2 0.1.2 0.1.3.1 0.1.3.2 0.1.4 0.2.1.1 0.2.1.2

Computer 0 0 1 0 1 1 0 0

Data 0 0 1 1 1 1 0 0

Space 0 0 1 1 1 0 1 1

Algorithm 0 0 0 0 1 1 0 0

Joy 0 0 0 0 0 0 1 0

elements in this example.

Each element in record.xml corresponds to a node in

the tree with labeled IDs in Figure 4.

Given the correlation value between the semantic

meaning of namespaces: 'c', and 'n', and that of the key-

words :"data","space", and "Algorithm", we construct a

term-element matrix denoted as M, the elements of

which are term frequencies occurring over all of ele-

ments in record.xml in Table 3.

Then we normalize matrix M denoted by M1 as fol-

lows













07071.0000000

005774.05774.00000

17071.0007071.05774.000

005774.05774.07071.05774.000

005774.05774.005774.000

M1 is decomposed into the following three matrixes

by SVD

Figure 3. Example of record.xml





















1403.04420.08437.02458.01122.0

7261.03172.03754.04147.02435.0

1073.02915.01155.06863.06474.0

0637.06617.03163.03361.05874.0

6614.04256.01839.04285.04050.0















0003433.00000

00004126.0000

000006569.000

0000003770.10

00000008397.1

Figure 4. Tree structure of record.xml































0003124.07066.01757.04984.03519.0

0000681.02579.07839.04786.02920.0

2877.04082.05.02159.01133.02135.04945.03878.0

2887.04082.05.02159.01133.02135.04945.03878.0

0003520.06345.04647.01798.04746.0

0008250.00774.02179.00328.05146.0

5722.07654.02946.000000

7113.02843.06428.000000

Element Retrieval Using Namespace Based on Keyword Search over XML Documents69

In the following, we consider the reduced semantic

space with two most informative dimensions. Let U1 be

first two columns of U, S1 be the diagonal square matrix

that contains the first two biggest eigenvalues 1.8397, and

13770 of S as diagonal elements, and other elements in

S1 are 0.V1 be the transpose of first two columns of V.

We then build up a new term-element matrix M2 by us-

ing U1* S1*V1 as below.















2413.02222.00874.00874.01587.00950.000

1271.01426.04559.04559.01097.02490.000

8897.07997.00059.00059.07346.05813.000

1492.00937.06475.06475.04292.05707.000

0321.00650.05804.05804.02472.04024.000

The correlation values between terms are shown in Ta-

ble 4a by using T

22. We then normalize these

values in table 4a to the range of [-1,1] as given in Table 4b.

According to Table 4b, the correlation values between

the semantic meanings of namespaces 'c','n' and those of

the keywords in the query are given in Table 5.

Consider the keyword search for "data" or "Algorith-

m" in a query. As shown in Table 5, both the values of

correlation of Namespace 'c' vector with "data" and "Al-

gorithm" vector satisfies Equation (1) and Equation (2).

In contrast, the correlation value of Namespace 'n' vector

and keyword vectors does not satisfy Equations (1) and

(2). So we have {0.2,0.2.1,0.2.1.1,0.2.1.2}A. The pa-

rameter values (in Section 3.1) of elements in set A are

listed in Table 6.

From line 05 to 11 of NFA in Figure 2, combining

with Table 6, the value[e] s of elements in set A are

shown in Table 7.

As shown in Table 7, there exists the overlap between

element 0.1 and other elements. After merging the over-

lap, the result is 0.1 including its descendent elements

0.1.2,0.1.3,0.1.4 as a whole components. The other re-

sulting element is 0.1.1 including all its descendent ele-

ments. Let , and in Equation (7) be 0.9 and 0.1,

respectively, and the correlation value between name-

space and keywords be 0.8085, which is the average cor-

relation value between “data" and "Algorithm". Then we

can get the final ranked result by using Equation (7) in

Table 8.

In order to exploit the relation between in Equation

(1), and in Equation (2) and search result, we assign

different pair values to and such as 0.6 and 0.3.

We still give the same query to perform NFA over re-

cord.xml. This time we focus on "space" in the query

rather than "Algorithm" and "data". Table 5 shows that

the correlation value between namespace 'c' vector and

"space" vector is 0.3038, which satisfies Equation (2) and

namespace 'n' vector and "space" vector is 0.5233, which

satisfies Equation (2). Consequently, the search result

performed by NFA is given in Figure 6.



As shown in Figures 5 and 6, the different degrees of

semantic information relevance between the name-

spaces and keywords will lead to various search results

by using NFA.

In summary, the degree of semantic relevance between

the namespace and keywords depends not only on their

semantic information similarity, but also on user-speci-

fied weights on other factors.

4. Experiments

In our experiments, we compare NFA with other related

Table 4. Correlation value between different pair of terms

in record.xml

















1571.01010.05652.00660.00462.0

1010.05262.00072.07473.06699.0

5652.00072.03087.28471.03281.0

0660.07473.08471.03795.10765.1

0462.06699.03281.00765.19020.0

















1455.00935.05233.00611.00428.0

0935.04872.00066.06919.06203.0

5233.00066.01377.27843.03038.0

0611.06919.07843.02773.19967.0

0428.06203.03038.09967.08352.0

Table 5. Correlation value between semantic of namespace

'c', 'n' vectors and other three keyword vectors over ele-

ments in record.xml

Correlation data space Algorithm

computer 0.9967 0.3038 0.6203

oy 0.0611 0.5233 -0.0935

Table 6. Times of "data","space","algorithm" occurring in

query and revelant elements of record.xml

Dewey ID )(

,data

tf )(

,space

tf )lg(

,orithma

0.1 3 3 2

0.1.2 1 1 0

0.1.3 1 1 1

0.1.3.1 1 1 0

0.1.3.2 0 0 1

0.1.4 1 1 1

Table 7. The ranked result-list with element overlap

Rank Self Parent Value[e]

1 0.1 Root 1.6160

2 0.1.4 0.1 1.5779

3 0.1.3 0.1 1.5779

4 0.1.2 0.1 1.4145

5 0.1.3.1 0.1.3 1.4145

6 0.1.3.2 0.1.3 1

Table 8. Comprehensive ranking using Equation (7)

Rank Dewey ID Score

1 0.1 0.8893

2 0.1.1 0.7277

Element Retrieval Using Namespace Based on Keyword Search over XML Documents

algorithms and methods on two metrics: precision and

recall. The result of comparing NFA with the methods

that have the similar Precision and Recall on aspect of

time execution of algorithm is also presented. We set 0.9

to 1



, 0.6 to 2



, 0.9 to , and 0.1 to in Equation

(7) to perform NFA.

4.1 Experimental Setup and Results

Equipment: Our experiments are performed on a PC

with a 2.33GHz Intel(R) Core(TM) 2 Duo CPU, 3.25 GB

memory, and Microsoft Windows XP. The TermJoin

algorithm [11], semantic tree creation algorithm [12], and

NFA are all implemented in C++.

Data set: We have tested NFA on two data sets called

Dataset1 [13] and Dataset2 [14], respectively. In order to

show its performance, we add some namespaces to ele-

ments [13]. Each namespace represents the general idea

of text embedded in elements [13].

Query set: the query set consists of two parts with 13

queries that represent all kinds of queries over Dataset1

and Dataset2 in Table 9.

4.1.1 Precision and Recall

Precision is defined as the number of relevant elements

retrieved by keyword search divided by the total number

of elements, while recall refers to as the number of rele-

vant elements retrieved by keyword search divided by the

total number of existing relevant elements. We compare

the precision and recall of NFA with the Termjoin algo-

rithm [11], semantic tree creation algorithm [12] on

Dataset1 and CAS Query [7] on Dataset2. We then cal-

culate the precision and recall of top 20 components re-

trieved by each algorithm as reported in Figure 7.

As shown in Figure 7, the Term-join algorithm re-

trieves the relevant elements. However, it also retrieves

some non-relevant elements. The basic idea of the Term-

join algorithm is to calculate the degree of text matching

of elements with keywords rather than the latent semantic

information of text of elements. Furthermore, both NFA

and semantic tree creation algorithm efficiently solve the

semantic information similarity between text of elements

and keyword. However, they do not have the equal run-

ning time as given in Section 4.1.2. In [6, 7], authors pro-

vide the methods that utilize the semantic

Figure 5. Experimental result elements in record.xml re-

trieved by NFA within Equation (1) be 0.8 andin Equa-

tion (2) be 0.6



Figure 6. Experimental result elements in record.xml retrieved by

NFA with in Equation (1) be 0.6 andin Equation (2) be 0.3



Table 9. Query set on Dataset 1 and Dataset 2

Dataset1

Q11: pitch step B and octave 2

Q12: natural type

Q13: voice 1 and type eighth

Q14: music with voice 1 staff 1

Q15: music with beam begin and down

Q16: 16th type in music

Q17: 16th type and type of beam

Q18: 16th type and duration 2

Dataset2

Q21: best table in furniture

Q22: best fruit table in furniture

Q23: eat apple at the table

Q24: have coffee at the table

Q25: the list of table

Figure 7. Precision and recall on Dataset1

Figure 8. Precision and recall on Dataset2

Element Retrieval Using Namespace Based on Keyword Search over XML Documents71

Figure 9. The average running time of NFA and semantic

formation of markups in elements to calculate the se-

ussed in Section 2.1, namespace can distinguish

In terme in practice, we compare NFA to

ts with

knowledge, no existing work has

earches have been done on the area of

er related area in elements retrieval is ranking

the

tree creation algorithm over 100 thousand elements from

Dataset1 based on the queries from Q11 to Q18 in Table 9

mantic information similarity between the elements and

query. However, sometimes it can only get the relevant

components with various markups. In order to present the

difference of search results of CAS Query [7] and NFA,

we test both of them on Dataset2 consisting of elements

with namespace 'h' and 'f' nested in the same markup <ta-

ble>. Both of them are tested by queries from Q21 to Q25

in Table 9 and the precision and recall is shown in Figure

As disc

fferent elements even with the same markup which

leads to different precision of NFA and CAS in Figure 8.

4.1.2. Running Time of NFA and Semantic Tree Crea-

tion Algorithm

s of running tim

the semantic tree creation algorithm. We test both of

them on Dataset1, and plot the average running time

based on the queries from Q11 to Q18 in Table 9 over

100 thousand elements from Dataset1 in Figure 9.

The idea of NFA is to filter the relevant elemen

spect to keywords in order to reduce the running time

of the semantic tree creation algorithm [12], which ac-

cesses all elements in a collection to get the semantic

information similarity between the text of elements and

keywords. Figure 9 shows that the semantic information

of namespace in elements significantly reduces running

time compared with the semantic tree creation algorithm

over a large collection of elements.

5. Related Work

To be the best of our

formally studied on the namespaces [14] for elements

retrieval. There has been a large body of work on con-

tent-oriented of XML documents and corresponding

ranking schema.

Substantial res

king relevant matches between the content and the

query as the criteria. e.g., DIL Query processing algo-

rithm [4], Termjoin Algorithm [11] and Top-k algorithm

[5]. Jovan Pehcevski et al. [15] content that the purpose

of XML retrieval task is to find elements that contain as

much relevant information. However, some elements that

are not keyword matches may be also relevant to the

query but not return in those algorithms. The classical

method is to calculate the value of consine similarity be-

tween the content and keyword utilizing the formula of

TF-IEF, the related work have been reported in

[6–8,11,16,18,19]. Unfortunately, most of them still can-

not accurately calculate the similarity on semantic prob-

lem only by this formula. Li Deng et al. [12] present the

semantic tree creation algorithm. Other proposals are

given on semantic problem from the inner structure of

XML document (e.g., Hongzhi Wang et al. [1]; Norbert

Govert et al. [20]; Felix Weigel et al. [3]; Sihem

Amer-Yahia et al. [5]; M.S.Ali et al. [21]). However,

they have to do a large time execution. Benny Kimelfeld

et al. [5] have observed this shortcoming. They presented

the method which filters the relevant documents before

processing the algorithm. Due to the notion of methods

[22], we interestingly find that the namespace in elements

not only solve the latent semantic problems between

elements and keyword, but also filter the relevant ele-

ments based on the keyword to reduce time execution in

the traditional algorithm. The most related work to this

paper is [6,11], both of which have proposed the content

of markup or frequency of markup as a factor contributed

to semantic similarity between the content and query.

However, It cannot effectively distinguish the elements

with same markup representing different semantic infor-

mation.

Anoth

hema based on keyword search. The classical scoring

function is tf-ief (e.g., [5,23]) in information retrieval.

However, many approaches simply calculate et

tf , with

respect to all elements of the collection [9] or p con-

sider it by estimating et

tf , across elements of the same

type [25]. et

tf , is also calculated based on the concen-

tration of thxt of the element and that of its descen-

dants [25,26]. A different approach is to compute et

tf ,

for leaf-elements only, which are then used to score

leaf-elements themselves. All non-leaf elements are

scored based on combination of the score of their de-

scendants elements. The propagation of score starts from

the leaf elements and can consider the distance between

the element being considered and its descendent

leaf-elements [27]. Similar notion is adopted by the DIL

algorithm [4]. V. Mihajlovi et al. [28] rank elements us-

ing a utility function that is based not only on the rele-

vance score of an element, but also on its size.

artly

e te

Element Retrieval Using Namespace Based on Keyword Search over XML Documents

s the keyword search over elem

nymous reviewers for their

REFERENCES

[1] H. Z. Wang, J X. M. Lin, “Cod

uhr, and M. Lalmas, “Evalua

gasundaram,

A. Marian, D. Srivastava, and D

Lee, J. W. Song, and D. H. Kim, “Similar-

g, and Y. C. Wang, “An improved

aw and C. S. Lee, “TwigINLAB: A decomposi-

Schutze,” Introduc-

C. Yu, and H. V. Jagadish, “Querying

oral

cxml.org/xml/elite.xml.

www.w3schools.

[15] “HixEval: Highlighting

ring similarity of

, G. Lin, C. Botev, and J. Shanmugasundaram,

“XML

“A novel method for

onsens, and M. Lalmas, “Structural

, Y. Sagiv, and D. Yahav, “Using

anmugasundaram, “Context-sensitive

XXL

. Callan, “Parameter estimation for a sim-

t XML,” IR at INEX 2005.

6. Conclusions

This paper addresseents

tion to information retrieval,” Cambridge Press, April,

2008.

[11] S. Al-Khalifa,

in XML documents. Using the namespaces of elements,

we have presented the Namespace Filter Algorithm (NFA)

that retrieves the relevant components of XML docu-

ments with respect to keyword queries. In addition, we

provide a new approach that can remove effectively the

element overlaps occurring in query results. Using an

evaluation formula, our approach is able to produce a

ranking result-list without element overlaps. Compared

with previous algorithms, NFA has demonstrated a better

performance not only on time execution but also on the

precision and recall of query results. Our future work will

study the relation of the previous factors on the back-

ground of graph structures in XML documents.

7. Acknowledgments

We are grateful to the ano

stru

helpful comments.

. Z. Li, W. Wang, and-

sem

ing-based join algorithm for structure queries on

graph-structured XML document,” World Wide Web, Vol.

11, pp. 485–510, 2008.

[2] N. Govert, G. Kazai, N. Fting retrieval: What about using contextualrelevance?” SAC,

pp. 1114–1115, April 23–27, 2006.

[20] B. Jeong, D. Lee, H. Cho, and J. Lee,

the effectiveness of content-oriented XML retrieval,” In-

formation Retrieval, Vol. 9, No. 6, pp. 699–722, 2006.

[3] F. Weigel, H. Meuss, and K. U. Schulz, “Francois bry me

content and structure in indexing and ranking XML,”

WebDB, Vol. 17–18, pp. 68–72, June 2004.

[4] G. Lin, S. Feng, C. Botev, and J. Shanmu

“XRank: Ranked keyword search over XML documents,”

ACM International Conference Proceeding, SIGMOD, pp.

7–11, June 9–12, 2003.

[5] S. A. Yahia, N. Koudas, .

rel

Toman, “Structure and content scoring for XML,” Pro-

ceedings of the 31st VLDB Conference, pp. 362–372,

2005.

[6] T. S. Kim, J. H.

ity measurement of XML documents based on structure

and contents,” International Conference on Computational

Science (ICCS), Part 3, LNCS 4489, pp. 902–905, 2007.

[7] M. Izabel, M. Azevedo, L. P. Amorim, and N. Ziviani, “A

of st

universal model for XML information retrieval,” LNCS

pp. 312–318, 2005.

[8] Y. T. Zhang, L. Gon

TF-IDF approach for text classification,” Journal of

Zhejiang University Science, Vol. 6A, No. 1, pp. 49–55,

2005.

[9] S. C. H

tion-matching-merging approach to improving XML

query processing,” American Journal of Applied Sciences,

Vol. 5, No. 9, pp. 1199–1205, 2008.

[10] C. D. Manning, P. Raghavan, and H.

structured text in an XML database,” ACM International

Conference Proceeding, SIGMOD, June 9–12, 2003.

[12] D. Li, X. J. Wang, and L. H. Wang, “Indexing temp

XML using semantic tree index,” IEEE Xplore, pp. 448–

451, 2008.

[13] http://www.musi

[14] Namespaces in XML Available: http://

com/XML/xml_namespaces.asp/.

J. Pehcevski and J. A. Thom,

XML retrieval evaluation,” LNCS 3977, pp. 43–57, 2006.

[16] T. S. Kim, J. H. Lee, J. W. Song, and S. L. Lee, “Semantic

ctural similarity for clustering XML documents,” Inha

University Technical Report, 2006. http://webbase.inha.

ac.kr/TechnicalReport/tech_04.pdf.

[17] C. Yang and N. Liu, “Measu

i-structured documents with context weights,” ACM

International Conference Proceeding, pp. 719–720, Au-

gust 6–11, 2006.

[18] S. Feng

“Efficient keyword search over virtual XML views,”

VLDB, pp. 1057–1065, September 23–28, 2007.

[19] K. Sauvagnat, L. Hlaoua, and M. Boughanem,

asuring semantic similarity for XML schema match-

ing,” Expert Systems with Applications, Vol. 24, pp.

1651–1658, 2008.

[21] M. S. Ali, M. P. C

evance in XML retrieval evaluation,” Proceedings of

the SIGIR Workshop on XML and Information Retrieval,

pp. 2–8, July 27, 2007.

[22] B. Kimelfeld, E. Kovacs

language models and the HITS algorithm for XML re-

trieval,” LNCS 4518, Springer-Verlag Berlin Heidelberg,

pp. 253–260, 2007.

[23] C. Botev and J. Sh

keyword search and ranking for XML,” WebDB, 2005.

[24] B. Sigurbjornsson, J. Kamps, and M. de Rijke, “The effe

ructured Queries and selective Indexing on XML re-

trieval,” INEX’05, LNCS 3977, pp. 104–118, 2006.

[25] M. Theobald, R. Schenkel, and G. Weikum, TopX &

at INEX 2005.

[26] P. Ogilvie and J

ple hierarchical generative model for XML component re-

trieval,” INEX, 2004.

[27] S. Geva, “GPX-gardens poin

[28] V. Mihajlovic, G. Ramirez, T. Westerveld, D. Hiemstra, H.

E. Blok, and A. P. de Vries, “Vague element selection,

image search, overlap, and relevance feedback,” INEX

2005.