Evaluating Relational Ranking Queries Involving both Text Attributes and Numeric Attributes

doi:10.4236/jsea.2012.512B018

Paper Menu >>

Journal Menu >>

A Journal of Software Engineering and Applications, 2012, 5, 88-93

doi:10.4236/jsea.2012.512b018 Published Online December 2012 (http://www.scir p.org/journal/jsea)

Evaluating Relational Ranking Queries Involving Both

Text Attributes and Numeric At t ri b u t es

Liang Zhu1, Zhaoliang Xie1, Qin Ma2

1Key Laboratory of Machine Learning and Computational Intelligence, School of Mathematics and Computer Science, Hebei

University, Baoding, Hebei 071002, China; 2Department of Foreign Language Teaching and Research, Hebei University, Baoding,

Hebei 071002, China

Email: zhu@hbu.edu.cn, xiezhaoliang533@163.com, maqin@hbu.edu.cn

Received October 12, 2012

ABSTRACT

In many database applications, ranking queries may reference both text and numeri c attr ib utes, whe re t he ra nki ng f unc-

tions are based on both sema ntic distance s/similarities for text attributes and numeric d istances fo r numeric attributes. In

this paper, we propose a new m eth od for evaluating such t yp e o f ranking queries over a relational database. By statis tic s

and training, this method builds a mechanism that combines the semantic and numeric distances, and the mechanism

can be used to balance the effects of text attributes and numeric attributes on matchin g a give n query a nd tuple s in da-

tabase search. The basic idea of the method is to create an index based on WordNet to expand the tuple words semanti-

cally for te xt attribute s and on the i nformation o f numeric attributes. The ca ndidate results for a query are retrie ved by

the index and a simple SQL selection statement, and then top-N answers are obtained. The result s of extensive experi-

ments indica te that the performance of this new strategy is efficient and effective.

Keywords: Relational Datab ase; Ranking Query; Sema ntic Dista nce; N umeric Distance; WordNet

1. Introduction

A r elati onal ran king q uer y (or top-N query) is to find the

N tupl es that satisf y the quer y cond ition t he best but not

necessarily completely, and the results are sorted ac-

cording to a given ranking function. Researches on

top-N queries have intensified since late 1990s [1-3],

and most of the researches involve numeric attributes

and use a numeric distance function (say, Lp-norm dis-

tances, p =1 , 2 , and ∞) to reduce a massive result set of a

conventional query to a few of the most relevant an-

swers . However, there are many applications where

ranking queries involving both text attributes and nu-

meri c attributes are available for processing.

Example 1. Consider a database of used books with

schema: Books(id#, title, pr i ce, year,…). Suppose a

customer wants to buy a used book with title on “film”,

price around “$50” and year about “2000”, where title is

a text attribute with semantics, and pri ce and yea r are

two numeric attributes. Obviously, a book on “algebra”

with price = “$50” and year = “2000” is not the desired

result for the cus tomer though the values for pric e and

year exactly match that of the query respectively.

Another book on “movie”, however, with pri ce = “$53”

and year = “2001” maybe satisfy the need of the cus-

tomer .

Ipum99 Occ50

idx A29 A40 A50 num occ50

87197 42 2 20000 2 Airplane pilots and

navigators

6505 28 3 25000 3 Architects

80789 51 930 33000 930

Gardeners, except

farm, and

groundskeepers

10860 9 999 999999 999 N/A (blank)

Figure 1 . Parts of IPUMS Census Database.

Exa mple 2. As shown in F igure 1, database IPUMS

has two relations Ipum99 and Occ50, which come from

[4]. In relation Occ50(num, occ50), its primary key num

is the value label of occ upation1950. Relatio n Ipum99

has 61 nu meric attributes where A29 is age, A50 means

income, and A40 is the foreign key referencing

Occ50.num. Furthermore, Ipum99 is added an attribute

idx by us as ide nti fier . A user is looking for the informa-

tion of “a horticultu rist with age about 50” from

IPUMS. Since there is no such word “horticulturist” in

Occ50.occ50, the answer will be nothing by using the

Evaluating Relational Ranking Queries Involving bot h Text Attributes and Numeric Attributes

traditional SQL selection statement. In fact, there is a

tuple (“Gardeners, except farm, and groundskeepers”,

age = 51, …) (with gray color) in IPUMS may be an

answer for the user.

Word Ne t::Similarity [5] is an open source Perl mod-

ule for measuring the semantic distance/similarity be-

tween two words; however, it is not easy for us to use

the source dir e ctly to evaluate ranking que ries [6].

Researches on semantic search in IR and SW (sema n-

tic web) have gained attention since 1990s. Taking the-

saurus ontol ogy na vigati on as a step i n query expansion,

many query expansion techniques are employed in key-

word search, say, query ter ms are expanded to WordNet

synonyms and meronyms, using the Boolean OR opera-

tions available i n most web search engines [7]. Different

from the above query expansion techniques in IR and

SW, we discuss semantic match between query words

and tuple words in relational databases via tu p le expan-

sion.

In recent years, ontologies have been used to build

applicatio ns in database com munity [8, 9]; however, it i s

a challenging jo b to construct efficient ontologies [9, 6].

Our method is different from the ontology-based tech-

niques. Firstly, instead of dealing with the challenge of

constructing ontology, we create a simple table main-

tained by RDBMS, and store information of index and

ranki ng funct ion into the table. Secondly, our strategy is

more general since it is based on WordNet. Finally, our

techniques are more efficient than the above ontolo-

gy-based methods due to the efficiency of RDBMS.

Another different yet related research topic is key-

word search in relational databases [10-12], which sup-

ports free-form keyword search in relational databases

without necessarily requiring the users to know the da-

tabase schema. Keyword search may be suitable for any

text attribute, but it is exact search without dealing with

semantic match. For instance, it cannot return the books

on “movie” for the quer y keyword “film”. If query words

match exactly tuple words, the results of semantic search

will contain those of keyword search, and improve the

effecti v eness of keyword search b y us ing our metho d.

There are t wo challe n gi ng p ro b lems for evaluating the

type of relatio nal ranking queries i n this p aper. The first

is how to design a good semantic distance function that

measure the semantic simila r ity between the query

words and the tuple words, [6] presented a solution for

this problem. The second is how to combine the seman-

tic and numeric distances. We empl oy statistics and

training to solve the problem, and create an index to

process ranking queries in terms of semantic and nu-

meric matching in database search. Moreover, this work

is a continuation of the work in [6], which studied the

processing of rela tional ranking queries only with text

attributes, witho ut numeric attr ibutes.

2. Problem Definition and Ranking

Function

Assume that R0(idx, A, B1, B2, …, Bm, …) is a relation

with identi fier idx, where A is a text attribute with natu-

ral l angua ge semantics and B1, B2, …, Bm are m numeric

attributes. Using the project operatio n

A(R0) (duplicate

tuples are eliminated), we get a new relation R(tid, A)

with ident ifie r tid. Furthermore, R0 is added by an

attribute FKid that is the foreign key referencing R.tid,

and the n the schema of R0 becomes R0(idx, A, B1, B2, …,

Bm, …, FKid). Or, let R(tid, A) and S(idx, B1, B2, …,

Bm, …, FKid) be two relations and S.FKid be the fore ign

key referencing R.tid, then R0 = R S with S.FKid =

R.tid.

Let t be a tup le in R0, then t[A] = (tw1, tw2, …, twn) is

the word -stri ng with n wor ds o n the te xt attribute A, a nd

t[Bj] = bj is the numeric value on t he attribute Bj (1 ≤ j ≤

m). For simplicity, we de note t = (tA, b1, b2, …, bm)

where tA = (tw1, tw2, …, twn), and call twi a tuple word

and bj a tuple value (1 ≤ i ≤ n, 1 ≤ j ≤ m).

As defined in [6], given a tuple word w, its kinship

words include the five kinds of words in WordNet: (1)

word w itself, (2) morph, (3) synonyms, (4) the imme-

diate hyponyms (subordinates), and (5) the immediate

hypernyms (superordinates). The s et of all kinship wo rds

of w is denoted by K(w). For instance, the kinship word

set of “computers” is K(computers) ={computers, com-

puter, data processor, machine, internet site, calcula-

tor, …}.

Consider a ranking query q = (qA, q1, q2, …, qm),

where qA = (qw1, qw2, …, qwk) is a wo rd-st ring with k

query words, and qj is a numeric value (1 ≤ j ≤ m). For

tuple t = (tA, b1, b2, …, bm) in R0, denote pj = |qj − bj| (1 ≤

j ≤ m), and dA = d(qA, tA) that is the se man tic distance

between qA and tA defined by Definition 1 to 3 in [6].

Moreover, dA belongs to the interval (0, 1].

We need to find a mechanism c o mbi ni ng t he s e ma nti c

and numeric distances dA and pjs, which can be used to

evaluate the query q over the relation R0, i.e., to define a

ranking function d(q, t ) =

(dA, p1, p2, …, pm).

Intuitively, the ranking function d(q, t) should satisfy

the fo llowing: First, a smaller d(q, t) indicates closer the

pair (q, t). Second, d(q, t) needs to balance the effects of

dA and pjs i n matc hi ng q wit h t. Finall y, it sho uld be easy

to implement.

To obtain d(q, t), we use the statistics of the domain s

of B1, B2, …, Bm and training.

Since the semantic di stance dA ∈ (0, 1], we normalize

it by scientific notation, dA =

×10

−

h, where 1.0 ≤

<10.0, and h is a no nnegative number (i.e., h≥ 0, if h = 0,

Evaluating Relational Ranking Queries Involving both Text Attributes and Numeric Attributes

we define −0 = 0), say, dA = 0.001041 = 1.041×10

−

3. W e

will see that h (the ab solute value of the expo nent) p lays

an important r ole in the ranking function d(q, t).

In collecting statistics, there is a step of cleaning data

and removing outlier, and then we get Min(Bj) and

Max(Bj) of numeric attribute Bj (1 ≤ j ≤ m). Based on the

semantics of attribute Bj, we obtain its reasonable unit

Zj >0, say, Zyear= 1 for attrib ute year and Zprice = 0 .01 for

price respectively in database BOOK.

Let Mj = Max(Bj) − Min(Bj) =

j × 10

, where 1.0 ≤

j <10.0 (Mj ≥ 0, obviously. If Mj = 0, let Mj := Zj then

Mj > 0 ). If Mj ≥ 1, we have cj ≥ 0, else if 0<Mj <1, let

Mj := Mj /Zj, and pj:= pj /Zj, then Mj ≥ 1 and cj ≥ 0.

Let ej := cj + 1 ≥ 1, for 1 ≤ j ≤ m. By using the sta tis-

tics and t rainin g in our experiments, we obtain the rank-

ing function d(q, t) =

(dA, p1, p2, …, pm) below.

d(q, t) = dA + Σ

j1=

(pj/Mj)

eh /

For example, in our experiments for relation Books

(title, price, year,…) with 56180 tuples. price is in (0,

1000] (except 633 tuples with price > 1000), and year

belongs to [1958, 2008] (except 651 tuples with year <

1958). Let attribute A, B1 and B2 be title, price and year

respectively, then M1 = 1000 − 0 = 1000 = 1.0 ×103, M2

= 2008 − 1958 = 50 =5.0×101, and the n c1 = 3, c2 = 1, e1

= c1+1 = 4 and e2 = c2+1=2. Thus, d(q, t) = dA +

( p1/1000)h/4 + (p2/50) h/2.

For query q, we will return dis-N tuples defined in [6]

to replace top-N ones. Let T be a set of tuples, a tuple

t∈T is called a dis-N tuple of q, if d(q, t) ≤ minN {d(q, ti)

| ti ∈T}, whic h means the Nth minimum value in the set

{d(q, ti) | ti ∈T }.

3. Creation of sn-Index

We extend w-index in [ 6] to sn-index ( stand s for seman-

tic and numeric index) in t his s ection.

3.1. Information Stored in sn-Index

A relation is employed to store the information in

sn-index, which is called IndexTable with schema In-

dexTable(id#, Word, Size, dbNSize, DBValue, BValue,

bSize), say, (2490, nights, 2, 2, '47898,0,3,1;2413,0,…',

'47898,8185,10.80,1997;…', 3) is a tuple in IndexTable

for database BOOK.

The attribute Word indicates the kinship word of tuple

words, and the relevant information will be stored in the

attribute DBValue. The value of DBValue is a string

wi th f orm “ tid,d,n ,f ;tid,d,n,f;…;” where “tid,d,n , f ;” is a

node, d = 0 , 1 , . . . , 5 , “d = i” is the subscript i of s eman-

tic distance di defined by Definition 1 in [6] , n is the

number of tuple words in the tuple with tid, and f is the

frequency in the set of kinship words of the tuple. The

attribute Size is the number of tids associated with kw,

and the duplicate tids are counted repeatedly. The

attribute dbNSize means the number of nodes in DBVa-

lue.

Attribute BValue is a string as “tid,id x , b1,b 2,…,bm;

tid,idx,b1,b2,…,bm;…;”, where “ tid,idx, b 1,b 2,…,bm;” is a

node, tid is the identifier of a tuple in the relation R(tid,

A), and idx is that in R0(idx, …). The attrib ute bSize

means the number of nodes in BValue.

3.2. Procedure of Creating sn-Index

Reconstructing w-index in [6], we obtain sn-index as

sho wn i n Figure 2. Sn -index consists of four parts: (1) a

hash table wi th a hash function h(), its each bucket con-

tains a pointer pWnPointer which points to a node in

wn-list; (2) one wn-list, its node has struc ture {iRow, kw,

size, dbNSize, bSize, pDBList, pBList}; (3) s db-lists, the

structure of the node is {tid, d, n, f}; (4) s b-lists, node

structure {tid, idx, b1, b2, …, bm}. The main difference

between sn-index and w-index in [6] is that b-lists (i.e.,

Bij’s in Figure 2) are added. The algorithm of creating

sn-index is sho wn as follows:

Ds1

Ds2

…… ……

…

hash tablewn-list db-list and b-list

Figure 2. Structure of sn-index.

Step 1. For the special character string tA = (tw1, …,

twn), calculate its hash value h=h(tA); check the pointer

pWnPointer in the bucke t of t he hash t able , and wn-list,

(1) If pWnPointer is NULL, or (tw1, …, twn) is not in

wn-list,

(a) Create a new db-list, and a dbNode as tid =

tA.tid, d := −1 (means the special case), n := 1

and f :=1. Insert the dbNode into the db-list.

(b) Using tid above, search (idx, b1, b2, …, bm)

from R0 (or from S) b y the fol l o win g the SQ L

selection statement

Select idx, B1, B2, …, Bm from R0 where

FKid = tid

Evaluating Relational Ranking Queries Involving bot h Text Attributes and Numeric Attributes

Let z be the size of the result set {(idx, b1,

b2, …, bm)}, Create a new b-list, and z

bNode(s), insert the bNode(s) into the b -list.

twn), and dbNSize:=1 and bSize:= z. Insert the

wnNode into wn-list according to t he alphabet

order of kw.

(d) Get the pointers pDB L i st, pBList and

pWnPointer.

(2) Else if the string t is found out in wn-list, then in-

crease dbNSize by 1 in the wnNode with t, create

a dbNode as above, and let it be the first node of

the db-list. Additionally, by the SQL selection

statement as (b) above, get a result set {(idx, b1,

b2, …, bm)} with siz e z ≥ 1. Create z bNode(s), in-

sert the node(s) into the b-list, and increase bSize

by z.

Step 2. For each tuple word twi ∈ t, and eac h kinship

word kw of twi, get h=h(kw), check pWnPointer in the

bucket of the hash table, and wn-list,

(1) If pWnPointer is NULL, or kw is not in wn-list,

do the same jobs as the above (1) in Step1 ex-

cept for replacing tA by kw and defining d = k, k

∈ {0, 1, …, 5} such that dk is d(kw, twi) in De-

finition 1 in [6].

(2) Else if kw has be en in wn -list, then get its db -list

and b-list. There are two cases.

Case 1, in the db-list, there is a dbNode with

dbNode.tid = t.tid, replacing d by the

smaller if the two distances are different,

and increasing f by 1, that is OK.

Case 2, if no dbNode in the db-list satisfies

dbNode.tid = t.tid, increase dbNS i ze by 1

in the wnNode with kw, creat e and insert a

new dbNode into the db-list according to

the increasing order of distance d. In addi-

tion, by the SQL selection statement as (b)

above, get a result set {(idx, b1, b2, …, bm)}

with size z ≥ 1. Create z bNode(s), insert

the node( s) into the b -li s t , and bSize:= z.

Step 3. Storing sn-index.

To evaluate query q, we use two storing strate g ie s.

Strategy-1, the entire sn-index is in main memory.

Strategy-2 will store db-lists and b-lists in fixe d d i sk a nd

only load the hash table and wn-list into main memory.

4. Evaluation of Ranking Query

For query q = (qA, q1, q2, …, qm), firstly, matching the

query words with kinship words of R(tid, A) via

sn-index, we obtain the set T (= {tid}) of identifiers of

candidate tuples, and compute the semantic distances

between q and its candidate tuples, and then ge t the set L

of information of numeric attributes by each identifier

tid in T; secondly, compute d(q, t) =

(dA, p1, p2, …, pm)

between q and each of its cand idate tuples, and then ob-

tain {idx}, which is the sorted set of identifiers of dis-N

results according to d(q, t); finally, we retrieve the dis-N

tupels from underlying database and display the ranked

answers.

To get the set T is an important step for query

processing. The intermediate results are stored in a tem-

porary list, denoted by T-List. Its node has the structure

{tid, d[K], dA} where K is the maximum number of

query words for qA = (qw1,…, qwk). If k > K, let qA =

(qw1, …, qwK) (K = 30 in our implementatio n s). The

intermediate results for obtaining set L are saved in a

temporary list L-List, and the str uct ur e of i ts node is {dA,

p1, p2, …, pm, d, tid, idx}

We discuss Strate g y-2 of storage first (i.e., we only

load the hash tab le and wn-list in main memory).

(1) No r malization of qA. We remove some symbols,

character strings or stop words, and replace some of

them by normal strings for qA. Denoted again by qA =

(qw1, …, qwk).

(2) For each qwi ∈ qA (i= 1, …, k), calculate hash

value h=h(qwi), and check qwi in wn-list. If qwi is found

out, we get the wnNode in wn -lis t, and t he n use the SQ L

selection statement

select ∗ from IndexTable whe re id# = wnNode.iRow

to obtain dbNode(s). For each dbNode, we save the val-

ues i nto T-List , and co mpute dA= d(qA, tA), then obtain T.

(3) If above T≠∅, for each tid∈T, we get its corres-

ponding bNode(s) by using select statements from In-

dextable, store them into L-List, compute d :=

(dA,

p1, …, pm), and then obtain the set L (={idx}) of candi-

date dis-N tuple identifiers.

(4) Given a po sitive inte ger N, we get the set of dis-N

tuple identifiers LN ⊆ L; then, we ob tain t he d is-N tuples

of the query q by using the SQL selection sta tement as

the followin g format:

Select R0.*, R.A from R0, R where (R0.FKid = R.tid)

and (R.idx in LN) and …

(5) Display the dis-N tuples sorted by d(q, t).

Next, we discus s Strategy-1 of stor age, i.e., t he entire

w-index is saved into main memory. The query

processing is similar to the above situations, except for

steps (2) and (3), we can get dbNode(s) and bNode(s)

from sn-index directly, without using select statements.

Thus, the response time of Strategy-1 will be smaller

than that of Strategy-2.

Example 2 (cont.). For the dis-5 query “a ho rticu l-

turist with age=50 and income=30000” against IPUMS,

the answers with f o rm (idx, tA, age, income, d(q, t)) are

t1: (80789, tA, 51, 33000, 0.609496),

t2: (07296, tA, 48, 25000, 0.729997),

t3: (77380, tA, 37, 30600, 0.747815),

Evaluating Relational Ranking Queries Involving both Text Attributes and Numeric Attributes

t4: (43851, tA, 51, 21894, 0.763280), and

t5: (73792, tA, 47, 25000, 0.766698),

where tA = t[occ50] = “Gardeners

，

except farm, and

groundskeepers”.

5. Experimental results

Our experiments are carried out using Microsoft’s SQL

Server 2000 and VC++6.0 o n a P C with Windows XP,

Intel(R) Core(TM) i5-2400/3.10GHz 3.09GHz CPU, and

2.98GB memory. In addition, WordNet 2.1 and its API

functi ons, ODBC, and OD BC API functions are use d.

Datasets: We use two real datasets. The first one is

IPUMS with two relations that is a fragment of US

Censu s Bureau data [4]. T he relation Occ50(num, occ50)

contains 286 tuples with 2 attributes. The relation

Ipum99(idx, FKnum, age, income, ...) has 61 attributes

and 88443 tuples, where age is A29, income is A50, and

FKnum is A40 that is the foreign key referencing

Occ50.num. R0 = Ipum99 Occ50 with FKnum= num.

Pa rt of IPUMS is shown in Example 2 in Section 1.

The second dataset BOOK comes from the Library at

Beijing University of Technology, which is the record

set of English books in the Library, and produces two

relations. One is Books(id#, isbn , title, author, publisher,

price, yea r, FKid) having 56180 tuples. The other rela-

tion Titles(tid , title) has 48107 tuples (duplicate titles are

removed). In addition, Books and Titles act as R0 and R

respectively. Books.id# corresponds to R0.idx, and

Books.FKid is the foreign ke y referencing Titles.tid.

Attributes tid, A, B1, and B2 described in Section 2

correspond num, occ50, age and income for IPUMS,

and tid , title, price and year for BOOK, respectively.

Space cost of sn-index: Str ate g y-1 and Strateg y-2 are

used for IPUMS and BOOK respectively. The main

memory space costs are: index-space-IPUMS is about

3.6MB, and index-space-BOOK about 3MB.

Workloads: We build a program to create a workload

that is a set of 100 queries for each database. First, we

choose 100 tuples from R randomly, and then for each

tuple t, select 1~10 kinship word (s) from K(t) randomly,

where the numbers of simple queries with 1~3 words

and complex queries with 4~10 words are both 50. We

classify t hem int o 10 groups Gi (i =1, 2, …, 10 ), and the

query in Gi has i search word(s). The size of Gi is ran-

dom.

For the legends in the following figures, the suffixes

“1”, “2”, …, and “100” indicate the dis-1, dis-2, …, and

dis-100 queries, respectively.

5.1. Elapsed Time

Figure 3 illustrates the average elapsed time for execut-

ing all queries in eac h Gi for IPUMS by four teen curves

itN’s and dtN’s, which stand for index-time’s and

DB-time’s respectively. The seven curves it1 to it100 are

(almost) the same, and are related to queries, but not to

N (N = 1, 2, …, 100), whic h are from 11 to 230 millise-

conds. The other seven curves dt1 to d t100 are the costs

of retrieving tuples from DB by using SQL selection

statements for the natural join of Ipum99 and Occ50,

which are smal l er than 600 milliseconds.

100

200

300

400

500

600

12345678910

the number of query words (ipums)

elapsed time (ms)

dt1 dt2 dt3 dt10 dt20 dt40 dt100

it1 it2 it3 it10it20it40it100

Figure 3. The elapsed times for IPUMS

Figure 4 shows the elapsed time with fo urteen curves

for BOOK. The seven curves it1 to it100 are (almost)

the same, and are related to queries, b ut not to N (N = 1 ,

2, …, 100), which are between 34ms and 346ms. Curves

dt1 to dt100 show the average elapsed times for access-

ing database to retrieve tuples. The result sets are dif-

ferent for various dis-N queries. The larger N means the

longer the elapsed time. The DB-times are less than

20ms for 1≤ N ≤ 3. If 10≤ N ≤ 100, the DB-times are

between 50 and 300ms.

100

200

300

12345678910

the number of query words (book)

elapsed time (ms)

dt1 dt2 dt3 dt10dt20dt40 dt100

it1it2it3it10it20 it40 it100

Figure 4. The elapsed t imes for BOOK

5.2. Precision

It is difficult to confirm whether one of tuples retrieved

matches a query semantically by a computer, and it is

too big a job to recognize se mantic match manuall y for a

large dataset [6]. Therefore, the traditional recall for

evaluating IR systems is not suitable for measuring se-

mantic match when the dataset is large, and then we

Evaluating Relational Ranking Queries Involving bot h Text Attributes and Numeric Attributes

report only precision . Figures 5 and 6 illustrate the av-

erage precision for IPUMS and BOOK respective ly.

We can see that a smaller N has a larger precision;

therefore, a smaller N indicates more matching tuples

appear in its dis-N results. The precisions for IPUMS

are 1, 0.93, 0.90, 0.73, 0.62, 0.5 2, and 0.44 for N = 1, 2,

3, 10, 20, 40, and 100, respectivel y. The precisions for

BOOK are 0.93, 0.86, 0.78, 0.63, 0.55, 0.45, and 0.34

for N = 1, 2, 3, 10, 20, 40, and 100, respectively.

0.2

0.4

0.6

0.8

12345678910

the number of query words (ipums)

Precisions

p1p2p3 p10

p20 p40 p100

Figure 5. Precisions for IP U MS

0.2

0.4

0.6

0.8

12345678910

the number of query words (book)

Precisions

p1 p2p3 p10

p20 p40 p100

Figure 6. Precisions for BOOK

6. Conclusions

We proposed a new method to evaluate relational ra nk-

ing queries that reference both text attributes and nu-

meri c attributes. The meth od builds a ranking function

combining the semantic and numeric distances, and

creates an index based on WordNet to expand the tuple

words semantically for a text attribute and on the infor-

mation of numeric attributes. Thus, the results for a

query are retrieved by the index and a simple SQL se-

lection statement for the natural join of relations, and

ranked accord ing to the ranking function. We conducted

extensive experiments to measure the performance of

this new technique using two real datasets. The results of

experiments demonstrated that our strategy is efficient

and e ffective.

7. Acknowledgements

T hi s wo r k is supported in part by NSFC ( 61170039) and

the NSF of Hebei Province (F2012201006). The authors

would also like to express their gratitude to Professor

Weiyi Meng for providing many helpful suggestions.

REFERENCES

[1] M. Carey, and D. Kossmann, “On saying ‘enough al-

read y!’ in SQL,” SIGMOD 1997 Proceedings ACM in-

ternational conference on management of data, Vol. 26

No. 2, 1997, pp. 219-230.

[2] I. F Ilyas, G. Beskales, and M. A. Soliman, “A survey of

top-k query processing techniques in relational database

s yst em s,” ACM Computing Surveys, Vol. 40, No. 4, 2008,

Article 11.

[3] L. Zhu, W. Meng, C. Liu, W. Yang, and D. Liu,

“Processing top-N relational queries by learning,” Jour-

nal of Intelligent Information Systems. Vol.34, No.1,

2010, pp.21-55, doi:10.1007/s10844-009-0078-7.

[4] IPUMS Census Database, “ipums.la.99.gz”, 1999,

http://kdd.ics.uci.edu/databases/ipums/ipums.html

[5] T. Pedersen, “WordNet::Similarity,” 2008,

http://www.d.u mn.edu/~tpederse/similarity.html

[6] L. Zhu, Q. Ma, C. Liu, G. Mao and W. Yang. “Seman-

tic-distance based evaluation of ranking queries over re-

lational databases,” Journal of Intelligent Information

Systems, Vol. 35, No. 3, 2010, pp. 415-445.

[7] D. Buscaldi , P. Rosso, and A. E. Sanchis, “A word-

net-based query expansion method for geographical in-

formation retrieval,” Working Notes for the CLEF

Workshop, Vienna Austria, 2005.

[8] S. Das, E. Chong, G. I. Eadon, and J. Srinivasan, “Sup-

porting ontology-based semantic matching in RDBMS,”

Proceedings of the Thirtieth International Conference on

Very Large Data Bases (VLDB ’04), Toronto, Canada,

2004, pp. 1054-1065.

[9] J. Zhang, Z. Peng, S. Wang and H. Niehang,

“Si-SEEKER: Ontology-based semantic search over da-

tabases,” Knowledge Science, Engineering and Man-

agement, Guilin, China, Vol. 4092, 2006 , pp. 599-611.

[10] V. Hristidis, L. Gra vano, and Y. Papakonstantinou, “Ef-

ficient IR-style keyword search over relational databas-

es,” In Proceedings of 29th International Conference on

Very Large Data Bases (VLDB’03), Berlin, Ger many,

2003, pp. 850-861.

[11] F. Liu, C. Yu, W. Meng and A. Chowdhury, “Effective

Keyword Search in Relational Database,” 26th ACM

SIGMOD/PODS international Conference on Manage-

ment of Data/Principle of Database Systems, Chicago,

Illinois , USA, 2006, pp. 563-574.

[12] L. Zhu, Y. Zhu, and Q. Ma, “Chinese Keyword Search

over Relational Databases,” World Conference on

Science and Engineering (WCSE), Wuhan, China, Vol. 1,

2010, pp. 217 – 220.