Chinese Keyword Search by Indexing in Relational Databases

doi:10.4236/jsea.2012.512B021

Paper Menu >>

Journal Menu >>

A Journal of Software Engineering and Applications, 2012, 5, 107-112

doi:10.4236/jsea.2012.512b021 Published Online December 2012 (http://www.scirp.org/journal/jsea)

Chinese Keyword Search by Indexing in Relational

Databases

Liang Zhu1, Lijuan Pan1, Qin Ma2

1Key Laboratory of Machine Learning and Computational Intelligence, School of Mathematics and Computer Science, Hebei

University, Baoding, Hebei 071002, China; 2Department of Foreign Language Teaching and Research, Hebei University, Baoding,

Hebei 071002, China.

Email: zhu@hbu.edu.cn (L. Zhu), katharine162@126.com(L. Pan), maqin@hbu.edu.cn(Q. Ma)

Received Octob e r 15, 2012.

ABSTRACT

In this paper, we propose a new method based on index to r ealize IR-style C hine se key word sea rch with ra nki ng str ate-

gies in relational databases. This method creates an index by using the related information of tuple words and presents a

ranking strategy in terms of the nature of Chinese words. For a Chinese keyword query, the index is used to match

quer y search words and the tuple words in index quickl y, and to compute si milaritie s between the quer y and tuples by

the ranking strategy, and then the set of identifiers of candidate tuples is generated. Thus, we retrieve top-N results of

the query using SQL selectio n statements and output the ranked answers according to the similarities. The experimental

resul ts show that our method is effi cient and effecti ve.

Keywords: Relational Databa se; Chinese Keyword Search; Index; Ranking Strategy

1. Introduction

For a database system, keyword search with the gener-

al-purpo se que ry e ngine uses us er-supplied data to query

the contents of string pr operties that store keywords, and

then requires users to have the knowledge of database

sche ma and a que r y la ng ua ge (s a y, SQ L) . I nspired by the

success of free-form keyword search on information re-

trieval (IR) and Web search engines, i.e., it is popular to

users who need not know q uery lang uages and the struc-

ture of underlying data. Researches of English keyword

search with IR-style free-form in relational databases

have been extensively studied since 2002[1-7]. [1] and [2]

join tuples from multiple relations in the database to

identify tuple trees with a ll the quer y keywords, for each

enumerated join tree, both of them simply rank join se-

quences according to the number of joins. ObjectRank

system [3] applies authority-based ranking to keyword

search in database modeled as labeled graph. [4] pro-

posed a method (G-KS) for selecting the top-N candi-

dates based on their potential to contain results for a

given query. [5] succeeded in putting the model of com-

puting similarities in IR into computing similarities be-

tween a candidate answers and tuples in relational data-

bases, the methods pay more attention on effectiveness

of keyword search. [6] proposed a middleware free ap-

pro a ch to c o mput e such m-keyword queries on RDBMSs

using SQL only. In this paper, we will discuss Chinese

keyword search in a relational database based on an in-

dex.

Chinese is totall y different from Englis h. For instance,

(1) Chinese words are tighter in writing, unlike English;

there is no space between words. (2) Chinese words are

coded by using GB2312-80 and each Chinese word is

stored in two consecutive bytes in memory, while an

English word is composed of letter(s) and each letter

takes one byte, say, the Chinese word “人” (means “per-

son”) is 0xC8CB with two bytes in memory; however,

English word “person” takes 6 b ytes. (3) Abbreviations

in English are acronyms, such as “WWW” is short for

“World Wide Web”; however, abbreviations in Chinese

are extracted words from a phrase, say, “高代” means

“高等代数” (Advanced Algebra).

Example 1. As shown in Figure 1, database BOOKS

has three relations/tables: Titles(tid, title, Faid, Fpid, ...),

Authors(aid, name,...), and Publishers(pid, pname,...).

Without loss of generality, we suppose the relationship

between Aut hors and Titles is “one to many”, not “many

to many”. We regard all authors of a book together as

one author, since the keyword search will be processed

by each Chinese word and the index will be created by

using the text attributes of Entities (say, title, name,

Chinese Keyword Search by Indexin g i n Relational Databases

108

pname), not usin g the id’s i n rela tionships. Mor eover, the

1-to-n type relationship may be more feasible in practice

in our scenarios (say, a string of Chinese words will be

concatenation of names of two or more authors if there is

an error of typing). For query Q = {高代; 高教社}, a

user want to get the answers of “title = 高等代数, au-

thor =高代(if any), publisher = 高等教育出版社

(Higher Education Press)”. Traditional DBMS cannot

obtain suc h res ults, a nd it is d ifficult for us to use direct-

ly the existing methods of English ke yword search mod-

els in Chinese keyword query.

Titles

tid title Faid Fpid

t1 高等代数 a1 p1

t2 环境化学 a2 p1

t3 民法教程 a3 p3

Authors Publ i sh ers

aid aname

pid pname

a1 北京大学数学系 p1 高等教育出版社

a2 戴树桂 p2 新华出版社

a3 江平 p3 中国政法大学出版社

Figure 1. Part of BOOKS database

Inspired by the technique of creating index of tuple

words in [8], we present a new method to perform Chi-

nese keyword search b y i ndexing. This wo rk is a co ntin-

uation of [9], which studied Chinese keyword search

with only one relation; however, this paper discusses a

relational database with multiple rela tions, and it is more

chall enging tha n the work in [9].

We will build an index based on the information of

tuple words and improve the classic ranking strategy in

IR. For a Chinese keyword query, its top-N answers will

be obtained by the index and the improved ranking

strategy.

2. Data Model and Query Model

Consider a database with n relations R1, …, Rn. Each

relation Ri has mi text attributes Ai1, Ai2, ..., Aimi, a primary

key and possibly foreign key(s) referencing other rela-

tion(s).

Definition 1 (Sche ma Graph) [2 , 4]: A directed graph

captures the primary key-foreign key relationsh ips in the

schema of the database. It has a node for each relation Ri

of the databas e an d an edg e Ri→Rj for each f oreign key to

primary key relationship from a set of attributes (Aib1, … ,

Aibt) of Ri to a set o f attributes (Ajb1, …, Ajbt) o f Rj, where

Aibk ≡ Ajbk for k =1,. . ., t.

Definition 2 (Tuple Tree) [2, 5] A tuple tree T is a

joining tree of tuples. Each node ti in T is a tuple in the

base relation Ri. For each pair of adjacent tuples ti, tj ∈ T,

where ti∈Ri, tj∈Rj, there is an edge Ri→Rj and ti tj ∈ Ri

Rj.

Definition 3 (Tuple Word): For each tuple t belong s to

Ri and A∈{Ai1, Ai2 ,..., Aimi}, t[A] is the attrib ute value on

attribute A of relation Ri, it contains single or multiple

Chinese words. We define every single Chinese word as a

Chinese tuple wo rd.

Definition 4 (Index Table): An Index Table is com-

posed of tuple words and their related information ex-

tracted from the database, its schema is TupleWordTable

(wordid, word, size, DBValue), where wordid is the pri-

mary key, word is the tuple word, si ze is the number of

text attributes that co nta in the co rr esp ondi ng tuple wor d,

DBValue is a text attribute with form “cid, df, tid, tf, dl;

cid, df, tid , tf, dl;…;”. In DBValue, cid is the identi fier of

the attribute (or colu mn) cont aini ng the t uple word, df is

the number of cells containing the tuple word in certain

attribute (or column), tid is the identifier of the tuple

conta inin g the t uple word, tf is the number of tuple word

appears in the cell determined by tid and cid , and dl is the

total number of words (counting duplicate words) in the

cell determined by tuple identifier tid and

attribute/column identifier cid.

Definition 5 (Keyword Qu ery): A query Q = (k1, k2, ...,

kp) is a set of Chinese tuple words. The results of the

keyw or d qu ery are the tuple trees joined by relations. The

results are ranked by a ranking strategy, and then the

top-N ones will be the desired ans wers of a user.

3. Construction of Index Table

In this section, we will describe how to create an index

table with information of tuple words, including the de-

sign and implementa tion of the i ndex.

3.1. Design of Index Table

An index table which is designed as relation with schema

TupleWordTable (wordid, word, size, DBValue) is con-

structed to store the information of each single tuple

word. Tuple words in tuples of relation Ri (1≤ i ≤ n) are

extracted from Ri and are stored into the index table. For

a query, we invoke an i nd ex to imple me nt the search and

display the top-N answers ranked by a ranking strategy.

Therefor e, it is impor tant to d ecide info rmatio n gra nular-

ity of tuple words stored in index table. In this paper, we

consider co lumn level of gran ularity as well as cell level.

The answers for a query are tuple trees joined by differ-

ent attributes, and the nature of different attributes will

affect the effectiveness of the keyword query, thus,

Chinese Keyword Search by Indexin g i n Relational Databases

109

attributes information in database need to be recorded

individually. Cell granularity is more detailed which will

create more accurate similarities between query and re-

sults. In relation TupleTableWord, DBValue is used to

store column level and cell lev e l information.

3.2. Implementation of Index

The process of creating index table includes three steps:

(1) Normalize tuples in Ri (i=1, 2, ..., n), remove useless

characters and Chinese punctuation which will obstruct

the pr ocessing of extracting Chinese words. (2) For each

tuple word belongs to t[A], extract related information of

tuple word. (3) Use the information to cre a te ind e x table.

。。。。。。

Hash-Table

Word-List

Col-list Col-list

…

…… ……

…

...

’

tl-list

Figure 2. Structure of the ind ex

The structure of the index is shown in Figure 2. It has

a Hash-Table and three layers of linked lists. The first

layer is the Wo rd-list which stores tuple words. Node Wi

(i=1, 2,…, d) in Word-list corresponds to a single tuple

word and wordi d in the index table, it also has a pointer

pointing to a Col-list whic h is the second layer linked list.

Each node in a Col-list saves the information that corre-

lates to the tuple word appeared in a certain attribute (or

column). Different node in the Col-list of Wi indicates

that Wi turns up in di fferent attrib ute. Like wise, one no de

in the third layer tl-list stores the related information of

Wi of a specific tuple. The algorithm of creating the in-

dex table is described below:

______________________________________________

IndexTableCreationAlgorithm (R1, R2, … ,Rn) {

0. For each relation Ri in {R1, R2, … , Rn}

1. For each attribute Aij in { Ai1, Ai2, ..., Aimi} of Ri

2. For the value of each tuple t on attrib ute Aij // t[Aij]

3. For each tuple word z∈t[Aij]={z1 z2 …zs}

4. { If z has not been saved in word-list

{add a new node Wi in word-list to save z, and

sort the nodes in word-list by the code of

GB212-80;

create a new Col-list pointed by Wi, and add a

new node Di into the Col-li st, to save column

identifier (cid) and doc ument frequency (df);

create a new tl-list pointed by Di, add a new

node Ti into tl-list to save tuple identifier(tid),

tuple frequency(tf) and data length(dl);

}

5. Else

{Return node Wi containing z, search the Col-list

pointed by Wi;

6. If Col -list has the node Di corresponding to z.cid

{update df, search the tl-list pointed by Di;

7. If the tl-lis t has node Ti with z.tid, update tf;

Else add a new node of tl-list to save cid, tf, dl;

}

8. Else

{add a new node of Col-list to save cid and df;

add a new node of tl-list to save tid, tf, dl;

}// end else (8)

} // end else (5)

} // end if (4)

} // end algorith m

______________________________________________

As an example, a part of index table is shown in Fig-

ure 3 for database BOOKS in our expe ri ment. The index

table contains 4506 tuple words including most level 1

and level 2 Chinese words.

Figure 3 . A part of the index table for BOOKS database

In order to implement the keyword search, we need to

load the index table into memory, and match the key-

words with tuple words. Fo r a huge ind ex ta ble , however,

it is hard to load the whole index table into the memory.

Therefore, we need to compress and to improve the in-

dex. Firstly, we remove the stop words which are mea-

ningless words such as “的” and “吗”, and the words

appeared in specific attribute of most tuples, say, “出”,

“版”, and ‘社’ in pname attribute of relation P ublishers,

more than 95% of tuples contain these words. Secondly,

it is necessary for the index table to shrink its structure.

We only load the H ash-Table and Wor d-list into memo-

Chinese Keyword Search by Indexin g i n Relational Databases

110

ry.

4. Chinese Keyword Search

4.1. Generation of identifiers of candidate tuples

For the query Q = (k1, k2, ..., kp), we match query words

ki (i = 1, 2 , …, p ) with tuple wor ds b y usin g the ind ex i n

memory, and obtain the set of DBValue of tuple words.

We divide DBValue into pieces (cid, df, tid , tf, dl).

A tuple is solely identified by its tid, so we collect tid

from every piece and obtain the set of all identifiers of

basic tuples matched with query words ki, denote by

Rjki = {t | t = tid}, where tid means the identifier o f tuple t.

Combi ning all sets Rjki (i = 1, 2, …, p), we get RjQ = Rjk1

∪ Rjk2 ∪ ... ∪ Rjki (j=1, 2, ..., n), which are the sets of all

identifiers of candidate tuples containing query words in

relation Rj. In the process of collecting tid, we record the

number of distinct query words of each tuple. For a tuple,

the more distinct query words contained in a tuple, the

closer of it gets to Q. Fo r e xa mp le , a gi ve n que r y Q ={k1,

k2, k3}, tuple t1 contains one k1 and one k2 , t2 contains

two k2’s, t3 conta ins o ne k1, one k2 and o ne k3. Thus, t3 is

the best match for Q, and t1 is much closer to Q than t2

according to the similarities between Q and t1, t2, and t3,

which can be obtained as follows.

4.2. Ranking Strategy

For each candidate tuple that contains query words, we

extract the information from piece and calculate the si-

milarity b etween the query Q and the tuple . In t hi s p aper,

based on vector space model widely used in IR ranking

strategy, we improve a classic method [10, 11] to com-

pute the similarities between the query Q and tuple trees.

Query words and tuple trees are represented as a vector

of terms, and each term may be an individual word or a

multi-word phrase. The vocabulary of terms makes up a

term space, and each term occupies a dimension in the

space. Each element of vector is non-negative weight

that measures the importance of the ter m in the text. T he

similarity is:

( , )*(, )*(,)

k QT

Sim Q TncSimkTweightkQ

∈

= +

∑

(1)

(, )(,)*

Sim kTweightkAw

∈

∑

(2)

1 ln(1ln())1

( ,)*ln

(1 )*

tf N

weight kAdl df

avdl

++ +

−+

(3)

Equation (1) and (3) are derived from [9, 10], In Equa-

tion (1), weight (k, Q) is the appearance frequency of

word k in q uer y Q, tuple tree T is composed of attributes

{A1, A2, …, Am}, n∗c is a ne w item for our Chi nese key-

word query, c is constant (c = 10 by traini ng in o ur expe-

riments), and n is the number of different tuple words

contained in T. In Equation (3), s is a constant (usually

set to 0.2), N is the number of tuples corresponding to th e

attribute, and avdl is the average number of words in

tuples corresponding to the attribute. For a query, we

define E quation (2) due to the weights (or important fac-

tors) of different attributes. For example, when buying a

book, the title of a book is usually more important than

its publisher; therefore, the weight of a ttribute Titles.t itle

should be hig her than that of attribute Pub li shers.pname.

Sorting the weights of different attributes, we get the se t

W = {w1, w2, …, wn} and wi≥wj (i>j). According to Eq-

uation (1) and (3 ), if the value s of tf, df and dl are equiv-

alent, larger weights will lead to higher similarities, and

the tuple trees with larger weights will rank higher.

Suppose M is the maximum value of the number of

distinct quer y words in a candidate tuple. After obtaining

RjQ, we compute firstly the tuples that owns M distinc t

query words, then M-1, M-2, down to one query words.

According to the number of the distinct query words (i.e.,

n), we achieve the subsets Si’s of results for the query,

and then the set S of all results is S = S1∪S2∪ …∪SM, Si

∩ Sj = ∅(i≠j), where Si is the collection of tuple trees

whose numbers of distinct query words are i. In general,

tuple trees in Sj have h ig her si mi la ri ties tha n t ho se i n Si if

i < j.

4.3. Answers for queries

We utilize the schema graph and RjQ (j = 1, 2, …, n) to

construct SQL selection statements, and then retrieve

tuple tree from the database. Query conditions of SQL

statements have the relation ship of foreign key-primary

key and the location of query words. For example, if the

schema graph of database is R1←R2→R3, for a query Q =

{k1, k2}, SQL sta tements are as below:

Select * from R1, R2, R3 where R1.Pid = R2.Fid and

R3.Pid = R2.Fid and R2.Pid in R2Q (S1)

Select * fro m R1, R2, R3 where R1.Pid = R2.Fid and

R3.Pid = R2.Fid and R1.Pid in R1Q (S2)

Select * fro m R1, R2, R3 where R1.Pid = R2.Fid and

R3.Pid = R2.Fid and R3.Pid in R3Q (S3)

Where Pid and Fid denote Primary key and Foreign key

respectively. If multiple relations (or multiple attributes)

contain the query words that are in the same returned

tuple trees, the above SQL statements may lead to re-

dunda nt se ar c h r e sult s. In order to avoid redundancy, it is

necessary to reduce the repeat selection: Let R1Q = R1Q -

R12 after (S1) is executed, and R3Q = R3Q - R32 - R31 after

(S1) and (S2) are executed, where R12 is the identifier set

Chinese Keyword Search by Indexin g i n Relational Databases

111

of tup les both appear in R1Q and R2Q, and so as to R32 and

R31.

Example 1 (cont.) For query Q = {高代; 高教社}

submitted by the user, a part of answers with “title = 高

等代数, publisher = 高等教育出版社” are shown in

Figure 4 and are ranked by their similarities, where the

three results with id ’s 58734, 58709 and 58735 have the

same similarities.

Figure 4. Par t results of query “高代; 高教社”.

5. Experiments

Our experiments are carried out using Microsoft’s SQL

Server 2000 and VC++ 6.0 on a PC with Windows XP,

Intel(R) Core2 Duo 2.0 GHz CPU, and 2.0GB memory.

The real dataset co mes fr o m t he library of Hebei Uni-

versity, which is a fragment of the data of Chinese books.

As shown in Figure 1, our database BOOKS has three

relations: Titles(tid, title, Faid, Fpid), Authors(aid,

aname), Publishers(pid, pname), tid , aid, pid are the

primary keys for three relations respectively, and Faid,

Fpid are foreign keys of Titles referencing Authors.aid

and Publi shers.pid respectively. The relation Titles con-

tains 87762 tuples, Authors contains 62120 tuples and

Publisher s contains 2995 tupl es.

The parameters that we vary in the experiments are the

number of query words and the number of results N re-

quested in top-N queries. We consider 10 groups queries,

where keywords are randomly chosen from the attr ibute

word of TupleWordTable. We denote 10 groups by

Gi(i=1, 2, ..., 10), each Gi contains 10 queries and the

number of keywords of each query in Gi was i. The 100

queries are used to measure the time and accurac y of our

method.

(1) Time. For the 10 groups of keyword queries, we

record the running time of matching the index table (de-

noted by Index-Time) and the time of returning results

(denoted by R esult-Time ) respectively. As sho wn in Fig-

ure 5, Result -time is between 50 and 400 milliseconds,

Index-time are between 0 and 100 milliseconds. Gener-

ally, Result-time costs more than Index-time, for it re-

quires more time to rank similarities and I/O costs fre-

quently.

When the number of query words comes to 4 and 9 in

Figure 3, Result-time turns up to peak values, it is ob-

vious that Result-time is larger than Index-time, the rea-

son is as follows: some query words are contained in a

large amo unt o f t upl es, li ke q ue ry words “中逅猎竖”

in gr oup G 4, the number o f tuples which contain any sin-

gle word is 17637, while the general number of tuples is

about several thousands. We have to calculate the simi-

larity of ever y single tuple and then rank these tuples by

our ranking strategy.

100

200

300

400

500

12345678910

the number of query words

elapsed time(ms)

Index-time Result-time

Figure 5. E lapsed time for keyword queries

(2) Recall and precision. Fo r N=3, 10, 20, 50, 80 and

100, Figure 6 and F igur e 7 show recalls and precisions of

Top-N result s of keyword search respectively.

0.2

0.4

0.6

0.8

12345678910

the number of query words

recall

Top-3 Top-10 Top-20 50 Top-80 Top-100

Figure 6. R e c alls of Top-N result s

0.2

0.4

0.6

0.8

12345678910

the number of query words

precision

Top-3 Top-10 Top-20 Top-50 Top-80 Top-100

Figure 7. Precisions of Top-N results

In Figure 6, for each Gi, with the increase of N, recall

will become larger. T he reason is that the total number of

desired results in the database is constant, while the

Chinese Keyword Search by Indexin g i n Relational Databases

112

number of matching tuple tree in the Top-N results will

increase as N becomes larger. For a fixed N, the overall

rate of recall will decrease as the number of query words

increases. In Figure 7, for each Gi, with the increase N,

precision will decrease. When N≥80, the average rates of

recalls and precisions are higher than 50% and 60% re-

spectively. In addition, according to requirements, all

ranke d results may be displayed for a query.

6. Conclusions

In this paper, we proposed a new method to realize

IR-style free-form Chinese keyword search over rela-

tional databases. The basic idea of this method is to

create an index by extracting information from relations

in a database. For a given query, we use the index to ob-

tain the candidate tuples and calculate the similarity of

between the query and each candidate tuple through im-

proved ranking strategy. The Top-N results are retrieved

by SQL selection statements for the natural join of rela-

tions in the database. Exte nsive experiments were carried

out to measure the performance of our method based on

a real dataset. Exper i ment a l r e sult s s how tha t t he a ve rage

elapsed ti me including In dex-time and Result-time i s less

than 500 milliseconds for queries with 1 to 10 query

words. When N≥80, the average recalls and precisions

are higher than 50% and 60% respectively.

7. Acknowledgements

This work is supported in part by NSFC (61170039) and

the NSF of Hebei Province (F2012201006).

REFERENCES

[1] S. Agra wal, S. Chaudhuri and G. Das, “DBXplorer: A

System for Key word-Based Search over Relational Da-

tabase,” Proceedings of the 18th International Confe-

rence on Data Engineering, San Jose, 26 February -1

March 2002, pp. 5-16.

[2] V. Hristidis, L. Gracano and Y. Papakonstantinou, “Effi-

cient IR-style Keyword Search over Relational Databas-

es,” Proceedings of 29th International Conference on

Very Large Data Bases, Berlin, 9-12 September 2003, pp.

850-861.

[3] A. Balmin, V. Hristidis and Y. Papakonstantinou: “Ob-

jectRank: Authority-Based Keyword Search in Databas-

es”, Proceedings of the 30th International Conference o n

Very large Data Bases, Toronto, 31August-3September

2004, pp. 564 – 575.

[4] Q. Vu, B. Ooi, D. Papadias and A. Tung, “A Graph Me-

thod for Keyword-Based Selection of the Top-K Data-

bases,” Proceedings of the ACM SIGMOD International

Conference on Management of Data, Vancouver, 10-12

June 2008, pp. 915-926.

[5] F. Liu, C. Yu and W. Meng, A. Chowdhury, “Effective

Keyword Search in Relational Databases,” 26th ACM

SIGMOD/PODS International Conference on Manage-

ment of Data/Principles of Database Systems, Chicago,

27-29 June 2006, pp. 563-574.

[6] L. Qin, J. Yu and L. Chang, “Keyword Search in Data-

bases: The Power of RDBMS,” Proceedings of the 2009

ACM SIGMOD International Conference on Manage-

ment of data, Rhode Island, 29 June-2 July 2009, pp:

681-694.

[7] J. Yu , L. Qi n and L. Chang, “Keyword Search in Rela-

tional Databases: A Survey,” IEEE Data Eng. Bull. Spe-

cial Issue on Keyword Search, Vol. 33 No .1, 2010, pp.

67–78.

[8] L. Zhu, Q. Ma and C. Liu, “Semantic-Distance Based

Evaluation of Ranking Queries over Relational Dat abas-

es,” J. Intell. Inf. Syst, Vol. 35 No. 3, 2010, pp.

415-445. doi: 10.1007/s10844-009-0116-5.

[9] L. Zhu, Y. Zhu and Q. Ma, “Chinese Keyword Search

over Relational Databases,” 2010 Second World Con-

gress on Software Engineering, Wuhan, 19-20 December

2010, pp . 217-220.

[10] A. Singhal, “Modern Information Retrieval: A Brief

Overview,” IEEE Data Eng, Vol. 24, No. 4, 2001, pp. 35-

43.

[11] A. Singhal, C. Buckley and M. Mitra, “Pivoted Document

Length Normalization,” Proceedings of the 19th Annual

International ACM SIGIR Conference on Research and

Development in Information Retrieval, Zurich, 18-22

August 1996, pp. 21-29.