A Journal of Software Engineering and Applications, 2012, 5, 107-112
doi:10.4236/jsea.2012.512b021 Published Online December 2012 (http://www.scirp.org/journal/jsea)
Copyright © 2012 SciRes. JSEA
Chinese Keyword Search by Indexing in Relational
Databases
Liang Zhu1, Lijuan Pan1, Qin Ma2
1Key Laboratory of Machine Learning and Computational Intelligence, School of Mathematics and Computer Science, Hebei
University, Baoding, Hebei 071002, China; 2Department of Foreign Language Teaching and Research, Hebei University, Baoding,
Hebei 071002, China.
Email: zhu@hbu.edu.cn (L. Zhu), katharine162@126.com(L. Pan), maqin@hbu.edu.cn(Q. Ma)
Received Octob e r 15, 2012.
ABSTRACT
In this paper, we propose a new method based on index to r ealize IR-style C hine se key word sea rch with ra nki ng str ate-
gies in relational databases. This method creates an index by using the related information of tuple words and presents a
ranking strategy in terms of the nature of Chinese words. For a Chinese keyword query, the index is used to match
quer y search words and the tuple words in index quickl y, and to compute si milaritie s between the quer y and tuples by
the ranking strategy, and then the set of identifiers of candidate tuples is generated. Thus, we retrieve top-N results of
the query using SQL selectio n statements and output the ranked answers according to the similarities. The experimental
resul ts show that our method is effi cient and effecti ve.
Keywords: Relational Databa se; Chinese Keyword Search; Index; Ranking Strategy
1. Introduction
For a database system, keyword search with the gener-
al-purpo se que ry e ngine uses us er-supplied data to query
the contents of string pr operties that store keywords, and
then requires users to have the knowledge of database
sche ma and a que r y la ng ua ge (s a y, SQ L) . I nspired by the
success of free-form keyword search on information re-
trieval (IR) and Web search engines, i.e., it is popular to
users who need not know q uery lang uages and the struc-
ture of underlying data. Researches of English keyword
search with IR-style free-form in relational databases
have been extensively studied since 2002[1-7]. [1] and [2]
join tuples from multiple relations in the database to
identify tuple trees with a ll the quer y keywords, for each
enumerated join tree, both of them simply rank join se-
quences according to the number of joins. ObjectRank
system [3] applies authority-based ranking to keyword
search in database modeled as labeled graph. [4] pro-
posed a method (G-KS) for selecting the top-N candi-
dates based on their potential to contain results for a
given query. [5] succeeded in putting the model of com-
puting similarities in IR into computing similarities be-
tween a candidate answers and tuples in relational data-
bases, the methods pay more attention on effectiveness
of keyword search. [6] proposed a middleware free ap-
pro a ch to c o mput e such m-keyword queries on RDBMSs
using SQL only. In this paper, we will discuss Chinese
keyword search in a relational database based on an in-
dex.
Chinese is totall y different from Englis h. For instance,
(1) Chinese words are tighter in writing, unlike English;
there is no space between words. (2) Chinese words are
coded by using GB2312-80 and each Chinese word is
stored in two consecutive bytes in memory, while an
English word is composed of letter(s) and each letter
takes one byte, say, the Chinese word(means per-
son”) is 0xC8CB with two bytes in memory; however,
English word person” takes 6 b ytes. (3) Abbreviations
in English are acronyms, such as WWWis short for
World Wide Web”; however, abbreviations in Chinese
are extracted words from a phrase, say, 高代means
高等代数” (Advanced Algebra).
Example 1. As shown in Figure 1, database BOOKS
has three relations/tables: Titles(tid, title, Faid, Fpid, ...),
Authors(aid, name,...), and Publishers(pid, pname,...).
Without loss of generality, we suppose the relationship
between Aut hors and Titles is “one to many, not many
to many. We regard all authors of a book together as
one author, since the keyword search will be processed
by each Chinese word and the index will be created by
using the text attributes of Entities (say, title, name,
Chinese Keyword Search by Indexin g i n Relational Databases
Copyright © 2012 SciRes. JSEA
108
pname), not usin g the ids i n rela tionships. Mor eover, the
1-to-n type relationship may be more feasible in practice
in our scenarios (say, a string of Chinese words will be
concatenation of names of two or more authors if there is
an error of typing). For query Q = {高代; 高教社}, a
user want to get the answers oftitle = 高等代数, au-
thor =高代(if any), publisher = 高等教育出版社
(Higher Education Press)”. Traditional DBMS cannot
obtain suc h res ults, a nd it is d ifficult for us to use direct-
ly the existing methods of English ke yword search mod-
els in Chinese keyword query.
Titles
tid title Faid Fpid
t1 高等代数 a1 p1
t2 环境化学 a2 p1
t3 民法教程 a3 p3
Authors Publ i sh ers
aid aname
pid pname
a1 北京大学数学系 p1 高等教育出版社
a2 戴树桂 p2 新华出版社
a3 江平 p3 中国政法大学出版社
Figure 1. Part of BOOKS database
Inspired by the technique of creating index of tuple
words in [8], we present a new method to perform Chi-
nese keyword search b y i ndexing. This wo rk is a co ntin-
uation of [9], which studied Chinese keyword search
with only one relation; however, this paper discusses a
relational database with multiple rela tions, and it is more
chall enging tha n the work in [9].
We will build an index based on the information of
tuple words and improve the classic ranking strategy in
IR. For a Chinese keyword query, its top-N answers will
be obtained by the index and the improved ranking
strategy.
2. Data Model and Query Model
Consider a database with n relations R1, …, Rn. Each
relation Ri has mi text attributes Ai1, Ai2, ..., Aimi, a primary
key and possibly foreign key(s) referencing other rela-
tion(s).
Definition 1 (Sche ma Graph) [2 , 4]: A directed graph
captures the primary key-foreign key relationsh ips in the
schema of the database. It has a node for each relation Ri
of the databas e an d an edg e RiRj for each f oreign key to
primary key relationship from a set of attributes (Aib1, … ,
Aibt) of Ri to a set o f attributes (Ajb1, …, Ajbt) o f Rj, where
Aibk Ajbk for k =1,. . ., t.
Definition 2 (Tuple Tree) [2, 5] A tuple tree T is a
joining tree of tuples. Each node ti in T is a tuple in the
base relation Ri. For each pair of adjacent tuples ti, tj T,
where tiRi, tjRj, there is an edge RiRj and ti tj Ri
Rj.
Definition 3 (Tuple Word): For each tuple t belong s to
Ri and A{Ai1, Ai2 ,..., Aimi}, t[A] is the attrib ute value on
attribute A of relation Ri, it contains single or multiple
Chinese words. We define every single Chinese word as a
Chinese tuple wo rd.
Definition 4 (Index Table): An Index Table is com-
posed of tuple words and their related information ex-
tracted from the database, its schema is TupleWordTable
(wordid, word, size, DBValue), where wordid is the pri-
mary key, word is the tuple word, si ze is the number of
text attributes that co nta in the co rr esp ondi ng tuple wor d,
DBValue is a text attribute with form cid, df, tid, tf, dl;
cid, df, tid , tf, dl;…;”. In DBValue, cid is the identi fier of
the attribute (or colu mn) cont aini ng the t uple word, df is
the number of cells containing the tuple word in certain
attribute (or column), tid is the identifier of the tuple
conta inin g the t uple word, tf is the number of tuple word
appears in the cell determined by tid and cid , and dl is the
total number of words (counting duplicate words) in the
cell determined by tuple identifier tid and
attribute/column identifier cid.
Definition 5 (Keyword Qu ery): A query Q = (k1, k2, ...,
kp) is a set of Chinese tuple words. The results of the
keyw or d qu ery are the tuple trees joined by relations. The
results are ranked by a ranking strategy, and then the
top-N ones will be the desired ans wers of a user.
3. Construction of Index Table
In this section, we will describe how to create an index
table with information of tuple words, including the de-
sign and implementa tion of the i ndex.
3.1. Design of Index Table
An index table which is designed as relation with schema
TupleWordTable (wordid, word, size, DBValue) is con-
structed to store the information of each single tuple
word. Tuple words in tuples of relation Ri (1 i n) are
extracted from Ri and are stored into the index table. For
a query, we invoke an i nd ex to imple me nt the search and
display the top-N answers ranked by a ranking strategy.
Therefor e, it is impor tant to d ecide info rmatio n gra nular-
ity of tuple words stored in index table. In this paper, we
consider co lumn level of gran ularity as well as cell level.
The answers for a query are tuple trees joined by differ-
ent attributes, and the nature of different attributes will
affect the effectiveness of the keyword query, thus,
Chinese Keyword Search by Indexin g i n Relational Databases
Copyright © 2012 SciRes. JSEA
109
attributes information in database need to be recorded
individually. Cell granularity is more detailed which will
create more accurate similarities between query and re-
sults. In relation TupleTableWord, DBValue is used to
store column level and cell lev e l information.
3.2. Implementation of Index
The process of creating index table includes three steps:
(1) Normalize tuples in Ri (i=1, 2, ..., n), remove useless
characters and Chinese punctuation which will obstruct
the pr ocessing of extracting Chinese words. (2) For each
tuple word belongs to t[A], extract related information of
tuple word. (3) Use the information to cre a te ind e x table.
D
11
W
2
W
1
W
d
。。。。。。
Hash-Table
^
Word-List
Col-list Col-list
…… ……
T
1
T
1
T
r
T
q
D
1n
m
D
d1
D
dn
m
...
...
...
...
^
^
^
^
^^
T
h
T
1
T
1
T
u
tl-list
tl-list
tl-list
tl-list
Figure 2. Structure of the ind ex
The structure of the index is shown in Figure 2. It has
a Hash-Table and three layers of linked lists. The first
layer is the Wo rd-list which stores tuple words. Node Wi
(i=1, 2,…, d) in Word-list corresponds to a single tuple
word and wordi d in the index table, it also has a pointer
pointing to a Col-list whic h is the second layer linked list.
Each node in a Col-list saves the information that corre-
lates to the tuple word appeared in a certain attribute (or
column). Different node in the Col-list of Wi indicates
that Wi turns up in di fferent attrib ute. Like wise, one no de
in the third layer tl-list stores the related information of
Wi of a specific tuple. The algorithm of creating the in-
dex table is described below:
______________________________________________
IndexTableCreationAlgorithm (R1, R2, … ,Rn) {
0. For each relation Ri in {R1, R2, … , Rn}
1. For each attribute Aij in { Ai1, Ai2, ..., Aimi} of Ri
2. For the value of each tuple t on attrib ute Aij // t[Aij]
3. For each tuple word zt[Aij]={z1 z2zs}
4. { If z has not been saved in word-list
{add a new node Wi in word-list to save z, and
sort the nodes in word-list by the code of
GB212-80;
create a new Col-list pointed by Wi, and add a
new node Di into the Col-li st, to save column
identifier (cid) and doc ument frequency (df);
create a new tl-list pointed by Di, add a new
node Ti into tl-list to save tuple identifier(tid),
tuple frequency(tf) and data length(dl);
}
5. Else
{Return node Wi containing z, search the Col-list
pointed by Wi;
6. If Col -list has the node Di corresponding to z.cid
{update df, search the tl-list pointed by Di;
7. If the tl-lis t has node Ti with z.tid, update tf;
Else add a new node of tl-list to save cid, tf, dl;
}
8. Else
{add a new node of Col-list to save cid and df;
add a new node of tl-list to save tid, tf, dl;
}// end else (8)
} // end else (5)
} // end if (4)
} // end algorith m
______________________________________________
As an example, a part of index table is shown in Fig-
ure 3 for database BOOKS in our expe ri ment. The index
table contains 4506 tuple words including most level 1
and level 2 Chinese words.
Figure 3 . A part of the index table for BOOKS database
In order to implement the keyword search, we need to
load the index table into memory, and match the key-
words with tuple words. Fo r a huge ind ex ta ble , however,
it is hard to load the whole index table into the memory.
Therefore, we need to compress and to improve the in-
dex. Firstly, we remove the stop words which are mea-
ningless words such as” and “”, and the words
appeared in specific attribute of most tuples, say,,
, and ‘’ in pname attribute of relation P ublishers,
more than 95% of tuples contain these words. Secondly,
it is necessary for the index table to shrink its structure.
We only load the H ash-Table and Wor d-list into memo-
Chinese Keyword Search by Indexin g i n Relational Databases
Copyright © 2012 SciRes. JSEA
110
ry.
4. Chinese Keyword Search
4.1. Generation of identifiers of candidate tuples
For the query Q = (k1, k2, ..., kp), we match query words
ki (i = 1, 2 , …, p ) with tuple wor ds b y usin g the ind ex i n
memory, and obtain the set of DBValue of tuple words.
We divide DBValue into pieces (cid, df, tid , tf, dl).
A tuple is solely identified by its tid, so we collect tid
from every piece and obtain the set of all identifiers of
basic tuples matched with query words ki, denote by
Rjki = {t | t = tid}, where tid means the identifier o f tuple t.
Combi ning all sets Rjki (i = 1, 2, …, p), we get RjQ = Rjk1
Rjk2 ... Rjki (j=1, 2, ..., n), which are the sets of all
identifiers of candidate tuples containing query words in
relation Rj. In the process of collecting tid, we record the
number of distinct query words of each tuple. For a tuple,
the more distinct query words contained in a tuple, the
closer of it gets to Q. Fo r e xa mp le , a gi ve n que r y Q ={k1,
k2, k3}, tuple t1 contains one k1 and one k2 , t2 contains
two k2’s, t3 conta ins o ne k1, one k2 and o ne k3. Thus, t3 is
the best match for Q, and t1 is much closer to Q than t2
according to the similarities between Q and t1, t2, and t3,
which can be obtained as follows.
4.2. Ranking Strategy
For each candidate tuple that contains query words, we
extract the information from piece and calculate the si-
milarity b etween the query Q and the tuple . In t hi s p aper,
based on vector space model widely used in IR ranking
strategy, we improve a classic method [10, 11] to com-
pute the similarities between the query Q and tuple trees.
Query words and tuple trees are represented as a vector
of terms, and each term may be an individual word or a
multi-word phrase. The vocabulary of terms makes up a
term space, and each term occupies a dimension in the
space. Each element of vector is non-negative weight
that measures the importance of the ter m in the text. T he
similarity is:
,
( , )*(, )*(,)
k QT
Sim Q TncSimkTweightkQ
= +
(1)
(, )(,)*
ii
DT
Sim kTweightkAw
=
(2)
1 ln(1ln())1
( ,)*ln
(1 )*
i
tf N
weight kAdl df
ss
avdl
++ +
=
−+
(3)
Equation (1) and (3) are derived from [9, 10], In Equa-
tion (1), weight (k, Q) is the appearance frequency of
word k in q uer y Q, tuple tree T is composed of attributes
{A1, A2, …, Am}, nc is a ne w item for our Chi nese key-
word query, c is constant (c = 10 by traini ng in o ur expe-
riments), and n is the number of different tuple words
contained in T. In Equation (3), s is a constant (usually
set to 0.2), N is the number of tuples corresponding to th e
attribute, and avdl is the average number of words in
tuples corresponding to the attribute. For a query, we
define E quation (2) due to the weights (or important fac-
tors) of different attributes. For example, when buying a
book, the title of a book is usually more important than
its publisher; therefore, the weight of a ttribute Titles.t itle
should be hig her than that of attribute Pub li shers.pname.
Sorting the weights of different attributes, we get the se t
W = {w1, w2, , wn} and wiwj (i>j). According to Eq-
uation (1) and (3 ), if the value s of tf, df and dl are equiv-
alent, larger weights will lead to higher similarities, and
the tuple trees with larger weights will rank higher.
Suppose M is the maximum value of the number of
distinct quer y words in a candidate tuple. After obtaining
RjQ, we compute firstly the tuples that owns M distinc t
query words, then M-1, M-2, down to one query words.
According to the number of the distinct query words (i.e.,
n), we achieve the subsets Sis of results for the query,
and then the set S of all results is S = S1S2SM, Si
Sj = (ij), where Si is the collection of tuple trees
whose numbers of distinct query words are i. In general,
tuple trees in Sj have h ig her si mi la ri ties tha n t ho se i n Si if
i < j.
4.3. Answers for queries
We utilize the schema graph and RjQ (j = 1, 2, …, n) to
construct SQL selection statements, and then retrieve
tuple tree from the database. Query conditions of SQL
statements have the relation ship of foreign key-primary
key and the location of query words. For example, if the
schema graph of database is R1R2R3, for a query Q =
{k1, k2}, SQL sta tements are as below:
Select * from R1, R2, R3 where R1.Pid = R2.Fid and
R3.Pid = R2.Fid and R2.Pid in R2Q (S1)
Select * fro m R1, R2, R3 where R1.Pid = R2.Fid and
R3.Pid = R2.Fid and R1.Pid in R1Q (S2)
Select * fro m R1, R2, R3 where R1.Pid = R2.Fid and
R3.Pid = R2.Fid and R3.Pid in R3Q (S3)
Where Pid and Fid denote Primary key and Foreign key
respectively. If multiple relations (or multiple attributes)
contain the query words that are in the same returned
tuple trees, the above SQL statements may lead to re-
dunda nt se ar c h r e sult s. In order to avoid redundancy, it is
necessary to reduce the repeat selection: Let R1Q = R1Q -
R12 after (S1) is executed, and R3Q = R3Q - R32 - R31 after
(S1) and (S2) are executed, where R12 is the identifier set
Chinese Keyword Search by Indexin g i n Relational Databases
Copyright © 2012 SciRes. JSEA
111
of tup les both appear in R1Q and R2Q, and so as to R32 and
R31.
Example 1 (cont.) For query Q = {高代; 高教社}
submitted by the user, a part of answers with title =
等代数, publisher = 高 等教育出版 社are shown in
Figure 4 and are ranked by their similarities, where the
three results with id ’s 58734, 58709 and 58735 have the
same similarities.
Figure 4. Par t results of query 高代; 高教社”.
5. Experiments
Our experiments are carried out using Microsoft’s SQL
Server 2000 and VC++ 6.0 on a PC with Windows XP,
Intel(R) Core2 Duo 2.0 GHz CPU, and 2.0GB memory.
The real dataset co mes fr o m t he library of Hebei Uni-
versity, which is a fragment of the data of Chinese books.
As shown in Figure 1, our database BOOKS has three
relations: Titles(tid, title, Faid, Fpid), Authors(aid,
aname), Publishers(pid, pname), tid , aid, pid are the
primary keys for three relations respectively, and Faid,
Fpid are foreign keys of Titles referencing Authors.aid
and Publi shers.pid respectively. The relation Titles con-
tains 87762 tuples, Authors contains 62120 tuples and
Publisher s contains 2995 tupl es.
The parameters that we vary in the experiments are the
number of query words and the number of results N re-
quested in top-N queries. We consider 10 groups queries,
where keywords are randomly chosen from the attr ibute
word of TupleWordTable. We denote 10 groups by
Gi(i=1, 2, ..., 10), each Gi contains 10 queries and the
number of keywords of each query in Gi was i. The 100
queries are used to measure the time and accurac y of our
method.
(1) Time. For the 10 groups of keyword queries, we
record the running time of matching the index table (de-
noted by Index-Time) and the time of returning results
(denoted by R esult-Time ) respectively. As sho wn in Fig-
ure 5, Result -time is between 50 and 400 milliseconds,
Index-time are between 0 and 100 milliseconds. Gener-
ally, Result-time costs more than Index-time, for it re-
quires more time to rank similarities and I/O costs fre-
quently.
When the number of query words comes to 4 and 9 in
Figure 3, Result-time turns up to peak values, it is ob-
vious that Result-time is larger than Index-time, the rea-
son is as follows: some query words are contained in a
large amo unt o f t upl es, li ke q ue ry words
in gr oup G 4, the number o f tuples which contain any sin-
gle word is 17637, while the general number of tuples is
about several thousands. We have to calculate the simi-
larity of ever y single tuple and then rank these tuples by
our ranking strategy.
the number of query words
elapsed time(ms)
Index-time Result-time
Figure 5. E lapsed time for keyword queries
(2) Recall and precision. Fo r N=3, 10, 20, 50, 80 and
100, Figure 6 and F igur e 7 show recalls and precisions of
Top-N result s of keyword search respectively.
the number of query words
recall
Top-3 Top-10 Top-20 50 Top-80 Top-100
Figure 6. R e c alls of Top-N result s
the number of query words
precision
Top-3 Top-10 Top-20 Top-50 Top-80 Top-100
Figure 7. Precisions of Top-N results
In Figure 6, for each Gi, with the increase of N, recall
will become larger. T he reason is that the total number of
desired results in the database is constant, while the
Chinese Keyword Search by Indexin g i n Relational Databases
Copyright © 2012 SciRes. JSEA
112
number of matching tuple tree in the Top-N results will
increase as N becomes larger. For a fixed N, the overall
rate of recall will decrease as the number of query words
increases. In Figure 7, for each Gi, with the increase N,
precision will decrease. When N≥80, the average rates of
recalls and precisions are higher than 50% and 60% re-
spectively. In addition, according to requirements, all
ranke d results may be displayed for a query.
6. Conclusions
In this paper, we proposed a new method to realize
IR-style free-form Chinese keyword search over rela-
tional databases. The basic idea of this method is to
create an index by extracting information from relations
in a database. For a given query, we use the index to ob-
tain the candidate tuples and calculate the similarity of
between the query and each candidate tuple through im-
proved ranking strategy. The Top-N results are retrieved
by SQL selection statements for the natural join of rela-
tions in the database. Exte nsive experiments were carried
out to measure the performance of our method based on
a real dataset. Exper i ment a l r e sult s s how tha t t he a ve rage
elapsed ti me including In dex-time and Result-time i s less
than 500 milliseconds for queries with 1 to 10 query
words. When N≥80, the average recalls and precisions
are higher than 50% and 60% respectively.
7. Acknowledgements
This work is supported in part by NSFC (61170039) and
the NSF of Hebei Province (F2012201006).
REFERENCES
[1] S. Agra wal, S. Chaudhuri and G. Das, “DBXplorer: A
System for Key word-Based Search over Relational Da-
tabase,” Proceedings of the 18th International Confe-
rence on Data Engineering, San Jose, 26 February -1
March 2002, pp. 5-16.
[2] V. Hristidis, L. Gracano and Y. Papakonstantinou, “Effi-
cient IR-style Keyword Search over Relational Databas-
es,” Proceedings of 29th International Conference on
Very Large Data Bases, Berlin, 9-12 September 2003, pp.
850-861.
[3] A. Balmin, V. Hristidis and Y. Papakonstantinou: “Ob-
jectRank: Authority-Based Keyword Search in Databas-
es”, Proceedings of the 30th International Conference o n
Very large Data Bases, Toronto, 31August-3September
2004, pp. 564 – 575.
[4] Q. Vu, B. Ooi, D. Papadias and A. Tung, “A Graph Me-
thod for Keyword-Based Selection of the Top-K Data-
bases,” Proceedings of the ACM SIGMOD International
Conference on Management of Data, Vancouver, 10-12
June 2008, pp. 915-926.
[5] F. Liu, C. Yu and W. Meng, A. Chowdhury, “Effective
Keyword Search in Relational Databases,” 26th ACM
SIGMOD/PODS International Conference on Manage-
ment of Data/Principles of Database Systems, Chicago,
27-29 June 2006, pp. 563-574.
[6] L. Qin, J. Yu and L. Chang, “Keyword Search in Data-
bases: The Power of RDBMS,” Proceedings of the 2009
ACM SIGMOD International Conference on Manage-
ment of data, Rhode Island, 29 June-2 July 2009, pp:
681-694.
[7] J. Yu , L. Qi n and L. Chang, “Keyword Search in Rela-
tional Databases: A Survey,” IEEE Data Eng. Bull. Spe-
cial Issue on Keyword Search, Vol. 33 No .1, 2010, pp.
6778.
[8] L. Zhu, Q. Ma and C. Liu, “Semantic-Distance Based
Evaluation of Ranking Queries over Relational Dat abas-
es,” J. Intell. Inf. Syst, Vol. 35 No. 3, 2010, pp.
415-445. doi: 10.1007/s10844-009-0116-5.
[9] L. Zhu, Y. Zhu and Q. Ma, “Chinese Keyword Search
over Relational Databases,” 2010 Second World Con-
gress on Software Engineering, Wuhan, 19-20 December
2010, pp . 217-220.
[10] A. Singhal, “Modern Information Retrieval: A Brief
Overview,” IEEE Data Eng, Vol. 24, No. 4, 2001, pp. 35-
43.
[11] A. Singhal, C. Buckley and M. Mitra, “Pivoted Document
Length Normalization,” Proceedings of the 19th Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval, Zurich, 18-22
August 1996, pp. 21-29.