Intelligent Information Management, 2013, 5, 196-203
Published Online November 2013 (http://www.scirp.org/journal/iim)
http://dx.doi.org/10.4236/iim.2013.56022
Open Access IIM
Towards More Efficient Image Web Search
Mohammed Abdel Razek1,2
1Deanship of E-Learning and Distance Education, King Abdul-Aziz University, Jeddah, KSA
2Mathematics and Computer Science Department, Faculty of Science, Azhar University, Cairo, Egypt
Email: abdelram@azhar.edu.eg
Received October 30, 2013; revised November 15, 2013; accepted November 29, 2013
Copyright © 2013 Mohammed Abdel Razek. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
ABSTRACT
With the flood of information on the Web, it has become increasingly necessary for users to utilize automated tools in
order to find, extract, filter, and evaluate the desired information and knowledge discovery. In this research, we will
present a preliminary discussion about using the dominant meaning technique to improve Google Image Web search
engine. Google search engine analyzes the text on the page adjacent to the image, the image caption and dozens of other
factors to determine the image content. To improve the results, we looked for building a dominant meaning classifica-
tion model. This paper investigated the influence of using this model to retrieve more efficient images, through sequen-
tial procedures to formulate a suitable query. In order to build this model, the specific dataset related to an application
domain was collected; K-means algorithm was used to cluster the dataset into K-clusters, and the dominant meaning
technique is used to construct a hierarchy model of these clusters. This hierarchy model is used to reformulate a new
query. We perform some experiments on Google and validate the effectiveness of the proposed approach. The proposed
approach is improved for in precision, recall and F1-measure by 57%, 70%, and 61% respectively.
Keywords: Web Mining; Image Retrieval; Dominant Meaning Technique; K-Means Algorithm, Web Search
1. Introduction
The continuous growth in the size and use of the Web
information imposes new techniques to extract Web con-
tents. The taxonomy of Web mining contains three cate-
gories: Web content mining, Web structure, and Web
usage. The first category is Web content mining which
presents the process of extracting information and knowl-
edge from web WebPages. It may also deal with the
content data of the Web pages which consist of text,
images, audio, video, or structured records such as lists
and tables. This research will focus only on the Web
content mining which is the mining of pictures of a
Web page to find out the weight of the content of the
search query. The images on the web are considered as
part of Web c o n t e nts [1].
In a major part of this project, we will try to answer
the following challenges: how to construct a query model
based on the dominant meaning; how to improve the re-
sults of Web images search. To overcome, we use the
following algorithm to improve the query results of im-
age search.
Collecting specific dataset related to some application
domain;
Using K-means algorithm to cluster the dataset into
K-clusters;
Using dominant meaning technique to construct the
Hierarchy of meaning;
Constructing a new query based on the dominant
meaning algorithm;
Using the new query to Google;
Filter results based on the dominant meaning
words.
This project uses a clustering method called K-means
to classify dataset into k-clusters. Clustering is the proc-
ess of partitioning or group ing a g iv en set of patterns into
disjoint clusters. This project will use one of the cluster-
ing methods called K-means. The k-means presented an
effect in producing good clustering results for many
practical applications [2]. However, a direct algorithm of
k-means method requires time proportional to the prod-
uct of a number of patterns and a number of clusters per
iteration [3]. Following [4-6], this project briefly illus-
trates the direct K-means algorithm.
The idea behind this research is to improve the Image
Web search using the dominant meaning technique [7].
M. A. RAZEK
Open Access IIM
197
We apply dominant meaning words, along with a ma-
chine learning method to classify WebPages. The domi-
nant meaning definition is known as “the set of key-
words that best fit an intended meaning of a target word”
[7]. This technique sees a query as a target meaning plus
some words that fall within the range of that meaning. It
freezes up the target meaning, which is called a master
word, and adds or removes some slave words, which
clarify the target meaning.
2. Motivation
This research tackles to solve the Web mining content.
For the semi-structured data, all the works utilize the
HTML structures inside the WebPages and some utilized
the hyperlink structure between the WebPages for Web-
Page representation. As for the database view, in order to
have the better information management and querying on
the web, the mining always tries to infer the structure of
the web site to transform a web site to become a data-
base.
For HTML web pages, there are many research and
commercial systems available which use also image cap-
tions, e.g. Google image search: “Google analyzes the
text on the page adjacent to the image, the image caption
and dozens of other factors to determine the image con-
tent. Google also uses sophisticated algorithms to remove
duplicates and ensure that the highest quality images are
presented first in your results” [8], and [9]. In this sense,
this project is using dominant meaning technique [7] and
how it can be used to improve Web images searches.
How does it influence search results?
The dominant meaning definition is known as “the set
of keywords that best fit an intended meaning of a target
word” [7]. This technique sees a query as a target mean-
ing plus some words that fall within the range of that
meaning. It freezes the target meaning, which is called a
master word, and adds or removes some slave words,
which clarify the target meaning.
For example, suppose that the query is “Java”. Figure
1 shows the results of the word “Java”. As shown, the
most results are representing some images for the three
well-known meanings of java: Java (computer program
language), Java (coffee), and Java (Island).
The idea of this research is to clarify the target mean-
ing with some slave words. Accordingly, if we need to
look for java (computer program language), we need to
add some slaves of java such as, computer, program, and
language.
Figure 2 shows the results of Java with its slaves. This
result, as we see, is more close to java language pro-
gram.
Figure 3 shows the results of Java Island with its
slaves. It’s clear that the results are more close to Java
Island in Indonesia, and th ere is no images related to ja va
language program.
On the other hand, Figure 4 presents the results of
Java Coffee with its slaves. It’s clear that the results do
not include neither images for Java language program or
Java Island. Therefore, we use the learner’s context of
interest and domain knowledge to individualize the con-
text of this target word. We do that by looking for key-
wor ds in the use r profile (the learner’s context of interest)
to help in specifying the intending meaning. Because the
target meaning is “computer program language”, we look
for slave words in the user profile that best fit this spe-
cific meaning—words such as “computer”, “program”,
“awt”, “application”, and “swing”.
The main question now is how to specify the core
cluster of a query. To overcome this question, we must
give answers for the following three questions: How can
we construct a dominant meaning for image search? How
can the system decide which intended meaning for the
image requested? And how can it select words that must
be added to the original query?
The following subsections give an answer for each of
them in detail.
3. Methodology
This section presents the methodology to clus ter the data
collected from the Web, and also shows how to use this
clusters for forming the model of the dominant mean-
ing.
Figure 5 presents the architecture of our approach to
improve the results of Google image search engine. This
project follows some instructs to create and then improve
the query results of image search.
Firstly, we collect a specific datasets related to some
application domain.
Using K-means algorithm to cluster the dataset into
K-clusters. Each collection is divided into K-classes.
Each cluster is related to one meaning and contains
some words to identify his meaning called slave
words.
Using dominant meaning algorithm is to classify
slave words under its master words to identify the
meaning coming from the cluster. This technique ge-
nerates a hierarchy model for the dominant meaning
of each cluster.
The query is reconstructed based what is appropriate
slave words to be added the query can be very impor-
tant.
Send the original and the new qu ery independently, to
search Google Image Search Engine.
Choose the top-1000 items coming from the results
for both queries.
Compare the precision and recall of the results for
both que ri e s.
M. A. RAZEK
Open Access IIM
198
Figure 1. The results of Google images for “Java”.
Figure 2. Search results for java with its slaves.
M. A. RAZEK
Open Access IIM
199
Figure 3. The results of java island with its slaves.
Figure 4. The results of java coffee with its slaves.
M. A. RAZEK
Open Access IIM
200
Figure 5. Architecture of the methodology.
3.1. K-Means Algorithm
The procedure of K-means algorithm attempts to find
normal groups of data based on some similarity. It classi-
fies a given data set through a certain number of clusters
(assume K-clusters) fixed a priori. It assigns K-point
(K-centroids) as one for each cluster. These points must
be chosen in a good way because the place of the point
impact on the accuracy of the results of clusters. The
algorithm will assign each point in the data set to the
nearest K-centroid which it divides a set of data points
into non-over lapping groups. Therefore, poin ts in a group
are “more similar” to one another than points in other
groups.
The first step is completed when no point is pending in
the queue and an early group-age is done. The standard
measure of the spread of a group of points about its cen-
troids is the difference, or the sum of the squares of the
distance between each point and the centroid. If the data
points are close to the centroid, the difference will be
small. The error measure is called the objective function
which is the sum of all the differences:

11 ,
i
n
k
ij i
ij
x
z

 (1)
where the notation

,
ij i
x
z
stands for the distance
between ij
x
, and i
z. The ij
x
is the jth point in the ith
cluster, i
z is the reference point of the ith cluster, and
i
n is the number of points in that cluster. Accordingly,
to reach a delegate clustering should be as small as
possible.
The algorithm is composed of the following steps:
k-means algorithm
1) Select K points for initial group centroids.
2) Assign each object to the group that has the closest distance to
the centroid.
3) When all objects have been assigned, recalculate the positions of
the K centroids.
4) Repeat Steps 2 and 3 until the centroids no longer move. This
produces a separation of the objects into groups from which the
metric to be minimized can be calculated.
The results of K-means algorithm contain mclusters.
These clusters are used to build the dominant meaning
model. The subsection presen ts the methodology to build
this model.
3.2. Construction of Dominant Meanings Tree
Following [7], suppose that the result of clusters
consists of mclasses, i.e.
m
kk
C1
}{
 , and each cluster k
C is represented
by a finite set o f WebPages
| 1,...,
kr k
CDr r .
The question now is how can we use those WebPages
to construct dominant meanings for the corresponding
cluster?
To overcome this question, each Webpage is repre-
sented by a finite set of words {|1,..., }
rrj r
Dwj n . A
weight
k
rj
f
w is assigned to each term
w in a
document for that term, which depends on the number of
occurrences of the term in the document. This weight is a
statistical measure used to evaluate how important a
word is to a document in a collection of a data set.
The aim of this method is to find a top- N words
which represents cluster k
C. To complete the computa-
tions, suppose that a word k
w represents the cluster k
C.
Dominant Meaning Algorithm (K-Clusters)
1) Calculate the values of
,
kk
jr
f
wfw,,jr
2) Calculate

1,..., 1,...,
kr
kk
vj
jrvn
FMaxMaxfw

3) Define a set r
that contains the top-N maximum value o
f
rk
jjr
f
fw for a document r
W
r= |1,,
r
j
f
jN, where
0rk
j
f
F.
4) For each clusterk
C, we rank the terms of collection r
in de-
creasing order. As a result, the dominant meanings of the cluster k
C
can be represented by the set of words that is corresponds to the set
r
j
f
. Return
12
,,,
kk k
kT
Cww w.
M. A. RAZEK
Open Access IIM
201
4. Experimetal Results
To ensure that our algorithm works in practice, we con-
ducted experiments with images collected directly from
the Web.
4.1. Data Set
The data set consists of 314 web pages from various web
sites at the University of Waterloo, and some Canadian
websites [10]. The data is categorized into 10 categories
as shown in Table 1 and Figure 6.
4.2. Dominant Meaning Model and Formulate
Quarry
Based on K-means and dominant meaning algorithms,
Figure 7 shows the hieratical model the categories of the
proposed dataset shown in Table 1. Many research used
ontology and meaning to reformulate query [11], and
[12]. For example, if we used this model to reformulate a
query of a word “

1
2
Query w”, we would get the set
of corresponding clusters as

2
C. We observe that
cluster 2
C contain two words as,

12
22
,ww Conse-
quently, the new query will contains

11
22
New-Query ,ww.
4.3. Recall and Precision
Recall is the ability of a retrieval system to obtain all or
most of the relevant documents in the collection [13],
[14]. The relative recall can be calculated using follow-
ing the formula: Relative recall = Total number of sites
retrieved by a search engine/ Sum of sites retrieved.
To compare two experiments, we use F1 performance
measure [15] to determine the performance of both of
them. It is given by:
# of correct classes proposed
precision # of classes in test data
and
# of correct classes proposed
recall # of classes proposed
12 precisionrecall
(precision recall)
F
As shown in Table 3 and Figure 6, In case of the
Query using Dominant meaning, searching data campus-
network (D2) had the highest relative pr value (0.71) fol-
lowed by the data set snowboarding-skiing (0.69) with
the least relative recall for the data set river-rafting (D8)
(0.41).
As shown in Figure 6, In case of bag-of-words query,
searching data campus-network (D2) had the highest
relative recall value (0.57) followed by the data set
snowboarding-skiing (0.56) with the least relative recall
Figure 6. Number of training and testing examples.
Figure 7. Dominant meaning model of the dataset.
Table 1. Number of WebPages in dataset.
Subject
Number of
WebPages
D1 Black-bear-attack 30
D2 Campus-network 33
D3 Canada-transportation-roads 22
D4 Career-services 52
D5 Co-op 55
D6 Health-services 23
D7 River-fishing 23
D8 River-rafting 29
D9 Snowboarding-skiing 24
D10 Winter-Canada 23
Total 314
for the data set river-rafting (D8) (0.23).
As shown in Figure 8, in case of the Query using
dominant meaning, searching data black-bear-attack (D1)
and winter-Canada (D10) had the highest relative preci-
sion value (0.57) followed by the data set snowboard-
ing-skiing (0.69) with the least relative precision for the
data set campus-network (D2) (0.39).
M. A. RAZEK
Open Access IIM
202
Figure 9 shows a comparison for average precision for
query using dominant meaning vs. query using bag-of-
words. In case of bag-of-words query, searching data
campus-network (D9) had the highest relative precision
v al u e ( 0 . 4 3) f o l l o w e d by the da ta set win ter-Canada (D10)
with(0.56) with the least relative precision for the data
set Canada-transportation-roads (D3) with (0.27).
Figure 10 shows the F1-measures for each application
domain in the cluster for both the original query and the
reformulated query using the proposed technique. The
highest values are for D1 and D10 with F1-measures 0.61,
and 0.61 respectively.
We also notice that our approach can achieve better
performance in terms of F1 for categories D2, D3 with
the same value (0.5). It is clear that the query which is
reformulated with the dominant meaning approach has a
great improving for the results than the original query
For the improving in F1 values of the best four catego-
ries of the testing dataset (D1, D2, D5, and D4), we can
see that, compared with the original query, improve the
F1 measure by 17.9%, 15.4%, 13.3%, and 12.4%, respec-
tively.
Figure 8. Average recall for query using dominant mean-
ing vs. query using bag-of-words.
Fig ure 9. Average precision for query using dominant mean-
ing vs. query using bag-of-words.
Figure 10. F1-measures for the original query and for the
query with dominant meaning.
5. Conclusion
In this article, we studied the effectiveness of reformu-
lating the query using a dominant meaning technique on
the results of Google image search engine. To apply the
technique, we used the dataset which consists of 314 web
pages classified into 10 categories. K-means algorithm is
used to cluster each category in the dataset into K-clus-
ters. We applied the dominant meaning algorithm on
each cluster to extract some meaning to build a hierarchy
model. We used this model to reconstruct a new query.
We investigated into the influence of the results coming
from Google search engine to the performance of the
or ig in al qu er y and the restructured query. As experimental
results shown, the proposed technique in this paper had a
considerable performance for precision, recall and
F1-measure.
6. Acknowledgements
This project was funded by Deanship of Scientific Re-
search, King Abdulazize University, Saudi Arabia, under
award number (521/214/1432). And, the authors would
like to thank the Deanship of Scientific Research for their
supports and encouragements.
REFERENCES
[1] X. Wang, S. Qiu, K. Liu and X. Tang, “Web Image
Re-Ranking Using Query-Specific Semantic Signatures,”
IEEE Transactions on Pattern Analysis and Machine In-
telligence, No. 99, 2013, pp. 1-14.
[2] S. Sujatha and A. S. Sona, “New Fast K-Means Cluster-
ing Algorithm Using Modified Centroid Selection Me-
thod,” International Journal of Engineering Research &
Technology (IJERT), Vol. 2, No. 2, 2013, pp. 1-9.
[3] C. Zhang and S. X. Xia, “K-Means Clustering Algorithm
with Improved Initial Center,” 2nd International Work-
shop on Knowledge Discovery and Data Mining (WKDD),
Moscow, 23-25 January 2009, pp. 790-792.
[4] M. Gautam and A. Xavier, “Speed Improvements to In-
formation Retrieval-Based Dynamic Time Warping Using
M. A. RAZEK
Open Access IIM
203
Hierarchical K-Means Clustering,” 2013 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Proc-
essing (ICASSP), Vancouver, 26-31 May 2013, pp. 8515-
8519.
[5] D. Mavroeidis and P. Magdalinos, “A Sequential Sam-
pling Framework for Spectral k-Means Based on Efficient
Bootstrap Accuracy Estimations: Application to Distrib-
uted Clustering,” ACM Transactions on Knowledge Dis-
covery from Data, Vol. 7, No. 2, 2012, pp. 2-7.
[6] J. Wu, H. Xiong and J. Chen, “Adapting the Right Meas-
ures for k-Means Clustering,” Proceedings of the 15th
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, Paris, 28 June-1 July 2009,,
pp. 877-886.
[7] M. A. Razek, C. Frasson and M. Kaltenbach, “Dominant
Meanings towards Individualized Web Search for Learn-
ing Environments,” In: G. D. Magoulas and S. Y. Chen,
Eds., Advances in Web-Based Education: Personalized
Learning Environments, IDEA Group Publishing, Her-
shey, 2006.
[8] Y. Jing, M. Covell, D. Tsai and J. M. Rehg, “Learning
Query-Specific Distance Functions for Large-Scale Web
Image Search,” IEEE Transactions on Multimedia, Vol.
15, No. 8, 2013, pp. 2022-2034.
[9] G. Maderlechner, J. Panyr and P. Suda, “Finding Cap-
tions in PDF-Webpages for Semantic Annotations of Im-
ages,” In: D.-Y. Yeung, et al., Eds., Structural, Syntactic,
and Statistical Pattern Recognition, Lecture Notes in
Computer Science Volume, Springer-Verlag Berlin Hei-
delberg, 2006, pp. 422-430.
[10] K. Hammouda, “Web Ming Dataset,” 2013.
http://pami.uwaterloo.ca/~hammouda/webdata
[11] P. Singh, R. H. Goudar, R. Rathore, A. Srivastav and S.
Rao, “Domain Ontology Based Efficient Image Re-
trieval,” 7th International Conference on Intelligent Sys-
tems and Control (ISCO), Coimbatore, 4-5 January 2013,
pp. 445-452.
http://dx.doi.org/10.1109/ISCO.2013.6481196
[12] D. Gowsikhaa, S. Abirami and R. Baskaran, “Construc-
tion of Image Ontology Using Low-Level Features for
Image Retrieval,” International Conference on Computer
Communication and Informatics (ICCCI), Coimbatore,
10-12 January 2012, pp. 1-7.
http://dx.doi.org/10.1109/ICCCI.2012.6158922
[13] B. T. Sampath Kumar and J. N. Prakash, “Precision and
Relative Recall of Search Engines: A Comparative Study
of Google and Yahoo,” Singapore Journal of Library &
Information Management, Vol. 38, No. 1, 2009, pp. 124-
137.
[14] S. M. Shafi and R. A. Rather, “Precision and Recall of
Five Search Engines for Retrieval of Scholarly Informa-
tion in the Field of Biotechnology,” Webology, Vol. 2, No.
2, 2005, pp. 42-47.
http://www.webology.ir/2005/v2n2/a12.html
[15] M. Gagnon, A. Zouaq and L. Jean-Louis, “Can We Use
Linked Data Semantic Annotators for the Extraction of
Domain-Relevant Expressions?” The International World
Wide Web Conference Committee (IW3C2), WWW 2013
Companion, Rio de Janeiro, 13-17 May 2013, pp. 1239-
1246.