J. Software Engineering & Applications, 2009, 2: 150-159
doi:10.4236/jsea.2009.23022 Published Online October 2009 (http://www.SciRP.org/journal/jsea)
Copyright © 2009 SciRes JSEA
Data Mining in Biomedicine: Current Applications and
Further Directions for Research
S. L. TING1, C. C. SHUM2, S. K. KWOK1, A. H. C. TSANG1, W. B. LEE1
1Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hong Kong, China; 2Department of
Computing, The Hong Kong Polytechnic University, Hong Kong, China.
Email: jacky.ting@polyu.edu.hk
Received January 16th, 2009; revised June 18th, 2009; accepted June 24th, 2009.
Data mining is the process of finding the patterns, associations or relationships among data using different analytical
techniques involving the creation of a model and the concluded result will become useful information or knowledge.
The advancement of the new medical deceives and the database management systems create a huge number of data-
bases in the biomedicine world. Establishing a methodology for knowledge discovery and management of the large
amounts of heterogeneous data has become a major priority of research. This paper introduces some basic data mining
techniques, unsupervised learning and supervising learning, and reviews the application of data mining in biomedicine.
Applications of the multimedia mining, including text, image, video and web mining are discussed. The key issues faced
by the computing professional, medical doctors and clinicians are highlighted. We also state some foreseeable future
developments in the field. Although extracting useful information from raw biomedical data is a challenging task, data
mining is still a good area of scientific study and remains a promising and rich field for research.
Keywords: Data Mining, Biomedicine
1. Introduction
With the tremendous improvement in the speed of com-
puter and the decreasing cost of data storage, huge vol-
umes of data are created. However, data itself has no
value. Only if data can be changed to information, it be-
comes useful. In order to generate meaningful informa-
tion, or knowledge from database, the field of data min-
ing was born. The data mining field is about two decade
old. Early pioneers such as U. Fayyad, H. Mannila, G.
Piatetsky-Shapiro, G. Djorgovski, W. Frawley, P. Smith,
and others found that the traditional statistical techn iques
were not adequate to handle the mass amount of data.
They recognized the need of better, faster and cheaper
ways to deal with the dramatic increase in the amount of
Nowadays, besides the numerous number of databases
created and accumulated in a dramatic speed, data is no
longer restricted to numeric or character only especially
in the biomedicine aspect. The advanced medical de-
ceives and database management systems enable the in-
tegration of the different types of high dimensional mul-
timedia data (e.g. text, image, audio, and video) under
the same umbrella. Establishing a methodology for kno-
wledge discovery and management of large amounts of
heterogeneous data has therefore become a main priority.
Various techniques are used in different areas of bio-
medicine, including genomics, proteomics, medical di-
agnosis, effective drug design and pharmaceutical indus-
In this paper, we would first give a brief outline on
what is data mining, its position or role in the kn owledge
discovery process and the basic principles of some com-
monly used data mining techniques. Next, we present our
investigation results of the applications of the data min-
ing in the biomedicine aspect, which includes the area of
biology, medicine, pharmacy and health care. Lastly, we
discuss some difficulties of data mining in biomedicine
and the possible direction for the future development.
2. What is Data Mining?
Data mining (DM) is the process of finding the patterns,
associations or relationships among data using different
analytical techniques involving the creation of a model
and the concluded result will become useful information
or knowledge. DM can also be expressed as
Nontrivial extraction of implicit, previously un-
known, and potentially useful information from data [1];
Making sense of large amounts of mostly unsuper-
vised data in some domain [2]
Data Mining in Biomedicine: Current Applications and Further Directions for Research151
It is an interdisciplinary subject that lies at the inter
face of pattern recognition and database systems and
emerges the techniques from the mathematics and statis-
tical disciplines as well as from the artificial intelligence
and machine leaning communities. It has a great deal in
common with statistics but on the other hand, there are
differences. Unlike statistics, data mining can be due
with heteroge n eou s dat a field s .
Very often, the term knowledge discovery is used to-
gether with Data Mining. Knowledge discovery, also
known as knowledge d iscover y in databa se (KDD), is the
process that seeks new knowledge in some application
domain. DM is one of the steps in the knowledge discov-
ery process. Figure 1 is an outline of the six step hybrid
KDD model developed by [2].
The initial step of understanding the problem domain
involves working closely with domain experts to define
the problem and determine the project goals, and learning
about current solutions to the problem. A description of
the problem, including its restrictions, is prepared. The
DM tool to be used in the later stage is selected. Next, we
need to understand the data which includes collecting
sample data and deciding which data, including format
and size, will be needed. Data are checked for complete-
ness, redundancy, missing valu es, plausibility of attribute
values, etc. Preparation of data decides which data will
be used as input for DM methods in the subsequent step.
It involves sampling, running correlation and signifi-
cance tests, and data cleaning. Data miner then uses
various DM methods to derive knowledge from preproc-
essed data. Evaluation includes understanding and
checking if the result is novel. Finally, we will decide
how to use and deploy the discovered knowledge.
3. Data Mining Techniques
Data mining techniques fall into two broad categories:
unsupervised and supervised. Unsupervised learning re-
fers to the technique that is not guided by any particular
variable or class label. In the unsupervised learning, we
do not create a model or hypothesis prior to the analysis.
We apply the algorithm directly to the data and observe
the results. A model will then be built according to the
results. Thus, unsupervised leaning is used to define class
for data without class assignments. Clustering is one of
the common unsupervised techniques.
In contrast, for supervised learning, a model is built
prior to the analysis. We then apply the algorithm to the
data in order to estimate the parameters of the model.
The objective of building models using supervised
learning is to predict an outcome or category of interest.
The biomedical literature on applications of supervised
learning techniques is vast. Classification, statistical re-
gression and association rules building are very common
supervised learning techniques used in medical and
clinical research. Table 1 is the summary comparing the
characteristics and the techniques used for the two dif-
ferent learning methods. Followed is a brief explanation
Figure 1. Six-step hybrid KDD model [2]
Copyright © 2009 SciRes JSEA
Data Mining in Biomedicine: Current Applications and Further Directions for Research
Table 1. Comparing the characteristics and the techniques of the unsupervised and supervise d le ar ning
Characteristics Techniques
Unsupervised Learning
No guidance
Use to Define the class
Seldom utilized (until recently)
Association Rule
Supervised Learning
With guidelines
Class defined
Common with vast literature and application
Statistical Regression
Artificial neural networks
of the four learning techniques.
3.1 Clustering
Clustering is an unsupervised learning technique reveal-
ing natural groupings in the data. Cluster analysis refers
to the grouping of a set of data objects into clusters. A
cluster is a collection of data objects wh ich are similar to
one another within the same cluster but not si milar to the
objects in another cluster. Clustering is also called unsu-
pervised classification where no predefined classes are
3.2 Association Rule
Association rule discovery is to find the relationships
between the different items in a data base. It is normally
express in the form X => Y, where X and Y are sets of
attributes of the dataset which implies that transactions
that contain X also contain Y.
3.3 Classification
Classification is a supervised learning method. It is a
method of categorizing or assigning class labels to a pat-
tern set under the supervision. The object of classifica-
tion is to develop a model for each class. Classification
methods can usually be categorized as follows:
a) Decision tree
Decision tree classifiers divide a decision space into
piecewise constant regions. It splits a dataset on the basis
of discrete decisions, using certain thresholds on the at-
tribute values. It is one of the most widely used classifi-
cation method as it is easy to interpret and can be repre-
sented under the If-then-else rule condition.
b) Nearest-neighbor
Nearest-neighbor classifiers [3] typically define the
proximity between instances, find the neighbors if a new
instance, and then assign to it the label for the majority
class of its neighbors.
c) Probabilistic models
Probabilistic models are models which calculate prob-
abilities for hypotheses base on Bayes’ theorem [3].
3.4 Statistical Regression
Regression models are very popular in the biomedical
literature and have been applied in virtually every sub-
specialty of medical research. Before computers were
widely used, linear regression was the most popular
model to find solutions of the problem of estimating the
intercept and coefficients of the regression question. It
has solid foundation from the statistical theory. Linear
regression is similar to the task of finding the line that
minimizes the total distance to a set of data. That is find
the equation for line Y = a + bX. With the help of com-
puters and software package, we can calculate the high
complex models.
3.5 Artificial Neural Networks
Artificial neural networks [4] are signal processing sys-
tems that try to emulate the behavior of human brain by
providing a mathematical model of combination of nu-
merous neurons conn ected in a network. It learns through
examples and discriminate the characteristics among
various pattern classes by reducing the error and auto-
matically discovering inherent relationships in a data-rich
environment. No rules or programmed information is
need beforehand. It composes of many elements, called
nodes which are connected in between. The connection
between two nodes is weighted and by the adjustment of
these weights, the training of the network is performed.
The weights are network parameters and their values are
obtained after the training procedure. There are usually
several layers of nodes. During the training procedure,
the inputs are directed in the input layer with the desir-
able output values as targets. A comparison mechanism
will operates between the out and th e target valu e and the
weights are adjusted in order to reduce error. The proce-
dure is repeated until the network output matches the
targets. There are many advantages of neural networks
like adaptive learning ability, self-o rganization, real-time
operation and insensitivity to noise. However, it also has
a huge disadvantage that it is highly dependence on the
training data and it does not provide an explanation for
the decisions they make, just like working in the ‘black box’.
3.6 Advanced Data Mining Techniques
During the past few years, researchers have tried to com-
bine both unsupervised and supervised methods for the
Copyright © 2009 SciRes JSEA
Data Mining in Biomedicine: Current Applications and Further Directions for Research153
analysis [5]. Some examples of advanced unsupervised
learning models are hierarchical clustering, c-means
clustering self-organizing maps (SOM) and multidimen-
sional scaling techniques. Advanced examples of the
supervised learning models classification and regression
trees (CART) and support vector machines [6].
4. Applications of Data Mining in Biomedicine
4.1 Data Mining Models
Data mining applies in descriptive modeling for under-
standing. In [7], Tseng and Yang use Gene Ontology
(GO) to group genes in advance in order to show the po-
tential relations among gene groups and discover the
hidden relations between genes set in association with
GO terms. It can also be used to predict the outco me of a
future observation or to assess the potential risk in a dis-
ease situation. Regarding the predictive power, data
mining algorithms can learn from past examples in clini-
cal data and model the oftentimes non-linear relation-
ships between the independent and dependent variables,
thereby the resulting model representing the formalized
knowledge that can often provide a good diagnostic op-
tion [8]. Data mining techniques have been widely used
to find new patterns and knowledge from biomedical
4.2 Recent Development
The typical data mining process involves transferring
data originally collected in production systems (such as
electronic medical records) into data warehouse, cleaning
or scrubbing the data to remove errors and check for
format consistency, and then searching the data using
statistical model, artificial intelligence (such as neural
networks), and other machine learning methods [9]. In
[10], Prather et al. employs the KDD for identifying the
factors that will improve the quality and cost effective-
ness of perinatal care in an ex tensive clinical database of
obstetrical patients. Given the data warehouse of diabetic
patients, Breault et al. employ the CART to investigate
the factors affecting the occurrence of diabetics [11].
They are surprisingly discovered that younger age pre-
dicts bad diabetic control, in which explore a new area to
manage the diabetic control in younger age. Similar ap-
plicati ons of data m ining can al s o be f ound in Ta ble 2.
Apart from the diagnostic prediction, the knowledge
discovery ability in data mining also demonstrated a
good detector in adverse drug events (ADE). In [12],
Wilson et al. utilize the KDD techniques in pharma-
covigilance for detecting signals earlier than using exist-
ing methods. In [13], Lian et al. has pointed out that the
prescription is specified by a preference function based
on the user's preference in prior clinical experience. Thus,
they propose a dose optimization framework based on
probability theory. In [14], Susan and Warren have
demonstrated that the cond itional probability (CP) model
is superior in optimizing the drug lists over the multiple
linear regression and discriminant analysis models. Con-
cerning the strong relationship between the diagnosis and
medication, it formulates a posterior probability (what
medication is needed) b ased on a priori probability (wh at
diagnosis has been made). This approach aligns with the
Mediface as purposed by [15].
Table 2. Recent applications of data mining
Author Description
Megalooikonomou et al. [20] They introduce statistical methods that aid the discovery of interesting associations and patterns between brain
images and other clinical data
Brossette et al. [21] They design a Data Mining Surveillance System (DMSS) that uses novel data mining techniques to discover
unsuspected, useful patterns of nosocomial infections and antimicrobial resistance from the analysis of hospital
laboratory data
Antonie et al. [22] They investigate the use of different data mining techniques for anomaly detection and classification of medi-
cal images
Coulter et al. [23] They examine the relation between antipsychotic drugs and myocarditis and cardiomyopathy
Li et al.[24] They explore a novel analytic cancer detection method with different feature selection methods and to compare
the results obtained on different datasets and that reported by Petricoin et al. in terms of detection performance
and selected proteomic patterns
Delen et al.[25] They use two popular data mining algorithms (artificial neural networks and decision trees) along with a most
commonly used statistical method (logistic regression) to develop the prediction models on breast cancer using
a large dataset.
Su et al. [26] They use four different data mining approaches to select the re levant features from the data to predict diabetes
Phillips-Wren et al. [27] They assess the utilization of healthcare resources by lung cancer patients related to their demographic charac-
teristics, socioeconomic markers, ethnic backgrounds, medical histories, and access to healthcare resources in
order to guide medical decision making and pu bl i c p ol icy
Copyright © 2009 SciRes JSEA
Data Mining in Biomedicine: Current Applications and Further Directions for Research
Figure 2. Framework for the integrated approach [17]
In recent years, numerous researchers intend to inte-
grate several data mining and artificial intelligence tech-
niques together to enhance the mining result and support
decision making. For example, Kuo et al. integrate the
clustering analysis and association rules mining tech-
nique to cluster the health insurance database and hence
discover the useful rules for each group [16]. In [17],
Zhuang et al. combine the data mining and case-based
reasoning (CBR) methodologies to provide intelligent
decision support for pathology test ordering by GPs.
They guarantee the integrated system can enhance the
testing ordering in term of evidence based, situational
relevance, flexibility and interactivity. In [18], Huang et
al. propose a model of a chronic diseases prognosis and
diagnosis (CDPD) system by integrating data mining and
CBR to support the chronic d isease treatment. Compared
with traditional coronary artery diseases (CAD) diagnos-
tic methodologies, Tsipouras et al. integrate the decision
trees and fuzzy modeling to form a fuzzy rule-based de-
cision support system that obtain a significant improve-
ment compared with artificial neural networks and adap-
tive neuro-fuzzy inference system [19]. Example of such
integration can be found in Figure 2.
All in all, most of the existing data mining app lications
Copyright © 2009 SciRes JSEA
Data Mining in Biomedicine: Current Applications and Further Directions for Research155
are focused on exploring the pattern in sound biomedical
databases. With proper structure of the data collected via
different medical devices, data mining techniques can
serve as a promising tool to convert the information into
useful and valuable knowledge to physicians and re-
4.3 Current Trend
4.3.1 Mul timedi a Mining
Classically, databases were formed by tuples of numeric
and alphanumeric contents, but with the widespread use
of medical information systems, information absorption
are now expands to different data types including text,
document, image, graphics, speech, audio, hypertext, etc.
At the same time, the growth in Internet information can
also be considered as a new dimension as a distributed
multimedia database of the largest useful information.
Concerning the tremendous amount of visual information,
it is obvious that the development of data mining tech-
niques in these multimedia data is the next generation in
biomedicine. With the widely advanced in digital multi-
media technology, numerous researchers introduce sev-
eral novel data mining techniques, namely image mining,
text mining, video mining, and web mining. Below we
will discuss these four technology revolution and how
does it impact the biomedicine area.
4.3.2 Text Mining
Apart from the medical images and signals, another
clinical data that physicians would like to interpret is the
unstructured free-text. Regarding there is a lot of infor-
mation presented in text or document databases, in form
of electronic books, research articles, digital libraries,
medical dictionaries, etc., several researchers developed
a novel data mining approach in extracting useful
knowledge from textual data or documents, so called the
text mining [28,29]. For example, we can employs text
mining techniques to extract the information of pro-
tein-protein interaction within three different documents.
In addition to the traditional data mining techniques,
text mining uses techniques from many multidisciplinary
scientific fields (e.g. text analysis techniques) to gain
insight and automatically rev eal useful info rmation to the
human users. In [30], Cohen and Hunter describe text
mining is “the use of automated methods for exploiting
the enormous amount of knowledge available in the
biomedical literature”. One of the examples of text min-
ing is to manage the health information in Internet and
response the needs for those who have health information
inquiry in HIV/AIDS [31]. Another common application
of text mining is used to extract the information of pro-
tein-protein interaction. When given the unstructured text,
Zhou et al. employ the semantic parsing and hidden vec-
tor state model to mine the knowledge within the text
[32]. By setting the annotation PROTEIN_NAME (AC-
TIVATE(PROTEIN_NAME), the system will automati-
cally generate the result as shown in Figure 3.
Figure 3. Semantic parsing employe d in protein documents [32]
Copyright © 2009 SciRes JSEA
Data Mining in Biomedicine: Current Applications and Further Directions for Research
4.3.3 Image Mining
More and more medical procedures employ imaging as a
preferred diagnostic tool. Thus, there is a need to develop
methods for efficient mining in images databases, which
is completely different and more difficult than mining in
structured datatypes. Therefore, mining of image data is
a challenge problem. Meanwhile, with numerous imag-
ing techniques (such as SPECT, MRI, PET, and collec-
tion of ECG or EEF signals) can generate gigabytes of
data per day, and heterogeneous nature of image data
(like a single cardiac SPECT procedure of one patient
may contain dozens of 2D images), image mining has
become one of the emerging field in biomedical study.
Typically, most of the activities in mining image data are
based on the searching, retrieving and comparing of
query image with the stored image by its degree of simi-
larity or feature(s). In [22], Antonie et al. present the use
of different data mining techniques for tumor classifica-
tion in digital mammograph y and they find that associate
rule obtains a better result than neural networks. Fur-
thermore, in order to tackle the issue of complicated na-
ture of surrounding of breast tissue, the variation of MCs
in shape, orientation, brightness and size, Peng et al.
propose knowledge-discovery incorporated genetic algo-
rithm (KD-GA) to search for the bright spots in mam-
mogram and hence evaluate the possibility of a bright
spot being a true MC, and adaptively adjust the associ-
ated fitness values [34]. Another example, which intro-
duces a notion of image sequence similarity patterns
(ISSP) for discovering the possible Space-Occupying
Lesion (PSO) in brain images, is presented by [35].
4.3.4 Vi deo Mining
With the advancement in streaming audio and digital TV,
more and more video data are stored in which this brings
the interest of researchers to discover and explore inter-
esting patterns in the audio-visual content. In order to
meet such demand, video mining is developed. In the
biomedicine area, it is observed that specialists intend to
use cameras to take the video in each operation, which
imply there are ample opportunities of applying data
mining principles in conjunction with the video retrieval
techniques. For example, Zhu et al. introduce a video
database management framework and strategies for video
content structure and events mining [36]. They first seg-
mented the video shot into groups and hence organized
the video shots into a hierarchical structure using clus-
tered scenes, scenes, groups, and shots, in increasing
granularity from top to bottom. With a sound structure,
audio and video processing techniques are integrated to
mine event information, such as dialog, presentation and
clinical operation, from the detected scenes.
4.3.5 Web Mining
Internet is growing at a tremendous speed. World Wide
Web (WWW) becomes the largest database that ever
existed. In particular, many medical literatures are writ-
ten in electronic format which are widely available and
accessible in the Internet nowadays. Therefore, the capa-
bility of knowledge discovery and retrieving information
from WWW is important to physicians. But, the com-
plexity of web pages and the dynamic nature of data
stored in the Internet make adoption of data mining tech-
niques difficult. In [37], web mining is the use of data
mining techniques to automatically retrieve, extract and
evaluate information for knowledge discovery from the
Internet. With its exploratory of hidden information abil-
ity, Yu and Jonnalagadda present an approach regarding
Semantic Web and mining that can improve the quality
of Web mining results and enhance the functions and
services and the interoperability of medical information
systems and standards in the healthcare field [38].
5. Discussions
Biomedicine has been evolved as a new application area
for data mining in recent year. As reflected by the brief
literature survey in this study, the current data mining
research concentrates on applying the data mining tech-
niques to manage the complex and unstructured data, and
in particular in form of visual and textual nature. Al-
though numerous studies resulting satisfactory result of
data mining adoption, it is found that data quality is one
of the major challenges on impacting the performance in
the biomedicine industry. In theory, data mining is a data
driven approach as the outcome of data mining heavily
depends on the quality and quantity of available data.
However, the data in the biomedicine area is rather com-
plex in nature. Thus, in order to enhance the performance
of data mining adoption in the domain area, concerns are
raised as follow:
a) Huge volume of data
Because of the sheer size of databases, it is unlikely
that any of the data mining methods will succeed with
raw data. In the field of biomedicine, it is particular true
that particular medical experts are required to pre-process
the data before adopting data mining. As different medi-
cal experts are professional in different medical aspects,
therefore it is time consuming and labor intensive to
handle the data beforehand.
b) Dynamic nature of data
Databases are constantly updated and adding new in-
formation at an alarming rate. For example, new SPECT
images (for the same or a new patient), or by replacement
of the existing ones (a SPECT had to be repeated because
of technical problems). This requires methods that are
able to incrementally update the knowledge learned so
c) Incomplete or imprecise data
The information collected in a database can be either
incomplete or imprecise. To address this problem, fuzzy
sets and rough sets were developed explicitly.
Copyright © 2009 SciRes JSEA
Data Mining in Biomedicine: Current Applications and Further Directions for Research157
d) Noisy data
It is very difficult for any data collection technique to
entirely eliminate noise. This implies that data mining
methods should be made less sensitive to noise, or care
should be taken that the amount of noise in data to be
collected in the future will be approximately the same as
that in the current da ta.
e) Missing attribute values
Missing values create a problem for most data mining
methods, since nearly all the methods require a fixed
dimension for each data object. In fact, this problem is
widely encountered in the medical databases because
most medical data are collected as a byproduct of patient
care activities, rather than for organized research proto-
cols; even in some large medical databases such as breast
cancer data set from University of Wisconsin Hospitals,
this problem are still existed. Typically, one approach to
remedy this problem is to ignore the missing values, or
omit any records containing missing values; whereas
another approach is to substitute missing values with
mostly likely values from obtaining values in the mode
or mean, or directly infer missing values from existing
values via artificial intelligence method (e.g. case-based
f) Redundant, insignificant data, or inconsistent d ata
The data set may contain redundant, insignificant, or
inconsistent data objects and attributes. Generally, medi-
cal data can be stored in numeric and textual format; in
which a large amount of preprocessing is required in or-
der to make the data useful. For example, misspelled of
medical terms is frequently occurred and one medication
or condition may be commonly referred to by a variety of
names (i.e. stomach and abdominal pain).
In addition to the data quality perspectives, several
considerations are also been made:
a) Quality of learning mechanism
Over- and under-learning will affect the performance
of data mining in which the learning mechanism will
misunderstand the human’s preferences and require hu-
man to adjust for achieving the goal state.
b) Quality of knowledge representation
Knowledge representation is an important element to
represent knowledge in an understandable manner to
facilitate the conclusions drawn from knowledge. If the
machine is insufficient to store the k now ledge d iscov ered,
it is also incapable to represent them; thus, such insuffi-
cient knowledge will make the machine less intelligent.
c) Nature o f p roblem
When the problem is too complex, chaos, or has not
encountered before, the intelligent machine do not have
enough knowledge or time to deduce an appropriate re-
sult. Using the case of diagnostic decision support as an
example, if most of the learning cases and rules are re-
lated to some general diagnosis, wh en there is a n ew case
related to specific diagnosis encountered, the system
cannot provide a good solution since there are no rules
triggered inside in the system.
As a result, with this study at hand, we can conclude
that opportunities to use data mining truly in bio medicine
will happen only when the data quality is committed to
the level of standard and there are new methods or algo-
rithms to handle the complex data types. Furthermore,
adoption of data mining in biomedicine is quite a young
field with many issues that still need to be researched and
explored in depth. Some further research directions and
questions are summarized as follow:
a) An absurd and false model may fit perfectly if the
model has enough complexity by comparison to the
amount of data available. When the degrees of freedom
in parameter selection exceed the information content of
the data, this leads to arbitrariness in the final (fitted)
model parameters which reduces or destroys the ability
of the model to generalize beyond the fitting data. If
you've got a learning algorithm in one hand and a dataset
in the other hand, to what extent can you decide whether
the learning algorithm is in danger of over-fitting or un-
der-fitting? Almost all of the data mining research is
done on an ad-hoc base. The techniques are designed for
an individual problem. There is no unifying theory.
b) The storage of large multimedia databases is often
required to be in compressed form. Data compression if
the techniques to reduce the redundancies in data repre-
sentation. Reducing the storage requirement is equivalent
to increasing the capacity of the storage medium. The
development of the data compression technology will
play a significant role in terms of the performance of data
mining. However, it seems the data compression field
has so far been neglected by the data mining community.
c) In today’s network ed society, data care not stored in
a single place. Internet has no doubt being the greatest
and largest databases that we have ever had. Information
inside the internet is often a mixed of text, image, audio,
speech, hypertext, graphics and video components. In
many cases, databases spread over multiple files in dif-
ferent disks or in different geographical locations. How
to handle or collaborate all kind of heterogeneous data in
a distributed environment will open up a newer area of
d) More and more multimedia data mining systems
will be used by medical doctors or clinicians. Th e design
of the system needs to take into consideration of the hu-
man perceptual. How to develop a system work synergis-
tically is a subject of ongoing research. In order to
achieve the goal, biologist, medical doctors, clinicians
and the computing professional all need to work closely
together. Any little part missing may lead to the failure of
the system design.
Copyright © 2009 SciRes JSEA
Data Mining in Biomedicine: Current Applications and Further Directions for Research
6. Conclusions
The well use of the data mining tools in the biomedicine
should bring revolutionary impact to the field. The study
of biomedical processes is heavily based on the identifi-
cation of understandable patterns which are present in the
data. These patterns may be used for diagnostic or prog-
nostic purpose as well as the analysis of microarrays.
Data mining is at the care of the pattern recognition
process. Biologist, medical doctors, clinicians and com-
puting professionals should collaborate so that the two
fields can contribute to each other. The challenge is for
each to widen its focus to attain harmonious and produc-
tive collaboration to develop the best practices.
7. Acknowledgement
The authors would like to express their sincere thanks to
the Research Committee of The Hong Kong Polytechnic
University for financial support of the research work
presented in this paper.
[1] W. Frawley, G. Piatetsky-Shapiro, and C. Matheus,
“Knowledge discovery in databases: An overview,” AI
Magazine, pp. 213–228, 1992.
[2] K. J. Cios, W. Pedrycz, R. W. Swiniarski, and L. A. Kur-
gan, “Data mining: A knowledge discovery approach,”
Springer, New York, 2007.
[3] J. T. Tou and R. C. Gonzalez, “Pattern recognition prin-
ciples,” Addison-Wesley, London, 1974.
[4] R. O. Duda, P. E. Hart, and D. G. Stork, “Pattern classi-
fication,” Wiley, 2001.
[5] T. Hastie, R. Tibshirani, and J. Friedman, “The elements
of statistical learning: Data mining, inference, and predic-
tion,” Springer, New York, 2001.
[6] J. W. Lee, J. B. Lee, M. Park, and S. H. Song, “An exten-
sive comparison of recent classification tools applied to
microarray data,” Computational Statistics & Data
Analysis, Vol. 48, No. 4, pp. 869–885, 2005.
[7] V. S. Tseng and S. C. Yang, “Mining multi-level associa-
tion rules from gene expression profiles and gene ontol-
ogy,” in Proceedings IEEE Workshop Life Science Data
Mining (held with IEEE ICDM), UK, November 2004.
[8] H. Chen, S. S. Fuller, C. Friedman, and W. Hersh,
“Medical informatics–knowledge management and data
mining in biomedicine,” Springer, 2005.
[9] C. D. Krivda, “Data-Mining Dynamine,” Byte, 1995.
[10] J. C. Prather, D. F. Lobach, L. K. Goodwin, J. W. Hales,
M. L. Hage, and W. E. Hammond, “Medical data mining:
Knowledge discovery in a clinical data warehouse,” in
Proceedings AMIA Annual Fall Symposium, pp. 101–
105, 1997.
[11] J. L. Breault, C. R. Goodall, and P. J. Fos, “Data mining a
diabetic data warehouse,” Artificial Intelligence in Medi-
cine, Vol. 26, pp. 37–54, 2002.
[12] A. M. Wilson, L. Thabane, and A. Holbrook, “Applica-
tion of data mining techniques in pharmacovigilance,”
British Journal of Clinical Pharmacology, Vol. 57, No. 2,
pp. 127–134, 2004.
[13] J. Lian, C. Cotrutz, and L. Xing, “Therapeutic treatment
plan optimization with probability density-based dose
prescription,” Medical Physics, Vol. 30, No. 4, pp. 655–
666, 2003.
[14] E. G. Susan and J. M. Warren, “Statistical modelling of
general practice medicine for computer assisted data entry
in electronic medical record systems,” International Jour-
nal of Medical Informatics, Vol. 57, No. 2-3, pp. 77–89,
[15] J. R. Warren, A. Davidovic, S. Spenceley, and P. Bolton,
“Mediface: Anticipative data entry interface for general
practitioners,” in Proceedings Computer Human Interac-
tion Conference 1998, pp. 192–199, 1998.
[16] R. J. Kuo, S. Y. Lin, and C. W. Shih, “Mining association
rules through integration of clustering analysis and ant
colony system for health insurance database in Taiwan,”
Expert Systems with Applications, Vol. 33, pp. 794–808,
[17] Z. Y. Zhuang, L. Churilov, F. Burstein, and K. Sikaris,
“Combining data mining and case-based reasoning for
intelligent decision support for pathology ordering by
general practitioners,” European Journal of Operational
Research, Vol. 195, No. 3, pp. 662–675, 2009.
[18] M. J. Huang, M. Y. Chen, and S. C. Lee, “Integrating
data mining with case-based reasoning for chronic dis-
eases prognosis and diagnosis,” Expert Systems with Ap-
plications, Vol. 32, No. 3, pp. 856–867, 2007.
[19] M. G. Tsipouras, T. P. Exarchos, D. I. Fotiadis, A. P.
Kotsia, K. V. Vakalis, K. K. Naka, and L. K. Michalis,
“Automated diagnosis of coronary artery disease based on
data mining and fuzzy modeling,” IEEE Transactions on
Information Technology in Biomedicine, Vol. 12, No. 4,
pp. 447–457, 2008.
[20] V. Megalooikonomou, J. Ford, L. Shen, F. Makedon, and
A. Saykin, “Data mining in brain imaging,” Statistical
Methods in Medical Research, Vol. 9, No. 4, pp. 359–394,
[21] S. E. Brossette, A. P. Sprague, W. T. Jones, and S. A.
Moser, “A data mining system for infection control sur-
veillance,” Methods of Information in Medicine, Vol. 39,
No. 4-5, pp. 303–310, 2000.
[22] M. L. Antonie, O. R. Zaiane, and A. Coman, “Application
of data mining techniques for medical image classifica-
tion,” in Proceedings Second International Workshop on
Multimedia Data Mining, pp. 94–101, 2001.
[23] D. M. Coulter, A. Bate, R. H. B. Meyboom, M. Lindquist,
and R. Edwards, “Antipsychotic drugs and heart muscle
disorder in international pharmacovigilance: Data mining
study,” British Medical Journal, Vol. 322, pp. 1207–1209,
[24] L. Li, H. Tang, Z. Wu, J. Gong, M. Gruidl, J. Zou, M.
Tockman, and R. Clark, “Data mining techniques for
cancer detection using serum proteomic profiling,” Arti-
Copyright © 2009 SciRes JSEA
Data Mining in Biomedicine: Current Applications and Further Directions for Research
Copyright © 2009 SciRes JSEA
ficial Intelligence in Medicine, Vol. 32, No. 2, pp. 71–83,
[25] D. Delen, G. Walker, and A. Kadam, “Predicting breast
cancer survivability: A comparison of three data mining
methods,” Artificial Intelligence in Medicine, Vol. 34, No.
2, pp. 113–27, 2005.
[26] C. T. Su, C. H. Yang, K. H. Hsu, and W. K. Chiu, “Data
mining for the diagnosis of type II diabetes from three-
dimensional body surface anthropometrical scanning
data,” Computers & Mathematics with Applications, Vol.
51, No. 6–7, pp. 1075–1092, 2006.
[27] G. Philips-Wren, P. Sharkey, and S. Morss, “Mining lung
cancer patient data to assess healthcare resource utiliza-
tion,” Expert Systems with Applications: An International
Journal, Vol. 35, No. 4, pp. 1611–1619, 2008.
[28] M. Hearst, “Untangling text data mining,” in the Pro-
ceedings ACL’99: The 37th annual meeting of the asso-
ciation for computational linguistics, University of Mary-
land, June 1999.
[29] H. Chen, “Knowledge management systems: A text min-
ing perspective,” Tucson, AZ, The University of Arizona,
[30] K. B. Cohen and L. Hunter, “Getting started in text min-
ing,” PLoS Computational Biology, Vol. 4, No. 1, doi:
10.1371/journal.pcbi.0040020, 2008.
[31] Y. Ku, C. Chiu, B. H. Liou, J. H. Liou, and J. Y. Wu,
“Applying text mining to assist people who inquire
HIV/AIDS information from Internet,” in Proceedings ISI
2008 Workshops, pp. 440–448, 2008.
[32] D. Zhou, Y. He, and C. K. Kwoh, “Validating text mining
results on protein-protein interactions using gene expres-
sion profiles,” in Proceedings International Conference on
Biomedical and Pharmaceutical Engineering 2006, pp.
580–585, 2006.
[33] Y. Peng, B. Yao, and J. Jiang, “Knowledge-discovery
incorporated evolutionary search for microcalcification
detection in breast cancer diagnosis,” Artificial Intelli-
gence in Medicine, Vol. 37, No. 1, pp. 43–53, 2006.
[34] H. Pan, Q. Han, X. Xie, Z. Wei, and J. Li, “A Similarity
retrieval method in brain image sequence database,” Ad-
vanced Data Mining and Applications, Vol. 4632, pp.
352–364, 2007.
[35] X. Zhu, W. G. Aref, J. Fan, A.C. Catlin, and A. K. Elma-
garmid, “Medical video mining for efficient database in-
dexing, management and access,” in Proceedings 19th
International Conference on Data Engineering, pp.
569–580, 2003.
[36] R. Kohavi, B. Masand, M. Spilipoulou, and J. Srivastava,
“Web mining,” Data Mining and Knowledge Discovery,
Vol. 6, pp. 5–8, 2002.
[37] W. D. Yu and S. R. Jonnalagadda, “Semantic web and
mining in healthcare,” in Proceedings 8th International
Conference on e-Health Networking, Applications and
Services, pp. 198–201, 2006.
[38] S. Mitra and T. Acharya, “Data mining: Multimedia, soft
computing and bioinformatics,” John Wiley & Sons, Inc.,
New Jersey, 2003.