Data Mining in Biomedicine: Current Applications and Further Directions for Research

doi:10.4236/jsea.2009.23022

Paper Menu >>

Journal Menu >>

J. Software Engineering & Applications, 2009, 2: 150-159

doi:10.4236/jsea.2009.23022 Published Online October 2009 (http://www.SciRP.org/journal/jsea)

Data Mining in Biomedicine: Current Applications and

Further Directions for Research

S. L. TING1, C. C. SHUM2, S. K. KWOK1, A. H. C. TSANG1, W. B. LEE1

1Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hong Kong, China; 2Department of

Computing, The Hong Kong Polytechnic University, Hong Kong, China.

Email: jacky.ting@polyu.edu.hk

Received January 16th, 2009; revised June 18th, 2009; accepted June 24th, 2009.

ABSTRACT

Data mining is the process of finding the patterns, associations or relationships among data using different analytical

techniques involving the creation of a model and the concluded result will become useful information or knowledge.

The advancement of the new medical deceives and the database management systems create a huge number of data-

bases in the biomedicine world. Establishing a methodology for knowledge discovery and management of the large

amounts of heterogeneous data has become a major priority of research. This paper introduces some basic data mining

techniques, unsupervised learning and supervising learning, and reviews the application of data mining in biomedicine.

Applications of the multimedia mining, including text, image, video and web mining are discussed. The key issues faced

by the computing professional, medical doctors and clinicians are highlighted. We also state some foreseeable future

developments in the field. Although extracting useful information from raw biomedical data is a challenging task, data

mining is still a good area of scientific study and remains a promising and rich field for research.

Keywords: Data Mining, Biomedicine

1. Introduction

With the tremendous improvement in the speed of com-

puter and the decreasing cost of data storage, huge vol-

umes of data are created. However, data itself has no

value. Only if data can be changed to information, it be-

comes useful. In order to generate meaningful informa-

tion, or knowledge from database, the field of data min-

ing was born. The data mining field is about two decade

old. Early pioneers such as U. Fayyad, H. Mannila, G.

Piatetsky-Shapiro, G. Djorgovski, W. Frawley, P. Smith,

and others found that the traditional statistical techn iques

were not adequate to handle the mass amount of data.

They recognized the need of better, faster and cheaper

ways to deal with the dramatic increase in the amount of

data.

Nowadays, besides the numerous number of databases

created and accumulated in a dramatic speed, data is no

longer restricted to numeric or character only especially

in the biomedicine aspect. The advanced medical de-

ceives and database management systems enable the in-

tegration of the different types of high dimensional mul-

timedia data (e.g. text, image, audio, and video) under

the same umbrella. Establishing a methodology for kno-

wledge discovery and management of large amounts of

heterogeneous data has therefore become a main priority.

Various techniques are used in different areas of bio-

medicine, including genomics, proteomics, medical di-

agnosis, effective drug design and pharmaceutical indus-

try.

In this paper, we would first give a brief outline on

what is data mining, its position or role in the kn owledge

discovery process and the basic principles of some com-

monly used data mining techniques. Next, we present our

investigation results of the applications of the data min-

ing in the biomedicine aspect, which includes the area of

biology, medicine, pharmacy and health care. Lastly, we

discuss some difficulties of data mining in biomedicine

and the possible direction for the future development.

2. What is Data Mining?

Data mining (DM) is the process of finding the patterns,

associations or relationships among data using different

analytical techniques involving the creation of a model

and the concluded result will become useful information

or knowledge. DM can also be expressed as

 Nontrivial extraction of implicit, previously un-

known, and potentially useful information from data [1];

and

 Making sense of large amounts of mostly unsuper-

vised data in some domain [2]

Data Mining in Biomedicine: Current Applications and Further Directions for Research151

It is an interdisciplinary subject that lies at the inter

face of pattern recognition and database systems and

emerges the techniques from the mathematics and statis-

tical disciplines as well as from the artificial intelligence

and machine leaning communities. It has a great deal in

common with statistics but on the other hand, there are

differences. Unlike statistics, data mining can be due

with heteroge n eou s dat a field s .

Very often, the term knowledge discovery is used to-

gether with Data Mining. Knowledge discovery, also

known as knowledge d iscover y in databa se (KDD), is the

process that seeks new knowledge in some application

domain. DM is one of the steps in the knowledge discov-

ery process. Figure 1 is an outline of the six step hybrid

KDD model developed by [2].

The initial step of understanding the problem domain

involves working closely with domain experts to define

the problem and determine the project goals, and learning

about current solutions to the problem. A description of

the problem, including its restrictions, is prepared. The

DM tool to be used in the later stage is selected. Next, we

need to understand the data which includes collecting

sample data and deciding which data, including format

and size, will be needed. Data are checked for complete-

ness, redundancy, missing valu es, plausibility of attribute

values, etc. Preparation of data decides which data will

be used as input for DM methods in the subsequent step.

It involves sampling, running correlation and signifi-

cance tests, and data cleaning. Data miner then uses

various DM methods to derive knowledge from preproc-

essed data. Evaluation includes understanding and

checking if the result is novel. Finally, we will decide

how to use and deploy the discovered knowledge.

3. Data Mining Techniques

Data mining techniques fall into two broad categories:

unsupervised and supervised. Unsupervised learning re-

fers to the technique that is not guided by any particular

variable or class label. In the unsupervised learning, we

do not create a model or hypothesis prior to the analysis.

We apply the algorithm directly to the data and observe

the results. A model will then be built according to the

results. Thus, unsupervised leaning is used to define class

for data without class assignments. Clustering is one of

the common unsupervised techniques.

In contrast, for supervised learning, a model is built

prior to the analysis. We then apply the algorithm to the

data in order to estimate the parameters of the model.

The objective of building models using supervised

learning is to predict an outcome or category of interest.

The biomedical literature on applications of supervised

learning techniques is vast. Classification, statistical re-

gression and association rules building are very common

supervised learning techniques used in medical and

clinical research. Table 1 is the summary comparing the

characteristics and the techniques used for the two dif-

ferent learning methods. Followed is a brief explanation

Figure 1. Six-step hybrid KDD model [2]

Data Mining in Biomedicine: Current Applications and Further Directions for Research

152

Table 1. Comparing the characteristics and the techniques of the unsupervised and supervise d le ar ning

Characteristics Techniques

Unsupervised Learning

 No guidance

 Use to Define the class

 Seldom utilized (until recently)

 Clustering

 Association Rule

Supervised Learning

 With guidelines

 Class defined

 Common with vast literature and application

 Classification

 Statistical Regression

 Artificial neural networks

of the four learning techniques.

3.1 Clustering

Clustering is an unsupervised learning technique reveal-

ing natural groupings in the data. Cluster analysis refers

to the grouping of a set of data objects into clusters. A

cluster is a collection of data objects wh ich are similar to

one another within the same cluster but not si milar to the

objects in another cluster. Clustering is also called unsu-

pervised classification where no predefined classes are

assigned.

3.2 Association Rule

Association rule discovery is to find the relationships

between the different items in a data base. It is normally

express in the form X => Y, where X and Y are sets of

attributes of the dataset which implies that transactions

that contain X also contain Y.

3.3 Classification

Classification is a supervised learning method. It is a

method of categorizing or assigning class labels to a pat-

tern set under the supervision. The object of classifica-

tion is to develop a model for each class. Classification

methods can usually be categorized as follows:

a) Decision tree

Decision tree classifiers divide a decision space into

piecewise constant regions. It splits a dataset on the basis

of discrete decisions, using certain thresholds on the at-

tribute values. It is one of the most widely used classifi-

cation method as it is easy to interpret and can be repre-

sented under the If-then-else rule condition.

b) Nearest-neighbor

Nearest-neighbor classifiers [3] typically define the

proximity between instances, find the neighbors if a new

instance, and then assign to it the label for the majority

class of its neighbors.

c) Probabilistic models

Probabilistic models are models which calculate prob-

abilities for hypotheses base on Bayes’ theorem [3].

3.4 Statistical Regression

Regression models are very popular in the biomedical

literature and have been applied in virtually every sub-

specialty of medical research. Before computers were

widely used, linear regression was the most popular

model to find solutions of the problem of estimating the

intercept and coefficients of the regression question. It

has solid foundation from the statistical theory. Linear

regression is similar to the task of finding the line that

minimizes the total distance to a set of data. That is find

the equation for line Y = a + bX. With the help of com-

puters and software package, we can calculate the high

complex models.

3.5 Artificial Neural Networks

Artificial neural networks [4] are signal processing sys-

tems that try to emulate the behavior of human brain by

providing a mathematical model of combination of nu-

merous neurons conn ected in a network. It learns through

examples and discriminate the characteristics among

various pattern classes by reducing the error and auto-

matically discovering inherent relationships in a data-rich

environment. No rules or programmed information is

need beforehand. It composes of many elements, called

nodes which are connected in between. The connection

between two nodes is weighted and by the adjustment of

these weights, the training of the network is performed.

The weights are network parameters and their values are

obtained after the training procedure. There are usually

several layers of nodes. During the training procedure,

the inputs are directed in the input layer with the desir-

able output values as targets. A comparison mechanism

will operates between the out and th e target valu e and the

weights are adjusted in order to reduce error. The proce-

dure is repeated until the network output matches the

targets. There are many advantages of neural networks

like adaptive learning ability, self-o rganization, real-time

operation and insensitivity to noise. However, it also has

a huge disadvantage that it is highly dependence on the

training data and it does not provide an explanation for

the decisions they make, just like working in the ‘black box’.

3.6 Advanced Data Mining Techniques

During the past few years, researchers have tried to com-

bine both unsupervised and supervised methods for the

Data Mining in Biomedicine: Current Applications and Further Directions for Research153

analysis [5]. Some examples of advanced unsupervised

learning models are hierarchical clustering, c-means

clustering self-organizing maps (SOM) and multidimen-

sional scaling techniques. Advanced examples of the

supervised learning models classification and regression

trees (CART) and support vector machines [6].

4. Applications of Data Mining in Biomedicine

4.1 Data Mining Models

Data mining applies in descriptive modeling for under-

standing. In [7], Tseng and Yang use Gene Ontology

(GO) to group genes in advance in order to show the po-

tential relations among gene groups and discover the

hidden relations between genes set in association with

GO terms. It can also be used to predict the outco me of a

future observation or to assess the potential risk in a dis-

ease situation. Regarding the predictive power, data

mining algorithms can learn from past examples in clini-

cal data and model the oftentimes non-linear relation-

ships between the independent and dependent variables,

thereby the resulting model representing the formalized

knowledge that can often provide a good diagnostic op-

tion [8]. Data mining techniques have been widely used

to find new patterns and knowledge from biomedical

data.

4.2 Recent Development

The typical data mining process involves transferring

data originally collected in production systems (such as

electronic medical records) into data warehouse, cleaning

or scrubbing the data to remove errors and check for

format consistency, and then searching the data using

statistical model, artificial intelligence (such as neural

networks), and other machine learning methods [9]. In

[10], Prather et al. employs the KDD for identifying the

factors that will improve the quality and cost effective-

ness of perinatal care in an ex tensive clinical database of

obstetrical patients. Given the data warehouse of diabetic

patients, Breault et al. employ the CART to investigate

the factors affecting the occurrence of diabetics [11].

They are surprisingly discovered that younger age pre-

dicts bad diabetic control, in which explore a new area to

manage the diabetic control in younger age. Similar ap-

plicati ons of data m ining can al s o be f ound in Ta ble 2.

Apart from the diagnostic prediction, the knowledge

discovery ability in data mining also demonstrated a

good detector in adverse drug events (ADE). In [12],

Wilson et al. utilize the KDD techniques in pharma-

covigilance for detecting signals earlier than using exist-

ing methods. In [13], Lian et al. has pointed out that the

prescription is specified by a preference function based

on the user's preference in prior clinical experience. Thus,

they propose a dose optimization framework based on

probability theory. In [14], Susan and Warren have

demonstrated that the cond itional probability (CP) model

is superior in optimizing the drug lists over the multiple

linear regression and discriminant analysis models. Con-

cerning the strong relationship between the diagnosis and

medication, it formulates a posterior probability (what

medication is needed) b ased on a priori probability (wh at

diagnosis has been made). This approach aligns with the

Mediface as purposed by [15].

Table 2. Recent applications of data mining

Author Description

Megalooikonomou et al. [20] They introduce statistical methods that aid the discovery of interesting associations and patterns between brain

images and other clinical data

Brossette et al. [21] They design a Data Mining Surveillance System (DMSS) that uses novel data mining techniques to discover

unsuspected, useful patterns of nosocomial infections and antimicrobial resistance from the analysis of hospital

laboratory data

Antonie et al. [22] They investigate the use of different data mining techniques for anomaly detection and classification of medi-

cal images

Coulter et al. [23] They examine the relation between antipsychotic drugs and myocarditis and cardiomyopathy

Li et al.[24] They explore a novel analytic cancer detection method with different feature selection methods and to compare

the results obtained on different datasets and that reported by Petricoin et al. in terms of detection performance

and selected proteomic patterns

Delen et al.[25] They use two popular data mining algorithms (artificial neural networks and decision trees) along with a most

commonly used statistical method (logistic regression) to develop the prediction models on breast cancer using

a large dataset.

Su et al. [26] They use four different data mining approaches to select the re levant features from the data to predict diabetes

Phillips-Wren et al. [27] They assess the utilization of healthcare resources by lung cancer patients related to their demographic charac-

teristics, socioeconomic markers, ethnic backgrounds, medical histories, and access to healthcare resources in

order to guide medical decision making and pu bl i c p ol icy

Data Mining in Biomedicine: Current Applications and Further Directions for Research

154

Figure 2. Framework for the integrated approach [17]

In recent years, numerous researchers intend to inte-

grate several data mining and artificial intelligence tech-

niques together to enhance the mining result and support

decision making. For example, Kuo et al. integrate the

clustering analysis and association rules mining tech-

nique to cluster the health insurance database and hence

discover the useful rules for each group [16]. In [17],

Zhuang et al. combine the data mining and case-based

reasoning (CBR) methodologies to provide intelligent

decision support for pathology test ordering by GPs.

They guarantee the integrated system can enhance the

testing ordering in term of evidence based, situational

relevance, flexibility and interactivity. In [18], Huang et

al. propose a model of a chronic diseases prognosis and

diagnosis (CDPD) system by integrating data mining and

CBR to support the chronic d isease treatment. Compared

with traditional coronary artery diseases (CAD) diagnos-

tic methodologies, Tsipouras et al. integrate the decision

trees and fuzzy modeling to form a fuzzy rule-based de-

cision support system that obtain a significant improve-

ment compared with artificial neural networks and adap-

tive neuro-fuzzy inference system [19]. Example of such

integration can be found in Figure 2.

All in all, most of the existing data mining app lications

Data Mining in Biomedicine: Current Applications and Further Directions for Research155

are focused on exploring the pattern in sound biomedical

databases. With proper structure of the data collected via

different medical devices, data mining techniques can

serve as a promising tool to convert the information into

useful and valuable knowledge to physicians and re-

searchers.

4.3 Current Trend

4.3.1 Mul timedi a Mining

Classically, databases were formed by tuples of numeric

and alphanumeric contents, but with the widespread use

of medical information systems, information absorption

are now expands to different data types including text,

document, image, graphics, speech, audio, hypertext, etc.

At the same time, the growth in Internet information can

also be considered as a new dimension as a distributed

multimedia database of the largest useful information.

Concerning the tremendous amount of visual information,

it is obvious that the development of data mining tech-

niques in these multimedia data is the next generation in

biomedicine. With the widely advanced in digital multi-

media technology, numerous researchers introduce sev-

eral novel data mining techniques, namely image mining,

text mining, video mining, and web mining. Below we

will discuss these four technology revolution and how

does it impact the biomedicine area.

4.3.2 Text Mining

Apart from the medical images and signals, another

clinical data that physicians would like to interpret is the

unstructured free-text. Regarding there is a lot of infor-

mation presented in text or document databases, in form

of electronic books, research articles, digital libraries,

medical dictionaries, etc., several researchers developed

a novel data mining approach in extracting useful

knowledge from textual data or documents, so called the

text mining [28,29]. For example, we can employs text

mining techniques to extract the information of pro-

tein-protein interaction within three different documents.

In addition to the traditional data mining techniques,

text mining uses techniques from many multidisciplinary

scientific fields (e.g. text analysis techniques) to gain

insight and automatically rev eal useful info rmation to the

human users. In [30], Cohen and Hunter describe text

mining is “the use of automated methods for exploiting

the enormous amount of knowledge available in the

biomedical literature”. One of the examples of text min-

ing is to manage the health information in Internet and

response the needs for those who have health information

inquiry in HIV/AIDS [31]. Another common application

of text mining is used to extract the information of pro-

tein-protein interaction. When given the unstructured text,

Zhou et al. employ the semantic parsing and hidden vec-

tor state model to mine the knowledge within the text

[32]. By setting the annotation PROTEIN_NAME (AC-

TIVATE(PROTEIN_NAME), the system will automati-

cally generate the result as shown in Figure 3.

Figure 3. Semantic parsing employe d in protein documents [32]

Data Mining in Biomedicine: Current Applications and Further Directions for Research

156

4.3.3 Image Mining

More and more medical procedures employ imaging as a

preferred diagnostic tool. Thus, there is a need to develop

methods for efficient mining in images databases, which

is completely different and more difficult than mining in

structured datatypes. Therefore, mining of image data is

a challenge problem. Meanwhile, with numerous imag-

ing techniques (such as SPECT, MRI, PET, and collec-

tion of ECG or EEF signals) can generate gigabytes of

data per day, and heterogeneous nature of image data

(like a single cardiac SPECT procedure of one patient

may contain dozens of 2D images), image mining has

become one of the emerging field in biomedical study.

Typically, most of the activities in mining image data are

based on the searching, retrieving and comparing of

query image with the stored image by its degree of simi-

larity or feature(s). In [22], Antonie et al. present the use

of different data mining techniques for tumor classifica-

tion in digital mammograph y and they find that associate

rule obtains a better result than neural networks. Fur-

thermore, in order to tackle the issue of complicated na-

ture of surrounding of breast tissue, the variation of MCs

in shape, orientation, brightness and size, Peng et al.

propose knowledge-discovery incorporated genetic algo-

rithm (KD-GA) to search for the bright spots in mam-

mogram and hence evaluate the possibility of a bright

spot being a true MC, and adaptively adjust the associ-

ated fitness values [34]. Another example, which intro-

duces a notion of image sequence similarity patterns

(ISSP) for discovering the possible Space-Occupying

Lesion (PSO) in brain images, is presented by [35].

4.3.4 Vi deo Mining

With the advancement in streaming audio and digital TV,

more and more video data are stored in which this brings

the interest of researchers to discover and explore inter-

esting patterns in the audio-visual content. In order to

meet such demand, video mining is developed. In the

biomedicine area, it is observed that specialists intend to

use cameras to take the video in each operation, which

imply there are ample opportunities of applying data

mining principles in conjunction with the video retrieval

techniques. For example, Zhu et al. introduce a video

database management framework and strategies for video

content structure and events mining [36]. They first seg-

mented the video shot into groups and hence organized

the video shots into a hierarchical structure using clus-

tered scenes, scenes, groups, and shots, in increasing

granularity from top to bottom. With a sound structure,

audio and video processing techniques are integrated to

mine event information, such as dialog, presentation and

clinical operation, from the detected scenes.

4.3.5 Web Mining

Internet is growing at a tremendous speed. World Wide

Web (WWW) becomes the largest database that ever

existed. In particular, many medical literatures are writ-

ten in electronic format which are widely available and

accessible in the Internet nowadays. Therefore, the capa-

bility of knowledge discovery and retrieving information

from WWW is important to physicians. But, the com-

plexity of web pages and the dynamic nature of data

stored in the Internet make adoption of data mining tech-

niques difficult. In [37], web mining is the use of data

mining techniques to automatically retrieve, extract and

evaluate information for knowledge discovery from the

Internet. With its exploratory of hidden information abil-

ity, Yu and Jonnalagadda present an approach regarding

Semantic Web and mining that can improve the quality

of Web mining results and enhance the functions and

services and the interoperability of medical information

systems and standards in the healthcare field [38].

5. Discussions

Biomedicine has been evolved as a new application area

for data mining in recent year. As reflected by the brief

literature survey in this study, the current data mining

research concentrates on applying the data mining tech-

niques to manage the complex and unstructured data, and

in particular in form of visual and textual nature. Al-

though numerous studies resulting satisfactory result of

data mining adoption, it is found that data quality is one

of the major challenges on impacting the performance in

the biomedicine industry. In theory, data mining is a data

driven approach as the outcome of data mining heavily

depends on the quality and quantity of available data.

However, the data in the biomedicine area is rather com-

plex in nature. Thus, in order to enhance the performance

of data mining adoption in the domain area, concerns are

raised as follow:

a) Huge volume of data

Because of the sheer size of databases, it is unlikely

that any of the data mining methods will succeed with

raw data. In the field of biomedicine, it is particular true

that particular medical experts are required to pre-process

the data before adopting data mining. As different medi-

cal experts are professional in different medical aspects,

therefore it is time consuming and labor intensive to

handle the data beforehand.

b) Dynamic nature of data

Databases are constantly updated and adding new in-

formation at an alarming rate. For example, new SPECT

images (for the same or a new patient), or by replacement

of the existing ones (a SPECT had to be repeated because

of technical problems). This requires methods that are

able to incrementally update the knowledge learned so

far.

c) Incomplete or imprecise data

The information collected in a database can be either

incomplete or imprecise. To address this problem, fuzzy

sets and rough sets were developed explicitly.

Data Mining in Biomedicine: Current Applications and Further Directions for Research157

d) Noisy data

It is very difficult for any data collection technique to

entirely eliminate noise. This implies that data mining

methods should be made less sensitive to noise, or care

should be taken that the amount of noise in data to be

collected in the future will be approximately the same as

that in the current da ta.

e) Missing attribute values

Missing values create a problem for most data mining

methods, since nearly all the methods require a fixed

dimension for each data object. In fact, this problem is

widely encountered in the medical databases because

most medical data are collected as a byproduct of patient

care activities, rather than for organized research proto-

cols; even in some large medical databases such as breast

cancer data set from University of Wisconsin Hospitals,

this problem are still existed. Typically, one approach to

remedy this problem is to ignore the missing values, or

omit any records containing missing values; whereas

another approach is to substitute missing values with

mostly likely values from obtaining values in the mode

or mean, or directly infer missing values from existing

values via artificial intelligence method (e.g. case-based

reasoning).

f) Redundant, insignificant data, or inconsistent d ata

The data set may contain redundant, insignificant, or

inconsistent data objects and attributes. Generally, medi-

cal data can be stored in numeric and textual format; in

which a large amount of preprocessing is required in or-

der to make the data useful. For example, misspelled of

medical terms is frequently occurred and one medication

or condition may be commonly referred to by a variety of

names (i.e. stomach and abdominal pain).

In addition to the data quality perspectives, several

considerations are also been made:

a) Quality of learning mechanism

Over- and under-learning will affect the performance

of data mining in which the learning mechanism will

misunderstand the human’s preferences and require hu-

man to adjust for achieving the goal state.

b) Quality of knowledge representation

Knowledge representation is an important element to

represent knowledge in an understandable manner to

facilitate the conclusions drawn from knowledge. If the

machine is insufficient to store the k now ledge d iscov ered,

it is also incapable to represent them; thus, such insuffi-

cient knowledge will make the machine less intelligent.

c) Nature o f p roblem

When the problem is too complex, chaos, or has not

encountered before, the intelligent machine do not have

enough knowledge or time to deduce an appropriate re-

sult. Using the case of diagnostic decision support as an

example, if most of the learning cases and rules are re-

lated to some general diagnosis, wh en there is a n ew case

related to specific diagnosis encountered, the system

cannot provide a good solution since there are no rules

triggered inside in the system.

As a result, with this study at hand, we can conclude

that opportunities to use data mining truly in bio medicine

will happen only when the data quality is committed to

the level of standard and there are new methods or algo-

rithms to handle the complex data types. Furthermore,

adoption of data mining in biomedicine is quite a young

field with many issues that still need to be researched and

explored in depth. Some further research directions and

questions are summarized as follow:

a) An absurd and false model may fit perfectly if the

model has enough complexity by comparison to the

amount of data available. When the degrees of freedom

in parameter selection exceed the information content of

the data, this leads to arbitrariness in the final (fitted)

model parameters which reduces or destroys the ability

of the model to generalize beyond the fitting data. If

you've got a learning algorithm in one hand and a dataset

in the other hand, to what extent can you decide whether

the learning algorithm is in danger of over-fitting or un-

der-fitting? Almost all of the data mining research is

done on an ad-hoc base. The techniques are designed for

an individual problem. There is no unifying theory.

b) The storage of large multimedia databases is often

required to be in compressed form. Data compression if

the techniques to reduce the redundancies in data repre-

sentation. Reducing the storage requirement is equivalent

to increasing the capacity of the storage medium. The

development of the data compression technology will

play a significant role in terms of the performance of data

mining. However, it seems the data compression field

has so far been neglected by the data mining community.

c) In today’s network ed society, data care not stored in

a single place. Internet has no doubt being the greatest

and largest databases that we have ever had. Information

inside the internet is often a mixed of text, image, audio,

speech, hypertext, graphics and video components. In

many cases, databases spread over multiple files in dif-

ferent disks or in different geographical locations. How

to handle or collaborate all kind of heterogeneous data in

a distributed environment will open up a newer area of

development.

d) More and more multimedia data mining systems

will be used by medical doctors or clinicians. Th e design

of the system needs to take into consideration of the hu-

man perceptual. How to develop a system work synergis-

tically is a subject of ongoing research. In order to

achieve the goal, biologist, medical doctors, clinicians

and the computing professional all need to work closely

together. Any little part missing may lead to the failure of

the system design.

Data Mining in Biomedicine: Current Applications and Further Directions for Research

158

6. Conclusions

The well use of the data mining tools in the biomedicine

should bring revolutionary impact to the field. The study

of biomedical processes is heavily based on the identifi-

cation of understandable patterns which are present in the

data. These patterns may be used for diagnostic or prog-

nostic purpose as well as the analysis of microarrays.

Data mining is at the care of the pattern recognition

process. Biologist, medical doctors, clinicians and com-

puting professionals should collaborate so that the two

fields can contribute to each other. The challenge is for

each to widen its focus to attain harmonious and produc-

tive collaboration to develop the best practices.

7. Acknowledgement

The authors would like to express their sincere thanks to

the Research Committee of The Hong Kong Polytechnic

University for financial support of the research work

presented in this paper.

REFERENCES

[1] W. Frawley, G. Piatetsky-Shapiro, and C. Matheus,

“Knowledge discovery in databases: An overview,” AI

Magazine, pp. 213–228, 1992.

[2] K. J. Cios, W. Pedrycz, R. W. Swiniarski, and L. A. Kur-

gan, “Data mining: A knowledge discovery approach,”

Springer, New York, 2007.

[3] J. T. Tou and R. C. Gonzalez, “Pattern recognition prin-

ciples,” Addison-Wesley, London, 1974.

[4] R. O. Duda, P. E. Hart, and D. G. Stork, “Pattern classi-

fication,” Wiley, 2001.

[5] T. Hastie, R. Tibshirani, and J. Friedman, “The elements

of statistical learning: Data mining, inference, and predic-

tion,” Springer, New York, 2001.

[6] J. W. Lee, J. B. Lee, M. Park, and S. H. Song, “An exten-

sive comparison of recent classification tools applied to

microarray data,” Computational Statistics & Data

Analysis, Vol. 48, No. 4, pp. 869–885, 2005.

[7] V. S. Tseng and S. C. Yang, “Mining multi-level associa-

tion rules from gene expression profiles and gene ontol-

ogy,” in Proceedings IEEE Workshop Life Science Data

Mining (held with IEEE ICDM), UK, November 2004.

[8] H. Chen, S. S. Fuller, C. Friedman, and W. Hersh,

“Medical informatics–knowledge management and data

mining in biomedicine,” Springer, 2005.

[9] C. D. Krivda, “Data-Mining Dynamine,” Byte, 1995.

[10] J. C. Prather, D. F. Lobach, L. K. Goodwin, J. W. Hales,

M. L. Hage, and W. E. Hammond, “Medical data mining:

Knowledge discovery in a clinical data warehouse,” in

Proceedings AMIA Annual Fall Symposium, pp. 101–

105, 1997.

[11] J. L. Breault, C. R. Goodall, and P. J. Fos, “Data mining a

diabetic data warehouse,” Artificial Intelligence in Medi-

cine, Vol. 26, pp. 37–54, 2002.

[12] A. M. Wilson, L. Thabane, and A. Holbrook, “Applica-

tion of data mining techniques in pharmacovigilance,”

British Journal of Clinical Pharmacology, Vol. 57, No. 2,

pp. 127–134, 2004.

[13] J. Lian, C. Cotrutz, and L. Xing, “Therapeutic treatment

plan optimization with probability density-based dose

prescription,” Medical Physics, Vol. 30, No. 4, pp. 655–

666, 2003.

[14] E. G. Susan and J. M. Warren, “Statistical modelling of

general practice medicine for computer assisted data entry

in electronic medical record systems,” International Jour-

nal of Medical Informatics, Vol. 57, No. 2-3, pp. 77–89,

2000.

[15] J. R. Warren, A. Davidovic, S. Spenceley, and P. Bolton,

“Mediface: Anticipative data entry interface for general

practitioners,” in Proceedings Computer Human Interac-

tion Conference 1998, pp. 192–199, 1998.

[16] R. J. Kuo, S. Y. Lin, and C. W. Shih, “Mining association

rules through integration of clustering analysis and ant

colony system for health insurance database in Taiwan,”

Expert Systems with Applications, Vol. 33, pp. 794–808,

2007.

[17] Z. Y. Zhuang, L. Churilov, F. Burstein, and K. Sikaris,

“Combining data mining and case-based reasoning for

intelligent decision support for pathology ordering by

general practitioners,” European Journal of Operational

Research, Vol. 195, No. 3, pp. 662–675, 2009.

[18] M. J. Huang, M. Y. Chen, and S. C. Lee, “Integrating

data mining with case-based reasoning for chronic dis-

eases prognosis and diagnosis,” Expert Systems with Ap-

plications, Vol. 32, No. 3, pp. 856–867, 2007.

[19] M. G. Tsipouras, T. P. Exarchos, D. I. Fotiadis, A. P.

Kotsia, K. V. Vakalis, K. K. Naka, and L. K. Michalis,

“Automated diagnosis of coronary artery disease based on

data mining and fuzzy modeling,” IEEE Transactions on

Information Technology in Biomedicine, Vol. 12, No. 4,

pp. 447–457, 2008.

[20] V. Megalooikonomou, J. Ford, L. Shen, F. Makedon, and

A. Saykin, “Data mining in brain imaging,” Statistical

Methods in Medical Research, Vol. 9, No. 4, pp. 359–394,

2000.

[21] S. E. Brossette, A. P. Sprague, W. T. Jones, and S. A.

Moser, “A data mining system for infection control sur-

veillance,” Methods of Information in Medicine, Vol. 39,

No. 4-5, pp. 303–310, 2000.

[22] M. L. Antonie, O. R. Zaiane, and A. Coman, “Application

of data mining techniques for medical image classifica-

tion,” in Proceedings Second International Workshop on

Multimedia Data Mining, pp. 94–101, 2001.

[23] D. M. Coulter, A. Bate, R. H. B. Meyboom, M. Lindquist,

and R. Edwards, “Antipsychotic drugs and heart muscle

disorder in international pharmacovigilance: Data mining

study,” British Medical Journal, Vol. 322, pp. 1207–1209,

2001.

[24] L. Li, H. Tang, Z. Wu, J. Gong, M. Gruidl, J. Zou, M.

Tockman, and R. Clark, “Data mining techniques for

cancer detection using serum proteomic profiling,” Arti-

Data Mining in Biomedicine: Current Applications and Further Directions for Research

159

ficial Intelligence in Medicine, Vol. 32, No. 2, pp. 71–83,

2004.

[25] D. Delen, G. Walker, and A. Kadam, “Predicting breast

cancer survivability: A comparison of three data mining

methods,” Artificial Intelligence in Medicine, Vol. 34, No.

2, pp. 113–27, 2005.

[26] C. T. Su, C. H. Yang, K. H. Hsu, and W. K. Chiu, “Data

mining for the diagnosis of type II diabetes from three-

dimensional body surface anthropometrical scanning

data,” Computers & Mathematics with Applications, Vol.

51, No. 6–7, pp. 1075–1092, 2006.

[27] G. Philips-Wren, P. Sharkey, and S. Morss, “Mining lung

cancer patient data to assess healthcare resource utiliza-

tion,” Expert Systems with Applications: An International

Journal, Vol. 35, No. 4, pp. 1611–1619, 2008.

[28] M. Hearst, “Untangling text data mining,” in the Pro-

ceedings ACL’99: The 37th annual meeting of the asso-

ciation for computational linguistics, University of Mary-

land, June 1999.

[29] H. Chen, “Knowledge management systems: A text min-

ing perspective,” Tucson, AZ, The University of Arizona,

2001.

[30] K. B. Cohen and L. Hunter, “Getting started in text min-

ing,” PLoS Computational Biology, Vol. 4, No. 1, doi:

10.1371/journal.pcbi.0040020, 2008.

[31] Y. Ku, C. Chiu, B. H. Liou, J. H. Liou, and J. Y. Wu,

“Applying text mining to assist people who inquire

HIV/AIDS information from Internet,” in Proceedings ISI

2008 Workshops, pp. 440–448, 2008.

[32] D. Zhou, Y. He, and C. K. Kwoh, “Validating text mining

results on protein-protein interactions using gene expres-

sion profiles,” in Proceedings International Conference on

Biomedical and Pharmaceutical Engineering 2006, pp.

580–585, 2006.

[33] Y. Peng, B. Yao, and J. Jiang, “Knowledge-discovery

incorporated evolutionary search for microcalcification

detection in breast cancer diagnosis,” Artificial Intelli-

gence in Medicine, Vol. 37, No. 1, pp. 43–53, 2006.

[34] H. Pan, Q. Han, X. Xie, Z. Wei, and J. Li, “A Similarity

retrieval method in brain image sequence database,” Ad-

vanced Data Mining and Applications, Vol. 4632, pp.

352–364, 2007.

[35] X. Zhu, W. G. Aref, J. Fan, A.C. Catlin, and A. K. Elma-

garmid, “Medical video mining for efficient database in-

dexing, management and access,” in Proceedings 19th

International Conference on Data Engineering, pp.

569–580, 2003.

[36] R. Kohavi, B. Masand, M. Spilipoulou, and J. Srivastava,

“Web mining,” Data Mining and Knowledge Discovery,

Vol. 6, pp. 5–8, 2002.

[37] W. D. Yu and S. R. Jonnalagadda, “Semantic web and

mining in healthcare,” in Proceedings 8th International

Conference on e-Health Networking, Applications and

Services, pp. 198–201, 2006.

[38] S. Mitra and T. Acharya, “Data mining: Multimedia, soft

computing and bioinformatics,” John Wiley & Sons, Inc.,

New Jersey, 2003.