J. Biomedical Science and Engineering, 2009, 2, 506-515
doi: 10.4236/jbise.2009.27073 Published Online November 2009 (http://www.SciRP.org/journal/jbise/
JBiSE
).
Published Online November 2009 in SciRes. http://www.scirp.org/journal/jbise
Correlation of selected molecular markers in chemosensitivity
prediction
David King, Thomas Keane, Wei Hu
Department of Computer Science, Houghton College, Houghton, NY 14744, USA.
Email: Wei.Hu@houghton.edu
Received 3 July 2009; revised 19 August 2009; accepted 20 August 2009.
ABSTRACT
Finding effective cancer treatment is a challenge,
because the sensitivity of the cancer stems from both
intrinsic cellular properties and acquired resistances
from p rior tr eatmen t. Prev ious r esearch has rev ealed
individual protein markers that are significant to
chemosensitivity prediction. Our goal is to find corre-
lated protein markers which are collectively signi-
ficant to chemosensitivity prediction to complement
the individual markers already reported. In order to
do this, we used the D’ correlation measurement to
study the feature selection correlations for chemo-
sensitivity prediction of 118 anti-cancer agents with
putatively known mechanisms of action. Three data-
sets on the NCI-60 were utilized in this study: two
protein datasets, one previously studied for chemo-
sensitivity prediction and another novel to this topic,
and one DNA copy number dataset. To validate our
approach, we identified the protein markers that
were strongly correlated by our analysis with the
individual protein markers found in previous studies.
Our feature analysis discovered highly correlated
protein marker pairs, based on which we found
individual protein markers with medical significance.
While some of the markers uncovered were con-
sistent with those previously reported, others were
original to this work. Using these marker pairs we
were able to further correlate the cellular functions
associated with them. As an exploratory analysis, we
discovered feature selection correlation patterns
between and within different drug mechanisms of
action for each of our datasets. In conclusion, the
highly correlated protein marker pairs as well as
their functions found by our feature analysis are
validated by previous studies, and are shown to be
medically significant, demonstrating D’ as an effec-
ive measurement of correlation in the context of
feature selection for the first time.
Keywords: Cancer; Chemosensitivity; Correlation; D’;
Feature Selection; Genetic Algorithm; Markov Blanket;
Memetic Algorithm; NCI-60
1. INTRODUCTION
The success of cancer treatment as well as the severity of
the side effects of said treatment is heavily dependent on
the sensitivity of the cancerous tissue to chemical treat-
ment. Clinics face a great challenge in predicting treat-
ment success, because chemosensitivity is determined by
both intrinsic genomic and proteomic characteristics of
the cancer as well as resistances induced through prior
treatment. When trying to choose a therapy that will
work best for a patient, it is important to evaluate their
physical responses to different drugs. Because of this,
many studies have been done to improve drug response
prediction accuracy.
Data profiling of cancer cells at genomic, proteomic,
chromosomal and functional levels has long been used in
the analysis of pharmacological sensitivity of the cancer
cells [1,2,3,4]. A primary source of cancer data in this
field is a set of 60 human cancer cell lines provided by
the National Cancer Institute (NCI-60) [5]. These cell
lines have been in use since 1990 and over 100,000
chemical compounds have been tested on them [6]. The
NCI-60 includes melanomas, leukemias and samples of
ovarian, prostate, renal, breast, colon, lung and central
nervous system cancers.
1.1. Related Works on the NCI-60
One study [7] used protein expression profiles to predict
responses to a set of 118 anti-cancer agents with known
or experimentally supported mechanisms of action [8].
Well known machine learning algorithms such as Ran-
dom Forest, Nearest Neighbor and Relief were used to
make chemosensitivity predictions. One Random Forest
based classifier was built for each of the 118 drugs. To
measure the significance of their predictions, this study
compared the computed predictions against random pre-
D. King et al. / J. Biomedical Science and Engineering 2 (2009) 506-515 507
dictions, which can be measured by a standard P-value.
The P-value was the percentage of 1000 random predic-
tions with higher accuracy than the calculated predic-
tions. The study found chemosensitivity prediction ac-
curacies ranging from 50 to 90%, with the vast majority
being between 50 and 70%. Every prediction had
P-values less than 0.019, and 97 of the predictions had
P-values equal to 0.00.
A subsequent study by the same research group used a
combination of the previously used proteomic data and
new transcriptional data [9]. This integrative approach
demonstrated its advantage, achieving higher accuracy
and statistical significance, with P-values for all 118
drugs less than 0.001, calculated in the same manner as
in [7].
A separate study [10] analyzed the correlation be-
tween DNA copy number variations, gene expression
levels, and chemosensitivies to the same 118 drugs as in
[7,9]. The analysis indicated that the correlations of gene
expression and DNA copy number are particularly evi-
dent among leukemias and ovarian cancers.
An additional study [6] used four gene expression
datasets, two of which were original to the paper, and
one proteomic dataset. These data sets were used to ob-
serve the effectiveness of transcript profiling for the pre-
diction of different protein expression levels. In addition,
a consensus set selected from the four gene expression
datasets was constructed. This consensus set was found
to have a correlation to the protein dataset of 65%; a
notable percentage that was higher than most reports
done with mammalian cells. Further, this consensus
dataset was used to predict tissue origin with a higher
accuracy than any of its parent datasets.
1.2. Feature Selection and Motivation
New technologies in biomedical studies, such as mi-
croarrays, have made the analysis of large volumes of
complex data a necessity [11]. Frequently, a majority of
these data contain noise, i.e., features not relevant to a
particular task at hand, such as classification of cancer
types with gene expression data.
Both studies conducted by [7,9] used Random Forest
as a feature selection technique to improve the accuracy
of chemosensitivity predictions and to single out protein
markers that were particularly important to this task.
In studying the effects of feature selection on chemo-
sensitivity prediction, we observed disparity between
expected and observed results. We ranked and ordered
all features in the smaller protein dataset used in this
study according to the Relief algorithm provided by
Weka. We used Random Forest to make predictions
based on incrementing feature subsets, using the top two
ranked features, then three, four, etc. up to 40 features,
as in Figure 1. We observed that contrary to our expec-
tations, some higher ranked features decreased predic-
tion accuracy, while some lower ranked features in-
creased accuracy.
This led us to hypothesize that features contribute to
the prediction accuracy collectively, rather than inde-
pendently. To test this hypothesis, we developed a new
technique using the D’ measure [12] in order to study the
correlations between feature pairs. As a demonstration of
the utility of this technique, we apply it to those protein
markers found to be significant in [9].
2. MATERIALS AND METHODS
2.1. Datasets
Three datasets derived from the NCI-60 were used in our
study: two sets of protein data, and one of DNA data.
Protein expression data. The first protein expression
dataset had 162 protein markers, hereafter referred to as
Protein162, and was created by Shankavaram et al [6]
and can be found at http://discover.nci.nih.Gov/datasets.
jsp. The second dataset, which contains 52 protein
markers (Protein52), available at http://discover.nci.nih.
gov/host/2003_profilingtable7.xls., was generated by a
study of the proteomic profiles of the NCI-60 [13], and
was also used by two studies on chemosensitivity pre-
diction [7,9].
DNA copy number. The DNA copy number variation
dataset was presented in a study of the correlation be-
tween mRNA and DNA copy number [10]. It is available
at http://discover.nci.nih.gov/datasets.jsp.
Drug activity data. Our drug resistance information
contained activity data from 118 anti-cancer agent activ-
ity profiles. They were screened by Scherf et al [8] and
recorded using the NCI-60 cancer cell lines. The file
containing this data can be found at http://discover.nci.
nih.gov/nature2000/data/selected_data/dataviewer.jsp?ba
seFileName=a_matrix118&&nsc=2&dataStart=3.
Defining drug sensitivity and resistance. As in [7,9]
we used a threshold to define sensitivity to a drug into
Figure 1. Random forest prediction accuracy. This plot shows
the prediction accuracies of Random Forest using the same
protein dataset used in [7,9]. The drug on which the prediction
was performed was Bisentrene (NSC # 337766).
SciRes
Copyright © 2009 JBiSE
D. King et al. / J. Biomedical Science and Engineering 2 (2009) 506-515
508
three categories. A log10 (GI50) was taken for each cell
line to determine sensitivity. Cell lines with sensitivities
at least 0.5 standard deviations above the average were
given the label ‘resistant.’ Those with sensitivities at
least 0.5 standard deviations below the average were
‘sensitive.’ The remaining cell lines were defined as ‘in-
termediate’ [7,9].
2.2. D’ Formula
A standard measurement for the correlation between
pairs of events i and j in a set of sequences is D’, which
can be defined by the following formulae:
ijiji j
Dxpq
max
'ij
ij
D
DD
where xij is the frequency at which both event i and event
j occur in a single sequence, pi is the frequency of event i
and qj is the frequency of event j. If Dij < 0,
,(1) 1
maxi jij
Dminp qpq



, and if Dij > 0,
1,(1)
maxiji j
Dminpqpq



.
The D’ formula was introduced by Richard Lewontin
as a measurement of linkage disequilibrium of alleles at
two or more loci on the same chromosome [12]. The D’
formula has been shown to be a more reliable measure-
ment than other measurements of correlation between
pairs of events [14], but this study is the first to use it to
correlate pairs of selected features.
2.3. Markov Blanked-Embedded Genetic
Algorithm (MBEGA)
Genetic Algorithms have been used as a strategy for
feature selection [15] due to their ability to generate bet-
ter feature subsets than other feature selection algorithms.
In some cases, these genetic algorithms are combined
with memetic operations in order to fine tune results
beyond what would be produced by classical genetic
algorithms alone.
One particular implementation of these memetic algo-
rithms is the Markov blanket-embedded genetic algo-
rithm (MBEGA), which uses an approximation of a Mar-
kov blanket to reduce redundancy in selected features.
Pseudocode for the MBEGA can be found in Figure 2.
In each generation of the algorithm, the MBEGA uses
add and delete operations to add and delete features from
some of the elite feature subsets in the population; the
elite feature subsets are improved by adding important
features and removing those that are less important. Af-
ter the memetic operations, standard genetic algorithm
techniques such as linear ranking, crossover and muta-
tion methods occur to generate the next population [16].
The MBEGA was selected in our study for two rea-
sons: First, the MBEGA generates a population of fea-
Markov Blanket Embedded Genetic Algorithm (MBEGA)
BEGIN
(1) Initialize: Randomly generate an initial population of feature
subsets encoded as binary strings
(2) For the number of iterations to run
(3) Evaluate all feature subsets in the population based on prediction
accuracy
(4) Select a number of elite feature subsets from the population to
undergo the Markov blanket memetic operations
(5) For each feature subset create a set of all present features X and
all absent features Y
Add operation BEGIN
1) Rank the features in Y according to their correlation to the class
label.
2) Select a feature Yi in Y so that the larger the correlation of a fea-
ture in Y the more likely it will be picked.
3) Add Yi to X.
END
Delete operation BEGIN:
1) Order the features in X according to their correlation to the class
label.
2) Select a feature Xi in X so that the larger the correlation of a fea-
ture in X the more likely it will be picked.
3) Eliminate all features in X which are less correlated than Xi. If no
feature is eliminated, remove Xi.
END
(6) Replace the original elite feature subset with the improved fea-
ture subset.
(7) End For
(8) Perform crossover and mutation to create the next generation of
feature subsets.
(9) End For
END
Figu . Pseudocode for MBEGA.
a single final subset as in classical feature selection algo-
Using D’ Formula
ween
Thereary types of feature selection algo-
re 2
thms. The feature subsets from each generation are rep-
resented as binary strings, with a 1 representing a present
feature and 0 representing an absent feature, to calculate
the D’ values of our correlation analysis. Second, the
MBEGA does not require a predefined number of fea-
tures to be selected. Rather, the MBEGA gradually op-
timizes both the size of the feature subset as well as the
accuracy of the classifier.
2.4. Correlation Analysis
We used the D’ formula to calculate correlation bet
pairs of features selected in each generation of the MBE-
GA. Because the MBEGA begins with a randomized
feature subset and becomes more selective as the algo-
rithm progresses, we decided to use only the last 20% of
the feature subsets generated. We calculated the D’ val-
ues for every pair of selected features, using the pre-
sence of one feature within the encoded binary sequence
as event i and the presence of the other as event j.
2.5. Feature Selection Using Weka’s Relief
Algorithm
are three prim
thms: filter, wrapper and embedded algorithms. Filter
algorithms have advantages in their speed and scalability,
ture subsets in each generation, rather than generating
SciRes
Copyright © 2009 JBiSE
D. King et al. / J. Biomedical Science and Engineering 2 (2009) 506-515
SciRes Copyright © 2009
509
able 1. Highly correlated protein marker pairs in protein52 based on significant chemosensitivity protein markers. The protein
1 2 3 4 5 6 7 8 9 10
JBiSE
however they ignore feature dependencies. They also do
not interact with classifiers, which is both an advantage,
because they can select features independently, and a
disadvantage, because they are unable to take the classi-
fier into account when determining the feature subset.
Wrapper algorithms, on the other hand, do interact with
the classifier, and are therefore able to produce more
informative feature subsets. They are also less prone to
local optima. They are, however, computationally inten-
se, and have a higher risk of over fitting. Embedded al-
gorithms are built directly into a classifier. As such, they
are able to interact with the classifier in the same manner
as wrapper algorithms, but are far less computationally
intense.
Relief, a filter feature selection algorithm imple-
mented in WEKA, was used to assess the features pairs
found by our correlation analysis. Relief ranks features
by assigning them weights according to their ability to
T
markers in column one and associated drugs, expressed as their NSC drug numbers, in column two were found in [9]. The remaining
columns are the ten protein markers with the highest correlation to the protein marker in the first column. The markers notated with a
* are those also selected by Weka’s Relief algorithm.
Protein NSC
Marker Drug #
ISGF3g 56410 MAPK1 300* N E1* K3B DD AT1* AT3 AT5A 6 EPMSNMGSFASTSTSTMSH
ISGF3g 354646 MGMT *
*
8 * MGMT
VIL1 RIPK1EP300 EP300 MSN GSK3B FADD STAT5A MSH2
STAT3 56410 EP300* EP300MSN GSK3BFADD ISGF3GSTAT1* STAT5ASTAT6* MSH2
NME1 353451 FN1 MVP RELA MSN CDH1*MGMT GSK3B FADD ISGF3G*STAT5A
NME1 344007 KRT1EP300MAPK1CDH1GSK3B FADD ISGF3GSTAT3* STAT5A
NME1 102816 TP53 EP300 MGMT* CCNE MAP2K1CDH1* MGMT*GSK3B FADD STAT3*
NME1 107392 KRT8 MSN* CDH1 MGMT*GSK3B FADD ISGF3G STAT3* STAT5AMSH2
MGMT 95466 KRT18*CDH2 EP300 MSN EP300 MSN* CDH1 NME1* GSK3B
ISGF3G
*
CCNE 95441 KRT8* CCNA2
CCNB1VIL1*CDH1 RELA RIPK1 JAK1 MAP2K2STAT5A
EP300 119875 KRT18 EP300 CDH2 KRT20 FN1 MSN CCNB1*JAK1 MAP2K1
MAP2K
2
EP300 606497 EP300 CDH2 KRT20 FN1 KRT8 1 CCNBCCNE RIPK1STAT3*
MAP2K
1
FN1 135758 KRT18 CDH2 KRT20 KRT8
K2
28
* *
1 2*
*
3 K2
B A
0
1
2
*
1
T K1 *
*
1
CCNA2 CCNB1 CCNE VIL1 MAP2K2ISGF3G
MSN 301739 KRT20 MAPK1MCP MCM7CDK6 G22P1 MVP PGR MAP2K2 FADD
MSN 755 KRT18 CDH2 MCM7CDK6* G22P1 MVP PGR MAP2FADD STAT1
MSN 3761PCNA MCP CDK6 G22P1 MVP* PGR CCNEVIL1 CDH1*EP300
PGR 354646 MSN MVP CCNA2CCNB1*CCNE CDH1CASP2 RIPK1EP300 FADD
STAT1 354646 KRT20MAPKG22P1 MVP MAP2KMSN NME1* FADD STAT3 STAT5A
STAT6 354646 EP300 PCNA MAPK1FADDISGF3G STAT3STAT5A MSH2 MSH6 EP300
CASP2 264880 CCNA2CCNE VIL1 CDH1 RELA* RIPK1 JAK1* STAT3 EP300 STAT1
CDH1 71261 MAPK1 CCNE VIL1* RELA* CASP2 EP300 EP300 CDH1 NME1 MSH2*
MCP 740 TP53 EP300 EP300 KRT20 ACVR2*MCM7*CDK6 CCNB1VIL1 EP300
KRT18 1989TP53 EP300 CDH2 EP300 FN1* KRT8 PGR JAK1 MAP2 EP300
KRT18 757 TP53 EP300 RELA STAT3 EP300 CDH1 GSK3STAT5MSH6 EP300
KRT18 3341TP53 EP300 CDH2 EP300 KRT20*RIPK1 MAP2K2MSN MSH6 EP300
KRT18 125973 TP53 EP300 CDH2 EP300*KRT20 CDK6 CCNA2 CCNERELA EP300
KRT18 658831 TP53 EP300 KRT20 FN1* MAPK1 MSN G22P1 JAK1 CDH1 EP300
KRT18 673188 TP53 EP300 CDH2 FN1 GSK3B* FADD*STAT5ASTAT6MSH6 EP300
KRT18 671867 TP53 EP300 CDH2 EP300 CCNB1 CASP2 EP300 MAP2KMSH6 EP300
KRT18 664402 TP53 EP300 CDH2 EP300 MVP PGR JAK1 EP300 STAT6 EP300
KRT18 661746 TP53 EP300 CDH2 EP300 ACVRMCP EP300 MSN CDH1* EP300
KRT18 673187 TP53 EP300 CDH2* ACVR2CDK6 VIL1 STAT6 MSH2MSH6* EP300
KRT18 664404 TP53 EP300 CDH2 KRT20 RB1 MAPKEP300 EP300 STAT5A*EP300
KRT18 671870 TP53 EP300 CDH2 KRT20*FN1 PCNA CDH1 CASP2 STAT6 EP300*
KRT18 666608 TP53 EP300 CDH2 EP300 KRT20 MAPK1CCNB1RIPK1 STAT3 MGMT
KRT18 600222 TP53 EP300*CDH2* RB1 G22P1 PGR CCNA2 MSN STAT3 STAT5A
KRT18 656178 EP300 EP300 EP300 MGMMAPK1MSN G22P1 MAP2 STAT5ASTAT6
TP53 19893 KRT18 CDH2 RB1 MAPK1 EP300 CDK6 MSN STAT6 MSH6 EP300
TP53 125973KRT18* KRT20 FN1*MGMT MAPK1ERBB2 MCM7STAT6 MSH6 EP300
RELA 153353 CDH2* MSN G22P1VIL1 CDH1 CASP2 RIPK1* JAK1 MAP2KMAP2K
2
G22P1 224131 EP300 MAPK1 MSN MVP* CCNA2CCNB1 RIPK1 EP300* MAP2K2 EP300
D. King et al. / J. Biomedical Science and Engineering 2 (2009) 506-515
510
dineehboter each
n a ten-fold
vidual
A prrotein
ordiscy p inel
th
iscrimate betwn neigring patns. In itera-
tion of the algorithm, an instance x containing features
(x1,x2,…,xn) is selected randomly, and one nearest
neighbor from the same class (called NH) along with
one nearest neighbor from a different class (called NM)
are found. The weights of the features in x are updated
such that they will be greater if x is similar to the NH
and dissimilar to the NM, and less if the opposite is true.
3. RESULTS AND DISCUSSION
To generate sequences for D’ analysis, we ra
cross validation of the MBEGA on all three datasets.
Each fold of the MBEGA ran for 100 generations, with a
population of 51. Each fold generated 5100 sequences,
of which we used the last 20% generated, or 1020 from
each fold. The final number of sequences used for the D’
analysis was 10200 for each dataset.
3.1. Correlation Analysis of Indi
Protein Markers from Previous Study
evious study [9] discovered 18 individual p
markers from Protein52 along with their functions, in-
cluding transcriptional factoring, tumor suppressing,
DNA repair, cell adhesion, and apoptosis, among others,
that are significant to the prediction of chemosensitivity
to 33 of the 118 anti-cancer agents. These drugs repre-
sent 12 out of the 15 total mechanisms of action present
in the 118 anti-cancer agents, with a large number of
them being tubulin active antimitotic agents. In order to
investigate the protein markers highly correlated with
those found in [9], for each protein marker/drug com-
bination identified there, we found the ten protein mark-
ers with the highest D’ value. We also sought to validate
these pairs using a ten-fold cross validation of the Relief
feature selection algorithm provided by Weka, which
measures feature significance individually. As seen in
Table 1, the highly correlated protein marker pairs our
analysis discovered are validated not only by the protein
markers reported in [9], but also by Relief.
Iner to dover anatterns the corration of
e protein markers selected in Ta bl e 1 , we took a fre-
quency count of them, as illustrated in Figure 3. While
most protein markers had frequencies within the same
range, mostly between 4 and 10, there were some which
clearly stood out. In particular, the protein marker CDH2,
a cell adhesion protein, is highly correlated with 19 out
of the 40 protein markers in Table 1. CDH2 was not
selected in the previous study, but is very similar in both
function and family to CDH1, which was selected. An-
other protein with a high frequency, 16 out of 40, was
TP53 whose function is tumor suppression and apoptosis.
We found that in most occurrences TP53 was paired with
protein marker KRT18. Both of these protein markers
are involved in protein death, and both were found to be
strong chemosensitivity predictors for the drug Taxol in
the previous study [9]. Lastly, we noticed STAT5A is
both from the same family as and is highly correlated
with the protein markers STAT1 and STAT6, both of
which were highlighted in previous study [9].
We were also interested in observing how the func-
tions of the individual protein markers from [9] corre-
lated with the functions of the highly correlated protein
markers found in Table 1. We grouped the previously
reported protein markers according to their function, and
then selected the protein markers that were most fre-
quently correlated with them. We only included those
protein functions which had three or more protein mark-
ers associated with them, as in Table 2.
Because we used two protein datasets in this study, we
wanted to conduct the same analysis on the Protein162
dataset in order to explore the possibility of discovering
new protein markers highly correlated with those found
in [9]. All but 2 of the previously reported protein mark-
ers from Protein52, G22P1 and CCNE, were also present
in Protein162, so we used the same protein marker/drug
combinations as in Table 1 when generating Table 3 for
Protein162.
We also created a selection frequency histogram for
Figure 3. Frequency of protein52 protein markers present in Table 1.
SciRes
Copyright © 2009 JBiSE
D. King et al. / J. Biomedical Science and Engineering 2 (2009) 506-515 511
Table 2. Corre
Protein Markers Functions of Correlated Protein Markers
lation of the functions of protein markers in Table 1.
Protein Function Reported Protein Markers Correlated
Transcriptional ISGF3G STAT5A Transcriptional Factor
Factor STAT3 FADD Apoptosis
EP300
STAT1
STAT6
RELA
Integrin aling SignNME1 ST6 ATTranscriptional Factteferon Signaling or, In
EP300 MSH6 DNApair Re
TP53 KRT18 Structural Protein; Biomarker of cell death
FADD Apoptosis
Tu r mo CCNE FADD Apoptosis
Supp sors res T P53FN1 Cell Adhesi Signaling on; Integrin
RELA GSK3B Horrol monal Cont
KRT18 Struath ctural Protein; Biomarker of cell de
Cell Apoptosis CASP2 CDH2 Cell Adhesion
KRT18 KRT20 Structural Protein
TP53 MSH6 DNA Repair
RELA TP53 Tumor Supprd Apoptosis essor; Cell Cycle an
Table 3. Highly correlated protein marker pairs in protein162 baseicant chemotein
3 4 5 6 7 8 9 10
d on signifsensitivity protein markers. The pro
markers in column one and associated drugs, expressed as their NSC drug numbers, in column two were found in [9]. The remaining
columns are the ten protein markers with the highest correlation to the protein marker in the first column. The protein markers G22P1
and CCNE present in Table 1 are excluded here because neither is present in Protein162. The markers notated with a * are those also
selected by Weka’s Relief algorithm.
Protein
Marker NSC
Drug # 1 2
ISGF3g 56410 ANXA1 SP7 K 300 EP300 EP300 K1 LA K1 3 CACREPJARERIPTP5
ISGF3g 354646 AKAP5
CCNA2
AP2M1
CDH1
CDC2
CDKN2
* 1
* 2A
6*
1
4
1
1
9
1*
I1
1
1
D
2
MSN
HRAS
PRKCI
KRT18
RIPK1*
MAPK1
SMARCB
MVP
FASLG TP5
STAT1*
3
STAT3
VIL2
VASP
STAT3 56410 A
EP300
*
IRS1
NME1 353451 ANXA4 EP300 FN1 GTF2B*MGMT NCAM1PCNA
PRKCI
TP53
NME1 344007
102816
CASP7 CASP7 CDH2 ENAH EP300 HSPA4 JAK1 MCC TP53
NME1 CASP2 CASP7 EP300 FADD
EP300
ISGF3G MAP2K1
NCAM1
MGMT MSN NCAM1TUBB
NME1 107392 ANXA1 CASP7 EP300 MAP2K2
MGMT*
PCNA
RB1
PRKCARB1 RELA
MGMT 95466 ACVR2A BCAR1 EP300 FADD STAT1 TP53 TP53 YWHAG
EP300 119875 PARP1 CTNNB1* EP300 GRB2* JAK1 MCC MSN RIPK1 STATTP53
EP300 606497 ADNP
135758
ATXN2*EP300 EP300 GRB2 JAK1
EP300 GSTP1
MCP* MSH6
PCNA
STAT3*
STAT6
TP53
FN1 AKAP8 CDK4 CTTN EP300 EP300 FASLG
MSN 301739 ADNP CDH1 EP300 JAK1 MAP2K2MSN* MVP RIPK1
SMARCB1 STAT3
MSN 755 CDK5 CTNNBEP300 ISGF3GJAK1 KRT18 MAPK1MCC RB1 RIPK1
MSN 376128
PARP1 CDC2 CDH2 EP300 EP300 MGMT MVP*
MSN
PTPN6 RB1* TP53
PGR 354646
354646
CDH2 ENAH EP300 EP300 EP300 EP300 RELA EXOCTP53
STAT1 CASP2 EP300 EP300 EP300 FADD MCM7 MSH6*PRKCB1RELA TYR
STAT6 354646 PARP1 CDK4* CDK7 CRK ENAH EP300 MSH6*
FADD*
RB1 RELA STAT3
CASP2 264880 CASP7 CTNNB1*
KRT19
DSG1 EP300 EP300 EP300 ISGF3G KRT7* PCNA
TRADD
CDH1 71261 FN1 MGMTPTPN1RELA* RELA RELA STAT6
MCM7
TP53
MCP 740 CASP7 DSG1 EP300 ESR1 FADD KRT19 MAP2K2RB1* SMARCB1
STAT5A
KRT18 19893 PARP1
CDK4
PARP1 CDKN2A EP300 MCM7* MSN PRKCA
MCP
RB1 RELA
KRT18
KRT18
757 ENAH EP300 EP300 EP300 KRT19* SMARCB STAT5A STAT6
33410 AKAP8 EP300 EP300 EP300 EP300 JAK1 KLK3 KRT1 STAT1 VASP
KRT18 125973
658831
CDC2 CDK5 GSTP1 IRS1 KRT19*TP53 TP53 TP53 TP53 VIL2
KRT18
673188
A
TXN2 CCNA2 IRS1 MCM7*MSH2*
KLK3
EXOC4 SMARCB TP53 TP53 TRADD
KRT18 AKAP5
ADNP
CDK5 EP300 JAK1 KRT19*MAP2K2
MAP2K1
RELA TGFB1VASP
KRT18 671867 BCAR1 CCNA2
EP300
CDC2 EP300 KLK3 MAPKMVP STAT3
KRT18 664402 ADNP CCNA2
CCNB1
EP300 KRT19*KRT7 PTPN11 RELA RELA STAT6
KRT18 661746 AP2M1 CDH2 DSG1 EP300 KRT19 PRSS8 RELA STAT5A VASP
KRT18 673187 CCNA2
CDK6
EP300 EP300 ERBB2FADD XRCC6MGMT MVP* STAT3 TP53
KRT18 664404 EP300 ISGF3G KLK3 MCC MSH6 PRKCB1STAT1 STAT6 FASLG
KRT18 671870 CASP2 CCNB1 CDH2 EP300 MAP2K1MCP MLH1 PCNA RELA EXOC4
VIL1
KRT18 666608 CDH1*
ADNP
CDK7 ENAH EP300 FADD GSTP1 KLK3 KRT20 PRSS8*
KRT18 600222 AKAP5 AKAP8CDH2 EP300* GSTP1*
EXOC4
KLK3 MAP2KPTPN11
TP53
TYR
KRT18 656178 CDH2 EP300 GTF2B KRT19 KRT8* TP53 TP53 TRAD
TP53 19893 CASP7 EP300 EP300 ESR1 GTF2B KRT8 MAP2KSTAT1 STAT1 STAT6
TP53 125973 CASP7 CCNA2CDH1 EP300 JAK1 KLK3 KRT19* PRKCASTAT1
PTPN11
TP53
RELA 153353ADNP CASP7 CDK7 EP300 EP300 EP300 PRKCI PRSS8 STAT3
SciRes
Copyright © 2009 JBiSE
D. King et al. / J. Biomedical Science and Engineering 2 (2009) 506-515
512
T, egu ervedhe e seighuenclecels
the3 prcalle, arto se
able 3illustrat d in Fire 4. W obse that t
fe ly r a of 2is
dataset when compared to Protein52. We believe this is
because the number of unique protein markers in Pro-
tein162 was roughly twice that of the unique protein
markers in Protein52; however we chose the top ten
most correlated pairs in both instances.
Many of the most selected protein markers from Ta-
ble 1, including CDH2, TP53, and STAT
requencis wereower boughlyfactor for th
5A, had only an
average or even low frequency in Table 3. We selected 8
protein markers from Table 3 whose average frequencies
were above 4. These were KRT18, KLK3, CCNA2,
ADNP, MVP, RIPK1, SMARCB1, and ENAH. Because
the Protein162 dataset contains protein markers not pre-
sent in Protein52, we found 5 protein markers which
were not reported in the previous study [9]. These pro-
teins, as well as their cellular functions and associated
drugs can be seen in Table 3 . The most frequently se-
lected protein marker from Table 3 was KRT19, a struc-
tural protein from the same family as KRT18, a protein
marker found to be significant in [9], and KRT20, a pro-
tein marker frequently selected in Table 1. KLK3 had
and monitor prostatic carcinoma. Members of the KLK
family are also thought to be biomarkers for cancers and
diseases. CCNA2 has a functional relationship with
CDC2, another protein marker with an above-average
selection frequency in Ta b le 3 . ADNP affects both nor-
mal cell growth and cancer proliferation. In addition,
ADNP is a transcription factor, a trait held in common
with six of the eighteen significant protein markers in [9].
MVP is a protein which is over-expressed in multi-drug
resistant cancer cells, and is potentially useful as a signal
for drug resistance. MVP also bears a functional relation
with STAT1, one of the important protein markers re-
ported in [9]. RIPK1 is an apoptosis protein related to
cell death, much like the KRT18, TP53 and CASP2
found in both the previous study [9] and in Table 1.
SMARCB1 functions as a tumor suppressor, but muta-
tions within the protein are associated with rhabdoid
tumors. ENAH is a cell adhesion protein which is pre-
sent in some breast cancers, and may be used as a
marker for such.
thcond hest freqy of setion. Serum lev
of KLKotein, d PSAe used diagno
Figure 4. Frequency of protein162 protein markers present in Table 3. This plot shows the frequency with which protein
markers from the Protein162 dataset were selected in Table 3. Only those protein markers with an averagquency
Tab
to thevious study [9].
e fre
above 2 are shown due to limited space.
le 4. Highly correlated protein markers for the evaluated anticancer drugs. Protein markers denoted with a * were unique
e Protein162 dataset, and as such not reported in pr
Protein
Marker Function NSC Drug Numbers
KRT19* Structural Protein 71261, 740, 757, 33410, 125973, 673188, 664402,
661746, 656178
KLK3* Biomarker of Prostatic Carcinoma 33410, 673188, 671867, 664404, 666608, 600222,
125973
ADNP* Cell Growth, Cancer Proliferation, Transcription Factor 125973, 301739, 671867, 664402, 600222, 153353
CCNA2 Binding & Activating Agent 56410, 658831, 671867, 664402, 673187, 125973
MVP Mediating Drug Resistance, Over-expressed in multi-drug resistance
cancer cells 56410, 301739, 376128, 671867, 673187
RIPK1 Apoptosis Protein 56410, 354646, 119875, 301739, 755
SMARCB1* Tumor Suppressor 354646, 301739, 740, 757, 658831
ENAH* Cell Adhesion Protein 344007, 354646, 354646, 757, 666608
SciRes
Copyright © 2009 JBiSE
D. King et al. / J. Biomedical Science and Engineering 2 (2009) 506-515 513
Figure 5. D’ patterns categorized according to mechanism of action. The drugs which alkylate at N7 and O6 produce
similar patterns of D’ values in each of the three datasets. For readability, the curves displayed are maverage
.2. Featur
f the
featurealso calculated the average D’ values
ev
cular, we noticed that to-
oving
curves with a period of 20 for Protein162 and DNA166 and a period of 7 for Protein52.
e Correlation Analysis of All 118 related mechanisms. In parti
3Drugs
In addition to simply calculating the D’ values o
pairs, we
for each feature based on the D’ values of all pairs asso-
ciated with that feature. We performed this analysis for
both protein datasets as well as the DNA166 dataset,
using the protein markers as features for the protein
datasets, and the DNA copy number variations as fea-
tures for the DNA166 dataset. As an exploratory analysis,
we attempted to use the average D’ values to find the
trends of feature correlation within and between the
mechanisms of action of the 118 anti-cancer agents.
Each dataset generated unique patterns of feature cor-
relation in each of the 118 drugs. We did observe, how-
er, similar patterns of feature correlation in drugs with
poisomerase I inhibitors and topoisomerase II inhibitors
have very similar trends of feature correlation to one
another in all three datasets. Drugs which alkylate at
positions N7 (24 drugs) and O6 (7 drugs) of guanine also
have very similar trends of feature correlation, as shown
in Figure 5. This implies that related drug mechanisms
tend to produce similar patterns of correlation between
feature pairs. Our analysis indicated that this is not nec-
essarily true of drugs with similar chemical structure.
We also grouped drugs with similar mechanisms into
three larger categories: drugs which alkylate at specific
positions of guanine (Alkylating), drugs which inhibit to-
poisomerase (TIM Inhibitors), and all other drugs
(Other). The D’ values of these larger categories were
generated by averaging the D’ values of the individual
drugs within that larger category. We found the correlation
SciRes
Copyright © 2009 JBiSE
D. King et al. / J. Biomedical Science and Engineering 2 (2009) 506-515
514
Figure 6. D’ patterns according to major mechanism of action categories. Each dataset yields different levels of D’ val-
ues between the three categories. The values produced by Protein52 are very similar in all three mechani categories,
trends of th
ree datasets.
unique drugs and drug mecha-
ni
’ values of the Alkylating and
th
rs category was quite distinct with an average D’
value of –0.044. All averaged D’ values were negative.
–0
ponses of
th
sm
whereas TIM Inhibitor values for Protein162 are distinct from the others. All three categories produce distinct values in
DNA166. For readability, the plots of Protein162 and DNA166 are moving averages with a period of 20, whereas the
plot of Protein52 shows the curve without a moving average.
ese three categories to be different for all hibito
th
In Protein52, we observed that while each of the lar-
ger categories carries
sms, the averaged D’ values of all three of these cate-
gories were very similar, with the averages being 0.053
for the Alkylating category, 0.054 for the TIM Inhibitor
category, and 0.06 for the Other category. All averaged
D’ values were positive.
The same analysis of the Protein162 dataset revealed
that while the averaged D
e Other categories where very similar, with average D’
values –0.0348 and –0.0345 respectively, the TIM In-
For DNA166, all three curves have distinct averaged
D’ values, with the Alkylating category having an ave-
rage D’ of –0.0382, the Other category an average of
–0.0286 and the TIM Inhibitors category an average of
.0240. Again, all D’ values were negative.
These plots, available in Figure 6, illustrate the
benefit of using multiple datasets in this type of study.
While all of our datasets are based on the NCI-60, they
each provide unique insight into physical res
e cell lines to the 118 anti-cancer drugs. If we were
only using the DNA data, we might be tempted to claim
SciRes
Copyright © 2009 JBiSE
D. King et al. / J. Biomedical Science and Engineering 2 (2009) 506-515 515
that these three mechanism categories produce distinctly
different D’ values, whereas if we were only using the
data from Protein52, we might claim the opposite. It is
only when a wide range of data are used in study that a
holistic understanding of the effects of these 118 drugs
becomes possible.
4. CONCLUSIONS
We found that each of our datasets provides a unique
insight into the analysis of feature correlations in the
ity of the NCI-60 cancer lines.
52, DNA166) have been u
P.,
,
, Mesirov, J. P., Lander, E. S., and
001) Chemosensitivity prediction by
ary, K. K., Reimers, M. A.,
d Guo L., (2006) Predicting cancer drug response
e, L., Kohn, K. W., Reinhold, W. C., Myers, T.
, V., Harner, E. J., and Guo, N. L., (2009) An
uo, W., Gwadry, F., Ajay, Kouros-Mehr, H.,
s in bioinformatics. Bioinfor-
siderations, Heterotic Models,
ouros-Mehr, H., Bussey,
1341.
ournal of
thm for gene selection, Pat-
study of the chemosensitiv
Two of the three (Proteinsed by proteomic profiling, Clin. Cancer Res., 12, 4583
4589.
[8] Scherf U., Ross D. T., Waltham M., Smith L. H., Lee, J.
K., Tanab
in previous studies for the prediction of chemosensitivity
of cancerous cells, and though Protein162 was novel to
this topic, we have shown that both Protein162 and Pro-
tein52 contain a number of protein markers that are both
medically significant and highly correlated to individual
protein markers highlighted in previous study [9].
In addition, we have shown D’ to be an accurate
measure of correlation in the context of feature selection
for the first time.
5. ACKNOWLEDGEMENTS
We would like to thank Houghton College for its financial support.
REFERENCES
[1] Staunton, J. E., Slonim, D. K., Coller, H. A., Tamayo,
Angelo, M. J., Park, J., Scherf, U., Lee, J. K., Reinhold
W. O., Weinstein, J. N.
Golub, T. R., (2
transcriptional profiling, PNAS, 98, 1078710792.
[2] Potti, A., Dressman, H. K., Bild, A., Riedel, R. F., Chan,
G., Sayer, R., Cragun, J., Cottrill, H., Kelley, M. J., Pe-
tersen, R., Harpole, D., Marks, J., Berchuck, A., Gins-
burg, G. S., Febbo, P., Lancaster, J., and Nevins, J. R.,
(2006) Genomic signatures to guide the use of chemot-
herapeutics. Nature Medicine, 12, 12941300.
[3] Paweletz, C. P., Charboneau, L., Bichsel, V. E., Simone,
N. L., Chen, T., Gillespie, J. W., Emmert-Buck, M. R.,
Roth, M. J., Petricoin, E. F., and Liott, L. A., (2001) Re-
verse phase protein microarrays which capture disease
progression show activation of prosurvival pathways at
the cancer invasion front, Oncogene, 20, 19811989.
[4] Lee, J. K., Bussey, K. J., Gwadry, F. G., Reinhold, W.,
Riddick, S. L. Pelletier, S. Nishizuka, G. Szakacs, J. An-
nereau, G., Shankavaram, U., Lababidi, S., Smith, L. H.,
Gottesman, M. M., and Weinstein, J. N., (2003) Com-
paring cDNA and oligonucleotide array data: Concor-
dance of gene expression across platforms for the NCI-
60 cancer cells, Genome, Biol., 4, R82.
[5] Ross, D. T., Scherf, U., Eisen, M. B., Perou, C. M., Rees,
C., Spellman, P., Iyer, V., Jeffrey, S. S., Van de Rijn, M.,
Waltham, M., Pergamenschikov, A., Lee, J. C. F., Lash-
kari, D., Shalon, D., Myers, T. G., Weinstein, J. N., Bot-
stein, D., and Brown, P. O., (2000) Systematic variation
in gene expression patterns in human cancer cell lines.
Nat. Genet., 24, 227235.
[6] Shankavaram, U. T., Reinhold, W. C., Nishizuka, S.,
Major, S., Morita, D., Ch
Scherf, U., Kahn, A., Dolginow, D., Cossman, J., Kald-
jian, E. P., Scudiero, D. A., Petricoin, E., Liotta, L., Lee,
J. K., and Weinstein, J. N., (2007) Transcript and protein
expression profiles of the NCI-60 cancer cell panel: an
integromic microarray study, Mol. Cancer Ther., 6, 820
832.
[7] Ma Y., Ding Z., Qian Y., Shi X., Castranova V., Harner, E.
J., an
G., Andrews, D. T., Scudiero, D. A., Eisen, M. B., Saus-
ville, E. A., Pommier, Y., Botstein, D., Brown, P. O., and
Weinstein, J. N., (2000) A gene expression database for
the molecular pharmacology of cancer, Nat. Genet., 24,
236244.
[9] Ma, Y., Ding, Z., Qian, Y., Wan, Y., Tosun, K., Shi, X.,
Castranova
integrative genomic and proteomic approach to chemo-
sensitivity prediction, International Journal of Oncology,
34, 107115.
[10] Bussey, K. J., Chin, K., Lababidi, S., Reimers, M., Rein-
hold, W. C., K
Fridlyand, J., Jain, A., Collins, C., Nishizuka, S., Tonon,
G., Roschke, A., Gehlhaus, K., Kirsch, I., Scudi ero, D.
A., Gray, J. W., and Weinstein, J. N., (2006) Integrating
data on DNA copy number with gene expression levels
and drug sensitivities in the NCI-60 cell line panel, Mol.
Cancer Ther., 5, 853867.
[11] Saeys, Y., Inza, I., and Larranaga, P., (2007) A review of
feature selection technique
matics, 23: 25072517.
[12] Lewontin, R. C., (1964) The interaction of selection and
linkage, I. General con
Genetics, 49, 4967.
[13] Nishizuka, S., Charboneau, L., Young, L., Major, S.,
Reinhold, W. C., Waltham, M., K
K. J., Lee, J. K., Espina, V., Munson, P. J., Petricoin, E.,
Liotta, L. A., and Weinstein, J. N., (2003) Proteomic pro-
filing of the NCI-60 cancer cell lines using new high-
density reverse-phase lysate microarrays, Proc. Natl.
Acad. Sci., USA, 100, 1422914234.
[14] Hedrick, P. W., (1987) Gametic disequilibrium measures:
proceed with caution, Genetics, 11 7, 33
[15] Leardic R., Boggia, R., and Terrile, M., (2005) Genetic
algorithms as a strategy for feature selection, J
Chemometrics, 6, 267281.
[16] Zhu, Z., Ong, Y., and Dash, M., (2007) Markov blan-
ket-embedded genetic algori
tern Recognition, 40, 32363248.
SciRes
Copyright © 2009 JBiSE