Engineering, 2013, 5, 433-439
http://dx.doi.org/10.4236/eng.2013.510B089 Published Online October 2013 (http://www.scirp.org/journal/eng)
Copyright © 2013 SciRes. ENG
A Localized-Statistic-Based Approach for Biomarker
Identifica tion of Omi cs Data
Kuan Zhang, He Chen, Yongtao Li
Beijing Aerospace Control Center, Beijing, China
Email: zhangkua@mail.ustc.edu.cn, chenhe_mail@netease.com, xlslyt@sina.com
Received 2013
ABSTRACT
Omics data provides an essential means for molecular biology and systems biology to capture the systematic properties
of inner activities of cells. And one of the strongest challenge problems biological researchers have faced is to find the
methods for discovering biomarkers for tracking the process of disease such as cancer. So some feature selection me-
thods have been widely used to cope with discovering biomarkers problem. However omics data usually contains a
large number of features, but a small number of samples and some omics data have a large range distribution, which
make feature selection methods remains difficult to deal with omics data. In order to overcome the problems, we present
a computing method called localized statistic of abundance distribution based on Gaussian window (LSADBGW) to test
the significance of the feature. The experiments on three datasets including gene and protein datasets showed the accu-
racy and efficiency of LSADBGW for feature selection.
Keywords: Protein-Omics Data; Biomarker Selection; Localized Statistic; Gauss ian W indow
1. Introduction
With the advent of high-throughput measurement tech-
niques such as transcriptome by microarray and prote-
ome by mass spectrometry, the omics, which mean com-
prehensive analysis of a specific layer in a cellular sys-
tem and are emerging as essential methodological ap-
proaches for molecular biology and systems biology, have
been accumulated rapidly and make it possible to capture
the entire snapshot of cell-wide activity [1,2]. The in-
crease in data acquisition has lead to a demand for prac-
tical and effective data mining methods for in silico analy-
sis. One of the strongest challenge problems biological
researchers have faced is to find the methods for disco-
vering biomarkers for tracking the process of disease
such as cancer [3,4], as the biomarkers selection can be
viewed as a major bottleneck of supervised learning and
data mining on omics data [5,6].
Feature selection approaches, which aim to find a set
of features that best discriminate biological samples of
different types, have been widely applied to cope with
discovering biomarkers problem [3,4,7-9]. The selected
features are “biomarkers”, and they form “marker panel”
for analysis. The fold-change and p-value are two com-
monly known criteria to select differentially expressed
features under two experimental conditions. In the fold-
change method, a feature is viewed as a “biomarker” if
the ratio in absolute value of the expression levels be-
tween two classes exceeds a certain threshold, e.g., a 2-
fold change. The p-value ranking is an alternative ap-
proach for feature selection. Often the p -value is the pro-
bability outcome from a statistical testing procedure that
there is no difference between two conditions for an in-
dividual feature. A variety of statistical tests including
two-sample t test [10-16], X2 test [10,17], the one-way
analysis of variance [18,19], the Wilcoxon signed rank
test [20-23] and Mann-Whitney test [23] have been used
to obtain the p-values. Though great success have been
obtained using these approaches in selecting biomarkers,
it still remains difficult to deal with omics data. As we
know that omics datasets always belong to small sample
datasets, becaus e the number of features significantly out-
numbers the number of samples. Then the p-value me-
thods based on statistical tests sometimes are failed to
deal with the omics data, f or example, if the sampl e num-
ber of the dataset only equals to 1 for each class the sta-
tistical tests miss their efficiency. And [24] indicates that
some omics data have a large range distribution, so the
same criteria for different range data which is the strate-
gy employed by fold-change approach is incorrect, for
example, the significance of 2-fold change from 2 to 1 is
not equal to the significance of 2-fold change from 20,000
to 10,000.
In order to overcome the large range problem, [24] de-
veloped a computing method called Localized Statistics
of Protein Abundance Distribution (LSPAD) to eva-
K. ZHANG ET AL.
Copyright © 2013 SciRes. ENG
434
luate the statistical significance of protein-abundance
bias between two classes, by which are differentia signi-
ficance of a particular protein should be calculated through
its local protein-abundance distribution-window rather
than through whole distribution range from the lowest to
highest protein abundances. In fact, even though the sam-
ple number of the dataset only equals to 1 for each class
LSPAD also shows goo d performance which is validated
in [24]. However LSPAD is under-utilized practice and
there are two shortcomings in LSPAD. The first is that
the strategy of selecting local distribution window is too
rough, which postulated a width of the local window for
statistics as 33%, i.e. only neighb ored proteins within the
33% A-axis around a particular protein should be used
for calculation. And the second is that LSPAD employs
the fisher exact test to check the statistical significance.
Fisher exact test is a statistical significance test used in
the analysis of contingency tables where sample sizes are
small. However if the data type is float rounding opera-
tion must be performed which may make Fisher exact
test fail to deal with the omics data and fisher exact test
should be time-consuming when the sample sizes are
large.
In this study we present a computing method called
localized statistic of abundance distribution based on Gaus-
sian window(LSADBGW) which also employs the loca-
lized statistic strategy used by LSPAD but propose a
Gaussian window as the local abundance distribution
window and a simpler and more general statistic ap-
proach to test the significance of the feature. By using the
Gaussian window, the selection of local abundance dis-
tribution window is more reasonable and persuasive. And
LSADBGW not only can deal with the integral data but
also the float data, which furthers the application range
comparing with LSPAD. The experiments on three data-
sets including gene and protein datasets and the compar-
ison with the LSPAD show the accuracy and efficiency
of LSADBGW for feature selection.
In summary, our contributions are: 1) We extend the
application range of localized statistic str ategy to all om-
ics, which is opposite to LSPAD is only oriented towards
the protein tandem mass spectrometry data processed by
SEQUEST [25]; 2) We propose a new strategy of se lect-
ing local abundance distribution window which employs
the Gaussian window. By using the Gaussian window our
method is more reasonable and persuasive than LSPAD;
3) We proposed a simpler but more effective statistic test
instead of the fisher exact test used in LSPAD. The rest
of the paper is organized as the follows. A brief not on
the LSPAD is given in Section 2. Our method is pre-
sented in Section 3 and the datasets and experiments are
given in Section 4. We show the experimental results and
discuss the results in Section 5. Finally Section 6 con-
cludes.
2. Related Work
The concept of localized statistic used in feature selec-
tion of omics is firstly proposed by [24], in which human
serum of non-diabetic and diabetic cohorts was analyzed
by proteomic approach. To analyze total 1377 high-con-
fident serum-proteins, they developed a computing strat-
egy called localized statistics of pro tein abundance distri -
bution (LSPAD) to calculate a significant bias of a par-
ticular protein-abundance between these two cohorts.
The LSPAD method can be divided to two steps. First-
ly, since the peptide-spectral-count distributions of iden-
tified serum-proteins were widely spread out to the range
of 105, they developed M-A plotting referring to micro-
array analysis in order to display a relative protein -abun-
dance distribution of each protein. The M and A values
are defined as foll ows:
12
12
1 21
2 22
( )/2
log (1)
log (1)
A YY
MYY
YX
YX
= +
= −
= +
= +
(1)
wherein X1 and X2 respectively represent the peptide
spectral counts in diabetic serum and in non-diabetic se-
rum, M represents differential protein abundance between
diabetic and non-diabetic serum, and A represents the
average protei n a bundance .
Then the differential significance of a particular pro-
tein is calculated based on the proteins fell into its local
protein-abundance distribution-window using fisher’s ex-
act test. And [24] postulates a width of the local window
for statistics as 33% A-axis.
3. Method
In order to overcome the under-utilized in practice and
the unreason able window selection strategy, we propos ed
a more practical and reasonable method of selecting sig-
nificant features called localized statistic of abundance
distribution based on Gaussian window (LSADBGW). In
fact, the M value used in Equation (1) can be employed
as statistic value; on the contrary, M value is ignored by
LSPAD. Because of the generality and simplicity of the
normal distributions, it has been widely used in various
areas, including the omics data such as gene expression
data [26]. And we propose a Gaussian window in
LSADBGW instead the local window used in LSPAD.
3.1. The Significant Test Method Using M Value
We assume that the M value obeys the normal distribu-
tion, and this is reasonable which can be validated in Fig-
ure 1.
With the assuming a Gaussian distribution, the signi-
K. ZHANG ET AL.
Copyright © 2013 SciRes. ENG
435
(a) (b)
Figure 1. The M values di stribution. (a) represents the M values distribution of serum SELDI MS data (Ovarian, 07 August
2002); (b) represents the M values distribution of wing sarcoma and rhabdomyosarcoma in the dataset small round blue cell
tumors which is a DNA mi croarray dataset.
ficance of a feature can be given by
data W
W
MM
S
σ
−< >
=
(2)
wherein S represents the significance value, Mdata represents
the M value of the feature tested, MWrepresents the
mean value of the M values fell in the statistical window
and
W
σ
represents the standard deviation of the M val-
ues fell in the statistical window.
After S value is obtained by Equation (2), the signi-
ficance can be calculated through S e.g. |S| 2.6 can be
treated as significant at a level of 99% assuming a Gaus-
sian distribution.
3.2. The Gaussian Window
Since the Significance calculation of particular differen-
tial features should be localized to a certain range of re-
lated abundance level [24], the selection of appreciate
local abundance distribution window plays an important
role in localized statistics method. However choosing a
local window for localized statistics appropriate to all
kinds of data distribution, which ensures that all the data
fell into it are under the same range, is difficult or im-
possible, as the concept of the same range is puzzled.
Then we consider the interaction between different range
samples instead of accurate the same range partition, that
is, the correlation between samples located nearby with
each other is higher than the samples located far. For
example, under the data partition of [24], the correlation
between low level and high level of protein abundance
samples is lower between two high level samples.
However, how to accurately define and quantify the
correlation between two samples according to their range
distance is also a problem. Fortunately, it is known that
there is close relationship between data range and data
distribution, that is, the problem of estimating the corre-
lation between two different range samples may be rede-
fined and carried out from the view of the density esti-
mation of distribution. So the correlation between two
samples can be performed according to the contribution
to the density estimation of each sample point for each
other. For example, if sample point A has a higher con-
tribution for the density estimation of sample C than the
point B, we can say that the relationship between A and
C is higher than A and B.
So from the point view of density estimation, the se-
lection of location range window can employ the same
strategy of location density estimation window. In fact,
LSPAD employs rectangle window which the width is
the 33% of all the range length. However this seems not
reasonable, that is, it is difficult to say that using 33% is
better than using 25% or others. We focus on the Gaus-
sian window instead of rectangle window.
With a generalized weight kernel function K(x) the
density estimator
ˆ()px
is given by
1
1
ˆ()( )
Ni
i
xx
px K
Nh h
=
=
(3)
wherein N is the sample number, h is called smoothing
parameter or window width and the kernel function K(x)
is required to be a normalized probability dens ity. If K(x)
is the Gaussian kernel, the density estimator is given by
2
2
1
()
11
ˆ( )exp()
2
2
Ni
i
xx
px Nh
h
π
=
= −
(4)
The choice of the bandwidth h is crucial to the density
estimator, that is, if h is chosen to small spurious fine
-5 -4 -3-2 -10 1 234 5
0
50
100
150
200
250
K. ZHANG ET AL.
Copyright © 2013 SciRes. ENG
436
structure becomes visib le, while if h is too large all detail,
spurious or otherwise is obscured. There are some me-
thods for choosing an appropriate bandwidth available,
however most of these methods suffer a considerable
computational burden [27]. As a tradeoff between com-
putational effort and performance one may choose the
optimal bandwidth as the one that minimizes the mean
integrated square error, assuming the underlying distribu-
tion is Gaussian. An optimal Gaussian bandwidth hopt is
given by [ 2 8]
11
55
4
( )1.06
3
opt
hN
N
σσ
= ≈
(5)
We employ the Gaussian window as the local abun-
dance distribution window. In fact the Gaussian window
used is not the original local window, on the contrary, it
is a whole window but the weight for each sample point
is different. The sample set used to localized statistics is
constructed by t he f ol low strategy
()
i
ii
i
for eachsampex
if randomPx
thenselectxto staDataset
end if
endfor
<
(6)
wherein randowi is a random number obey the uniform
distribution between 0 and 1, staDataset represents the
sample set used to localized statistics and P(xi) is given
by
()2*(1( ,,||))
iopt i
pxnormcdfx hx= −
(7)
wherein norcdf (x, hopt, |xi|) is defined as the normal cu-
mulative distribution function, x represents the mean of
the normal distribution function, hopt represents the stan-
dard deviation and |xi| means the absolute value of the
sample xi.
4. Materials and Experiments
4.1. Datasets
Three datasets are deployed here:
Dataset1: Ovarian cancer Dataset (07 August 2002),
which was collected using WCX2 protein array. The
sample set included 91 controls and 162 ovarian cancers.
The SELDI MS data for each case is an ASCLL file
containing 15,155 points of m/z values with correspond-
ing intensities.
Dataset2: Small Round Blue Cell Tumors (SRBCTs),
which was obtained from glass-slide cDNA microarrays.
The data consisted of expression measurements on 6567
genes (2308 genes after filtering for minimal level of ex-
pression). The tumors are classified as Burkitt lymphoma
(BL, 11 samples), Ewing sarcoma (EWS, 29 samples),
neuroblastoma (NB, 18 samples) and rhabdomyosarcoma
(RMS, 25 samples). As we only focus on the binary clas-
sification problem, EWS and RMS are selected to form a
new two class dataset.
Dataset3: Stem Cell Matrix (SCM) [29], which is a
database of global gene expression profiles. The database
consisted of 218 samples which belong to 17 cell lines.
As the operation in dataset2, ES cells_undifferentiated
and ES_differentiated neural stem cells are selected to
form a new two class dataset. IPS cells also are selected
to further our method and this will be discussed in the
latter section.
4.2. The Classification Results and Discussion
The LSADBGW currently is suitable for the two column
data, so the mean vectors of two classes must be calculat-
ed firstly and form a new mean dataset. In fact, this oper-
ation may ignore the differences among the same class
data which are useful for feature selection. Leave-one-
out-cross-validation (LOOCV) method and liner-SVM
are employed in our classification experimental frame-
work.
As the mean vectors are only used for three methods,
the differences between the same classes samples are ig-
nored which may be an obstacle for classification. After
the feature selection, we cluster the features selected to
10 classes by k-mean cluster method, and then we select-
ed the top 1 feature of each class to form a feature sets
for classification.
In Figures 2-4 we respectively list the results o btained
from the dataset 1, dataset 2 and dataset 3 using
LSADBGW, LSPAD’ and LSPAD. Here, all the p values
used in three methods were equal to 0.95.
The results in Figure 2 showed that the LSPAD per-
formed better than LSPAD’, which seems that the fisher’
exact test was better than using simple statistical test us-
ing M values. However, in Figured 3 and 4, the results
generated by LSPAD were not represented. This is be-
cause that the LSPAD did not generate good significant
features set which were illustrated in Figures 5 and 6.
The results in Figure 5 showed that only two features
were selected while in Figure 6 showed that almost all
the features were selected, this phenomena indicated that
the LSPAD using the fisher’ exact test was not a stable
strategy for omics data, on the contrary, the LSPAD’ us-
ing simple statistical test were much more stable.
It was also showed that the performance of using Gaus-
sian window performed better than rectangle window,
especially in Figure 2. However the results in Figure 4,
LSPAD’ seems a little better than LSADBGW. We then
respectively used the top 10 and top 20 features without
clustering to investigate the performance of LSADBGW
and LSPAD’, and the results were showed in Figure 7
K. ZHANG ET AL.
Copyright © 2013 SciRes. ENG
437
Figure 2. The classification performance comparison on the ovarian cancer dataset.
Figure 3. The classification performance comparison on the small round blue cell tumors dataset.
Figure 4. The classification performance comparison on the stem cell matrix dataset.
Figure 5. M-A plotting of small round blue cell tumors dataset, red dots represented statistically significant overpresented
genes in EWS and G reen dots represented stat ist i cally signif icant under-represented genes in EWS.
and 10. The new results, especially in Figure 8, indicated
that the performance of LSADBGW was better than
LSPAD’, which meant that the strategy employing the
Gaussian window performs better than employing the
rectangle window.
The comparative study of three feature selection me-
thods indicated that the strategy employing simple statis-
tical test using M values was much more stable than fi-
sher’ exact test and employing the Gaussian window is
much more accurate than rectangle window.
0.88
0.9
0.92
0.94
0.96
0.98
1
Sensitivity SpecificityPPVNPVOCA
LSABBGW
LSPAD'
LSPAD
0
0.2
0.4
0.6
0.8
1
1.2
SensitivitySpecificityPPV NPV OCA
LSABBGW
LSPAD'
LSPAD
0
0.2
0.4
0.6
0.8
1
1.2
SensitivitySpecificityPPV NPV OCA
LSABBGW
LSPAD'
LSPAD
K. ZHANG ET AL.
Copyright © 2013 SciRes. ENG
438
Figure 6. M-A plotting of stem cell matrix dataset, red dots represented statistically significant over-presented genes in ES
cells_undifferentiated and green dots represented statistically significant under-represented genes in ES cells_undifferen-
tiated.
Figure 7. The classification performance comparison on the stem cell matrix dataset using the top 10 features without clu-
stering.
Figure 8. The classification performance comparison on the stem cell matrix dataset using the top 20 Features without clus-
tering.
5. Conclusion
In this article, we proposed a new localized statistical
approach to deal with biomarkers selection called loca-
lized statistic of abundance distribution based on Gaus-
sian window (LSADBGW). Comparing with the localiz-
ed statistics of protein abundance distribution (LSPAD),
LSADBGW employs the more reasonable local statistic-
al window selection strategy and a more generalized and
simpler statistical test method. The classification experi-
mental results prove that our approach perform well than
LSPAD. In conclusion, we hope that our LSADBGW
method could present useful alternatives in the analysis
of the omics data.
REFERENCES
[1] D. J. Oliver, B. Nikol au and E. S. Wurtele, Functional Ge-
nomics: High-Throughput mR NA, Protein, and Metabo-
lite Analyses,” Elsevier, 2002, pp. 98-106.
[2] N. Ishii and M. Tomita, Multi-Omics Data-Driven Sys-
tems Biology of E. coli,” Springer, 2009, p. 41.
[3] S. Smit, H. C. J. Hoefsloot and A . K. Smilde, Statistical
Data Processing in Clinical Proteomics,” Elsevier, 2008,
pp. 77-88.
[4] H. Shin and M. K. Markey, “A Machine Learning Pers-
pective on the Development of Clinical Decision Support
Systems Utilizing Mass Spectra of Blood Samples,” El-
sevier, 2006, pp. 227-248.
[5] I. Guyon and A. Elisseeff, “An Introduction to Variable
0
0.2
0.4
0.6
0.8
1
1.2
SensitivitySpecificityPPVNPV OCA
LSABBGW
LSPAD'
LSPAD
0
0.2
0.4
0.6
0.8
1
1.2
SensitivitySpecificityPPV NPV OCA
LSABBGW
LSPAD'
LSPAD
K. ZHANG ET AL.
Copyright © 2013 SciRes. ENG
439
and Feature Selection,” MIT Press Cambridge, 2003, pp.
1157-1182.
[6] E. Marchiori, et al., Feature Selection for Classification
with Proteomic Data of Mixed Quality,” 2005, pp. 1-7.
[7] H. W. Ressom, et al., Classification Algorithms for Phe-
notype Prediction in Genomics and Proteomics,” NIH
Public Access, p. 691.
[8] M. Dakna, et al., Technical, Bioinformatical and Statisti-
cal Aspects of Liquid Chromatography-Mass Spectrome-
try (LC-MS) and Capillary Electrophoresis-Mass Spectro-
metry (CE-MS) Based Clinical Proteomics: A Critical
Assessment,” Elsevier, 2009, pp. 1250-1258.
[9] Chen, J. J., et al., “Gene Selection with Multiple Ordering
Criteria,” BioMed Central Ltd., 2007, p. 74.
[10] A. Vlahou, et al., “Development of a Novel Proteomic
Approach for the Detection of Transitional Cell Carcino-
ma of the Bladder in Urine,” ASIP, 2001, pp. 1491-1502.
[11] M. J. Campa, et al., “Protein Expression Profiling Identi-
fies Macrophage Migration Inhibitory Factor and Cyclo-
philin A as Potential Molecular Targets in Non-Small Cell
Lung Cancer 1,” AACR, 2003, pp. 1652-1656.
[12] J. M. Koomen, et al., Plasma Protein Profiling for Diag-
nosis of Pancreatic Cancer Reveals the Presence of Host
Response Proteins,” AACR, 2005, pp. 1110-1118.
[13] J. M. Koomen, et al., “Diagnostic Protein Discovery Us-
ing Proteolytic Peptide Targeting and Identifica tion,” Joh n
Wiley & Sons, Ltd. , Chichester, 2004.
[14] K. R. Kozak, et al., Identification of Biomarkers for Ova-
rian Cancer Using Strong Anion-Exchange ProteinChips:
Potential Use in Diagnosis and Prognosis,” National Acad
Sciences, 2003, pp. 12343-12348.
[15] W. Zhu, et al., Detection of Cancer-Specific Markers
Amid Massive Mass Spectral Data,” National Acad Sci-
ences, 2003, pp. 14666-14671.
[16] T. C. W. Poon, et al., “Comprehensive Proteomic Profil-
ing Identifies Serum Proteomic Signatures for Detection
of Hepatocellular Carcinoma and Its Subtypes,” Ameri-
can Association of Clinical Chemistry, 2003, p. 752-760.
[17] A. Valerio, et al., “Serum Protein Profiles of Patients with
Pancreatic Cancer and Chronic Pancreatitis: Searching for
a Diagnostic Protein Pattern,” John Wiley & Sons, Ltd.,
Chichester, 2001.
[18] M. Wagner, D. Naik and A. Pothen, “Protocols for Dis-
ease Classification from Mass Spectrometry Data,” WI-
LEY-VCH Verlag Weinheim, 2003.
[19] M. Wagner, et al., Computational Protein Biomarke r Pre-
diction: A Case Study for Prostate Cancer,” Bi oMed Cen -
tral Ltd., 2004, p. 26.
[20] S. Bhattacharyya, et al., “Diagnosis of Pancreatic Cancer
Using Serum Proteomic Profiling,” 2004, pp. 674-686.
[21] L. H. Cazares, et al., “Normal, Benign, Prene oplastic, and
Malignant Prostate Cell s Have Distinct Protein Expres-
sion Profiles Resolved by Surface Enhanced Laser De-
sorption/Ionization Mass Spectrometry 1,” AACR, 2002,
pp. 2541-2552.
[22] J. M. Sorace and M. Zhan, “A Data Review and Re-As-
sessment of Ovarian Cancer Serum Proteomic Profiling,”
BioMed Central Ltd., 2003, p. 24 .
[23] T. A. Zhukov, et al., “Discovery of Distinct Protein Pro-
files Specific for Lung Tumors and Pre-Malignant Lung
Lesions by SELDI Mass Spec t r ome try,” 2003, p. 267.
[24] R. X. Li, et al., Localized-Statistical Quantification of Hu-
man Serum Proteome Associated with Type 2 Diabetes,”
Public Library of Science , 2008.
[25] J. K. Eng, A. L. McCormack and J. R. Yates Iii, “An Ap-
proach to Correlate Tandem Mass Spectra Data of Pep-
tides with Amino Acid Sequences in a Protein Database,”
Elsevier Science Pub. Co., New Y ork, 1994, pp. 976-989.
[26] K. Y. Yeung, et al., “Model-Based Clustering and Data
Transformations for Gene Expression Data,” Oxford Uni-
versity Press, 2001, pp. 977-987.
[27] Y. I. Moon, B. Rajagopalan and U. Lall, Estimation of
Mutual Information Using Kernel Density Estimators,”
APS, 1995, pp. 2318-2321.
[28] B. W. Silverman, “Density Estimation for Statistics and
Data Ana lysis,” Chapman & Hall/CRC, 1986.
[29] F. J. Müller, et al., “Regulatory Networks Define Pheno-
typic Classes of Human Stem Cell Lines,” Nature Publi-
shing Group, 2008, pp. 401-405.