Statistical two-group comparisons are widely used to identify the significant differentially expressed (DE) signatures against a therapy response for microarray data analysis. We applied a rank order statistics based on an Autoregressive Conditional Heteroskedasticity (ARCH) residual empirical process to DE analysis. This approach was considered for simulation data and publicly available datasets, and was compared with two-group comparison by original data and Auto-regressive (AR) residual. The significant DE genes by the ARCH and AR residuals were reduced by about 20% - 30% to these genes by the original data. Almost 100% of the genes by ARCH are covered by the genes by the original data unlike the genes by AR residuals. GO enrichment and Pathway analyses indicate the consistent biological characteristics between genes by ARCH residuals and original data. ARCH residuals array data might contribute to refining the number of significant DE genes to detect the biological feature as well as ordinal microarray data.
Microarray technology provides a high-throughput way to simultaneously investigate gene expression information in a whole genome level. In the field of cancer research, the genome-wide expression profiling of tumors has become an important tool to identify gene sets and signatures that can be used as clinical endpoints, such as survival and therapy response [
As one approach for this challenge, based on the residuals from the Autoregressive Conditional Heteroskedasticity (ARCH) models, the proposed rank order statistic for two-sample problems pertaining to empirical processes refines the significant DE gene list. The ARCH process was proposed by Engle [
To investigate whether ARCH residuals can consistently refine a list of significant DE genes, we apply publicly available datasets called Affy947 [
Denote the sample and the genomic location by i and j in microarray data x i j . The samples for the microarray data are divided by two biological different groups, one group is for breast cancer tumors driven by ER + and another group is for breast cancer tumors driven by ER−. We apply the two-group comparison testing to identify significant different expression level between two groups of ER+ and ER− samples for each gene (genomic location). As the statistical test, we propose the rank order statistics for ARCH residual empirical process introduced in 2.1. For comparisons with the ARCH model’s performance, we consider applying the two-group comparison testing to original array data and applying the test to the residuals obtained by ordinal AR (autoregressive) model. The details about both methods are summarized in 2.2. For the obtained significant DE gene lists, biologists or medical scientists require further analysis for their biological interpretation to investigate the biological process or biological network. In this article, we apply GO (gene ontology) analysis shown in 2.3 and Pathway analysis shown in 2.4, which methods are generally used to investigate specific genes or relationships among gene groups.
Suppose that a classes of ARCH (p) models is generated by the following equations
X t = { σ t ( θ X ) ε t , σ t 2 ( θ X ) = θ X 0 + ∑ i = 1 p X θ X i X t − i 2 for t = 1 , ⋯ , m 0 , for t = − p X + 1 , ⋯ , 0 (2.1.1)
where { ε t } is a sequence of i.i.d.(0,1) random variables with fourth-order cumulant κ 4 X , θ X = ( θ X 0 , θ X 1 , ⋯ , θ X p X ) ′ ∈ Θ X ⊂ ℝ p X + 1 is an unknown parameter vector satisfying θ X 0 > 0 , θ X i ≥ 0 , i = 1 , ⋯ , p X − 1 , θ X p X > 0 , and ε t is independent of X s , s < t . Denote by F ( x ) the distribution function of ε t 2 and we assume that f ( x ) = F ′ ( x ) exists and is continuous on ( 0 , ∞ ) .
Suppose that another class of ARCH(p) models, independent of { X t } , is generated similarly by the equations
Y t = { σ t ( θ Y ) ξ t , σ t 2 ( θ Y ) = θ Y 0 + ∑ i = 1 p Y θ Y i Y t − i 2 for t = 1 , ⋯ , m 0 , for t = − p Y + 1 , ⋯ , 0 (2.1.2)
where { ξ t } is a sequence of i.i.d. (0,1) random variables with fourth-order cumulant κ 4 Y , θ Y = ( θ Y 0 , θ Y 1 , ⋯ , θ Y p Y ) ′ ∈ Θ Y ⊂ ℝ p Y + 1 is an unknown parameter vector satisfying θ Y 0 > 0 , θ Y i ≥ 0 , i = 1 , ⋯ , p Y − 1 , θ Y p Y > 0 , and ξ t is independent of Y s , s < t . The distribution function of ξ t 2 is denoted by G ( x ) and we assume that g ( x ) = G ′ ( x ) exists and is continuous on ( 0 , ∞ ) . For (2.1.1) and (2.1.2), we assume that θ X 1 + ⋯ + θ X p X < 1 and θ Y 1 + ⋯ + θ Y p Y < 1 for stationarity (see [
Now we are interested in the two-sample problem of testing
H 0 : F ( x ) = G ( x ) forall x agains t H A : F ( x ) ≠ G ( x ) forsome x .
In this article, F ( x ) and G ( x ) correspond to the distribution for the expression data of samples driven by ER+ and ER−, individually.
For this testing problem, we consider a class of rank order statistics including, such as Wilcoxon’s two-sample test. The form is derived from the empirical residuals ε ^ t 2 = X t 2 / σ t 2 ( θ ^ X ) , t = 1 , ⋯ , n and ξ ^ t 2 = Y t 2 / σ t 2 ( θ ^ Y ) , t = 1 , ⋯ , n . Because Lee and Taniguchi [
To obtain the empirical residuals as mentioned in 2.1, the ARCH model is applied to a vector { x i 1 , x i 2 , ⋯ , x i L } for the i th sample, where L is the total number of genomic locations in the microarray data. Assuming that the ER+ and ER− samples correspond to distributions F ( x ) and G ( x ) as shown in 2.1, orders p X and p Y of the ARCH model are identified by model selection using the Akaike Information Criterion (AIC), where the model with the minimum AIC is defined as the best fit model [
For comparisons with the ARCH model’s performance, the two-group comparison testing to original array data and applying the test to the residuals obtained by ordinal AR (autoregressive) model.AR model represents the current
value using the weighted average of the past values as x i j = ∑ k = 1 K β i x i j − k + w i j , where β i , k , and w i j are the AR coefficient, the AR order, and the error terms. The AR model is widely applied in time series analysis and the signal processing of economics, engineering, and science. In this article, we apply it to the expression data for two ER+ and ER− groups. The AR order for the best fit model is identified by AIC. Empirical residuals w i j + for ER+ and w i j − for ER− are subtracted from the data by predictions.
These procedures are finally summarized as follows: 1) take the original microarray and the clinical data for ER from one study cohort; 2) apply the ARCH and AR models to the original data for each sample and identify the best fit model among the model candidates within 1 - 10 time lags; 3) subtract the residuals from the data by the prediction for the best fit model; 4) apply Wilcoxon statistic to the original data and to the empirical residuals by ARCH and AR; 5) list the p-values and identify the significant FDR (5%) corrected genes. 6) apply Gene Ontology analysis and pathway analysis (see the details in 2.3 and 2.4) for biological interpretation to the obtained gene list (see 4. in
The computational programs were done by the garchFit function (in “fGARCH”) for ARCH fitting, by the ar.ols function for AR fitting, the wilcox.test as a rank- sum test, and fdr.R for the FDR adjustment in the R package.
To investigate the gene product attributes from the gene list, Gene Ontology (GO) analysis was performed to find specific gene sets that are statistically associated among several biological categories. GO is designed as a functional annotation database to capture the known relationships between biological terms and all the genes that are instances of those terms. It is widely used by many functional enrichment tools and is highly regarded both for its comprehensiveness and its unified approach for annotating genes in different species to the same basic set of underlying functions [
As well as for GO analysis, the identified genes are mapped to the well-defined- biological pathways. Pathway analysis determines which pathways are overrepresented among genes that present significant variations. The difference from GO analysis is that pathway analysis includes interactions among a given set of genes. Several tools for pathway analysis have been published. In this study, we used a web-based analysis tool called REACTOME, which is a manually curated open-source open-data resource of human pathways and reactions [
To investigate the performance of our proposed algorithm, we performed a simulation study. We first prepared the clinical indicator like ER+ and ER−. The artificial indicator includes “1” for 50 samples and “0” for 50 samples. Next, we considered two types of artificial 1000-array and 100 samples: one array data (A) generating by normal distributions was set. The mean and variance values of the distribution were set as 1.0 to generate overall array data at once. In addition, the array data for the 201 - 400 array and the 601 - 600 array were replaced with the data generating different normal distribution with 1.8 mean and 10 variance; another array data (B) was generated by ARCH model. The model was applied to real array (DES data, see the detail in Section 4) and the parameters (mu: the intercept, omega: the constant coefficient of the variance equation, alpha: the coefficients of the variance equation, skew: the skewness of the data, shape: the shape parameter of the conditional distribution setting as 3) for the model was estimated for ER+ and ER−, respectively. We used these parameters and random number to generate the simulation data. For the computational programs, we conducted normrnd of MatlabÒ command to generate random variables by normal distributions for array data A, and conducted garchSim of the R package fGARCH for array data B. We iterated 100 times to generate the two array data sets. To 100 data sets for A and B, we applied two-group comparison for the original simulation data and the ARCH residuals of them and identify 5% FDR significant parts.
Due to the extensive usage of microarray technology, in recent years publicly available datasets have exploded [
Against all probes that covered the whole genome, we use the probes that correspond to the intrinsic signatures that were obtained by classifying breast tumors into five molecular subtypes [
For the simulation data and ARCH residuals, we summarized the average of the number of the identified 5% FDR significant parts and the number of the overlapped parts in
1. Original series | 2. ARCH residuals | Overlapped # of 2 with 1 | Ratio for the overlapped # | |
---|---|---|---|---|
Array set A | 30.2 | 29.8 | 29.3 | 98.2 % |
Array set B | 512.2 | 338.9 | 212.1 | 62.6 % |
distribution, the significant number for original series and ARCH residuals in array sets A and B was not differ. The parts identified in A were mostly same as ones in B. On the other hand, in the simulation data generated by ARCH model, the ARCH residuals identified more significant parts from the data than the original series. The number of significant parts for ARCH residuals was about 30% less than the number of significant parts for the original series. The overlapped number was less than the case A, however over 50% parts were covered.
Based on the method explained in 3.2, the best fit AR and GARCH models were selected by AIC for each sample. The estimated orders of all the best fit models of all the studies are summarized in Supplementary
Using residuals obtained by the best fit ARCH and AR models and the original data, we applied Wilcoxon statistic to compare DEs between two groups divided by ER+ and ER−. The significant genomic locations were assessed by a FDR. The locations were mapped on Entrez gene IDs according to the Affy probes presented in the original microarray data and converted into gene symbols by SOURCE. The identified genes in the original data and the ARCH residual analyses are listed in Supplementary
similar structure as the ARCH process and empirical ARCH residuals might be more effective to specify important genes from a list of long genes than AR residuals.
To investigate the overlapping genes by the ARCH residuals with genes by the original data, the corresponding cytoband and gene symbols are summarized in
Model | Data | Des | Mil | Min | Loi | Chin |
---|---|---|---|---|---|---|
- | Original #EntrezID (unique) | 245 (186) | 238 (176) | 274 (195) | 53 (47) | 277 (201) |
ARCH | Residual #EntrezID (unique) | 193 (152) | 175 (139) | 207 (154) | 46 (41) | 177 (133) |
Overlapped with original [%] | 100 | 100 | 100 | 98 | 100 | |
AR | Residual #EntrezID (unique) | 194 (152) | 161 (131) | 183 (141) | 37 (34) | 178 (139) |
Overlapped with original [%] | 95 | 99 | 87 | 92 | 98 |
chromosome 1q region, ERBB2 in the chromosome 17q region, and ESR1 in the chromosome 6q region, even if the number of identified Entrez genes was less than the number of identified genes from the original data.
Next, we performed GO enrichment analysis using significant DE gene lists for the original data and ARCH’s residual analyses in all studies. To correctly find the enriched GO terms for the associated genes, a background list was prepared of all the probes included in the original microarray data. The Entrez genes in the background list were converted into 13,177 gene symbols without any duplication by SOURCE. As the input gene lists to GOrilla, the numbers of summarized unique genes are shown in the parentheses of
Furthermore, to investigate the consistency of the refined significant gene
Studies | Original | ARCH residuals | ||
---|---|---|---|---|
cytoband | genes | cytoband | genes | |
Des Mil Min Chin | 1p13.3 1p32.3 1p34.1 1p35 1p35.3 1p35.3-p33 1q21 1q21.1 1q21.3 1q23.2 1q24-q25 1q32.2 1q41 1q42.11 1q42.13 2p11.2 2q35 2q37.3 3p14.3 3p21 3q13.1 3q23-q25 3q24-q25.1 3q25 4q12 4q21.1 4q28.3 4q32.1 4q35.1 5q13.1 5q14-q21 5q22-q23 5q31.1 5q33.2 5q33.3 5q35.2 5q35.3 6p12 6p21.3 6p22.3 6q22.31 6q22.33 6q22-q23 6q23.3 6q25.1 7p13 7p15 7q21 7q21-q31 7q31.1 7q36 8p12 8p21 8p22 | VAV3, GSTM3, CHI3L2 ECHDC2 CTPS1 IFI6 ATPIF1 MEAF6 S100A11, S100A1 PEA15 CRABP2 COPA CACYBP ELF3 TP53BP2 DEGS1 ADCK3 TMSB10 IGFBP5, IGFBP2 LRRFIP1, SNED1 ACOX2 MST1 ALCAM CP GYG1 SIAH2 KIT, PDGFRA USO1 MGST2 GRIA2 ACSL1 PIK3R1 PAM REEP5 JADE2 GALNT10 CYFIP2 MSX2 GNB2L1 MCM3, TFAP2B HIST1H1C ID4 ASF1A ECHDC1 FABP7 CITED2 ESR1 BLVRA GARS FZD1 SEMA3C IFRD1 PTPRN2 NRG1, PLAT EPHX2 ASAH1, TUSC3 | 1p13.3 1p35 1p35.3 1q21 1q21.1 1q21.3 1q23.2 1q24-q25 1q41 1q42.13 2q35 2q37.3 3q23-q25 3q24-q25.1 3q25 4q12 4q21.1 4q35.1 5q13.1 5q14-q21 5q22-q23 5q33.2 5q35.2 5q35.3 6p12 6p21.3 6p22.3 6q22.31 6q22-q23 6q23.3 6q25.1 7p13 7p15 7q21 7q21-q31 7q31.1 7q36 8p12 8p21 8p22 | VAV3, GSTM3 IFI6 ATPIF1 S100A1, S100A11 PEA15 CRABP2 COPA CACYBP TP53BP2 ADCK3 IGFBP5, IGFBP2 LRRFIP1, SNED1 CP GYG1 SIAH2 KIT, PDGFRA USO1 ACSL1 PIK3R1 PAM REEP5 GALNT10 MSX2 GNB2L1 MCM3 HIST1H1C ID4 ASF1A FABP7 CITED2 ESR1 BLVRA GARS FZD1 SEMA3C IFRD1 PTPRN2 NRG1 EPHX2 ASAH1, TUSC3 |
Des Mil Min Chin | 8q21.1 8q22 8q22.1 8q24.1 8q24.12 9q33.3 9q34.1 9q34.11 10p15 10q24 11p12-p11 11p15 11q11-q12 11q12.3 11q13 11q14.1 12p13 12q12 12q13 12q13.12 12q14 12q14.1 12q24.21 13q12 13q21.1-q32 13q22.2 13q31.2-q32.3 13q33 14q11.2 15q24 15q24.2 15q26.3 16p12.2 16p13.3 16q13 16q22.1 16q24.3 17p11.2 17q11.2 17q11.2-q12 17q11-q12 17q12 17q21.2 17q21.31 17q24-q25 18p11.3 18q21.1 18q22-q23 18q23 19p13.3 19p13.3-p13.2 19q13.2 19q13.3 19q13.4 20p11.21 21q21.1 22q11.2 22q13.1 Xp21.3 | PEX2 CA2 LAPTM4B SQLE TRPS1 RALGPS1 CRAT SPTAN1 GATA3 PDCD4, MYOF, PAPSS2 EXT2 RPL27A TCN1 PLA2G16 NUMA1 RSF1 PTMS, SCNN1A TWF1 STAT6, SLC11A2 FKBP11 GNS PPM1H MED13L FLT1 CLN5 LMO7 STK24 EFNB2 MMP14 CIB2 COMMD4 IGF1R POLR3E HCFC1R1 ARL2BP CDH1 PIEZO1 ALDH3A2, PEMT FAM222B, CCL18 LIG3 FLOT2 ERBB2 KRT17 ACBD4 CDC42EP4 RAB31 ACAA2 ZNF236 CYB5A KDM4B EPOR CYP2A6 CA11, ARHGAP35 PEG3 ENTPD6 BTG3 IGL POLR2F, H1F0 ZFX | 8q22 8q22.1 8q24.12 9q34.1 10p15 10q24 11p12-p11 11p15 11q11-q12 11q12.3 11q13 11q14.1 12p13 12q12 12q13 12q13.12 12q14 12q14.1 12q24.21 13q12 13q21.1-q32 13q22.2 13q33 15q26.3 16p12.2 16p13.3 16q13 16q24.3 17p11.2 17q11.2 17q11.2-q12 17q11-q12 17q12 17q24-q25 18q21.1 18q22-q23 18q23 19p13.3 19q13.2 19q13.3 19q13.4 20p11.21 21q21.1 22q11.2 22q13.1 | CA2 LAPTM4B TRPS1 CRAT GATA3 PDCD4, MYOF, PAPSS2 EXT2 RPL27A TCN1 PLA2G16 NUMA1 RSF1 PTMS TWF1 STAT6 FKBP11 GNS PPM1H MED13L FLT1 CLN5 LMO7 EFNB2 IGF1R POLR3E HCFC1R1 ARL2BP PIEZO1 ALDH3A2 FAM222B, CCL18 LIG3 FLOT2 ERBB2 CDC42EP4 ACAA2 ZNF236 CYB5A KDM4B CYP2A6 CA11,ARHGAP35 PEG3 ENTPD6 BTG3 IGL POLR2F, H1F0 |
---|
Des Mil Min Chin | Xp22.1 | SAT1 | ||
---|---|---|---|---|
Xq26.3 | VGLL1 | Xq26.3 | VGLL1 | |
+Loi | 1p13.3 | CHI3L2 | ||
1q24-q25 | CACYBP | 1q24-q25 | CACYBP | |
2q35 | IGFBP5 | 2q35 | IGFBP5 | |
3p21 | 3p21 | |||
5q35.2 | MSX2 | |||
7p13 | BLVRA | 7p13 | BLVRA | |
10q24 | PDCD4, MYOF | 10q24 | PDCD4, MYOF | |
17q11.2 | CCL18 | |||
17q24-q25 | 17q24-q25 | CDC42EP4 | ||
19p13.3-p13.2 | EPOR | 19p13.3-p13.2 | EPOR | |
21q21.1 | BTG3 | BTG3 | ||
22q11.2 | IGL | 21q21.1 | IGL | |
Xp22.1 | SAT1 | 22q11.2 |
lists, we applied pathway analysis to the significant DE genes for the original and ARCH residuals listed in
We applied a rank order statistic for an ARCH residual empirical process to refine significant DE genes by two-group comparison in microarray analysis. Our approach considered publicly available gene expression datasets and the clinical output for ER in addition to the simulation study. We compared the analysis performances by the ARCH residuals with the AR residuals and the ordinal original microarray data. While the genes for the AR residuals did not cover 100% of the genes for the original data analysis, the genes by the ARCH residuals were mostly 100% overlapped with the original data, and the gene lists were reduced about 30% from the gene lists obtained by the original data analysis. We confirmed the similar property for the 30% reduction in the simulation study. In GO enrichment and pathway analyses, the result by the ARCH residuals was mostly covered with associated biological terms obtained by the original data
Associated GO terms | Des | Mil | Min | Chin | |||||
---|---|---|---|---|---|---|---|---|---|
Orig | Arch | Orig | Arch | Orig | Arch | Orig | Arch | ||
Biological Process | epithelial cell proliferation | + | + | + | + | + | + | + | + |
response to estrogen | + | + | + | + | + | + | + | + | |
epidermis development | + | + | + | + | + | + | |||
regulation of phosphatidylinositol 3-kinase activity | + | + | + | + | + | + | + | ||
erythropoietin-mediated signaling pathway | + | + | + | + | + | + | |||
regulation of lipid kinase activity | + | + | + | + | + | + | + | + | |
phosphatidylinositol 3-kinase signaling | + | + | + | + | + | + | + | + | |
positive regulation of phosphatidylinositol 3-kinase activity | + | + | + | + | + | + | + | + | |
phenylpropanoid catabolic process | + | + | + | + | + | + | + | + | |
mast cell differentiation | + | + | + | + | + | + | + | + | |
extracellular vesicle | + | + | + | + | + | + | + | ||
extracellular vesicular exosome | + | + | + | + | + | + | + | ||
extracellular region part | + | + | + | + | + | + | + |
Pathways | Original | ARCH residuals | ||
---|---|---|---|---|
Number | Gene symbol | Number | Gene symbol | |
ERBB2 signaling | 7 | ERBB2, KIT, NRG1 | 6 | ERBB2 |
EGFR pathways | 11 | FLT1, KIT, PIK3R1, VAV3 | 10 | ERBB2, FLT1, PIK3R1 |
Cell cycle | 5 | CDH1, MCM3 | 3 | MCM3, NUMA1 |
Immune system | 5 | CDH1, KIT, PIK3R1, STAT6 | 5 | ERBB2, IFI6, STAT6 |
Metabolic disorder | 1 | SAT1 | 1 | FZD1 |
PI3K/AKT signaling | 5 | KIT, PIK3R1 | 5 | ERBB2, PIK3R1 |
Wnt pathway | 5 | FZD1 | 5 | FZD1 |
analysis and presented additional important GO terms in biological processes. These results suggest that data processing using ARCH residuals array data could contribute to refining significant DE genes that follow the required gene signatures and provide prognostic accuracy and guide clinical decisions.
The research by the second author was supported by Japanese Grant-in-Aids A23244011 (Taniguchi, M., Waseda Univ.).
The authors declare no competing interests.
Solvang, H.K. and Taniguchi, M. (2017) Microarray Analysis Using Rank Order Statistics for ARCH Residual Empirical Process. Open Journal of Statistics, 7, 54-71. https://doi.org/10.4236/ojs.2017.71005
Supplementary
Supplementary
Supplementary
Supplementary