DNA microarray is a widely used technique which allows one to identify the genes that are similarly or differentially expressed in different cell types or conditions, to learn how their expression levels change in different developmental stages or disease states, and to identify the cellular processes in which they participate. This technology produces a large amount of complex data, necessitating employment of multiple bioinformatics and computational tools and techniques to provide a comprehensive view of the underlying biology. This review overviews methods and techniques which may be employed to analyze and interpret microarray data. The focus is primarily on analysis of gene expression matrices to obtain biological insights to this end. Both supervised and unsupervised methods commonly used for expression data analysis have been discussed. Data visualization techniques which may be used to comprehend biological relevance of the data has also been discussed in brief.
DNA microarrays are microscopic arrays in which thousands of unique DNA molecules (probes) of known sequences are immobilized on a solid substrate. Microarrays are in principle and practice extension of hybridization based methods that have been used for decades to identify and quantitate nucleic acids in biological samples [
The work flow in a microarray experiment encompasses experimental design, procedures, data pre-processing, i.e. data transformation from raw microarray data to gene expression matrices (
regulated genes differently. Genes up-regulated by a factor of two have an expression ratio of 2, while those down-regulated by a factor of 2 have an expression ratio of ½ (0.5). As a result, down-regulated genes are compressed between 1 and 0 while up-regulated genes expand to cover the region between 1 and positive infinity, hence a logarithimic transformation is used, generally the logarithimic base being 2. The advantage of this transformation is that it treats upand down-regulated genes equivalently and produces a continuous spectrum of values for differentially expressed genes, i.e. log2(2) = 1 and log2(1/2) = −1. However, before transformation, one must first have an accurate method for comparing the measured expression levels between query and reference samples, and this is done by means of normalization. Normalization scales one or both of the measured expression levels of each gene to make them equivalent, and consequently the expression ratios derive from them [
After normalization and transformation the measurements of expression are combined into a log ratio for each sample, which describes numerically the extent to which the gene is differentially expressed, and whether it is up-regulated or down-regulated. To identify those that are consistently differentially expressed across all replicates under experimental condition one may choose a threshold, i.e. 2 fold differential expression and select those genes whose average differential expression is greater than that of threshold. But from a statistical perspective this is not a good approach because the average ratio does not take into account the sample size or variability within the sample (replicates or individuals). Hence a methodology known as hypothesis test is used to determine whether or not a gene is differentially expressed.
A hypothesis test builds a probabilistic model for the observed data based on what is known as null hypothesis which in this case is that there is no biological effect, i.e. The gene is not differentially expressed due to conditions under study, but results instead from differences between replicates or measurement errors. Using this model, it is possible to calculate the probability of observing a statistic, i.e. average fold change that is at least as extreme as the observed statistic in the data. This probability is known as p-value [
One may perform statistical tests on different individual genes and conclude whether genes are up or downregulated based on these tests. But in case of microarray experiment one has to apply these tests to many genes in parallel which has serious consequence known as multiplicity of p-values, for example if the p-value is 0.01, by definition of p-value, each gene would have 1 percent chance of having p-value of less than 0.01 and thus will be significant at the one percent level. If there are 10,000 genes on an array then there may be 100 significant genes at this level. This gives rise to an important question: how does one know that the gene that appears to be differentially expressed is truly differentially expressed? This is a deep problem in statistic and the p-values must be adjusted so as to have an acceptable false possible rate. Multiple test correction is made to estimate what fraction of the differentially expressed genes called to be significant are false positive. This is called the false discovery rate (FDR) [
The goal of microarray data analysis is to find relationships and patterns in the data; to make further analysis convenient the expression data is represented as a matrix, where rows represent genes, columns represent experimental condition, and each value at each position in the matrix characterize the expression level of the particular gene under the particular experimental condition. Further additional information i.e. gene annotations, function descriptions or sample details may also be added to the matrix. After organizing the expression data into such matrices it can be used for higher level analysis. Current methodologies for higher level data analysis may be divided into two categories: Supervised approaches or analysis to determine genes that fit a determined pattern, they are used to find a “classifier” that separates data into classes; and Unsupervised approaches or analysis to characterize the components of a data set without a priori input or knowledge of the pattern and is used to find groups inherent to the data [
Most gene expression data analysis algorithms assume that the gene expression values are scalars; in these algorithms the replicates are either treated as separate experimental conditions or are replaced by one generalizing scalar, i.e. mean or median. Thus information about variance and reliability are lost. Another approach is to treat expression values as vectors where each gene can be considered as a point in m-dimensional space, where m is the number of samples. Similarly, each sample can be considered as a vector in n dimensional space where n is the number of genes. Thus all genes may be represented as points in multidimensional space and genes having similar expression values will be situated close to each other than genes having dissimilar expression values. This method provides an intuitive picture of similarity and mathematical formulae may be used to calculate “distance” between two expression vectors. There are varieties of methods for measuring distance, typically falling into three general classes: Euclidean, Non-Euclidean and semimetric [
When choosing a distance measure to use for further analysis, there is no one answers as to what is the best measure. Different measures have different strengths and weaknesses. Once the distance measure has been applied the expression matrix hence formed generally appears without any apparent pattern or order. Further analytical techniques may be applied to these matrices to re-order the rows or columns or both so that the pattern of expression becomes apparent. Among unsupervised techniques most common are—hierarchical clustering, k-means clustering, self-organizing maps.
Clustering is a very useful technique for exploring expression patterns that exist in the data. Clustering results in grouping together of samples or genes having similar expression profiles. The data is divided into few groups thereby reducing the dimensionality in the data and making it more amiable for biological interpretation [
If there is prior knowledge regarding the number of clusters that should be represented in the data, K-means clustering is a good alternative to hierarchical methods [13,14]. In K-means, objects are partitioned into fixed number (K) of clusters such that the clusters internally are similar and externally are dissimilar. The process involved in K-means is conceptually simple but computationally intensive. Initially all objects are randomly assigned to one of the k clusters. An average expression
vector is then calculated for each cluster, and this is used to compute the distances between clusters. Using an iterative method, objects are moved between clusters, intercluster distances are measured with each move. Objects are allowed to remain in the new cluster only if they are closer to it than to their previous cluster. After each move, the expression vectors for each cluster are recalculated. The shuffling proceeds until moving any more objects would make the clusters more variable (
Self-organizing maps (SOM) are similar to hierarchical clustering, in that they also provide a survey of expression patterns within a data set, but the approach is quite different [17,18]. Genes are represented as points in multi dimensional space, and then genes are assigned to a series of partitions based on similarity of their expression vectors to reference vectors that are defined for each partition. It is the process of defining these reference vectors that distinguishes SOMs from k-means clustering. Before initiating the analysis, the user defines a geometrical configuration for the partitions, typically a twodimensional rectangular or hexagonal grid. A map is set with the centers of each cluster-to-be (known as centroids) arranged in the defined configuration. As the method iterates, the centroids move towards randomly chosen genes at a decreasing rate. The method continues until there is no further significant movement of these centroids. The advantages of SOM include easy twodimensional visualization of expression patterns [
Most of the clustering methods described above are heuristic in the sense that they do not try to optimize any scoring function describing the overall quality of the clustering. Model-based clustering assumes that the data have been generated by some, typically probabilistic
(Bayesian), model and tries to find the clustering corresponding to the most probable model. They may still be heuristic in that they may not guarantee identification of the most probable clustering. Although model-based clustering has the potential to incorporate a priori knowledge about the domain in the analysis, it is not easy to apply it in a way that produces more meaningful biological results than purely heuristic methods. Fuzzy clustering is not deterministic, i.e. a given object either belongs or does not belong to a given cluster, but rather it assigns to each object the probability of belonging to the particular cluster [
Principal component analysis (PCA) is another way of reducing dimensionality in the data. PCA is based on finding the direction in multidimensional vector space that has largest amplitude in the dispersion of data points i.e. greatest variability (
directions or principal components represent most of the variability then other directions are disregarded and hence dimensionality in data is greatly reduced. Principal components are a set of vectors in this space that decreasingly capture the variation seen in the points. The first principal component captures more variation than the second, and so on. Often in microarray datasets most variability can be accounted for by a small number of principal directions [
Multidimensional scaling (MDS) is a different approach to dimensionality reduction. Unlike PCA it does not start from the data, but rather uses the distance measures. It tries to locate the entities being compared in twoor three-dimensional space, as close as possible to the distances measured between the entities in the higher dimensional space [25,26].
Supervised techniques use prior knowledge right from the beginning and try to find properties that support that knowledge. Generally here the focus is on finding genes that can be used for grouping samples into clinically or biologically relevant classes rather than on identifying gene functions. This is done by employing classification algorithms the simplest of them being linear regression and K-nearest neighbor methods. In linear regression “least square fit” is used to define a threshold line or linear space. Then a new point is classified into groups depending upon whether it lies below or above this thresh old. The K-nearest neighbor technique can be used in both supervised and unsupervised manner but the use of this technique in supervised fashion to find genes directly with patterns that best match a designated query pattern.
The query pattern may be an ideal gene pattern for a given condition, i.e. a group of genes that are highly expressed in one condition and expressed at a very low level in another condition. All the genes that have been measured can then be compared to this ideal gene pattern and ranked by their similarity [27,28]. Although this technique results in genes that might individually split two sets of microarrays, it does not necessarily find the smallest set of genes that most accurately splits the two sets. In other words, a combination of the expression levels of two genes might split two conditions perfectly, but these two genes might not necessarily be the top two genes that are most similar to the idealized pattern.
Support vector machines (SVM) address the problem of finding combination of genes that better split sets of biological samples [
All The analytical techniques discussed so far end with a list of genes which would be meaningless without a biological context, this may be provided by using Gene Ontology (GO), an expert-curated database which assigns genes to various functional categories. GO is designed as a formal representation of biological knowledge, as it relates to genes and gene products [
Pathway Analysis is used to map genes onto precompiled pathways to visualize whole chains of events indicated by microarray data [
One of the central features of microarray data is that there is a lot of it. No matter which distance measure is used one ends up with a high-dimensional data which is very difficult to comprehend. Thus to comprehend and visualize data it becomes necessary to reduce dimensionality of data through PCA or clustering. However, once the dimensionality has been reduced a number of visualization techniques may be used to find patterns in the data. The most popular techniques are heat maps (
Correspondence analysis uses PCA in a chi-square distance matrix to allow one to assess which group of genes are most important for defining which experimental condition (sample) or vice-versa. It visualizes two or three principal axes of gene and sample space in the same diagram [
Wide varieties of supervised and unsupervised methods are available for analysis, however, the existent challenge is in translating hypotheses into an appropriate bioinformatics technique. Supervised methods are of much use in domains of drug discovery and diagnostic testing, where definite answers are needed for specific questions. Unsupervised methods are less intuitive, because these start with less direct questions. These methods may be used to answer questions about the number and type of expression responses in a period of time after application of a compound. Hierarchical clustering and self-organizing maps survey all the genes and cluster them together on the basis of their expression patterns. Relevant networks may be used to search for the pairs of genes that are more likely to be co-expressed. True genetic regulatory networks might be found using methods such as constructing Bayesian networks. Moreover, a combination of both supervised and unsupervised methods may be used depending upon the answer that one seeks from the analysis, i.e. a hierarchical clustering may be used to obtain a dendrogram and then supervised learning may be used to find the best threshold to cut sub-trees or class vectors may be included into gene expression matrix as additional dimension and used for clustering. In all of the above cases, the analyses are not aimed at providing an ideal answer but are rather used as exploratory tools in the early discovery process.
The obvious truth with which one must agree after an experience with microarray technique is that the ratelimiting step in functional genomics is neither the actual experimental procedure nor the data analysis, but instead data interpretation for determining what the results actually mean. Detailed functional information might not yet be available for genes that have been found to be significant, even though these genes might be very well represented in microarray probe sets. The official name, predicted protein domains or gene-ontology classification might become available in a few days or might take decades. Oligonucleotide sequences that were thought to be unique at the time of designing the probe against a particular gene might not remain unique as more genomic data are collected. Operationally this means that one is never done analyzing a set of microarray data. The infrastructure has to be developed to reinvestigate constantly genes and gene information from microarray information performed in the past.
R.K.G is thankful to Council of Scientific and Industrial Research (F. No:09/015(0346)/2008-EMR-I) and Department of Biotechnology, Government of India for providing the financial assistance. Authors sincerely acknowledge Bose Institute for providing the infrastructure.