Advances in Anthropology
2011. Vol.1, No.2, 9-14
Copyright © 2011 SciRes. DOI:10.4236/aa.2011.12002
Principal Component Analyses in Anthropological Genetics
Xingdong Chen, Chao Chen, Li Jin
Ministry of Education Key Laboratory of Contemporary Anthropology, Fudan University, Shanghai, China.
Email: lijin.fudan@gmail.com
Received August 17th, 2011; revised October 8th, 2011; accepted October 20th, 2011.
Principal component analyses (PCA) is a statistical method for exploring and making sense of datasets with a
large number of measurements (which can be thought of as dimensions) by reducing the dimensions to the few
principal components (PCs) that explain the main patterns. Thus, the first PC is the mathematical combination of
measurements that accounts for the largest amount of variability in the data. Here, we gave an interpretation
about the principle of PCA and its original mathematical algorithm, singular variable decomposition (SVD).
PCA can be used in study of gene expression; also PCA has a population genetics interpretation and can be used
to identify differences in ancestry among populations and samples, through there are some limitations due to the
dynamics of microevolution and historical processes, with advent of molecular techniques, PCA on Y chro-
mosome, mtDNA, and nuclear DNA gave us more accurate interpretations than on classical markers. Further-
more, we list some new extensions and limits of PCA.
Keywords: Principal Component Analysis, Singular Value Decomposition, Human Genetics
Introduction
Principal component analysis (PCA) involves a mathematical
procedure that transforms a number of possibly correlated vari-
ables into a smaller number of uncorrelated variables called pri-
ncipal components. The first principal component accounts for
as much of the variability in the data as possible, and each suc-
ceeding component accounts for as much of the remaining
variability as possible. Depending on the field of application, it
is also named the discrete Karhunen-Loève transform (K.L.T.), the
Hotelling transform or proper orthogonal decomposition (POD).
PCA was invented in 1901 by Karl Pearson (Pearson, 1901).
Now it is mostly used as a tool in exploratory data analysis and
for making predictive models. PCA involves the calculation of
the eigenvalue decomposition of a data covariance matrix or
singular value decomposition of a data matrix, usually after
mean centering the data for each attribute. The results of a PCA
are usually discussed in terms of component scores and loadings.
PCA is mathematically defined as an orthogonal linear
transformation that transforms the data to a new coordinate
system such that the greatest variance by any projection of the
data comes to lie on the first coordinate (called the first prin-
cipal component), the second greatest variance on the second
coordinate, and so on. PCA is the simplest of the true eigen-
vector-based multivariate analyses. Often, its operation can be
thought of as revealing the internal structure of the data in a
way which best explains the variance in the data. If a multivari-
ate dataset is visualised as a set of coordinates in a high-dime-
nsional data space (1 axis per variable), PCA supplies the user
with a lower-dimensional picture, a “shadow” of this object when
viewed from its (in some sense) most informative viewpoint.
Luca Cavalli-Sforza and colleagues had the original insight
that PCA could be applied to human genetic variation (Menozzi
et al., 1978), and they eventually analyzed about 100 protein
polymorphisms that had been measured in many human popu-
lations (Cavalli-Sforza et al., 1994). For several decades, PCA
has been used to study human population migrations: detecting
population substructure, correcting for stratification in disease
studies and making qualified inferences about human history.
In the recent genome wide association studies (GWAS), PCA is
used to explicitly model ancestry differences between cases and
controls, due to population stratification-allele frequency dif-
ferences between cases and controls from systematic ancestry
differences-can cause spurious associations in disease studies
(Price et al., 2006). PCA is also widely used in microarray ex-
pression data analysis, to control surrogate variables, such as
different studies comparison, batch effect and time course ana-
lysis (Alter et al., 2000, 2003; Alter & Golub, 2006; Omberg et
al., 2007; Yeung & Ruzzo, 2001)
In this review, we first interpreted the principal algorithm of
PCA, how it related to singular value decomposition (SVD)
mathematically, and what is the difference between these two
methods, in section 1; and in section 2, we discussed applica-
tions of PCA and SVD in modern genetics, such as population
genetics on anthropology and illustrative gene expression ap-
plications. Finally, in section 3, we list some limit of PCA and
new extensions to PCA.
Section 1: Principle of PCA and SVD
Principle of PCA
Define a data matrix, XT, with zero empirical mean (the em-
pirical mean of the distribution has been subtracted from the
data set), where each of the n rows represents a different repeti-
tion of the experiment, and each of the m columns gives a par-
ticular kind of datum. The singular value decomposition of X is
X = WΣVT, where the m × m matrix W is the matrix of eigen-
vectors of XXT, the matrix Σ is an m × n rectangular diagonal
matrix with nonnegative real numbers on the diagonal, and the
n × n matrix V is the matrix of eigenvectors of XTX. The PCA
transformation that preserves dimensionality (that is, gives the
same number of principal components as original variables) is
then given by:
YT = XTW + VΣT
Since W (by definition of the SVD of a real matrix) is an or-
thogonal matrix, each row of YT is simply a rotation of the cor-
responding row of XT. The first column of YT is made up of the
“scores” of the cases with respect to the “principal” component,
X. D. CHEN ET AL.
10
the next column has the scores with respect to the “second prin-
cipal” component, and so on.
Given a set of points in Euclidean space, the first principal
component (the eigenvector with the largest eigenvalue) corre-
sponds to a line that passes through the mean and minimizes
sum squared error with those points. The second principal com-
ponent corresponds to the same concept after all correlation
with the first principal component has been subtracted out from
the points. Each eigenvalue indicates the portion of the variance
that is correlated with each eigenvector. Thus, the sum of all the
eigenvalues is equal to the sum squared distance of the points
with their mean divided by the number of dimensions. PCA
essentially rotates the set of points around their mean in order
to align with the first few principal components. This moves as
much of the variance as possible (using a linear transformation)
into the first few dimensions. The values in the remaining di-
mensions, therefore, tend to be highly correlated and may be
dropped with minimal loss of information. PCA is often used in
this manner for dimensionality reduction. PCA has the distinc-
tion of being the optimal linear transformation for keeping the
subspace that has largest variance. This advantage, however,
comes at the price of greater computational requirement if com-
pared, for example, to the discrete cosine transform. Non-linear
dimensionality reduction techniques tend to be more computa-
tionally demanding than PCA.
SVD
This section is the most mathematically involved and can be
skipped without much loss of continuity. It is presented solely
for completeness. We derive another algebraic solution for
PCA and in the process, find that PCA is closely related to sin-
gular value decomposition (SVD). In fact, the two are so inti-
mately related that the names are often used interchangeably.
What we will see though is that SVD is a more general method
of understanding fundamental mathematical transformations.
We begin by quickly deriving the decomposition. In the follow-
ing section we interpret the decomposition and in the last sec-
tion we relate these results to PCA.
Let X denote an m × n matrix of real-valued data and rank r,
where without loss of generality m n, and therefore r n. In
the case of microarray data, xij is the expression level of the ith
gene in the jth assay. The elements of the ith row of X form the
n-dimensional vector gi, which we refer to as the transcrip-
tional response of the ith gene. Alternatively, the elements of
the jth column of X form the m-dimensional vector aj, which we
refer to as the expre s si on profile of the jth assay.
The equation for singular value decomposition of X is the
following:
X = USVT (1)
where U is an m × n matrix, S is an n × n diagonal matrix, and
VT is also an n × n matrix. The columns of U are called the left
singular vectors, {uk}, and form an orthonormal basis for the
assay expression profiles, so that ui·uj = 1 for i = j, and ui·uj = 0
otherwise. The rows of VT contain the elements of the right
singular vectors, {vk}, and form an orthonormal basis for the
gene transcriptional responses. The elements of S are only non-
zero on the diagonal, and are called the singular values. Thus, S
= diag(s1,,sn). Furthermore, sk > 0 for 1 k r, and si = 0 for
(r + 1) k n. By convention, the ordering of the singular vec-
tors is determined by high-to-low sorting of singular values,
with the highest singular value in the upper left index of the S
matrix. Note that for a square, symmetric matrix X, singular
value decomposition is equivalent to diagonalization, or solu-
tion of the eigenvalue problem.
One important result of the SVD of X is that:
() 1
1V
l
kkkk
XUS
 T
(2)
is the closest rank-l matrix to X. The term “closest” means that
X(l) minimizes the sum of the squares of the difference of the
elements of X and X(l), ij|xijx(l)ij|2.
One way to calculate the SVD is to first calculate VT and S
by diagonalizing XTX:
XTX = VS2VT (3)
and then to calculate U as follows:
U = XVS–1 (4)
where the (r + 1),,n columns of V for which sk = 0 are ig-
nored in the matrix multiplication of Equation (4). Choices for
the remaining n-r singular vectors in V or U may be calculated
using the Gram-Schmidt orthogonalization process or some
other extension method. In practice there are several methods
for calculating the SVD that are of higher accuracy and speed.
Section 4 lists some references on the mathematics and compu-
tation of SVD.
Relation between PCA and SVD
There is a direct relation between PCA and SVD in the case
where principal components are calculated from the covariance
matrix. If one conditions the data matrix X by centering each
column, then XTX = ΣigigiT is proportional to the covariance
matrix of the variables of gi (i.e., the covariance matrix of the
assays). By Equation (3), diagonalization of XTX yields VT,
which also yields the principal components of {gi}. So, the
right singular vectors {vk} are the same as the principal com-
ponents of {gi}. The eigenvalues of XTX are equivalent to sk2,
which are proportional to the variances of the principal compo-
nents. The matrix US then contains the principal component
scores, which are the coordinates of the genes in the space of
principal components.
If instead each row of X is centered, XXT = ΣjajajT is propor-
tional to the covariance matrix of the variables of aj (i.e. the
covariance matrix of the genes). In this case, the left singular
vectors {uk } are the same as the principal components of {aj}.
The sk2 are again proportional to the variances of the principal
components. The matrix SVT again contains the principal com-
ponent scores, which are the coordinates of the assays in the
space of principal components.
Section 2: Application of PCA and SVD
Gene Expression
As we mention in the introduction, gene expression data are
well suited to analysis using SVD/PCA. In this section we pro-
vide examples of SVD-based analysis methods as applied to
gene expression analysis. Before illustrating specific techniques,
we will discuss ways of interpreting the SVD in the context of
gene expression data. This interpretation and the accompanying
nomenclature will serve as a foundation for understanding the
methods described later.
A natural question for a biologist to ask is: “What is the bio-
logical significance of the SVD?” There is, of course, no gen-
eral answer to this question, as it depends on the specific appli-
cation. We can, however, consider classes of experiments and
provide them as a guide for individual cases. For this purpose
we define two broad classes of applications under which most
studies will fall: systems biology applications, and diagnostic
applications (see below). In both cases, the n columns of the
X. D. CHEN ET AL. 11
gene expression data matrix X correspond to assays, and the m
rows correspond to the genes. The SVD of X produces two
orthonormal bases, one defined by right singular vectors and
the other by left singular vectors. Referring to the definitions,
the right singular vectors span the space of the gene transcrip-
tional responses {gi} and the left singular vectors span the
space of the assay expression profiles {aj}. Following the con-
vention of Alter et al. (2000), we refer to the left singular vec-
tors {uk} as eigenassays and to the right singular vectors {vk} as
eigengenes.
We sometimes refer to an eigengene or eigenassay generi-
cally as a singular vector, or, by analogy with PCA, as a com-
ponent. In systems biology applications, we generally wish to
understand relations among genes. The signal of interest in this
case is the gene transcriptional response gi. By Equation (1), the
SVD equation for gi is
1,:1, ,
r
ikikkk
USVim
 g (7)
which is a linear combination of the eigengenes {vk}. The ith
row of U, g'i, contains the coordinates of the ith gene in the co-
ordinate system (basis) of the scaled eigengenes, skvk. If r < n,
the transcriptional responses of the genes may be captured with
fewer variables using g'i rather than gi. This property of the
SVD is sometimes referred to as dimensionality reduction. In
order to reconstruct the original data, however, we still need
access to the eigengenes, which are n-dimensional vectors.
Note that due to the presence of noise in the measurements, r =
n in any real gene expression analysis application, though the
last singular values in S may be very close to zero and thus
irrelevant.
An analysis of micro-array data is a search for genes that
have similar, correlated patterns of expression. This indicates
that some of the data might contain redundant information. For
example, if a group of experiments were more closely related
than we had expected, we could ignore some of the redundant
experiments, or use some average of the information without
loss of information (Khan et al., 2011; Quackenbush, 2001).
Some examples are given of previous applications of SVD to
analysis of gene expression data.
Cell-cycle gene expression data display strikingly simple
patterns when analyzed using SVD. Here we discuss two dif-
ferent studies that, despite having used different pre-processing
methods, have produced similar results (Alter et al., 2000;
Holter et al., 2000). Both studies found cyclic patterns for the
first two eigengenes, and, in two-dimensional correlation scat-
ter plots, previously identified cell cycle genes tended to plot
towards the perimeter of a disc. Alter et al. used information in
SVD correlation scatter plots to obtain a result that 641 of the
784 cell-cycle genes identified in are associated with the first
two eigengenes (Spellman et al., 1998). Holter et al. displayed
previously identified cell-cycle gene clusters in scatter plots,
revealing that cell-cycle genes were relatively uniformly dis-
tributed in a ring-like feature around the perimeter, leading
Holter et al. to suggest that cell-cycle gene regulation may be a
more continuous process than had been implied by the previous
application of clustering algorithms (Holter et al., 2000).
Raychaudhuri et al.’s study of yeast sporulation time series
data is an early example of application of PCA to microarray
analysis (Raychaudhuri et al., 2000). In this study, over 90% of
the variance in the data was explained by the first two compo-
nents of the PCA. The first principal component contained a
strong steady-state signal. Projection scatter plots were used in
an attempt to visualize previously identified gene groups, and to
look for structures in the data that would indicate separation of
genes into groups. No clear structures were visible that indi-
cated any separation of genes in scatter plots. Holter et al.’s
more recent SVD analysis of yeast sporulation data made use of
a different pre-processing scheme from that of Raychaudhuri et
al. The crucial difference is that the rows and columns of X in
Holter et al.’s study were iteratively centered and normalized.
In Holter et al.’s analysis, the first two eigengenes were found
to account for over 60% of the variance for yeast sporulation
data. The first two eigengenes were significantly different from
those of Raychaudhuri et al., with no steady-state signal, and,
most notably, structure indicating separation of gene groups
was visible in the data. Below we discuss the discrepancy be-
tween these analyses of yeast sporulation data.
Other Applications in Gene Expression Analysis
Imag e processin g an d com p ression. The property of SVD to
provide the closest rank-l approximation for a matrix X (Equa-
tion (2)) can be used in image processing for compression and
noise reduction, a very common application of SVD. By setting
the small singular values to zero, we can obtain matrix ap-
proximations whose rank equals the number of remaining sin-
gular values (see Equation (2)). Each term ukskvkT is called a
principal image. Very good approximations can often be ob-
tained using only a small number of terms (Richards & Jia, 2006).
SVD is applied in similar ways to signal processing problems.
Immunology. One way to capture global prototypical im-
mune response patterns is to use PCA on data obtained from
measuring antigen-specific IgM (dominant antibody in primary
immune responses) and IgC (dominant antibody in secondary
immune responses) immunoglobulins using ELISA assays.
Fesel and Coutinho (Fesel & Coutinho, 1998) measured IgM
and IgC responses in Lewis and Fischer rats before and at three
time points after immunization with myelin basic protein (MBP)
in complete Freud’s adjuvant (CFA), which is known to pro-
voke experimental allergic encephalomeyelitis (EAE). They
discovered distinct and mutually independent components of
IgM reaction repertoires, and identified a small number of
strain-specific prototypical regulatory responses.
Molecular dynamics. PCA and SVD analysis methods have
been developed for characterizing protein molecular dynamics
trajectories (Romo et al., 1995). In a study of myoglobin, Romo
et al. used molecular dynamics methods to obtain atomic posi-
tions of all atoms sampled during the course of a simulation.
The higher principal components of the dynamics were found
to correspond to large-scale motions of the protein. Visualiza-
tion of the first three principal components revealed an inter-
esting type of trajectory that was described as resembling beads
on a string, and revealed a visibly sparse sampling of the con-
figuration space.
Small-angle scattering. SVD has been used to detect and
characterize structural intermediates in biomolecular small-
angle scattering experiments (Chen et al., 1996). This study
provides a good illustration of how SVD can be used to extract
biologically meaningful signals from the data. Small-angle sca-
ttering data were obtained from partially unfolded solutions of
lysozyme, each consisting of a different mix of folded, col-
lapsed and unfolded states. The data for each sample was in the
form of intensity values sampled at on the order of 100 differ-
ent scattering angles. UV spectroscopy was used to determine
the relative amounts of folded, collapsed and unfolded lysozy-
me in each sample. SVD was used in combination with the
spectroscopic data to extract a scattering curve for the collapsed
state of the lysozyme, a structural intermediate that was not ob-
served in isolation.
X. D. CHEN ET AL.
12
Information Retrieval. SVD became very useful in Informa-
tion Retrieval (IR) to deal with linguistic ambiguity issues. IR
works by producing the documents most associated with a set
of keywords in a query. Keywords, however, necessarily con-
tain much synonymy (several keywords refer to the same con-
cept) and polysemy (the same keyword can refer to several
concepts). For instance, if the query keyword is “feline”, tradi-
tional IR methods will not retrieve documents using the word
“cat”—a problem of synonymy. Likewise, if the query keyword
is “java”, documents on the topic of Java as a computer lan-
guage, Java as an Island in Indonesia, and Java as a coffee bean
will all be retrieved—a problem of polysemy. A technique
known Latent Semantic Indexing (LSI) (Berry et al., 1995)
addresses these problems by calculating the best rank-l ap-
proximation of the keyword-document matrix using its SVD.
This produces a lower dimensional space of singular vectors
that are called eigen-keywords and eigen-documents. Each ei-
gen-key-word can be associated with several keywords as well
as particular senses of keywords. In the synonymy example
above, “cat” and “feline” would therefore be strongly correlated
with the same eigen-keyterm. Similarly, documents using “java”
as a computer language tend to use many of the same keywords,
but not many of the keywords used by documents describing
“java” as coffee or Indonesia. Thus, in the space of singular
vectors, each of these senses of “java” is associated with dis-
tinct eigen-keywords.
Population Genetics
Novembre and Stephens pointed out PCA is a tool for ana-
lyzing genetic data. PCA remains useful for genetic analysis in
many contexts that do not require a historical interpretation,
such as in detecting the presence of population structure or in
correcting for stratification in disease studies (Novembre &
Stephens, 2008). On the other hand, if the aim is to study his-
tory and document migrations, it is important to carry out addi-
tional research to correlate the PCA results with other lines of
evidence.
By superimposing the PCs on the geography of the sampled
populations, they obtained “synthetic maps” that showed re-
markable gradients of variation across continents suggestive of
historical migrations (Pearson, 1901). For example, the first
European PC map shows a southeast-to-northwest cline that
was interpreted as reflecting the spread of Neolithic farming
from the Levant throughout Europe between 9000 and 6000
years ago. The hypothesis of a demic diffusion of Neolithic
farming has since been supported by additional genetic and
archaeological data (Pinhasi et al., 2005; Semino et al., 2004;
Sokal et al., 1991).
Population Structure and Stratification in Disease
Studies
PCA has a population genetics interpretation and can be used
to partly identify differences in ancestry among populations and
samples. In particular, by assessing whether the proportion of
the variance explained by the first PC is sufficiently large, it is
possible to obtain a formal P value for the presence of popula-
tion substructure and to identify the number of PCs that are
statistically significant (Patterson et al., 2006). PCA is also
useful as a method to address the problem of population strati-
fication—allele frequency differences between cases and con-
trols due to ancestry differences or under selection—that can
cause spurious associations in disease association studies. We
and others have described how one can correct for stratification
in structured populations such as European Americans by ad-
justing genotypes and phenotypes by amounts attributable to
ancestry along the top PCs (Price et al., 2006; Zhu et al., 2008).
Novembre and Stephens (Novembre & Stephens, 2008) em-
phasize that this approach is appropriate regardless of whether
the PCs have arisen as a result of migrations, isolation by dis-
tance or both.
PCA is a tool that has been used to infer population structure
in genetic data for several decades, long before the era of GWA
studies (Novembre & Stephens, 2008; Patterson et al., 2006;
Pearson, 1901; Price et al., 2006). It should be noted that top
principal components do not always reflect population structure:
they may reflect family relatedness (Patterson et al., 2006),
long-range linkage disequilibrium (LD) (due to, for example,
inversion polymorphisms) or assay artifacts (Clayton et al.,
2005). These effects can often be eliminated by removing re-
lated samples, regions of long-range LD or low-quality data,
respectively, from the data used to compute principal compo-
nents. In addition, PCA can highlight effects of differential bias
that require additional quality control (Price et al., 2006).
Using top principal components as covariates corrects for
stratification in GWA studies (Purcell et al., 2007; Zhu et al.,
2008), and this can be done using software such as EIGEN-
STRAT. Like structured association, PCA will appropriately
apply a greater correction to markers with large differences in
allele frequency across ancestral populations. Unlike initial
implementations of structured association, PCA is computa-
tionally tractable in large genome-wide data sets. Related ap-
proaches, such as multidimensional scaling (MDS) and genetic
matching, have also proven useful (Lee et al., 2010; Luca et al.,
2008) and can be carried out using the PLINK software (Purcell
et al., 2007). When genome-wide data are not available (for
example, in replication studies), structured association or PCA
can infer genetic ancestry, and hence correct for stratification,
using ancestry-informative markers (AIMs) (Furey et al., 2000).
A common misconception is that AIMs should be used to infer
genetic ancestry even when genome-wide data are available,
but in fact the best ancestry estimates are obtained using a large
number of random markers, as the examples we supplied in the
following section.
Qualified Inferences about Human History
Given the results of Novembre and Stephens (2008), what
confidence should we have in use of PCA for inferences re-
garding human history? To illustrate this, David Reich et al.
turned to a dataset of 940 individuals from 53 populations
typed at ~650,000 SNPs as part of the Human Genome Diver-
sity Project( Li et al., 2008; Reich et al., 2008). They used EI-
GENSOFT (Patterson et al., 2006; Price, et al., 2006) to find
the principal axes of genetic variation in the seven sub-Saharan
African populations in this dataset and then projected all sam-
ples on the resulting PCs. Another example Quebec population
study, the distribution of Mendelian diseases points to local
founder effects suggesting stratification of the contemporary
French Canadian gene pool. They characterized the population
structure through the analysis of the genetic contribution of
7798 immigrant founders identified in the genealogies of 2221
subjects partitioned in eight regions. To detect population stra-
tification from genealogical data, they propose an approach
based on principal component analysis (PCA) of immigrant
founders' genetic contributions. Results showed evidence of a
distinct identity of the northeastern and eastern regions and
stratification of the regional populations correlated with geo-
graphical location along the St-Lawrence River. Analysis of
PC-correlated founders illustrates the differential impact of
X. D. CHEN ET AL. 13
early versus latter founders consistent with specific regional
genetic patterns. These results highlight the importance of con-
sidering the geographic origin of samples in the design of ge-
netic epidemiology studies conducted in Quebec (Claude &
Bherer, 2011). Another example comes from a Brazil group:
they used SNP data from 1129 individuals—138 from the urban
population of Sao Paulo, Brazil, and 991 from 11 populations
of the HapMap Project. PCA was performed on the SNPs
common to these populations, to identify the composition and
the number of SNPs needed to capture the genetic variation of
them. Both admixture and local ancestry inference were per-
formed in individuals of the Brazilian sample. Then found indi-
viduals from the Brazilian sample fell between Europeans,
Mexicans, and Africans. Brazilians are suggested to have the
highest internal genetic variation of sampled populations. Their
results indicate Brazilian sample analyzed descend from Amer-
indians, African, and/or European ancestors, but intermarriage
between individuals of different ethnic origin had an important
role in generating the broad genetic variation observed in the
present-day population. Those examples highlight how PCA
methods can provide evidence of important migration events.
Interpreting the results to make reliable historical predictions,
however, requires further genetic analysis and integration with
other sources of information from archeology, anthropology,
linguistics and geography.
Section 3: New Extensions and Limits of PCA
This section provides an important context for understanding
when PCA might perform poorly as well as a road map for
understanding new extensions to PCA, as follows:
Linearity
Linearity frames the problem as a change of basis. Several
areas of research have explored how applying a nonlinearity
prior to performing PCA could extend this algorithm—this has
been termed kernel PCA.
Mean an d variance a r e sufficient sta tistics.
The formalism of sufficient statistics captures the notion that
the mean and the variance entirely describe a probability dis-
tribution. The only class of probability distributions that are
fully described by the first two moments are exponential distri-
butions (e.g. Gaussian, Exponential et al.). In order for this
assumption to hold, the probability distribution of xi must be
exponentially distributed. Deviations from this could invalidate
this assumption.
Large variances have important dynamics.
This assumption also encompasses the belief that the data has
a high SNR. Hence, principal components with larger associ-
ated variances represent interesting dynamics, while those with
lower variances represent noise.
The principal components are orthogonal.
This assumption provides an intuitive simplification that
makes PCA soluble with linear algebra decomposition tech-
niques. These techniques are highlighted in the two following
sections. We have discussed all aspects of deriving PCA—what
remain are the linear algebra solutions. The first solution is
somewhat straightforward while the second solution involves
understanding an important algebraic decomposition.
Limits
Both the strength and weakness of PCA is that it is a non-
parametric analysis. One only needs to make the assumptions
and then calculate the corresponding answer, while it is the
same on population and gene expression data analysis. There
are no parameters to tweak and no coefficients to adjust based
on user experience, the answer is unique and independent of the
user.
This same strength can also be viewed as a weakness. If one
knows a-priori some features of the structure of a system, then
it makes sense to incorporate these assumptions into a paramet-
ric algorithm—or an algorithm with selected parameters.
In gene expression study, most implementations of PCA, it is
difficult to define accurately the precise boundaries of distinct
clusters in the data, or to define genes (or experiments) belong-
ing to each cluster. In population genetics, a limitation of PCA
is that they do not model family structure or cryptic relatedness.
These factors may lead to inflation in test statistics if they are
not explicitly modeled because samples that are correlated are
assumed to be uncorrelated. And association statistics that ex-
plicitly account for family structure or cryptic relatedness are
likely to achieve higher power owing to improved weighting of
the data.
Another weakness is sometimes though the assumptions
themselves are too stringent. One might envision situations
where the principal components need not be orthogonal. Fur-
thermore, the distributions along each dimension (xi) need not
be Gaussian. The largest variances do not correspond to the
meaningful axes; Diagonalzing a covariance matrix might not
produce satisfactory results. The most rigorous form of remov-
ing redundancy is statistical independence.
P(y1, y2) = P(y1)P(y2)
where P(·) denotes the probability density. Thus PCA fails.
However, PCA is still a powerful technique for the analysis
when, 1) used with another classification technique, such as
k-means clustering or SOMs, that requires the user to specify
the number of clusters. More frequently, this prior non-linear
transformation is sometimes termed a kernel transformation and
the entire parametric algorithm is termed kernel PCA. Other
common kernel transformations include Fourier and Gaussian
transformations. This procedure is parametric because the user
must incorporate prior knowledge of the structure in the selec-
tion of the kernel but it is also more optimal in the sense that
the structure is more concisely described. 2) This less con-
strained set of problems is not trivial and only recently has been
solved adequately via Independent Component Analysis (ICA)
(Hyv Rinen & Oja, 2000). ICA decomposes the expression data
into a set of statistically independent modes that we term as
“ICA traits”. The statistical independence between modes is
estimated by optimizing a contrast function, such as kurtosis or
mutual information (Biswas et al., 2008). Unlike SVD, ICA
components might differ based on the contrast function and
number of underlying sources, which under a generative model
is responsible for the variation in the data.
It is important to note that application of SVD and PCA to
modern anthropological genetics is relatively recent, and that
methods are currently evolving. Presently, modern genetics
analysis in general tends to consist of iterative applications of
interactively performed analysis methods. The detailed path of
any given analysis depends on what specific scientific questions
are being addressed. As new inventions emerge, and further
techniques and insights are obtained from other disciplines, we
mark progress towards the goal of an integrated, theoretically
sound approach to modern genetics.
References
Alter, O., Brown, P. O., & Botstein, D. (2000). Singular value decom-
X. D. CHEN ET AL.
14
position for genome-wide expression data processing and modeling.
Proceedin gs of the N ation al Academ y of Scien ces, 97 , 10101.
doi:10.1073/pnas.97.18.10101
Alter, O., Brown, P. O., & Botstein, D. (2003). Generalized singular
value decomposition for comparative analysis of genome-scale ex-
pression data sets of two different organisms. Proceedings of the Na-
tional Academy of Sciences, 100, 3351.
doi:10.1073/pnas.0530258100
Alter, O., & Golub, G. H. (2006). Singular value decomposition of
genome-scale mRNA lengths distribution reveals asymmetry in RNA
gel electrophoresis band broadening. Proceedings of the National
Academy of Sciences, 103, 11828. doi:10.1073/pnas.0604756103
Berry, M. W., Dumais, S. T., & O’Brien, G. W. (1995). Using linear
algebra for intelligent information retrieval. SIAM Review, 37, 573-
595. doi:10.1137/1037127
Biswas, S., Storey, J., & Akey, J. (2008). Mapping gene expression
quantitative trait loci by singular value decomposition and indepen-
dent component analysis. BMC Bioinformatics, 9, 244.
doi:10.1186/1471-2105-9-244
Cavalli-Sforza, L. L., Menozzi, P., & Piazza, A. (1994). The history
and geography of human genes. Princeton, NJ: Princeton University
Press.
Chen, L., Hodgson, K. O., & Doniach, S. (1996). A lysozyme folding
intermediate revealed by solution X-ray scattering. Journal of Mo-
lecular Biology, 261, 658-671. doi:10.1006/jmbi.1996.0491
Clayton, D. G., Walker, N. M., Smyth, D. J., Pask, R., Cooper, J. D.,
Maier, L. M., Smink, L. J., Lam, A. C., Ovington, N. R., & Stevens,
H. E. (2005). Population structure, differential bias and genomic con-
trol in a large-scale, case-control association study. Natu re Genet ics,
37, 1243-1246. doi:10.1038/ng1653
Fesel, C., & Coutinho, A. (1998). Dynamics of serum IgM autoreactive
repertoires following immunization: strain specificity, inheritance and
association with autoimmune disease susceptibility. European Journal
of Immunology, 28, 3616-3629.
doi:10.1002/(SICI)1521-4141(199811)28:11<3616::AID-IMMU361
6>3.0.CO;2-B
Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer,
M., & Haussler, D. (2000). Support vector machine classification and
validation of cancer tissue samples using microarray expression data.
Bioi nf or m a ti cs , 1 6, 906. doi:10.1093/bioinformatics/16.10.906
Handley, L. J. L., Manica, A., Goudet, J., & Balloux, F. (2007). Going
the distance: Human population genetics in a clinal world. TRENDS
in Genetics, 23, 432-439. doi:10.1093/bioinformatics/16.10.906
Holter, N. S., Mitra, M., Maritan, A., Cieplak, M., Banavar, J. R., &
Fedoroff, N. V. (2000). Fundamental patterns underlying gene ex-
pression profiles: Simplicity from complexity. Proceedings of the
National Academy of Sciences, 97, 8409.
doi:10.1073/pnas.150242097
Hyv Rinen, A., & Oja, E. (2000). Independent component analysis:
Algorithms and applications. Neura l Networ ks , 13, 411-430.
doi:10.1016/S0893-6080(00)00026-5
Khan, J., Wei, J. S., Ringnér, M., Saal, L. H., Ladanyi, M., Westermann,
F., Berthold, F., Schwab, M., Antonescu, C. R., & Peterson, C.
(2001). Classification and diagnostic prediction of cancers using gene
expression profiling and artificial neural networks. Natu re Me dic i ne , 7,
673-679. doi:10.1038/89044
Lee, A. B., Luca, D., Klei, L., Devlin, B., & Roeder, K. (2010). Dis-
covering genetic ancestry using spectral graph theory. Genetic Epi-
demiology, 34, 51-59.
Li, J. Z., Absher, D. M., Tang, H., Southwick, A. M., Casto, A. M.,
Ramachandran, S., Cann, H. M., Barsh, G. S., Feldman, M., & Cavalli-
Sforza, L. L. (2008). Worldwide human relationships inferred from
genome-wide patterns of variation. Sci ence, 319, 1100.
doi:10.1126/science.1153717
Luca, D., Ringquist, S., Klei, L., Lee, A. B., Gieger, C., & Wichmann,
H. (2008). On the use of general control samples for genome-wide
association studies: genetic matching highlights causal variants. The
American Journal of Huma n Geneti cs, 82, 453-463.
doi:10.1016/j. aj h g.20 07 .11.003
Mellars, P. (2006). Going east: New genetic and archaeological per-
spectives on the modern human colonization of Eurasia. Sc ienc e , 313,
796. doi:10.1016/j.ajhg.2007.11.003
Menozzi, P., Piazza, A., & Cavalli-Sforza, L. (1978). Synthetic maps of
human gene frequencies in Europeans. Scienc e, 201, 786.
doi:10.1126/sc i en c e.35 626 2
Novembre, J., & Stephens, M. (2008). Interpreting principal component
analyses of spatial population genetic variation. Nature Genetics, 40,
646-649. doi:10.1038/ng.139
Omberg, L., Golub, G. H., & Alter, O. (2007). A tensor higher-order
singular value decomposition for integrative analysis of DNA microar-
ray data from different studies. Proceedings of the National Academy
of Sciences , 104, 18371. doi:10.1073/pnas.0709146104
Patterson, N., Price, A. L., & Reich, D. (2006). Population structure and
eigenanalysis. PLoS Gene tics, 2, e190.
doi:10.1371/journal.pgen.0020190
Pearson, K. LIII. (1901). On lines and planes of closest fit to systems of
points in space. Ph il osop hi c al Magaz i ne Se r ies 6, 2, 559-572.
doi:10.1080/14786440109462720
Pinhasi, R., Fort, J., & Ammerman, A. J. (2005). Tracing the origin and
spread of agriculture in Europe. PLoS Biology, 3, e410.
doi:10.1371/journal.pbio.0030410
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick,
N. A., & Reich, D. (2006). Principal components analysis corrects
for stratification in genome-wide association studies. Nature, 38, 904-
909.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A.,
Bender, D., Maller, J., Sklar, P., de Bakker, P. I., Daly, M. J., &
Sham, P. C. (2007). PLINK: A tool set for whole-genome association
and population-based linkage analyses. American Journal of Human
Genetics, 81, 559-575. doi:10.1086/519795
Quackenbush, J. (2001). Computational analysis of microarray data.
Nature Re view s Geneti cs, 2, 418 -427. doi:10.1038/35076576
Raychaudhuri, S., Stuart, J. M., & Altman, R. B. (2000). Principal
components analysis to summarize microarray experiments. Appl ica -
tion to Sporulation Time Series, 455.
Reich, D., Price, A. L., & Patterson, N. (2008). Principal component
analysis of genetic data. Nature Genetics, 40 , 491-491.
doi:10.1038/ng0508-491
Richards, J. A., & Jia, X. (2006). Remote sensing digital image analysis:
An intr od uctio n. Berlin: Springer Verlag.
Romo, T. D., Clarage, J. B., Sorensen, D. C., & Phillips Jr, G. N.
(1995). Automatic identification of discrete substates in proteins:
Singular value decomposition analysis of time—Averaged crystal-
llographic refinements. Proteins: Structure, Function, and Bioinfor-
matics, 22, 311-321. doi:10.1002/prot.340220403
Semino, O., Magri, C., Benuzzi, G., Lin, A. A., Al-Zahery, N.,
Battaglia, V., Maccioni, L., Triantaphyllidis, C., Shen, P., & Oefner,
P. J. (2004). Origin, diffusion, and differentiation of Y-chromosome
haplogroups E and J: Inferences on the neolithization of Europe and
later migratory events in the Mediterranean area. The American Jour-
nal of Human Genetics, 74, 1023-1034. doi:10.1086/386295
Sokal, R. R., Oden, N. L., & Wilson, C. (1991). Genetic evidence for
the spread of agriculture in Europe by demic diffusion. Nauture, 351,
143-145. doi:10.1038/351143a0
Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K.,
Eisen, M. B., Brown, P. O., Botstein, D., & Futcher, B. (1998).
Comprehensive identification of cell cycle-regulated genes of the
yeast Saccharomyces cerevisiae by microarray hybridization. Mo-
lecular Biology of the Cell, 9, 3273.
Yeung, K. Y., & Ruzzo, W. L. (2001). Principal component analysis for
clustering gene expression data. Bioinfo r matics, 17, 763.
doi:10.1093/bioinformatics/17.9.763
Zhu, X., Li, S., Cooper, R. S., & Elston, R. C. (2008). A unified asso-
ciation analysis approach for family and unrelated samples correcting
for stratification. The America n Jou rna l of Hu man Gen et ics , 82 , 352-
365. doi:10.1016/j.ajhg.2007.10.009