^{1}

^{2}

^{3}

Biology is a challenging and complicated mess. Understanding this challenging complexity is the realm of the biological sciences: Trying to make sense of the massive, messy data in terms of discovering patterns and revealing its underlying general rules. Among the most powerful mathematical tools for organizing and helping to structure complex, heterogeneous and noisy data are the tools provided by multivariate statistical analysis (MSA) approaches. These eigenvector/eigenvalue data-compression approaches were first introduced to electron microscopy (EM) in 1980 to help sort out different views of macromolecules in a micrograph. After 35 years of continuous use and developments, new MSA applications are still being proposed regularly. The speed of computing has increased dramatically in the decades since their first use in electron microscopy. However, we have also seen a possibly even more rapid increase in the size and complexity of the EM data sets to be studied. MSA computations had thus become a very serious bottleneck limiting its general use. The parallelization of our programs—speeding up the process by orders of magnitude—has opened whole new avenues of research. The speed of the automatic classification in the compressed eigenvector space had also become a bottleneck which needed to be removed. In this paper we explain the basic principles of multivariate statistical eigenvector-eigenvalue data compression; we provide practical tips and application examples for those working in structural biology, and we provide the more experienced researcher in this and other fields with the formulas associated with these powerful MSA approaches.

The electron microscope (EM) instrument, initially developed by Ernst Ruska in the early nineteen thirties [

Around 1970, in a number of ground-breaking publications, the idea was introduced by the group of Aaron Klug in Cambridge of using images of highly symmetric protein assemblies such as helical assemblies or icosahedral viral capsids [

For other types of complexes, no methods were available for investigating their 3D structures. The vast majority of the publications of those days thus were based on the visual recognition of specific molecular views and their interpretation in terms of the three-dimensional of the macromolecules. For example, a large literature body existed on the 3D structure of the ribosome-based antibody labelling experiments [

Against this background, one of us (MvH), then at the university of Groningen, and Joachim Frank, in Albany NY, started a joint project in 1979 to allow for objectively recognizing specific molecular views in electron micrographs of a negative stain preparation (see Appendix). This would allow one to average similar images prior to further processing and interpretation. Averaging is necessary in single-particle processing in order to improve the very poor signal-to-noise ratios (SNR) of direct, raw electron images. Averaging similar images from a mixed population of images, however, only makes sense if that averaging is based on a coherent strategy for deciding which images are sufficiently similar. We need good similarity measures between images such as correlation values or distance criteria for the purpose. Upon a suggestion of Jean-Pierre Bretaudière (see Appendix), the first application of multivariate statistical analysis (MSA) emerged in the form of correspondence analysis [

The above MSA eigenvector-eigenvalue data compression was still incomplete without the introduction of automatic “unsupervised” classification in the resulting compressed eigenvector space [

A most fundamental specimen-preparation development strongly influenced the field of single-particle electron microscopy: In the early nineteen eighties the group around Jacques Dubochet at the EMBL in Heidelberg brought the “vitreous-ice” embedding technique for biological macromolecules to maturity [

The instrumental developments in cryo-EM over the last decades have been substantial. The new generation of electron microscopes-optimized for cryo-EM-are computer controlled and equipped with advanced digital cameras (see, for example: [

These newest developments facilitate the automatic collection of large dataset such as are required to bring subtle structural/functional differences existing within the sample to statistical significance. The introduction of movie-mode data collection by itself already increased the size of the acquired data set by an order of magnitude. Cryo-EM datasets already often exceed 5000 movies of each 10 frames, or a total of 50,000 frames of each 4096 × 4096 pixels. This corresponds to 1.7 Tb when acquired at 16 bits/pixels (or 3.4 Tb for 32 bit/pixel). From this data one would be able to extract ~1,000,000 molecular images of each, say, 300 × 300 pixels. These 1,000,000 molecular images represent 100,000 molecular movies, of each 10 frames. The MSA approaches are excellent tools to reveal the information hidden in our huge, multi-Terabyte datasets.

First, One image is simply a “measurement” that can be seen as a collection of numbers, each number repre- senting the density in one pixel of the image. For example, let us assume the images we are interested in are of the size 300 × 300 pixels. This image thus consists of 300 × 300 = 90,000 density values, starting at the top left with the density of pixel (1, 1) and ending at the lower right with density of pixel (300, 300). The fact that an image is intrinsically two-dimensional is not relevant for what follows. What is relevant is that we are trying to make sense out of a large number of comparable measurements, say 200,000 images, all of the same size, with pixels arranged in the same order. Each of these measurements can be represented formally by a vector of numbers F(a), where a is an index running over all pixels that the image contains (90,000 in our case). A vector of numbers can be seen as a line from the origin, ending at one specific point of a multi-dimensional space known as hyperspace.

There is no “magic” in hyperspace; it is merely a convenient way to represent a measurement. A set of 200,000 images translates to a set of 200,000 points in hyperspace (with 90,000 dimensions, in our case). When trying to make sense of these 200,000 different molecular images, we like to think in terms of the distances between these points. The collection of our 200,000 points in hyperspace is called a data cloud. Similar images are close together in the data cloud and are separated by only a small distance, or, equivalently, have a high degree of correlation as will be discussed in the next chapter. This abstract representation of sets of measurements is applicable to any form of multidimensional data including one-dimensional (1D) spectra, two-dimensional (2D) images, or even three-dimensional (3D) volumes, since all of them will be represented simply by a vector of numbers F(a).

The entire raw data set, in our example, consists of 200,000 images of each 90,000 pixels or a total of 18 × 10^{9} pixel density measurements. In the hyperspace representation, this translates to a cloud of 200,000 points in a 90,000 dimensional hyperspace, again consisting of 18 × 10^{9} co-ordinates. Each co-ordinate corresponds to one of the original pixel densities, thus the hyperspace representation does not change the data in any way. This type of representation is illustrated in

The basic idea of the MSA approach is to optimize the orthogonal co-ordinate system of the hyperspace to best fit the shape of the cloud. We wish to rotate (and possibly shift) the co-ordinate system, such that the first axis of the rotated co-ordinate system will correspond to the direction of the largest elongation of the data cloud. In the simplistic (two-pixel images) example of

Concentrating on the main direction of variations within the data, in the example of

Multivariate Statistical Analysis is all about comparing large sets of measurements and the first question to be resolved is how to compare them. What measure of similarity one would want to use? The concepts of distances and correlations between measurements are closely related, as we will see below. Different distance and associated correlation criteria are possible depending on the metric one chooses to work with. We will start with the simplest and most widely used metric: The classical Euclidean metric.

The classical measure of similarity between two measurements F(a) and G(a) is the correlation or inner product (also known as the covariance) between the two measurement vectors:

The summation in this correlation between the two vectors is over all possible values of a, in our case, the p pixels of each of the two images being compared. (This summation will be implicit in all further formulas; implicit may also be the normalization by 1/p). Note that when F and G are the same (F = G), this formula yields the variance of the measurement: the average of the squares of the measurement:

Closely related to the correlation is the Euclidean squared distance between the two measurements F and G:

The relation between the correlation and the Euclidean distance between F and G becomes clear when we work out Equation (2):

In other words, the Euclidean square distance between the two measurements F and G, is a constant (the sum of the total variances in F and in G, F_{VAR} + G_{VAR}, respectively, minus twice C_{FG}, the correlation between F and G. Thus correlations and Euclidean distances are directly related in a simple way: the shorter the distance between the two, the higher the correlation between F and G; when their distance is zero, their correlation is at its maximum. This metric is the most used metric in the context of multivariate statistics; it is namely the metric associated with Principal Components Analysis (PCA, see below). Although this is a good metric for signal processing in general, there are some disadvantages associated the with use of pure Euclidean metrics.

One disadvantage of Euclidean distances and correlations are their sensitivity to a multiplication by a constant. For example, suppose the two measurements F and G have approximately the same variance and one would then multiply one of the measurement, say F(a), with the constant value of “10”. A multiplication of a measurement by such a constant does not change the information content of that measurement. However, the correlation value C_{FG} between the measurements F and G (Equation (1)) will increase by a factor 10. The Euclidean square distance will, after this multiplication, be totally dominated by the F_{VAR} term which will then be one hundred times larger than the corresponding G_{VAR} term in (Equation (4)).

A further problem with the Euclidean Metric (and with all other metrics discussed here) is the distorting influence that additive constants can have. Add a large constant to the measurement F and G, and their correlation (Equation (1)) and Euclidean distance (Equations (2)-(4)) will be fully dominated by these constants, leaving just very small modulations associated with the real information content of each of F and G. A standard solution to these problems in statistics is to correlate the measurements only after subtracting the average and normalising them by the standard deviation of each measurement. The correlation between F and G thus becomes:

This normalisation of the data is equivalent to replacing the raw measurements F(a) and G(a) by their normalised versions

These substitutions thus render the Euclidean metrics correlations and distances (Equations (5) and (6)) to exactly the same form as the original ones (Equations (1) and (2)).

Interestingly, it is a standard procedure to “pretreat” EM images prior to any processing and during the various stages of the data processing and this routine is, in fact, a generalisation of this standard normalisation in statistics discussed above. During the normalisation of the molecular images [

The high-pass filtering is often combined with a low pass filter to remove some noise in the high-frequency range, again, trying to reduce structure-unrelated noise. These high spatial frequencies, however, although very noisy, also contain the finest details one hopes to retrieve from the data. For a first 3D structure determination, it may be appropriate to suppress the high frequencies. During the 3D structural refinements, the original high frequency information in the data may later be reintroduced. The overall filtering operation is known as “band- pass” filtering.

The high-pass filtering indeed removes the very-low frequencies in the data but that also has the effect of setting the average density in the images to a zero value. In cryo-EM the predominant contribution to the image generated in the transmission electron microscope (“TEM”) is phase contrast. In phase contrast, the very low frequency information in the images is fundamentally not transmitted by the phase-contrast imaging device, because that area of the back focal plane of the imaging device is associated with the image of the illuminating source or “zero order beam” [

As was mentioned above, the first applications of MSA techniques in electron microscopy [^{2}-metric). Chi-square distances are good for the analysis of histogram data, that is per definition positive. The c^{2} correlation and distance are, respectively:

and,

With the c^{2}-metrics, the measurements F and G are normalized by the average of the measurements:

Substituting the normalized measurements (11) and (12) into formulas (9) and (10) brings us again back to the standard forms (1) and (2) for the correlation and distance.

Why this normalisation by the average? Suppose that of the 15 million inhabitants of Beijing, 9 million own a bicycle, and 6 million do not own a bicycle. Suppose also that we would like to compare these numbers with the numbers of cyclists and non-cyclists in Cambridge, a small university city with only 150,000 inhabitants. If the corresponding numbers for Cambridge are 90,000 bicycle owners versus 60,000 non-owners, then the c^{2} distance (10) between these two measurements is zero, in spite of the 100 fold difference in population size between the two cities. This metric c^{2} is thus well suited for studying histogram-type of information.

Interestingly, with the c^{2}-distance, the idea of subtracting the average from the measurements in already “built in”, and leads to identically the same distance (10) as can be easily verified:

This illustrates that the c^{2}-metrics are oriented towards an analysis of positive histogram data, that is, towards data were the role of the standard deviation F_{SD} and of the average F_{AV} in Equation (7), are both covered by the average of the measurement F_{AV}. Although this leads to nice properties in the representation of the data (see below) problems arise when the measurements F(a) are not histogram data but rather a non-positive signal. The normalisation by the average F_{AV} rather than by the standard deviation F_{SD} may then lead to an explosive behaviour of

As was discussed above, a disadvantage of the Euclidean distance and of the associated correlation is the sensitivity to multiplication of one of the data vectors by a constant. The c^{2}-metric does not have this sensitivity to a multiplicative factor through its normalisation of the measurements by their average. In contrast to histogram data, however, in signal and image processing the measurements need not be positive. Signals often have (or are normalized to) zero average density as discussed in the paragraph above. The c^{2}-metric, when applied to signal-processing measurements is associated with a fundamental problem. This problem in the application of correspondence analysis per se to electron microscopical data was realised soon after its introduction in electron microscopy, and a new “modulation”-oriented metric was introduced to circumvent the problem [

and,

With

and,

The MSA variant with modulation distances we call “modulation analysis”, and this MSA technique shares the generally favourable properties of CA, yet, as is the case in PCA, allows for the processing of zero-aver- age-density signals.

We have thus far considered an image (or rather any measurement) as a vector named F(a) or as a vector named G(a). We want to study large data sets of n different images (say: 200,000 images), each containing p pixels (say 90,000 pixels). For the description of such data sets we use the much more compact matrix notation. For completeness, we will repeat some basic matrix formulation allowing the first time reader to remain within the notation used here.

In matrix notation we describe the whole data set by a single symbol, say “X”. X stands for a rectangular array of values containing all n × p density values of the data set (say: 200,000 × 90,000 measured densities). The matrix X thus contains n rows, one for each measured image, and each row contains the p pixel densities of that image:

This notation is much more compact than the one used above because we can, for example, multiply the matrix X with a vector G (say an image with p pixels) to yield a correlation vector C as in:

The result of this multiplication _{i}_{ }has the form:

This sum is identical to the correlation or inner product calculation presented above for the case of the Euclidean metrics (Equation (1)). Explained in words: one here multiplies each element from row i of the data matrix X with the corresponding element of vector G to yield a vector C, the n elements of which are the correlations (or inner products or “projections”) of all the images in the data set X with the vector G.

An important concept in this matrix formulation is that of the “transposed” data matrix X denoted as X^{T}:

In X^{T}, the transposed of X, the columns have become the images, and the row have become what first were the columns in X. Similar to the multiplication of the matrix X with the vector G, discussed above, we can calculate the product between matrices, provided their dimensions match. We can multiply the X with X^{T} because the rows of X have the same length p as the columns of X^{T}:

This matrix multiplication is like the earlier one (18) of the (n × p) matrix X with a single vector G (of length p) yielding a vector C of length n. Since X^{T} is itself a (p × n) matrix the inner-product operation is here applied to each column of X^{T} separately, and the result thus is an (n × n) matrix A_{n}. Note that each element of the matrix A is the inner product or co-variance between two images (measurements) of the data set X. The n diagonal elements of A_{n} contain the variance of each of the measurements. (The variance of a measurement is the co-vari- ance of an image with itself). The sum of these n diagonal elements is the total variance of the data set, that is, the sum of the variances of all images together, and it is known as the trace of A. The matrix A is famous in multivariate statistics and is called the “variance co-variance matrix”. Note that we also have in the conjugate representation (see below):

Note that the matrix A_{n} (Equation (21a)) is square and that its elements are symmetric around the diagonal: therefore its transposed is identical to itself (

In the introduction of the MSA concepts above, we discussed that we aim at rotating the Cartesian co-ordinate system such that the new, rotated co-ordinate system optimally follows the directions of the largest elongation of the data cloud. Like the original orthogonal co-ordinate system of the image space, the rotated one can be seen as a collection of, say q orthogonal unit vectors (vector with normalized length of 1), and the columns of matrix U:

To find the co-ordinates of the images with respect to the new rotated co-ordinate system we simply need to project the p-dimensional image vectors, stored as the rows in the data matrix X, onto the unit vectors stored as the columns of the matrix U:

Each row of the n × q matrix C_{img} again represents one full input image but now in the rotated co-ordinate system U. We happen to have chosen our example of n = 200,000 and p = 90,000 such that n > p. That means that for the rotated co-ordinate system U we can have a maximum of p columns spanning the p-dimensional space (in other words: q ≤ p). The size of C_{img} will thus be restricted to a maximum size of n × q. We may choose to use values of q smaller than p, but when q = p, the (rotated) co-ordinate matrix C_{img} will contain all the information contained in the original data matrix X.

In an orthonormal co-ordinate system the inner products between the unit vectors spanning the space is always zero, apart from the inner product between a unit vector and itself, which is normalised to the value 1. This orthonormality can be expressed in matrix notation as

We have already seen the matrix U above (equation (22)); the matrix U^{T} is simply the transposed of U. The matrix I_{q} is a diagonal matrix, meaning that it only contains non-zero element along the diagonal from the top-left to the lower right of this square matrix. All these diagonal elements have a unity value implying that the columns of U have been normalized.

The normal definition of the inverse of a variable is that the inverse times the variable itself yields a unity result. In matrix notation, for our unit vector matrix U, this becomes:

The unit matrix I_{q} is again a diagonal matrix: its non-zero elements are all 1 and are all along the diagonal from the top-left to the lower right of this square matrix. The “left-inverse” of the matrix U is identical to its transposed version U^{T} (see above). We will thus use these as being identical below. We will use for example: (U^{T})^{−}^{1} = U.

It may already have been assumed implicitly above, but let us emphasise one aspect of the matrix representation explicitly. Each row of the X matrix represents a full image, with all its pixel-values written in one long line. To fix our minds, we introduced a data matrix with 200,000 images (n = 200,000) each containing 90,000 pixel densities (p = 90,000). This data set can thus be seen a data cloud of n points in a p-dimensional “image space”. An alternative hyperspace representation is equally valid, namely, that of a cloud of p points in an n-dimensional hyper space. The co-ordinates in this conjugate n-dimensional space are given by the columns of matrix X rather than its rows.

The columns of matrix X correspond to specific pixel densities throughout the stack of images. Such column-vectors can therefore be called “pixel-vectors”. The first column of matrix X thus corresponds to the top-left pixel density throughout the whole stack of n input images. Associated with the matrix X are two hyper-spaces in which the full data set can be represented: 1) the image-space in which every of the n images is represented as a point. This space has as many dimensions as there are pixels in the image; image-space is thus p-dimensional. The set of n points in this space is called the image-cloud; 2) the pixel-space in which every one of the p pixel-vectors is represented as a single point: pixel-space is n-dimensional. The set of p points in this space is called the pixel-vector-cloud or short: pixel-cloud. (This application-specific nomenclature will obviously change depending on the type of measurements we are processing.)

Note that the pixel vectors are also the rows of the transposed data matrix X^{T}. We could have chosen those vectors as our basic measurements entering the analysis without changing anything. As we will see below, the analyses in both conjugate spaces are fully equivalent and they can be transformed into each other through “transition formulas”. There is no more information in one space than in the other! In the example we chose n = 200,000 and p = 90,000. The fact that n is larger than p means that the intrinsic dimensionally of the data matrix X here is “p”. Had we had fewer images n than pixels p in each image, the intrinsic dimensionally or the rank of X would have been limited to n. The rank of the matrix is the maximum number of possible independent (non-zero) unit vectors needed to span either pixel-vector space or image space.

The mathematics of the PCA eigenvector eigenvalue procedures have been described in various places (for example in [

Let direction vector “u” be the vector we are after (

maximize the variance it describes of the full data cloud, we need to maximize

of all images to the vector u, making this a standard least-square minimization problem. We have seen above (Equation (18)) how to calculate the inner product of the full data matrix X and a unit vector u.

The sum

Let u_{1} be the unit vector that maximises this variance and let us call that maximised variance λ_{1}. (We will see below how this maximum is actually calculated). We then have for this variance maximizing vector:

Since u_{1} is a unit vector we have (see above) the additional normalisation condition:

The data matrix has many more dimensions (keyword “rank”) than can be covered by just its main “eigenvector” u_{1}, which describes only λ_{1} of the total variance of the data set. (As mentioned above, the total variance of the data set is the sum of the diagonal elements of A, known as its trace). We want the second eigenvector u_{2} to optimally describe the variance in the data cloud that has not yet been described by the first one u_{1}. We thus want:

While, at the same time, u_{2} is normalized and perpendicular to the first eigenvector, thus:

and,

It now becomes more appropriate to write these “eigenvector eigenvalue” equations in full matrix notation. The matrix U contains eigenvector u_{1} as its first column, u_{2} as its second column, etc. The matrix Λ is a diagonal matrix with as its diagonal elements the eigenvalues λ_{1}, λ_{2}, λ_{3}, etc.:

with the additional orthonormalization condition:

The eigenvector eigenvalue Equation (29a) is normally written as:

which is the result of multiplying both sides of Equation (29a) by

Let the eigenvectors in the space of the columns of the matrix A be called V (with v_{1} the first eigenvector of the space as its first column, v_{2} the second column of the matrix V, etc.). The eigenvector equation in this “pixel- vector” space is very similar to the one above (Equation (29a)):

With the additional orthonormalization condition:

Both terms of Equation (30a) multiplied from the left by

It is obvious that the total variance described in both image space and pixel-vector space is the same since the total sum of the squares of the elements of all row of matrix X is the same as the total sum of the squares of the elements of all columns of matrix X. The intimacy of both representations goes much further, as we see will below.

Multiplying both sides of the eigenvector Equation (29c) from the left with the data matrix X yields:

This equation is immediately recognised as the eigenvector equation in the conjugate space of the pixel vectors, Equation (30c) with the product matrix (X∙U) taking the place of the eigenvector matrix V. Similarly, multiplying both sides of the eigenvector equation (30c) from the left with the transposed data matrix X^{T} yields

Again, this equation is immediately recognised as the eigenvector equation in image space (Equation (29c)) with the product matrix X^{T}∙V taking the place of the eigenvector matrix U. However, the product matrices X∙U and X^{T}∙V are not normalised the same way as are the eigenvector matrices V and U, respectively. The matrix U, is normalised through U^{T}∙U = I_{q} (Equation (29b)), but the norm of the corresponding product matrix X^{T}∙V is given by eigenvector Equation (30a):

and, correspondingly:

These important formulas are known as the transition formulas relating the eigenvectors in image space (p space) to the eigenvectors in pixel-vector space (n space).

We mentioned earlier that the co-ordinates of the images in the space spanned by the unit vectors U are the product of X and U (Equation (23)); we now expand on that, using transition formulas (34) and (33):

and their pixel-vector space equivalents

We have seen that the co-ordinates of the images with respect to the eigenvectors (or any other orthogonal co-ordinate system of p space) are given by: X∙U = C_{img} (Equation (23)). Multiplying both sides of that equation from the right by U^{T} (a q × p matrix) yields:

Using Equation (34) we can also write this equation as

This formula is known as the eigen-filtering or reconstitution formulas [^{T} are (q × p), with q being maximally the size of n or p (whichever is smaller).

The reason for using a “*” to distinguish X^{*} from the original X, is the following: we are often only interested in the more important eigenvectors, assuming that the higher eigenvectors and eigenvalues are associated with experimental noise rather than with real information we seek to understand. Therefore we may restrict ourselves to a relatively small number of eigenvectors, or restrict ourselves to a value for q, of say, 50. The formulas can now be used to recreate the original data (X) but restricting ourselves to only that information that we consider important.

We have introduced various distances and correlation measures earlier, but in discussing the MSA approaches we have so far only considered conventional Euclidean metrics.

We have discussed above that Euclidean distances are not always the best way to compare measurements and that it may be sometimes better to normalize the measurements by their total (c^{2} distances) or by their standard deviation (Modulation distance). In matrix notation let us introduce an n × n diagonal weight matrix N that has as its diagonal elements 1/w_{i}, where w_{i} is, say, the average density of image x_{i}, or the standard deviation of that image (row i of the data matrix X). Note that then the new product matrix _{i}. We then calculate the associated variance-covariance matrix:

Interestingly, now all the elements of this variance co-variance matrix are normalized by the specific weights w_{i} for each original image as required for the correlations we discussed above for the c^{2} metrics (Equation (9)) or the modulation metrics (Equation (13)).

Similarly, we can introduce a diagonal (p × p) weights matrix M in the conjugate space with diagonal elements 1/w_{j}, where w_{j} is, the average density of pixel-vector x_{j}, or the standard deviation of that pixel vector (column j of the data matrix X). Note that then the product matrix X' = X∙M will have columns _{j}. Lets us now combine these weight matrices in both conjugate spaces into a single formulation. Instead of the original data matrix X we would actually like to use a normalized version X' which relates to the original data matrix X as follows:

and its transposed:

Let us now substitute these in to the PCA eigenvector eigenvalue Equation (29c):

Leading to:

with the additional (unchanged) orthonormalization constraint (29b):

With the N and M normalisations of the data matrix X, nothing really changed with respect to the mathematics of the PCA calculations with Euclidean metrics discussed in the previous paragraphs. All the important formulas can be simply generated by the substitution above (Equation (39)). For example, the co-ordinate Equation (35) becomes:

We call this pretreatment because this multiplication of X with N and M can be performed prior to the eigenvector analysis exactly the same way as the pretreatment band-pass filtering of the data discussed above. The procedures of the MSA analysis are not affected by pretreatment of the data (although the results can differ substantially).

The normalisation of the data by N and M allow us to perform the eigenvector analysis from a perspective of c^{2} distances or that of modulation distances. This normalisation means that, in the 9,000,000 bicycle example for c^{2} distances, the measurements for Beijing and Cambridge fall on top of each other which is what we wanted. However, the fact that the weight of the measurement for Beijing is 100 times higher than that for Cambridge will be completely lost with this normalisation! That means that even for the calculation of the eigenvectors and eigenvalues of the system, the weight of Cambridge contribution remains identical to that of Beijing.

In standard (not normalised PCA), the contribution of Beijing to the total variance of the data set to the eigenvalue/eigenvector calculations would be 100^{2} =10,000 times higher than that of Cambridge, thus distorting the statistics data set. (Squared correlation functions in general suffer from this problem [

A more balanced approach than either the pure PCA approach or the total normalisation of the data matrix can be achieved by concentrating our efforts on a partially normalised data matrix

and its transposed:

Substituting these in to the classical PCA eigenvector-eigenvalue equation yields:

with the additional (unchanged) orthonormalization constraint

By then substituting

This is equivalent to (multiplying left and right hand side of the equation from the left by M^{−}^{1/2}) the eigenvector-eigenvalue equation for generalised metrics [

However, by substituting

Equivalently, we obtain the eigenvector-eigenvalue equation in the conjugate space as:

or, alternatively, formulated as (the result of a multiplication from the left with (N^{−}^{1/2})

with the associated orthonormalization condition

For deriving the transition formulas we proceed as was done earlier for PCA derivations. Starting from the eigenvector-eigenvalue Equation (45), and multiplying both sides of this equation from the left with the normalized data matrix

This last equation, again, is virtually identical to the eigenvector equation in the conjugate space (apart from its scaling):

And, again, we have a different normalisation for

We thus again need to normalise the “transition equation” with Λ^{−}^{1/2}, leading to two transition equations between both conjugate spaces:

and correspondingly:

The calculation of the image co-ordinates in n space as we have seen above (Equation (23)):

With the appropriate substitutions:

and

However, these co-ordinates, seen with respect to the eigenvectors U, have a problem: the matrix ^{1/2} [

(The right hand side was derived using the transition formula ^{1/2}.) And we also have, similarly:

Using these co-ordinates for the compressed data space again puts the “Beijing measurement” smack on top of the “Cambridge measurement”. How is this now different from the “total normalisation” discussed in section 6b (above)? The difference lies in that each measurement is now not only associated with its co-ordinates with respect to the main eigenvectors of the data cloud, but each measurement now is also associated with a weight. The weight for the “Beijing measurement” here is one hundred times higher than that of the “Cambridge measurement”. That weight difference is later taken into account, for example, when performing an automatic hierarchical classification of the data in the compressed eigenvector space.

The algorithm we use for finding the main eigenvectors and eigenvalues of the data cloud is itself illustrative for the whole data compression operation. The IMAGIC “MSA” program, originally written by one of us (MvH) in the early 1980s, is optimised for efficiently finding the predominant eigenvectors/eigenvalues of extremely large sets of images. Here we give a simplified version of the underlying mathematics. Excluded from the mathematics presented here are the “metric” matrices N and M for didactical reasons. The basic principle of the MSA algorithm is the old and relatively simple “power” procedure (cf. [_{1}, through the symmetric variance co-variance matrix A, which will yield a new vector

This resulting vector is then (after normalisation) successively multiplied through the matrix A again:

and that procedure is then repeated iteratively. The resulting vector will gradually converge towards the first (largest) eigenvector u_{1} of the system, for which, per definition, the following equation holds:

Why do these iterative multiplications necessarily iterate towards the largest eigenvector of the system? The reason is that the eigenvectors “u” form a basis of the n-dimensional data space and that means that our random vector r_{1} can be expressed as a linear combination of the eigenvectors:

The iterative multiplication through the variance-covariance matrix A will yield for r_{1} after k iterations (using Equation (58) repeatedly):

or:

Because λ_{1} is the predominant eigenvalue, the contributions of the other terms will rapidly vanish (_{1} rapidly converge towards the main eigenvector u_{1.} The variance co-variance matrix A is normally calculated as the matrix multiplication of the data matrix X and it's transposed, X^{T}:

As was mentioned above, the data matrix X contains, as its first row, all of the pixels of image 1; its general i^{th} row contains all the pixels of image i. The MSA algorithm operates by multiplying a set of randomly generated eigenvectors (because of the nature of the data also called eigenimages) r_{1}, r_{2}, etc., through the data matrix U and its transposed U' respectively. The variance-covariance matrix A_{p} is thus never calculated explicitly since that operation is already too expensive in terms of its massive computational burden. The MSA algorithm does not use only one random starting vector for the iterations, but rather uses the full set of q eigenimages desired and multiplies that iteratively through the data matrix X, similar to what was suggested by [

In detail the MSA algorithm works as follows (_{q} is first filled with random numbers which are then orthonormalised (normalised and made orthogonal to each other). The typical number of eigenimages used depends fully on the complexity of the problem at hand but typically is 10 - 100 and they are symbolised by a set of two “eigenimages” in the illustration (top of

set. The algorithm converges rapidly (typically within 30 - 50 iterations) to the most important eigenimages of the data set.

An important property of this algorithm is its efficiency for large numbers of images n: its computational requirements scale proportionally to n∙p, assuming the number of active pixels in each image to be p. Most eigenvector-eigenvalue algorithms require the variance-covariance matrix as input. The calculation of the variance-covariance matrix, however, is itself a computationally expensive algorithm requiring computational resources almost proportional to n^{3}. (This number is actually: Min (n^{2}p, np^{2})). The MSA program produces both the eigenimages and the associated eigenpixel-vectors in the conjugate data space as described in [

In spite of its high efficiency and perfect scaling with the continuously expanding sizes of the data sets, compared to most conventional eigenvector-eigenvalue algorithms, the single-CPU version of the algorithm had become a serious bottleneck for the processing of large cryo-EM data sets. The parallelisation of the MSA algorithm had thus moved to the top of our priority list. We have considered various parallelisation schemes including the one depicted in

As was discussed before, Equation (46a) (see also the legend of

In which, u_{j} is one single eigenvector and λ_{j} the corresponding eigenvalue. Having stated this, a direct approach for parallelisation of the calculation of Equation (46a) could be the exploiting of the independence of calculations of all u_{j} since each eigenvector _{j}. Although it is a clear possibility and is straightforward to implement, this scheme does not represent an effective approach. Since there may typically be more processors available than the number of eigenvectors needed for representing the dataset, this approach would lead to a waste of processing time with many processors runningidle. Moreover, all processors would need access to the full image matrix X. This approach is thus likely to create severe I/O bottlenecks. To analyse another possible parallelisation schemes, let us write Equation (62) as:

where u_{ij} is element i of eigenvector u_{j}, x_{ij} is pixel j of image i, and n_{s} and m_{t} are the diagonal elements of the N and M metrics, respectively. This equation can be expanded into the form:

The equation above suggests a possible parallelisation scheme over the image dataset instead of the eigenvectors (

The implementation of the algorithm can be structured in a master/slave architecture, where the master is responsible for all non-parallel tasks and for summing the partial results of each node. During each iteration of the algorithm, each node needs to read data from the image matrix X thus potentially creating a network bottleneck. However, since each node in this parallelisation scheme needs only to access a part of the X matrix, distributing the partial datasets of input data over the local disks efficiently, parallelises the I/O operations.

What we wish to achieve with the MSA approaches is to concentrate on the intrinsic information of large noisy data sets and make that more accessible. The determination of the eigenvectors (eigenimages) by itself does not change or reduce the total image information in any way. It is merely a rotation of the co-ordinate system in a special direction, such that the first eigenvector covers most of the variance in the data cloud, the second, orthogonal to the first, covers most of the remaining variance of the data cloud, not covered by the first eigenvector, etc. If one would find all the eigenvectors of the data cloud (all the eigenvectors of the variance-covariance matrix; all p or n of them, whichever is the smaller number) there would be no loss of information at all. Of course, even without any loss of information the most important information is associated with the first eigenvectors/eigenimages that describe the strongest variations (in terms of variance) within the data set.

When we decide to consider only the first so-many eigenvectors and thus decide to ignore all higher eigenvectors of the system, we make the “political” decision between what is signal and what is noise. Here the decision that the percentage of the overall variance of the data set (the “trace”) covered by the main eigenvectors, suffices for our desired level of understanding of the dataset. An often spectacular level of data compression is thus achieved. The problem remains of how to best exploit that information in the compressed data space. Having concentrated on the mathematics and algorithmic aspects of the eigenvector-eigenvalue aspect of the MSA approach, we here will just give a relatively short account of this most fascinating aspect of the methodology.

The first EM data sets that were ever subjected to the MSA approach were simple negative-stain contrasted specimens where the differences were directly visible in the micrographs. The data sets were chosen carefully for being simple in order for the problem to be solvable with the limited computing resources available in those days. The very first data set that was ever analysed by the MSA approach was an artificial mixture dataset from two different species of hemoglobins. The first analyses were performed with the “AFC” program developed by Jean Pierre Bretaudière from a program originally written by JP Benzécri (see Appendix for details). The 16 molecular images of the giant annelid hemoglobin of Oenone fulgida visibly had extra density in the centre compared to the 16 molecular images of well-known giant hemoglobin of the common earthworm Lumbricus terrestris which rather had a “hole” in the centre of its double-ring geometry (see Appendix). The separation of the two species was so obvious that one immediately moved on to a single data set with real internal variations (the example below).

The first real data set with genuine variations among the molecular images was a preparation Limulus polyphemus hemocyanin half-molecules, each consisting of four hexamers produced by controlled dissociation of the full molecule. The data sets were by today's standards extremely small with only 46 images in this case. Each 64 × 64 image was then low-pass filtered and coarsened to a mere 32 × 32 pixel format. Nevertheless the fact that the image information could be compressed logically to points on a two-dimensional plane (

In those early days in computing, the address space of the computers was minimal by today’s standard. The 16-bit PDP-11/45 computer on which the work was performed would only address 2^{16} bytes = 64 Kbytes of memory. It was thus difficult to fit any serious computational problem in the central memory of that generation of computers! The critical memory requirement for the eigenvector-eigenvalue calculations was the “core” needed for the square variance-covariance matrix which, practically limiting the number of “particles” to around 50 at most. In the IMAGIC implementation of Bretaudière’s AFC program (see Appendix) the limit was slightly over 100 molecular images (see the Limulus polyphemus hemocyanin double-hexamer analysis in [

A new generation of 32-bit computers―with a much larger address space―came on the market (such as the Digital Equipment VAX-780) allowing one to handle a much larger data sets with a significantly higher level of complexity. There were some efforts to map more complicated cases nonlinearly onto just a 2D plane in order to visualise the underlying structure. Although such approaches may relieve the problem somewhat, what was really needed was a more abstract approach in which all information in the compressed data space is structured automatically. The development of automatic classification procedures operating on a, say, 60-dimensional compressed data space became an absolute necessity.

To eliminate the subjectivity of visual classification (which is simply impossible when the number of factors to be taken into account is larger than two or three), automatic classification schemes were introduced in EM. These schemes emerged in the general statistical literature in the 1960s and 1970s, and were first used in electron microscopy in the early 1980s. There are many algorithms available each with their advantages and disadvantages. The three listed below are important in single particle cryo-EM.

In the first class of algorithms, one chooses a number of k classes into which one wishes to use to subdivide the full data set (of n images) and then selects at random k from the n elements to serve as the first classification

seeds. Then one loops over all n elements of the data set and assigns each element to one of the k seeds based on closest proximity. After one iteration, each of the k original seeds shift to the centre of mass k' of all those elements of the total of n that were found to be closest to the original random seed k. All original n elements of the data set are then classified against the k' new classification centres and the process is the repeated. This type of algorithm is now known as a “k-means” algorithm [

The HAC technique is a “from-the-bottom-up” classification approach in which each element of the data set is first considered as a “class” by itself. Each of these individual starting classes can be associated with a mass depending on the choice of metric. Individual classes are then merged in pairs, based on a merging criterion, until finally one single large class emerges, containing all the elements of the original data set. The history of the classification procedure (that is, which classes merge together at which value of the aggregation criterion) is stored in a classification or merging “tree”. The user chooses the desired number of classes at which the tree is to be cut. As a merging criterion, the minimum added intra-class variance or Ward criterion is normally used [

In this formula ^{ }is the (now) Euclidean square distance between the classes i and j having masses (weights) w_{i} and w_{j} respectively (Appendix of [^{2}-metric and the modulation metric.

A disadvantage of the HAC algorithm [^{2}, (with n the number of images in data set) for larger data sets of, say, more than 200,000 elements, the computational requirements become excessive and often exceed the requirements for the eigenvector calculations.

Although, at every merging step, two classes are merged that lead to a minimum AIV contribution, this fact is also a fundamental limitation: If two elements are merged into one class at an early stage of the procedure, the elements will always remain together throughout all further HAC classification levels, whereas if their marital ties were weaker, a lower intra-class variance minimum could easily be obtained [

The moving-elements approach is a post-processor to refine and consolidate an existing partition. Starting point is typically the partition obtained with HAC based on the Ward criterion discussed above. The partition is refined, to reach a “deeper” local minimum of intra-class variance, by allowing each member of each class to migrate to any other class where it is happier in terms of the very same added intra-class variance criterion [

After one cycle, many of the classes have changed and the procedure is iterated until no (or little) further moving of elements is observed. The partition thus obtained is a real (and significantly deeper) local minimum of the total intra-class variance than that obtained directly by the HAC in the sense that no single element can change class-membership without increasing the total intra-class variance of the partition. We call this algorithm “moving elements consolidation” (or refinement) as opposed to the “moving centres consolidation” proposed independently by [

A variety of further possible classification schemes/refinement schemes have been have suggested in [^{2} comparisons we only have to deal with 1,000,000 × 20,000 comparisons. The HAC operations are otherwise fully standard and it stops when the desired final number of classes is reached (say, 2000). The moving elements post-processor, again still based on the Ward criterion, bring the whole partition to a variance minimum.

After the classification phase, all noisy molecular images that have been assigned to the same class in the classification phase are averaged together. This averaging of images leads to a large improvement in SNR. The SNR improves proportionally to the number of images averaged, under the idealised assumption that all images averaged are identical apart from the noise, which is different from image to image. The new class averages may be used as references for a new alignment and classification rounds. After a few iterations, good class averages with improved SNRs can be obtained. The high SNR values obtained are of great importance for an accurate assignment of Euler angles to these projection images by techniques such as “projection matching” [

Another example of the use of MSA is the unbiased analysis of the main symmetry properties of a macromolecular complex. Earlier symmetry analysis approaches were based on finding the predominant rotational symmetry components of one single image at a time [

A methodologically clean approach for determining the strongest symmetry component in a data was proposed, which entirely avoids the symmetry bias resulting from any explicit or implicit rotational alignment of the molecular images [

The total of some 7300 particles were submitted to MSA eigenvector analysis. The first eigenvector in this analysis (

Alignments of the images within the data matrix X change the MSA analysis in fundamental ways. Interestingly however, alignments of the individual images do not change the total variance of the data set. A rotation of an image merely shifts around the pixel densities within a row of the data matrix X; the total variance of the measurement, however, does not change. (This is perfectly true as long as no non-zero pixels are rotated or shifted out of the part of the image that is active during the MSA analysis). The variance of each image is the corresponding diagonal element of the variance-covariance matrix A and hence these diagonal elements remain unchanged.

The total variance of the data set (the trace of A) is the sum of the diagonal elements of matrix A (Equation (21a)). What does change during the alignment procedures is that certain images within the data matrix will become better aligned to each other and will sense a higher co-variance among themselves (specific off-diagonal elements of matrix “A” will increase).

We will here apply an interesting and illustrative form of alignment, namely: “alignment by MSA”, (similar to the Alignment By Classification “ABC” [

We have here aligned the full (centred) data set rotationally with respect to one of the two main eigenimages associated with the 6-fold symmetry property of this hemoglobin. In doing so, we have concentrated a much larger percentage of the total variance of the data set into the first few eigenvectors of the system as is seen from the total sum of the first 10 eigenvalues before (4.3%;

Concentrating the information (variance) into the first few eigenimages means that more eigenvectors will become statistically significant above the background random noise level. With the relatively small data sets from the past, alignments were an absolute necessity for obtaining statistically significant results. Today, however,

and especially with the powerful new parallel MSA algorithms, we can afford to largely increase the overall size of the data set. The rationale for using larger data sets is simple: using ten times more images in a data set means using ten times more information! In doing so, we can achieve statistical significance without the risk prejudicing the data sets by introducing reference bias [

Mixed populations of molecular images were originally the curse of single-particle electron microscopy. Originally, one would try to work with an as homogeneous as possible population of macromolecules. The basic assumption for single particle cryo-EM was that all molecules were the same, apart from the fact that each molecule could exhibit a different orientation and be in a different position in the sample. When the data were heterogeneous or flexible, and that heterogeneity is not accounted for, that results in a deterioration of the quality of the final 3D reconstruction since one will then be averaging “cows” and “horses” into the same 3D map. When, on the other hand, one knows how to separate the molecular images into structurally homogeneous subgroups, a whole new world of possibilities opens. One can then study a mixture of closely related structures and from that commence to understand the biological reasons for the distinct 3D structures found in a sample. The name “4D” cryo-EM has been used in this context.

One of the first examples of the challenging possibilities came from a cryo-EM sample of the E. coli 70s ribosome, in complex with Release Factor 3 [

Another possibility to study 3D structural variation is to generate a large number of 3D reconstructions and to study those by means of MSA the way 2D molecular images were studied above [

A quality criterion for the 4D MSA types of analysis is the resolution achieved in the subset 3D reconstructions. As usual the resolution is quantified by Fourier Shell Correlation (“FSC”, [

The main goal of the biological sciences is to make sense of the massive and very noisy “data” in terms of elucidating the underlying general principles. No better example than the data sets collected in cryo-EM which are probably among the largest and noisiest currently being collected in the biomedical sciences. Over the past three decades, multivariate statistical analysis approaches have been very successful in helping us sort out complex EM data sets in many different ways, and examples have been discussed such as a first “manifold” separation into two functional states [

How can that be that the computational effort was simply not affordable? After all, one of the most dramatic developments in science over the past decades, has been the increase in speed in general computing. In spite of massive computing resources becoming available, however, we have seen an even more rapid increase in the size and complexity of the EM data sets. The size of the data sets have been limited not by what one would wish to analyse, but rather by what was practical to analyse. The high-resolution scanning of (analogue) micrographs into digital form was a tedious activity taking up to hours of labour for a single micrograph. With automatic 24/7 data collection directly on the microscope―using direct electron detector 4 k × 4 k cameras [

There are other reasons why the MSA have not always received the appropriate attention, including ignorance. The MSA approaches are not the simplest of techniques to understand and to get acquainted to for the potential biological user. The learning curve is steep and users tend to avoid investing the time necessary to understand the approach. More seriously, however, is that many software systems in use in cryo-EM do not have MSA options, and those that have typically apply standard libraries for performing the analysis. These standard eigenvector-eigenvalue routines normally calculate all eigenvectors and eigenvalues of the system, and those are algorithms that scale proportional to N^{2}, implying that the analysis of data sets of more than a few thousand elements will easily exceed the memory/computing capacity of any modern computer. With the poor scaling of the algorithms it can be very difficult to analyse large data sets, a fact that is also not appreciated in the general statistical literature where data sets are small by cryo-EM standards. Even in the special EM literature this issue is not always appreciated. (See, for example, paragraph 2.5 of [

Another issue that has largely been ignored in the literature is that the original c^{2}-metric of correspondence analysis is inappropriate for electron-microscopical phase-contrast data. Historically it happened to be the first MSA technique used because it was available (see Appendix). However, the c^{2}-metric correspondence analysis uses is designed for positive data (counting occurrences in histograms). To make the EM phase-contrast data positive one could threshold the image data to positive values only. This thresholding, however, is not justifiable for general signals and―at the very least―leads to wasting of half of the information for zero-average-density measurements. Alternatively, the negative values can be rendered positive by adding a constant to the data. This too has far-reaching consequences. The strong negative densities will end up as small positive densities and will have very little contribution to the total variance of the data set, whereas the large positive values will become very large positive values with a disproportionally strong contribution to the total variance. For high-resolution EM work, however, the c^{2}-metric must be considered obsolete since the publication of [

In recent years, another multivariate statistical approach was introduced to single particle cryo-EM known as the “maximum likelihood” [

The issue of reference bias is of fundamental importance in the cryo-EM field. There has been a much publicised case where two different 3D structure were presented by Mao & Sodrosky and co-workers [

New cryo-EM developments allowing the separation of subsets of 3D structures are emerging that have the potential to change the field of structural biology. Atomic resolution structures (~3Å) had hitherto been elucidated mainly by X-ray crystallography, where the biological molecules are confined to the rigidity of a crystal. In X-ray crystallography one collects diffraction amplitudes data that per definition has been averaged over all unit cells of the crystal. For a better understanding of a biological process, however, it is essential to see the sequence of conformational changes and interactions a molecule or complex undergoes during its functional cycle. A sequence of 3D structures or “4D data analysis” is required for this level of understanding. We want to separate the different manifolds in hyperspace, representing the different states the complexes we study can assume. Single-particle cryo-EM provides a direct window into the solution, revealing different views of different complexes, in different structural/functional states [

New developments in MSA eigenvector-eigenvalue data-compression approaches are expanding the possibilities of single-particle cryo-EM. Decades after its first introduction to electron microscopy, MSA approaches are more alive than ever and they are being tailored for many new tasks. With the orders-of-magnitude speed increase achieved by the MPI parallelisation of the algorithms, huge datasets of Terabytes in size can processed. A rapid expansion of the multivariate statistical techniques has taken place in single-particle cryogenic electron microscopy. One of the most challenging uses of the parallel MSA algorithm is the separation of heterogeneous data sets into different 3D structures associated with the different functional states of the macromolecular complexes we seek to understand. The different functional states of the molecules we study may be better visualised and understood by assessing the different manifolds one finds in the eigenvector compressed dataspace. Thanks to the maturation and proliferation of multivariate statistical data analysis approaches, single-particle electron microscopy techniques are now one of the preferred instruments of structural biology.

First let us again point to the first MSA ideas that were inspired by Jean-Pierre Bretaudière (see Appendix). This document reflects decades of development and experience with the MSA techniques discussed. During this period many individuals contributed to the various MSA projects (and not just the current authors) including: George Harauz, Lisa Borland, Ralf Schmidt, Elena Orlova, Holger Stark, HansPeter Winkler, Manfred Kastowsky, Bruno Klaholz, Tim Balance, Charlotte Linnemayr-Seer, and Pavel Afanasiev. Countless other scientists who used the programs gave us constructive feedback, reported bugs or inconsistencies, and thus forced us to streamline our programs and ideas on MSA methodologies. We thank Katie Melua for inspiring the Beijing/Cambridge bicycle statistics example. Our research was financed in part by grants from: EU/NOE (grant No. NOE-PE0748), from the Dutch Ministry of Economic Affairs (Cyttron project No. BIBCR_PX0948; Cyttron II FES-0908); the BBSRC (Grant: BB/G015236/1); the Brazilian science foundations: CNPq (Grants CNPq- 152746/2012-9 and CNPq-400796/2012-0), and the InstitutoNacional de C, T&I emMateriais Complexos Funcionais (INOMAT).

Marin van Heel,Rodrigo V. Portugal,Michael Schatz, (2016) Multivariate Statistical Analysis of Large Datasets: Single Particle Electron Microscopy. Open Journal of Statistics,06,701-739. doi: 10.4236/ojs.2016.64059

Jean-Pierre Bretaudière (1946-2008):

The Early Days of Multivariate Statistics in Electron Microscopy

Jean-Pierre Bretaudière (

Since 1976, I had been employed as a project scientist in the electron microscopy group of Professor Ernst van Bruggen of the Biochemistry Department of the University of Groningen, The Netherlands. Erni van Bruggen had been most impressed by the early “averaging” successes of the group around Aaron Klug in Cambridge and wanted to incorporate those techniques in his research. It was my task to develop image processing software for data collected either with the transmission electron microscope (TEM) or the Scanning version of that instrument (a “STEM”). In the fall of 1979, I went to visit Albany, for a 6-week period, to work with Joachim Frank on automatically finding sub-populations of images in a heterogeneous population as discussed in the main paper. We had met two years earlier at an EMBO image processing course in Basel and we had been corresponding ever since. We both agreed that when averaging images of individual macromolecules, the presence of a mixture of different views was a serious and important complication: a challenging problem in need of a solution. I travelled to Albany taking with me two different sets of images in which different views could already be distinguished visually. One set consisted of images of worm hemoglobin (two different preparations: one with a clearly visible extra subunit in the middle and one without) and the other dataset of arthropod 4 × 6 hemocyanins where the difference between “flip” and “flop” could be directly seen (

What followed were many nights in the windowless catacombs of the New York State Department of Health (Division Laboratories and Research, now the Wadsworth Centre) beneath the Empire State Plaza. During these late sessions I met Jean-Pierre Bretaudière (“JP”) while working at the same PDP-11 45 computer (

where I had spent most of 1971 as a student. We also discussed our actual work and I explained the first attempts Joachim and I were making at separating subsets of images from the overall population of molecular images. Jean-Pierre explained what he was doing at the State Department of Health: quality control of medical laboratories in the state of New York. The Health Department would send out identical blood samples to many different medical laboratories. The results of the various chemical tests from the different labs were then compared using a French multivariate statistical technique called “l’analyse des correspondances” [

Jean-Pierre was immediately convinced that his “AFC” program (for: “Analyse Factorielle des Correspondances”) would be able to help solve our problem and handle our “massive” amount of image data. The AFC program was a modified version of the correspondence analysis program “CORAN” by Jean Paul Benzécri [

What was “massive” in this context was a very relative and very dated concept: at the time we had less than 64 images of 64 × 64 pixels = 4096 pixels each; that is a total data set size of less than one megabyte (each image being 16 Kbyte in floating point format). An uncompressed picture from even the cheapest digital camera today is much larger than one megabyte! However, it was characteristic for those early days in computing, that the 16-bit PDP-11 computer would only address 2^{16} bytes = 64 Kbytes. It was thus difficult to fit any “major” computational problem in the central memory of such a computer! The critical memory requirement in our case was the “core” needed for the square variance-covariance matrix (see main article) which, for 64 images would already occupy 16 Kbyte of the precious available address space.

As we sat together with Joachim the following morning to discuss the matter, we were all convinced that this was very much worth trying. Joachim wrote a small conversion program to allow JP’s AFC program to read the aligned images produced by the SPIDER program. The result of this first correspondence analysis of single particles was an instant success!

The first trial was on a giant annelid hemoglobin data set extracted from images of hemoglobins from two different species. The giant annelid hemoglobin of Oenone fulgida visibly had extra density in the centre [

On the correspondence analysis maps of these images four clear groups (“classes”) were visible. These reflected the “flip and flop” versions of the 4 × 6 oligomer where the right hand 2 × 6 part can be shifted up or down with respect to the left-hand side 2 × 6-mer. This behaviour had already been anticipated based on visual inspection of the images (

In these early MSA results, we would print out two-dimensional maps of the positions of the images on the first factorial co-ordinates and then draw a circle around a cluster of images, call that a “class” and then sum the members of that class for further interpretation. This visual classification, after eigenvector-eigenvalue data reduction, obviously had its shortcomings and was impossible to perform when more than ~3 factorial axes were to be taken into account. It was JP who also pointed me in the direction of automatic classification and the wealth of literature on the subject [

Jean-Pierre was a wild guy-bursting with energy, full of ideas and projects, always burning the candle at both ends. I have vivid memories of, after a late evening session at work and an even longer night in a bar, stumbling out of the bar roaring with laughter along with JP and Vicky, Joachim’s programmer who later became JP’s wife. JP was always exuberant in his love for life and all its adventures. Jean-Pierre, having seen the success of correspondence analysis in single-particle electron microscopy, became directly interested in EM image processing. He changed fields and moved to the University of Texas Medical Center in Houston, Texas, where he developed the SUPRIM image processing system for electron microscopy [

The last time we spoke, already many years ago, was about the time he left Houston to return to France. We spoke of his health issues and about his plans to leave the single-particle EM field-again. JP was never afraid of rigorous decisions. To this day, I am sorry we did not include JP as a co-author of the first papers that resulted [

The beautiful portrait of JP by Jan Galligan (

I am grateful to Jan Galligan and Robert Rej for letting me to use their pictures, and for their support and constructive suggestions (Marin van Heel; 2009).

Submit or recommend next manuscript to SCIRP and we will provide best service for you:

Accepting pre-submission inquiries through Email, Facebook, LinkedIn, Twitter, etc.

A wide selection of journals (inclusive of 9 subjects, more than 200 journals)

Providing 24-hour high-quality service

User-friendly online submission system

Fair and swift peer-review system

Efficient typesetting and proofreading procedure

Display of the result of downloads and visits, as well as the number of cited articles

Maximum dissemination of your research work

Submit your manuscript at: http://papersubmission.scirp.org/