Test problem simulations are presented of the matrix equation, D = BF, equivalent to least squares data fitting, where the matrices are rectangular with D and F being experimental data. The chosen application is finding fractions of secondary structures of proteins from circular dichroism (CD) spectra employing singular value decomposition, SVD, to obtain the matrix B and its pseudo-reciprocal. In practice the first step of the analysis is to select the reduced noise representation of the CD spectra which sets the bounds for subsequent computation and the development of the CD noise spectra. The conclusion from analysis is to obtain the structural component spectra summarizing the database and all the structural fractions and their uncertainties. The database noise spectrum can be used to prepare the reduced noise CD spectrum for a new protein to yield its structural fractions and their uncertainties.
Mathematical modeling of scientific data is at the foundation of modern scientific exploration of natural phenomena. Usually a model of the phenomenon is hypothesized and translated into quantifiable variables and compared with the related laboratory measurements [
Initially the problem was attacked using the techniques that were current, namely constructing the spectral functions describing secondary structures presumed to be present and responsible for the CD spectra. The quality and quantity of the CD spectra has improved with each passing year; however the care required in obtaining high quality experimental data has not lessened. To maximize the usefulness of experimental data, not only should each laboratory take great care in obtaining it, but a generally acceptable protocol should be employed to ensure that the quality of the data is easily documented. The results from the analysis of the quality data are made credible by presenting output obtained by the treatment of selective test problems with the analytical algorithms. The protocol must be comprehensive enough to ensure reproducible measurement results of reproduced materials and environments. It is then that confidence can be drawn from the implication of the analytical results. In short, there is no substitute for precise data. After the necessary database is complete with calibrated and validated data, then the analytical data reduction into model parameters follows. One of the first steps in the development or testing of analytical modeling is the design of test problems. Adequate test problems are those which simulate the real database and demonstrate or quantify the prediction capability of the algorithms employed.
Deriving protein secondary structure content from CD spectra has changed from postulating component spectra of the characteristic structures to the use of the secondary structure fractions from X-ray data as a basis for the mathematical prediction of the component spectra representation of the CD database. The mathematical technique used is well developed [
Simulation of a solution strategy helps to validate its use and to define its limitations and instability. The solution strategy in operation here is to start with two independent experimental measurements on a group of proteins and to combine these measurements to uncover fundamental structural factors that are common to all the proteins. The two experimental measured quantities used in this case are the fractions of secondary structures for each protein as measured by X-ray crystallography and their individual calibrated CD spectrum. Thus,
where:
D = CD spectra (calibrated);
B = the basis function (fundamental factors), to be determined;
F = the secondary structure fractions (X-ray results).
Generally, the variables are rectangular matrices reflecting that the CD spectrum for each protein is a digital listing and the number of proteins must be a number greater than or equal to the number of secondary structures fractions. The rank of the CD data, D, must agree with the rank of the fraction data, F, and is achieved by the number of singular values retained. The resulting data matrix is the reduced matrix, D+. Then
where F* is the pseudo-inverse of F.
This problem can be reformulated from a different point of view, though related to the first. That is,
where
X = projector functions (a.k.a.: generalized inverse functions [
D+ = CD spectra (calibrated), reduced;
F = the fractions (X-ray results).
And
where D* is the pseudo-inverse of D+.
The relationship between the two points of view is that X and B are pseudo-reciprocal to each other. Equation (1) can be used to find the basis functions, B, which characterize the connection between matrices D+ and F. Once B is defined it will be used with other spectra of proteins to find corresponding values of the secondary structure fractions and their confidence intervals or uncertainties. Correspondingly, Equation (2) could be solved to obtain the matrix X to project the fractional values from the spectrum of a protein from the reduced database. By finding B, the pseudo-reciprocal of X, the solution by Equation (1) leads directly to finding the confidence intervals for the fractions obtained by function fitting to the reduced CD data.
There is another equivalent point of view for solution of this model. Transforming to physically meaningful representation from the SVD formulation [
where
U = column orthogonal matrix;
S = diagonal matrix of singular values;
V' = row orthogonal matrix, transpose.
Introducing T, the rotation matrix
equating the factors to the experimental factors; thus
and further T and then B.
The algorithms of SVD set the dimensions of the matrices after reducing the rank by choosing the number of singular values retained to the same number as the primary factors (secondary structure fractions). Since generally the matrices are rectangular, and therefore the subsequent factors are rectangular matrices, the techniques of finding pseudo-inverses are required [
To begin the simulations of this model one must choose the number of basis functions, the size of the spectral range of the database, the number of proteins, and their fractions of secondary structures. In order to validate and perfect the digital algorithms involved in the solution of the protein databases, it is valuable to formulate a simulation of the CD spectral portion of the database by selecting the basis functions possessing some of the characteristics for the structure. This is achieved by choosing well-defined, distinct functions defined over the same wavelength range anticipated for the laboratory setting. In this application 81 points in the wavelength interval for 20 spectra with four secondary structures were chosen.
The four basis functions chosen relate to the first four harmonic oscillator functions (SHM), which have some orthogonality character (that is orthogonality on + to − infinity, a one dimensional domain) though not complete on the finite domain of digital functions as shown in
After the matrix, D is constructed using the functions, B, as prescribed and proportioned by the matrix, F, then the algorithms of the solution are applied to matrices, D and F to obtain the matrices, B and X corresponding to Equations (1)-(3). The solution algorithms should yield the basis functions, B that are pseudo-reciprocal to X, the projector functions. The results of the simulations should yield the known basis functions and a least squares fit to the spectra should yield the elements of the fractions matrix, F, and provide precision indices or standard
Protein | I | II | III | IV | Protein | I | II | III | IV |
---|---|---|---|---|---|---|---|---|---|
1 | 1.0 | 0.0 | 0.0 | 0.0 | 11 | 0.2 | 0.4 | 0.2 | 0.2 |
2 | 0.0 | 1.0 | 0.0 | 0.0 | 12 | 0.4 | 0.2 | 0.2 | 0.2 |
3 | 0.0 | 0.0 | 1.0 | 0.0 | 13 | 0.15 | 0.15 | 0.15 | 0.55 |
4 | 0.0 | 0.0 | 0.0 | 1.0 | 14 | 0.15 | 0.15 | 0.55 | 0.15 |
5 | 0.3 | 0.3 | 0.3 | 0.1 | 15 | 0.15 | 0.55 | 0.15 | 0.15 |
6 | 0.3 | 0.3 | 0.1 | 0.3 | 16 | 0.55 | 0.15 | 0.15 | 0.15 |
7 | 0.3 | 0.1 | 0.3 | 0.3 | 17 | 0.33 | 0.33 | 0.33 | 0.0 |
8 | 0.1 | 0.3 | 0.3 | 0.3 | 18 | 0.33 | 0.33 | 0.0 | 0.33 |
9 | 0.2 | 0.2 | 0.2 | 0.4 | 19 | 0.33 | 0.0 | 0.33 | 0.33 |
10 | 0.2 | 0.2 | 0.4 | 0.2 | 20 | 0.0 | 0.33 | 0.33 | 0.33 |
deviations for the fractions on the constructed database. The solution for the projectors when applied to the spectra should yield the corresponding structural fractions of the spectrum. Further, X and B should prove to be pseudo-reciprocal.
This type of scientific modeling can be developed on digital computers using any of many high-level computational compilers. We used Microsoft Fortran Powerstation Professional Development System, Version 1.0 for MS-DOS and Windows Operating Systems, 1993, U.S. patent No. 4,955,066, running on a Dell Dimension 2400 PC. Various subroutines were taken directly from references 1, 2, and 6 (some of these subroutines are available in C in more recent publications by these last authors), while the main programs were our adaptations written in the Fortran Powerstation above (very similar to Fortran 77). SVD was checked by running the equivalent generalized matrix inversion. The criteria for the selection of the virtual secondary structure fractions are described in the text and the legend of
The simulation database, D, is a matrix of 81 rows and 20 columns. Each column represents the CD spectrum of a virtual protein. The first step in the solution is to obtain the singular values for the data matrix, D. The results are shown in
Singular values | Residual std. dev. | Variance |
---|---|---|
96.828 | 3.295 | 0.552 |
50.390 | 2.225 | 0.149 |
50.396 | 1.866 | 0.149 |
50.395 | 1.358 | 0.149 |
Singular values | Residual std. dev. | Variance |
---|---|---|
1.161 | 0.336 | 0.149 |
2.232 | 0.358 | 0.552 |
1.161 | 0.360 | 0.149 |
1.161 | 0.360 | 0.149 |
Application of the pseudo-reciprocal basis functions (projectors) to the simulated CD spectra showed the secondary structure fractions for each virtual protein of
To further test simulations to the solution algorithms a linearly independent (LID) set of basis functions [
Application of these functions to the simulated spectra showed the structure fractions for each virtual protein as given in
In summary, the results are as follows: The pseudo-reciprocals were obtained using the two algorithms, SVD and general matrix methods. The results are duplicated to within anticipated machine precision at double precision and the standard deviation calculated for the solution functions arrived at by the different approaches show the high degree of agreement again governed by machine precision. The singular values obtained for the decomposition of the D and F matrices indicate well defined completeness of the simulations components and the adequacy of the algorithms used in the digital programs. In developing the simulations and the cross checking nature of the solution strategy, the programming of the algorithms was maximized.
Any of the three formulations presented for finding the basis functions and/or projector functions, Equations (1)-(3), are adequate when employed on precisely generated test problems. Testing on precise data requires no choices to be made in regard to the number of singular values to be retained in view of the variance obtained for
Structure | Fraction | Deviation factor |
---|---|---|
A (zero set) | 0.127E−15 | 0.201 S |
I | 0.300 | 0.366 S |
II | 0.300 | 0.230 S |
III | 0.300 | 0.305 S |
IV | 0.100 | 0.203 S |
S, sample variance = 0.200E−14.
Singular values | Residual std. dev. | Variance |
---|---|---|
107.75 | 3.382 | 0.6254 |
60.87 | 2.125 | 0.1997 |
49.68 | 1.492 | 0.1330 |
27.87 | 0.7512 | 0.0419 |
Structure | Fraction | Deviation factor times S |
---|---|---|
A (zero set) | 0.1135E−13 | 0.269 S |
I | 0.300 | 0.048 S |
II | 0.300 | 0.017 S |
III | 0.300 | 0.035 S |
IV | 0.100 | 0.035 S |
S = sample variance = 0.585E−14.
the CD data or the X-ray secondary structure fractions as the number of singular values is set by the test problem. If one has access to and understanding of the computer code, then one can insert extra steps to produce results of other checks to the veracity of the analysis program. If not, then inputting designed virtual data and obtaining reasonable output will help to build confidence in the numerical solution.
When processing real laboratory data the choice of the number of singular values retained is determined by the number of secondary structures expressed in the fractions data. The remaining choices in the steps to obtain
the basis functions can be determined by choosing the set of functions that yield the smallest standard deviation of the function fitting of the reduced CD spectrum or spectra. Finally the old adage “garbage in, garbage out” prevails in the digital age. Thus high quality, calibrated, standardized and validated input data will give the high quality, reliable and reproducible output results.
The reward for the labor of algorithm development and precise data accumulation is realized by obtaining the basis set of functions for the database, reciprocal projectors, and the noise database. These results lead directly to finding the secondary structure fractions from the CD of a new protein. This is accomplished by using all the information available from the database and the algorithms used in the analysis. The mathematical description is
where
D = the CD database;
D+ = the reduced database for the retained singular values;
E = the noise database for the negligible singular values.
The rectangular matrix, E, the same size as D, contains all the unused information or noise present in the CD spectra and D+ contains all the useful information present in the CD spectra. Any new protein CD spectrum is composed of useful information and noise. What is desired is to remove the noise from the new spectrum and then to process the reduced spectrum by the projection or basis functions to obtain the secondary structure fractions present. This is done by finding the matrix E for the database as part of the first steps before finding the projectors, X, and the basis, B. There is a noise spectrum for each protein in the database in the matrix E and they are representative of the noise spectra of the proteins contained in the database. A simple average of the noise spectra should be subtracted from the CD spectrum of the target protein to obtain a reasonable approximation of its reduced spectrum. The reduced spectrum is then subjected to the database-derived projectors X or basis B to obtain the estimated secondary structure fractions and their uncertainties. Note that this technique works best when the database fits the criteria contained in the Introduction and the legend to
In future studies we will obtain CD spectra and X-ray secondary structure fractions from published databases and subject selected lists of proteins to analysis as described here. We will describe the error estimation on parameters of the model in relation to the significant precision of the input data. Our preliminary studies of the computed prediction of secondary structure fractions for a reduced CD data and X-ray secondary structure database show that a single parameter adjustment of the CD spectra will rectify the ambiguity in the summation of the fractions to one per protein.
The authors thank Anne T. Tang for assistance in formatting and Dr. Lisa Alex for her support and encouragement.
David A. Haner,Patrick W. Mobley, (2015) Simulations Relating to the Determination of Protein Secondary Structure Fractions from Circular Dichroism Spectra. Open Access Library Journal,02,1-10. doi: 10.4236/oalib.1101601