Sequence-Based Protein Crystallization Propensity Prediction for Structural Genomics: Review and Comparative Analysis

doi:10.4236/ns.2009.12012

Paper Menu >>

Journal Menu >>

Vol.1, No.2, 93-106 (2009) Natural Science

http://dx.doi.org/10.4236/ns.2009.12012

Sequence-Based Protein Crystallization Propensity

Prediction for Structural Genomics: Review and

Comparative Analysis

Lukasz Kurgan*, Marcin J. Mizianty

Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada.

*University of Alberta, ECERF, 9107 116 Street, Edmonton, Alberta, Canada; lkurgan@ece.ualberta.ca

Received 6 August 2009; revised 28 August 2009; accepted 30 August 2009.

ABSTRACT

Structural genomics (SG) is an international

effort that aims at solving three-dimensional

shapes of important biological macro-molecules

with primary focus on proteins. One of the main

bottlenecks in SG is the ability to produce dif-

fraction quality crystals for X-ray crystallogra-

phy based protein structure determination. SG

pipelines allow for certain flexibility in target

selection which motivates development of in-

silico methods for sequence-based prediction/

assessment of the protein crystallization pro-

pensity. We overview existing SG databanks

that are used to derive these predictive models

and we discuss analytical results concerning

protein sequence properties that were discov-

ered to correlate with the ability to form crystals.

We also contrast and empirically compare mo-

dern sequence-based predictors of crystalliza-

tion propensity including OB-Score, ParCrys,

XtalPred and CRYSTALP2. Our analysis shows

that these methods provide useful and compli-

mentary predictions. Although their average ac-

curacy is similar at around 70%, we show that

application of a simple majority-vote based en-

semble improves accuracy to almost 74%. The

best improvements are achieved by combining

XtalPred with CRYSTALP2 while OB-Score and

ParCrys methods overlap to a larger extend,

although they still complement the other two

predictors. We also demonstrate that 90% of the

protein chains can be correctly predicted by at

least one of these methods, which suggests that

more accurate ensembles could be built in the

future. We believe that current protein crystalli-

zation propensity predictors could provide

useful input for the target selection procedures

utilized by the SG centers.

Keywords: Structural Genomics; X-Ray

Crystallography; Crystallization Propensity Prediction;

Protein Structure; Protein Crystallization

1. INTRODUCTION

Proteins are organic compounds composed of amino

acids arranged in a linear chain polymer with the help of

peptide bonds. Proteins implement a wide variety of

functions such as transportation, signalling, catalysis of

chemical reactions, formation of the cell cytoskeleton,

immune responses, regulation of cell processes, etc. etc.

They are so versatile due to their ability to adopt an im-

mense variety of shapes. Knowledge of the tertiary

(three-dimensional) structure of proteins is vitally im-

portant for understanding and manipulating their bio-

chemical and cellular functions. For instance, this know-

ledge is exploited in rational drug design via virtual

screening [1-3], provides insights into various diseases

[4] and it is used in deciphering interactions of proteins

with other macro molecules and smaller ligands [5-7].

1.1. Structural Genomics

As of July 2009 we know close to 8.2 million nonre-

dundant protein chains which can be found in SeqRef

database [8] but the corresponding structure is know for

“only” about 55 thousand proteins which are deposited

into the Protein Data Bank (PDB) database [9]. This

wide and continually widening sequence-structure gap

calls for new and efficient efforts that would help in ac-

quiring protein structures. This resulted in creation of

structural genomics (SG) which is an international effort

to find the three-dimensional shapes of important bio-

logical macro-molecules, primarily focusing on proteins

[10]. In contrast to a traditional approach used by struc-

tural biologists who often work with a given protein that

they try to solve for many years, the structural genomics

efforts frequently concern ”unknown” proteins. More-

over, SG focuses on development and usage of high

L. Kurgan et al. / Natural Science 1 (2009) 93-106

throughput and cost-effective methods for protein pro-

duction and determination of the corresponding structure

which are implemented with the help of dedicated SG

centers. In the United States one of the first SG efforts,

which was undertaken around year 2000, was the crea-

tion of a multi-center, including four large-scale centers

and six specialized centers, Protein Structure Initiative.

Similar SG projects were also carried out in Canada,

Israel, Japan, and Europe. For example, Structural Ge-

nomics Consortium which was formed in 2004 spans

centers at the Oxford University, University of Toronto

and Karolinska Institute. Analysis shows that in 2004/

2005 about half of protein structures were solved at a SG

centers rather than in the traditional laboratory [11]. Also,

at that time the cost of solving a structure at the most

efficient SG center in the United States was equal to

about 25% of the estimated cost when using the tradi-

tional methods [11]. Another more recent study shows

that the production-line approach taken at the Protein

Structure Initiative centers reduced the cost of solving

structures from ~$250,000 apiece in 2000 to ~$66,000 in

2008 [12]. Most importantly, from our point of view,

these SG initiatives shifted the focus from one-by-one

determination of individual protein structures, which is

being pursued by structural biologists, to protein fam-

ily-directed structure analyses in which a group of pro-

teins is targeted and structure(s) of representative mem-

bers are determined and used to represent the entire

group [13]. The corresponding process of choosing rep-

resentative proteins is known as target selection and it

encompasses a computational process of restricting can-

didate proteins to those that are tractable and of un-

known structure and prioritizing them according to ex-

pected interest and accessibility [14]. In the case of the

Protein Structure Initiative, the target selection concen-

trates on representatives from large, structurally unchar-

acterized protein domain families, and from structurally

uncharacterized subfamilies in very large and diverse

families with incomplete structural coverage [15]. We

note that this approach allows for some flexibility in the

selection of the targets.

1.2. X-ray Crystallography and Protein

Crystallization

The protein structures are being determined with the

help of experimental methods including X-ray crystal-

lography [16], NMR spectroscopy [17], electron mi-

croscopy [18], and (more recently) by application of

computational approaches such as homology modelling

[19, 20]. The most popular method, which accounts for

approximately 86% of the solved and deposited protein

structures, is the X-ray crystallography; see Figure 1. At

the same time, the other approaches play a strong com-

plementary role for some protein types, such as mem-

brane proteins [21, 22].

One of the main challenges the SG initiative faces it

that only about 2-10% of protein targets pursued in the

context of the second step of the Protein Structure Initia-

tive yield high-resolution protein structures [23]. We

further investigated these estimates based on data pub-

lished in the TargetDB database [24] in July 2009. Tar-

getDB is a world-wide database that provides informa-

tion on the experimental progress and status of targets

5000

10 0 0 0

15 0 0 0

20000

25000

30000

35000

40000

45000

50000

55000

1990199119921993199419951996199719981999 2000200120022003 2004 2005 20062007 2008 2009

yea

cumulative number of

solved structures

X-ray crystallography

NMR spectroscopy

Electron microscopy

Figure 1. The growth in the number of protein structures deposited into PDB by that were solved by X-ray crystallography, NMR

spectroscopy and electron microscopy (source http://www.rcsb.org/).

L. Kurgan et al. / Natural Science 1 (2009) 93-106

selected for structure determination. Among 150,727

cloned targets that were deposited into TargetDB, only

37,398 (24.8%) were reported to be successfully purified,

12,923 (8.6%) to be successfully crystallized, and 6,942

(4.6%) gave diffraction quality crystals. Moreover, some

estimates show that more than 60% of the cost of struc-

ture determination is consumed by the failed attempts

[25] while crystallization is characterized by a signifi-

cant rate of attrition and is among the most complex and

least understood problems in structural biology [26]. The

above provides a strong motivation for further research

and development in this area. Several strategies have

been proposed to improve the success rate including

obtaining one representative structure per protein family

and working with multiple orthologues [14, 26, 27, 28].

In spite of advances made in the context of protein crys-

tallization [29], the above numbers and insights from

some researchers [30-32] demonstrate that the produc-

tion of high-quality crystals is one of the major bottle-

necks in the protein structure determination. The crystals

should be sufficiently large (> 50 micrometres), pure in

composition, regular in structure and with no significant

internal imperfections. The problem of production of

diffraction-quality crystals is usually tackled using an

empirical approach based mainly on trial and error (also

called the “art” of crystallisation), in which a large

number of experiments is brute-forced to find a suitable

setup, and through understanding of the fundamental

principles that govern crystallisation [30]. The latter is

used to design new (and improved) experimental meth-

odologies that would produce high-quality crystals.

1.3. Databases

One of the early steps taken to alleviate the abovemen-

tioned difficulties in resolving the structures via X-ray

crystallography was to create databanks that record in-

formation concerning both successful and failed attempts

to produce the structures. The importance of these efforts

was advocated in 2000 by Raymond Stevens who said

that “industrial-scale efforts will lead to the generation

of knowledge bases that will be mined to expand our

understanding of the techniques used in protein crystal-

lography. These efforts will act as ‘learning factories’, in

which successes and failures will be used to continually

improve the technology for high-throughput protein cry-

stallography“ [33].

These words were echoed in 2003 by Rodrigues and

Hubbard who said “as structural genomics projects

evolve, valuable experimental data will be accumulated,

thus presenting researchers with a unique opportunity to

establish improved predictive methods for a protein’s

chemical and physical behaviour based on its amino acid

sequence. It is essential for laboratories producing such

data to keep track of both ‘successful’ and ‘unsuccessful’

results, so that these can be fed back into the structural

determination pipeline through the improvement of the

target selection procedures” [34]. The development of the

databases was fuelled by generation of large and well an-

notated experiments by SG centers, such as one for the

Thermotoga maritime proteome [35]. To the best of our

knowledge, the first such initiative was the PRESAGE

database which included annotations indicating current

experimental status, structural predictions and suggestions

[36]. Some of the SG consortia have established on-line

progress reports which contain details and current ex-

perimental status of their targets. Examples include Inte-

grated Consortium Experimental Database [37], ZebaView

(http://www-nmr.cabm.rutgers.edu/bioinformatics/ZebaView/),

ReportDB (http://www.secsg.org/cgi-bin/report.pl) and

SPINE (Structural Proteomics in the NorthEast) [38, 39].

SPINE, which was developed in early 2000 and reengi-

neered in 2003, integrates a tracking database and a data

mining method for identifying feasible targets. Each

protein deposited in this database is described with in-

formation related to the experimental progress (e.g., ex-

pression level, solubility, ability to crystallize) and 42

descriptors of the underlying protein sequence (amino

acid composition, secondary structure, etc.). The largest

and most comprehensive TargetDB [24] was launched

July 2001 and it builds upon the work on the PRESAGE

database. TargetDB serves as a primary target registra-

tion database for structural SG project worldwide. It

consolidates data from 28 SG centers in USA, Canada,

Germany, Isreal, Japan, France and UK, including 9

Protein Structure Initiative centers. PepcDB (Protein

Expression Purification and Crystallization DataBase),

which was created around 2004, was established as an

extension to TargetDB to collect more detailed status

information and the experimental details of each step in

the protein structure production pipeline [40]. This da-

tabase stores a complete history of the experimental

steps in each production trial besides describing the cur-

rent target production status. PepcDB records status his-

tory, stop conditions, reusable text protocols and contact

information collected from 15 SG centers in USA. The

interested readers are directed to two recent articles by

Helen Berman that introduce a wealth of resources con-

cerning the SG initiative [41] and a knowledgebase de-

veloped by the Protein Structure Initiative [42].

1.4. Computational Models in Protein

Crystallization

The problems with the protein crystallization and the

availability of the suitable databases motivated the de-

velopment of analytical and predictive models that can

be used to either support or directly predict protein crys-

tallization [43]. These models were often developed by

researchers at certain SG centers who used their own

data to draw conclusions. In one of the first attempts, a

L. Kurgan et al. / Natural Science 1 (2009) 93-106

decision tree that predicts solubility from protein se-

quence was developed [44]. The SPINE system, which

was developed at the Northeast Structural Genomics

Consortium, incorporates decision tree-based classifiers

for solubility and crystallization propensity. This system

was used to extract a few interesting rules such that

soluble proteins tend to have more acidic residues and

fewer hydrophobic segments [38]. The SG project on

Plasmodium falciparum has lead to an analysis of pro-

tein characteristics, such as the presence of transmem-

brane helices, low-complexity regions, and coiled-coil

regions, in the context of the crystallization propensity

[34]. Another decision tree-based predictive model de-

veloped by Goh and colleagues in 2004 using data from

TargetDB has revealed several protein features that in-

fluence the feasibility of using a given target protein

chain for a high-throughput structure determination [45].

They include conservation of the sequence across organ-

isms, composition of charged residues, occurrence of

hydrophobic patches in the sequence, number of binding

partners, and chain length. Based on the data from the

Thermotoga maritime proteome [35], the researchers at

the Joint Center for Structural Genomics discovered a

few features, which include isoelectric point, sequence

length, average hydropathy, existence of low complexity

regions, presence of signal peptides and trans-membrane

helices, that correlate with crystallization [46]. The

isoelectric point calculated from the protein sequence

was also used to develop a method that suggests optimal

pH ranges for crystallization screening [47, 48]. Ex-

perimental work by Derewenda’s group shows that crys-

tallization can be improved by application of surface

entropy reduction approach in which clusters of two or

three exposed amino acids with high conformational

entropy side chains (such as Lys, Glu and Gln) are re-

placed with lower-entropy residues (like Ala) [49-54].

One drawback of this method is that it may decrease

protein solubility which hinders crystallization screening

[50, 52]. The surface entropy reduction approach was

recently implemented as a web server [55]. This server

utilizes information concerning conformational entropy

and solvent exposure indices, predicted secondary struc-

ture, residues conservation scores, and close homologues

to propose crystallization enhancing mutations for a

given protein sequence. Another study, which was con-

ducted at the Center for Eukaryotic Structural Genomics,

used disorder prediction algorithms to analyze the im-

pact of intrinsic protein disorder on crystallization effi-

ciency [56]. The Berkeley Structural Genomics Center

has utilized several protein features including length of

the sequence and predicted transmembrane helices,

coiled coils, and low-complexity regions to eliminate

targets predicted to be intractable for the high-through-

put structure determination [57]. The most recent study

that was performed at the Northeast Structural Genomics

Consortium shows that crystallization propensity de-

pends primarily on the prevalence of well-ordered sur-

face epitopes [58]. More specifically, the authors show

that crystallization propensity can be computed from the

knowledge of predicted disordered regions, side-chain

entropy of predicted exposed residues, the amount of

predicted buried Gly and the fraction of Phe in the input

sequence.

2. SEQUENCE-BASED METHODS FOR

PREDICTION OF PROTIEN

CRYSTALLIZATIONPROPENSITY

The SG efforts allow for certain flexibility in selection

of the chains for the crystallization and the subsequent

structure determination and this motivates development

of methods that aim at the prediction/assessment of the

crystallization propensity for a given input sequence.

Such methods could be incorporated into target selection

pipelines that are utilized by SG centers. Their develop-

ment is often supported and motivated by the described

above computational analyses/models. We also note that

numerous studies have already demonstrated that se-

quence-based prediction approaches, which may address

a variety of structural and functional properties of pro-

teins, provide useful information and insights for both

basic research and drug design and hence are widely

welcome by the scientific community [59-63].

Crystallization propensity prediction methods incor-

porate predictive models that are extracted from larger

datasets that span data coming from multiple SG centers

and they take the protein sequence as their only input.

The underlying principle is that the predictive models

summarize/describe patterns (similarities) hidden in the

data from databases such as TargetDB. This is done by

generating a set of patterns that describe sequences that

can be crystallized (crystallizable proteins) and another

set of patterns for sequences that were shown to be im-

possible to crystallize (noncrystallizable proteins). The

two sets of patterns should describe the two correspond-

ing sets of protein chain and, at the same time, each of

them should exclude sequences from the other set. The

existing crystallization propensity predictors include

SECRET [64] that was developed by Frishman’s group,

OB-Score [65] and ParCrys [66] that were produced by

the Barton’s group, XtalPred [67, 68] that came from

Godzik’s group, and CRYSTALP [69] and most recent

CRYSTALP2 [70] that were developed by Kurgan’s

group. These methods perform the prediction in two

steps: (1) the input sequence is converted into a set of

numerical features that describe certain characteristics of

the sequence; and (2) the feature values are fed into a

predictive model that outputs the outcome that quantifies

propensity for crystallization. The predictive model en-

capsulates the patterns that are computed from the in-

formation encoded by the features. Table 1 shows a

L. Kurgan et al. / Natural Science 1 (2009) 93-106

Table 1. A side-by-side comparison of existing methods for sequence-based protein crystallization propensity prediction.

Input features

Methods

[reference]

Source of

data description #

Predictive

model Web server/page Notes

SECRET

[64]

Deposition

from PDB

assuming that

NMR only

solved protein

are diffi-

cult/impossibl

e to crystal-

lize; Deposi-

tions in Tar-

getDB

Content of mono-,

di-, and tripeptides

represented by

20-letter amino acid

alphabet and by

several reduced

alphabets grouped

by physicochemical

and structural prop-

erties of amino

acids

103

Two-layered

structure where

output of sev-

eral support

vector machine

classifiers are

combined by a

second-level

Naive Bayes

classifier

http://mips.helmholtz-muenchen.de/secret/

Limited

to se-

quences

between

46 and

200

amino

acids

OB-Score

[65]

Depositions in

TargetDB

Isoelectric point

and average hydro-

phobicity

Z-score

(two-dimen-

sional

lookup-table)

http://www.compbio.dundee.ac.uk/xtal/

CRYS-

TALP

[69]

Deposition

from PDB

assuming that

NMR only

solved protein

are diffi-

cult/impossibl

e to crystal-

lize;

Content of selected

mono-, di- and

collocated dipep-

tides

46 Naive Bayes N/A

Limited

to se-

quences

between

46 and

200

amino

acids

XtalPred

[67, 68]

Depositions in

TargetDB

Protein length,

molecular mass,

gravy and instabil-

ity indices, extinc-

tion coefficient,

isoelectric point,

content of Cys,

Met, Trp, Tyr, and

Phe residues, inser-

tions in the align-

ment compared to

homologs in

non-redundant

protein sequences

database, predicted

secondary structure,

predicted disor-

dered,

low-complexity and

coiled-coil regions,

predicted trans-

membrane helices

and signal peptides.

9 Normalized

product http://ffas.burnham.org/XtalPred/

Outputs

1 of 5

crystal-

lization

classes:

optimal,

subopti-

mal,

average,

difficult,

and very

difficult

ParCrys

[66]

Depositions in

TargetDB and

PepcDB

Isoelectric point

and average hydro-

phobicity, content

of Ser, Cys, Gly,

Phe, Tyr, and Met

residues

Kernel-based

classifier using

Parzen window

http://www.compbio.dundee.ac.uk/xtal/

CRYS-

TALP2

[70]

Depositions in

TargetDB and

PepcDB

Isoelectric point,

average hydropho-

bicity, content of

selected mono-, di-

and collocated di-

and tripeptides

Normalized

Gaussian radial

basis function

network

http://biomine.ece.ualberta.ca/CRYSTALP2/CRYSTALP2.html

L. Kurgan et al. / Natural Science 1 (2009) 93-106

side-by-side comparison of the six existing methods

based on the data source that was used to generate pre-

dictive model and the applied input features and predic-

tive models. It also provides URLs of the corresponding

web servers or web pages.

Two early methods, namely SECRET and CRYSTALP,

accept only sequences between 46 and 200 amino acids

in length. This limitation is due to the composition of

datasets used to generate these prediction models. Al-

though OB-Score predictor does not impose a limit on

sequence size, it considers only two predictive features,

i.e., isoelectric point and hydrophobicity. This method

was developed for the Scottish Structural Proteomics

Facility [65]. The ParCrys method extends OB-score by

using an advanced kernel-based classification algorithm

and by adding information concerning content of several

amino acids including Ser, Cys, Gly, Phe, Tyr, and Met

to the set of predictive features. Similarly, CRYSTALP2

improves upon CRYSTALP by applying a more ad-

vanced kernel-based classifier and by introducing new

predictive features that are based on the collocation of

amino acids in the sequence, isoelectric point and hy-

drophobicity. The motivation for the application of the

collocation based features comes from their applications

in related fields [71-74] and the fact that they consider

local neighbourhood information in the protein chain,

which was also utilized in a recent method for surface

entropy reduction based design of crystallizable protein

variants [55]. A significant majority of the collocations

used by CRYSTALP2 incorporate residues with high

conformational entropy, or with low entropy and high

potential to mediate crystal contacts, and these residues

are utilized by the surface entropy reduction methods [51,

52].

The above five methods are built using black-box (not

readable by a human) classification models, which are

inductively learned from a set of protein chains which

are annotated as crystallizable and noncrystallizable. By

contrast, the XtalPred is a white-box (human readable)

approach that combines probabilities of successful crys-

tallization calculated from several protein features. This

method, which was developed based on experiences at

the Joint Center for Structural Genomics, which is one of

the large centers in the Protein Structure Initiative, mim-

ics the work performed by structural biologists. XtalPred

utilizes nine biochemical and biophysical features of an

input protein with probability distributions estimated

from data from TargetDB. The individual probabilities

concerning each input feature are combined into a single

crystallization score which is used to assign one of five

crystallization classes: optimal, suboptimal, average,

difficult, and very difficult. The design of XtalPred

shows that medium sequence length and hydrophobicity

combined with acidic character improve the success in

protein production. It also demonstrates that very short,

very long, or very hydrophobic proteins are more diffi-

cult to crystallize under standard experimental setups.

This method also confirms the utility of predicted struc-

tural disorder, presence of transmembrane helices, insta-

bility, and high content of predicted loops, insertions,

and coiled-coil structures for the prediction of the crys-

tallization propensity [67]. Several methods, including

XtalPred, OB-Score, ParCrys and CRYSTALP2, utilize

information concerning isoelectric point which is esti-

mated from protein sequence. This agrees with prior

finding that indicate important role of this feature

[46-48].

We note that all investigated crystallization propensity

predictors take into account only intra-molecular factors

that are encoded in the protein chain. This means that

they may not provide reliable predictions when in-

ter-molecular factors such as protein-protein and/or pro-

tein-precipitant interactions, buffer composition, pre-

cipitant diffusion method, etc. must be considered. Also,

they are limited to predictions for non-redundant chains

and should not be used when assessing crystallization of

homologues. In the latter case we recommend the use of

the surface entropy reduction server [55].

3. COMPARATIVE ANALYSIS

Following we perform empirical comparison of the qual-

ity of predictions offered by the sequence-based protein

crystallization propensity predictors. Our analysis ex-

cludes CRYSTALP and SECRET methods since they are

limited to only relatively small chains and since their

quality was show to be inferior when compared with

other methods [66,70]. Our comparative analysis is per-

formed based on predictions performed for a dataset of

relatively recent depositions to TargetDB and PepcDB.

We analyze predictive power of individual methods and

we also investigate their complementarity.

3.1. Dataset

We use a dataset composed of 2000 protein chains

(hereafter TEST-NEW), which was originally introduced

in [70] and which was developed using procedure pro-

posed in [66]. The crystallizable proteins were extracted

from sequences deposited in TargetDB and they include

the last 1000 depositions as of December 2008. The non-

crystallizable sequences, which correspond to the actual

construct sequences used, were extracted from the trial

sequences stored in PepcDB. As in the case of crystal-

lizable chains, they include the last 1000 depositions as

of December, 2008. The selected sequences were also

processed to remove the N-terminal hexaHis tag and

LEHHHHHH tag at the C-terminus, which are intro-

duced to ease the purification. Duplicate sequences were

removed and thus the resulting dataset consists of non-

redundant chains. It can be freely downloaded from

L. Kurgan et al. / Natural Science 1 (2009) 93-106

Table 2. Summary of results for predictions performed with OB-Score, ParCrys, XtalPred and CRYSTALP2 methods on the

TEST-NEW dataset.

Accuracy MCC TPR TNR AROC

OB-Score1 69.8 0.42 0.86 0.54 0.74

ParCrys1 70.6 0.43 0.83 0.58 0.75

XtalPred2 70.0 0.40 0.76 0.64 0.76

CRYSTALP23 69.3 0.39 0.76 0.63 0.74

1Results computed using the ParCrys/OB-Score server at http://www.compbio.dundee.ac.uk/xtal/

2Results computed using the XtalPred server at http://ffas.burnham.org/XtalPred/

3Results based on [70]

http://biomine.ece.ualberta.ca/CRYSTALP2/CRYSTALP2.html.

3.2. Quality Measures

The annotations from TargetDB were stripped from the

input sequences, which in turn were inputted into the

corresponding predictors. The prediction outputs were

compared with the original annotations to assess the

prediction quality. Four potential prediction outcomes

are possible: TP (true positive) which corresponds to

crystallizable chains that were correctly predicted as

crystallizable, FN (false negative) which corresponds to

crystallizable chains that were incorrectly predicted as

noncrystallizable, FP (false positive) which indicates that

noncrystallizable chains were incorrectly predicted as

crystallizable, and TN (true negative) which denotes

cases where noncrystallizable chains were correctly pre-

dicted as noncrystallizable. The predictions were as-

sessed based on the following quality indices:

100







TNTP

accuracy

)()()()( FNTNFPTNFNTPFPTP

FNFPTNTP

MCC







TPR 



TNR 



The accuracy measures the fraction of correct predic-

tions among all predictions. The Matthews Correlation

Coefficient (MCC) is confined to <-1,1> interval. If the

MCC value is close to 0 then the prediction method is

not better than a random classification. Higher MCC

value corresponds to better performance of the predic-

tion method. These two measures provide an evaluation

of the prediction quality over the entire dataset. In con-

trast, TPR (true positive rate) and TNR (true negative

rate) evaluate the quality separately for crystallizable

(positive) and noncrystallizable (negative) proteins.

TPR/TNR quantifies the fraction of correctly predicted

crystallizable/noncrystallizable proteins. We also report

receiver-operator characteristics (ROC) curves that pre-

sent a graphical plot of the TP rate = TP / (TP + FN)

against FP rate = FP / (FP + TN). This is performed by

thresholding the confidence values (probabilities) that

are generated together with the predicted classes (crys-

tallizable vs. noncrystallizable). These plots are also

used to compute the area under the ROC curve (AROC).

The higher the AROC value is the better the predictive

power of the corresponding method.

3.3. Comparison of Existing Prediction

Methods

Results of application of the four crystallization propen-

sity predictors on the TEST-NEW dataset are summa-

rized in Table 2. In the case of XtalPred we assume a

prediction assignment in which optimal, suboptimal, and

average outcomes are categorized as crystallizable pro-

teins and difficult and very difficult as noncrystallizable.

The same assignment was used in [70] since it leads to

optimal results.

The comparison shows that the four methods are char-

acterized by relatively similar prediction quality with

MCC and accuracy values ranging between 0.39 and

0.43 and between 69.3 and 70.6%, respectively. We note

that since the dataset is balanced a random assignment of

the prediction outcomes would give accuracy of 50%.

This means that the accuracy of the existing methods is

better by about 20% than the random coin-toss approach.

At the same time we observe a considerable space for

improvement although we caution the reader that the

upper limit of the prediction accuracy should not be as-

sumed at 100%. This is since the input data likely in-

cludes mislabeled proteins. In particular, since data

comes from multiple SG centers, some proteins that

could not be crystallized in one center could be poten-

tially crystallized by another center that uses different

protocols and equipment. This means that some of the

proteins could be mislabeled as noncrystallizable, i.e.,

some of the FPs are in fact TPs. At this time we are not

able to estimate their number. We observe that OB-Score

and ParCrys are both strongly biased towards prediction

of crystallizable proteins, i.e., their TPR values are much

higher than TNR values and the TNR values are rela-

tively low. The XtalPred and CRYSTALP2 provide a

L. Kurgan et al. / Natural Science 1 (2009) 93-106

100

more balanced prediction for the two classes of proteins

and their TNR values are above 0.63. All four methods

provide better predictions for crystallizable proteins, i.e.,

they correctly predict a bigger fraction of crystallizable

proteins, when compared with the noncrystallizable pro-

teins. In other words, they are more likely to succeed in

confirming that a crystallizable chain can be crystallized

rather than in showing that a chain difficult to crystallize

cannot be crystallized; although in both cases all of the

considered methods work better than the coin-toss. Fig-

ure 2 shows the ROC curves for the four predictors. We

again observe that all considered methods behave simi-

larly, i.e., they provide comparable TP rates for the same

FP rates.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

00.10.2 0.30.4 0.5 0.60.7 0.80.91

FP rate

TP rate

CRYSTALP2 on TEST-NEW

ParCrys on TEST-NEW

XtalPred on TEST-NEW

OB-Score on TEST-NEW

Figure 2. ROC curves for the tests performed with OB-Score, ParCrys, XtalPred and CRYS-

TALP2 methods on the TEST-NEW dataset.

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

all proteinsnoncrystallizable

pr oteins

crystallizable

proteins

dataset

percentage of

predictions

no correct

predictions

1 method with

correct prediction

2 methods with

correct predictions

3 methods with

correct predictions

All 4 methods with

correct predictions

Figure 3. Analysis of the number of correct predictions produced by OB-Score, ParCrys,

XtalPred and CRYSTALP2 methods on the all proteins, only crystallizable and only

noncrystallizable proteins from the TEST-NEW dataset.

L. Kurgan et al. / Natural Science 1 (2009) 93-106

101

10 0

200

300

400

500

600

700

[38, 99][100, 149][150, 199][200, 299][300, 399][400, 1948]

protein length intervals

number of

pr oteins

all proteins

crystallizable proteins

noncrystallizable proteins

0.00

0.10

0.20

0.30

0.40

0.50

0.60

[38, 99][100, 149][150, 199][200, 299][300, 399][400, 1948]

protein length intervals

MCC

CRY STA LP2

ParCrys

XtalPred

OB-Score

Figure 4. Analysis the predictions and characteristics of the TEST-NEW dataset with respect to the input protein chain length. A)

Distribution of number of proteins (black bars), number of crystallizable (green bars) and noncrystallizable (red bars) proteins in the

considered protein length intervals. B) Prediction quality measured using MCC for OB-Score, ParCrys, XtalPred and CRYSTALP2

methods for each of the protein size intervals.

Figure 3 analyzes the predictions with respect to the

number of correct predictions produced by the four

methods for each input protein. Analysis of the results

obtained on the entire TEST-NEW set indicates that at

least three methods provide correct predictions simulta-

neously for two thirds of the test proteins. It also shows

that only 9.6% of the proteins cannot be correctly pre-

dicted by any of the considered methods. We again ob-

serve that predictions for crystallizable proteins are

characterized by higher quality than for the noncrystal-

lizable proteins. In particular, only 1.6% of crystallizable

proteins are never correctly predicted and 78.0% are

correctly predicted by at least 3 methods. In contrast, the

same numbers for the noncrystallizable proteins are

17.6% and 53.2%, respectively.

3.4. Analysis of Predictions for Varying Pro-

tein Sizes

The protein chain length was indicated as one of the

important factors related to the protein crystallization

propensity [45, 46, 57, 67]. It is also correlated with the

quality of the secondary structure prediction [75], which

is utilized in the prediction of protein crystallization [55,

67]. To this end, Figure 4 summarizes results that are

organized by binning the input protein chains into six

size-based intervals. Figure 4A shows, as expected [67],

uneven distribution of the crystallizable and noncrystal-

lizable proteins against the protein chain length. We ob-

serve that majority of short chains with less than 100

amino acids are difficult to crystallize while the crystal-

lization is more successful for longer chains. More im-

portantly, the XtalPred method stands out from the

competition as it provides better performing predictions

for short sequences of up to 150 amino acids. On the

other hand, a slight improvement over the competition is

observed for the OB-Score method when predicting long

chains with above 400 amino acids. Finally, the CRYS-

TALP2 method is characterized by the most even quality.

We also observe a generic trend that best results are on

average obtained for the average sized protein chains

between 100 and 200 amino acids.

3.5. Complementarity of Existing Methods

Although the above results indicate that the existing

methods are characterized by comparable prediction

quality, substantial differences in their underlying design

and results shown in Figures 3 and 4B suggest that their

results could be complementary with each other. In other

words, although on average they provide the same num-

ber of correct predictions, these prediction likely concern

different input proteins.

We investigate the complementarity by combining

multiple methods using OR operator, i.e., a given predic-

tion is assumed correct if at least one of the methods in

an ensemble provides a correct prediction. This approach

allows quantifying the amount of overlap in predictions

and it also estimates the upper boundary of a potential

meta-predictor that combines predictions from the indi-

vidual methods. Figure 5 shows summary of results, in

terms of achieved TPR, TNR and MCC values for all

combinations of two, three, and four predictors as well

as for the individual methods. We observe that certain

ensembles obtain higher quality of predictions indicating

a stronger complementarity. In particular combining

either OB-Score and XtalPred or CRYSTALP2 with

XtalPred gives better results than any other combination

L. Kurgan et al. / Natural Science 1 (2009) 93-106

102

OB PC XP C2 .82

OB PC XP .76

OB XP C2 .80

PC XP C2 .80

OB PC C2 .73

PC C2 .68

OB C2 .68

XP C2 .70

OB PC .58

PC XP .69

OB XP .70

OB .42

PC .43

XP .40

C2 .39

0.50

0.60

0.70

0.80

0.90

1.00

0.75 0.80 0.85 0.90 0.95 1.00

TPR

TNR

Figure 5. Analysis the complementarity of predictions for OB-Score (OB), ParCrys (PC), Xtal-

Pred (XP) and CRYSTALP2 (C2) methods on the TEST-NEW dataset. Each combination of 1,

2, 3, and 4 methods was applied using OR operator, i.e., a given prediction was assumed correct

if at least one of the predictors predicted it correctly. The x-axis/y-axis shows TPR/TNR values

(TPR values are scaled between 0.75 and 1 while TNR values are scaled between 0.5 and 1),

and the labels next to markers denote a particular combination of applied predictions together

with the MCC value (e.g., “PC XP C2 .80” means that combination of ParCrys, XtalPRed and

CRYSTALP2 obtained MCC of 0.8). Markers and labels in red denote the best results for a

given number of applied methods.

of two methods. Among the ensembles of three methods,

the combination of XtalPred and CRYSTALP2 with ei-

ther ParCrys or OB-Score works best. This observation

and the fact that OB-Score and ParCrys are the least

complimentary among all pairs of predictors indicate

that these two methods provide relatively overlapping

outputs. Finally, an ensemble of all four methods obtains

MCC of 0.82 which is not much higher than 0.80

achieved with just three methods, showing that addition

of the fourth predictor brings relatively minor improve-

ments. Finally, we again observe that results indicate that

both individual and ensemble-based predictions are

characterized by higher quality for crystallizable rather

than noncrystallizable proteins.

We also investigate a possibility of implementing a

simple, majority-vote based meta-predictor. Such met-

hod generates predictions which correspond to the most

frequent prediction of its member methods. We apply a

simple majority vote for the three members based

meta-predictors, while for ensemble of four methods we

resolve the tie-break (2 vs 2 split decisions from the

member methods) by applying the prediction of one se-

lected method. This leads to eight potential configura-

tions, i.e., three combinations of three out of four meth-

ods and four configurations with four member methods

each time using a different method as a tie-breaker. The

corresponding results are presented in Figure 6. The

results demonstrate that the best ensemble includes Xtal-

Pred, CRYSTALP2 and OB-Score. The runner-up con-

figurations include an ensemble of XtalPred, CRYS-

TALP2 and ParCrys and two ensembles of four methods

with tie-breakers as XtalPred and CRYSTALP2. These

results are consistent with the above complementarity

analysis and indicate beneficial overlap between Xtal-

Pred and CRYSTALP2. We also observe that application

of a majority-vote mechanism provides only moderate

improvements. More specifically, the best vote-based

ensemble obtains MCC of 0.49 while the MCC of best

individual method equals 0.43 and the MCC of best

combination of methods from Figure 5 gives MCC

L. Kurgan et al. / Natural Science 1 (2009) 93-106

103

XP .40

C2 .39

ALL tie-bkr OB .47

ALL tie-bkr PC .46

ALL tie-bkr XP .48

ALL tie-bkr C2 .48

PC XP C2 .48

OB PC XP .47

OB PC C2 .47

OB XP C2 .49

OB .42

PC .43

0.50

0.52

0.54

0.56

0.58

0.60

0.62

0.64

0.75 0.770.79 0.81 0.830.85 0.87 0.89

TPR

TNR

Figure 6. Analysis the performance of majority-vote based ensembles of OB-Score (OB), Par-

Crys (PC), XtalPred (XP) and CRYSTALP2 (C2) methods on the TEST-NEW dataset. The

x-axis/y-axis shows TPR/TNR values (TPR values are scaled between 0.75 and 0.9 while TNR

values are scaled between 0.5 and 0.65), and the labels next to markers denote a particular en-

semble together with the MCC value (e.g., “OB XP C2 .49” means that ensemble composed of

OB-Score, XtalPRed and CRYSTALP2 obtained MCC of 0.49). The prediction of the ensemble

corresponds to the most frequent prediction of its members. The tie-breaker for ensembles of 4

methods is chosen as the prediction of one specific method, i.e., “ALL tie-brk XP” corresponds

to an ensemble of all four methods in which a split 2 vs 2 decision is decided by the prediction

of XtalPred. Markers and labels in red/blue denote the best/second best results.

equal to 0.82. In terms of the corresponding accuracies,

this means that although the considered four methods

can correctly predict up to 90.4% of proteins, the simple

voting provides only 73.6% of correct predictions.

Overall, the analysis shows that the best improve-

ments, when compared with using individual predictors,

are achieved by combining XtalPred with CRYSTALP2.

The OB-Score and PareCrys methods overlap to a larger

extend although they also complement the other two

predictors. This can be explained by the use of very

similar input features in ParCrys and OB-Score and use

of larger numbers of more complementary features in

CRYSTALP2 and XtalPred. Finally, a simple voting

based meta-predictor is shown to provide some im-

provements although more complex designs should be

considered to better exploit complementarity between

the existing prediction methods. Such advanced hetero-

geneous (using diverse types of member methods)

meta-predictors were already successfully used in se-

quence-based prediction of other protein properties such

as fold type [76, 77], subcellular localization [78-80],

structural class [81], and solvent accessibility [82].

4. SUMMARY AND CONCLUSIONS

Structural genomics efforts have entered a mature stage

when a wealth of data that could be analyzed to build

useful supporting tools has been already accumulated.

One of most significant bottlenecks in the protein struc-

ture determination pipelines implemented by SG centers

is the ability to generate diffraction quality crystals. Al-

though some mechanisms were already implemented to

improve the corresponding success rates, our analysis

shows a significant room for further improvements. In

this context we have overviewed existing databases,

analytical results and predictive methods that aim at

supporting the protein crystallization task.

We show that analysis of data from certain SG centers

and community-wide databases such as TargetBD re-

L. Kurgan et al. / Natural Science 1 (2009) 93-106

104

vealed that certain factors, such as protein size, isoelec-

tric point, disorder regions, presence of transmembrane

helices, etc. were found to correlate with the ability to

produce quality protein crystals. We also contrasted and

compared several modern sequence-based predictors of

crystallization propensity including OB-Score, ParCrys,

XtalPred and CRYSTALP2. We demonstrate that these

methods provide useful predictions which are comple-

mentary to each other. Although their average success

rate is similar and at about 70%, we show that usage of a

simple majority-vote based combination of these meth-

ods can improve the success rate to almost 74%. Our

work also reveals that close to 90% of the protein chains

can be correctly predicted by at least one of these meth-

ods, which motivates development of more advanced

meta-predictors. The best predictions for short, under

100 amino acids, chains are produced by XtalPred and

the most accurate predictions, on average, are generated

for medium-sized chains of 100 to 200 amino acids. We

believe that these crystallization propensity predictors

could provide useful input for current SG efforts that

could be incorporated into the target selection procedure.

REFERENCES

[1] Guido, R.V., Oliva, G. and Andricopulo, A.D. (2008)

Virtual screening and its integration with modern drug

design technologies. Current Medicinal Chemistry, 15(1),

37-46.

[2] Norin, M. and Sundström, M. (2001) Protein models in

drug discovery. Current Opinion in Drug Discovery &

Development, 4, 284-290.

[3] Klebe, G. (2000) Recent developments in structure-based

drug design. Journal of Molecular Medicine, 78(5),

269-281.

[4] Fernàndez-Busquets, X., de Groot, N.S., Fernandez, D.

and Ventura, S. (2008) Recent structural and computa-

tional insights into conformational diseases. Current Me-

dicinal Chemistry, 15, 1336-1349.

[5] Luscombe, N.M., Laskowski, R.A. and Thornton, J.M.

(2001) Amino acid-base interactions: a three-dimensional

analysis of protein-DNA interactions at an atomic level.

Nucleic Acids Research, 29, 2860-2874.

[6] Ellis, J.J., Broom, M. and Jones, S. (2007) Protein-RNA

interactions: structural analysis and functional classes.

Proteins, 66, 903-911.

[7] Chen, K. and Kurgan, L. (2009) Investigation of atomic

level patterns in protein - small ligand interactions. PLoS

ONE, 4(2), e4473.

[8] Pruitt, K.D., Tatusova, T. and Maglott, D.R. (2007) NCBI

Reference Sequence (RefSeq): a curated non-redundant

sequence database of genomes, transcripts and proteins.

Nucleic Acids Research, 35(Database issue), D61-65.

[9] Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G.,

Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.

(2000) The Protein Data Bank. Nucleic Acids Research,

28, 235-242.

[10] Brenner, S.E. (2001) A tour of structural genomics. Na-

ture Reviews Genetics, 2(10), 801-809.

[11] Chandonia, J.M. and Brenner, S.E. (2006) The impact of

structural genomics: expectations and outcomes. Science,

311, 347-351.

[12] Service, R.F. (2008) Protein Structure Initiative: Phase 3

or Phase Out. Science, 319(5870), 1610-1613.

[13] Terwilliger, T.C., Waldo, G., Peat, T.S., Newman, J.M.,

Chu, K. and Berendzen, J. (1998) Class-directed struc-

ture determination: Foundation for a protein structure

initiative. Protein Science, 7(9), 1851-1856.

[14] Brenner, S.E. (2000) Target selection for structural ge-

nomics. Nature Structural Biology, 7, 967-969.

[15] Dessailly, B.H., Nair, R., Jaroszewski, L., Fajardo, J.E.,

Kouranov, A., Lee, D., Fiser, A., Godzik, A., Rost, B. and

Orengo, C. (2009) PSI-2: structural genomics to cover

protein domain family space. Structure, 17(6), 869-881.

[16] Ilari, A. and Savino, C. (2008) Protein structure determi-

nation by x-ray crystallography. Methods in Molecular

Biology, 452, 63-87.

[17] Wishart, D. (2005) NMR spectroscopy and protein

structure determination: applications to drug discovery

and development. Current Pharmaceutical Biotechnol-

ogy, 6(2), 105-120.

[18] Hite, R.K., Raunser, S. and Walz, T. (2007) Revival of

electron crystallography. Current Opinion in Structural

Biology, 17(4), 389-395.

[19] Fischer, D. (2006) Servers for protein structure prediction.

Current Opinion in Structural Biology, 16(2), 178-182.

[20] Xiang, Z. (2006) Advances in homology protein structure

modeling. Current Protein & Peptide Science, 7(3),

217-227.

[21] Lacapère, J.J., Pebay-Peyroula, E., Neumann, J.M. and

Etchebest, C. (2007) Determining membrane protein

structures: still a challenge! Trends in Biochemical Sci-

ences, 32(6), 259-270.

[22] Schnell, J.R. and Chou, J.J. (2008) Structure and mecha-

nism of the M2 proton channel of influenza A virus. Na-

ture, 451, 591-595.

[23] Service, R. (2005) Structural genomics, round 2. Science,

307, 1554-1558.

[24] Chen, L., Oughtred, R., Berman, H.M. and Westbrook, J.

(2004) TargetDB: a target registration database for struc-

tural genomics projects. Bioinformatics, 20(16),

2860-2862.

[25] Slabinski, L., Jaroszewski, L., Rychlewski, L., Wilson,

I.A., Lesley, S.A. and Godzik, A. (2007) XtalPred: a web

server for prediction of protein crystallizability. Bioin-

formatics, 23(24), 3403-3405.

[26] Hui, R. and Edwards, A. (2003) High-throughput protein

crystallization. Journal of Structural Biology, 142,

154-161.

[27] Savchenko, A., Yee, A., Khachatryan, A., Skarina, T.,

Evdokimova, E., Pavlova, M., Semesi, A., Northey, J.,

Beasley, S., Lan, N., Das, R., Gerstein, M., Arrowmith,

C.H. and Edwards, A.M. (2003) Strategies for structural

proteomics of prokaryotes: quantifying the advantages of

studying orthologous proteins and of using both NMR

and x-ray crystallography approaches. Proteins, 50,

392-399.

[28] Chandonia, J.M. and Brenner, S.E. (2005) Implications

of structural genomics target selection strategies:

L. Kurgan et al. / Natural Science 1 (2009) 93-106

105

Pfam5000, whole genome, and random approaches. Pro-

teins, 58, 166-179.

[29] McPherson, A. (2004) Protein crystallization in the

structural genomics era. Journal of Structural and Func-

tional Genomics, 5(1-2), 3-12.

[30] Chayen, N.E. (2004) Turning protein crystallisation from

an art into a science. Current Opinion in Structural Biol-

ogy, 14(5), 577-583.

[31] Biertumpfel, C., Basquin, J. and Suck, D. (2005) Practi-

cal implementations for improving the throughput in a

manual crystallization setup. Journal of Applied Crystal-

lography, 38, 568-570.

[32] Puesy, M., Liu, Z.J., Tempel, W., Praissman, J., Lin, D.,

Wang, B.C., Gavira, J.A. and Ng, J.D. (2005) Life in the

fast lane for protein crystallization and X-ray crystallog-

raphy. Progress in Biophysics and Molecular Biology, 88,

359-386.

[33] Stevens, R.C. (2000) High-throughput protein crystalli-

zation. Current Opinion in Structural Biology, 10(5),

558-63.

[34] Rodrigues, A. and Hubbard, R.E. (2003) Making deci-

sions for structural genomics. Briefings in Bioinformatics,

4, 150-167.

[35] Lesley, S.A., Kuhn, P., Godzik, A., Deacon, A.M.,

Mathews, I., Kreusch, A., Spraggon, G., Klock, H.E.,

McMullan, D., Shin, T., Vincent, J., Robb, A., Brinen,

L.S., Miller, M.D., McPhillips, T.M., Miller, M.A.,

Scheibe, D., Canaves, J.M., Guda, C., Jaroszewski, L.,

Selby, T.L., Elsliger, M.A., Wooley, J., Taylor, S.S.,

Hodgson, K.O., Wilson, I.A., Schultz, P.G., Stevens, R.C.

(2002) Structural genomics of the Thermotoga maritima

proteome implemented in a high-throughput structure

determination pipeline. Proceedings of the National

Academy of Sciences of USA, 99, 11664-11669.

[36] Brenner, S.E., Barken, D. and Levitt, M. (1999) The

PRESAGE database for structural genomics. Nucleic

Acids Research, 27(1), 251-253.

[37] Chance, M.R., Bresnick, A.R. Burley, S.K., Jiang, J.S.,

Lima, C.D., Sali, A., Almo, S.C., Bonanno, J.B., Buglino,

J.A., Boulton, S., Chen, H., Eswar, N., He, G., Huang, R.,

Ilyin, V., McMahan, L., Pieper, U., Ray, S., Vidal, M.,

Wang, L.K. (2002) Structural genomics: pipeline for pro-

viding structures for the biologist, Protein Science, 11(4) ,

723-738.

[38] Bertone, P., Kluger, Y., Lan, N., Zheng, D., Christendat,

D., Yee, A., Edwards, A.M., Arrowsmith, C.H., Mon-

telione, G.T. and Gerstein, M. (2001) SPINE: An inte-

grated tracking database and data mining approach for

identifying feasible targets in high-throughput structural

proteomics. Nucleic Acids Research, 29, 2884-2898.

[39] Goh, C.S., Lan, N., Echols, N., Douglas, S.M., Milburn,

D., Bertone, P., Xiao, R., Ma, L.C., Zheng, D., Wunder-

lich, Z., Acton, T., Montelione, G.T. and Gerstein, M.

(2003) SPINE 2: a system for collaborative structural

proteomics within a federated database framework. Nu-

cleic Acids Research, 31, 2833-2838.

[40] Kouranov, A., Xie, L., de la Cruz, J., Chen, L., West-

brook, J., Bourne, P.E. and Berman, H.M. (2006) The

RCSB PDB information portal for structural genomics.

Nucleic Acids Research, 4(Database issue), D302-305.

[41] Berman, H.M. (2008) Harnessing knowledge from

structural genomics. Structure, 16, 16-18.

[42] Berman, H.M., Westbrook, J.D., Gabanyi, M.J., Tao, W.,

Shah, R., Kouranov, A., Schwede, T., Arnold, K., Kiefer,

F., Bordoli, L., Kopp, J., Podvinec, M., Adams, P.D.,

Carter, L.G., Minor, W., Nair, R. and La Baer, J. (2008)

The protein structure initiative structural genomics

knowledgebase. Nucleic Acids Research, 37(Database

issue), D365-368.

[43] Rupp, B. and Wang, J.W. (2004) Predictive models for

protein crystallization. Methods, 34, 391-408.

[44] Christendat, D., Yee, A., Dharamsi, A., Kluger, Y.,

Savchenko, A., Cort, J.R., Booth, V., Mackereth, C.D.,

Saridakis, V., Ekiel, I., Kozlov, G., Maxwell, K.L., Wu,

N., McIntosh, L.P., Gehring, K., Kennedy, M.A., David-

son, A.R., Pai, E.F., Gerstein, M., Edwards, A.M., Ar-

rowsmith, C.H. (2000) Structural proteomics of an ar-

chaeon. Nature Structural Biology, 7, 903-909.

[45] Goh, C.S., Lan, N., Douglas, S.M., Wu, B., Echols, N.,

Smith, A., Milburn, D., Montelione, G.T., Zhao, H. and

Gerstein, M. (2004) Mining the structural genomics

pipeline: Identification of protein properties that affect

high-throughput experimental analysis. Journal of Mo-

lecular Biology, 336, 115-130.

[46] Canaves, J.M., Page, R., Wilson, I.A. and Stevens, R.C.

(2004) Protein biophysical properties that correlate with

crystallization success in Thermotoga maritima: Maxi-

mum clustering strategy for structural genomics. Journal

of Molecular Biology, 344, 977-991.

[47] Kantardjieff, K.A. and Rupp, B. (2004) Protein isoelec-

tric point as a predictor for increased crystallization

screening efficiency. Bioinformatics, 20, 2162-2168.

[48] Kantardjieff, K.A., Jamshidian, M. and Rupp, B. (2004)

Distributions of pI vs pH provide strong prior informa-

tion for the design of crystallization screening experi-

ments. Bioinformatics, 20, 2171-2174.

[49] Longenecker, K.L., Garrard, S.M., Sheffield, P.J. and

Derewenda, Z.S. (2001) Protein crystallization by ra-

tional mutagenesis of surface residues: Lys to Ala muta-

tions promote crystallization of RhoGDI. Acta Crystal-

lographica Section D: Biological Crystallography, 57,

679-688.

[50] Mateja, A., Devedjiev, Y., Krowarsch, D., Longenecker,

K., Dauter, Z., Otlewski, J., Derewenda, Z.S. (2002) The

impact of Glu-Ala and Glu-Asp mutations on the crystal-

lization properties of RhoGDI: the structure of RhoGDI

at 1.3 A resolution. Acta Crystallographica Section D:

Biological Crystallography, 58, 1983-1991.

[51] Derewenda, Z.S. (2004) The use of recombinant methods

and molecular engineering in protein crystallization.

Methods, 34, 354-363.

[52] Derewenda, Z.S. (2004) Rational protein crystallization

by mutational surface engineering. Structure, 12,

529-535.

[53] Derewenda, Z.S. and Vekilov, P.G. (2006) Entropy and

surface engineering in protein crystallization. Acta Crys-

tallographica Section D: Biological Crystallography, 62,

116-124.

[54] Cooper, D.R., Boczek, T., Grelewska, K., Pinkowska, M.,

Sikorska, M., Zawadzki, M. and Derewenda, Z. (2007)

Protein crystallization by surface entropy reduction: op-

timization of the SER strategy. Acta Crystallographica

Section D: Biological Crystallography, 63, 636-645.

L. Kurgan et al. / Natural Science 1 (2009) 93-106

106

[55] Goldschmidt, L., Cooper, D.R., Derewenda, Z. and

Eisenberg, D. (2007) Toward rational protein crystalliza-

tion: A Web server for the design of crystallizable protein

variants. Protein Science, 16, 1569-1576.

[56] Oldfield, C.J., Ulrich, E.L., Cheng, Y., Dunker, A.K. and

Markley, J.L. (2005) Addressing the intrinsic disorder

bottleneck in structural proteomics. Proteins, 59,

444-453.

[57] Chandonia, J.M., Kim, S.H. and Brenner, S.E. (2006)

Target selection and deselection at the Berkeley Struc-

tural Genomics Center. Proteins, 62, 356-370.

[58] Price, W.N. 2nd, Chen, Y., Handelman, S.K., Neely, H.,

Manor, P., Karlin, R., Nair, R., Liu, J., Baran, M., Everett,

J., Tong, S.N., Forouhar, F., Swaminathan, S.S., Acton, T.,

Xiao, R., Luft, J.R., Lauricella, A., DeTitta, G.T., Rost, B.,

Montelione, G.T. and Hunt, J.F.. (2009) Understanding

the physical properties that control protein crystallization

by analysis of large-scale experimental data. Nature Bio-

technology, 27(1), 51-57.

[59] Chou, K.C. (2004) Structural bioinformatics and its im-

pact to biomedical science. Current Medicinal Chemistry,

11, 2105-2134.

[60] Chou, K.C. (2005) Progress in protein structural class

prediction and its impact to bioinformatics and pro-

teomics. Current Protein & Peptide Science, 6, 423-436.

[61] Yang, Z. R., Wang, L., Young, N. and Chou, K.C. (2005)

Pattern recognition methods for protein functional site

prediction. Current Protein & Peptide Science, 6,

479-491.

[62] Chou, K.C. and Shen, H.B. (2007) Recent progresses in

protein subcellular location prediction. Analytical Bio-

chemistry, 370, 1-16.

[63] Kurgan, L., Cios, K.J., Zhang, H., Zhang, T., Chen, K.,

Shen, S. and Ruan, J. (2008) Sequence-based methods

for real value predictions of protein structure. Current

Bioinformatics, 3(3), 183-196.

[64] Smialowski, P., Schmidt, T., Cox, J., Kirschner, A. and

Frishman, D. (2006) Will my protein crystallize? A se-

quence-based predictor. Proteins, 62, 343-355.

[65] Overton, I.M. and Barton, G.J. (2006) A normalised scale

for structural genomics target ranking: the OB-Score.

FEBS Letters, 580, 4005-4009.

[66] Overton, I.M., Padovani, G., Girolami, M.A. and Barton,

G.J. (2008) ParCrys: a Parzen window density estimation

approach to protein crystallization propensity prediction.

Bioinformatics, 24, 901-907.

[67] Slabinski, L., Jaroszewski, L., Rodrigues, A.P.C.,

Rychlewski, L., Wilson, I.A., Lesley, S.A. and Godzik, A.

(2007) The challenge of protein structure determination -

lessons from structural genomics. Protein Science, 16(11),

2472-2482.

[68] Slabinski, L., Jaroszewski, L., Rychlewski, L., Wilson,

I.A., Lesley, S.A. and Godzik, A. (2007) XtalPred: a web

server for prediction of protein crystallizability. Bioin-

formatics, 23(24), 3403-3405.

[69] Chen, K., Kurgan, L. and Rahbari, M. (2007) Prediction

of protein crystallization using collocation of amino acid

pairs. Biochemical and Biophysical Research Communi-

cations, 355, 764-769.

[70] Kurgan, L., Razib, A.A., Aghakhani, S., Dick, S.,

Mizianty, M.J. and Jahandideh, S. (2009) CRYSTALP2:

sequence-based protein crystallization propensity predic-

tion. BMC Structural Biology, 9, 50.

[71] Campbell, K. and Kurgan, L. (2008) Sequence-only

based prediction of ß-turn location and type using collo-

cation of amino acid pairs. Open Bioinformatics Journal,

2, 37-49.

[72] Chen, K., Kurgan, L. and Ruan, J. (2007) Prediction of

flexible/rigid regions in proteins from sequences using

collocated amino acid pairs. BMC Structural Biology, 7,

25.

[73] Chen, Y.Z., Tang, Y.R., Sheng, Z.Y. and Zhang, Z. (2008)

Prediction of mucin-type O-glycosylation sites in mam-

malian proteins using the composition of k-spaced amino

acid pairs. BMC Bioinformatics, 9, 101.

[74] Chen, K., Jiang, Y., Du, L. and Kurgan, L. (2009) Predic-

tion of integral membrane protein type by collocated hy-

drophobic amino acid pairs. Journal of Computational

Chemistry, 30(1), 163-172.

[75] Kurgan L. (2008) On the relation between the predicted

secondary structure and the protein size. The Protein

Journal, 24(4), 234-239.

[76] Shen, H.B. and Chou, K.C. (2009) Predicting protein

fold pattern with functional domain and sequential evo-

lution information. Journal of Theoretical Biology,

256(3), 441-446.

[77] Chen, K. and Kurgan, L. (2007) PFRES: protein fold

classification by using evolutionary information and pre-

dicted secondary structure. Bioinformatics, 23(21),

2843-2850.

[78] Assfalg, J., Gong, J., Kriegel, H.P., Pryakhin, A., Wei, T.

and Zimek, A. (2009) Supervised ensembles of predic-

tion methods for subcellular localization. Journal of Bio-

informatics and Computational Biology, 7(2), 269-285.

[79] Shen, H.B. and Chou, K.C. (2007) Hum-mPLoc: an en-

semble classifier for large-scale human protein subcellu-

lar location prediction by incorporating samples with

multiple sites. Biochemical and Biophysical Research

Communications, 355(4), 1006-1011.

[80] Chou, K.C. and Shen, H. B. (2006) Hum-PLoc: A novel

ensemble classifier for predicting human protein subcel-

lular localization. Biochemical and Biophysical Research

Communications, 347, 150-157.

[81] Kedarisetti, K.D., Kurgan, L. and Dick, S. (2006) Classi-

fier ensembles for protein structural class prediction with

varying homology. Biochemical and Biophysical Re-

search Communications, 348(3), 981-988.

[82] Chen, H. and Zhou, H.X. (2005) Prediction of solvent

accessibility and sites of deleterious mutations from pro-

tein sequence. Nucleic Acids Research, 33(10),

3193-3199.