Journal of Intelligent Learning Systems and Applications, 2011, 3, 155-170
doi:10.4236/jilsa.2011.33017 Published Online August 2011 (http://www.scirp.org/journal/jilsa)
Copyright © 2011 SciRes. JILSA
155
Learning Probabilistic Models of Hydrogen Bond
Stability from Molecular Dynamics Simulation
Trajectories
Igor Chikalov1, Peggy Yao2, Mikhail Moshkov1, Jean-Claude Latombe3
1Mathematical and Computer Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal,
Saudi Arabia; 2Bio-Medical Informatics, Stanford University, Stanford, USA; 3Computer Science Department, Stanford University,
Stanford, USA.
Email: igor.chikalov@kaust.edu.sa
Received March 7th, 2011; revised March 24th, 2011; accepted April 2nd, 2011.
ABSTRACT
Hydrogen bonds (H-bonds) play a key role in both the formation and stabilization of protein structures. H-bonds in-
volving atoms from residues that are close to each other in the main-chain sequence stabilize secondary structure ele-
ments. H-bonds between atoms from distant residues stabilize a protein’s tertiary structure. However, H-bonds greatly
vary in stability. They form and break while a protein deforms. For instance, the transition of a protein from a non-
functional to a func tional state may require some H-bonds to break and others to form. The intrinsic strength of an in-
dividual H-bond has been studied from an energetic viewpoint, but energy alone may not be a very good predictor.
Other local interactions may reinforce (or weaken) an H-bond. This paper describes inductive learning methods to
train a protein-ind ep endent pro bab ilistic mod el o f H-bond stab ility from mo lecula r dynamics (MD) simulation trajecto-
ries. The training data describes H-bond occurrences at successive times along these trajectories by the values of at-
tributes called predictors. A trained model is constructed in the form of a regression tree in which each non-leaf node is
a Boolean test (split) on a predictor. Each occurrence of an H-bond maps to a path in this tree from the root to a leaf
node. Its predicted stability is associated with the leaf node. Experimental results demonstrate that such models can
predict H-bond stability qu ite well. In particular, their performance is roughly 20% better than that of models based on
H-bond energy alone. In addition, they can accurately identify a large fraction of the least stable H-bonds in a given
conformation. The pape r discuss es sever al exte nsions that may yield further improvements.
Keywords: Molecular Dynamics, Machine Learning, Regression Tree
1. Introduction
A hydrogen bond (H-bond) corresponds to the attractive
electrostatic interaction between a covalent pair D—H of
atoms, in which the hydrogen atom H is bonded to a
more electronegative donor atom D, and another non-
covalently bound, electronegative acceptor atom A. Most
H-bonds in a protein are of the form N—HO or
O—HO, but other forms are possible. Due to their
strong directional character, short distance ranges, and
relatively large number in a folded protein, H-bonds play
a key role in both the formation and stabilization of pro-
tein structures [1-3]. While H-bonds involving atoms
from residues that are close to each other in the main-
chain sequence stabilize secondary structure elements,
H-bonds between atoms from distant residues stabilize a
protein’s tertiary structure [1,3-5]. In particular, the later
ones shape loops and other irregular features that may
contain functional sites.
Unlike covalent bonds, H-bonds greatly vary in stabil-
ity. They form and break while the conformation1 of a
protein deforms. For instance, the transition of a folded
protein from a non-functional meta-stable state into a
functional (e.g. binding) state may require certain H-
bonds to break and others to form [6]. The intrinsic
strength of an individual H-bond has been studied from
an energetic view point [7-11]. However, potential en-
ergy alone may not be a very good predictor of H-bond
stability. Other local interactions may reinforce or weaken
an H-bond. Moreover, several “redundant” H-bonds may
1A protein conformation defines the relative positions of all the atoms
in the protein.
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
156
contribute to rigidify the same group of atoms. To better
understand the possible deformation of a protein in its
folded state, it is desirable to create models that can re-
liably predict the stability of an H-bond not just from its
energy, but also from its local environment. Such a
model can then be used in a variety of ways, e.g. to study
the kinematic deformability of a folded protein confor-
mation (by detecting its rigid components) and sample
new conformations [12].
In this paper we apply inductive learning methods to
train a protein-independent probabilistic model of
H-bond stability from a training set of molecular dynam-
ics (MD) simulation trajectories of various proteins. The
input to the training procedure is a data table in which
each row gives the value of several (32) attributes, called
predictors, of an H-bond and its local environment at a
given time t in a trajectory, as well as the measured
stability of this H-bond over an interval of time
,t
t . The output is a function
of a subset of pre-
dictors that estimates the probability that an H-bond pre-
sent in the conformation c achieved by a protein will be
present in any conformation achieved by this protein
within a time interval of duration . The value of
defines the timescale of the prediction.
MD simulation trajectories provide huge amount of
data yielding training data tables made of several hun-
dred thousand, or more, rows. To build regression trees
from such tables we propose methods that run in

logOab a time, where a is the number of rows and
b is the number of predictors. Tests demonstrate that
the models trained with these methods can predict H-
bond stability roughly 20% better than models based on
H-bond energy alone. The models can also accurately
identify a large fraction of the least stable H-bonds in a
given conformation. In most tests, about 80% of the 10%
H-bonds predicted as the least stable are actually among
the 10% truly least stable.
Section 2 gives a precise statement of the problem ad-
dressed in this paper. Section 3 presents the machine
learning approach that is used to solve this problem. Sec-
tion 4 describes details of the training algorithm. Section
5 discusses test results obtained with models trained us-
ing software implementing this algorithm. Section 6
suggests future developments that may lead to improving
trained models.
2. Problem Statement
Let c be the conformation of a protein P at some
time considered (with no loss of generality) to be 0 and
H
be an H-bond present in c. Let

M
c be the set
of all physically possible trajectories of P passing
through c and π be the probability distribution over
this set. We define the stability of
H
in c over the
time interval
by:
 


0
:,, 0,1,
1
,,,,d π,
qMc
Hc
H
cIqHttq




(1)
where
,,
qHt is a Boolean function that takes value
1 if
H
is present in the conformation

qt at time
t along trajectory q, and 0 otherwise. The value
,,Hc
can be interpreted as the probability that
H
will be present in the conformation of P at any
specified time
0,t
, given that P is at conforma-
tion c at time 0.
Our goal is to design a method for generating good
approximations
of
. We also want these ap-
proximations to be protein-independent, i.e., the argu-
ment c may be a conformation of any protein.
3. General Approach
We use machine learning methods to train a stability
model s from a given set Q of MD simulation trajecto-
ries of various proteins. Each trajectory qQ is a dis-
crete sequence of conformations of a protein. These
conformations are reached at times i
ti
 , 0,1,i
2, , called ticks, where
is typically on the order of
the picoseconds2. We detect the H-bonds3 which are pre-
sent in each conformation

1
qt using the geometric
criteria given in [13] (see Appendix A). Note that an H-
bond in a given protein is uniquely identified (across
different conformations) by its donor, acceptor, and hy-
drogen atoms. So, we call the presence of a specific H-
bond H in a conformation

1
qt an occurrence of H in
1
qt .
For each occurrence of an H-bond H in
1
qt we
compute a fixed list of predictors, some numerical, oth-
ers categorical. Some are time-invariant, like the types of
the donor and acceptor atoms and the number of residues
along the main-chain between the donor and acceptor
atoms. Others are time-dependent. Among them, some
describe the geometry of H in

1
qt , e.g. the distance
between the hydrogen and the donor atoms and the angle
made by the donor, hydrogen, and acceptor atoms. Oth-
ers describe the local environment of H in
1
qt , e.g.
the number of other H-bonds within a certain distance
from the mid-point of H. The complete list of predictors
used in our work is given in Appendix C. In total, it con-
tains 32 predictors.
2MD simulation trajectories are computed by integrating the equations
of motion with a time step on the order of the femtoseconds (10–15 s) in
order to take into account high-frequency thermal vibrations. However,
to reduce the amount of stored data, they are usually sub-sampled at a
time step on the order of the picoseconds (10–12 s).
3We only consider H-bonds inside a protein. We ignore H-bonds be-
tween a protein and the solvent.
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
157
We train
as a function of these predictors. The
predictor list defines a predictor space and every H-
bond occurrence maps to a point in . As some predic-
tors vary over time, two occurrences of the same H-bond
at two different ticks usually map to two distinct points.
Given the input set Q of trajectories, we build a data
table in which each row corresponds to an occurrence h
of an H-bond present in a conformation
1
qt con-
tained in Q. So, many rows may correspond to the same
H-bond at different ticks. In our experiments, a typical
data table contains several hundred thousand rows (see
Section 5.1.2). Each column, except the last one, corre-
sponds to a predictor p and the entry

,hp of the
table is the value of p for h. The entry in the last
column is the measured stability y of the H-bond oc-
currence in conformation

1
qt . More precisely, let
H
be the H-bond of which h is an occurrence. In addition,
let l
 , where is the duration over which we
wish to predict the stability of h (see Section 2), and let
ml be the number of ticks k
t, 1,2, ,ki iil ,
such that
H
is present in
k
qt . The measured stabil-
ity y of h is the ratio ml. Figure 1 plots a (typical)
histogram of the measured stability of all H-bond occur-
rences in one protein trajectory. This histogram indicates
that H-bond occurrences tend to be quite stable: over
25% have measured stability 1, about 50% have meas-
ured stability higher than 0.8, and only 15% have meas-
ured stability less than 0.3.
We build s as a binary regression tree [14]. This well-
studied machine learning approach has been one of the
most successful in practice. Regression trees are often
simple to interpret. Not only may this simplicity eventu-
ally lead to pertinent insights to better understand H-
bond stability; it also allows us to perform many experi-
Figure 1. Histogram of the measured stability of H-bond
occurrences in 1eia protein trajectory (the rightmost bar
defines the fraction H-bond occurrences whose measured
stability is exactly 1).
ments, compare the generated trees, and analyze the rela-
tive importance of the predictors. Furthermore, the
method can work with both categorical and numerical
predictors in a unified way, as shown in Section 4.1.
Each non-leaf node N in a regression tree is a Boolean
test, called a split. Each split on a numeric predictor p
divides the predictor space into two half-spaces
separated by a hyper-plane perpendicular to the coordi-
nate axis representing p. Each arc outgoing from N
corresponds to one of these half-spaces. So, each node
N of the tree determines a region of which is ob-
tained by intersecting all the half-spaces associated with
the arcs connecting the root of the tree to N. We say
that an H-bond occurrence falls into a node N if it is
contained in this region. The predicted stability value
stored at a leaf node L is the average of the measured
stability values computed for all the H-bond occurrences
in the training data table that fall into L. We expect this
averaging, which is done over many pieces of trajectories,
to approximate well the averaging defined in Equation
(1). To avoid over-fitting the input data, only a relatively
small subset of predictors (selected by the training algo-
rithm, as described in Section 4 is eventually used in a
regression tree.
Once a regression tree has been generated, it is used as
follows. Given an H-bond
H
in an arbitrary conforma-
tion c of an arbitrary protein, the leaf node L of the
tree into which
H
falls is identified by calculating the
values of the necessary predictors for
H
in c. The
predicted stability value stored at L is returned. (Note
that by construction of the tree, any H-bond
H
falls
into one and only one leaf node.)
4. Training Algorithm
4.1. Basic Tree-Construction Algorithm
We construct a model
as a binary regression tree
using the CART (Classification and Regression Tree)
method [14]. The tree is generated recursively in a top-
down fashion, i.e. starting from the root. When a new
node N is created, it is inserted as a leaf of the tree if a
predefined recursion depth has been reached or if the
number of H-bond occurrences (from the training data
table) falling into N is smaller than a predefined
threshold. Otherwise, N is added as an intermediate
node, its split is computed, and its left and right children
L and R are created. A split
s
is defined by a pair
,pr , where p is the split predictor and r is the split
value. If p is a numerical predictor, then r is a threshold
on p, and
s
pr
. If p is a categorical predictor,
then r is a subset of categories, and
s
pr
. We
define the score
,wpr of split

,
s
pr at a node
N as the reduction of variance in measured stability
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
158
that results from s. More formally:


 
,Var
Var1 Var
N
LL
LR
wpr Y
nn
YY
nn







(2)
where:
N
Y is the distribution of the measured stability of the
H-bond occurrences in the training data table falling
into N,
L
Y and
R
Yare the distributions of the measured sta-
bility of the H-bond occurrences falling into L and
R, respectively, when split
s
is applied,

Var Y is the variance of distribution Y,
n is the number of H-bond occurrences falling into
N,
L
n is the number of H-bond occurrences falling into
L when split
s
is applied.
The algorithm chooses the split—both the predictor
and the split value—that has the largest score. The com-
putation of the split value for each predictor is done as
follows (where we denote by
N
H
the subset of H-bond
occurrences in the training data table that fall into N):
1) For a numerical predictor, the values of this predic-
tor in
N
H
are sorted in ascending order. All midpoints
between two distinct consecutive values are used as can-
didate split values. The one with the largest split score is
used as the split value. This value is clearly optimal.
2) For a categorical predictor, for every possible value
v of this predictor in
N
H
, we first compute the mean

N
Yv of the measured stability of all the H-bond oc-
currences in
N
H
where the predictor has value v. We
then sort the possible values of the predictor into a list

12
,,,
K
vv v ordered by
N
i
Yv All the 1
K
splits
that divide this list into two contiguous sub-lists—e.g.

12
,,,
j
vv v and

12
,,,
jj K
vv v

—are considered.
The one with the best score is selected. Statement 8.16 in
[14] proves that no other split can give a better score.
Since the number of values of a numerical predictor in
N
H
may often be large, it is worth noticing that an in-
cremental procedure can compute split scores efficiently.
Consider two consecutive candidate split values i
s
and
1i
s
in Step 1. Assume that we have computed the split
score for i
s
and that we now want to compute the score
for 1i
s
. We can easily identify the H-bond occurrences
that are shifting from L to R. Then we can update

VarLYand

VarRYby only considering these occur-
rences, in time linear in their number, as shown in Ap-
pendix B. As a result we can compute the scores of all
the candidate split values in time linear in the number of
values of the considered numerical predictor in
N
H
. For
a categorical predictor, the computation of the scores of
all the candidate split values is also linear in the number
of categorical values.
At each layer of the tree the total number of samples
does not exceed the number of rows in the training table.
So, building each layer takes linear time in the table size.
Since we limit the depth of a regression tree by a rela-
tively small constant (see Section 4.3), the complexity of
the tree construction algorithm is dominated by the initial
sorting of the table rows for each predictor. So, a tree is
built in
logOab a time, where a is the number of
rows in the training data table and b is the number of
predictors. This makes it possible to process tables with
dozens of attributes and several hundred thousand rows
using an off-the-shelf computer.
4.2. Violation of IID Property
One important issue to deal with is the violation of the
IID (independent, identically distributed) property in the
training data table. The IID property would require that
H-bond occurrences follow a certain fixed probability
distribution, and that each row of a data table input to the
learning algorithm is sampled according to this distribu-
tion, independent of the other rows. The satisfaction of
this property is critical for the trained model
to pre-
dict reliably the stability of H-bonds in new protein con-
formations. However, it is likely to be violated, mainly
because several H-bond occurrences in a data table cor-
respond to the same H-bond. More specifically, two oc-
currences of the same H-bond along the same trajectory
are more likely to be similar (or even the same, in the
case of time-independent predictors) along several di-
mensions of the predictor space than two occur-
rences of distinct H-bonds, especially if these bonds be-
long to different proteins. This may result into correla-
tions between predictor values and measured stability
that are bond-specific and thus do not extend to other
bonds. To illustrate this point, Figure 2 plots the histo-
gram of the mean measured stability of all the distinct H-
bonds occurring in one MD simulation trajectory. The
figure shows that distinct H-bonds can have very differ-
ent mean measured stability. It also shows that many H-
bonds are unstable. These bonds contribute few bond
occurrences in the training data table, which leads the
histograms in Figures 1 and 2 to have “inverse” shapes.
While Figure 1 indicates that most H-bond occurrences
are quite stable, Figure 2 indicates that many H-bonds
are unstable.
To address this issue, we apply a two-step split calcu-
lation procedure [17]. The training data table is divided
at random into two tables 1
T and 2
T. The split predic-
tor p and the split value r at a node N are com-
puted separately, using one of these two tables:
1) The best split value *
p
r is computed for each pre-
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
159
Figure 2. Histogram of the mean measured stability of H-
bonds in 1eia protein trajectory.
dictor p using 1
T:


*
1
arg max,
pr
rwpr, where

1,wpr denotes the score of split

,pr on 1
T.
2) The best split predictor *
p is computed using 2
T
with the best split values computed at the previous step:


**
2
arg max,
pp
pwpr, where

*
2,
p
wpr denotes
the score of split

*
,
p
pr on 2
T.
3) The selected split is

**
,
p
pr .
Assume that the best split value computed in the first
step is obtained for some predictor p. If this best value
results from a bond-specific correlation between p
and measured stability in 1
T, then this correlation is
unlikely to happen again in 2
T, since 1
T and 2
T have
been separated at random. So, in the second step, predic-
tor p will likely have a small score

*
2,p
wpr
and
so will not be selected as the split predictor. To reduce
the risk that the same bond-specific correlation exists in
both 1
T and 2
T, we divide the training data in such a
way that all occurrences of the same H-bond end up in
the same table.
4.3. Tree Pruning
To prevent model overfitting, we limit the size of a re-
gression tree by bounding its maximal depth by a rela-
tively small constant (5 in most of our experiments). We
also define a minimal number of H-bond occurrences
that must fall into a node for this node to be split. How-
ever, it is usually better to set these thresholds rather lib-
erally and later prune the obtained tree using an adaptive
algorithm, as described below.
We initially set aside a fraction 3
T of the training
data table that has no overlap with the subsets 1
T and
2
T used in Section 4.2. Once a tree has been constructed
using 1
T and 2
T, pruning is an iterative process. At
each step, one non-leaf node N whose split has mini-
mal score (on 1
T) becomes a leaf node by removing the
sub-tree rooted at N. This process continues until the
pruned tree only contains the root node. It creates a se-
quence of trees with decreasing numbers of nodes. We
then estimate the prediction error of each tree as the
mean square error of the predictions made by this tree on
3
T. The tree with the smallest error is selected. 3
T is
selected so that it does not contain occurrences of H-
bonds also represented in 1
T or 2
T.
5. Test Results
In Section 5.1 we present the experimental setup with
which the test results reported in Section 5.2 have been
obtained. In Section 5.3 we analyze and discuss the con-
tents of regression trees generated in our experiments.
5.1. Experimental Setup
5.1.1. MD Tra jectori e s
In the experiments reported below, we used 6 MD simu-
lation trajectories picked from different sources. We call
these trajectories 1c90A, 1e85A, 1g90A_1 and 1g90A_2
from [18], and 1eia and complex from [19]. In all of
them the time interval
between two successive ticks
is 1 ps. Table 1 indicates the protein simulated in each
trajectory, its number of residues, the force field used by
the simulator, and the duration of the trajectory. Each
trajectory starts from a folded conformation resolved by
X-ray crystallography.
Trajectories obtained with different proteins allow us
to test if a model
trained with one protein can pre-
dict H-bond stability in another protein. Similarly, tra-
jectories generated with different force fields allow us to
test a model
trained with one force field can predict
H-bond stability in trajectories generated with another
force field. We did additional experiments with a larger
set of trajectories, but the results were similar to those
reported below.
5.1.2. Data Ta bl es
From each of the 6 trajectories we derived a separate
data table in which the rows represent the detected H-
bond occurrences and the columns give the values of the
predictors and H-bond measured stability. Table 2 lists
the number of distinct H-bonds detected in each trajec-
tory and the total number of H-bond occurrences ex-
tracted. Since most H-bonds are not present at all ticks,
the number of H-bond occurrences is smaller than the
number of distinct H-bonds multiplied by the number of
ticks. So, for example, the data table generated from tra-
jectory 1e85A consists of 1,253,879 rows, 32 columns
for predictors, and one column for measured stability.
Note that complex was generated for a complex of two
molecules. All H-bonds occurring in this complex are
taken into account in the corresponding data table. The
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
160
Table 1. Characteristics of the MD simulation trajectories used to create the 9 datasets.
Trajectory Protein # ResiduesForce field and warter model Temperature, Ko Duration, ns
1c90A Cold shock protein 66 ENCAD [15] with F3C explicit water
model 298 10
1e85A Cytochrome C 124 Same as above 298 10
1eia EIAV capsid protein P26 207 Amber 2003 with explicit SPC/E
water model 300 2
1g90A_1 PDZ1 domain of human Na(+)/H(+)
exchanger regulatory factor 91 Same as for 1c90A 298 10
1g90A_2 Same as above 91 Same as for 1c90A 298 10
complex Efb-C/C3d complex formed by the C3d
domain of human Complement Compo-
nent C3 and one of its bacterial inhibitors
362
Amber 2003 with implicit solvent
using the General Born solvation
method [16]
300 2
Table 2. Number of distinct H-bonds and H-bond occur-
rences detected in each trajectory.
Trajectory # H-bonds # Occurrences
1c90A 263 363,463
1e85A 525 1,253,879
1eia 757 379,573
1g90A_1 374 558,761
1g90A_2 397 544,491
complex 1825 348,943
measured stability y of an H-bond H in
1
qt is
computed as described in Section 3, as the ratio of the
number of ticks where the bond is present in the time
interval
,
ii
tt l
 in trajectory q divided by the total
number of ticks l in this interval.
The values of the time-varying predictors are subject
to thermal noise. Since a model
will in general be
used to predict H-bond stability in a protein conforma-
tion sampled using a kinematic model ignoring thermal
noise (e.g. by sampling the dihedral angles
,
, and
) [12], we chose to average the values of these predic-
tors over l ticks to remove thermal noise. More pre-
cisely, let h be an H-bond occurrence in q
i
t. The
value of a predictor stored in the row of the data table
corresponding to h is the average value of this predic-
tor in

1il
qt
 ,

1il
qt
 ,…,

i
qt , where il ki
tt


lk

.
The values of l and l are chosen according to dif-
ferent criteria. The choice of l is based on two consid-
erations. It must be large enough for the measured stabil-
ity ml to be statistically meaningful. It must also cor-
respond to the timescale over which one wants to predict
H-bond stability. The choice of l should be just
enough to remove thermal noise from the predictor val-
ues. Experiment #5 in Section 5.2.5 shows that 50l
is near optimal. We also chose 50l in most of the
tests reported below, as this value both provides a mean-
ingful ratio ml and corresponds to an interesting pre-
diction timescale (50 ps). In Experiment #5, we will
compare the performance of models generated with sev-
eral values of l.
5.1.3. Pe r formance Measures
The performance of a regression model can be measured
by the root mean square error (RMSE) of the predictions
on a test dataset. For a data table
11
,Txy
22
,,,,
nn
x
yxy, where each i
x
, 1,,in, de-
notes a vector of predictor values for an H-bond occur-
rence and i
y is the measured stability of the H-bond,
and a model
, the RMSE is defined by:



2
1
RMSE ,ii
i
Tyx
n


As RMSE depends not only on the accuracy of s, but
also on the table T, some normalization is necessary in
order to compare results on different tables. So, in our
tests we compute the decrease of RMSE relative to a
base model 0
. The relative base error decrease (or
RBED) is then defined by:


0
0
0
RMSE,RMSE ,
RBED , ,100%
RMSE ,
TT
TT



In most cases, 0
is simply defined by

0
1
xn
i
i
y
, i.e. the average measured stability of all H-
bond occurrences in the dataset. In other cases, 0
is a
model based on the H-bond energy.
5.2. Experiments
5.2.1. Experiment #1: Training on one Data Table,
Predicting on Another
Here we trained 10 models on each one of the 6 data
tables (i.e. 60 models total). We tested every model sepa-
rately on each of the other 5 data tables. For each
model, the corresponding training data table was parti-
tioned into three tables 1
T, 2
T and 3
T, as described in
Sections 4.2 and 4.3: 60% of the data went to 1
T,
20% to 2
T, and 20% to 3
T. In addition, to achieve
greater independence between the three tables, no two
tables contain occurrences of the same H-bond. The 10
models trained with the same data table were generated
with different partitions generated at random, but still
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
161
satisfying the previous ratios and condition. In all cases
the maximal depth of a tree was set to 5.
Table 3 gives the mean value of RBED for each pair
of data tables and Table 4 gives estimated standard de-
viation. More specifically, the chart of Figure 3 shows
the distribution of the RBED values for the 10 models
trained with 1c90A on each data table (so each of the 10
models contributes 6 points in the chart). These results
show that, on average, a model trained with one trajec-
tory predicts H-bond stability in another trajectory rea-
sonably well, even if the two trajectories were generated
using different energy functions. However, we note that
the variance of the 10 RBED values for each test data
table is rather large.
We also note that the mean RBED values are generally
lower for models tested on complex, while the mean
RBED values for models trained on complex and tested
on other tables (last column of Table 3) are comparable
to other values. Recall that the trajectory complex was
generated for a complex made of a protein and a ligand,
while every other trajectory was generated for a single
protein. So, it is likely that complex contains H-bonds in
situations that did not occur in any of the other trajecto-
ries. Figures 4 and 5 show trees trained with the data
table 1c90A and complex. We will comment on gener-
ated trees in Section 5.3.
5.2.2. Experiment #2: Training on Data from
Multiple Trajectories
Here, we trained models on data tables obtained by mix-
ing subsets of 5 data tables and we tested these models
on the remaining data table. For each combination of 5
data tables, we trained 10 models by mixing different
fractions of the 5 data tables. For each model, the mixed
data table was partitioned into three tables 1
T, 2
T and
3
T as in Experiment #1: 60% of the data went to 1
T,
20% to 2
T, and 20% to 3
T. Again, no two tables
contain occurrences of the same H-bond. Furthermore,
we trained 4 groups of models varying in the tree’s
maximal depth (5 or 15) and in the fraction of H-bond
occurrences taken from each data table (10% or 50%).
So, in total, 240 models were generated in this experi-
ment.
Table 5 shows the mean RBED value for each com-
bination of data tables and each group of models and
Table 6 shows estimated standard deviation. In columns
3 through 8 we indicate the data table used for testing the
models trained on a combination of the 5 other data ta-
bles. Figure 6 shows the distribution of the RBED val-
ues for the models built with the settings of in the first
row of Table 5 (i.e. maximal depth of 5 and 10% from
each data table).
We note that the RBED values are significantly higher
than in Experiment #1, meaning that models trained us-
ing data from several trajectories are more accurate than
models trained using data from a single trajectory. This
is not surprising, since a training data table generated
from several trajectories is likely to provide richer data
Figure 3. Distribution RBED values for the 10 models gen-
erated with 1c90A (Experiment #1). The horizontal line
shows the average of all 50 RBED values.
Table 3. Mean values of RBED for each pair of data tables (Experiment #1).
Test\Train 1c90A 1e85A 1eia 1g90A_1 1g90A_2 complex
1c90A 37.89 42.87 11.72 37.04 38.70 35.44
1e85A 44.91 58.82 31.67 53.04 53.10 42.20
1eia 34.41 38.18 41.52 34.55 37.69 40.19
1g90A_1 40.06 44.83 27.75 41.89 47.52 34.36
1g90A_2 34.69 40.79 35.11 37.52 38.72 34.54
complex 28.41 34.75 34.14 28.46 32.81 39.00
Average 36.72 43.37 30.31 38.75 41.42 37.62
Table 4. Standard deviation values of RBED for each pair of data tables (Experiment #1).
Test\Train 1c90A 1e85A 1eia 1g90A_1 1g90A_2 complex
1c90A 8.27 6.35 35.58 11.57 11.49 6.06
1e85A 14.64 3.99 28.99 5.68 8.29 8.48
1eia 5.85 1.85 3.46 5.86 3.55 1.93
1g90A_1 9.04 7.69 14.95 11.99 9.07 11.70
1g90A_2 12.00 4.87 11.94 10.97 8.36 3.57
complex 5.00 1.62 2.49 7.83 3.09 2.58
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
162
Figure 4. Regression tree trained with 1c90A (Experiment #1).
Figure 5. Regression tree (only the top 3 layers are shown) trained with complex (Experiment #1).
Table 5. Mean RBED values obtained in Experiment #2.
Fraction of data Max tree depth 1c90A 1e85A 1eia 1g90A_1 1g90A_2 complex Average
0.1 5 46.92 59.37 42.60 50.93 45.29 37.90 47.17
0.5 5 47.07 59.59 43.15 50.69 45.45 38.08 47.34
0.1 15 47.24 59.03 43.35 51.42 45.65 38.07 47.46
0.5 15 46.87 59.04 43.46 51.38 45.89 38.38 47.50
Table 6. Standard deviation of RBED values obtained in Experiment #2.
Fraction of data Max tree depth 1c90A 1e85A 1eia 1g90A_1 1g90A_2 complex
0.1 5 0.25 0.60 0.64 0.54 0.31 0.63
0.5 5 0.25 0.36 0.83 0.66 0.22 0.48
0.1 15 0.57 1.00 0.84 0.42 0.50 0.85
0.5 15 1.27 1.10 1.00 0.62 0.59 0.60
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
163
about H-bond stability than a table derived from a single
trajectory. Furthermore, the variance of RBED values is
now very small, meaning that the training process yields
models that are stable in performance. Finally, like in
Experiment #1, the RBED values are again lower for
models tested on complex. All these results suggest that
we should try to train models with a larger set of trajec-
tories. We actually did some experiments using a few
additional trajectories, but with no noticeable improve-
ment. Most likely these trajectories did not contain
enough H-bonds in situations that did not already occur
in the trajectories of Table 1.
Another observation is that both deeper trees and lar-
ger data fractions tend to improve model accuracy, but
the very small gain is not worth the additional model or
computation complexity. Table 6 shows that standard
deviation is higher for deeper trees that is an expected
result of increasing model complexity.
Figures 7 and 8 show two partial trees trained with
combinations of all tables, except 1c90A (in Figure 7)
and 1e85A (in Figure 8).
Figure 6. Distribution RBED values for the models built
with settings specified in the first row of Table 5.
Figure 7. Top 3 layers of a tree trained with combination of all tables, except 1c90A (Experiment #2). The actual tree has 5
layers (55 nodes).
Figure 8. Top 3 layers of a tree trained with combination of all tables, except 1e85A (Experiment #2). The actual tree has 5
complete layers (63 nodes).
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
164
5.2.3. Experiment #3: Comparison with
FIRST-Energy Model
Here, the models are the same as those generated in Ex-
periment #2 in the first row of Table 5 (maximal depth
of 5 and 10% from each data table). But we now com-
pare them to a regression tree 0
built from the same
training data using FIRST_energy as the only predictor
(predictor #32 in Appendix C). FIRST_energy is the
value of the function used in FIRST [10] to evaluate the
energy of an H-bond occurrence; it is a slightly modified
version of the Mayo energy [7]. We compute RBED
values as defined in Section 5.1.4 where 0
is the sim-
ple regression tree. Table 7 shows the mean RBED val-
ues. Tests on all 6 data tables show that the more com-
plex models are significantly more accurate that the
model based on FIRST_energy only. Overall, these re-
sults confirm that the stability of an H-bond occurrence
depends not only on its energy, but also on other pa-
rameters. See Section 5.3 for more comments.
5.2.4. Experiment #4: Identification of Least Stable
H-Bonds
Most H-bond occurrences tend to be stable. So, accu-
rately identifying the weakest ones is important if one
wishes to predict the possible deformation of a protein
[12]. Here, we measure how well the models generated
in Experiment #2 (again, in the first row of Table 5)
identify the least stable H-bonds occurrences in the test
data table. In each test table T, we first identify the
subset S of the 10% least stable H-bond occurrences
(i.e., the H-bond occurrences with the smallest measured
stability). Using a regression tree
trained with a
combination of data from the 5 other tables, we then sort
the H-bond occurrences in T in ascending order of
predicted stability and we compute the fraction
0, 1w of S that is contained in the first 100 %u
occurrences in this sorted list, for successive values of
0,1u. We call the function

wu the identification
curve of the least stable H-bonds for
.
Figure 9 plotsidentification curves for each of the 6
test tables. Each plot consists of three curves: the red
curve is the ideal identification curve (the one that would
be obtained with a model that perfectly predict the 10%
least stable H-bonds), the blue curve is obtained with one
(randomly picked) regression tree computed in Experi-
ment #2, and the green curve is obtained by sorting H-
bond occurrences in decreasing values of FIRST_en ergy.
One can see that the models computed in Experiment #2
perform well in general. For models tested on data tables
Table 7. Mean values of RBED computed in Experiment #3.
1c90A 1e85A 1eia 1g90A_1 1g90A_2 complex
26.36 27.95 5.65 22.63 19.63 19.42
other than complex, about 80% of the 10% H-bond oc-
currences predicted as the least stable are actually among
the 10% truly least stable. However, several curves show
a rather long tail of poorly ranked unstable bonds. For
example, the set of the 50% least stable bonds predicted
by the model tested on 1eia still misses about 5% of the
truly least stable bonds. Not surprisingly, the results for
complex are much less satisfactory. The regression mod-
els generated in Experiment #2 perform consistently bet-
ter than the FIRST_energy-only models, but for 1eia the
difference is small.
5.2.5. Experiment #5: Models for Different Averaging
and Prediction Windows
Here the testing setup is the same as for Experiment #2
(first row of Tabl e 5), but we let the numbers of ticks l
and l
vary. Recall from Section 5.1.b that l is the
number of ticks over which bond stability is measured,
while l
is the number of ticks over which predictor
values are averaged.
First, we set l to 50 ticks (50 ps), and we built and
tested models for predictor averaging windows of suc-
cessive lengths l
= 2, 5, 10, 20, 50, 100, 200, and 500
ticks. So, in total we built 400 models. Table 8 shows
the mean RBED value for each test table and each value
of l
.
Figure 10 shows the distribution of the RBED values
for the models tested on 1c90A (10 models for each
value of l
). An averaging window length of l
= 50
ticks gives the best results. Shorter lengths fail to elimi-
nate thermal noise in predictor values, but longer win-
dows tend to smooth out important changes in predictor
values.
Next, we set l
to 50 ticks (50 ps) and we built and
tested models for prediction windows of successive
lengths l = 10, 20, 50, 100, 200, and 500 ticks (so here
we built 300 models). Table 9 shows the mean RBED
value for each test table and each value of l.
Figure 11 shows the distribution of the RBED values
for the models tested on 1c90A. Again, a 50-tick predic-
tion window gives the best results. With shorter windows
measured stability is less reliable. But longer windows
lead to making predictions too far beyond an observed
Table 8. Mean RBED values for different lengths
l'
of the
predictor averaging window (Experiment #5).
'l1c90A1e85A1eia 1g90A_1 1g90A_2 complex
228.15 39.55 22.71 36.39 24.72 7.18
538.71 48.65 31.08 43.75 38.05 22.03
1042.78 54.18 36.31 46.50 42.37 29.53
2045.58 57.43 40.48 49.40 44.66 34.78
5047.13 59.72 43.88 51.48 45.76 39.05
10046.58 59.44 43.81 50.96 45.52 38.54
20045.18 58.91 43.21 49.95 43.38 36.20
50041.40 56.25 41.81 46.94 37.99 30.69
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
165
1c90A 1e85A
1eia 1g90A_1
1g90A_2 complex
Figure 9. Identification curves of the least stable bonds for different models.
Table 9. Mean RBED values for different lengths l of the
prediction window (Experiment #5)
l 1c90A 1e85A 1eia 1g90A_1 1g90A_2 complex
2 27.13 34.60 21.45 30.54 26.17 17.61
5 36.98 46.65 30.49 41.34 35.99 26.58
10 42.49 53.16 36.17 47.02 41.50 32.23
20 45.75 57.28 40.55 50.57 44.91 36.39
50 47.09 59.78 43.87 51.50 45.79 38.81
100 45.89 59.29 43.47 49.17 44.82 38.24
200 43.32 56.76 42.17 45.38 42.55 35.24
500 38.13 49.77 37.08 41.43 38.32 31.30
H-bond occurrence; the pertinence of the predictor val-
rather long timescales. For l = 500 ticks, the mean
RBED values start declining more significantly, while
the plot of Figure 11 indicates that the variance of the
RBED values also increases sharply. This is not surpris-
ing since a window of 500 ticks represents a large frac-
tion of each of the trajectories.
5.3. Analysis of Regression Trees
In all our regression trees the root split was done with
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
166
Figure 10. Distribution of RBED values for models tested
on 1c90A for different lengths l’ of the predictor averag-
ing window (Experiment #5).
Figure 11. Distribution of RBED values for models tested
on 1c90A for different lengths l of the prediction window
(Experiment #5).
predictor Dist_H_A (the distance between the H and ac-
ceptor atoms), which therefore appear as the single most
discriminative attribute to predict H-bond stability. The
mean measured stability of the two children of the root
node differs by a ratio ranging from 1.5 to 2 depending
on the specific tree. The importance of the distance be-
tween the H and the acceptor is consistent with previous
findings. Levitt [8] found that most stable H-bonds have
Dist_H_A less than 2.07Å. Jeffrey and Saenger [20] also
suggested that Dist_H_A is a key attribute affecting
H-bond stability, with a value less than 2.2Å for moder-
ate to strong H-bonds. Consistent with these previous
findings, the split values of the deepest Dist_H_A nodes
in our re gression trees vary slightly around 2.1Å. This
distance was observed in [8] to sometimes fluctuate by
up to 3Å in stable H-bonds, due to high-frequency
atomic vibration. This observation supports our decision
to average predictor values over windows of l ticks, as
it would be easy to incorrectly predict the stability of an
H-bond from the value of Dist_H_A at a single tick.
Predictor FIRST_energy, a modified Mayo potential [7]
implemented in FIRST (a protein rigidity analysis soft-
ware) [10], is often used in splits close to the root. This is
not surprising since it is a function of several other per-
tinent predictors: Dist_H_A, Angle_D_H_A, Angle_H_
A_AA, and Hybrid_state (hybridization state of the bond).
Some other distance-based predictors (Dist_D_AA, Dist_
D_A, Dist_H_D), angle-based predictors and Ch_type
predictor appear often in regression trees, but closer to
the leaf nodes. They nevertheless play a significant role
in predicting H-bond stability. For example, as shown in
Figures 7 and 8, if Angle_H_A_AA is at least 105˚, an H-
bond has very high stability (about 0.96); otherwise, the
stability drops to 0.71. The preference for larger angle
matches well with the well-known linearity of H-bonds
[20,21].
Other predictors that are used in splits only occasion-
ally have a less obvious role. A number of predictors
(such as, Atom_type_A, Atom_type_AA, Resi_type_H,
Rgd_type) never appeared in our trees. Either they have
no or very small impact on H-bond stability, or they are
highly correlated with other more discriminative predic-
tors.
In order to get a more quantitative measure of the rela-
tive impact of the predictors on H-bond stability, let us
define the importance of a predictor p in a regression tree
by:

p
sN
I
pwsns
where
p
N is the set of nodes where the split is made
using p,
ws is the score of the split
s
, and
ns
is the number of H-bond occurrences falling into the
node where split
s
is made4. We trained 10 models on
data tables combining 10% of each the 1c90A, 1e85A,
1eia, 1g90A_1, 1g90A_2 and complex data tables. Im-
portance scores for each predictor were averaged over
these models and then linearly scaled to adjust the score
of the least important predictor (with non-zero average
importance) equal to 1. The average importance of every
predictor appearing in at least one model is shown in
Figure 12. The figure confirms that distance-based and
angle-based predictors, as well as FIRST_energy, are the
most important. It also shows that a number of other pre-
dictors—including Resi_name_H, Resi_name_A and
Range (difference in residue numbers of donor and ac-
ceptor)—have less, but still significant importance.
Overall, we observe that predictors that describe the
local environment of an H-bond occurrence play a rela-
tively small role in predicting its stability. In particular,
4This measure is not perfect because some predictors are correlated.
For instance, the value of FIRST_energy is correlated with other pre-
dictors. A better, but more complicated, measure of predictor impor-
tance uses ensembles of trees of a special form [17].
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
167
Figure 12. Predictor importance scores.
we had expected that descriptors such as #29 (Num_hb_
spaceNbr ) and #30 (Num_hb_spaceRgdNbr), which count
the number of other H-bonds located in the neighbor-
hood of the analyzed H-bond, would have had more im-
portance. However, this may reflect the fact that the MD
simulation trajectories used in our tests are too short to
contain enough information to infer the role of such pre-
dictors. Indeed, while transitions between meta-stable
states are rare in those trajectories, predictors describing
local environments may have greater influence on the
stability of H-bonds that must break for such transitions
to happen. So, longer trajectories may eventually be
needed to better model H-bond stability.
6. Conclusions and Future Work
In this paper we have described machine learning meth-
ods to train regression trees modeling H-bond stability in
a protein. The training and test data are in the form of
tables whose rows describe H-bond occurrences at suc-
cessive times along Molecular Dynamics simulation tra-
jectories and columns give the values of various predic-
tors. Each node in a regression tree is a Boolean test on a
predictor. Each row (H-bond occurrence) in a data table
determines a path in the tree from the root to a leaf node.
A predicted stability is associated with each leaf node.
The generated trees are relatively small and easily under-
standable. Trees can be built to predict H-bond stability
over different time scales.
Test results demonstrate that trained models can pre-
dict H-bond stability quite well. In particular, we have
shown that their performance is significantly better
(roughly 20% better) than that of a model based on H-
bond energy alone. We have also shown that they can
accurately identify a large fraction of the least stable H-
bonds in a given conformation. However, our results also
suggest that better results could be obtained with a richer
set of MD simulation trajectories. In particular, the tra-
jectories used in our experiments might be too short to
characterize the stability of H-bonds that break and form
during a transition between meta-stable states.
We believe that the training methods could be im-
proved in several ways:
To eliminate thermal noise, predictor values are aver-
aged over time windows of 50 ticks, independent of
the elapsed time between two ticks (see Section 5.2.5).
It would be better to averaging predictor values before
sub-sampling MD simulation trajectories (see Foot-
note 2). This would result in a much shorter averaging
window, hence it would greatly reduce the risk of fil-
tering out changes in predictor values that are impor-
tant for H-bond stability. Unfortunately, in our trajec-
tories we only had access to the data after sub-sam-
pling.
More sophisticated learning techniques could be used.
For example, instead of generating a single tree, we
could generate an ensemble of trees, such as Gradient
Boosting Trees [22] or Random Forests [23]. A re-
gression tree could also be enriched by using splits on
linear combinations of predictors and by fitting linear
regression models at the leaves.
We could use rigidity analysis methods such as those
described in [12] to decompose a protein into rigid
groups of atoms (based on distance constraints im-
posed by covalent and hydrogen bonds present in the
current conformation). This would allow us to apply
Bayesian techniques to align the predicted stability of
individual H-bonds in the same rigid group. By doing
so, we could better predict the collective behavior of
related H-bonds and avoid solitary incorrect predic-
tions.
Finally, the notion of stability itself could be refined,
for example by distinguishing between the case where
an H-bond frequently switches on and off during a
prediction window and the case where it rarely
switches.
Overall, we believe that considerable progress can still
be made in learning more accurate and robust models of
H-bond stability.
7. Acknowledgements
The authors thank Lydia Kavraki (Rice University),
Vijay Pande (Stanford), Michael Levitt (Stanford) and
Jerry Wang Tsai (University of the Pacific) for providing
us MD simulation trajectories and for useful comments
during our work.
8. References
[1] E. N. Baker, “Hydrogen Bonding in Biological Macro-
molecules,” International Tables for Crystallography,
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
168
Vol. F, No. 22, 2006, pp. 546-552.
[2] A. R. Fersht and L. Serrano, “Principles in Protein Stabil-
ity Derived from Protein Engineering Experiments,” Cur-
rent Opinion in Structural Biology, Vol. 3, No. 1, 1993,
pp. 75-83. doi:10.1016/0959-440X(93)90205-Y
[3] D. Schell, J. Tsai, J. M. Scholtz and C. N. Pace, “Hydro-
gen Bonding Increases Packing Density in the Protein In-
terior,” Proteins: Structure, Function, and Bioinformatics,
Vol. 63, No. 2, 2006, pp. 278-282.
doi:10.1002/prot.20826
[4] B. Honing, “Protein Folding: From the Levinthal Paradox
to Structure Prediction,” Journal of Molecular Biology,
Vol. 293, No. 2, 1989, pp. 283-293.
doi:10.1006/jmbi.1999.3006
[5] C. N. Pace, “Polar Group Burial Contributes More to
Protein Stability than Nonpolar Group Burial,” Biochem-
istry, Vol. 40, No. 2, 2001, pp. 310-313.
doi:10.1021/bi001574j
[6] Z. Bikadi, L. Demko and E. Hazai, “Functional and
Structural Characterization of a Protein Based on Analy-
sis of Its Hydrogen Bonding Network by Hydrogen
Bonding Plot,” Archives of Biochemistry and Biophysics,
Vol. 461, No. 2, 2007, pp. 225-234.
doi:10.1016/j.abb.2007.02.020
[7] B. I. Dahiyat, D. B. Gordon and S. L. Mayo, “Automated
Design of the Surface Positions of Protein Helices,” Pro-
tein Science, Vol. 6, No. 6, 2007, pp. 1333-1337.
doi:10.1002/pro.5560060622
[8] M. Levitt, “Molecular Dynamics of Hydrogen Bonds in
Bovine Pancreatic Trypsin Unhibitor Protein,” Nature,
Vol. 294, 1981, pp. 379-380. doi:10.1038/294379a0
[9] K. Morokuma, “Why do Molecules Interact? The Origin
of Electron Donor-Acceptor Complexes, Hydrogen
Bonding, and Proton Affinity,” Accounts of Chemical Re-
search, Vol. 10, No. 8, 1997, pp. 294-300.
doi:10.1021/ar50116a004
[10] A. J. Rader, B. M. Hespenhelde, L. A. Kuhn and M. F.
Thorpe, “Protein Unfolding: Rigidity Lost,” Proceedings
of the National Academy of Sciences, Vol. 99, No. 6,
2002, pp. 3540-3545. doi:10.1073/pnas.062492699
[11] M. A. Spackman, “A Simple Quantitative Model of Hy-
drogen Bonding,” Journal of Chemical Physics, Vol. 85,
No. 11, 1986, pp. 6587-6601. doi:10.1063/1.451441
[12] M. F. Thorpe, M. Lei, A. J. Rader, D. J. Jacobs and L. A.
Kuhn, “Protein Flexibility and Dynamics Using Con-
straint Theory,” Journal of Molecular Graphics and
Modeling, Vol. 19, No. 1, 2001, pp. 60-69.
doi:10.1016/S1093-3263(00)00122-4
[13] I. K. McDonald and J. M. Thornton, “Satisfying Hydro-
gen Bonding Potential in Proteins,” Journal of Molecular
Biology, Vol. 238, No. 5, 1994, pp. 777-793.
doi:10.1006/jmbi.1994.1334
[14] L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone,
“Classification and Regression Trees,” CRC Press, Boca
Raton, 1984.
[15] M. Levitt, M. Hirshberg, R. Sharon and V. Daggett, “Po-
tential Energy Function and Parameters for Simulations
of the Molecular Dynamics of Proteins and Nucleic Acids
in Solution,” Computer Physics Communications, Vol. 91,
1995, No. 1-3, pp. 215-231.
doi:10.1016/0010-4655(95)00049-L
[16] J. Srinivasan, M. Trevathan, P. Beroza and D. Case, “Ap-
plication of a Pairwise Generalized Born Model to Pro-
teins and Nucleic Acids: Inclusion of Salt Effects,” Theo-
retical Chemistry Accounts, Vol. 101, No. 6, 1999, pp.
426-434. doi:10.1007/s002140050460
[17] E. Tuv, A. Borisov and K. Torkokola, “Best Subset Fea-
ture Selection for Massive Mixed-Type Problems,” Lec-
ture Notes in Computer Science, Springer, Vol. 4224,
2006, pp. 1048-1056. doi:10.1007/11875581_125
[18] H. Joo, X. Qu, R. Swanson, C. M. McCallum and J. Tsai,
“Modeling the Dependency of Residue Packing upon
Backbone Conformation Using Molecular Dynamics
Simulation,” Computational Biology and Chemistry, Ac-
cepted, 2010.
[19] N. Haspel, D. Ricklin, B. Geisbrecht, J. D. Lambris and E.
K. Lydia, “Electrostatic Contributions Drive the Interac-
tion between Staphylococcus Aureus Protein Efb-C and
Its Complement Target C3d,” Protein Science, Vol. 17,
No. 11, 2008, pp. 1894-1906. doi:10.1110/ps.036624.108
[20] G. A. Jeffrey and W. Saenger, “Hydrogen Bonding in
Biological Structures,” Springer-Verlag, 1991.
[21] W. W. Clenland, P. A. Frey and J. A. Gerlt, “The Low
Barrier Hydrogen Bond in Enzymatic Catalysis,” Journal
of Biological Chemistry, Vol. 273, 1998, pp. 25529-
25532. doi:10.1074/jbc.273.40.25529
[22] J. H. Friedman, “Greedy Function Approximation: A
Gradient Boosting Machine,” Annals of Statistics, Vol. 29,
No. 5, 2000, pp. 1189-1232.
doi:10.1214/aos/1013203451
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
169
Appendix A: H-bond Identification
We use the geometric criteria proposed in [13] to identify
H-bonds in a protein conformation. These criteria, shown
in Figure A.1., specify conditions on distances and an-
gles that must be satisfy by the atoms H (hydrogen), D
(donor), A (acceptor), and AA (the atom covalently
bonded to A) for the H-bond to be considered present.
Figure A.1. Constraints on H-bond geometry.
Appendix B: Computing the Split Score for
a Numerical Predictor.
Consider Step 1 (computation of the optimal split value
of a numerical predictor) in Section 4.1. Let i
s
and
1i
s
be two consecutive candidate split values. Assume
that we have computed the split score for i
s
and that
we now want to compute the score for 1i
s
.
As we mentioned in Section 4.1, we can easily find the
H-bond occurrences that are shifting from the left child L
of node N to its right child R. Assume for simplicity, all
predictor values are different. Then only one bond oc-
currence shifts. Let i
y be measured stability for this
occurrence. We keep sum of response values (SL, SR)
and sum of square response values (QL, QR) for left and
right node, and do the simple update:
1iii
SLSL y
;
1iii
SRSR y
;
2
1
iii
QLQLy
;
2
1
iii
QRQRy
Having these sum updated we immediately calculate
variance of the response in left and right child as mean of
the squares minus the square of the mean

2
11
11
ii
L
QL SL
Var Yii






;

2
11
11
ii
R
QR SR
Var Yni ni




 

.
The response variance in the root node does not de-
pend on the split point, so we can easily calculate split
score now
 
1
11
**
iN LR
ini
WVarYVar YVarY
nn

 .
As a result we can calculate split score in a constant
time.
Appendix C: List of Predictors
# Feature Name Feature Meaning Type
Distance-related
1 Dist_H_D Distance between H and donor (covalent bond length) F
2 Dist_H_A Distance between H and acceptor (H-bond length) F
3 Dist_A_AA Distance between acceptor and the atom it covalently bonded to F
4 Dist_D_A Distance between donor and acceptor F
5 Dist_D_AA Distance between donor and AA F
6 Dist_H_AA Distance between H and AA F
Angle-related
7 Ang_D_H_A Angle Donor-H-Acceptor F
8 Ang_H_A_AA Angle H-acceptor-the atom the acceptor covalently bonded to F
9 Ang_D_A_AA Angle donor-acceptor-the atom the acceptor covalently bonded to F
10 Ang_planar Angle between plane D-H-A and H-A-AA F
Atom
11 Atom_type_D Donor atom type (e.g. O, N, S, C) C
12 Atom_type_A Acceptor atom type (e.g. N, O, S) C
13 Atom_type_AA AA atom type (e.g. P, C, S) C
Learning Probabilistic Models of Hydrogen Bond Stability from Molecular Dynamics Simulation Trajectories
Copyright © 2011 SciRes. JILSA
170
Residue
14 Resi_name_H Donor residue name (3 letter code) C
15 Resi_name_A Acceptor residue name (3 letter code) C
16 Resi_type_H Donor residue type. Nonpolar (Ala, Val, Leu, Ile, Trp, Met, Pro), Polar_acidic (Asp,
Glu), Polar_uncharged (Gly, Ser, Thr, Cys, Tyr, Asn, Gln), Polar_basic (Lys, Arg, His)C
17 Resi_type_A Acceptor residue type C
18 Resi_sch_size_H Donor residue side-chain size, i.e. number of atoms in the side-chain F
19 Resi_sch_size_A Acceptor residue side-chain size F
Bond structure type
20 Sec_type
Secondary structure of the H-bond. MA (H-atom and O-atom are in same helix, middle
portion), MB (same strand, middle), EA (same helix, end), EB (same strand, end), AL
(helix-loop), BL (helix-loop), DA (different helices), SL (same loop), DL (different
loops). Don't have DB (different strands) because it's hard to know which strand pairs
with which strand to form the sheet.
C
21 Ch_type H and O are on mch or sch: MM (mch-mch), MS (mch-sch), SS (sch-sch) C
22 Rgd_type SR (H and A are in the same rigid body), DR (different rigid body) C
23 Range Difference in the residue numbers of donor and acceptor, i.e. abs(Residonor-Resiacceptor) F
24 Hybrid_state Hybridization state (sp2-sp2, sp2-sp3, sp3-sp2, sp3-sp3) C
25 Num_furcated_H Number of H-bonds share the H-atom as this H-bond F
26 Num_furcated_A Number of H-bonds share the acceptor as this H-bond F
Environment
27 Num_potential_As Number of potential acceptors (N, O, or S) in 3Å of H (but not covalently bonded to it)
besides the current acceptor F
28 Num_hb_seqNbr Number of sequence-neighboring H-bonds, i.e., number of H-bonds of residues ±2 of
Residonor and Resiacceptor F
29 Num_hb_spaceNbr Number of space-neighboring H-bonds, i.e., number of H-bonds within 5Å of the
mid-point of this H-bond F
30 Num_hb_spaceRgdNbr Number of space-neighboring H-bonds in the same rigid-body, i.e., number of
Num_hb_spaceNbr in the same rigid-body as this H-bond (cross-rigid = –100)5 F
31 Surface Average surface percentage of the H atom and acceptor F
Energy
32 FIRST_energy Modified Mayo potential implemented in FIRST [10] F
5Here, we first use the FIRST software [TLR + 01] to decompose the
p
rotein into rigid groups of atoms based on distance constraints im-
p
osed by covalent and hydrogen bonds present in the current confor
m
a-
tion. Num_hb_spaceRgdNbr is the number of H-
b
onds located within
5Å of the mid-point of the analyzed H-bond in the same rigid compo-
nent.