P. F. SUN ET AL.
Copyright © 2013 SciRes. ENG
is a user-defined constant, and an unlabelled vector (a
query or test point) is classified by assigning the label
which is most frequent among the k training samples
nearest to that query point.
3. Selection of Protein Nature and Encoding
Method
In the natural world protein has many and varied attributes.
It is not realistic to make all the attributes determine the
methods as the condition of forming the disulfide bond.
In this research, based on the former results, several kinds
of important attributes are selected to serve as the input
items of the predicting models, and encode these data ac-
cording to their characteristics.
Attribute 1: hydrophobic of amino acid
In the water medium, globular protein folding always
favors in burying the hydrophobic amino acid to the in-
side of the protein. This phenomenon is called hydro-
phobic of amino acid. It holds the prominent status in
stabling the protein three-dimensional structure. The for-
mer researching results indicate that there are differences
between the disulfide bond and non disulfide bond in the
existence ratio of hydrophobic amino acid. The results
reveal that the hydrophobic of amino acid is very impor-
tant to the forming process of the disulfide bond. There-
fore, the property of amino acid can be used as input of
prediction model. According to hydrophobic of amino
acid, hydrophobic amino acid is coded as 1 and non hy-
drophobic amino acid is encoded as 0.
Attribute 2: protein secondary structure [6 ]
Protein secondary structure includes protein folding
information which is significant to predict and recon-
struct the protein 3D structure. Simultaneously, it is im-
portant to the prediction of the protein conjunction which
is proceeding in this thesis. The protein secondary struc-
ture from the data in the research is chosen from second-
ary structure database DSSP. DSSP is a database of sec-
ondary structure assignments for all proteins in the Pro-
tein Data Bank. To any protein in the database of protein
3D structure PDB, the corresponding secondary structure
can be derived by its three-dimensional structure. Ac-
cording to the protein secondary structure, the structure
code of α is 00, the structure code of βis 01, and other
structure cod e is 10.
Attribute 3: evolution information of protein [7]
In the research of predicting the protein secondary
structure, it can be found that using the evolution infor-
mation of protein will increase the predicting accuracy
obviously. It reveals that the evolution information of
protein contains the important information of protein
structure formation. Thus this research introduces evolu-
tion information of protein to predict the protein disulfide
bond. In this research, the protein evolution information
gains from the HSSP database. The HSSP is a database
of protein secondary structures derived by aligning to
each protein of known structure all sequences deemed
homologous. HSSP contains the sequence information
which is based on the sequence’s ratio between the pro-
tein and its homology in the protein database. According
to the protein secondary structure, every amino acid is
encoded as Pi, in every position, Pi is the probability of
the presenting some kind of amino acid. The value scope
of i is 1 to 20.
Because of the consideration of the interaction be-
tween the neighboring amino acids, the sliding windows
with the length as 7 are chosen as a unit. According to
the above encoding method, the predicting amino acid
pair (i, j) is encoded and a group of 322-dimensional data
are gotten as an input vector
.
4. Sample Selection Method Based on the
K-Nearest Neighbor Algorithm
Artificial neural network (BP), a theore tical classification
model, is raised from the simulation of the brain infor-
mation processing and the learning procedures. It puts
forward on the basis of the human science information
processing research, contains modern Neurobiology and
cognitive science, and has very strong adaptivity and
self-learning ability, and the nonlinear mapping ability,
robustness and fault-tolerant capability, etc. In recent
years, with the development of artificial neural network,
it uses in every field of bioinformatics successfully, and
the technique of artificial neural network is increasingly
becoming an important tool for solving the problems on
sequence analysis and pattern recognition of machine
learning technology. For neural network, it depends on
the quality of the training samples, it may not contain
enough information when the performance of the training
samples is too small; and if the training samples are too
large, it may be too large and make the sample redundan-
cy, increase the training time and is likely to cause over-
fitting. So the selection process of the training samples
has an important meaning on prediction modeling, and
how to choose the training sample will be the key to im-
prove the performance of classification. Through the ana-
lysis of the working principle of neural network, it is
known that neural network is one of the optimization of
the nearest neighbor classifier essentially, only its tem-
plate stored in the network structure by form of weight
values, through repeated adjusting relevance weights to
the purpose of fitting with the template. And K-Nearest
Neighbor Algorithm make representative samples as a
template directly, determine the category of the sample
according to the distance to the template. So there is no
need to iterative process of iterative adjustment. Both
compare, neural network training is complex but classi-
fication accuracy is higher, and K-Nearest Neighbor Al-
gorithm is low precision but simple and quick. Therefore