Post-translational modification (PTM) increases the functional diversity of proteins by introducing new functional groups to the side chain of amino acid of a protein. Among all amino acid residues, the side chain of lysine (K) can undergo many types of PTM, called K-PTM, such as “acetylation”, “crotonylation”, “methylation” and “succinylation” and also responsible for occurring multiple PTM in the same lysine of a protein which leads to the requirement of multi-label PTM site identification. However, most of the existing computational methods have been established to predict various single-label PTM sites and a very few have been developed to solve multi-label issue which needs further improvement. Here, we have developed a computational tool termed mLysPTMpred to predict multi-label lysine PTM sites by 1) incorporating the sequence-coupled information into the general pseudo amino acid composition, 2) balancing the effect of skewed training dataset by Different Error Cost method, and 3) constructing a multi-label predictor using a combination of support vector machine (SVM). This predictor achieved 83.73% accuracy in predicting the multi-label PTM site of K-PTM types. Moreover, all the experimental results along with accuracy outperformed than the existing predictor iPTM-mLys. A user-friendly web server of mLysPTMpred is available at http://research.ru.ac.bd/mLysPTMpred/.
The structural and functional diversities of proteins as well as plasticity and dynamics of living cells are significantly dominated by the post-translational modifications (PTMs) [
In general, the side chain of lysine plays the key role in increasing the complexity of PTM network [
However, the purely experimental technique such as mass spectrometry, peptide micro-array, liquid chromatography, etc., to determine the exact modified sites of protein is expensive as well as time-consuming, especially for large-scale datasets. In this context, it is highly demanded to use computational approaches to identify the K-PTM sites effectively and accurately [
In the development of computational classifier, one of the major challenges is to handle imbalance dataset problem [8,15,19,20], as it is found in most of the dataset for this kind of prediction, the number negative subset is much larger than the corresponding positive subset [8,15]. As the real world picture is that here the non K-type modification sites are always the majority compared with the K-type modification ones, so naturally the predictor should be biased to the non K-type modification sites. Here the problem is that, for this type of predictors may interpret many K-PTM sites as non K-PTM sites [21-23]. But, the information about the K-PTM sites is mostly desired than non K-PTM sites. As a result, it is crucial to find an effective solution to balance this kind of bias consequence.
The current study has been begun with an attempt to address the problems mentioned above and then tried to develop a more powerful predictor using combination of support vector machine which can be used to predict the multiple K-type modification sites of proteins. In this predictor, the Different Error Costs (DEC) method [24-26] has been used to resolve the data imbalance issue. It should be noted here that the features used in this predictor are extracted by using vectorized sequence-coupling model [
In order to launch a useful sequence-based statistical predictor for a biological system as demonstrated in a series of recent publications [8,15,28-35], the Chou’s five-step rules [
iPTM-mLys’s [
In iPTM-mLys [
P ξ ( K ) = R − ξ R − ( ξ − 1 ) ⋯ R − 2 R − 1 K R 1 R 2 ⋯ R + ( ξ − 1 ) R + ξ (1)
where the subscript ξ is an integer, R − ξ represents the ξ-th up stream amino acid residue from the center, the R + ξ represents the ξ-th downstream amino acid residue, and so forth.
The ( 2 ξ + 1 ) -tuple peptide sample P ξ ( K ) was further classified into the following two categories [
P ξ ( K ) ∈ { P ξ + ( K ) , ifitscenterisK − PTMsite P ξ − ( K ) , otherwise (2)
where P ξ + ( K ) denotes a true K-PTM segment with K at its center, P ξ − ( K ) a false K-PTM segment with K at its center, and the symbol ∈ means “a member of” in the set theory.
In iPTM-mLys’s work, ( 2 ξ + 1 ) -tuple peptide window was used to collect peptide segment that have K at the center. It should be mentioned here that if the upstream or downstream in a protein sequence is less than ξ or greater than L-ξ (L is the length of the protein sequence concerned) then the lacking amino acid has been filled with the same residue as its nearest one [
After applying some screening procedure based on some constraints on that collected peptide samples, for example, considering window size, keep only one when two or more samples share same sequence, iPTM-mLys finally constructed a benchmark dataset [
The four benchmark dataset S ξ ( K ) in iPTM-mLys’s study was formulated as
{ S ξ ( acetylation ) = S ξ + ( acetylation ) ∪ S ξ − ( acetylation ) S ξ ( crotonylation ) = S ξ + ( crotonylation ) ∪ S ξ − ( crotonylation ) S ξ ( methylation ) = S ξ + ( methylation ) ∪ S ξ − ( methylation ) S ξ ( succinylation ) = S ξ + ( succinylation ) ∪ S ξ − ( succinylation ) (3)
where the positive subset S ξ + ( acetylation ) contains only the peptide samples with their center residues K (Equation (3)) confirmed by experiments being able to be of acetylation, while the negative subset S ξ − ( acetylation ) only contains those samples unable to be of acetylation, and the symbol ∪ means union in the set theory. Likewise, the remaining three sub-equations in Equation (3) have exactly the same definition but refer to “crotonylation”, “methylation” and “succinylation”, respectively.
Using numeric values, in iPTM-mLys’s study, the Equation (3) was formulated as
{ S ξ ( 1 ) = S ξ + ( 1 ) ∪ S ξ − ( 1 ) S ξ ( 2 ) = S ξ + ( 2 ) ∪ S ξ − ( 2 ) S ξ ( 3 ) = S ξ + ( 3 ) ∪ S ξ − ( 3 ) S ξ ( 4 ) = S ξ + ( 4 ) ∪ S ξ − ( 4 ) (4)
where the numerical argument 1, 2, 3 or 4 denotes ‘acetylation’, ‘crotonylation’, ‘methylation’ or ‘succinylation’, respectively.
Note that, depending on some preliminary test, window size was selected as 27 (2*𝜉 + 1) in iPTM-mLys’s study, where ξ = 13 . Thus, the benchmark dataset obtained by iPTM-mLys for S ξ = 13 ( 1 ) , S ξ = 13 ( 2 ) , S ξ = 13 ( 3 ) ,and S ξ = 13 ( 4 ) are available at online supplementary materials (http://research.ru.ac.bd/mLysPTMpred/) as Supporting Information. It should be mention that our published online supplementary materials are taken from iPTM-mLys’s work [
The appropriate features of protein sequences or samples plays very important roles for the prediction of PTM site, as a result it draws the much attention of scientist that how to select the core and essential features of protein samples but this task becomes harder as this types of features are either hidden or burred in the complicated protein sequences. As most existing machine learning algorithm can handle only vector but not sequence sample, one of the critical problem in bioinformatics is how to extract vector from biological sequence with keeping considerable sequence characteristics [
To avoid complete losing the sequence pattern information for a protein, the pseudo amino acid composition or PseAAC [
In this paper, the incorporation of sequence-coupling model [27,42] into Chou’s general PseAAC [
(5)
where
(6)
and
Attribute | PTM Type and Number of Samples | |||
---|---|---|---|---|
Ace | Cro | Met | Suc | |
S(1) | S(2) | S(3) | S(4) | |
Positive | 3991 | 115 | 127 | 1169 |
Negative | 2403 | 6279 | 6267 | 5225 |
Ace, acetylation; Cro, crotonylation; Met, methylation; Suc, succinylation.
(7)
In Equation (5) p − ξ + ( R − ξ | R − ( ξ − 1 ) ) is the conditional probability of amino acid R − ξ occurring at the left 1st position (see Equation (1)) given that its closest right neighbor is R − ( ξ − 1 ) , p − ( ξ − 1 ) + ( R − ( ξ − 1 ) | R − ( ξ − 2 ) ) is the conditional probability of amino acid R − ( ξ − 1 ) occurring at the left 2nd position given that its closest right neighbor is R − ( ξ − 2 ) ,and so forth. It should be mentioned here that in Equation (6), only p − 1 + ( R − 1 ) and p + 1 + ( R + 1 ) are of non-conditional probability since the right neighbor of R − 1 and the left neighbor of R + 1 are always K. All these probability values can be easily derived from the positive benchmark dataset given in Supporting Information as done in [
The modeling algorithm of SVM searches an optimal hyperplane with the maximum margin for separating two classes by finding a solution of the following constraint optimization problem [43-45]:
maximize α ∑ i = 1 n α i − 1 2 ∑ i = 1 n ∑ j = 1 n α i α j y i y j k ( x i , x j ) subjectto : ∑ i = 1 n y i α i = 0 , 0 ≤ α i ≤ C for all i = 1 , 2 , 3 , ⋯ , n (8)
where x i ∈ R p and y i ∈ { − 1 , + 1 } is the class label of x i , 1 ≤ i ≤ n .
Finally, the discriminant function of SVM by involving the kernel function takes the following form
f ( x ) = ∑ i n α i y i k ( x , x i ) + b (9)
It noted here that a kernel function and its parameter have to be chosen to build a SVM classifier [43-45]. In this work, radial basis function kernel has been used to build SVM classifier which is defined below:
K ( x i , x j ) = exp ( − ‖ x i − x j ‖ 2 2 σ 2 ) , σ is the width of the function.
Any data set that shows an unequal distribution between its classes can be considered imbalanced data set problem. The main challenge in imbalance problem is that the small classes are often more useful, but standard classifiers tend to be weighed down by the huge classes and ignore the tiny ones. Although SVMs work effectively with balanced datasets, they provide sub-optimal models with imbalanced datasets [24,25]. The main reason for the SVM algorithm to be sensitive to class imbalance would be that the soft margin objective function [43-45] assigns the same cost (i.e., C) for both positive and negative misclassifications in the penalty term [
In this paper, we have used a Different Error Costs (DEC) method to handle imbalance dataset problem of K-PTM sites prediction. The Different Error Costs (DEC) method is a cost-sensitive learning solution proposed in [
C + = C ∗ N 2 ∗ N 1 , C − = C ∗ N 2 ∗ N 2 (10)
where N is the total number of instances, N1 is the number of instances for positive class, and N2 is the number of negative class.
In statistical prediction, there are three commonly used methods to derive the metric values for a predictor, these are, the independent dataset test, subsampling (e.g., K-fold cross validation) test, and jackknife test [15,46]. These methods are often used for testing the accuracy of a statistical prediction algorithm. However, among those three methods, the jackknife test is deemed the most objective because it can always yield a unique result for a given benchmark data set, as reported in a comprehensive review [
In this study, we have used K-fold cross validation (subsampling) method to save the computational time. As the information about the exact 5-way splits of dataset used in previous studies is not published [
According to the description of iPTM-mLys dataset in [
For measuring the predictive capability and reliability for this kind of classification, a set metrics are usually used in the literature which are define below [
Aiming = 1 N ∑ k = 1 N ( ‖ Y k ∩ Z k ‖ ‖ Z k ‖ )
Coverage = 1 N ∑ k = 1 N ( ‖ Y k ∩ Z k ‖ ‖ Y k ‖ )
Accuracy = 1 N ∑ k = 1 N ( ‖ Y k ∩ Z k ‖ ‖ Y k ∪ Z k ‖ )
Absolute-True = 1 N ∑ k = 1 N Δ ( Y k , Z k )
Absolute-False = 1 N ∑ k = 1 N ( ‖ Y k ∪ Z k ‖ − ‖ Y k ∩ Z k ‖ M ) (11)
where N is the total number of the samples concerned, M the total number of labels in the system, ∪ and ∩ the symbols are for the “union” and “intersection” in the set theory, ‖ ‖ means the operator acting on the set therein to count the number of its elements, Y k denotes the subset that contains all the labels experiment-observed for the k-th sample, Z k represents the subset that contains all the labels predicted for the kth sample, and
Δ ( Y k , Z k ) = { 1 , if all labels in Z k are identical with those in Y k 0 , other wise
All of these metrics defined in this section have been successfully applied to study several multi-label systems, such as those in which a protein may stay in two or more different subcellular locations [
In order to generate highly performing SVM classifiers capable of dealing with real data an efficient model selection is required [
In this study, four SVM classifiers, one for each dataset, have been used for predicting the acetylation, crotonylation, methylation and succinylation sites. The model selection of each SVM classifiers has been done separately as binary classifier using the corresponding benchmark dataset given in
For radial basis function (RBF) kernel, to find the parameter value C (penalty term for soft margin) and σ (sigma), we have considered the value from 2−8 to 28 for C and from 2−8 to 28 for sigma as our searching space. Herein, the value of C will be used to find the misclassification cost of C+ and C− defined in Equation (10). Since the information about the exact 5-way splits of dataset used in previous studies is not published [
After getting the four trained binary SVM classifier with appropriate values of C and sigma (Supplementary Tables S1), a multi-label predictor, named mLysPTMpred, has been developed by combing output from these four SVM classifiers, as shown in
However, in order to train the system for the web server, we have used that value of C and sigma which appears most of the times as best model in 5 times complete run of 5-fold cross validation in each dataset. Note that, a random selection of the value of C and sigma has also been performed from 5 set of C and sigma of each dataset where “most of the times” criteria fail to select C and sigma. In this way, the selected C and sigma for each type of dataset is given in
Type of PTM | C | σ |
---|---|---|
acetylation | 24 | 24 |
crotonylation | 2−2 | 27 |
methylation | 20 | 24 |
succinylation | 21 | 23 |
It can be mentioned here that all the trains and tests have been conducted on a standard machine of DELL Optiplex 390 with 8 GB RAM and Core-i3 processor running at 3.30 GHz. We have used Matlab 2014b version to implement our system where the svmtrain function of Matlab by default uses DEC with the same cost defined in Eq. (8) to handle imbalance situation.
The values of the five metrics (cf. Equation (11)) obtained by the current mLysPTMpred predictor for multi-label lysine PTM site are given in the
In Equation (11), the first four metrics are completely opposite to the last one. For the former, the higher the rate is, the better the multi-label predictor’s performance will be; for the latter, the lower the rate is, the better its performance will be [
Among the five metrics in Equation (11), the most strict and harsh one is the “Absolute-True”. According to [
Also, among the same five metrics, the most important is the “Accuracy”, the average ratio of the correctly predicted labels over the total labels including correctly and incorrectly predicted ones as well as
Predictor | Aiming (%) | Coverage (%) | Accuracy (%) | Absolute-True (%) | Absolute-False (%) |
---|---|---|---|---|---|
iPTM-mLys | 69.78 | 74.54 | 68.37 | 60.92 | 13.40 |
mLysPTMpred | 84.82 (±0.0022) | 86.56 (±0.0021) | 83.73 (±0.0024) | 79.73 (±0.0029) | 6.66 (±0.00009) |
those real labels but are missed out during the prediction. The mLysPTMpred achieves 83.73% accuracy which is considerable amount of higher than iPTM-mLys. In addition, the rate of ‘‘Aiming’’ or ‘‘Precision’’ [
Therefore, it is obvious from
In iPTM-mLys, an example protein sequence (Q16778) has been used to validate their findings. It should be noted here that the example protein sequence (Q16778) is also available in our site under Example button. For making comparison, we have not changed the sequence as example. The prediction result using this sequence (Q16778) from mLysPTMpred and the actual experimental result of this sequence is reported in
Why can the proposed method enhance the prediction quality so significantly? First, the coupling effects among the amino acids around the target sites have been taken into account via the conditional probability as done by many investigators in successfully enhancing the prediction quality in some applications [27,31,35,42]. Second, the predictor used Different Error Costs (DEC) method to balance the effect of skewed training dataset and hence many false prediction events produced by imbalanced and skewed training datasets can be avoided as established in some recent studies [7,8,15,31,35].
To attract more users especially for the convenience of experimental scientists and enhance the value of practical application, a user-friendly web-server for mLysPTMpred has been established at http://research.ru.ac.bd/mLysPTMpred/. In order to get the predicted result, users are required to submit protein sequence through the input text box in our site. The input sequence should follow the FASTA format. An example of a sequence of FASTA format is available under example button in our published site. Moreover, in order to get batch prediction, users are required to enter desired batch input file in the FASTA format. Noted that, the benchmark dataset used to train and test the mLysPTMpred predictor are available under Supporting Information button.
In this article, we have designed a simple and efficient predictor mLysPTMpred for predicting multiple lysine PTM sites. Experimental results show that our method is very promising and can be a useful tool for prediction of multiple lysine PTM site. The mLysPTMpred has achieved remarkably higher success rates in comparison with the existing predictors (iPTM-mLys) in this area. We believe that the approach and formulations proposed in this article for multi-label K-PTM can be used to study other multi-label PTM systems such as C-PTM, R-PTM and S-PTM for the corresponding multi-label PTM sites at Cys, Arg and Ser residues, respectively.
For convenience of the experimental scientists, we have established a user-friendly web server and a step by step guide has been provided about how to use this web server. It provides an easier way to obtain the desired results without knowing the mathematical details. We have projected that the mLysPTMpred
Sites | Predicted Result | Experimental Result | ||||||
---|---|---|---|---|---|---|---|---|
Ace | Cro | Met | Suc | Ace | Cro | Met | Suc | |
6 | Yes | Yes | Yes | No | Yes | Yes | No | No |
12 | Yes | Yes | No | No | Yes | Yes | No | No |
13 | Yes | Yes | No | No | Yes | Yes | No | No |
16 | Yes | Yes | No | No | Yes | Yes | No | No |
17 | Yes | Yes | No | No | Yes | Yes | No | No |
21 | Yes | Yes | No | No | Yes | Yes | No | No |
24 | Yes | Yes | No | No | Yes | Yes | No | No |
25 | No | No | No | No | No | No | No | No |
28 | No | No | No | No | No | No | No | No |
29 | No | No | No | No | No | No | No | No |
31 | No | No | No | No | No | No | No | No |
35 | No | Yes | No | No | No | Yes | No | No |
44 | No | No | No | No | No | No | No | No |
47 | No | No | Yes | No | No | No | Yes | No |
58 | No | No | Yes | No | No | No | Yes | No |
86 | Yes | No | Yes | No | Yes | No | Yes | No |
109 | No | No | Yes | No | No | No | Yes | No |
117 | Yes | No | No | No | No | No | No | No |
121 | Yes | No | No | No | No | No | No | No |
126 | Yes | No | No | No | No | No | No | No |
will become a very useful and higher throughput tool to deal with both single- and multi-label PTM systems.
As the current mLysPTMpred has been developed to study the multi-label system for only four different K-PTM types, we will try to add more types of K-PTM and include more new sequences [
The authors declare no conflicts of interest regarding the publication of this paper.