Paper Menu >>
Journal Menu >>
![]() J. Biomedical Science and Engineering, 2010, 3, 791-798 JBiSE doi:10.4236/jbise.2010.38105 Published Online August 2010 (http://www.SciRP.org/journal/jbise/). Published Online August 2010 in SciRes. http:// www. scirp. org/journal/jbise Innovative data mining approaches for outcome prediction of trauma patients Eleni-Maria Th eodorak i1, Stylianos Katsaragakis2, Christos Koukouvinos3, Chr i st i n a Par p ou l a 3 1Department of Statistics and Actuarial-Financial Mathematics, University of the Aegean, Samos Island, Greece; 2First Propaedeutic Surgery Clinic, Hippocratio Hospital, Athens, Greece; 3Department of Mathematics, National Technical University of Athens, Athens, Greece. Email: parpoula.ch@gmail.com Received 7 June 2010; revised 17 June 2010; accepted 23 June 2010. ABSTRACT Trauma is the most common cause of death to young people and many of these deaths are preventable [1]. The prediction of trauma patients outcome was a difficult problem to investigate till present times. In this study, prediction models are built and their ca- pabilities to accurately predict the mortality are as- sessed. The analysis includes a comparison of data mining techniques using classification, clustering and association algorithms. Data were collected by Hel- lenic Trauma and Emergency Surgery Society from 30 Greek hospitals. Dataset contains records of 8544 patients suffering from severe injuries collected from the year 2005 to 2006. Factors include patients' de- mographic elements and several other variables reg- istered from the time and place of accident until the hospital treatment and final outcome. Using this analysis the obtained results are compared in terms of sensitivity, specificity, positive predictive value and negative predictive value and the ROC curve depicts these methods performance. Keywords: Data Mining; Medical Data; Decision Trees; Classification Rules; Association Rules; Clusters; Confusion Matrix; ROC 1. INTRODUCTION One of the most common and rapidly growing causes of death and disability worldwide, regardless of each country’s development level, is traumatic injury [2]. Every day 16,000 people die [3] and trauma is the leading cause of death in the age of 44 years [4] and the fourth leading cause of all ages after cardiovascular, neoplastic, and respiratory diseases. In 1996 the Na- tional Academy of Sciences and National Research Council published a report which characterized the injury as the “neg lec ted disea se of the modern world”. Due to technological advancements in healthcare domain, an enormous amount of data has been col- lected over the last few years. This fact is followed by clinician’s willingness to explore different technologies and methodologies to analyze these data because their assessment may lead to trends and patterns within the data previously unknown which could significantly enhance their understanding of disease management. Interest in developing prognostic models for binary outcomes has emerged as an essential tool for evalua- tion of medical treatment. Multiple models exist to as- sist with prediction of the outcome of injured patients and many comp arisons be tween different methods exist [4]. Traditionally, researchers have used the regression techniques which are not ideal in handling multidimen- sional, complex biologic data stored in large databases and are time consuming. Therefore, due to the fact that there is no consensus as to an optimal method, it is in- teresting to explore different methods. Data mining methods were developed to overcome these limitations. With these techniques, a priori know- ledge of variable associations is unnecessary. In con- trast to an a priori approach to the selection of predictor variables, data mining allows the discovery of previ- ously unknown variable relationships by exploring a wide range of possible predictor variables. The process of data mining is to find hidden patterns and associa- tions in the data. The utility of data mining methods to derive medical prognostic models from retrospective data, can contribute to increased availability and vol- ume of medical data gathered through systematic use of laboratory, clinical and hospital information systems. Also, it can lead to construction of interpretable prog- nostic models, handling of noise and missing values, and discovery and incorporation of non-linear patterns and feature combinations. This paper investigates the utility of machine learn- ing techniques to construct outcome prediction models ![]() 792 E.-M. Theodoraki et al. / J. Biomedical Science and Engineering 3 (2010) 791-798 Copyright © 2010 SciRes. JBiSE for severe trauma patients and examines measures that will improve the quality of treatment and therefore sur- vivability of patient through optimal management. The study is organized as follows. Section 2 introduces the dataset that was used to investigate the plausibility of modeling the outcome. Statistical methods used for that purpose were classification, association and clustering algorithms. The results of data analysis, and their eval- uation according to their predictive ability are reported in Section 3. Section 4 summarizes the results and pro- vides conclusion of the paper. 2. MATERIALS AND METHODS 2.1. Patient Population and Variables Our database consisted of cases collected during the project, entitled “Report of the epidemiology and man- agement of trauma in Greece”, which was initiated in October 2005 and lasted for twelve months. Study in- cluded patients from a range of 30 teaching, and general hospitals who were admitted w ith a primary diagnosis of injury. Information was gathered for these trauma pa- tients admitted for at least one day in hospital. To avoid biasing estimates, persons who arrived dead or died at the Emergency Room of each hospital were excluded from the analysis. The data and injury scoring was per- formed by a highly-trained coordinator. Input variables which were extracted and included to the models concerned demographics, mechanism of in- jury, month of admission to hospital, whether the patient was referred from another hospital, prehospital care, hospital care and procedures, and outcomes at discharge. Various injury severity scores were also considered in- cluding Injury Severity Score (ISS) [5], Abbreviated Injury Scores (AIS) [6], and the Glasgow Coma Score (GCS) [7]. For all models, there was a single output variable: probability of death. Trauma registry was followed by extensive correction and verification of the data. During preprocessing an aly- sis missing data were also handled. Despite the chal- lenges inherent when data are missing, information could be gained when a thoughtful and systematic ana- lytical approach is used [8]. For that purpose Multiple imputation (MI) was an appropriate method that was used to handle Missing At Random Data in our dataset [9] in order to minimize bias and increase the validity of findings. In this method, multiple (m) versions (typical range 5-20) of the data set are created using available data to predict missing values. These data sets are then used to conduct m analyses, which are then combined into one inferential analysis. The particular appeal of this method is that once completed data sets have been de- veloped, standard statistical methods can be used. Ad- justing multiple imputation issues to data mining meth- ods, derived datasets were compared in terms of per- formance (correctly classified datasets) and the one with the most correctly classified training and test sets was chosen. The analysis was carried out using the SPSS 17.0 and SPSS Clementine 12.0 statistical software. 2.2. Data Mining Algorithms In this section, we present the data mining methods that were applied to analyze the trauma data. These methods may be categorized according to their goal as feature selection methods, decision tree learners, binary classi- fier comparison metrics, clustering algorithms and gen- eralized rule induction algorithms. 2.2.1. F eature S e lection In order to reduce data set size, minimize the computa- tional time and improve model accuracy, a set of vari- ables selection criteria may be used. Such criteria are the maximum percentage of records in a single category criterion, as fields which have too many records falling into the same category may be omitted and the maxi- mum number of categories as a percentage of records criterion, as if a high percentage of the categories con- tains only a single case, the field may be ignored. There are two more variable selection criteria, the minimum standard deviation criterion and the minimum coefficient of variation criterion. According to these, fields with standard deviation or respectively coefficient of variance less than or equal to the specified minimum measure may be of limited use. The coefficient of variance is de- fined as the ratio of the predictor standard deviation to the predictor mean. A common technique used in data mining is ranking the attributes based on the measure of importance which is defined as (1-p), where p is the p-value of a chosen statistical test such as the Pearson's chi-square statistic, the Likelihood-ratio chi-square statistic, the Cramer’s V or Lambda statistic. More details can be found among others in [10-12] and [13]. The Pearson’s chi-square sta- tistical test, is a test of independence between X, where X is a predictor with I categories, and Y, where Y is the target value with J categories, that involves the differ- ence between the observed and the expected frequencies. The expected cell frequencies under the null hypothesis of independence are estimated by .. ˆij ij NN NN , where N is the total number of cases, ij N is the number of cases with X = i and Y = j, .i N is the number of cases with X = i (.1 J iij j NN ) and . j N is the number of cases with Y = j (.1 I j ij i NN ). ![]() E.-M. Theodoraki et al. / J. Biomedical Science and Engineering 3 (2010) 791-798 793 Copyright © 2010 SciRes. JBiSE Under the null hypothesis, Pearson’s chi-square con- verges asymptotically to a chi-square distribution 2 d x with degrees of freedom d = (I – 1)(J – 1). Now, the p-value based on Pearson’s chi-square 2 X is calculated by p-value = Prob 22 () d x X, where 22 11 ˆˆ ()/ IJ ij ijij ij X NN N . 2.2.2. De c ision Trees Decision tree models are a structural description which gives the opportunity to develop classification models that may be used to predict or classify future data sets, according to a number of provided decision rules. Future data with unknown classification may be classified just by routing down the tree according to the tests in nodes and assigning the class of the reached leaf. Some of the advantages of this approach are that it is easy under- standable, can be transformed into a set of rules (if-then rules) that interpret the data set and finally that the pro- vided tree includes only the important attributes that really contribute to the decisions making. The Classifi- cation and Regression Tree (C&RT) is a method based on recursive partitioning to split the training set into subsets so as to obtain more homogeneous subsets than in the previous step. The split is based on the reduction in an impurity index, and in this study we used the Gini index. CHAID algorithm, or Chi-square Automatic In- teraction Detection, is based on the significance level of a statistical test and is a non-binary tree method, that is, it can produce more than two categories at any particular level in the tree. C5.0 algorithm works for data sets where the target field is categorical and builds decision tree by splitting the sample based on the field that pro- vides the maximum information gain at each level. 2.2.3. Clustering Clustering is concerned with grouping records with re- spect to similarity of values for a set of input fields without the profit of prior knowledge about the form and the characteristics of the groups. K-means is an iterative algorithm which tries to dis- cover k clusters, where (k) is defined by the user, so that records within a cluster are similar to each other and distinct from records in other clusters. There are differ- ent distance measures, such as Euclidean distance, Manhattan distance and Mahalanobis distance, but in our application we used the Euclidean distance. The TwoStep cluster method is a scalable cluster analysis algorithm designed to handle very large data sets and both continuous and categorical variables or attributes. It requires only one data pass. It has two steps 1) pre-cluster the cases (or records) into many small sub-clusters 2) cluster the sub-clusters resulting from pre-cluster step into the desired number of clusters. The TwoStep algorithm uses an hierarchical clustering me- thod in the second step to assess multiple cluster solu- tions and automatical ly determine the o ptimal nu mber of clusters for the input data. To determine the number of clusters automatically, TwoStep uses a two-stage proce- dure that works well with the hierarchical clustering method. In the first stage, the BIC (distance measure) for each number of clusters within a specified range is cal- culated and used to find the initial estimate for the num- ber of clusters. TwoStep can use the hierarchical cluster- ing method in the second step to assess multiple cluster solutions and automatically determine the optimal num- ber of clusters for the input data. 2.2.4. Association Rules Association rule mining finds interesting associations and/or correlation relationships among large set of data items. Association rules show attribute value conditions that occur frequently together in a given data set. A typ- ical and widely-used example of association rule mining is Market Basket Analysis. Association rules provide information of this type in the form of if-then statements. These rules are computed from the data and, unlike the if-then rules of logic, association rules are probabilistic in nature. In association analysis the antecedent and consequent are sets of items (called itemsets) that are disjoint (do not have any items in common). In addition to the antecedent (the if part) and the consequent (the then part), an association rule has two numbers that ex- press the degree of uncertainty about the rule. The first number is called the support for the rule. The support is simply the number of transactions that include all items in the antecedent and consequent parts of the rule. (The support is sometimes expressed as a percentage of the total number of records in the database). The other number is known as the confidence of the rule. Confi- dence is the ratio of the number of transactions that in- clude all items in the consequent as well as the antece- dent (namely, the support) to the number of transactions that include all items in the antecedent. Clementine uses Christian Borgelt’s Apriori implementation. Unfortu- nately, the Apriori [14] algorithm is not well equipped to handle numeric attributes unless it is discretized during preprocessing. Of course, discretization can lead to a loss of information, so if the analyst has numerical in- puts and prefers not to discretize them, may choose to apply an alternative method for mining association rules: GRI. The GRI methodology can handle either categorical or numerical variables as inputs, but still requires categori- cal variables as outputs. Rather than using frequent item sets, GRI applies an information-theoretic approach to determine the interestingness of a candidate association ![]() 794 E.-M. Theodoraki et al. / J. Biomedical Science and Engineering 3 (2010) 791-798 Copyright © 2010 SciRes. JBiSE rule using the quantitative measure J. GRI uses this quantitative measur e J to calculate how interesting a rule may be and uses bounds on the possible values this measure may take to constrain the rule search space. Briefly, the J measure maximizes the simplicity, good- ness-of-fit trade-off by utilizing an information theo retic based cross-entropy calculatio n. Once a ru le is en tered in the table, it is examined to determine whether there is any potential benefit to specializing the rule, or adding more conditions to the antecedent of the rule. Each spe- cialized rule is evaluated by testing its J value against those of other rules in the table with the same outcome, and if its value exceeds the smallest J value from those rules, the specialized rule replaces that minimum-J rule in the table. Whenever a specialized rule is added to the table, it is tested to see if further specialization is war- ranted, and if so, such specialization is performed and this process proceeds recursively. The association rules in GRI take the form If X = x then Y = y where X and Y are two fields (attributes) and x and y are values for those fields. The advantage of association rule algorithm over a decision tree algorithm is that associations can exist between any of the attributes. A decision tree algo- rithm will build rules with only a single conclusion, whereas association algorithms attempt to find many rules, each of which may have a different conclusion. The disadvantage of association algorithms is that they are trying to find patterns within a poten tially very large search space and, hence, can require much more time to run than a decision tree algorithm. 2.2.5. Model Performance After categorizing the features and inducing outcome prediction models, different statistical measures can be used to estimate the quality o f derived models. In present study discrimination and calibration were calculated. The discriminatory power of the model (Classification accuracy (CA)) measures the proportion of correctly classified test examples, therefore the ability to correctly classify survivors and nonsurvivors. In addition, models were assessed for performance by calculating the Re- ceiver-Operating-Characteristic (ROC) curves, con- structed by plotting true-positive fraction versus the false-positive fraction an d comparing the areas under the curves. Sensitivity and specificity measure the model’s ability to “recognize” the patients of a certain group. If we decide to observ e the surviv in g patien ts, sen sitiv ity is a probability that a patien t who has survived is also clas- sified as surviving, and specificity is a probability that a not-surviving patient is classified as not-surviving. The Area under ROC curve (AUC) is based on a non-para- metric statistical sign test an d estimates a probability that for a pair of patients of which one has survived and the other has not, the surviving patient is given a greater probability of survival. This probability was estimated from the test data using relative frequencies. A ROC of 1 implies perfect discrimination, whereas a ROC of 0.5 is equivalent to a random model. The above metrics and statistics were assessed through stratified ten-fold cross-validation [15 ]. This technique randomly splits the dataset into 10 subgroups, each containing a similar dis- tribution for the outcome variable, reserving one sub- group (10%) as an independent test sample, while the nine remaining subgroup s (90%) are combined for use as a learning sample. This cross-validation process contin- ues until each 10% subgroup has been held in reserve one time as a test sample. The results of the 10 mini-test samples are then combined to form error rates for trees of each possible size; these error rates are applied to the tree based on the entire learning sample, yielding reli- able estimates of the independent predictive accuracy of the tree. The prediction performance on the test data using cross-validation shows the best estimates of the misclassification rates that would occur if the classifica- tion tree were to be applied to new data, assuming that the new data were drawn from the same distribution as the learning data. Misclassification rates are a reflection of undertriage and overtriage, while correct classifica- tion of injured patients according to their need for TC or NTC care reflects sensitivity and specificity, respectively. Of the two misclassification errors, undertriage is more serious because of the potential for preventable deaths, whereas overtriage unnecessarily consumes economic and human resources. Given a classifier and an instance, there are four possi- ble outcomes. If the instance is positive (P) and it is classi- fied as positive, it is counted as a true positive (TP); if it is classified as negative (N), it is counted as a false negative (FP). If the instance is neg ativ e and it is c lassif ied as nega- tive, it is counted as a true negative (TN); if it is classified as positive, it is counted as a false positive (FP). Given a classifier and a set of instances (the test set), a two-by-two confusion matrix (also called a contingency table) can be constructed representing the dispositions of the set of instances. A confusion matrix contains informa- tion about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix. This matrix for ms the bas is for man y common me trics. We will present this confusion matrix and equations of several common metrics that can be calculated from it for each training, validation and test set in our study. The numbers along the major diagonal represent the correct decisions made, and the numbers of t his diagonal represent the errors, the confusion, between the various classes. The common metrics of the classifier and the addi- tional terms associated with ROC curves such as ![]() E.-M. Theodoraki et al. / J. Biomedical Science and Engineering 3 (2010) 791-798 795 Copyright © 2010 SciRes. JBiSE Sensitivity = TP/(TP + FN) Specificity = TN/(FP + TN) Positive predictive v a lue = TP/(TP + FP) Negative Predictive value = TN/(FN + TN) Accuracy = (TP + TN)/(TP + FP + FN + TN) are also calculated for each training, test and validation set. 3. RESULTS Altogether, 8544 patients were recorded with 1.5% mor- tality rate (128 intrahospital deaths). The models were therefore trained with a dataset heavily favoured toward s survivor. For each of them the binary response variable y (death: 1, otherwise: 0) is reported. There were ap- proximately 780.000 data points (92 covariates, 8544 cases). In order to reduce the dimension of the problem we followed the procedure of feature selection, to exe- cute and detect the most statistically significant of th em, according to Pearson’s chi-square. The final data set which is used for further analysis, included all of the 8544 available patients and the 36 selected factors (fields for data mining). The data set was divided randomly into three subsets: the training set, containing 50% of cases (t.i 4272), the test set, containing 25% of cases (2136) and validation set with 25% of cases (2136). After medical advice, all of the factors were treated equally during the data mining approach, meaning that there was no factor that should be always maintained in the model. Defining maximum percentage of records in a single category equal to 90%, maximum number of categories as a percentage of records equal to 95%, minimum coef- ficient of variation equal to 0.1 and minimum standard deviation equal to 0.0, we removed some factors of low importance. Moreover, applying the Pearson’s chi-square statistic with respect to the categorical type of the target field and significance level a = 5%, we finally identified the 36 important variables displayed in C. Koukouvinos webpage, http://www.math.ntua.gr/~ckoukouv. There were no clear results from C&RT algorithm be- cause it could not generate a rule set (condition too com- plex). The summary of C5.0 and CHAID model’s predic- tive ability measured by percentages of correct classified records is displayed in Table 1. The percentage of records for which the outcome is correctly predicted, represents the overall accuracy of the examined met hod. Tables 2, 3 and 4 display the confusion matrix for each set for C5.0 algorithm. The metrics for the training set were: Sensitivity (84.6%), Specificity (98.4%), Positive predictive value (45.8%), Negative Predictive value (99.8%), Accuracy (98.9%). The metrics for the test set were: Sensitivity (72.7%), Specificity (98.9%), Positive predictive value (26.6%), Negative Predictive value (99.8%), Accuracy (98.83%). The metrics for the validation set were: Sensitivity (64.28%), Specificity (99.2%), Positive predictive value (36%), Negative Predictive value (99.76%), Accuracy (98.97%). The C5.0 tree may be converted into set of rules which are listed in Table 5. In each rule assigned the major classification of the corresponding node. Ruleset for 0 (life) contains 4 rules and ruleset for 1 (death) con- tains 3 rules. Table 1. C5.0 and CHAID model’s predictive ability. Correctly classified Algorithm Training set Test set Validation set C5.0 98.94% 98.84% 98.97% CHAID98.31% 98.6% 98.79% Table 2. The confusion matrix for the training set for C5.0 algorithm. Training set Outcome 0(–) life 1(+) death 0(–) life 4178 6 1(+) death 39 33 Table 3. The confusion matrix for the test set for C5.0 algo- rithm. Test set Outcome 0(–) life 1(+) death 0(–) life 2113 3 1(+) death22 8 Ta bl e 4 . The confusion matrix for the validation set for C5.0 algorithm. Validation set Outcome 0(–) life 1(+) death 0(–) life 2111 5 1(+) death 17 9 Table 5. Ruleset for C5.0 algorithm. IF THEN x3 <= 8.203 and x27 in [0 1 2 3 4] and x71 = 1 and x9 > 8.755 life x3 <= 8.203 and x27 in [0 1 2 3 4] and x71 in [2 4] life x3 <= 8.203 and x27 = 5 and x26 in [12 25 28 31] life x3 > 8.203 life x3 <= 8.203 and x27 in [0 1 2 3 4] and x71 = 1 and x9 <= 8.755 death x3 <= 8.203 and x27 in [0 1 2 3 4] and x71 in [3 6] death x3 <= 8.203 and x27 = 5 and x26 in [15 24] death ![]() 796 E.-M. Theodoraki et al. / J. Biomedical Science and Engineering 3 (2010) 791-798 Copyright © 2010 SciRes. JBiSE Also the CHAID tree may be converted into set of rules which are listed in Ta b l e 6 . In each rule assigned the major classification of the corresponding node. Rules for 0 (life) contai ns 14 rul es. Finally, we evaluated the performance of the afore- mentioned classification algorithms by means of ROC curves methodology, complemented by determination of the areas under the curves, as presented in Table 7. We observe from the results derived from ROC curves methodology that CHAID algorithm has the biggest AUC = 0.888 which indicates an excellent performance of classifiers and a very good discriminating ability about the patient's outcome (life or death). C5.0 algo- rithm has the biggest value for the overall accuracy and a very satisfactory AUC = 0.709 which indicates a good performance of classifiers and a satisfactory discrimi- nating ability about the patient's outcome. The C&RT algorithm has AUC = 0.5 which indicates a random per- formance of classifiers and an unreasonable discrimi- nating ability to diagnose patients with and without the disease/condition. It is therefore natural not to trust the results of C&RT algorithm, as expected, from our pre- vious effort to build a decision tree and a ruleset for C&RT where we observed that there were no clear re- sults from the Clementine because it could not generate a rule set (conditions too complex). Generally, Classifi- cation algorithms were successful on trauma data set. The classification accuracy was especially high, reach- ing accuracy of 99% of correct classifications. In Figure 1 we present the evaluation of testing set for all the classification algorithms by means of ROC curves. Table 6. Ruleset for CHAID algorithm. IF THEN x71 = 1 and x3<=13.081 life x71 = 1 and x3 > 13.081 life x71 = 2 and x19<= 3 and x50 = 0 life x71 = 2 and x19<= 3 and x50 = 1 life x71 = 2 and x19 > 3 and x19 <= 3.945 and x50 = 0 life x71 = 2 and x19 > 3 and x19 <= 3.945 and x50 = 1 life x71 = 2 and x19>3.945 and x19<= 4 life x71 = 2 and x19 > 4 and x19 <= 4.888 and x11 <= 17827.408 life x71 = 2 and x19 > 4 and x19 <= 4.888 and x11 > 17827.408 life x71 = 2 and x19 > 4.888 life x71 = 3 or x71 = 4 life x71 = 6 and x3<=11.447 life x71 = 6 and x3 > 11.447 and x28 = 0 life x71 = 6 and x3 > 11.447 and x28 = 1 life At clustering we specified as minimum number of clusters 2 and as maximum number of clusters 15 where we received detailed clustering of the records (distribu- tion of variables with a percentag e > 85% for its values). The clusters are obtained automatically from the per- formance of TwoStep algorithm, using only the most significant fields of the Trauma data set as they have been derived from feature selection algorithm. The number of clusters wa s 5 grouping 336, 914 , 1227 , 1050, 729 number of records. For the clustering analysis, we performed additionally the K-means algorithm and we determined 5 clusters as default so that records within a cluster are similar to each other and distinct from records in other clusters, where we also received detailed clustering of the records. Five clusters are obtained from the performance of K-means algorithm, using only the most significant fields of the Trauma data set as they have been derived from feature selection algorithm. Each cluster contained 1280, 1010, 379, 935, 652 records. Hence, we achieved the first goal of clustering, that is the decomposition of the data set into categories of sim- ilar data. Clusters are defined by their centers, where a cluster center is a vector of values for the input fields. Results deriving from clustering analysis, are reported as following: for discrete fields the mean value for training records assigned to each cluster is presented. For con- tinuous fields we present only the major value of the variable, the major percentage which belongs to each cluster. Both algorithms gave identical rules. The TwoS- tep created the most multitudinous cluster containing Table 7. Performance of the classification algorithms. Area Under the Curve Algo- rithm Overall accuracy AUC CHAID98.602 0.888 C5.0 98.835 0.709 C&RT 98.602 0.500 Figure 1. Evaluation of Testing set for CHAID, C5.0, C& RT algorithms. ![]() E.-M. Theodoraki et al. / J. Biomedical Science and Engineering 3 (2010) 791-798 797 Copyright © 2010 SciRes. JBiSE 1227 cases which were middle aged (50 years old on average, (o.a)), weighted 73.2 kg (o.a), had white cells 10537 (o.a), glucose levels 125 (o.a), creatinine 1.04 (o.a), urea 40 (o.a), good evaluation of disability (4.3, o.a), not severely injured (Injury Severity Score mean=6.2), high GCS (14.76, (o.a)), 84 pulses (o.a), systolic arterial pressure 130.5 (o.a), diastolic arterial pressure 77.5 (o.a), Ht 40 (o.a), Hb 13.4 (o.a). Addition- ally these patients were not pale (96.4%), had not ephi- drosis (97.3%), had hydration with fluids (88.8%), had done radiography and CT (92.1%, 89.5% respectively), and were admitted to hosp ital clinic after the Emergency Room treatment (90.6%). Using association rules, we performed the General- ized Rule Induction (GRI) algor ithm in order to summa- rize patterns in the data using a quantitative measure for the interestingness of rules. The consequent (the “then” part of the rule) is restricted to being a single value as- signment expression (Y = 1 death) while the antecedent (the “if” part of the rule) may be a conjunction of ex- pressions of only the most significant fields of the Trauma data set as they have been derived from feature selection algorithm. Each rule in the final ruleset has associated support, confidence, based on the number of records for which the antecedent and the entire rule are true. Defining minimum antecedent support equal to 0%, minimum rule confidence equal to 50%, maximum number of antecedents equal to 3, maximum number of rules equal to 100 and choosing only true values for flags, resulted in the appearance only of the set of rules with consequent y = 1 death. Four association rules are obtained from the performance of GRI algorithm and this set of association rules is presented in Table 8. According to the results derived from the implementa- tion of GRI association rule, mortality is predicted with higher percent of support (1.03%) and confidence 60% when G.C.S (x3) is smaller than 6. Moreover the model suggests with the highest confi- dence that people with a cutpoint of G.C.S under 6 al- though that are transferred to hospital with ambulance and they don't suffer from lower limbs injury they are predicted to die (Support: 0.96%, Confidence: 63.41%). 4. CONCLUSIONS In conclusion, the selection of the most important factors Table 8. Ruleset for GRI algorithm. IF THEN x3 < 5.97405 and x25 = 1 and x108 = 0 death x3 < 5.97405 and x108 = 0 death x3 < 5.97405 and x25 = 1 death x3 < 5.97405 death determining the outcome of injured patients is critical, particularly when the problem is high dimensional. Therefore in order to detect the requested info rmation, it is imperative to use expertise and cutting -edge statistical methods that would meet these needs. Data mining can be considered as an in-depth research to find information previously not seen in many of the collected data and has recently been used to medical data [16] often giving useful information for patterns. In our study, the results were encouraging because the implemented algorithms generated useful rules that are logical, consistent with the medical experience and provide more specific in- formation which may assist as guidelines for trauma management. Specifically, we found that the CHAID and C5.0 algorith ms offer an extensive knowledg e of the classification of injuries including combinations of fea- tures that lead to death or good outcome. Also the K-mean and TwoStep algorithms produce casualties with common features and the classification is particularly interesting in the latter case where groups are not deter- mined by the analyst. The comparison of data mining methods in terms of evaluation of medical diagnostic procedures for sensitivity, specificity, Positive Predictive value, Negative Predictive Value, confirmed that the extraction of data from a medical basis as this, may con- tribute to detect factors or combinations of factors that can predict reliably trauma patients outcome. REFERENCES [1] The trauma audit and research network. http://www. tarn.ac.uk/introduction/firstDecade.pdf [2] Meyer, A. (1998) Death and disability from injury: A global challenge. Journal of Trauma, 44(1), 1-12. [3] World Health Organization. http://www.who.int/e n/ [4] The trauma audit and research network. http://www. tarn.ac.uk/content/downloads/36/firstdecade.pdf [5] Baker, P., O’Neil, B., Haddon, W. and Long, B. (1974) The injury severity score: A method for describing patients with multiple injuries and evaluating emergency care. Journal of Trauma, 14(3), 187-196. [6] Copes, W.S., Sacco, W.J., Champion, H.R. and Bain, L.W. (1990) Progress in characterising anatomic injury. Proceedings of the 33rd Annual Meeting of the Asso- ciation for the Advancement of Automotive Medicine, Baltimore, 2-4 October 1989, 205-218. [7] Teasdale, G. and Jennett, B. (1974) Assessment of coma and impaired consciousness. A practical scale. Lancet, 2(7872), 81-84. [8] Penny, K. and Chesney, T. (2006) Imputation methods to deal with missing values when data mining trauma injury data. Proceedings of 28th International Conference on Information Technology Interfaces, Cavtat, 19-22 June 2006, 213-218. [9] Donders, A.R., Van der Heijden, G.J., Stijnen, T. and Moons, K.G. (2006) Review: A gentle introduction to imputation of missing values. Journal of Clinical Epi- ![]() 798 E.-M. Theodoraki et al. / J. Biomedical Science and Engineering 3 (2010) 791-798 Copyright © 2010 SciRes. JBiSE demiology, 59(10), 1087-1091. [10] Cox, D.R. and Hinkley, D.V. (1974) Theoretical statistics. Chapman and Hall, London. [11] Cramer, H. (1946) Mathematical methods of statistics. Princeton University Press, Princeton. [12] Dobson, A. (2002) An introduction to generalized linear models. 2nd Edition, Chapman and Hall/CRC, London. [13] Pearson, R.L. (1983) Karl Pearson and the chi-squared test. International Statistical Review, 51, 59-72. [14] Agrawal, R. and Srikant, R. (1994) Fast algorithms for mining association rules. Proceedings of the 20th Inter- national Conference on Very Large Databases, Santiago de Chile, 12-15 September 1994, 479-499. [15] Craven, P. and Wahba, G. (1979) Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31, 377-403. [16] Breault, J.L., Goodall, C.R. and Fos, P.J. (2002) Data mining a diabetic data warehouse. Artificial Intelligence in Medicine, 26(1-2), 37-54. |