Predicting the outcome of treatment among TB patients is a big concern of the Department of Health. Data mining in health care system can be used for decision making. The most widely used for data exploration is decision tree based on divide and conquer technique. The objectives of this article are to create a predictive data mining model for TB patient category to find the relapse treatment and to classify the factors influencing the relapse treatment to provide assistance, guidance, and appropriate warning to TB patients who are at risk. The dataset of TB patient records is verified and applied in CHAID classification tree algorithm using SPSS Statistics 17.0. The classification tree model identified the set of two statistically significant independent variables (DSSM Result, Age) as predictors of patient category.
Philippine Tuberculosis (TB) is a foremost community health problem and remains a major cause of death and it is one of the nations with high TB incidence. “Philippines ranked ninth among the 22 high TB burdened countries” [
In TB treatment, one major problem is guaranteeing patients to pursue their treatment, together with medication and medical checkups till completion. Hence, there’s a desire to boost the adherence and retention in care. Whereas there could also be several reasons for the lack of endurance and there could also be ways to boost completion of treatment programs by maintaining better contact between health workers and TB patients. Treatment results fill in as intermediary proportions of the nature of tuberculosis treatment provided by the health care system, and it is essential to assess the effectiveness of Directly Observed Therapy-Short course program in controlling the disease, and diminishing treatment failure, default and death [
In this study, the Chi-Square Automatic Interaction Detection (CHAID) decision tree algorithm is employed to predict the patient category relapse in Cabanatuan City, Philippines. “CHAID applications focused on the field of medical and psychiatric research although it can be employed also in researches of different fields. The technique was developed in South Africa and was published in 1980 by Gordon V. Kass, who had completed a PhD thesis on this topic” [
Other researches use CHAID to: explore the adverse effects of social networking sites on students’ academic performance in secondary schools [
The reasons behind the selection of decision trees as the basis method of this study can be enumerated like: 1) CHAID decision tree model is understandable, easy, and interpretable model 2) and it is fast to build on the predictive methods, then it’s highly appropriate and flexible for future changes of data as it has low training time [
This research aims to create a predictive data mining model for relapse category rate of TB by using Integrated Tuberculosis Information System (ITIS) data on reported cases and found variables influencing the relapse treatment of TB.
The study is a quantitative research design that uses the statistical method to quantify and analyze the data to generalize results from a sample population. It was done in Cabanatuan City, Nueva Ecija, Philippines. The data came from dataset of TB patient records of Cabanatuan which were extracted from the database of Integrated Tuberculosis Information System (ITIS) of Cabanatuan City Health Office last 2017. This confirms the correctness and comparability of data, which are significant features in CHAID model. The collected data were coded and scrutinized using the Statistical Package for Social Sciences (SPSS Statistics 17.0). The study protocol of data collection and interview was approved by the Office of the City Mayor and the director of the City Health Office.
The study used data mining as a tool with CHAID classification tree as a technique to design the TB patient category relapse prediction model. According to [
The CHAID (Chi-Square Automatic Interaction Detection) algorithm is one of the most prevalent statistically based methods of supervised learning for decision tree development proposed by a statistician Kass in the late 1970’s. The CHAID acronym denotes automatic and iteration technique of tree development based on Pearson’s Chi-square statistic and corresponding p-value. The CHAID analysis builds a predictive model to help define how variables are best unified to describe the result in the specified dependent variable.
In order to form a decision tree by means of CHAID algorithm, according to its nature, initially, a description of the used variables was achieved as follows: The variable, Patient Category is defined as a dependent variable. Patient Category is a nominal variable with two values (non-relapse, relapse), the creation model can be based on Chi-square splitting criterion.
As observed on
The most significant independent variable in
Most of the respondents (231) go to node 1 where in the value of DSSM Result are “2+” and “ODT”. The 87 respondents belong to node 2 containing DSSM Result is equal to “1+”. Node 3 has 130 respondents where DSSM Result is equal to “0”, and the rest of the participants (62) belong to node 4 where DSSM Result is equal to “3+”. For DSSM Result, within the first level of the tree, node 3 is parent node, while nodes 1, 2, and 4 are all terminal.
The second level of the tree is variable Age which is statistically significant. Independent Variable Age is significant for splitting of node 3 (Chi-square = 144.870, df = 1, p-value = 0.001). In congruence to this, the following two groups of respondents are found: TB patients with Age of less than or equal to 51 (≤51) belong to node 5, while Age is greater than 51 (>51) belong to node 6. All the nodes are terminal in the final level of decision tree.
The four (4) terminal nodes in the formed tree structure are marked as 1, 2, 4 and 5 relates to Non-Relapse, however node 6 refers to Relapse. Actually, the lanes from the root to terminal nodes produce a set of rules for classification of TB patients into one of the defined categories of the variable Patient Category. This obviously specifies that the developed model and knowledge described in the decision tree can be formulated as if-then rules.
Variable Name | Value (modalities) | Structure | Type of Variable | ||
---|---|---|---|---|---|
Fi | % | MS | IV/DV | ||
Age | ≤51 | 404 | 79 | SV | IV |
>51 | 106 | 21 | |||
Sex | Male | 343 | 67 | NV | IV |
Female | 167 | 33 | |||
BacStatus (Bacteriogically Status) | Bacteriologically-confirmed TB | 258 | 51 | NV | IV |
Clinically-diagnosed TB | 252 | 49 | |||
DSSM Result (Direct Sputum Smear Microscopy) | Other Diagnostic Test (ODT) | 174 | 34 | NV | IV |
0 | 130 | 26 | |||
1+ | 87 | 17 | |||
2+ | 57 | 11 | |||
3+ | 62 | 12 | |||
Classification | Pulmonary | 507 | 99 | NV | IV |
Extra-Pulmonary | 3 | 1 | |||
Patient Category | Non-Relapse | 455 | 89 | NV | DV |
Relapse | 55 | 11 |
Legend: MS is Measurement Scales; NV is Nominal Variable; SV is Scale Variable; fi is frequency; % is Percentage; IV is Independent Variable; DV is Dependent Variable.
IF (DSSMResult = “2+”) OR (DSSMResult = “ODT”) THEN
Node = 1
Prediction = 1
Probability = 1.000
IF (DSSMResult = “1+”) THEN
Node = 2
Prediction = 1
Probability = 0.977
IF (DSSMResult = “3+”) THEN
Node = 4
Prediction = 1
Probability = 0.871
IF (DSSMResult = “0”) AND (AGE < = 51) THEN
Node = 5
Prediction = 1
Probability = 0.789
IF (DSSMResult = “0”) AND (AGE > 51) THEN
Node = 6
Prediction = 2
Probability = 0.537
Based on the rule set of CHAID algorithm using Patient Category as the dependent variable, prediction of node 1 is 1 referring to Non-Relapse category with a probability of 1.000; prediction of node 2 is 1 referring to Non-Relapse category with probability of 0.977; prediction node 4 is 1 referring to Non-Relapse category with probability of 0.871; prediction node 5 is 1 referring to Non-Relapse category with probability of 0.789 and; prediction node 6 is 2 referring to Relapse category with probability of 0.537. The nodes 1, 2, 4, and 5 have a prediction value of 1 referring to Non-Relapse, and node 6 has a prediction value of 2 referring to Relapse. The CHAID method shows that the variable DSSMResult is the best predictor in Patient Category. For the DSSM result, result “2+” is the significant predictor with 100% result for Non-Relapse.
For the growing stage, the next best predictors are the DSSM Result and Age with 78.9%, if the DSSMResult is “0” and Age is less than equal to 51 the category is Non-Relapse. For the last stage, 53.7% result for Relapse category, if the DSSMResult is “0” and Age is greater than 51 and this is considered a terminal node.
Prediction risk was presented in
Estimate | |
---|---|
Re-substitution | Cross-validation |
0.100 | 0.100 |
Patient Category | Predicted | |||
---|---|---|---|---|
Non-Relapse | Relapse | Percent Correct | ||
Observed | Non-Relapse | 430 | 25 | 94.50% |
Relapse | 26 | 29 | 52.70% | |
Overall Percentage | 89.40% | 10.60% | 90.00% |
relations of TB patient category of 10%, and the risk of 10% in cross-validation for the test sample used.
Classification matrix was presented in
In this study, CHAID classification tree technique is used for prediction on the dataset of 510 TB patients to predict and analyze the patient category relapse. The CHAID prediction model was very convenient and useful to evaluate the coherence among variables that are utilized to predict the relapse in TB treatment category. A model was developed based on TB patient correlated input variables gathered from the ITIS database of city health office. The variables DSSM Result, and Age are the strongest indicators for the prediction of patient category relapse treatment. From the classification matrix, it is clear that 90% is the overall accuracy of the model, and only 10% in prediction risk.
As a future work, the author is planning to create models with a three-year period of dataset to attain more precise results, and engage additional techniques from the dataset.
The author is thankful for the support offered to him by the Provincial Health Office of Nueva Ecija and Cabanatuan City Health Office.
The author declares no conflicts of interest regarding the publication of this paper.
Dela Cruz, A.P. (2018) Predicting the Relapse Category in Patients with Tuberculosis: A Chi-Square Automatic Interaction Detector (CHAID) Decision Tree Analysis. Open Journal of Social Sciences, 6, 29-36. https://doi.org/10.4236/jss.2018.612003