﻿ Estimation of Area under Receiver Operating Characteristic Curve for Bi-Pareto and Bi-Two Parameter Exponential Models

Open Journal of Statistics
Vol.4 No.1(2014), Article ID:42571,10 pages DOI:10.4236/ojs.2014.41001

Estimation of Area under Receiver Operating Characteristic Curve for Bi-Pareto and Bi-Two Parameter Exponential Models

Bhavna Kaushal, Kanchan Jain, Suresh K. Sharma

Department of Statistics, Panjab University, Chandigarh, India

Email: kaushal.bhavna@gmail.com, jaink14@gmail.com, ssharma643@yahoo.co.in

Received November 13, 2013; revised December 13, 2013; accepted December 20, 2013

ABSTRACT

In this paper, we find the ROC curves for Bi-Pareto and Bi-two parameter exponential distributions. Theoretical, parametric and non-parametric values of area under receiver operating characteristic (AUROC) curve for different parametric combinations have been calculated using simulations. These values are compared in terms of root mean square and mean absolute errors. The results are demonstrated for two real data sets.

Keywords: ROC; AUROC; Pareto; Two Parameter Exponential; Root Mean Square Error; Mean Absolute Error

1. Introduction

Receiver operating characteristic (ROC) curves have become the standard tool for evaluating the discriminatory power of medical diagnostic tests and are commonly used in assessing the predictive ability of binary regression models. ROC curve is a diagnostic tool that helps in determining the accuracy of a test conducted on a person to know whether a particular disease is present or not. In a typical setting, one has a binary indicator and a set of predictors or marker values. The goal is to see how well the marker values predict the binary indicator. The principal idea is to dichotomize the marker at various thresholds and compute the resulting sensitivity and specificity. Sensitivity of a test is defined as the probability of a positive test result when the disease is present and specificity is the probability of a negative test result when disease is absent. Sensitivity is also known as True Positive Rate (TPR) and specificity is known as False Negative Rate (FNR). False Positive Rate is termed as (1-specificity). ROC curve is obtained by plotting the sensitivity versus (1-specificity).

In credit rating models in finance, sensitivity is termed as “Hit Rate” (HR) whereas (1-specificity) is known as “False Alarm Rate” (FAR). If the rating score of the debtor is lower than a cut-off value C, he is treated a defaulter. Otherwise, he is a non-defaulter.

Hence and ROC curve plots HR versus FAR . For detailed discussion on ROC curves, one can refer to . If F and G are the cumulative distribution functions (cdfs) for two populations N and P, then the ROC curve has the form .

The area under ROC curve (AUROC) is a widely used summary index [3-6]. It is the average TPR taken uniformly over all FPRs on (0, 1) and written as (1)

For credit rating models, The area under ROC curve is

• 0.5 if the model does not have discriminative quality;

• between 0.5 and 1.0 for a reasonable model;

• if the model is perfect.

There are many methods, parametric as well as non-parametric, to find the AUROC. Parametric methods are used when the statistical distribution of test values is known in diseased and non-diseased groups. The most common ROC curve model is the Binormal model which assumes that both diseased and healthy test values follow normal distribution. In some situations, the assumption of normality is violated. In case sample sizes are small, this model cannot be adopted. So, we consider Bi-Pareto and Bi-Two parameter exponential models and study the areas under ROC curve.

In Section 2, we derive the expressions for AUROC of Bi-Pareto and Bi-two parameter exponential distributions. In Section 3, we carry out simulations for various combinations of parameters and calculate the values of AUROC using both parametric and non-parametric approaches. Section 4 includes two real life applications for finding ROC curves and areas under them. Conclusions are given in Section 5.

2. Area under ROC for Bi-Pareto and Bi-Two Parameter Exponential Models

In this section, we derive the parametrical forms of ROC by assuming that the two populations labelled as N and P follow some particular distributions. The distributions under consideration are Pareto and two parameter exponential distributions.

2.1. ROC for Bi-Pareto Model

It is assumed that population N follows Pareto distribution with parameters and and population P follows Pareto distribution with parameters and . Hence, the cdf of population N is (2)

and the cdf of population P is (3)

If we write , then using (2) Hence using (3), the ROC curve has the form Therefore using (1), the area under ROC curve is (4)

We estimate the AUROC using the maximum likelihood estimators of parameters of Pareto distribution given by and where Xi’s are the sample observations and GM is the geometric mean of the observations .

In particular, if , then Solving the above integral using Mathematica, we get the theoretical value of AUROC as 0.921433.

2.2. ROC for Bi-Two Parameter Exponential Model

It is assumed that populations N and P follow two parameter exponential distributions with parameters , and , respectively. Then the cdf for population N is (5)

and the cdf for population P is (6)

Writing , and using (5), we get .

Using (6), we get the ROC curve as This gives the area under ROC curve as (7)

We estimate the AUROC using the maximum likelihood estimators of µ and θ, parameters of two parameter exponential distribution given by and where s are the sample observations .

In particular, if , then .

The theoretical value of AUROC is obtained as 0.815805 by solving the above integral using Mathematica.

In the following section, parametric and non-parametric estimates of AUROC are calculated by carrying out simulations.

3. Simulations

The theoretical AUROC values are calculated by assuming Pareto and two-parameter exponential distributions for both populations N and P. Samples are generated from assumed distributions by choosing different values of the parameters. We obtain the parametric estimates of AUROC by substituting the values of MLEs of parameters in the AUROC formulas given in (4) and (7) for Bi-Pareto and Bi-two parameter exponential models respectively. The non-parametric estimates of AUROC are obtained using Mann-Whitney U statistic .

We use 1000 replications for sample sizes 25, 50 and 100 for each distribution. The parameters are estimated using MLEs for each replication and substituted back into the AUROC formula. The error is defined as the difference between estimates based on sample and theoretical AUROC values. n and m denote the sample sizes from two populations. Various parametric combinations are taken and theoretical as well as simulated area under ROC curve is computed using Mathematica and R softwares. The theoretical AUROC (TAUROC), root mean square errors (RMSEs), mean absolute errors (MAEs) and AUROC have been computed using parametric and non-parametric approach. The results are presented in Table 1 for Bi-Pareto model and Table 2 for Bi-two parameter exponential model.

It is evident from Tables 1 and 2 that

• the root mean square error and mean absolute error for parametric approach are less than those for nonparametric approach. Therefore, one can estimate the area under ROC curve more accurately by parametric approach than by non-parametric approach in case of Bi-Pareto and Bi-two parameter exponential models In the following discussion, we present two real life examples where the two groups in data sets fit well to Pareto and two-parameter exponential distributions. The theoretical value of AUROC, RMSE and MAE has been obtained for both models.

4. Applications to Real life Data Sets

4.1. Bi-Pareto Model

The data as shown in Table 3 consist of 50 patients  with advanced acute myelogenous leukemia reported to the International Bone Marrow Transplant registry. 28 of these patients had received an autologous (auto) bone

Table 1 . AUROC, Mean absolute error and Root mean square error for Bi-Pareto model.

Table 2. AUROC, Mean absolute error and Root mean square error for Bi-two exponential model.

Table 3. Leukemia free-survival times (in months) for Autologous and Allogeneic Transplants.

marrow transplant in which, after high doses of chemotherapy, their own marrow was reinfused to replace their destroyed immune system. 22 patients had an allogeneic (allo) bone marrow transplant where marrow from an HLA (Histocompatibility Leukocyte Antigen) matched sibling was used to replenish their immune systems.

By using the easy fit software, it is seen that data in both groups fit well to the Pareto distribution. The p-values for Kolmogorov-Smirnov and Chi-square tests are shown in Table4

The histograms of Allo and Auto patients are shown in Figures 1 and 2.

The area under ROC curve is calculated to be 0.711 by taking the allo patients as one group and auto patients as the second group when both groups follow Pareto distribution. The ROC curve is plotted in Figure 3.

For the above example, the parametric and non-parametric values of AUROC are 0.8164450 and 0.8049404 respectively. The root mean square errors for parametric and non-parametric approach are 0.2107548 and 0.2511394 respectively.

4.2. Bi-Two Parameter Exponential Model

Freireich  gave the results of a clinical trial of a drug 6-mercaptopurine (6-MP) versus a placebo in 42 children with acute leukemia and data is given in Table5 The trial was conducted at 11 American hospitals. Those patients were selected who had a complete remission or partial remission of their leukemia induced by treatment with the drug prednisone. The trial was conducted by matching pair of patients at a given hospital by remission status (partial or complete) and randomising within the pair to either a 6-MP or placebo maintenance therapy. The patients were followed until their leukemia returned or until the end of the study (in months). The data are given below:

By using the easyfit software, we see that data for placebo and 6-MP patients fit well to the two parameter exponential distribution and this can also be concluded from the values in Table6

The histograms of Placebo and 6-MP patients are as shown in Figures 4 and 5.

Table 4. p-values for Kolmogrov-Smirnov and Chi-square tests. Figure 1. Histogram for allo patients. Figure 2. Histogram for auto patients. Figure 3. ROC Curve for the data given in Table3

The area under ROC curve is calculated to be 0.759 by taking the placebo patients as one group and 6-MP patients as the second group when both groups follow the two parameter exponential distribution. The ROC curve is plotted in Figure 6.

For the above example, parametric value of AUROC is 0.78713997 and non-parametric value is 0.45884774. The root mean square errors for parametric and non-parametric approach are 0.02813997 and 0.30015226 respectively.

5. Conclusion

In this paper, we derive the AUROC for Bi-Pareto and Bi-two parameter exponential models. The theoretical, parametric and non-parametric values of AUROC for different parameter combinations have been calculated. The

Table 5 . Time to relapse for Placebo patients and 6-MP patients.

CR: Complete Remission, PR: Partial Remission

Table 6 . p-values for Kolmogorov-Smirnov and Chi-square tests. Figure 4. Histogram for Placebo patients. Figure 5. Histogram of 6-MP patients. Figure 6. ROC Curve for the data given in Table5

root mean square and mean absolute errors are calculated using simulations. For both the models, the area under ROC curve can be estimated more accurately by parametric approach as compared to the non-parametric approach. The applications have been discussed using real life data sets.

Acknowledgements

The first author is thankful to University Grants Commission, Government of India, for providing financial support for this work.

 REFERENCES

 S. Satchell and W. Xia, “Parametrical Models of the ROC Curve: Applications to Credit Rating Model Validation,” Quantitative Finance Research Centre Research Paper 181, University of Technology, Sydney, 2006.

 W. J. Krzanowski and D. J. Hand, “ROC Curves for Continuous Data,” Taylor and Francis Group, New York, 2009. http://dx.doi.org/10.1201/9781439800225

 D. Bamber, “The Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph,” Journal of Mathematical Psychology, Vol. 12, No. 4, 1975, pp. 387-415. http://dx.doi.org/10.1016/0022-2496(75)90001-2

 J. A. Hanley and B. J. McNeil, “The Meaning and Use of the Area under ROC Curve,” Radiology, Vol. 4, 1982, pp. 49-58.

 D. M. Green and J. A. Swets, “Signal Detection Theory and Psychophysics,” Wiley, New York, 1966.

 A. P. Bradley, “The Use of the Area under ROC Curve in the Evaluation of Machine Learning Algorithms,” Pattern Recognition, Vol. 30, No. 7, 1997, pp. 1145-1159. http://dx.doi.org/10.1016/S0031-3203(96)00142-2

 N. L. Johnson and S. Kotz, “Continuous Univariate Distributions,” Vol. 1, Wiley, New York, 1970.

 S. J. Mason and N. E. Graham, “Area Beneath Relative Operating Characteristic (ROC) and Relative Operating Levels (ROL) Curve: Statistical Significance and Interpretation,” Quarterly Journal of the Royal Meteorological Society, Vol. 128, No. 584, 2002, pp. 2145-2166. http://dx.doi.org/10.1256/003590002320603584

  J. P. Klein and M. L. Moeschberger, “Survival Analysis Techniques for Censored and Truncated Data,” Springer-Verlag, New York, 2003.

  T. R. Freireich, E. Gehan, E. Frei, L. R. Schroeder, I. J.Wolman, R. Anbari, E. O. Burgert, S. D. Mills, D. Pinkel, O. S. Selawry, J. H. Moon, B. R. Gendel, C. L. Spurr, R. Storrs, F. Haurani, B. Hoogstraten and S. Lee, “The Effect of 6-Mercaptopurine on the Duration of Steroid Induced Remissions in Acute Leukemia: A Model for Evaluation of Other Potentially Useful Therapy,” Blood, Vol. 21, No. 6, 1963, pp. 699-716.