11

11

22

22

1234

18 20 22 24 262830

Patient 39

11

2

2

1

1

1

2

2

2

12345

14 16 18 20

P

1

1

1

11

22

2

22

at ient 40

12345

20 25 30 35 40 45

Patient 47

1111

22

2

2

1234

5 1015202530

Patient 50

1

11

2

22

123

10 20 30

Patient 51

1

1

1

2

22

123

25 30 35 40 45 50 55

Patient 53

1

2

1

1

1

2

2

2

1234

30 35 40 45

P

1

1

11

2

2

2

2

at ient 54

1234

20 25 3035 40

Patient 60

111

1

222

2

1234

45 50 55

Patient 71

1111

2

222

1234

5 1015202530

Patient 76

Figure 1. Plots of RECIST readings versus time index for each patient b y two radiologists (denoted 1, 2) in a longitudinal st udy.

The RECIST markup data here is the largest diameter of one identified nodule, in millimeters (mm).

on (6) for each patient. The bottom figure, the test statis-

tics is defined similarly as in (11) but with variance esti-

mate as computed given in the bottom figure in Figure 2.

One can conclude from Figure 3 that the test results in

the bottom figure significantly improve over original re-

sults in top figure. For patient number 2, the new test

does not find significance, while original raw-variance

based test finds strong significance due to a low variance

estimate. The new test seems to be more consistent with

visual appearance of patient number 2 in Figure 1. Simi-

lar comments apply to data for patient number 11. The

opposite is observed for patient number 18, and patient

number 19. Original tests based on raw variance estimate

do not find significance due to inflated variance esti-

mates, and this is corrected in the new test. As a result,

significant change is observed for both patient numbers

18 and 19 using the improved test. It is found that the

two readers agree with each other on their assessment in

the direction of change on 19 out of 23 patients. (They

disagreed on patient number 1, 20, 21, and 23). A sum-

mary of decision results based on the statistical tests is

provided in detail in Table 1, where we use 10% as the

threshold for significant increase and 5% level for sig-

nificant decrease.

Interestingly, one may compare the statistical test re-

sults with the RECIST guidelines (a time-sequence in-

crease of at least 20% defines “progressive disease” (PD),

while a decrease of at least 30% defines “partial re-

sponse” (PR)), which can be inferred from the relative

percentage change data plotted in Figure 4. Summary of

the RECIST analysis is given in Table 2. In short, on 4

out of 23 patients, the two experts have given percentage

changes of opposite directions (cf. patient numbers 1, 20,

21, and 23). The two experts differ in their computed

percentage changes with a mean average difference of

21%. In terms of RECIST decision results based on the

radiologists’ individual assessments, in addition to the 4

patients on which they totally disagree, they agree on 15

Z.-Q. J. LU ET AL. 265

11

1

111

1

1

111

1111111

1

1

2

2

222

2

2

22222

2

2222

22

2

Patient index

Difference-based Std (mm)

111

222

10 20

0 51525

Raw robust variance estimati on

11

1

1

1

1

1

1

11

1

1

111

1

1

11

1

2

2

2

2

2

2

2

2

222

2

2

22

2

2

22

2

Patient index

std (mm)

1

11

2

22

10 20

02468

Variance estimation after bias adjustment

11

1

11

1

1

111

1

1

111

1

1

11

1

2

2

222

2

2

2

222

2

2

22

2

2

22

2

Patient index

std (mm)

1

11

2

22

10 20

34567

Variance estimation after l.e. B pooling

Figure 2. Variance estimation. Top: Raw robust variance estimates for within-patient readings from each of the two experts

(denoted by 1 and 2). Middle: after bias adjustment from signal bias (mainly for highest variances). Bottom: after variance

pooling to stabilize and to improve underestimated variance estimates (low variances) using Robbins method.

Table 1. Summary on the statistical results: Out of 23 patients annotated, two readers totally disagree on 4 patients (Patients 1,

20, 21, 23). They agree on 16 patients, they agree partially on 3 patients (Patients 3, 8, 17).

Reader 2

Significant increase (x)Increase but not sig (%)Decrease but not sig (o) Significant decrease (--)

Significant increase (x)2 2 0 2

Increase but not sig (%)0 4 0 0

Decrease but not sig (o)0 0 4 0

Reader 1

Significant decrease (--)2 0 1 6

We define the following symbols: x for significant increase at 10% level, -- for significant decrease, o for non-significant decrease, and % for non-significant

increase.

Table 2. Summary on RECIST results: On 4 out of 23 patients, the two experts have given percentage changes of opposite di-

rections (cf . Patients 1, 20, 21, 2 3). The two experts differ in their comp uted percen tage changes with a mean a verage differen ce

of 21%. In terms of RECIST decision results, t hey agree on 15 patients, and agree partially on 4 patients (patients 8, 9, 17, 22).

Reader 2

Progressive disease

(PD)

Increase but below

20% (y)

Decrease but below

30% (y)

Partial recovery

(PR)

Progressive disease (PD) 2 1 1 0

Increase but below 20% (y) 1 4 1 0

Decrease but below 30% (N) 1 1 7 2

Reader 1

Partial Recovery (PR) 0 0 0 2

Copyright © 2012 SciRes. OJS

Z.-Q. J. LU ET AL.

266

1

1

111111

11

1

11111

1

1

1

2

2222

2

2222

222

2

2

2

2

Patient index

Z-statistics

1

1

1

1

2

2

2

2

10 20

-4-202468

Z-statistics ratios using raw variance estimates

1

1

1

1

1

1

1

1

11

11

111

1

1

1

1

22

222

2

2

2

22

22

222

2

2

2

2

Patient index

Z-statistics

1

11

1

2

2

2

2

10 20

-2 024

Z-statistics ratios using l.e. B pooled variance estimates

Figure 3. Z -ratio test s tatistics f or chan ge base d on read ings from t wo experts (denoted by symbols 1, 2) in a longitudinal study

involving 23 Patients. Top: Based on raw variance estimate. Bottom: Based on pooled and stabilized variance estimation as

given in Figure 2. The solid lines (black) denote the 0.05 significance test threshold for positive or negative change in the mean,

and dashed line (red) denote the 0.10 significance test threshold for change.

1

1

1

1

11

1

1

11

11

111

1

1

22

2

2

2

2

2

2

2222

222

2

2

Patient Number

Percentage Change Observed

1

1

1

1

1

1

2

2

2

2

2

2

10 20

-50050100 150

Figure 4. RECIST percentage-based interpretation of an-

notated RIDER data. X-axis: Patient number from 1 through

23 for 23 patients in the annotated database. Y-axis: Com-

puted percentage changes (measurement at final time minus

at entry time, divided by measurement at entry time) for t wo

experts (1s for reader one, 2s for reader two). Data are the

RECIST annotations (largest diameter of nodule among all

slices at a given time) at different times for all 23 patients by

two radiologists. The two sol id lin es denote the 20% increas e

and 30% decrease thresholds.

patients, and agree partially on 4 patients (patient num-

bers 8, 9, 17, and 22) in their categorical classifications

(progressive disease (PD), partial response (PR)). In

terms of the Kappa measure [16], the statistical tests give

.. ..

1 0.5861

ij iiii

ii i

,

where θij denotes one of the entries in Table 1, and θ.i and

θi. denote the row and column sums, divided by the total

(23). The Kappa number for the statistical test indicates

slightly better agreement between the two readers than

the Kappa measure (0.5027) based on RECIST-based

results in Table 2 (However, both approaches indicate

there is moderate agreement among the two readers [17]).

At the minimum we can ask whether there is any cor-

roboration or dependence between the two readers’ as-

sessments. If we only use the signs of the categorical

score measurements (so the variance estimation has no

impact), there are 7 concordant “increases”, 10 concor-

dant “decreases”, 4 discordant decisions for “increase”

by one reader and “decrease” by another reader, and vice

versa. The Chi-squared test for independence by the two

readers using the contingency table gives a value of

5.5457 with 1 df, and P-value of 0.0185. Using Fisher’s

exact test, P-value is 0.0092 for two-sided alternative.

Copyright © 2012 SciRes. OJS

Z.-Q. J. LU ET AL. 267

We may also decide on a threshold value, say 1, and as-

sign the decision 1, 0, −1 for “increase”, “indecision”,

“decrease” if the score on a patient is ≥1 between −1 and

1, or ≤−1. The contingency table for the two readers in

the column and row order of −1, 0, 1 is: 7, 0, 0; 2, 5, 3; 0,

4, 2. The Pearson’s Chi-squared test for independence

has value of 16.3215, with df = 4, P-value = 0.0026.

Fisher’s exact test has a P-value of 0.0012 (for two-sided

alternative).

5. Discussion

We believe that there is a strong need to study the reli-

ability and statistical performance of RECIST, or any

other time-sequence tumor size measurement regimes

such as WHO or 3D volume metrics. Statistical methods

suggested in this paper are used to demonstrate the po-

tential of medical decision making by taking into account

explicitly the uncertainty in the markings by expert radi-

ologists, and a statistical decision rule for change could

potentially be available for the future based on realistic

measurement quantification along the lines of [6,18]. In

addition, there is a critical need for establishing meas-

urement uncertainty, such as accommodating the effects

of protocols and instrument settings [6]. Statistics-based

decision rule can easily incorporate the different facets of

uncertainty components in therapy response decision

making. There are needs to study biological variability

and to study the algorithmic factors of computer-assisted

measurements in other size measures such as volume

metric which is mainly useful for thin slice CT scans (1.0

mm or less) [6].

Partly due to the observation that there is measurement

bias in the absolute nodule size measurements, alterna-

tive procedures have been investigated for direct change

measurements (e.g. [19,20]). However, we caution the

readers that the latter approach raises additional issues

with the uncertainty in the change measurements them-

selves and there are still issues on how to assess meas-

urement uncertainty in change-measurement data such as

for small nodules. Though there are many develop-

ments with RECIST, this important topic has received

little attention in the statistical literature (an exception is

[21]), we believe there are ample opportunities for statis-

ticians to be engaged in this important medical image

decision analysis concerned with assessing therapeutic

effectiveness.

6. Acknowledgements

The first two authors would like to thank our colleague

Qiming Wang for her work in analyzing and accessing

the DICOM images used in this paper, and to our col-

league Alden Dima who made the DICOM image data-

base server available to us.

REFERENCES

[1] E. A. Eisenhauer, P. Therasse, J. Bogaerts, L. H. Schwartz,

D. Sargent, R. Ford, J. Dancey, S. Arbuck, S. Gwyther, M.

Mooney, L. Rubinstein, L. Shankar, L. Dodd, R. Kaplan,

D. Lacombe and J. Verweij, “New Response Evaluation

Criteria in Solid Tumours: Revised RECIST Guideline

(Version 1.1),” European Journal of Cancer, Vol. 45, No.

2, 2009, pp. 228-247. doi:10.1016/j.ejca.2008.10.026

[2] C. C. Jaffe, “Measures of Response: RECIST, WHO, and

New Alternatives,” Journal of Clinical Oncology, Vol. 24,

No. 20, 2006, pp. 3245-3251.

doi:10.1200/JCO.2006.06.5599

[3] H. Robbins, “Estimating Many Variances,” In: S. S. Gupta,

Ed., Statistical Decision Theory and Related Topics III,

Vol. 2, Academic Press, New York, 1982, pp. 251-261.

[4] H. Robbins, “Some Thoughts on Empirical Bayes Eesti-

mation,” Annals of Statistics, Vol. 11, No. 3, 1983, pp.

713-723. doi:10.1214/aos/1176346239

[5] L. H. Schwartz, M. Mazumdar, W. Brown, A. Smith and

D. M. Panicek, “Variability in Response Assessment in

Solid Tumors: Effect of Number of Lesions Chosen for

Measurement,” Clinical Cancer Research, Vol. 9, No. 12,

2003, pp. 4318-4323.

[6] Z. Q. J. Lu, N. Petrick, C. Fenimore, D. Clunie, K. Bor-

radaile, R. Ford, M. F. McNitt-Gray, H. J. G. Kim, R.

Zeng, M. A. Gavrielides, B. Zhao and A. J. Buckler, “Sta-

tistical Analysis of Reader Measurement Variability in

Nodule Sizing with CT Phantom Imaging Data,” NIST

Interagency Report, 2012.

[7] J. J. Erasmus, G. W. Gladish, L. Broemeling, B. S. Sabloff,

M. T. Truong, R. S. Herbst and R. F. Munden, “Interob-

server and Intraobserver Variability in Measurement of

Non-Small-Cell Carcinoma Lung Lesions: Implications

for Assessment of Tumor Response,” Journal of Clinical

Oncology, Vol. 21, No. 13, 2003, pp. 2574-2582.

doi:10.1200/JCO.2003.01.144

[8] L. E. Dodd, R. F. Wagner, S. G. Armato III, M. F. McNitt-

Gray, S. Beiden, H.-P. Chan, D. Gur, G. McleNnan, C. E.

Metz, N. Petrick, B. Sahiner and J. Sayre, “Assessment

Methodologies and Statistical Issues for Computer-Aided

Diagnosis of Lung Nodules in Computed Tomography:

Contemporary Research Topics Relevant to the Lung

Image Database Consortium,” Academic Radiology, Vol.

11, No. 4, 2004, pp. 462-475.

doi:10.1016/S1076-6332(03)00814-6

[9] C. R. Meyer, T. D. Johnson, G. McLennan, D. R. Aberle,

E. A. Kazerooni, H. MacMahon, B. F. Mullan, D. F.

Yankelevitz, E. J. R. van Beek, S. G. Armato III, M. F.

McNitt-Gray, A. P. Reeves, D. Gur, C. I. Henschke, E. A.

Hoffman, R. H. Bland, G. Laderach, R. Pais, D. Qing, C.

Piker, J. Guo, A. Starkey, D. Max, B. Y. Croft and L. P.

Clarke, “Evaluation of Lung MDCT Nodule Annotation

Across Radiologists and Methods,” Academic Radiology,

Vol. 13, No. 10, 2006, pp. 1254-1265.

doi:10.1016/j.acra.2006.07.012

[10] RIDER: Reference Image Database to Evaluate Response,

Copyright © 2012 SciRes. OJS

Z.-Q. J. LU ET AL.

Copyright © 2012 SciRes. OJS

268

National Institute of Biomedical Imaging and Bioengi-

neering Institute of NIH.

http://www.nibib.nih.gov/Research/Resources/ImageClin

Data#RIDER

[11] Z. Q. Lu, “Local Polynomial Prediction and Volatility

Estimation in Financial Time Series,” In: A. S. Soofi and

L. Cao, Eds., Modelling and Forecasting Financial Data:

Techniques of Nonlinear Dynamics, Kluwer, Boston, 2002,

pp. 115-135.

[12] C. R. Meyer, S. G. Armato III, C. P. Fenimore, G. McLen-

nan, L. M. Bidaut, D. P. Barboriak, M. A. Gavrielides, E.

F. Jackson, M. F. McNitt-Gray, P. E. Kinahan, N. Petrick

and B. Zhao, “Quantitative Imaging to Assess Tumor

Response to Therapy: Common Themes of Measurement,

Truth Data, and Error Sources,” Translational Oncology,

Vol. 2, No. 4, 2009, pp.198-210.

[13] P. J. Huber, “Robust Statistics,” Wiley, New York, 1981.

[14] D. C. Hoaglin, F. Mosteller and J. W. Tukey, “Under-

standing Robust and Exploratory Data Analysis,” Wiley,

New York, 1983.

[15] S. G. Armato III, C. R. Meyer, M. F. McNitt-Gray, G.

McLennan, A. P. Reeves, B. Y. Croft and L. P. Clarke,

“The Reference Image Database to Evaluate Response to

Therapy in Lung Cancer (RIDER) Project: A Resource

for the Development of Change-Analysis Software,”

Clinical Pharmacology & Therapeutics, Vol. 84, No. 4,

2008, pp. 448-456. doi:10.1038/clpt.2008.161

[16] J. R. Landis and G. G. Koch, “The Measurement of Ob-

server Agreement for Categorical Data,” Biometrics, Vol.

33, No. 1, 1977, pp. 159-174. doi:10.2307/2529310

[17] A. J. Viera and J. M. Garrett, “Understanding the In-

terobserver Agreement: The Kappa Statistics,” Family

Medicine, Vol. 37, No. 5, 2005, pp. 360-363.

[18] B. Zhao, L. P. James, C. S. Moskowitz, P. Guo, M. S.

Ginsberg, R. A. Lefkowitz, Y. Qin, G. J. Riely, M. G.

Kris and L. H. Schwartz, “Evaluating Variability in Tu-

mor Measurements from Same-Day Repeat Scans of Pa-

tients with Non-Small Cell Lung Cancer,” Radiology, Vol.

252, No. 1, 2009, pp. 263-272.

doi:10.1148/radiol.2522081593

[19] A. P. Reeves, A. B. Chan, D. F. Yankelevitz, C. I. Hen-

schke, B. Kressler, W. J. Kostis, “On Measuring the

Change in Size of Pulmonary Nodules,” IEEE Transac-

tions on Medical Imaging, Vol. 25, No. 4, 2006, pp. 435-

450. doi:10.1109/TMI.2006.871548

[20] J. M. Reinhardt, K. Ding, K. Cao, C. E. Christensen, E. A.

Hoffman and S. V. Bodas, “Registration-Based Estimates

of Local Lung Tissue Expansion Compared to Xenon CT

Measures of Specific Ventilation,” Medical Image Analy-

sis, Vol. 12, No. 6, 2008, pp. 752-763.

doi:10.1016/j.media.2008.03.007

[21] L. D. Broemeling, “Bayesian Biostatistics and Diagnostic

Medicine,” Chapmall & Hll/CRC, Boca Raton, 2007.