wsd">Patient 35
11
11
22
22
1234
18 20 22 24 262830
Patient 39
11
2
2
1
1
1
2
2
2
12345
14 16 18 20
P
1
1
1
11
22
2
22
at ient 40
12345
20 25 30 35 40 45
Patient 47
1111
22
2
2
1234
5 1015202530
Patient 50
1
11
2
22
123
10 20 30
Patient 51
1
1
1
2
22
123
25 30 35 40 45 50 55
Patient 53
1
2
1
1
1
2
2
2
1234
30 35 40 45
P
1
1
11
2
2
2
2
at ient 54
1234
20 25 3035 40
Patient 60
111
1
222
2
1234
45 50 55
Patient 71
1111
2
222
1234
5 1015202530
Patient 76
Figure 1. Plots of RECIST readings versus time index for each patient b y two radiologists (denoted 1, 2) in a longitudinal st udy.
The RECIST markup data here is the largest diameter of one identified nodule, in millimeters (mm).
on (6) for each patient. The bottom figure, the test statis-
tics is defined similarly as in (11) but with variance esti-
mate as computed given in the bottom figure in Figure 2.
One can conclude from Figure 3 that the test results in
the bottom figure significantly improve over original re-
sults in top figure. For patient number 2, the new test
does not find significance, while original raw-variance
based test finds strong significance due to a low variance
estimate. The new test seems to be more consistent with
visual appearance of patient number 2 in Figure 1. Simi-
lar comments apply to data for patient number 11. The
opposite is observed for patient number 18, and patient
number 19. Original tests based on raw variance estimate
do not find significance due to inflated variance esti-
mates, and this is corrected in the new test. As a result,
significant change is observed for both patient numbers
18 and 19 using the improved test. It is found that the
two readers agree with each other on their assessment in
the direction of change on 19 out of 23 patients. (They
disagreed on patient number 1, 20, 21, and 23). A sum-
mary of decision results based on the statistical tests is
provided in detail in Table 1, where we use 10% as the
threshold for significant increase and 5% level for sig-
nificant decrease.
Interestingly, one may compare the statistical test re-
sults with the RECIST guidelines (a time-sequence in-
crease of at least 20% defines “progressive disease” (PD),
while a decrease of at least 30% defines “partial re-
sponse” (PR)), which can be inferred from the relative
percentage change data plotted in Figure 4. Summary of
the RECIST analysis is given in Table 2. In short, on 4
out of 23 patients, the two experts have given percentage
changes of opposite directions (cf. patient numbers 1, 20,
21, and 23). The two experts differ in their computed
percentage changes with a mean average difference of
21%. In terms of RECIST decision results based on the
radiologists’ individual assessments, in addition to the 4
patients on which they totally disagree, they agree on 15
Z.-Q. J. LU ET AL. 265
11
1
111
1
1
111
1111111
1
1
2
2
222
2
2
22222
2
2222
22
2
Patient index
Difference-based Std (mm)
111
222
10 20
0 51525
Raw robust variance estimati on
11
1
1
1
1
1
1
11
1
1
111
1
1
11
1
2
2
2
2
2
2
2
2
222
2
2
22
2
2
22
2
Patient index
std (mm)
1
11
2
22
10 20
02468
Variance estimation after bias adjustment
11
1
11
1
1
111
1
1
111
1
1
11
1
2
2
222
2
2
2
222
2
2
22
2
2
22
2
Patient index
std (mm)
1
11
2
22
10 20
34567
Variance estimation after l.e. B pooling
Figure 2. Variance estimation. Top: Raw robust variance estimates for within-patient readings from each of the two experts
(denoted by 1 and 2). Middle: after bias adjustment from signal bias (mainly for highest variances). Bottom: after variance
pooling to stabilize and to improve underestimated variance estimates (low variances) using Robbins method.
Table 1. Summary on the statistical results: Out of 23 patients annotated, two readers totally disagree on 4 patients (Patients 1,
20, 21, 23). They agree on 16 patients, they agree partially on 3 patients (Patients 3, 8, 17).
Reader 2
Significant increase (x)Increase but not sig (%)Decrease but not sig (o) Significant decrease (--)
Significant increase (x)2 2 0 2
Increase but not sig (%)0 4 0 0
Decrease but not sig (o)0 0 4 0
Reader 1
Significant decrease (--)2 0 1 6
We define the following symbols: x for significant increase at 10% level, -- for significant decrease, o for non-significant decrease, and % for non-significant
increase.
Table 2. Summary on RECIST results: On 4 out of 23 patients, the two experts have given percentage changes of opposite di-
rections (cf . Patients 1, 20, 21, 2 3). The two experts differ in their comp uted percen tage changes with a mean a verage differen ce
of 21%. In terms of RECIST decision results, t hey agree on 15 patients, and agree partially on 4 patients (patients 8, 9, 17, 22).
Reader 2
Progressive disease
(PD)
Increase but below
20% (y)
Decrease but below
30% (y)
Partial recovery
(PR)
Progressive disease (PD) 2 1 1 0
Increase but below 20% (y) 1 4 1 0
Decrease but below 30% (N) 1 1 7 2
Reader 1
Partial Recovery (PR) 0 0 0 2
Copyright © 2012 SciRes. OJS
Z.-Q. J. LU ET AL.
266
1
1
111111
11
1
11111
1
1
1
2
2222
2
2222
222
2
2
2
2
Patient index
Z-statistics
1
1
1
1
2
2
2
2
10 20
-4-202468
Z-statistics ratios using raw variance estimates
1
1
1
1
1
1
1
1
11
11
111
1
1
1
1
22
222
2
2
2
22
22
222
2
2
2
2
Patient index
Z-statistics
1
11
1
2
2
2
2
10 20
-2 024
Z-statistics ratios using l.e. B pooled variance estimates
Figure 3. Z -ratio test s tatistics f or chan ge base d on read ings from t wo experts (denoted by symbols 1, 2) in a longitudinal study
involving 23 Patients. Top: Based on raw variance estimate. Bottom: Based on pooled and stabilized variance estimation as
given in Figure 2. The solid lines (black) denote the 0.05 significance test threshold for positive or negative change in the mean,
and dashed line (red) denote the 0.10 significance test threshold for change.
1
1
1
1
11
1
1
11
11
111
1
1
22
2
2
2
2
2
2
2222
222
2
2
Patient Number
Percentage Change Observed
1
1
1
1
1
1
2
2
2
2
2
2
10 20
-50050100 150
Figure 4. RECIST percentage-based interpretation of an-
notated RIDER data. X-axis: Patient number from 1 through
23 for 23 patients in the annotated database. Y-axis: Com-
puted percentage changes (measurement at final time minus
at entry time, divided by measurement at entry time) for t wo
experts (1s for reader one, 2s for reader two). Data are the
RECIST annotations (largest diameter of nodule among all
slices at a given time) at different times for all 23 patients by
two radiologists. The two sol id lin es denote the 20% increas e
and 30% decrease thresholds.
patients, and agree partially on 4 patients (patient num-
bers 8, 9, 17, and 22) in their categorical classifications
(progressive disease (PD), partial response (PR)). In
terms of the Kappa measure [16], the statistical tests give
.. ..
1 0.5861
ij iiii
ii i
 

 ,
where θij denotes one of the entries in Table 1, and θ.i and
θi. denote the row and column sums, divided by the total
(23). The Kappa number for the statistical test indicates
slightly better agreement between the two readers than
the Kappa measure (0.5027) based on RECIST-based
results in Table 2 (However, both approaches indicate
there is moderate agreement among the two readers [17]).
At the minimum we can ask whether there is any cor-
roboration or dependence between the two readers’ as-
sessments. If we only use the signs of the categorical
score measurements (so the variance estimation has no
impact), there are 7 concordant “increases”, 10 concor-
dant “decreases”, 4 discordant decisions for “increase”
by one reader and “decrease” by another reader, and vice
versa. The Chi-squared test for independence by the two
readers using the contingency table gives a value of
5.5457 with 1 df, and P-value of 0.0185. Using Fisher’s
exact test, P-value is 0.0092 for two-sided alternative.
Copyright © 2012 SciRes. OJS
Z.-Q. J. LU ET AL. 267
We may also decide on a threshold value, say 1, and as-
sign the decision 1, 0, 1 for “increase”, “indecision”,
“decrease” if the score on a patient is 1 between 1 and
1, or ≤−1. The contingency table for the two readers in
the column and row order of 1, 0, 1 is: 7, 0, 0; 2, 5, 3; 0,
4, 2. The Pearson’s Chi-squared test for independence
has value of 16.3215, with df = 4, P-value = 0.0026.
Fisher’s exact test has a P-value of 0.0012 (for two-sided
alternative).
5. Discussion
We believe that there is a strong need to study the reli-
ability and statistical performance of RECIST, or any
other time-sequence tumor size measurement regimes
such as WHO or 3D volume metrics. Statistical methods
suggested in this paper are used to demonstrate the po-
tential of medical decision making by taking into account
explicitly the uncertainty in the markings by expert radi-
ologists, and a statistical decision rule for change could
potentially be available for the future based on realistic
measurement quantification along the lines of [6,18]. In
addition, there is a critical need for establishing meas-
urement uncertainty, such as accommodating the effects
of protocols and instrument settings [6]. Statistics-based
decision rule can easily incorporate the different facets of
uncertainty components in therapy response decision
making. There are needs to study biological variability
and to study the algorithmic factors of computer-assisted
measurements in other size measures such as volume
metric which is mainly useful for thin slice CT scans (1.0
mm or less) [6].
Partly due to the observation that there is measurement
bias in the absolute nodule size measurements, alterna-
tive procedures have been investigated for direct change
measurements (e.g. [19,20]). However, we caution the
readers that the latter approach raises additional issues
with the uncertainty in the change measurements them-
selves and there are still issues on how to assess meas-
urement uncertainty in change-measurement data such as
for small nodules. Though there are many develop-
ments with RECIST, this important topic has received
little attention in the statistical literature (an exception is
[21]), we believe there are ample opportunities for statis-
ticians to be engaged in this important medical image
decision analysis concerned with assessing therapeutic
effectiveness.
6. Acknowledgements
The first two authors would like to thank our colleague
Qiming Wang for her work in analyzing and accessing
the DICOM images used in this paper, and to our col-
league Alden Dima who made the DICOM image data-
base server available to us.
REFERENCES
[1] E. A. Eisenhauer, P. Therasse, J. Bogaerts, L. H. Schwartz,
D. Sargent, R. Ford, J. Dancey, S. Arbuck, S. Gwyther, M.
Mooney, L. Rubinstein, L. Shankar, L. Dodd, R. Kaplan,
D. Lacombe and J. Verweij, “New Response Evaluation
Criteria in Solid Tumours: Revised RECIST Guideline
(Version 1.1),” European Journal of Cancer, Vol. 45, No.
2, 2009, pp. 228-247. doi:10.1016/j.ejca.2008.10.026
[2] C. C. Jaffe, “Measures of Response: RECIST, WHO, and
New Alternatives,” Journal of Clinical Oncology, Vol. 24,
No. 20, 2006, pp. 3245-3251.
doi:10.1200/JCO.2006.06.5599
[3] H. Robbins, “Estimating Many Variances,” In: S. S. Gupta,
Ed., Statistical Decision Theory and Related Topics III,
Vol. 2, Academic Press, New York, 1982, pp. 251-261.
[4] H. Robbins, “Some Thoughts on Empirical Bayes Eesti-
mation,” Annals of Statistics, Vol. 11, No. 3, 1983, pp.
713-723. doi:10.1214/aos/1176346239
[5] L. H. Schwartz, M. Mazumdar, W. Brown, A. Smith and
D. M. Panicek, “Variability in Response Assessment in
Solid Tumors: Effect of Number of Lesions Chosen for
Measurement,” Clinical Cancer Research, Vol. 9, No. 12,
2003, pp. 4318-4323.
[6] Z. Q. J. Lu, N. Petrick, C. Fenimore, D. Clunie, K. Bor-
radaile, R. Ford, M. F. McNitt-Gray, H. J. G. Kim, R.
Zeng, M. A. Gavrielides, B. Zhao and A. J. Buckler, “Sta-
tistical Analysis of Reader Measurement Variability in
Nodule Sizing with CT Phantom Imaging Data,” NIST
Interagency Report, 2012.
[7] J. J. Erasmus, G. W. Gladish, L. Broemeling, B. S. Sabloff,
M. T. Truong, R. S. Herbst and R. F. Munden, “Interob-
server and Intraobserver Variability in Measurement of
Non-Small-Cell Carcinoma Lung Lesions: Implications
for Assessment of Tumor Response,” Journal of Clinical
Oncology, Vol. 21, No. 13, 2003, pp. 2574-2582.
doi:10.1200/JCO.2003.01.144
[8] L. E. Dodd, R. F. Wagner, S. G. Armato III, M. F. McNitt-
Gray, S. Beiden, H.-P. Chan, D. Gur, G. McleNnan, C. E.
Metz, N. Petrick, B. Sahiner and J. Sayre, “Assessment
Methodologies and Statistical Issues for Computer-Aided
Diagnosis of Lung Nodules in Computed Tomography:
Contemporary Research Topics Relevant to the Lung
Image Database Consortium,” Academic Radiology, Vol.
11, No. 4, 2004, pp. 462-475.
doi:10.1016/S1076-6332(03)00814-6
[9] C. R. Meyer, T. D. Johnson, G. McLennan, D. R. Aberle,
E. A. Kazerooni, H. MacMahon, B. F. Mullan, D. F.
Yankelevitz, E. J. R. van Beek, S. G. Armato III, M. F.
McNitt-Gray, A. P. Reeves, D. Gur, C. I. Henschke, E. A.
Hoffman, R. H. Bland, G. Laderach, R. Pais, D. Qing, C.
Piker, J. Guo, A. Starkey, D. Max, B. Y. Croft and L. P.
Clarke, “Evaluation of Lung MDCT Nodule Annotation
Across Radiologists and Methods,” Academic Radiology,
Vol. 13, No. 10, 2006, pp. 1254-1265.
doi:10.1016/j.acra.2006.07.012
[10] RIDER: Reference Image Database to Evaluate Response,
Copyright © 2012 SciRes. OJS
Z.-Q. J. LU ET AL.
Copyright © 2012 SciRes. OJS
268
National Institute of Biomedical Imaging and Bioengi-
neering Institute of NIH.
http://www.nibib.nih.gov/Research/Resources/ImageClin
Data#RIDER
[11] Z. Q. Lu, “Local Polynomial Prediction and Volatility
Estimation in Financial Time Series,” In: A. S. Soofi and
L. Cao, Eds., Modelling and Forecasting Financial Data:
Techniques of Nonlinear Dynamics, Kluwer, Boston, 2002,
pp. 115-135.
[12] C. R. Meyer, S. G. Armato III, C. P. Fenimore, G. McLen-
nan, L. M. Bidaut, D. P. Barboriak, M. A. Gavrielides, E.
F. Jackson, M. F. McNitt-Gray, P. E. Kinahan, N. Petrick
and B. Zhao, “Quantitative Imaging to Assess Tumor
Response to Therapy: Common Themes of Measurement,
Truth Data, and Error Sources,” Translational Oncology,
Vol. 2, No. 4, 2009, pp.198-210.
[13] P. J. Huber, “Robust Statistics,” Wiley, New York, 1981.
[14] D. C. Hoaglin, F. Mosteller and J. W. Tukey, “Under-
standing Robust and Exploratory Data Analysis,” Wiley,
New York, 1983.
[15] S. G. Armato III, C. R. Meyer, M. F. McNitt-Gray, G.
McLennan, A. P. Reeves, B. Y. Croft and L. P. Clarke,
“The Reference Image Database to Evaluate Response to
Therapy in Lung Cancer (RIDER) Project: A Resource
for the Development of Change-Analysis Software,”
Clinical Pharmacology & Therapeutics, Vol. 84, No. 4,
2008, pp. 448-456. doi:10.1038/clpt.2008.161
[16] J. R. Landis and G. G. Koch, “The Measurement of Ob-
server Agreement for Categorical Data,” Biometrics, Vol.
33, No. 1, 1977, pp. 159-174. doi:10.2307/2529310
[17] A. J. Viera and J. M. Garrett, “Understanding the In-
terobserver Agreement: The Kappa Statistics,” Family
Medicine, Vol. 37, No. 5, 2005, pp. 360-363.
[18] B. Zhao, L. P. James, C. S. Moskowitz, P. Guo, M. S.
Ginsberg, R. A. Lefkowitz, Y. Qin, G. J. Riely, M. G.
Kris and L. H. Schwartz, “Evaluating Variability in Tu-
mor Measurements from Same-Day Repeat Scans of Pa-
tients with Non-Small Cell Lung Cancer,” Radiology, Vol.
252, No. 1, 2009, pp. 263-272.
doi:10.1148/radiol.2522081593
[19] A. P. Reeves, A. B. Chan, D. F. Yankelevitz, C. I. Hen-
schke, B. Kressler, W. J. Kostis, “On Measuring the
Change in Size of Pulmonary Nodules,” IEEE Transac-
tions on Medical Imaging, Vol. 25, No. 4, 2006, pp. 435-
450. doi:10.1109/TMI.2006.871548
[20] J. M. Reinhardt, K. Ding, K. Cao, C. E. Christensen, E. A.
Hoffman and S. V. Bodas, “Registration-Based Estimates
of Local Lung Tissue Expansion Compared to Xenon CT
Measures of Specific Ventilation,” Medical Image Analy-
sis, Vol. 12, No. 6, 2008, pp. 752-763.
doi:10.1016/j.media.2008.03.007
[21] L. D. Broemeling, “Bayesian Biostatistics and Diagnostic
Medicine,” Chapmall & Hll/CRC, Boca Raton, 2007.