^{1}

^{*}

^{1}

The amount of scientific knowledge from randomized parallel group trials have been improved by the CONSORT Guideline, but important intelligence with important clinical implications remains to be extracted. This may though be obtained if the conventional statistical significance testing is supplied by 1) Addition of an unbiased and reproducible quantification of the magnitude or size of the clinical significance/importance of a difference in treatment outcome; 2) Addition of a quantification of the credulity of statements on any possible effect size and finally; 3) Addition of a quantification of the risk of committing an error when the null hypothesis is either accepted or rejected. These matters are crucial to proper conversion of trial results into good usage in every-day clinical practice and may produce immediate therapeutic consequence in quite opposite direction to the usual ones. In our drug eluting stent trial “SORT OUT II”, the implementation of our suggestions would have led to immediate cessation of use of the paclitaxel-eluting stent, which the usual Consort like reporting did not lead to. Consequently harm to subsequent patients treated by this stent might have been avoided. Our suggestions are also useful in cancer treatment trials and in fact generally so in most randomized trial. Therefore increased scientific knowledge with immediate and potentially altered clinical consequence may be the result if hypothesis testing is made complete and the corresponding adjustments are added to the CONSORT Guideline—first of all— for the potential benefit of future patients.

When the Consort Statement was published in 2001, it was emphasized that it was “… a continually evolving instrument” to improve the quality of reports of parallel group randomized trials” [

Next, null hypothesis significance testing and interpretation hold some surprises that were well summarized by Gliner, Leech and Morgan [

Finally, it is important to recognize that rejection or acceptance of the null hypothesis may still be an error, but the risk of doing wrong with the null hypothesis may be assessed and may have grave consequences in form of cessation of use of harmful treatments. We therefore suggest inclusion of a figure depicting those risks.

The SORT OUT II is used as exemplification of our proposals and has previously been thoroughly described [

The scientific knowledge may be maximized if the trial reporting system from CONSORT were expanded to encompass the following items:

First―the traditional quantification of the statistical significance with calculation of the hazard ratio (HR) and 95% confidence interval for HR, as already present in the CONSORT Guideline. In SORT OUT II the MACE occurred in 467 (22.3 %) of the patients, with 222 (20.8%) in the sirolimus-eluting stent group and 245 (23.7%) in the paclitaxel-eluting stent group. The vertical difference between the two curves of cumulated proportion of patients experiencing a MACE was statistically non-significant (log-rank test, Chi-square = 2.49, p = 0.11, HR 0.87, 95-% confidence interval 0.72 - 1.04) (

Second―and new―is quantification of the clinical significance/importance of an outcome difference or effect size: The area between the two curves of cumulated events will reflect an estimate of the net health gain with the superior treatment under the given and important presumption, that the present curves (

for X days then the patients experiencing an event would―at the average―postpone it by Y days as compared to the patients treated by the apparent worse treatment (

Even if the vertical difference between two curves had been of statistical significance, the horizontal difference is still useful in order to determine if such proven difference also should have a magnitude of clinical significance that would be making a difference. The measure may be used in most trials. For instance, a randomized study on pancreatic cancer may statistically significantly reveal, that the number needed to treat to save one live would be 28. If the clinical significance was an average postponement of death by 21 days, then the statistically significant result might be perceived as a proven difference that is not making much of a difference. Therefore calculation of the postponement or the horizontal difference is a strong tool to measure the clinical relevance of a given “number needed to treat”.

[Computation of the horizontal difference: for each single day the area under each curve of cumulated proportion of patients experiencing an event is determined. The difference between these areas is calculated together with the cumulated area difference as a function of the observation time. The potential superiority is calculated as this cumulated area difference divided by the concomitant event rate from the event curve, which ends up in possessing the lowest cumulated event rate (in this case the sirolimus-eluting stent group).]

Third―is calculation of the type two error risks best depicted in an “operating characteristic curve” displaying the connection between increasing minimal relevant differences and the risk of not detecting such differences as being statistically significant [

Fourth―is calculation of the risk of committing an error when either rejecting or accepting the null hypothesis (the delta error or the epsilon error respectively). These risks are dependent on the risk of committing a type one and a type two error but not the least on our trust in the correctness of the null hypothesis [

If results from other trials make it possible that two treatments give different outcome then our belief in the null hypothesis is small. If for instance

For the sake of illustration, we may fictitiously set the significance level to 0.11 which artificially would turn the SORT OUT II into a statistically significant study. A meta-analysis has shown that one stent is statistically significantly superior to the other which reduces our trust in the correctness of the null hypothesis to be for instance 10% [

number of trials (e.g.: 1000) | statistically significant (e.g.: p < 0.05) | statistically ns (e.g.: p > 0.05) | n= |
---|---|---|---|

Hnull true (TER = 0.1 => n = 100) | false positive (0.05 × 100 = 5) | true negative (0・95 × 100 = 95) | 100 |

Hnull false (TER = 0.1 => n = 900) | true positive (0・80 × 900 = 720) | false negative (0・20 × 900 = 180) | 900 |

n= | 725 | 275 | 1000 |

Delta error = (false positive)/(all positive) (5 × 100/725 = 0.69 %) | Epsilon error = (false negative)/(all negative) (180 × 100/275 = 65%) |

In brackets, the example contains 1000 virtual trials, alfa = 0.05 and beta = 0.20. TER = true effectiveness ratio by example a ratio of 0.1 is similar to a 10% trust in the correctness of the null hypothesis or

the null hypothesis (

The SORT OUT II was a statistically non-significant study and estimation of the epsilon error is therefore the correct thing to do. Continuing with

and as a consequence accept the null hypothesis and continue the clinical use of both stent types we would be translating the scientific evidence just like always!! But the ones who believe in a difference between the stent (

Explanatory remarks have been incorporated above and it suffices to say, that we have first of all created an unbiased and reproducible measure of clinical significance which has not previously been available. This measure is also useful to interpret the clinical importance of “the number needed to treat”. The estimation of the credulity of postulations on any effect size is especially useful when smaller trials have problems with the power and reach ns results because the operating characteristic curve enables extraction of the credulity on any effect size that should not be likely to be missed. Finally, the use of our a priory trust in the correctness of the null hypothesis has a strong impact on the risk of being in the wrong when rejecting or accepting the null hypothesis but these risks may be assessed and displayed in figures on delta and epsilon errors and used to make important clinical consequences more effective than what is generally happening now without estimation of these parameters.

All our suggestions cannot be extracted as single measures but it only takes three different depictions to enable the reader to choose any individually selected x-value and read the appropriate answers from the corresponding y-values. These depictions should therefore be part of reports on randomized trials and be itemized by the experts in their next revision of the CONSORT Guideline.

The reporting of the vertical difference between two cumulated event rates is used to estimate if there is a statistically significant difference. Expansion of the reporting with calculation of the horizontal differences may be utilized to estimate the magnitude of the clinical significance in an unbiased and reproducible way. The depiction of operating characteristic curves for both absolute and relative minimal relevant differences allows the reader to assess the credulity of not having missed any individually chosen minimal relevant difference. Finally, when other sources have brought knowledge of expected effect sizes, this may be used in the validation of the trial results by inclusion of a figure displaying the risks of delta and epsilon errors, which may induce quite other consequences to clinical practice than what simple null hypothesis testing would induce. These simple suggestions may be used to procure additional intelligence from each trial and may lead to improved quality of reports of parallel-group randomized trials and if the general reader of randomized trial is capable of understanding these few changes our suggestions may eventually lead to maximum scientific knowledge and better clinical practice―hopefully for the beneficial of the patients.

The Danish Heart Registry (DHR) has contributed with essential detection of invasive cardiac procedures.

Simon Day, the editor of Statistics in Medicine, has contributed importantly.

Niels Bligaard, Leif Thuesen, Henning Kelbæk, Per Thayssen, Jens Aarøe, Peter R. Hansen, Jens F. Lassen, Kari Saunamäki, Anders Junker, Jan Ravkilde, Ulrik Abildgaard, Hans H. Tilsted, Thomas Engstrøm, Jan S. Jensen, Hans E. Bøtker, Søren Galatius, Carsten T. Larsen, Steen D. Kristensen, Lars R.Krusell, Steen Z. Abildstrøm, Evald H. Christiansen, Ghita Stephansen, R. N., Jørgen L. Jeppesen, John Godtfredsen, Søren Boesgaard, Jørgen L. Jeppesen, Anders M. Galløe.

Boston Scientific and Cordis, a Johnson & Johnson company donated unrestricted research grants but had no role in the design and conduct of the study; in the collection, management, analysis, or interpretation of the data; or in the preparation, review, or approval of the manuscript.