Khatry DB. Interrelationships among enrichment with diagnostics, probability of success and demonstration of effectiveness in clinical trials. J Med Stat Inform. 2013; 1:4. http://dx.doi.org/10.7243/2053766214
Deepak B. Khatry
Correspondence: Deepak B. Khatry khatryd@medimmune.com
Author Affiliations
MedImmune, Biostatistics/Translational Sciences, One MedImmune Way, Gaithersburg, USA.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Background: Prevalence of disease phenotype in clinical practice is often not given adequate importance during formulation, validation, and implementation of diagnostic tests in clinical research and development. After promising biomarkers have been identified as potential screening diagnostics, an important strategic question for optimal decisionmaking in clinical development of a therapeutic is when to choose an enrichment study design over the traditional allcomer randomized control trial design.
Methods: A hypothetical example of a cholesterol lowering treatment is used to illustrate influences of key statistical criteria and clinical considerations for choosing study designs. Computer simulations demonstrate how results of such analyses can aid in deciding whether or not to choose enrichment study designs.
Results: This study shows how understanding of disease prevalence in practice, predictive values of diagnostic test, and prespecified establishment of a clinically meaningful minimum effectiveness all need to be integrated to insure clinical trial success and appropriate benefit to targeted patient subgroups. The most important statistical and clinical considerations were the anticipated effect size, phenotype prevalence, predictive values of diagnostics test, study power, and desired clinically meaningful difference.
Conclusions: This study illustrates how successful clinical studies can be designed with careful planning and utilization of computer simulations to increase not only the probability of trial success but also to demonstrate to payers convincing evidence of clinical effectiveness. A sixstep checklist is recommended as an evidencebased guideline to assist in decisionmaking on whether or not to adopt a diagnostics enriched clinical study design.
Keywords: Clinical trial, biomarker, screening diagnostics, clinical effectiveness, personalized healthcare, stratified medicine, computer simulation
Many clinical trials fail because a "sufficiently appropriate" group of patients are not studied. Often, inclusion/exclusion criteria in conventional randomized clinical trials (RCT) do not optimally match patients to an investigational therapy's specific mechanism of action (MoA). Thus, assessment of "efficacy," a measure of a therapy's "average" benefit to patients compared between standardofcare (SoC) and investigational treatment arms, poses two types of risks.
The first risk, designated as "consumer risk," is to individual patients. Because of heterogeneity in the clinical trial population, overall large efficacy signals may be produced by only a small subset of patients matching the therapy's MoA. Thus, although a large proportion of study subjects may not be responding to an investigational therapy, the clinical trial, nevertheless, may show statistical significance and result in regulatory approval for marketing. In this latter scenario, there are potential ethical issues in the study itself because many participants with very low probability of benefit from the investigational therapy will have been unnecessarily exposed to potential harm. As the inclusion/exclusion criteria from pivotal trials determine product labels, many patients may later be prescribed the newlyapproved therapy causing a large distortion in benefit/harm or benefit/cost ratio in the real world.
The second risk, designated as "producer risk," is to the sponsor. Because of heterogeneity in study sample and inferior concordance with therapy's MoA, the ratio of efficacy signal to background noise may be greatly diminished, thereby increasing risk of trial failure. Phase II success rates are lower than at any other phase of development, as evidenced by decline in success rates for new development projects from 28% in 20062007 to 18% in 20082009 [1]. An estimate of likelihood of a drug successfully progressing through Phase III to launch is 50% [2]. Such high attrition rates in latestage drug development result in large financial costs to industry from both lost revenue and missed opportunity cost of not pursuing alternate drug candidates or targets.
Physicians, patients, and increasingly payers, who control reimbursement decisions, prefer clinical "effectiveness" over "efficacy" (see reference [3], for a regulatory perspective). Clinical effectiveness measures how well a treatment works in patients in realworld conditions. Ideally, quantification of effectiveness should include proportion of responders and not just average response of a group. Demonstration of effectiveness will generally demonstrate efficacy, but the reverse may not hold true. A study population must exhibit sufficient homogeneity of response to a treatment to demonstrate effectiveness. Thus, a screening diagnostics (DX) with an optimum threshold for accurate patient classification may become necessary for assuring higher homogeneity. This paper focuses on key interrelated statistical and clinical considerations for aiding decisions on when to adopt a DXenriched study design. Important among these are the anticipated effect size, prevalence of phenotype, diagnostic test accuracy, study power, and clinically meaningful desired difference. Simulations of a hypothetical example of a cholesterollowering drug are used to demonstrate how wellplanned computer simulations can aid in deciding when to adopt DXenriched study designs.
Diagnostic accuracy and probability
Diagnostic tests are undertaken to determine the presence or absence of a phenotype, and this is carried out by making a decision that the condition is or is not present based upon test results [4,5]. In medical decisionmaking dependent on diagnostic test results, conditional probabilities are conditioned on the outcome rather than the "unknown" truth. Such probabilities are the "inverse probabilities," also known as "Bayesian" probabilities. It is important to understand the direction of a conditional probability as to whether the direction is from truth to outcome or in the reverse direction [4]. Confusion on the directionality of the conditional probability can misinform an understanding of probabilities that affect clinical decisionmaking. For the purpose of planning and designing a DXenrichment study, positive predictive value (PPV) is the most important diagnostic measure of accuracy because it directly impacts probability of trial success (based on statistical significance) as well as the likelihood of demonstrating a prespecified minimal clinically meaningful difference. The predictive values of a DX can be estimated from test sensitivity (SN), specificity (SP), and pretest or prior probability (PP) using the following equations [6]:
PPV = SN x PP/SN x PP+(1–SP) x (1–PP) .............(1)
NPV = SP x (1–PP)/SP x (1–PP) + (1–SN) x PP ...........(2)
Because the inverse probabilities are calculated from the truthconditional probabilities and the PP, it is important to conduct sensitivity analyses using a plausible range of pretest probabilities when calculating PPV. If a plausible range of pretest probabilities are unknown, but the SN and SP of a DXtest is known with sufficient confidence, the number of test +ves and test ves from a pilot study or retrospective data can be used with a flat Beta prior in a simulated Markov process with a continuous state space to estimate PP and Bayesian confidence intervals (see reference [7], for an example with R programming codes). The PP can also be approximated using assumed SN and SP of the diagnostic test and observed proportion of DX+ve (PDXP) in a pilot or retrospective study using the following equation:
PP = (PDXP+SP1)/(SN+SP1) ...............(3)
Based on the normal approximation to the binomial distribution, the traditional 95% confidence interval for PDXP can be calculated and substituted into Equation 3 to obtain 95% confidence interval for PP. However, as cautioned in [7], absurd estimates of prevalence can sometimes result when using Equation 3. The Bayesian method is robust against such absurd estimates and may be preferable, especially considering ready availability of openaccess statistical software such as R. Representative values from inside the confidence interval can subsequently be used for sensitivity analyses in downstream simulations of PPV, and to examine the effects of DX accuracy on clinically meaningful difference.
Illustrative example
A hypothetical example of an investigational cholesterollowering drug that selectively benefits a targeted subgroup of patients at risk of heart disease is utilized. The example is drawn from an actual Dutch study of cholesterollowering therapy as described in [8]. In the Dutch study, familial hypercho lesterolemia (FH) was diagnosed through genetic cascade screening, and the study patients were treated with a cholesterol lowering drug. After analysis of the study data, it was observed that mean lowdensity lipoprotein cholesterol (LDLC) decreased to 124 (± 43) mg/dL, which was statistically significant. However, only 22% of study subjects achieved the LDLC target level of ≤97 mg/dL recommended in Dutch guidelines. Although questions have been raised about the effectiveness of genetic testing for FH [9,10], it is assumed in this example that a novel predictive DX has been developed for selecting likely responders to the investigational lipid lowering treatment. The DX will be utilized to enrich the clinical trial population to demonstrate both clinical "effectiveness" and "efficacy" of the new investigational therapy. In this paper, mock simulated studies will be utilized with 10% above the Dutch recommended guideline of ≤97 mg/dL of LDLC (i.e., ≤107 mg/dL) as the reduced posttreatment target for demonstrating clinical effectiveness. This target level is on the low side of the range for "near ideal" category (100129 mg/dL) published by the Mayo Clinic [11]. Analyses will incorporate simulation and application of formal statistical tests. First, relationships will be examined between various measures of diagnostic accuracy (SN, SP, PPV, NPV, and overall accuracy) (Table 1). Subsequently, relationships among predictive values of DX tests, effect size, study power, and their contributions to go/no go decisions will be examined (Figures 1 and 2). The goal is to not only achieve statistical significance, but also attain cholesterol reduction to a hypothetical mean target threshold of ≤107 mg/dL LDLC in the treatment arm. Thus, the interest is in assessing what level of PPV will be necessary to achieve this target, and what will be the associated study power for demonstrating statistical significance to help decide if a DX enriched clinical trial should be undertaken.
Table 1 : Relationship of different measures of diagnostic accuracy with prediagnostic test probability (values are means and standard deviations obtained from simulations).
Figure 1 : Relationship among study power, PPV, and effect size [n=82, power=80%, α=0.05, and reference effect size=0.62 (corresponding to treatment mean=97 mg/dL, SoC mean=124 mg/dL, pooled standard deviation=43 mg/dL)].
Figure 2 : Relationship of clinical effectiveness, different effect sizes (each with correctly specified sample size), and PPV.
Simulation
Key features of the simulations and the statistical methods/tests used in this study are shown inside the text box. Simulation was carried out to generate artificial data representing prevalence of phenotype in 10% increments. SYSTAT (v. 11.0) and the open access statistical software R (v. 2.10.1) were used for data analyses.

Simulation results and observed relationships between the different measures of diagnostic accuracy and prediagnostic probabilities are summarized in (Table 1). The shaded cells with bold numbers in the table indicate thresholds of PPVs below which choosing a DXenriched study design will not make sense because the posterior probabilities offer no advantage over the pretest probabilities.
(Table 1) shows that higher overall accuracy of a DX is need ed for lower prevalence in order to cross the potential minimal utility threshold of PPV (a PPV of 50% is equivalent to tossing an unbiased coin when pretest probability is 50%). Thus, for investment in a DXenriched study design, an acceptable target PPV will need to be predetermined taking into consideration the assumed pretest probability and level of willingness of the sponsor to risk trial failure. As an example, a PPV threshold of 70% may constitute such a target deemed to be acceptable. Thus, in order to cross a 70% PPV threshold, an overall DX accuracy of 80, 80, 70, 60, and 50% are required for prevalence of 30, 40, 50, 60, and 70%, respectively (Table 1).
(Figure 1) demonstrates relationships among study power, PPV, and different effect sizes for a fixed sample size (n=82). As effect size decreases from the reference effect size, study power obtained from gains in diagnostic accuracy (PPV) declines rapidly, exhibiting progressively lower relative impact of PPV on the study power (bottom three lines in Figure 1). However, for an underestimated or a correctly specified effect size, there is rapid gain in study power as PPV increases (top two lines in Figure 1). Thus, as per expectation, a study's power is affected by both the sample size and the PPV of a DX.
However, having sufficient study power by correct specification of sample size does not guarantee demonstration of clinical effectiveness. This point was previously emphasized in reference [12], in which the authors caution and illustrate that even large changes in statistical significance levels can correspond to small, nonsignificant changes in the underlying quantities of practical interest. Once a sample size has been adequately specified for an unknown true effect size (i.e., the influence of variance has been sufficiently accounted for) to achieve the desired study power, clinical effectiveness will then depend upon the PPV of a DXtest as shown by the steep negative slope of the superimposed lines representing the different effect sizes in (Figure 2) (the lines for the different effect sizes lie on top of one another because sample sizes have been adjusted for differences in variances). Thus, a PPV of 70% (0.7 on the Xaxis) would insure that a DX +ve study should result in expected mean LDLC level of 105 mg/dL (which satisfies the clinical effectiveness target of less than 107 mg/dL on the Yaxis), irrespective of the effect size. Note that even though the clinical effectiveness target is expected to be met, the study power would only be ~ 60% (Figure 1, for the reference effect size of 0.62), not 80%. This difference in study power arises because the sample size for the 80% power was calculated assuming a PPV of 100%, not 70%. A PPV of 20% on (Figure 2). would be expected to reduce LDLC level to only an expected mean of 119 mg/dL, a number significantly above the desired target LDLC level set for demonstrating clinical effectiveness, even when statistical significance is attained. Of course, a PPV of 20% will likely fail to attain statistical significance because of low study power, but a consumer risk does exist, nevertheless.
As drug costs are escalating with an increase in development costs nearing $2 billion for each marketed drug, success rate has declined from ~12% to ~7% [13]. "Fail fast, fail cheap," "shots on goal," the use of biomarkers, and changing governance and organizational models have been some of the strategies adopted by industry since 2001 [14]. As the science of individualized medicine matures, empiricism and the probabilistic underpinnings of medical practice are increasingly replaced by specific targeted diagnosis and treatments with mechanismbased deterministic precision [13]. Drug developers must now provide evidence of differentiation and clinical value in order to convince major payers to offer reimbursement for new medicines at a fair price [15,16]. Both public and private payers in rich and emerging economies are becoming increasingly interested in using evidence to inform healthcare resource allocation decisions and for preferential coverage in health plans [17,18]. One way to achieve and demonstrate such evidence is through practical enrichment, i.e., seeking to reduce noise (variability of measurement) and minimizing heterogeneity of patients in clinical trials [19]. A sufficiently accurate screening diagnostics, as illustrated in this paper, can be an invaluable tool for such a purpose. Ideally, relevant diagnostics should be evaluated during drug development and be available for use in efficacy trials [20]. So what should such an evaluation encompass? Sensitivity and specificity of a diagnostic test estimate the probability of a positive or negative test result when the gold standard (truthsurrogate outcome) is known. These two commonly reported diagnostic measures are useful in selecting a test from among different competing tests. Positive and negative predictive values, on the other hand, measure the probability of making a correct choice from a test result. These latter two diagnostic measures are useful in clinical decision making and in patient screening for enrolment in clinical trials. Sensitivity infers the probability of a +ve DX test, given that the patient has the disease or phenotype of interest. With a test result in hand, however, the clinician wants to know the probability that the patient has the specified disease or phenotype given a +ve or –ve DX test. As predictive values are a function of both the sensitivity and specificity of the DX test and the pretest probability, clinical decision making or patient selection for clinical trials should pay significant attention to making sure the pretest probabilities are not seriously overor underestimated. Test sensitivities and specificities can only be correctly interpreted for decisionmaking when the unknown true pretest probabilities have been estimated reasonably accurately. At low pretest probabilities, diagnostic utility is often limited by inadequate accuracy of tests. As prediagnostic probability decreases further away from 50%, the DXtest will need to be increasingly more accurate in sensitivity while possessing increasingly higher specificity. At high pretest probabilities, diagnostic tests may not be necessary for designing clinical trials because the study population will already be relatively homogenous. Any marginal gain in enrichment by using a diagnostics in such a situation may be inefficient from a time and cost perspective. Honig [21] expresses the lack of importance given to disease/trait prevalence in determining the predictive value, clinical utility, costeffectiveness, and generalizability of screening, testing, or enrichment in published papers that discuss diagnostic measures, particularly those overtly emphasizing only sensitivity and specificity. Diagnostic tests will have highest utility in clinical trial designs when pretest probabilities are in the midrange [21,22].
Thus, if the success of a clinical trial depends on sufficient enrichment, then both diagnostic accuracy and pretest probability of a disease or phenotype are important criteria. If the true prevalence of a disease or genetic trait of interest is low, the PPV of the test/screen also will be correspondingly low and, even with high sensitivity and specificity, the majority of positive tests will tend to be falsely positive [21,22]. A sponsor will want to invest in a clinical trial only if there is sufficient confidence in the trial's probability of success. Although prediagnostic probability can be biased toward trial success by altering a study's inclusion/exclusion criteria, such a manipulation often comes with a tradeoff cost in higher rate of false negatives (i.e., higher screen failure rate and correspondingly smaller potential market size). Therefore, from a purely DXenrichment viewpoint, positive predictive value must be recognized as the "primary" diagnostic measure of interest because it directly impacts the probability of success of a clinical trial. Effect size affects the power of a study if a study's signal/noise (S/N) ratio differs significantly from the initially assumed S/N used in calculating the sample size. Clinical trials designed with prognostic or predictive biomarkers as screening diagnostics can greatly increase the efficiency of trials because enrichment positively affects the S/N ratio and, consequently, leads to smaller sample size requirements for demonstrating both clinical efficacy and effectiveness. According to a recently published guidance document of the Food and Drug Administration (FDA) on clinical trial enrichment strategies, "the strategy can be particularly useful for early effectiveness studies because it can provide clinical proof of concept and contribute to selection of appropriate doses for later studies" [23]. The Agency further states, "The decision to use an enrichment design is largely left to the sponsor of the investigation, but like the entire research and clinical communities, FDA is very interested in targeting treatments to the people who can benefit from them (i.e., individualization)."
In conclusion, the following sixstep checklist is recommended as a generic evidencebased guideline to aid decisionmaking on whether or not to adopt a DXenriched clinical study design: (1) be reasonably confident that assumption of prevalence is not erroneous, (2) insure that DX yields a sufficiently high PPV, (3) insure that the expected PPV will likely result in a prespecified minimum clinically meaningful average value for the study's primary endpoint, (4) seek assurance that assumption of effect size is not faulty in order to insure sufficient power from the planned study sample size, (5) conduct simulations to assess effects of plausible underor overestimations of the critical assumptions (i.e., conduct sensitivity analyses), and lastly (6) make go/no go decisions based upon careful evaluation of assumptions and synthesis of simulation results from all preceding five steps in conjunction with risk tolerance for clinical trial failure deemed to be acceptable by the sponsor. As only statistical significance is affected by a study's sample size, it is especially important to prespecify and satisfy a minimum clinicallymeaningful average treatment difference in both DXenriched and RCT clinical studies. Adherence to the above guidelines will increase the likelihood of not only successful demonstration of efficacy, safety and benefitrisk management to obtain regulatory approval, but also the achievement of what Honig [24] has characterized in a recent editorial as the "fourth hurdle" to successful commercialization of biopharmaceutical products increased likelihood of reimbursement.
The author declares that he has no competing interests.
Received: 20Sep2013 Revised: 01Nov2013
ReRevised: 06Nov2013 Accepted: 13Nov2013
Published: 21Nov2013
Khatry DB. Interrelationships among enrichment with diagnostics, probability of success and demonstration of effectiveness in clinical trials. J Med Stat Inform. 2013; 1:4. http://dx.doi.org/10.7243/2053766214
Copyright © 2015 Herbert Publications Limited. All rights reserved.