Interrelationships among enrichment with diagnostics, probability of success and demonstration of effectiveness in clinical trials

Background : Prevalence of disease phenotype in clinical practice is often not given adequate importance during formulation, validation, and implementation of diagnostic tests in clinical research and development. After promising biomarkers have been identified as potential screening diagnostics, an important strategic question for optimal decision-making in clinical development of a therapeutic is when to choose an enrichment study design over the traditional all-comer randomized control trial design. Methods : A hypothetical example of a cholesterol lowering treatment is used to illustrate influences of key statistical criteria and clinical considerations for choosing study designs. Computer simulations demonstrate how results of such analyses can aid in deciding whether or not to choose enrichment study designs. Results : This study shows how understanding of disease prevalence in practice, predictive values of diagnostic test, and pre-specified establishment of a clinically meaningful minimum effectiveness all need to be integrated to insure clinical trial success and appropriate benefit to targeted patient subgroups. The most important statistical and clinical considerations were the anticipated effect size, phenotype prevalence, predictive values of diagnostics test, study power, and desired clinically meaningful difference. Conclusions : This study illustrates how successful clinical studies can be designed with careful planning and utilization of computer simulations to increase not only the probability of trial success but also to demonstrate to payers convincing evidence of clinical effectiveness. A six-step checklist is recommended as an evidence-based guideline to assist in decision-making on whether or not to adopt a diagnostics enriched clinical study design.


Introduction
Many clinical trials fail because a "sufficiently appropriate" group of patients are not studied. Often, inclusion/exclusion criteria in conventional randomized clinical trials (RCT) do not optimally match patients to an investigational therapy's specific mechanism of action (MoA). Thus, assessment of "efficacy," a measure of a therapy's "average" benefit to patients compared between standard-of-care (SoC) and investigational treatment arms, poses two types of risks.
The first risk, designated as "consumer risk, " is to individual patients. Because of heterogeneity in the clinical trial population, overall large efficacy signals may be produced by only a small subset of patients matching the therapy's MoA. Thus, although a large proportion of study subjects may not be responding to an investigational therapy, the clinical trial, nevertheless, may show statistical significance and result in regulatory approval for marketing. In this latter scenario, there are potential ethical issues in the study itself because many participants with very low probability of benefit from the investigational therapy will have been unnecessarily exposed to potential harm. As the inclusion/exclusion criteria from pivotal trials determine product labels, many patients may later be prescribed the newly-approved therapy causing a large distortion in benefit/ harm or benefit/cost ratio in the real world.
The second risk, designated as "producer risk, " is to the sponsor. Because of heterogeneity in study sample and inferior concordance with therapy's MoA, the ratio of efficacy signal to background noise may be greatly diminished, thereby increasing risk of trial failure. Phase II success rates are lower than at any other phase of development, as evidenced by decline in success rates for new development projects from 28% in 2006-2007 to 18% in 2008-2009 [1]. An estimate of likelihood of a drug successfully progressing through Phase III to launch is 50% [2]. Such high attrition rates in late-stage drug development result in large financial costs to industry from both lost revenue and missed opportunity cost of not pursuing alternate drug candidates or targets.
Physicians, patients, and increasingly payers, who control reimbursement decisions, prefer clinical "effectiveness" over "efficacy" (see reference [3], for a regulatory perspective). Clinical effectiveness measures how well a treatment works in patients in real-world conditions. Ideally, quantification of effectiveness should include proportion of responders and not just average response of a group. Demonstration of effectiveness will generally demonstrate efficacy, but the reverse may not hold true. A study population must exhibit sufficient homogeneity of response to a treatment to demonstrate effectiveness. Thus, a screening diagnostics (DX) with an optimum threshold for accurate patient classification may become necessary for assuring higher homogeneity. This paper focuses on key interrelated statistical and clinical considerations for aiding decisions on when to adopt a DX-enriched study design. Important among these are the anticipated effect size, prevalence of phenotype, diagnostic test accuracy, study power, and clinically meaningful desired difference. Simulations of a hypothetical example of a cholesterol-lowering drug are used to demonstrate how well-planned computer simulations can aid in deciding when to adopt DX-enriched study designs.

Diagnostic accuracy and probability
Diagnostic tests are undertaken to determine the presence or absence of a phenotype, and this is carried out by making a decision that the condition is or is not present based upon test results [4,5]. In medical decision-making dependent on diagnostic test results, conditional probabilities are conditioned on the outcome rather than the "unknown" truth. Such probabilities are the "inverse probabilities," also known as "Bayesian" probabilities. It is important to understand the direction of a conditional probability as to whether the direction is from truth to outcome or in the reverse direction [4]. Confusion on the directionality of the conditional probability can misinform an understanding of probabilities that affect clinical decision-making. For the purpose of planning and designing a DX-enrichment study, positive predictive value (PPV) is the most important diagnostic measure of accuracy because it directly impacts probability of trial success (based on statistical significance) as well as the likelihood of demonstrating a pre-specified minimal clinically meaningful difference. The predictive values of a DX can be estimated from test sensitivity (SN), specificity (SP), and pre-test or prior probability (PP) using the following equations [6]: Because the inverse probabilities are calculated from the truthconditional probabilities and the PP, it is important to conduct sensitivity analyses using a plausible range of pre-test probabilities when calculating PPV. If a plausible range of pre-test probabilities are unknown, but the SN and SP of a DX-test is known with sufficient confidence, the number of test +ves and test -ves from a pilot study or retrospective data can be used with a flat Beta prior in a simulated Markov process with a continuous state space to estimate PP and Bayesian confidence intervals (see reference [7], for an example with R programming codes). The PP can also be approximated using assumed SN and SP of the diagnostic test and observed proportion of DX+ve (PDXP) in a pilot or retrospective study using the following equation: Based on the normal approximation to the binomial distribution, the traditional 95% confidence interval for PDXP can be calculated and substituted into Equation 3 to obtain 95% confidence interval for PP. However, as cautioned in [7], absurd estimates of prevalence can sometimes result when using Equation 3. The Bayesian method is robust against such absurd estimates and may be preferable, especially considering ready availability of open-access statistical software such as R. Representative values from inside the confidence interval can subsequently be used for sensitivity analyses in downstream simulations of PPV, and to examine the effects of DX accuracy on clinically meaningful difference.

Illustrative example
A hypothetical example of an investigational cholesterollowering drug that selectively benefits a targeted sub-group of patients at risk of heart disease is utilized. The example is drawn from an actual Dutch study of cholesterol-lowering therapy as described in [8]. In the Dutch study, familial hypercho lesterolemia (FH) was diagnosed through genetic cascade screening, and the study patients were treated with a cholesterol-lowering drug. After analysis of the study data, it was observed that mean low-density lipoprotein cholesterol (LDL-C) decreased to 124 (± 43) mg/dL, which was statistically significant. However, only 22% of study subjects achieved the LDL-C target level of ≤97 mg/dL recommended in Dutch guidelines. Although questions have been raised about the effectiveness of genetic testing for FH [9,10], it is assumed in this example that a novel predictive DX has been developed for selecting likely responders to the investigational lipid lowering treatment. The DX will be utilized to enrich the clinical trial population to demonstrate both clinical "effectiveness" and "efficacy" of the new investigational therapy. In this paper, mock simulated studies will be utilized with 10% above the Dutch recommended guideline of ≤97 mg/dL of LDL-C (i.e., ≤107 mg/ dL) as the reduced post-treatment target for demonstrating clinical effectiveness. This target level is on the low side of the range for "near ideal" category (100-129 mg/dL) published by the Mayo Clinic [11]. Analyses will incorporate simulation and application of formal statistical tests. First, relationships will be examined between various measures of diagnostic accuracy (SN, SP, PPV, NPV, and overall accuracy) ( Table 1). Subsequently, relationships among predictive values of DX tests, effect size, study power, and their contributions to go/ no go decisions will be examined (Figures 1 and 2). The goal is to not only achieve statistical significance, but also attain    used for data analyses.

Results
Simulation results and observed relationships between the different measures of diagnostic accuracy and pre-diagnostic probabilities are summarized in ( Table 1). The shaded cells with bold numbers in the table indicate thresholds of PPVs below which choosing a DX-enriched study design will not make sense because the posterior probabilities offer no advantage over the pre-test probabilities.
( Table 1) shows that higher overall accuracy of a DX is need ed for lower prevalence in order to cross the potential minimal utility threshold of PPV (a PPV of 50% is equivalent to tossing an unbiased coin when pre-test probability is 50%). Thus, for investment in a DX-enriched study design, an acceptable target PPV will need to be predetermined taking into consideration the assumed pre-test probability and level of willingness of the sponsor to risk trial failure. As an example, a PPV threshold of 70% may constitute such a target deemed to be acceptable. Thus, in order to cross a 70% PPV threshold, an overall DX accuracy of 80, 80, 70, 60, and 50% are required for prevalence of 30, 40, 50, 60, and 70%, respectively ( Table 1). (Figure 1) demonstrates relationships among study power, PPV, and different effect sizes for a fixed sample size (n=82). As effect size decreases from the reference effect size, study power obtained from gains in diagnostic accuracy (PPV) declines rapidly, exhibiting progressively lower relative impact of PPV on the study power (bottom three lines in Figure 1). However, for an underestimated or a correctly specified effect size, there is rapid gain in study power as PPV increases (top two lines in Figure 1). Thus, as per expectation, a study's power is affected by both the sample size and the PPV of a DX.
However, having sufficient study power by correct specification of sample size does not guarantee demonstration of clinical effectiveness. This point was previously emphasized in reference [12], in which the authors caution and illustrate that even large changes in statistical significance levels can correspond to small, non-significant changes in the underlying quantities of practical interest. Once a sample size has been adequately specified for an unknown true effect size (i.e., the influence of variance has been sufficiently accounted for) to achieve the desired study power, clinical effectiveness will then depend upon the PPV of a DX-test as shown by the steep negative slope of the superimposed lines representing the different effect sizes in (Figure 2) (the lines for the different effect sizes lie on top of one another because sample sizes have been adjusted for differences in variances). Thus, a PPV of 70% (0.7 on the X-axis) would insure that a DX +ve study should result in expected mean LDL-C level of 105 mg/dL (which satisfies the clinical effectiveness target of less than 107 mg/dL on the Y-axis), irrespective of the effect size. Note that even though the clinical effectiveness target is expected to be met, the study power would only be ~ 60% (Figure 1, for the reference effect size of 0.62), not 80%. This difference in study power arises because the sample size for the 80% power was calculated assuming a PPV of 100%, not 70%. A PPV of 20% on (Figure 2). would be expected to reduce LDL-C level to only an expected mean of 119 mg/dL, a number significantly above the desired target LDL-C level set for demonstrating clinical effectiveness, even when statistical significance is attained. Of course, a PPV of 20% will likely fail to attain statistical significance because of low study power, but a consumer risk does exist, nevertheless.

Discussion
As drug costs are escalating with an increase in development costs nearing $2 billion for each marketed drug, success rate has declined from ~12% to ~7% [13]. "Fail fast, fail cheap," "shots on goal," the use of biomarkers, and changing governance and organizational models have been some of the strategies adopted by industry since 2001 [14]. As the science of individualized medicine matures, empiricism and the probabilistic underpinnings of medical practice are increasingly replaced by specific targeted diagnosis and treatments with mechanism-based deterministic precision [13]. Drug developers must now provide evidence of differentiation and clinical value in order to convince major payers to offer reimbursement for new medicines at a fair price [15,16]. Both public and private payers in rich and emerging economies are becoming increasingly interested in using evidence to inform health-care resource allocation decisions and for preferential coverage in health plans [17,18]. One way to achieve and demonstrate such evidence is through practical enrichment, i.e., seeking to reduce noise (variability of measurement) and minimizing heterogeneity of patients in clinical trials [19]. A sufficiently accurate screening diagnostics, as illustrated in this paper, can be an invaluable tool for such a purpose. Ideally, relevant diagnostics should be evaluated during drug development and be available for use in efficacy trials [20]. So what should such an evaluation encompass? Sensitivity and specificity of a diagnostic test estimate the probability of a positive or negative test result when the gold standard (truth-surrogate outcome) is known. These two commonly reported diagnostic measures are useful in selecting a test from among different competing tests. Positive and negative predictive values, on the other hand, measure the probability of making a correct choice from a test result. These latter two diagnostic measures are useful in clinical decision making and in patient screening for enrolment in clinical trials. Sensitivity infers the probability of a +ve DX test, given that the patient has the disease or phenotype of interest. With a test result in hand, however, the clinician wants to know the probability that the patient has the specified disease or phenotype given a +ve or -ve DX test. As predictive values are a function of both the sensitivity and specificity of the DX test and the pre-test probability, clinical decision making or patient selection for clinical trials should pay significant attention to making sure the pre-test probabilities are not seriously over-or under-estimated. Test sensitivities and specificities can only be correctly interpreted for decision-making when the unknown true pre-test probabilities have been estimated reasonably accurately. At low pre-test probabilities, diagnostic utility is often limited by inadequate accuracy of tests. As prediagnostic probability decreases further away from 50%, the DX-test will need to be increasingly more accurate in sensitivity while possessing increasingly higher specificity. At high pre-test probabilities, diagnostic tests may not be necessary for designing clinical trials because the study population will already be relatively homogenous. Any marginal gain in enrichment by using a diagnostics in such a situation may be inefficient from a time and cost perspective. Honig [21] expresses the lack of importance given to disease/trait prevalence in determining the predictive value, clinical utility, cost-effectiveness, and generalizability of screening, testing, or enrichment in published papers that discuss diagnostic measures, particularly those overtly emphasizing only sensitivity and specificity. Diagnostic tests will have highest utility in clinical trial designs when pre-test probabilities are in the mid-range [21,22]. Thus, if the success of a clinical trial depends on sufficient enrichment, then both diagnostic accuracy and pre-test probability of a disease or phenotype are important criteria. If the true prevalence of a disease or genetic trait of interest is low, the PPV of the test/screen also will be correspondingly low and, even with high sensitivity and specificity, the majority of positive tests will tend to be falsely positive [21,22]. A sponsor will want to invest in a clinical trial only if there is sufficient confidence in the trial's probability of success. Although pre-diagnostic probability can be biased toward trial success by altering a study's inclusion/exclusion criteria, such a manipulation often comes with a trade-off cost in higher rate of false negatives (i.e., higher screen failure rate and correspondingly smaller potential market size). Therefore, from a purely DX-enrichment viewpoint, positive predictive value must be recognized as the "primary" diagnostic measure of interest because it directly impacts the probability of success of a clinical trial. Effect size affects the power of a study if a study's signal/noise (S/N) ratio differs significantly from the initially assumed S/N used in calculating the sample size. Clinical trials designed with prognostic or predictive biomarkers as screening diagnostics can greatly increase the efficiency of trials because enrichment positively affects the S/N ratio and, consequently, leads to smaller sample size requirements for demonstrating both clinical efficacy and effectiveness. According to a recently published guidance document of the Food and Drug Administration (FDA) on clinical trial enrichment strategies, "the strategy can be particularly useful for early effectiveness studies because it can provide clinical proof of concept and contribute to selection of appropriate doses for later studies" [23]. The Agency further states, "The decision to use an enrichment design is largely left to the sponsor of the investigation, but like the entire research and clinical communities, FDA is very interested in targeting treatments to the people who can benefit from them (i.e., individualization). "

Conclusion
In conclusion, the following six-step checklist is recommended doi: 10.7243/2053-7662-1-4 as a generic evidence-based guideline to aid decision-making on whether or not to adopt a DX-enriched clinical study design: (1) be reasonably confident that assumption of prevalence is not erroneous, (2) insure that DX yields a sufficiently high PPV, (3) insure that the expected PPV will likely result in a prespecified minimum clinically meaningful average value for the study's primary endpoint, (4) seek assurance that assumption of effect size is not faulty in order to insure sufficient power from the planned study sample size, (5) conduct simulations to assess effects of plausible under-or over-estimations of the critical assumptions (i.e., conduct sensitivity analyses), and lastly (6) make go/no go decisions based upon careful evaluation of assumptions and synthesis of simulation results from all preceding five steps in conjunction with risk tolerance for clinical trial failure deemed to be acceptable by the sponsor. As only statistical significance is affected by a study's sample size, it is especially important to pre-specify and satisfy a minimum clinically-meaningful average treatment difference in both DX-enriched and RCT clinical studies. Adherence to the above guidelines will increase the likelihood of not only successful demonstration of efficacy, safety and benefit-risk management to obtain regulatory approval, but also the achievement of what Honig [24] has characterized in a recent editorial as the "fourth hurdle" to successful commercialization of biopharmaceutical products-increased likelihood of reimbursement.