Empirical Size and Power of a Hybrid Statistic for Matched and Unmatched Designs in Interim Analysis of Clinical Trials: A Simulation Study

A hybrid test statistics was proposed in the literature to analyze matched studies with non-normally distributed outcomes. In this article, we investigated and compared the hybrid statistic with the metaanalysis t test under commonly used interim analysis settings in clinical trials. We estimated the empirical powers and the empirical type I errors among the 10,000 simulated datasets with different sample sizes, different effect sizes, different correlation coefficients for matched pairs, and different data distributions, respectively, in the interim and final analysis with 4 different group sequential methods. Results from our simulation study show that, compared to the meta-analysis t-test commonly used for data with normally distributed observations, the hybrid statistic almost keeps the powerfor data observed from normally distributed random variables and generally achieves greater power for log-normally, and multinomially distributed random variables with matched and unmatched subjects as well as with outliers. Powers rose with the increase in sample size, effect size, and correlation coefficient for the matched pairs. In addition, lower type I errors were observed by using the hybrid statistic in most of the cases studied, which indicates that this test is also conservative for data with outliers in the interim analysis of clinical trials.


Introduction
In clinical trials, an interim analysis is commonly applied to obtain an evidence of a significant difference of the study interference. It could save patient resources and shorten the drug development and approval time if a trial is terminated with high confidence at an earlier time than planned. Group sequential designs are the most commonly used methods in the interim analysis.
Pocock (1977) and O' Brien and Fleming (1979) proposed their group sequential designs of interim analysis [1,2]. Both of their designs assume afixed maximum number of interim analyses (K) with an equal scale space. But the Pocock design needs a higher number of a maximum sample sizewhile the O'Brien-Fleming design requires a higher number of an expected sample sizeif the study continues to the final analysis [3].
Wang and Tsiatis (1987) proposed a family of two sided tests which provide boundaries of different shapes with different values of parameter delta [4]. The parameter could be varied to put more emphasis on a low maximum sample size ora low expected sample size in a group sequential analysis. Typically the delta ranges from 0 (O'Brien-Fleming design) to 0.5 (Pocock design) and a delta of 0.25 was selected as the middle pointin our study. Haybittle (1971) and Petoet al. (1976) also suggested a simple form of sequential monitoring in which H 0 is rejected in the kth interim analysis (k<K) if |Z k | ≥3 [5,6]. Data collections in clinical trials are varied in types. Sometimes both matched and unmatched subjectscould be included in atrial and it isimportant to properly include both of them in the interim analysis. The meta-analysis t-test with an inverse variance-weighted method is a commonly used to combine results of matched and unmatched subjects for data with normal distribution [7][8][9][10][11]. In practice, there are many more different types of outcome data For example, two eyes would be randomized to treatment or control groups as a matched pair if both eyes of a patient are sick in ophthalmology clinical trials. However, if some of patients have onesick and one normaleye and then only thesick-eye would be randomized in the trial. In addition, the study outcome such as visual acuityis usually tested based on the distances and sizes of letters and recorded as 20/x in vision measurement, where x denotes as the distance the patient would see in 20 feet and these measurements are neither normally nor dichotomously distributed. Therefore, this study will include both matched and unmatched observations with neither normally distributed nor dichotomously distributed values.
A hybrid statistic is reported by investigators of Early Treatment of Retinopathy of Prematurity for data analysis [12,13]. They applied this statistic effectively to combine results of outcome from matched and unmatched portions in their researches. In our study, we performed a simulation study of the hybrid statistic to assess its empirical sizes and powers under the setting of interim analysis. We also compared our results of the hybrid statistic to thosewith the meta-analysis t-test.

Meta-analysis t-test and hybrid statistic
The meta-analysis t-test with the inverse variance-weighted method is usuallyapplied in a two-treatment comparison for normally distributed data with paired and unpaired subjects. Briefly, let T paired , var(T paired ) and T unpaired , var(T unpaired ) be the observed mean difference and variance for paired and unpaired subjects, respectively, a weighted mean difference of the meta-analysis t-test at kth in terim analysisis defined as In above formulations, the T k statistic for the combined and uncombined data is the paired two sample T statistic to calculate T value at the kth interim analysis.
Asymptotically, assuming that has a normal distribution and the mean difference under the null hypothesis of H 0 is zero, then, we have the following normally distributed test statistic evaluated at the kth interim analysis: For the hybrid statistic defined by Byun et al, we can also evaluate it at the kth interim analysis [13]. Briefly, at interim analysis k, let The Wang-Tsiatis design introduce a parameter ∆ from 0 to 0.5. It equals the Pocock design when ∆=0.5 and the O'Brien-Fleming design when ∆=0. The null hypothesis of H 0 will be rejectedand the study would be terminated if |Z k | ≥ C WT (K, α, ∆) × (k/K) ∆-1/2 , where C WT (K, α, ∆)is the constant at K groups. Haybittle and Peto et al. suggest a simple form of sequential monitoring in which the null hypothesis of H 0 will be rejected in the interim analysis (k<K) if |Z k |≥3 and in the final analysis (k=K) if |Z k |≥C HP (K, α), where C HP (K, α) is the constant at k=K.
To assess the empirical sizes we compared type-I errors using meta-analysis t-test and hybrid statistic for trials with 4 interim analyses and a final analysis (K=5). Results from the interim analysis at each group can also be applied as a reference of a single-stage design. To confirm the changes of trends of type I errors among different clinical trial designs, four different group sequential methods, Pocock, O'Brien-Fleming, Wang-Tsiatis, and Haybittle-Peto designs with their corresponding significance levels were applied in this study. Using a 2-sided test, the null hypothesis of H 0 was rejected in the kth interim analysis if | k Z | ≥C k , where C k equals to C P (K, α) for Pocock design, to C B (K, α) × k K for O'Brien-Fleming design, to C WT (K, α, ∆) × (k/K) ∆-1/2 for Wang-Tsiatis design (∆=0.25), and to 3 (if k<K) or to C HP (K, α) (if k=K) for Haybittle-Peto design. C P (K, α), C B (K, α), C WT (K, α, ∆), and C HP (K, α) are the constants with K groups of observations and type I error αin each design [3].
Empirical sizes were estimated bythe proportions of rejecting the null hypothesis from the simulated data.Similar to the study in Byun et al [13], 10,000 datasets from normally and log-normally distributed random variables were simulated with maximum sample sizes of 100, 300, and 500, respectively, with half of paired subjects and half of unpaired subjects. Each dataset contained 5 equal groups (K=5) with 4 interim analyses and a final analysis. For example, a dataset with a sample size of 100 was divided into 5 equal groupsand each group contained 5 matched pairs and 10 unmatched subjects. The first interim analysis would be doneat the first group of 20 subjects and the second to the fifth (final) analyses would be completedwith the equally spaced sample sizes of 40, 60, 80 and 100, respectively. The treatment effect of µ 0 =µ 1 =0 for normally distributed data and ln(µ 0 )=ln(µ 1 )=0 for log-normally distributed data, different correlation coefficients of 0.2, 0.5, and 0.8 for matched pairs, and a variance of 1 for normally distributed data or a logarithm variance of 1 for log-normally distributed data were used fordata simulation without outliers. We selected the group sample sizes in equal and the range of correlation coefficients for convenience and the total sample sizes in commonly encountered clinical trials utilizing interim analysis.
For a dataset with 20% outliers, 80% of subjects were from previous simulated dataset, and 20% of subjects were simulated from an effect size of µ 2 =µ 3 =10 with a variance of 20 for normally distributed random variables. For log-normally distributed datasets, the effect size of ln(µ 2 )=ln(µ 3 )=2 with a logarithm variance of 2 were used for 20% subjects with outliers.
Datasets observed from multinomially distributed random variables were also simulated but not showed in this paper due to the similar results withlog-normally distributed random variables.
The z-values from different sample sizes at 5 significance levels with Pocock, O'Brien-Fleming, Wang-Tsiatis and Haybittle-Peto designs were summarized in Table 1. Compared to the boundary of a standard normal distribution, all empirical boundaries estimated using both the meta-analysis t-test and the hybrid statistic for data with matched and unmatched subjects were far and wide from the standard Z-test when the sample size was equal to 20. However, they were gradually narrowed and closed to the standard boundary with the increase of the sample size. Compared to the meta-analysis t-test,empirical boundaries estimated by using the hybrid statistic were closer to the standard boundary lineand especially when n≥40. (Figures 1 and 2

).
Results from above indicate that both the meta-analysis t-test and the hybrid statistic have large type I errors for normally distributed data with matched and unmatched subjects when the sample size issmall (n≤20). But with the number increasing from 20 to 100, type I errors were decreased, especially by using the hybrid statistic. Table 2 showed the type I errors among those four different designs with sample size 100.The largest empirical type I error from the Pocock design was observed in the first interim analysis (k=1). But type I error increasedwith the k i increasingfor the O'Brien-Fleming design (  The results of empirical sizes for log-normally distributed random variables are tabulated in Table 3. Lower type I errors were observed by using the hybrid statistic, which indicates that this test is also conservative for data with log-normally distributed outcomes. Similar results were also observed when outliers were presented in the interim analysis of clinical trials.

-test H-test t-test H-test t-test H-test t-test H-test t-test H-test
Empirical type I errors wereaffected by sample sizes and correlation coefficient for O'Brien-Fleming design are shown in Figure 3. They were greater than standard type I error when N=100, but similar with a maximum sample size of 500 for data without outliers (Figure 4). For data with outliers, empirical type I errors were lower than standard normal type I error when N=100 and k i ≥3 (Figure 5), but were similar to standard one when N=500 (Figure 6). Meanwhile, lower type I errors, especially for data with outliers and with low sample sizes,    were observed by using the hybrid statistic in most cases of the study. These results suggest that using the hybrid statistic wasless likely to commit type I error when there were outliers. The largest empirical type I error of the Pocock design for data observed from log-normally distributed random variables with and without outliers was also observed in the first interim analysis. Type I errors increased with the increase of the sample size for the O'Brien-Fleming design, similar to the data observed from normally distributed random variables for the Wang-Tsiatis design and the Haybittle-Peto design. results for data observed from log-normally distributed random variables suggest that both tests have lower type I errors, especially for data with outliers. Moreover, compared to results estimated by using the meta-analysis t-test, a lower type I error was also observed by using the hybrid statistic.

-test H-test t-test H-test t-test H-test t-test H-test t-test H-test t-test H-test
We also observed lower type 1 errors by using hybrid statistics than that of using meta t statistic for multinomially distributed random variables with and without outliers. These results are similar to those observed from log-normally distributed random variables.
Each dataset was simulated 10,000 times, and both the meta-analysis t-test and the hybrid statistic were applied for 4 interim analyses and a final analysis. Four group sequen-doi: 10.7243/2053-7662-8-5     subjects and with maximum sample sizes of 100, 300, and 500 were simulatedusing both the meta-analysis t-test and the hybrid statistic. Partial results for maximum sample size of 300 without or with outliers are summarized in Tables 4 and 5.

t-test H-test t-test H-test t-test H-test t-test H-test t-test H-test t-test H-test
Overall, all empirical powers increased with the increasingof sample size, number of interim analyses (k i ), treatment effect size (d i ), and correlation coefficient (ρ i ) for matched pairs using both the meta-analysis t-test and the hybrid statistic. Due to the similar trend of powers for 4 different group sequential designs listed in the tables, empirical powers of the O'Brien-Fleming design were used for graphs. Table 4 shows the empirical powers of thehybrid test statistic are very close to that of the meta-analysis t-test under normal without outliers ( Table 4 and Figures 7 and 8). However, the hybrid test is much more powerful when there are outliers ( Table 5 and Figures 9 and 10). For a closer look when the alternative is log-normal with and without outlier, we summarize the results in Tables 6 and 7.
From Table 5, we can see empirical powers were less than 60% for d i =0.3 except for k i =5 and ρ i =0.8 (Figure 11). Empiri-  H-test t-test H-test t-test H-test t-test H-test t-test H-test t-test H- Table 5. Empirical powers for data observed from normally distributed random variables with matched and unmatched subjects and with 20% outliers (Maximum sample size=300).     Table 7 clearly indicated that the hybrid test statistic is much powerful (Figures 13 and 14).

t-test H-test t-test H-test t-test H-test t-test H-test t-test H-test t-test H-test
For example, as shown in  -test H-test t-test H-test t-test H-test t-test H-test t-test H-test t-test H- Table 7. Empirical powers for data observed from log-normally distributed random variables with matched and unmatched subjects and with 20% outliers (Maximum sample size=300).  design with 20% outlines with d i 2 =0.6, ln(σ 2 )=2, the empirical power of the hybrid statistic achieved 95.9% whereas the meta t statistic reached only 27.2% at the final stage. In fact, almost all the empirical powers were larger than 90% at the final stage for the hybrid statistics, but the empirical power from meta t was mostly less than 30%.  Similar results were observed for multinomial outcomes as in for the log-normal distribution for the matched and unmatched subjects and with maximum sample sizes of 100, 300, and 500 using both the meta-analysis t-test and the hybrid statistic. These results are omitted here. The purpose of simulating multinomial outcomes was to validate that the hybrid statistic is not only powerful for data observed from continuous random variables with outliers but also for data observed from discrete variables with outliers in the interim analysis of clinical trials.

Discussions and Concluding Remarks
Meta-analysis t-test can be used for normally distributed random variables with matched and unmatched subjects. Byun, et al. reported a hybrid statistic for data from neither normally nor dichotomously distributed random variables under clinical trials without interim analysis [13]. In our study, we conducted simulation studies of an interim analysis of clinical trial with different sample size and with matched and unmatched subjects to compare the type I errors and powers using meta-analysis t-test and hybrid statistic.Our results showed empirical type I error rates were relatively large when sample sizeis small for matched and unmatched subjects in the interim analysis by using both the meta-analysis t-test and the hybrid statistic. However, compared to the meta-analysis t-test, lower type I error rate was observed for the hybrid test statistic under the normal outcome setting. It was observed that with increase in sample sizes, empirical type I errors were gradually approached those for standard analysis (that is, clinical trials conducted without interim analysis). In general, hybrid test statistic had a lower type I error rate comparing to the meta-analysis t test.
The meta-analysis t-test is more powerful for ideal data observed from normally distributed outcomes with matched and unmatched subjects. However, the hybrid test is more robust and powerful than that of the meta-analysis when the outcome contained outliers in the interim analysis of clinical trials. All empirical powers rose with the increase in sample size, effect size, and correlation coefficient for matched pairs. Our results from the simulation study strongly support that the hybrid statistic has more power and low type I error for data with matched and unmatched subjects and with nonnormal distribution, especially for those data with outliers.