Yunfei Wang and Dejian Lai*

*Correspondence: Dejian Lai dejian.lai@uth.tmc.edu

Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Cetnter at Houston, 1200 Pressler St., Room 1008, Houston TX 77030, USA.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

A hybrid test statistics was proposed in the literature to analyze matched studies with non-normally distributed outcomes. In this article, we investigated and compared the hybrid statistic with the metaanalysis t test under commonly used interim analysis settings in clinical trials. We estimated the empirical powers and the empirical type I errors among the 10,000 simulated datasets with different sample sizes, different effect sizes, different correlation coefficients for matched pairs, and different data distributions, respectively, in the interim and final analysis with 4 different group sequential methods. Results from our simulation study show that, compared to the meta-analysis t-test commonly used for data with normally distributed observations, the hybrid statistic almost keeps the powerfor data observed from normally distributed random variables and generally achieves greater power for log-normally, and multinomially distributed random variables with matched and unmatched subjects as well as with outliers. Powers rose with the increase in sample size, effect size, and correlation coefficient for the matched pairs. In addition, lower type I errors were observed by using the hybrid statistic in most of the cases studied, which indicates that this test is also conservative for data with outliers in the interim analysis of clinical trials.

**Keywords**: Clinical Trials, Empirical Power, Empirical Size,Interim Analysis

In clinical trials, an interim analysis is commonly applied to obtain an evidence of a significant difference of the study interference. It could save patient resources and shorten the drug development and approval time if a trial is terminated with high confidence at an earlier time than planned. Group sequential designs are the most commonly used methods in the interim analysis.

Pocock(1977) and O’Brien and Fleming (1979) proposed their group sequential designs of interim analysis [1,2]. Both of their designs assume afixed maximum number of interim analyses (K) with an equal scale space. But the Pocock design needs a higher number of a maximum sample sizewhile the O’Brien-Fleming design requires a higher number of an expected sample sizeif the study continues to the final analysis [3].

Wang and Tsiatis (1987) proposed a family of two sided tests which provide boundaries of different shapes with different values of parameter delta [4]. The parameter could be varied to put more emphasis on a low maximum sample size ora low expected sample size in a group sequential analysis. Typically the delta ranges from 0 (O’Brien-Fleming design) to 0.5 (Pocock design) and a delta of 0.25 was selected as the middle pointin our study. Haybittle (1971) and Petoet al. (1976) also suggested a simple form of sequential monitoring in which H_{0} is rejected in the kth interim analysis (k

Data collections in clinical trials are varied in types. Sometimes both matched and unmatched subjectscould be included in atrial and it isimportant to properly include both of them in the interim analysis. The meta-analysis t-test with an inverse variance-weighted method is a commonly used to combine results of matched and unmatched subjects for data with normal distribution [7-11]. In practice, there are many more different types of outcome data For example, two eyes would be randomizedto treatment or control groups as a matched pair if both eyes of a patient are sick in ophthalmology clinical trials. However, if some of patientshave onesick and one normaleye and then only thesick-eye would be randomized in the trial. In addition, the study outcome such as visual acuityis usually tested based on the distances and sizes of letters and recorded as 20/x in vision measurement, where x denotes as the distance the patient would see in 20 feet and these measurements are neither normally nor dichotomously distributed. Therefore, this study will includeboth matched and unmatched observations with neither normally distributed nor dichotomously distributed values.

A hybrid statistic is reported by investigators of Early Treatment of Retinopathy of Prematurity for data analysis [12,13]. They applied this statistic effectively to combine results of outcome from matched and unmatched portions in their researches. In our study, we performed a simulation study of the hybrid statistic to assess its empirical sizes and powers under the setting of interim analysis. We also compared our results of the hybrid statistic to thosewith the meta-analysis t-test.

**Meta-analysis t-test and hybrid statistic **The meta-analysis t-test with the inverse variance-weighted method is usuallyapplied in a two-treatment comparison for normally distributed data with paired and unpaired subjects. Briefly, let

In above formulations, the T^{k} statistic for the combined and uncombined data is the paired two sample T statistic to calculate T value at the kth interim analysis.

Asymptotically, assuming that T^{k}_{combined} has a normal distribution and the mean difference under the null hypothesis of *H _{0}* is zero, then, we have the following normally distributed test statistic evaluated at the kth interim analysis:

For the hybrid statistic defined by Byun et al, we can also evaluate it at the kth interim analysis [13].

Briefly, at interim analysis k, let k paired θ^{^k}_{paired} be the median of the Walsh averages associate with the Wilcoxon signed rank test defined as median

where

is called a Walsh average and z_{i}=y_{i}-x_{i} , i=_{1,…,}n_{1,k,}and var ( θ^{^k}_{paired}) is ts variance [14,15]. Let θ^{^kun}_{paired} be the median of the cross differences of the unmatched two samples defined as median{y_{i}-x_{i} } j =1,.....,n_{2,k}; l =1,.....,n_{3,k};, where n_{1,k}. is the number of matched observations at stage k and n_{2,k} and n_{3,k} are the number of observations of the unmatched samples, respectively.

The median of these cross difference is the Hodges-Lehman estimate based on the Wilcoxon rank sum testand var(θ^{^kun}_{paired} )is its variance [16]. The hybrid test statistic evaluated at the kth interim analysis is calculated as following,

The null hypothesis of H_{0} in the kth interim analysis was rejected if |Z_{k}| ≥C_{k,} where C_{k} equals to C_{P} (K, α) for Pocock design, to

for O’Brien-Fleming design, to C_{WT} (K, α, Δ) × (k/K)^{Δ-1/2} for Wang-Tsiatis design, and to 3 (if k

Based on the Z values defined in last section, for the Pocock design,the nominal significance level isα’=2[1-Φ{CP(K, α)}] with the constant of C_{P} (K, α)at K groups. For O’Brien-Fleming design, the nominal significance level , α'_{k} =2[1-Φ{CB(K, α)√(K/k)}] with the constant of C_{B} (K, α) at K groupsfor O’Brien-Fleming design. The null hypothesis of H_{0} will be rejectedand the study would be terminated if |Z_{k}| ≥C_{p} (K, α) for Pocock design or |Z_{k}| ≥C_{B}(K, α) × √(K/k for O’Brien-Fleming design.

The Wang-Tsiatis design introduce a parameter Δ from 0 to 0.5. It equals the Pocock design when Δ=0.5 and the O’Brien- Fleming design when Δ=0. The null hypothesis of H_{0} will be rejectedand the study would be terminated if |Z_{k}| ≥C_{WT} (K, α,Δ) × (k/K)^{Δ-1/2}, where C_{WT }(K, α, Δ)is the constant at K groups. Haybittle and Peto et al. suggest a simple form of sequential monitoring in which the null hypothesis of H0 will be rejected in the interim analysis (k_{HP} (K, α), where C_{HP}(K, α) is the constant at k=K.

To assess the empirical sizes we compared type-I errors using meta-analysis t-test and hybrid statistic for trials with 4 interim analyses and a final analysis (K=5). Results from the interim analysis at each group can also be applied as a reference of a single-stage design. To confirm the changes of trends of type I errors among different clinical trial designs, four different group sequential methods, Pocock, O’Brien- Fleming, Wang-Tsiatis, and Haybittle-Peto designs with their corresponding significance levels were applied in this study. Using a 2-sided test, the null hypothesis of H_{0} was rejected in the kth interim analysis if | Z_{k} | ≥C_{k}, where C_{k} equals to C_{P} (K, α) for Pocock design, to C_{B} (K, α) ×√(K/k for O’Brien-Fleming design, to C_{WT} (K, α, Δ) × (k/K)_{Δ-1/2} for Wang-Tsiatis design (Δ=0.25), and to 3 (if k_{P} (K, α), C_{B} (K, α), C_{WT} (K, α, Δ), and C_{HP} (K, α) are the constants with K groups of observations and type I error αin each design [3].

Empirical sizes were estimated bythe proportions of rejecting the null hypothesis from the simulated data.Similar to the study in Byun et al [13], 10,000 datasets from normally and log-normally distributed random variables were simulated with maximum sample sizes of 100, 300, and 500, respectively, with half of paired subjects and half of unpaired subjects. Each dataset contained 5 equal groups (K=5) with 4 interim analyses and a final analysis. For example, a dataset with a sample size of 100 was divided into 5 equal groupsand each group contained 5 matched pairs and 10 unmatched subjects. The first interim analysis would be doneat the first group of 20 subjects and the second to the fifth (final) analyses would be completedwith the equally spaced sample sizes of 40, 60, 80 and 100, respectively. The treatment effect of μ_{0} =μ_{1}=0 for normally distributed data and ln(μ_{0})=ln(μ_{1})=0 for log-normally distributed data, different correlation coefficients of 0.2, 0.5, and 0.8 for matched pairs, and a variance of 1 for normally distributed data or a logarithm variance of 1 for log-normally distributed data were used fordata simulation without outliers. We selected the group sample sizes in equal and the range of correlation coefficients for convenience and the total sample sizes in commonly encountered clinical trials utilizing interim analysis.

For a dataset with 20% outliers, 80% of subjects were from previous simulated dataset, and 20% of subjects were simulated from an effect size of μ_{2}=μ_{3}=10 with a variance of 20 for normally distributed random variables. For log-normally distributed datasets, the effect size of ln(μ_{2})=ln(μ_{3})=2 with a logarithm variance of 2 were used for 20% subjects with outliers.

Datasets observed from multinomially distributed random variables were also simulated but not showed in this paper due to the similar results withlog-normally distributed random variables.

The z-values from different sample sizes at 5 significance levels with Pocock, O’Brien-Fleming, Wang-Tsiatis and Haybittle- Peto designs were summarized in Table 1. Compared to the boundary of a standard normal distribution, all empirical boundaries estimated using both the meta-analysis t-test and the hybrid statistic for data with matched and unmatched subjects were far and wide from the standard Z-test when the sample size was equal to 20. However, they were gradually narrowed and closed to the standard boundary with the increase of the sample size. Compared to the meta-analysis t-test,empirical boundaries estimated by using the hybrid statistic were closer to the standard boundary lineand especially when n≥40. (Figures 1 and 2).

Table 1 **: Simulated z-values at 5 different points of significance levels with different sample sizes estimated by using the meta-analysis t-test and the hybrid statistic.**

Figure 1 **:** **O’Brien-Fleming boundaries estimated by hybrid statistic.**

Figure 2 **:** **O’Brien-Fleming boundaries estimated by metaanalysis t-test.**

Results from above indicate that both the meta-analysis t-test and the hybrid statistic have large type I errors for normally distributed data with matched and unmatched subjects when the sample size issmall (n≤20). But with the number increasing from 20 to 100, type I errors were decreased, especially by using the hybrid statistic.

Table 2 showed the type I errors among those four different designs with sample size 100. The largest empirical type I error from the Pocock design was observed in the first interim analysis (k=1). But type I error increasedwith the k_{i }increasing for the O’Brien-Fleming design (Figure 3). The trend of empirical type I errors for the Wang-Triatis design was between that of the Pocock and O’Brien-Fleming designs, and the trend for the Haybittle-Peto design was similar to that of the Pocock design in the interim analyses for ki

Table 2 **: Type I errors (α) for data observed from normally distributed random variables with matched and unmatched subjects (Maximum sample size=100).**

Figure 3 **:** **Type I errors for normal distribution (N=100).**

The results of empirical sizes for log-normally distributed random variables are tabulated in Table 3. Lower type I errors were observed by using the hybrid statistic, which indicates that this test is also conservative for data with log-normally distributed outcomes. Similar results were also observed when outliers were presented in the interim analysis of clinical trials.

Table 3 **: Type I errors (α) for data observed from normally distributed random variables with matched and unmatched subjects (Maximum sample size=100).**

Empirical type I errors wereaffected by sample sizes and correlation coefficient for O’Brien-Fleming design are shown in Figure 3. They were greater than standard type I error when N=100, but similar with a maximum sample size of 500 for data without outliers (Figure 4). For data with outliers, empirical type I errors were lower than standard normal type I error when N=100 and k_{i}≥3 (Figure 5), but were similar to standard one when N=500 (Figure 6). Meanwhile, lower type I errors, especially for data with outliers and with low sample sizes, were observed by using the hybrid statistic in most cases of the study. These results suggest that using the hybrid statistic wasless likely to commit type I error when there were outliers.

Figure 4 **:** **Type I errors for normal distribution (N=500).**

Figure 5 **:** **Type I errors for normal distribution with 20% outliers (N=100).**

Figure 6 **:** **Type I errors for normal distribution with 20% outliers (N=500).**

The largest empirical type I error of the Pocock design for data observed from log-normally distributed random variables with and without outliers was also observed in the first interim analysis. Type I errors increased with the increase of the sample size for the O’Brien-Fleming design, similar to the data observed from normally distributed random variables for the Wang-Tsiatis design and the Haybittle-Peto design. results for data observed from log-normally distributed random variables suggest that both tests have lower type I errors, especially for data with outliers. Moreover, compared to results estimated by using the meta-analysis t-test, a lower type I error was also observed by using the hybrid statistic.

We also observed lower type 1 errors by using hybrid statistics than that of using meta t statistic for multinomially distributed random variables with and without outliers. These results are similar to those observed from log-normally distributed random variables.

Design and Results of SimulationStudies of Empirical Powers Data observed from normally and log-normally distributed random variables were simulated with effects of μ_{1}-μ_{0} = 0.3, 0.6 (where μ_{0}=0, σ^{2}=1), and effects of ln(μ_{1}) - ln(μ_{0}) = 0.3, 0.6 [where ln (μ_{0})=0, ln(σ^{2})=1], respectively. The correlation coefficients of 0.2, 0.5, and 0.8, respectively, for matched pairs were used as the data simulation.

In order to simulate datasets with 20% outliers, 80% of subjects were from above datasets and the other 20% were simulated with effect sizes of μ_{3}-μ_{2} = 0.3, 0.6 (where μ_{2}=10 and σ^{2}_{2}=20), for normally distributed random variables, or with an effect size of ln(μ_{3}) - ln(μ_{2}) = 0.3, 0.6 [where ln(μ_{2})=2 and ln(σ^{2}_{2})=2] for log-normally distributed datasets, respectively.

A dataset observed from multinomially distributed random variables was obtained from a dataset simulated based on log-normally distributed random variables, and continuous variables were transformed to discrete variables in the data transformation. For example, values ≤0.1 were transformed into 0.05, and values between 0.1- 0.3 were changed into 0.2, etc.

Each dataset was simulated 10,000 times, and both the meta-analysis t-test and the hybrid statistic were applied for 4 interim analyses and a final analysis. Four group sequential methods, Pocock, O’Brien-Fleming, Wang-Tsiatis, and Haybittle-Peto designs, and their corresponding significance levels with 5 groups were applied to estimate the empirical powers for each dataset.

Empirical powers for data observed from normally distributed random variables with matched and unmatched subjects and with maximum sample sizes of 100, 300, and 500 were simulatedusing both the meta-analysis t-test and the hybrid statistic. Partial results for maximum sample size of 300 without or with outliers are summarized in Tables 4 and 5. Overall, all empirical powers increased with the increasingof sample size, number of interim analyses (k_{i}), treatment effect size (d_{i}), and correlation coefficient (ρ_{i}) for matched pairs using both the meta-analysis t-test and the hybrid statistic. Due to the similar trend of powers for 4 different group sequential designs listed in the tables, empirical powers of the O’Brien- Fleming design were used for graphs.

Table 4 **: Empirical powers for data observed from normally distributed random variables with matched and unmatched subjects (Maximum sample size=300).**

Table 5 **: Empirical powers for data observed from normally distributed random variables with matched and unmatched subjects and with 20% outliers (Maximum sample size=300).**

Table 4 shows the empirical powers of thehybrid test statistic are very close to that of the meta-analysis t-test under normal without outliers (Table 4 and Figures 7 and 8). However, the hybrid test is much more powerful when there are outliers (Table 5 and Figures 9 and 10). For a closer look when the alternative is log-normal with and without outlier, we summarize the results in Tables 6 and 7.

Table 6 **: Empirical powers for data observed from log-normally distributed random variables with matched and unmatched subjects (Maximum sample size=300).**

Table 7 **: Empirical powers for data observed from log-normally distributed random variables with matched and unmatched subjects and with 20% outliers (Maximum sample size=300).**

Figure 7 **:** **Normal distribution (N=300, d=0.3).**

Figure 8 **:** **Normal distribution (N=300, d=0.6).**

Figure 9 **:** **Normal dis. 20% outliers (N=300, d=0.3).**

Figure 10 **:** **Normal dis. 20% outliers (N=300, d=0.6).**

From Table 5, we can see empirical powers were less than 60% for d_{i}=0.3 except for k_{i}=5 and ρ_{i}=0.8 (Figure 11). Empirical powers were increased for d_{i}=0.6, and all were above 60% for k_{i}=4 and 80% for final analysis (k_{i}=5) (Figure 12). Empirical powers increased as the increasing of sample size. At N=300, it reached 80% when d_{i}=0.3, ρ_{i}=0.8 at k_{i}=3, and d_{i}=0.6, ρ_{i}=0.8 at k_{i}=2 or at k_{i}=3 for each ρi for data without outliers. When there are outliers in the data, results in Table 7 clearly indicated that the hybrid test statistic is much powerful (Figures 13 and 14). For example, as shown in Table 7, the power for O’Brien-Fleming design with 20% outlines with d_{i 2}=0.6, ln(σ^{2})=2, the empirical power of the hybrid statistic achieved 95.9% whereas the meta t statistic reached only 27.2% at the final stage. In fact, almost all the empirical powers were larger than 90% at the final stage for the hybrid statistics, but the empirical power from meta t was mostly less than 30%.

Figure 11 **: ****Log-normal distribution (N=300, d=0.3)**.

Figure 12 **:** **Log-normal distribution (N=300, d=0.6)**.

Figure 13 **:** **Log-normal dis. 20% outliers (N=300, d=0.3)**.

Figure 14 **:** **Log-normal dis. 20% outliers (N=300, d=0.6)**.

Similar results were observed for multinomial outcomes as in for the log-normal distribution for the matched and unmatched subjects and with maximum sample sizes of 100, 300, and 500 using both the meta-analysis t-test and the hybrid statistic. These results are omitted here. The purpose of simulating multinomial outcomes was to validate that the hybrid statistic is not only powerful for data observed from continuous random variables with outliers but also for data observed from discrete variables with outliers in the interim analysis of clinical trials.

Meta-analysis t-test can be used for normally distributed random variables with matched and unmatched subjects. Byun, et al. reported a hybrid statistic for data from neither normally nor dichotomously distributed random variables under clinical trials without interim analysis [13]. In our study, we conducted simulation studies of an interim analysis of clinical trial with different sample size and with matched and unmatched subjects to compare the type I errors and powers using meta-analysis t-test and hybrid statistic.Our results showed empirical type I error rates were relatively large when sample sizeis small for matched and unmatched subjects in the interim analysis by using both the meta-analysis t-test and the hybrid statistic. However, compared to the meta-analysis t-test, lower type I error rate was observed for the hybrid test statistic under the normal outcome setting. It was observed that with increase in sample sizes, empirical type I errors were gradually approached those for standard analysis (that is, clinical trials conducted without interim analysis). In general, hybrid test statistic had a lower type I error rate comparing to the meta-analysis t test.

The meta-analysis t-test is more powerful for ideal data observed from normally distributed outcomes with matched and unmatched subjects. However, the hybrid test is more robust and powerful than that of the meta-analysis when the outcome contained outliers in the interim analysis of clinical trials. All empirical powers rose with the increase in sample size, effect size, and correlation coefficient for matched pairs. Our results from the simulation study strongly support that the hybrid statistic has more power and low type I error for data with matched and unmatched subjects and with nonnormal distribution, especially for those data with outliers.

The authors declare that they have no competing interests.

Authors' contributions |
YW |
DL |

Research concept and design | √ | √ |

Collection and/or assembly of data | √ | -- |

Data analysis and interpretation | √ | -- |

Writing the article | √ | -- |

Critical revision of the article | √ | √ |

Final approval of article | √ | √ |

Statistical analysis | √ | -- |

This work was partially supported by Cancer Prevention Research Institute of Texas (RP170668).

EIC: Jimmy Efird, East Carolina University, USA.

Received: 15-Mar-2020 Final Revised: 18-May-2020

Accepted: 12-Jun-2020 Published: 25-Jun-2020

- Bellissant E, Benichou J and Chastang C.
**A microcomputer program for the design and analysis of phase II cancer clinical trials with two group sequential methods, the sequential probability ratio test, and the triangular test**.*Comput Biomed Res*. 1994;**27**:13-26. | Article | PubMed - O'Brien PC and Fleming TR.
**A multiple testing procedure for clinical trials**.*Biometrics*. 1979;**35**:549-56. | PubMed - Jennison C and Turnbull BW.
**Group Sequential Methods with Applications to Clinical Trials**.*Chapman & Hall/CRC*. 2000. - Wang SK and Tsiatis AA.
**Approximately optimal one-parameter boundaries for group sequential trials**.*Biometrics*. 1987;**43**:193-9. | PubMed - Haybittle JL.
**Repeated assessment of results in clinical trials of cancer treatment**.*Br J Radiol*. 1971;**44**:793-7. | Article | PubMed - Peto R, Pike MC, Armitage P, Breslow NE, Cox DR, Howard SV, Mantel N, McPherson K, Peto J and Smith PG.
**Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design**.*Br J Cancer*. 1976;**34**:585-612. | Article | PubMed Abstract | PubMed FullText - Shadish WR and Haddock CK.
**Combining estimates of effect size**. In: Cooper H, Hedges LV (editors). The Handbook of Research Synthesis. New York: Russell Sage Foundation. 1994; 261-284. - Rosenthal R.
**Combining results of independent studies**.*Psychological Bulletin*. 1978;**85**:185-193. - Fleiss JL.
**Measures of effect size for categorical data**. In: Cooper H, Hedges LV (editors). The Handbook of Research Synthesis. New York: Russell Sage Foundation. 1994; 245-260. - Shutton AJ, Abrams KR, Jones DR, Sheldon TA and Song F.
**Methods for Meta-Analysis in Medical Research**. New York: Wiley. 2000. - Rosenthal R.
**Combining results of independent studies**.*Psychological Bulletin*. 1978;**85**:185-193. **Early Treatment for Retinopathy of Prematurity Cooperative Group. Final Visual Acuity Results in the Early Treatment for Retinopathy of Prematurity Study**.*Archives of Ophthalmology*. 2010;**128**:1684-1701.- Byun J, Lai D, Luo S, Risser J, Tung B and Hardy RJ.
**A hybrid method in combining treatment effects from matched and unmatched studies**.*Stat Med*. 2013;**32**:4924-37. | Article | PubMed Abstract | PubMed FullText - Walsh JE.
**Some significance tests for the median which are valid under very general conditions**.**Annals of Mathematical Statistics**. 1949;**20**:64-81. - Wilcoxin F.
**Probability tables for individual comparisons by ranking methods**.*Biometrics*. 1947;**3**:119-22. | PubMed - Hodges JL and Lehmann EL.
**Estimates of location based on rank test**.*Annals of Mathematical Statistics*. 1963;**34**:598-611.

Volume 8

Wang Y and Lai D. **Empirical Size and Power of a Hybrid Statistic for Matched and Unmatched Designs in Interim Analysis of Clinical Trials: A Simulation Study**. *J Med Stat Inform*. 2020; **8**:5. http://dx.doi.org/10.7243/2053-7662-8-5

View Metrics

Copyright © 2015 Herbert Publications Limited. All rights reserved.

Post Comment|View Comments