
Almalik O. Combining dependent p-values resulting from multiple effect size homogeneity tests in metaanalysis for binary outcomes. J Med Stat Inform. 2022; 10:1. http://dx.doi.org/10.7243/2053-7662-10-1
Osama Almalik
*Correspondence: Osama Almalik o.almalik@tue.nl
Researcher, Department of Mathematics & Computer Science, Eindhoven University of Technology, Netherlands.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Testing effect size homogeneity is an essential part when conducting a meta-analysis. Comparative studies of effect size homogeneity tests in case of binary outcomes are found in the literature, but no test has come out as an absolute winner. A alternative approach would be to carry out multiple effect size homogeneity tests on the same meta-analysis and combine the resulting dependent p-values. In this article we applied the correlated Lancaster method for dependent statistical tests. To investigate the proposed approach’s performance, we applied eight different effect size homogeneity tests on a case study and on simulated datasets, and combined the resulting p-values. The proposed method has similar performance to that of tests based on the score function in the presence of a effect size when the number of studies is small, but outperforms these tests as the number of studies increases. However, the method’s performance is sensitive to the correlation coefficient value assumed between dependent tests, and only performs well when this value is high. More research is needed to investigate the method’s assumptions on correlation in case of effect size homogeneity tests, and to study the method’s performance in meta-analysis of continuous outcomes.
Keywords: Meta-analysis, two*two contingency tables, effect size, homogeneity test, dependent p-values
Meta-analysts still carry out effect size homogeneity tests on a regular basis [1,2]. Several methods have been developed to test effect size homogeneity in meta-analysis with multiple 2x2 contingency tables, and the performance of these methods have been studied in the literature. Jones et al. (1989) studied the performances of the Likelihood Ratio test, Pearson’s chi square test, Breslow-Day test and Tarone’s adjustment to it, a conditional score test and Liang and Self’s normal approximation to the score test [3]. Gavaghan et al. (1999) compared the performance of the Peto statistic, the Woolf statistic, the Q-statistic (applied to the estimates of the risk difference), Liang and Self’s normal approximation to the score test, and the Breslow-Day test [4]. Almalik and van den Heuvel (2018) compared the performance of the fixed effects logistic regression analysis, the random effects logistic regression analysis, the Q-statistic, the Bliss statistic, the I2, the Breslow-Day test, the Zelen statistic, Liang and Self’s T2 and TR statistics, and the Peto statistic [5]. A recent study focused on meta-analysis with rare binary events [6]. However, no one specific test could be presented as a universal winner from these comparative studies. An alternative approach would be to perform multiple tests of effect size homogeneity on the same meta-analysis and combine the resulting p-values.
The earliest method to combine p-values resulting from independent statistical tests is Fisher’s method [7]. Since then many methods have been proposed to combine p-values resulting from independent statistical tests, see [8,9] for an overview. The p-values in our context result from different statistical tests for effect size homogeneity testing the same null hypothesis. Since these p-values result from the same meta-analysis dataset, these p-values as correlated [16]. Therefore, we only consider methods that combine p-values resulting from dependent tests, and we present a brief review of these methods here.
Brown (1975) [11] extended Fisher’s method to the case where the p-values result from test statistics having a multivariate normal distribution with a known covariance matrix. Kost and McDermott (2002) [12] extended Brown’s method analytically for unknown covariance matrices. Other methods to calculate the covariance matrix have been proposed in the literature [13]. Brown’s method and its improvements can only be applied to the case where the test statistics follow a multivariate normal distribution. Since most test statistics used to test effect size homogeneity approximately follow the Chi Square distribution, these methods cannot be applied.
Makambi (2003) modified the Fisher statistic, using weights derived from the data, to accommodate correlation between the p-values [16]. However, the author applied methods developed by Brown to derive the first two moments of the weighted distribution and to estimate the correlation coefficients, which implies that the same distributional assumptions made by Brown must hold here. Similar modifications of the Fisher statistic can be found in the literature [17]. Yang (2010) introduced an approximation of the null distribution of the Fisher statistic, based on the Lindeberg Central Limit theorem. However, one condition that the test statistics being m-dependent clearly does not hold here [14]. Another approximation introduced in Yang (2010) based on permutations is said by the author to be numerically intensive [14]. Methods combining dependent p-values for very specific applications have been proposed in the literature [20-22], and Bootstrap methods have been applied to this problem as well [23].
One method developed by Hartung (1999) considered dependent test statistics testing one null hypothesis, each having a continuous distribution under the null hypothesis [15]. Using the probability integral transformation [24], the dependent p-values are transformed into standardized z values using the probit function. Then the author proposed a formula for combining the z values into one z value using weights for each z value and a correlation coefficient between each two z values. This combined z value is then used to test the original null hypothesis. Hartung’s method is only applicable for one-sided hypothesis tests, which makes it applicable for combing p-values resulting from effect size homogeneity tests based on random effects having one-sided alternative hypothesis. However, this approach is not promising since effect size homogeneity tests using random effects have been shown to perform poorly [5,6].
Another possibility is a method developed by Dai et al. [18,10]. Dai’s approach uses a combined test statistic developed earlier by Lancaster [19], but adjusted to incorporate correlations between the p-values. Lancaster presented a test statistic that transforms the independent p-values into a Chi squared test statistic using the inverse cumulative distribution function of the Gamma distribution. Dai et al. (2014) noted that after introducing correlation between the p-values the Lancaster statistic no longer follows the Chi square distribution, and provided five approaches to approximate the distribution of the Lancaster test statistic under correlation. The approximation is done using the observed test statistics and their corresponding degrees of freedom. The basic approach presented by the authors is the Satterthewaite approximation [25], and the other four approaches are based on the Satterthwaite approximation as well. The authors recommended using the Satterthwaite approximation as the standard procedure to adjust the Lancaster statistic. See section 2.2.2 for a detailed description of the Sattherthwaite approximation of the correlated Lancaster test statistic.
This article is structured as follows. Section 2 presents a short description of the tests for effect size homogeneity using two-sided alternative hypothesis and of the correlated Lancaster procedure. Section 3 describes the case study and the simulation model used. The results are presented in section 4 and the discussion is relegated to section 5.
We first introduce notation that will be used throughout this section. For a binary clinical outcome, let X1i and X0i be the number of successes in the treatment group and the control group in study i (i= 1,2,...,m) out of n1i and n0i trails, respectively. Methods testing effect size homogeneity assume that E(X1i )= n1i . p1i, with p1i the proportion of success for the treatment group in study i. The Odds ratio is calculated by iOˆRi = X1i (n0i − X0i) / [(n1i − X1i)X0i] and its standard error is given by
All methods testing effect size homogeneity in this section assume that the proportion pji, j = 0,1, satisfy the following form: logit (pji) = αi + (β + γi) · tji, with ai an intercept for study i, β the (mean) effect size, tji a treatment indicator variable for study i with value 1 when j = 1 and 0 otherwise, and gi a study treatment interaction effect for study i. Then effect size homogeneity tests apply the following hypotheses
H0 : γ1 = γ2 = . . . = γm = 0 vs H1 : γi ≠γi ‘ for some i ≠ i’ (1)
Tests of effect size homogeneity
In this section the effect size homogeneity tests based on fixed-effects are briefly described. Under the null hypothesis in (1), all these tests have an approximately Chi squared distributed test statistic with m-1 degrees of freedom [26-34].
The Likelihood ratio test
Assuming a fixed-effects logistic regression model, and using the notation presented above, the success probability pji for a subject in study i receiving treatment j, j = 0,1, is given by pji = exp (αi + (β + λi) tji) / (1 + exp (αi + (β + λi) tji)). The log Likelihood function can be constructed as follows
The Maximum Likelihood estimates are obtained under the Full and the Null models, denoted from now on by indexes F and N, respectively. The Likelihood ratio test statistic is given by
Tests based on the Q-statistic
Defining βi= log(ORi) as the primary effect size, the Q-statistic [26] is given by
with β a weighted average given by
Bliss’s test statistic is given by
with
Tests based on the score function
The Breslow-Day approach [28] adjusted by Tarone [30] can be described as follows. Firstly, the Cochran-Mantel-Haenszel pooled odds ratio ORC is calculated using
Define Ec(X1i)=E(X1i|Xi, O^Rc) as the expected value of X1i given Xi = X0i + X1i, where EC(X1i) is obtained by solving the following equation
(3)
The Breslow-Day approach adjusted by Tarone statistic is given by
with var (X1i | Xi, ORc) is given by
(4)
Zelen’s test statistic [30], later corrected by Halperin et al. [31], can be described as follows. Firstly, the odds ratio ORZ is given by ORZ = exp(β(N)) with β(N) the Maximum Likelihood estimator of the Fixed effects logistic regression analysis (2) under the null hypothesis of effect size homogeneity. The corrected Zelen statistic for testing effect size homogeneity is given by
with EZ(X1i) now obtained by solving equation (3) with ORC replaced by ORZ, and var (X1i|Xi, ORZ) is obtained by equation (4) with EC (X1i) replaced by EZ (X1i).
Liang and Self (1985) [32] developed the following test statistic using ORCL = exp (β(CL)) with β(CL) the Maximum Likelihood estimator of the conditional Likelihood function given Xi and Υi=0, . The test statistic is given by
with ECL (X1i) now obtained by solving equation (3) with ORC replaced by ORCL, and Var (X1i|Xi, ORCL) is obtained by equation (4) with EC(X1i) replaced by ECL(X1i).
Woolf statistic
The Woolf statistic [33] is given by
Peto test
The Peto statistic is given by [34]
where ni = n1i + n0i and Vi = (Xi (ni − Xi) n1in0i) /n2 i(ni − 1)).
Correlated Lancaster procedure for combining correlated p-values
In this section we describe the correlated Lancaster procedure [18] for combining dependent p-values resulting from the above mentioned eight effect size homogeneity tests. Lancaster’s method assumes there are n statistical tests each resulting in a test statistic Ti, i=1,....n, degrees of freedom, dfi and a p value, pi. Applying the probability integral transformation [24], it is noted that 1-pi are uniformly distributed on (0,1). Lancaster showed that for n independent p-values that
where γ-1(dfi/2,2), is the inverse cumulative distribution function of a Gamma distribution with a shape parameter dfi/2 and a scale parameter 2, and
[19]. Dai et al. (2014), noting that for correlated p-values the t statistic does not follow a xdf2 anymore, suggested five methods to approximate the distribution of T. The authors recommended a method using the Satterthwaite approximation [25] as a standard procedure to adjust the T statistic, which can be described as follows. For a set of correlated p-values, the authors noted that
defined the statistic
where
where c and ν are chosen so that the first and second moments of the scaled chi-square distribution and the distribution of T under the null are identical [9,10]. The statistic TA can be used for testing the null hypothesis of effect size homogeneity. The authors presented several methods to estimate ρik [18].
Case study
Bein et al. (2021) carried out a systematic review and metaanalysis to investigate the risk of adverse pregnancy, perinatal and early childhood outcomes among women with subclinical hypothyroidism treated with Levothyroxince [36]. Among the extensive study was a meta-analysis of preterm delivery associated with levothyroxine treatment versus no treatment among women with subclinical hypothyroidism during pregnancy. This meta-analysis included seven studies, each study having a group treated with Levothyroxine and a control group. For each group the number of preterm deliveries (events) was noted. This meta-analysis was used here as a case study and the data are shown in Table 1.
Table 1 : Meta-analysis on studying the effect of Levothyroxine on preterm pregnancy among women with subclinical hypothyroidism (Bein et al. 2021).
Simulation model
The simulation model applied can be described as follows [5,35]. In total m studies are created, and for the ith study nji, j = 0,1, are independently drawn from a Poisson distribution with parameter δ. The random variables Xjiare drawn independently from the Binomial distribution Bin(nij,pji), with p0i = exp (αi)∕(1+exp(αi)) and p1i = exp (αi+β+Υi)∕(1+ exp (αi+β+Υi)). Here αi is study-specific intercept, β is a constant effect size, and Υi is study-specific effect size with αi ~ (α, σ2α) and Υi ~ N (0, τ2). Two scenarios are considered: effect size homogeneity (τ2 = 0) and effect size heterogeneity (τ2 >0). The following parameter values were used in the simulation study: m = 5,10,20,30, δ = 50, α= 0, β = 0,2, σ2α = 1, τ2 = 0,0.15,0.3,0.5. A number of 1000 simulation runs were carried out for each parameter combination, and the average Type I error rate and the average statistical power based on significance level of 0.05 were calculated and presented in the results section.
This section presents the results of the case-study and the simulation study. For the correlated Lancaster method we used ρik= 0.25,0.5,0.75.
Case study
We applied the Breslow-Day test adjusted by Tarone (BDT), the Bliss test, the Liang & Self test, the Likelihood ratio test (LRT), the Peto test, the Q-statistic (Q), the Woolf test and the Zelen test to test the effect size homogeneity hypothesis for the meta-analysis study in Bein et al. (2021). Subsequently, we used the correlated Lancaster method (CORR. LANC.) to combine the eight resulting p-values. The resulting pvalues are shown in Table 2. All homogeneity tests and the correlated Lancaster method (for all values of ρik) rejected the effect size homogeneity hypothesis (p<0.05). The effect size homogeneity tests based on the score function and the Peto test produced similar p-values. The Q-statistic and the Bliss statistic produced substantially lower p-values than all other tests. For the correlated Lancaster method the p-value increased as the ρik value increased.
Table 2 : p-values from the effect size homogeneity tests and the correlated Lancaster method applied to meta-analysis from Bein et al. (2021).
Results of simulation study
Table 3 shows the average Type 1 error rates of the Breslow- Day test, the Bliss test, the Liang & Self test, the Peto test, the Q-statistic, the Woolf test, the Zelen test and the correlated Lancaster method. For the correlated Lancaster method, it is noted that the value of the correlation coefficient between the p-values resulting from effect size homogeneity test statistics affects the Type I error values, with the Type I error closest to the nominal value when ρik = 0.75. In case of no effect size, all homogeneity tests are conservative, while the combined Lancaster method is liberal. This pattern is consistent for the different number of studies included in a meta-analysis. In case β = 2, all effect size homogeneity tests with the exception of the Liang & Self, Likelihood ratio test and the Zelen tests, have a Type I error value below the nominal value. The Likelihood ratio test has a Type I error rate above the nominal level for all values of m. The Liang & Self has a Type I error rate equaling the nominal value when m = 5,10. The Zelen test has a nominal Type I error value when m = 5 but a Type I error value above the nominal value when m = 10. As m increases, the Type I error rates of the Liang & Self and the Zelen tests increase way above the nominal value. The Q-statistic and the Bliss statistic have a Type I error rates lower than the nominal value, and these rates decrease as m increases. When ρik = 0.75 and for all values of m, the correlated Lancaster method has a Type I error either equal to or below the nominal value.
Table 3 : Type I error for fixed-effects homogeneity tests and the correlated Lancaster method τ2=0.
The statistical power is shown in Figures 1-3. Since the Type I error was closest to the nominal value when ρik = 0.75, the statistical power of the correlated Lancaster method is only shown for the case ρik = 0.75. When there is no effect size, β = 0, and in the case of low heterogeneity, τ2 = 0.15, all methods have a statistical power lower than 30% when m = 5, with the correlated Lancaster method showing higher statistical power than other methods. Of the other methods, the Likelihood ratio test, Liang and Self test and Breslow-Day test have slightly higher statistical power, followed by Zelen, Peto, Q-statistic, Woolf and Bliss. The same pattern remains as m increases, with the statistical power of all methods increasing but remaining below than 80%. In case of moderate and high heterogeneity (τ2 = 0.3,0.5), the statistical power increases for all methods when m = 5, and all methods approach or surpass the nominal value of 95% as m increases. The correlated Lancaster method still has higher statistical power than other methods.
Figure 1 : Statistical power in case of weak effect size heterogeneity: the statistical power at 0.05 significance level is calculated as the average of 1000 simulation runs.
Figure 2 : Statistical power in case of moderate effect size heterogeneity: the statistical power at 0.05 significance level is calculated as the average of 1000 simulation runs.
Figure 3 : Statistical power in case of strong effect size heterogeneity: the statistical power at 0.05 significance level is calculated as the average of 1000 simulation runs.
When β = 2, τ2 = 0.15 and m = 5, the Likelihood ratio test has the highest statistical power. The Likelihood ratio test is followed by the Zelen, Liang and Self, BDT and the correlated Lancaster methods. The Q-statistic, Bliss, Woolf and Peto methods have clearly a lower statistical power. All methods, however, have a statistical power below 25%. As m increases, the statistical power of all methods increases slightly, except for the Peto method which remains below 10%. As the heterogeneity level increases, τ2 = 0.3,0.5, the statistical power increases for all methods when m = 5. The same pattern persists with the Likelihood ratio, Zelen, Liang and Self, Breslow-Day tests and the correlated Lancaster methods still having the highest power, although the statistical power remains below 50%. As m increases, the statistical power increases for all methods. When τ2 = 0.5 the Likelihood ratio test’s statistical power exceeds the nominal level, while the Zelen, Liang and Self, Breslow-Day and the correlated Lancaster methods have statistical power close or equal to the nominal value when m = 30.
The purpose of this article was applying an approach to combine p-values of different effect size homogeneity tests applied to a meta-analysis of binary outcomes. The proposed approach is an adjustment of a method introduced in Lancaster (1961) to combine independent p-values into a Chi squared test statistic. The Lancaster method was adjusted to incorporate correlation between dependent p-values. The Satterthewaite method was used to approximate the correlated Lancaster test statistic in case of dependent p-values. The method was originally developed for aggregating effects in high-dimensional genetic data analysis (Dai et al. 2012). To study the performance of the proposed method we analyzed a real life meta-analysis, and we carried out a simulation study with multiple scenarios including different number of studies, different effect sizes and different levels of heterogeneity. We tested the null hypothesis of effect size homogeneity using the Breslow-Day test, the Bliss test, the Liang & Self test, the Likelihood ratio test, the Peto test, the Q-statistic, the Woolf test, and the Zelen test for the case study and for the simulated datasets. Subsequently, we combined the resulting p-values from these eight effect size homogeneity tests using the correlated Lancaster method. For the case study we compared the performance of the correlated Lancaster method to that of the eight effect size homogeneity tests using the p-values. For the simulation study we did the comparison using the average Type I error rates and the average statistical power.
Some findings regarding the effect size homogeneity tests have been established earlier in the literature. The Likelihood ratio test is liberal, and tests based on the score function (Breslow-Day, Liang & Self and Zelen tests) perform well when the number of studies is small [3,5]. However, as the number of studies increases these score function tests tend become liberal [5]. The Q-statistic and the Bliss statistic are conservative and they get more conservative as the number of studies increases [3,5].
The correlated Lancaster method is sensitive to the value of correlation coefficient between the test statistics, as the combined p value is positively correlated with the value of the correlation coefficient. This can be explained by the fact that higher values of the correlation coefficient result in a larger variance of the correlated Lancaster statistic. This produces a smaller value of the correction constant c and in turn a smaller value of the correlated Lancaster test statistic and thereby a larger p value. The correlated Lancaster method performs best in case of a high positive correlation between the dependent test statistics, namely a correlation coefficient value of 0.75. The correlated Lancaster method performs quite well in the presence of a effect size, having a Type 1 error rate always within the nominal value. Unlike all effect size homogeneity tests considered here, the correlated Lancaster method is robust to the number of studies in a meta-analysis. The statistical power of the correlated Lancaster method is similar to that of the Breslow-Day, Liang & Self and the Zelen tests when the number of studies is small. As the number of studies increases, these three tests have superior statistical power to the correlated Lancaster method. This can be explained by the inflated Type I error the Breslow-Day, Liang & Self and the Zelen tests when the number of studies increases.
The correlated Lancaster method performs well and it is easy to implement, but few reservations need to be mentioned. The assumption of positive correlation between the dependent test statistics is intuitively a reasonable assumption. However, the method’s performance is sensitive to the value of the correlation coefficient, as the method performs best when the value of the correlation coefficient is high. It is therefore recommended to carry out the correlated Lancaster method with different values for the correlation coefficient between the resulting p-values. An overall significant result can then be presented only when there is a consensus among the significance results based on the different values of the correlation coefficient. In addition, we only applied the method to balanced meta-analysis of binary outcomes. Extra research is warranted to investigate the method’s performance on unbalanced meta-analysis, meta-analysis of continuous outcomes and meta-analysis of rare binary events.
The author declares that he has no competing interests.
The author would like to thank the two editors for their very useful comments which led to a clear improvement of this manuscript.
Editor: Dr. Nicola Shaw. Algoma University, Canada.
Received: 03-May-2022 Final Revised: 08-Aug-2022
Accepted: 24-Sep-2022 Published: 10-Oct-2022
Almalik O. Combining dependent p-values resulting from multiple effect size homogeneity tests in metaanalysis for binary outcomes. J Med Stat Inform. 2022; 10:1. http://dx.doi.org/10.7243/2053-7662-10-1
Copyright © 2015 Herbert Publications Limited. All rights reserved.