Osama Almalik

*Correspondence: Osama Almalik o.almalik@tue.nl

Researcher, Department of Mathematics & Computer Science, Eindhoven University of Technology, Netherlands.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Testing effect size homogeneity is an essential part when conducting a meta-analysis. Comparative studies of effect size homogeneity tests in case of binary outcomes are found in the literature, but no test has come out as an absolute winner. A alternative approach would be to carry out multiple effect size homogeneity tests on the same meta-analysis and combine the resulting dependent p-values. In this article we applied the correlated Lancaster method for dependent statistical tests. To investigate the proposed approach’s performance, we applied eight different effect size homogeneity tests on a case study and on simulated datasets, and combined the resulting p-values. The proposed method has similar performance to that of tests based on the score function in the presence of a effect size when the number of studies is small, but outperforms these tests as the number of studies increases. However, the method’s performance is sensitive to the correlation coefficient value assumed between dependent tests, and only performs well when this value is high. More research is needed to investigate the method’s assumptions on correlation in case of effect size homogeneity tests, and to study the method’s performance in meta-analysis of continuous outcomes.

**Keywords**: Meta-analysis, two*two contingency tables, effect size, homogeneity test, dependent p-values

Meta-analysts still carry out effect size homogeneity tests on a regular basis [1,2]. Several methods have been developed to test effect size homogeneity in meta-analysis with multiple 2x2 contingency tables, and the performance of these methods have been studied in the literature. Jones et al. (1989) studied the performances of the Likelihood Ratio test, Pearson’s chi square test, Breslow-Day test and Tarone’s adjustment to it, a conditional score test and Liang and Self’s normal approximation to the score test [3]. Gavaghan et al. (1999) compared the performance of the Peto statistic, the Woolf statistic, the Q-statistic (applied to the estimates of the risk difference), Liang and Self’s normal approximation to the score test, and the Breslow-Day test [4]. Almalik and van den Heuvel (2018) compared the performance of the fixed effects logistic regression analysis, the random effects logistic regression analysis, the Q-statistic, the Bliss statistic, the *I ^{2}*, the Breslow-Day test, the Zelen statistic, Liang and Self’s

The earliest method to combine p-values resulting from independent statistical tests is Fisher’s method [7]. Since then many methods have been proposed to combine p-values resulting from independent statistical tests, see [8,9] for an overview. The p-values in our context result from different statistical tests for effect size homogeneity testing the same null hypothesis. Since these p-values result from the same meta-analysis dataset, these p-values as correlated [16]. Therefore, we only consider methods that combine p-values resulting from dependent tests, and we present a brief review of these methods here.

Brown (1975) [11] extended Fisher’s method to the case where the p-values result from test statistics having a multivariate normal distribution with a known covariance matrix. Kost and McDermott (2002) [12] extended Brown’s method analytically for unknown covariance matrices. Other methods to calculate the covariance matrix have been proposed in the literature [13]. Brown’s method and its improvements can only be applied to the case where the test statistics follow a multivariate normal distribution. Since most test statistics used to test effect size homogeneity approximately follow the Chi Square distribution, these methods cannot be applied.

Makambi (2003) modified the Fisher statistic, using weights derived from the data, to accommodate correlation between the p-values [16]. However, the author applied methods developed by Brown to derive the first two moments of the weighted distribution and to estimate the correlation coefficients, which implies that the same distributional assumptions made by Brown must hold here. Similar modifications of the Fisher statistic can be found in the literature [17]. Yang (2010) introduced an approximation of the null distribution of the Fisher statistic, based on the Lindeberg Central Limit theorem. However, one condition that the test statistics being m-dependent clearly does not hold here [14]. Another approximation introduced in Yang (2010) based on permutations is said by the author to be numerically intensive [14]. Methods combining dependent p-values for very specific applications have been proposed in the literature [20-22], and Bootstrap methods have been applied to this problem as well [23].

One method developed by Hartung (1999) considered dependent test statistics testing one null hypothesis, each having a continuous distribution under the null hypothesis [15]. Using the probability integral transformation [24], the dependent p-values are transformed into standardized z values using the probit function. Then the author proposed a formula for combining the z values into one z value using weights for each z value and a correlation coefficient between each two z values. This combined z value is then used to test the original null hypothesis. Hartung’s method is only applicable for one-sided hypothesis tests, which makes it applicable for combing p-values resulting from effect size homogeneity tests based on random effects having one-sided alternative hypothesis. However, this approach is not promising since effect size homogeneity tests using random effects have been shown to perform poorly [5,6].

Another possibility is a method developed by Dai et al. [18,10]. Dai’s approach uses a combined test statistic developed earlier by Lancaster [19], but adjusted to incorporate correlations between the p-values. Lancaster presented a test statistic that transforms the independent p-values into a Chi squared test statistic using the inverse cumulative distribution function of the Gamma distribution. Dai et al. (2014) noted that after introducing correlation between the p-values the Lancaster statistic no longer follows the Chi square distribution, and provided five approaches to approximate the distribution of the Lancaster test statistic under correlation. The approximation is done using the observed test statistics and their corresponding degrees of freedom. The basic approach presented by the authors is the Satterthewaite approximation [25], and the other four approaches are based on the Satterthwaite approximation as well. The authors recommended using the Satterthwaite approximation as the standard procedure to adjust the Lancaster statistic. See section 2.2.2 for a detailed description of the Sattherthwaite approximation of the correlated Lancaster test statistic.

This article is structured as follows. Section 2 presents a short description of the tests for effect size homogeneity using two-sided alternative hypothesis and of the correlated Lancaster procedure. Section 3 describes the case study and the simulation model used. The results are presented in section 4 and the discussion is relegated to section 5.

We first introduce notation that will be used throughout this section. For a binary clinical outcome, let *X _{1i }*and

All methods testing effect size homogeneity in this section assume that the proportion *p _{ji}, j = 0,1*, satisfy the following form: logit

*H _{0} : γ_{1} = γ_{2} = . . . = γ_{m} = 0 vs H1 : γ_{i }≠γ_{i} ‘ for some i ≠ i’* (1)

**Tests of effect size homogeneity**

In this section the effect size homogeneity tests based on fixed-effects are briefly described. Under the null hypothesis in (1), all these tests have an approximately Chi squared distributed test statistic with *m-1* degrees of freedom [26-34].

*The Likelihood ratio test*

Assuming a fixed-effects logistic regression model, and using the notation presented above, the success probability *p _{ji}* for a subject in study

The Maximum Likelihood estimates are obtained under the Full and the Null models, denoted from now on by indexes *F* and *N*, respectively. The Likelihood ratio test statistic is given by

*Tests based on the Q-statistic*

Defining *β _{i}= log(OR_{i})* as the primary effect size, the Q-statistic [26] is given by

with *β* a weighted average given by

Bliss’s test statistic is given by

with

*Tests based on the score function*

The Breslow-Day approach [28] adjusted by Tarone [30] can be described as follows. Firstly, the Cochran-Mantel-Haenszel pooled odds ratio *OR _{C}* is calculated using

Define *E _{c}(X_{1i})=E(X_{1i}|X_{i}, O^R_{c})* as the expected value of

(3)

The Breslow-Day approach adjusted by Tarone statistic is given by

with *var (X _{1i }| X_{i}, OR_{c})* is given by

(4)

Zelen’s test statistic [30], later corrected by Halperin et al. [31], can be described as follows. Firstly, the odds ratio ORZ is given by *OR _{Z} = exp(β_{(N)})* with

with *E _{Z}(X_{1i})* now obtained by solving equation (3) with

Liang and Self (1985) [32] developed the following test statistic using *OR _{CL} = exp (β_{(CL)})* with

with *E _{CL} (X_{1i})* now obtained by solving equation (3) with

*Woolf statistic*

The Woolf statistic [33] is given by

*Peto test*

The Peto statistic is given by [34]

where *n _{i} = n_{1i} + n_{0i}* and

**Correlated Lancaster procedure for combining correlated p-values**

In this section we describe the correlated Lancaster procedure [18] for combining dependent p-values resulting from the above mentioned eight effect size homogeneity tests. Lancaster’s method assumes there are *n* statistical tests each resulting in a test statistic *T _{i}, i=1,....n*, degrees of freedom,

where γ^{-1}* _{(dfi/2,2)}*, is the inverse cumulative distribution function of a Gamma distribution with a shape parameter

[19]. Dai et al. (2014), noting that for correlated p-values the t statistic does not follow a *x _{df}^{2}* anymore, suggested five methods to approximate the distribution of

defined the statistic

where

where *c* and *ν* are chosen so that the first and second moments of the scaled chi-square distribution and the distribution of *T* under the null are identical [9,10]. The statistic *T _{A}* can be used for testing the null hypothesis of effect size homogeneity. The authors presented several methods to estimate

**Case study**

Bein et al. (2021) carried out a systematic review and metaanalysis to investigate the risk of adverse pregnancy, perinatal and early childhood outcomes among women with subclinical hypothyroidism treated with Levothyroxince [36]. Among the extensive study was a meta-analysis of preterm delivery associated with levothyroxine treatment versus no treatment among women with subclinical hypothyroidism during pregnancy. This meta-analysis included seven studies, each study having a group treated with Levothyroxine and a control group. For each group the number of preterm deliveries (events) was noted. This meta-analysis was used here as a case study and the data are shown in Table 1.

Table 1 **: Meta-analysis on studying the effect of Levothyroxine on preterm pregnancy among women with subclinical hypothyroidism (Bein et al. 2021).**

**Simulation model**

The simulation model applied can be described as follows [5,35]. In total *m* studies are created, and for the *i ^{th}* study

This section presents the results of the case-study and the simulation study. For the correlated Lancaster method we used *ρ _{ik}*= 0.25,0.5,0.75.

**Case study**

We applied the Breslow-Day test adjusted by Tarone (BDT), the Bliss test, the Liang & Self test, the Likelihood ratio test (LRT), the Peto test, the Q-statistic (Q), the Woolf test and the Zelen test to test the effect size homogeneity hypothesis for the meta-analysis study in Bein et al. (2021). Subsequently, we used the correlated Lancaster method (CORR. LANC.) to combine the eight resulting p-values. The resulting pvalues are shown in Table 2. All homogeneity tests and the correlated Lancaster method (for all values of *ρ _{ik}*) rejected the effect size homogeneity hypothesis (p<0.05). The effect size homogeneity tests based on the score function and the Peto test produced similar p-values. The Q-statistic and the Bliss statistic produced substantially lower p-values than all other tests. For the correlated Lancaster method the p-value increased as the

Table 2 **: p-values from the effect size homogeneity tests and the correlated Lancaster method applied to meta-analysis from Bein et al. (2021).**

**Results of simulation study**

Table 3 shows the average Type 1 error rates of the Breslow- Day test, the Bliss test, the Liang & Self test, the Peto test, the Q-statistic, the Woolf test, the Zelen test and the correlated Lancaster method. For the correlated Lancaster method, it is noted that the value of the correlation coefficient between the p-values resulting from effect size homogeneity test statistics affects the Type I error values, with the Type I error closest to the nominal value when *ρ _{ik}* = 0.75. In case of no effect size, all homogeneity tests are conservative, while the combined Lancaster method is liberal. This pattern is consistent for the different number of studies included in a meta-analysis. In case

Table 3 **: Type I error for fixed-effects homogeneity tests and the correlated Lancaster method τ^{2}=0.**

The statistical power is shown in Figures 1-3. Since the Type I error was closest to the nominal value when *ρ _{ik}* = 0.75, the statistical power of the correlated Lancaster method is only shown for the case

Figure 1 **:** **Statistical power in case of weak effect size heterogeneity: the statistical power at 0.05 significance level is calculated as the average of 1000 simulation runs.**

Figure 2 **:** **Statistical power in case of moderate effect size heterogeneity: the statistical power at 0.05 significance level is calculated as the average of 1000 simulation runs.**

Figure 3 **:** **Statistical power in case of strong effect size heterogeneity: the statistical power at 0.05 significance level is calculated as the average of 1000 simulation runs.**

When *β* = 2, *τ ^{2}* = 0.15 and

The purpose of this article was applying an approach to combine p-values of different effect size homogeneity tests applied to a meta-analysis of binary outcomes. The proposed approach is an adjustment of a method introduced in Lancaster (1961) to combine independent p-values into a Chi squared test statistic. The Lancaster method was adjusted to incorporate correlation between dependent p-values. The Satterthewaite method was used to approximate the correlated Lancaster test statistic in case of dependent p-values. The method was originally developed for aggregating effects in high-dimensional genetic data analysis (Dai et al. 2012). To study the performance of the proposed method we analyzed a real life meta-analysis, and we carried out a simulation study with multiple scenarios including different number of studies, different effect sizes and different levels of heterogeneity. We tested the null hypothesis of effect size homogeneity using the Breslow-Day test, the Bliss test, the Liang & Self test, the Likelihood ratio test, the Peto test, the Q-statistic, the Woolf test, and the Zelen test for the case study and for the simulated datasets. Subsequently, we combined the resulting p-values from these eight effect size homogeneity tests using the correlated Lancaster method. For the case study we compared the performance of the correlated Lancaster method to that of the eight effect size homogeneity tests using the p-values. For the simulation study we did the comparison using the average Type I error rates and the average statistical power.

Some findings regarding the effect size homogeneity tests have been established earlier in the literature. The Likelihood ratio test is liberal, and tests based on the score function (Breslow-Day, Liang & Self and Zelen tests) perform well when the number of studies is small [3,5]. However, as the number of studies increases these score function tests tend become liberal [5]. The Q-statistic and the Bliss statistic are conservative and they get more conservative as the number of studies increases [3,5].

The correlated Lancaster method is sensitive to the value of correlation coefficient between the test statistics, as the combined p value is positively correlated with the value of the correlation coefficient. This can be explained by the fact that higher values of the correlation coefficient result in a larger variance of the correlated Lancaster statistic. This produces a smaller value of the correction constant c and in turn a smaller value of the correlated Lancaster test statistic and thereby a larger p value. The correlated Lancaster method performs best in case of a high positive correlation between the dependent test statistics, namely a correlation coefficient value of 0.75. The correlated Lancaster method performs quite well in the presence of a effect size, having a Type 1 error rate always within the nominal value. Unlike all effect size homogeneity tests considered here, the correlated Lancaster method is robust to the number of studies in a meta-analysis. The statistical power of the correlated Lancaster method is similar to that of the Breslow-Day, Liang & Self and the Zelen tests when the number of studies is small. As the number of studies increases, these three tests have superior statistical power to the correlated Lancaster method. This can be explained by the inflated Type I error the Breslow-Day, Liang & Self and the Zelen tests when the number of studies increases.

The correlated Lancaster method performs well and it is easy to implement, but few reservations need to be mentioned. The assumption of positive correlation between the dependent test statistics is intuitively a reasonable assumption. However, the method’s performance is sensitive to the value of the correlation coefficient, as the method performs best when the value of the correlation coefficient is high. It is therefore recommended to carry out the correlated Lancaster method with different values for the correlation coefficient between the resulting p-values. An overall significant result can then be presented only when there is a consensus among the significance results based on the different values of the correlation coefficient. In addition, we only applied the method to balanced meta-analysis of binary outcomes. Extra research is warranted to investigate the method’s performance on unbalanced meta-analysis, meta-analysis of continuous outcomes and meta-analysis of rare binary events.

The author declares that he has no competing interests.

The author would like to thank the two editors for their very useful comments which led to a clear improvement of this manuscript.

Editor: Dr. Nicola Shaw. Algoma University, Canada.

Received: 03-May-2022 Final Revised: 08-Aug-2022

Accepted: 24-Sep-2022 Published: 10-Oct-2022

- Sunjaya AP, Allida SM, Di Tanna GL, Jenkins C (2021) Asthma and risk of infection, hospitalization, ICU admission and mortality from COVID-19: Systematic review and meta-analysis. J Asthma.
- Zhou Y, Yang W, Liu G, Gao W (2021) Risks of vaptans in hypernatremia and serum sodium overcorrection: A systematic review and meta-analysis of randomised controlled trials. Int J Clin Pract 75:e13939.
- Jones MP, O’Gorman TW, Lemke JH, Woolson RF (1989) A Monte Carlo Investigation of Homogeneity Tests of the Odds Ratio under Various Sample Size Configurations. Biometrics 45(1):171-181.
- Gavaghan DJ, Moore RA, McQuay HJ (2000). An evaluation of homogeneity tests in meta-analyses in pain using simulations of individual patient data. Pain 85(3):415-424.
- Almalik O, van den Heuvel ER (2018) Testing homogeneity of effect sizes in pooling 2x2 contingency tables from multiple studies: a comparison of methods. Cogent math stat 5(1).
- Zhang C, Wang X, Chen M, Wang TA (2021) Comparison of hypothesis tests for homogeneity in meta-analysis with focus on rare binary events. Res Synth Methods 12:408-428.
- Fisher RA (1948) "Questions and answers #14". Am Stat 2(5):30-31.
- Bhandary M, Zhang X (2011) Comparison of Several Tests for Combining Several Independent Tests. J Mod Appl Stat Methods 10(2):436-446.
- Dai H, Steven Ledder J, Cui Y (2014) A modified generalized Fisher method for combining probabilities from dependent tests. Front Genet 5(32).
- Li S, Williams BL, Cui Y (2011) A combined p-value approach to infer pathway regulations in eQTLmapping. Stat Interface 4:389-402.
- Brown MB (1975) A method for combining non-independent, one-sided tests of significance. Biometrics 31:987—992.
- Kost JT, McDermott MP (2002) Combining dependent p-values. Stat Probabil Lett 60:183-190.
- Poole W, Gibbs DV, Shmulevich I, Bernard B, Knijnenburg TA (2016) Combining dependent P-values with an empirical adaptation of Brown’s method. Bioinformatics 32:430-436.
- Yang JJ (2010) Distribution of Fisher’s combination statistic when the tests are dependent. J Stat Comput Simul 80(1):1-12.
- Hartung JA (1999) A Note on Combining Dependent Tests of Significance. Biom J 41(7):849-855.
- Makambi K (2003) Weighted inverse chi-square method for correlated significance tests. J Appl Stat 30(2):225-234.
- Hou CD (2005) A simple approximation for the distribution of the weighted combination of nonindependent or independent probabilities. Stat Probab Lett 73:179-187.
- Dai H, Steven Leeder J, Cui Y (2020) A modified generalized Fisher method for combining probabilities from dependent tests. Front Genet 5(32).
- Lancaster HD (1961) The combination of probabilities: an application of orthonomral functions. Aust J Stat 3:20—33.
- Xu X, Tian L and Wei L (2003) Combining dependent tests for linkage or association across multiple phenotypic traits. Biostatistics 4(2):223-229.
- Yang Y, Jin Z (2006) Combining dependent tests to compare the diagnostic accuracies˜a non-parametric approach. Stat Med 25:1239-1250.
- Wei L J, Johnson WE (1985) Combining Dependent Tests with Incomplete Repeated Measurements. Biometrika 72(2):359-364.
- Shenga X, Yang J (2013) An adaptive truncated product method for combining dependent p-value. Econ Lett 119(2):180-182.
- David FN, Johnson NL (1950) The Probability Integral Transformation when the Variable is Discontinuous. Biometrika 37(1/2):42-49.
- Satterthwaite FE (1946) An approximate distribution of estimates of variance components. Biometrics Bulletin 2(6):110-114.
- Cochran WG (1954) The combination of estimates from different experiments. Biometrics 10(1):101-129.
- Bliss CL (1952) The statistics of Bioassay: with special reference to the vitamins. Academic Press Inc.
- Breslow NE, Day NE (1980) Statistical Methods in Cancer Research. Volume I˜The Design and Analysis of Case-Control Studies;32 of IARC Scientific Publications. Lyon: International Agency for Research on Cancer.
- Tarone RE (1985) On heterogeneity tests based on efficient scores. Biometrika 72:91-95.
- Zelen M (1971) The analysis of several 2x2 contingency tables. Biometrika 58:129-137.
- Halperin M, Ware JH, Byar DP, Mantel N, Brown CC, Koziol J, Gail M, Green SB (1977) Testing for interaction in an I x J x K contingency table. Biometrika 64:271-275.
- Liang KY, Self SG (1989) Tests of homogeneity of odds ratio when the data are sparse. Biometrica 72:353-358.
- Woolf B. On estimating the relationship between blood group and disease. Annual Human Genetics 1955;19:251-253.
- Yusuf S Peto, R, Lewis J, Collins R, Sleight P (1986) Beta blockade during and after myocardial infarction: an overview of the randomized trials.Prog Cardiovasc Dis 27(5):335-371.
- Amatya A, Bhaumik DK, Normand S, Greenhouse J, Kaizar E, Neelon B, Gibbons R.D. (2015) Likelihood-Based Random-Effect Meta-Analysis of Binary Events. Journal of Biopharmaceutical Statistics, 25(5): 984-1004.
- Bein M, Hoi O, Yu Y, Grandi SM, Frati FYE, Kandil K, Filion KB (2021) Levothyroxine and the risk of adverse pregnancy outcomes in women with subclinical hypothyroidism: a systematic review and meta-analysis. BMC Endocr Disord 21(34):984-1004.

Volume 10

Almalik O. **Combining dependent p-values resulting from multiple effect size homogeneity tests in metaanalysis for binary outcomes**. *J Med Stat Inform*. 2022; **10**:1. http://dx.doi.org/10.7243/2053-7662-10-1

View Metrics

Copyright © 2015 Herbert Publications Limited. All rights reserved.

Post Comment|View Comments