Jorge I. Vélez^{1,2}, Cameron A. Jack^{3}, Aaron Chuah^{3}, Bob Buckley^{3}, Juan C. Correa^{4}, Simon Easteal^{5 }and Mauricio Arcos-Burgos^{1,2*}

*Correspondence: Mauricio Arcos-Burgos Mauricio.Arcos-Burgos@anu.edu.au

1. Genomics and Predictive Medicine Group, Department of Genome Biology, John Curtin School of Medical Research, The Australian National University, Canberra, ACT, Australia.

2. Neuroscience Research Group, University of Antioquia, Medellín, Colombia.

3. Genome Discovery Unit, John Curtin School of Medical Research, The Australian National University, Canberra, ACT, Australia.

4. Research Group in Statistics, Department of Statistics, National University of Colombia at Medellín, Medellín, Colombia.

5. Genome Diversity and Health Group, Department of Genome Biology, John Curtin School of Medical Research, The Australian National University, Canberra, ACT, Australia.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Recently, we presented a new method of pooling/resampling genome-wide association study (*pr*GWAS) that uncovered new and known loci associated to Alzheimer's disease. In here, we contrast this method with the Welcome Trust Case Control Consortium (WTCCC) data, a well-known GWAS on seven human complex diseases. Our results suggest that *pr*GWAS can be considered an efficient, specific, and accurate alternative to the conventional GWAS approach at a fraction of the genotyping cost, and provide insights into other potential applications such as next generation sequencing.

**Keywords**: Case/control data, GWAS, DNA pooling, *pr*GWAS, subsampling

In a recent manuscript we presented a new strategy for genome wide association studies (GWAS) that uses resampling and DNA pooling, and that we denominated pooling/bootstrap-based GWAS (*pr*GWAS) [1]. This methodology is well suited to identify disease-associated genetic variants using limited and relatively small samples at a fraction of the individual genotyping cost [1]. We applied the *pr*GWAS strategy to a unique cohort of patients with an autosomal-dominant form of Alzheimer's disease (AD) segregating a mutation in the presenilin-1 gene (*PSEN1*), [1-4] and identified new and previously reported loci underpinning the susceptibility to AD and/or modifying the age of onset of this dementia [1].

Here we cross-validate the *pr*GWAS methodology using the Welcome Trust Case Control Consortium (WTCCC) data, a collection of publically available genetic information obtained by a network of ~50 research groups across the United Kingdom (http:// www.wtccc.org.uk/). Based on performance measures, [5] we found that the *pr*GWAS provides efficient, specific and accurate parameters comparable to those obtained from the application of traditional GWAS.

Furthermore, *pr*GWAS detects 98% of all regions of the genome showing strongest or moderate evidence of association [6] whilst using a reasonably number of DNA pools. Overall, these findings demonstrate that *pr*GWAS might be considered a feasible alternative to individual genotyping, and provide insights of its use into other potential whole genome designs.

**WTCCC data**

**Sample description**

Briefly, the WTCCC aimed to understand patterns of human genome sequence variation, and to explore their utility for the design and analyses of GWAS. The WTCCC analyzed *n*~2,000 cases of each of seven complex human diseases and *n*~3,000 control individuals, all of them from England, Scotland and Wales. Genotyping of ~1.8 million single nucleotide polymorphism (SNPs) in the *n*~17,000 samples was performed using highthroughput technologies [6].

The two control collections, assumed to be samples from the general population, are *n*~3,000 individuals from the 1958 British Birth Cohort (58C, *n*~1,500) and from blood donors recruited as part of this project (NBS cohort, *n*~1,500). Both cohorts have approximately the same number of males and females. Individuals in the 58C cohort, aged 44-45 years, were born in the same week in 1958; NBS samples are from blood donors with 18-69 years of age. Previous comparison of SNP allele frequencies between the 58C and NBS cohorts showed few significant differences [6] and therefore treated as a unique cohort in this study.

Approximately 14,000 individuals were phenotyped for seven human complex diseases (bipolar disorder [BD], Crohn's disease [CD], coronary artery disease [CAD], hypertension [HT], rheumatoid arthritis [RA], type 1 diabetes [T1D] and type 2 diabetes [T2D]) and comprised the case cohorts, each of which includes *n*~2000 samples [6].

*Genotype data*

Control cohorts were genotyped using the lllumina 1.2M and Affymetrix V6.0 platforms, and cases were genotyped using the Affymetrix 500K SNP-chip. Complete information for the genotyped SNPs was obtained from Illumina (www.illumina. com) and Affymetrix (www.affymetrix.com). Non-autosomal SNPs and those with no available coordinate information were excluded. Genotypes for all samples were obtained after application to the WTCCC, and processed in R [7] and Python (https://www.python.org/) (**Supplementary material**).

*Genetic association analysis*

Genetic association analysis were performed in PLINK [8] version 1.07. SNPs with a minimal allele frequency (MAF) of at least 1%, a p-value greater than 10^{-5} for the Hardy-Weinberg equilibrium test, and a genotype frequency of at least 95% across all samples were included for analysis. To some extent, these parameters mimic those previously used by the WTCCC [6]. After completion, results of the association analyses for all chromosomes in a specific case cohort were combined in a single file.

*pr*GWAS method

**Construction of DNA pools**

Let *N _{1}* be the total number of controls,

DNA pools were constructed as previously described [1]. First, a random sample of size ni is draw from *N _{i}* to form the master plates; and secondly,

Figure 1 : *In silico* experimental strategy to construct and analyze DNA pools using *pr*GWAS on the WTCCC data.

*In silico* experiments were performed to study the effect of *n _{1}*,

*Allele frequency estimation*

After constructing the pairs of DNA pools, the genotypes for all m genotyped SNPs were retrieved from the ped files previously constructed (**Supplementary material**) by matching the IDs of the individuals in the DNA pools with those of the WTCCC samples. The allele frequency for the *j*^{th} SNP in the* i*^{th} group was estimated as
*j*^{th} SNP (*i*=1,2; *j*=1,2,..., *m*). The number of alleles was determined based on the ordered two-string genotypes for each SNP.

** Detecting associated SNPs** As described in Vélez et al., [1] for each SNP it is of interest to test

against a suitable alternative hypothesis, say *H*_{1,j}. Because a total of k DNA pools are generated for cases and controls, (1) is to be tested *k* times for each SNP. For *k* fixed, the test statistic for this purpose is given by [1]

Let
*m*>1 statistical tests are being performed (and usually
*H*_{0,j} when *p _{j}* < α is that actual type I error probability of the m tests is

*Combination of P-values*

Based on the *k* pairs of DNA pools being generated and subsequently compared using (2), a total of *k* *P*-values are calculated for each of m SNPs in the SNP-chip. Further, *P*-values for each SNP are combined using the Stouffer's Z-transformed method [17,18] after introducing some degree of dependence [19]. The test statistic for the *j*^{th} SNP is given by [19]

*w*_{l} is the weight and *Z*_{l} the quantile of the standard normal distribution (i.e., *N*(0,1) distribution) associated with the *j*^{th} p-value of the lth DNA pool. Expressions for *k* can be found elsewhere [19].

When combining the *k* *P*-values for each SNP, the null hypothesis of interest when the is whether the allele frequency between cases and controls is the same across all pairs of DNA pools, that is

with the alternative hypothesis being *H*_{1,l}: *p*_{1,l}>*p*_{2,l} for some l. Under H0 in (4), the statistic
*N*(0,1) distribution. Hartung (1999) [19] showed that

Thus, the calculation of one- and two-tailed *P*-values follows [18].

**Comparison of prGWAS and GWAS**

Let

Results from the *pr*GWAS method were compared to those obtained in GWAS using several performance measures5 after constructing 2X2 contingency tables (Table 1a) for every combination of disease, chromosome, *k* and sample sizes of the master plates. Initially, the results from the association analysis using GWAS and those using *pr*GWAS were merged by SNP after the p-values from the GWAS analysis were corrected by multiple testing using Bonferroni's criterion. For the* l*^{th} pair of DNA pools and *n _{1}* and

Table 1 : **Results and expressions for calculating the performance measures used to quantitatively compare prGWAS and GWAS.**

where an
*pr*GWAS but not positive in GWAS, *c* is the number of not significant markers using *pr*GWAS but positive using GWAS, and *d* is the number of SNPs not statistically significant by neither method. In this context, markers found to be positive using the complete data set are said to be true positives (Table 1a).

In addition to the classical measures, we also calculated the lift, a performance measure initially introduced as 'interest' by Brin et al., [20] and recently used in data mining and marketing [21] to evaluate the relative performance of alternative classification models (Table 1b) [22]. If A is the event 'detecting a marker as being statistically significant using *pr*GWAS' and B the event 'detecting a marker as being statistically significant using GWAS', lift measures how many times more often A and B occur together than expected if they were statistically independent [22]. In other words, this is equivalent to quantify how much more successful the *pr*GWAS method is likely to be than if no predictive model (i.e., random selection) was used to detect statistically significant markers.

**Previously reported SNPs
** Table 2a presents the association signals at previously replicated loci. Using

Table 2 : **Comparison of previously replicated loci, and genomic regions with the strongest and moderate association signals between prGWAS and GWAS.**

Figure 2 : Number of pairs of DNA pools to be randomly generated in *pr*GWAS to obtain comparable *P*-values to those (**a**) previously replicated loci (Table 2a); (**b**) regions with the strongest association signals (Table 2b), and (**c**) regions of the genome showing moderate association when the full genotype data is used (Table 2c).

In Table 2b regions of the genome with the strongest association signals are presented. Twenty of the 21 signals previously reported in a standard analysis [6] were detected by *pr*GWAS, with 43% (9/21) of these signals being statistically significant for at least one value of *n* (see Methods). From the 63 signals/sample size combinations, 11% (7/63) were not significant at 5% (BD: rs420259 for *n=*500; CD: rs10761659 and rs2542151 for *n=*250; T1D: rs11171739 for *n=*250; T2D: rs9465871 regardless of *n*). Furthermore, the number of detected markers increases with *n* (trend *P*-value=0.0283, Figure 2b). A more detailed description of the findings for each of the seven diseases is provided in the **Supplementary material**.

A total of 58 markers were reported as showing moderate evidence of association (6th column, Table 2c): 13 in BD, nine in RA and T2D, eight in CD, seven in T1D, and six in CAD and HT. Of those, six (10.3%) did not pass *pr*GWAS quality control. Despite detecting all signals using *pr*GWAS, 38 of the 174 (22.4%) signal/sample size combinations were not significant for at least one value of n. Overall, the number of markers detected increases with *n* (trend *P*-value=0.00343, Figure 2c). An extensive and more detailed description of the associated variants for each disease can be found in the **Supplementary material**.

**Number of DNA pools**

Figure 2 depicts the number of DNA pools to be generated *k* as a function of the sample size of the master plates *n* to obtain, using *pr*GWAS, similar *P*-values to those reported previously (see Table 2) when the full genotype data was used.

Panel (a) depicts the results to detect markers in previously replicated loci (Table 2a). As a function of *n*, regression analysis shows that *k* decreases as a function of *n* (trend *P*-value=7.88x10^{-4}). Furthermore, analysis of variance (ANOVA) discloses statistically significance difference in the average value of *k* as a function of *n* (*F*_{2,26}=7.197, *P*-value=0.00325). The highest Tukey's honest significance difference was obtained after comparing *k* for *n=*1000 and *n=*250 (*d=*-4.25, *P*_{adjusted}=0.0237). These results suggests that, in *pr*GWAS, the larger the value of *n* the lower the number of DNA pools to be generated in order to obtain comparable *P*-values to those obtained at previously robustly replicated loci.

The number of DNA pools to be generated in *pr*GWAS decreases as a function of *n* (trend *P*-value=0.0252) for detecting regions with the strongest association signals (Table 2b). However, no statistically significance difference was found between the average value of *k* across sample sizes (*k*_{all}=5.63, *k*_{250}=7.11, *k*_{250}=5.93, *k*_{1000}=4.68; *F*_{2,40}=2.723, *p*=0.0778). Furthermore, no linear relationship between *n* and *k* (trend *P*-value=0.189) nor in the average value of *k* for all sample sizes (*k*_{all}=5.29, *k*_{250}=5.85, *k*_{250}=5.76, *k*_{1000}= 4.92; *F*_{2,61}=0.898, *p*=0.413) was found when detecting regions of the genome with moderate evidence of association (Table 2c). Altogether, these results indicate that *pr*GWAS detects those regions of the genome showing the strongest or moderate evidence of association previously reported by randomly generating a reasonably number of DNA pools regardless of the sample size of the master plates.

**Performance measures
** A total of 4,159 (7 diseases 22 chromosomesx3 master plates' sample sizesx9 combined

Figure 3 : Results of (**a**) sensitivity, (**b**) specificity, (**c**) classification rate, (**d**) positive predictive value, (**e**) negative predictive value, (**f**) false discovery and (**g**) lift across all chromosomes as a function of *k* and the sample size of the master plates (*n=*250 in blue, *n=*500 in pink and *n=*1000 in green).

Across all seven diseases, the sensitivity ranges from 0.15% to 1.4%. Regardless of *k*, the lowest sensitivity values are obtained when the sample size of the master plate is *n _{1}*=

In Figure 3b the results for the specificity are depicted. It can be seen that specificity increases slightly as a function of *k * when *n* is fixed, and is slightly higher for *k* fixed when *n* is large. In practical terms this implies that applying *pr*GWAS using master plates of relatively large size would result in higher specificity values (i.e., ~99%) regardless of *k*. However, as per the results when the GWAS approach is used, it also seems that the change in specificity is not particularly high for large *n*. Figure 3c presents the results for the classification rate (CR). Obtained values range between 80% and >90%, with master plates of larger size and higher values *k* producing lower CRs. Conversely, when n is small (i.e., *n=*250), *pr*GWAS produces the higher CRs for *k*>2. Furthermore, the CR tends to increase slightly when *k*>5 for this sample size.

Results for the Positive Predictive Value (PPV) and Negative Predictive Value (NPV) are presented in Figures 3d and e, respectively. In the former, master plates of size *n=*1000 provide PPV of >70% regardless of *k*, with PPVs stabilizing when *k*>3. Despite not being the highest, master plates of ~16% of controls and ~50% of cases (i.e., *n=*500) produce PPVs between 50% and 80%. On the other hand, the NPVs range from 80% to 90%, with the highest values found when *n=*250 regardless of *k*, followed by *n=*500 and *n=*1000. As a function of *k*, the NPV decreases for *n=*1000 and *n=*500, and slightly increases when *k*>3 for *n=*250 (panel (e), blue line). Figure 3f depicts the FDR. The results indicate that the FDR decreases as k increases, and that, in contrast with *n=*1000 which produces FDR values close to the nominal type I error probability α, *n=*250 produces higher FDR values. When the master plate is of size *n=*1000 (i.e., randomly selecting ~33% of the total number of controls and ~50% of the total number of cases) the FDR is close to the nominal level after *k*>4 for most diseases, except T1D. This value of *k* translates in having at least 4,000 cases and equal number of controls in the equivalent association analysis [1].

Figure 3g shows the results of lift for predicting statistically significant markers using *pr*GWAS as a function of *n* and *k*. Across all diseases, lift values are higher as the sample size of the master plates increases, suggesting that the *pr*GWAS method performs better at predicting statistically significant markers than a random choice targeting model. Lift values range between 2.5 and 7.5, with the highest values being recorded when for 2<*k*<4 when *n=*1000 and *n=*500 regardless of the disease, and the lowest values when *n=*250 regardless of *k*. On the other hand, the lift for predicting negative markers is slightly greater than one for all diseases (data not shown). Altogether, these results suggest that the *pr*GWAS method is potentially useful for predicting positive markers in future data sets.

By varying the number of pairs of DNA pools and the sample size of the master plates, it was possible to demonstrate that *pr*GWAS is efficient, specific and accurate compared to conventional GWAS [1].

In the GWAS approach, a total of ~17,000 DNA samples were genotyped, whereas using *pr*GWAS this number would have considerably been reduced to at least 80 DNA pools (up to 10 DNA pools per disease+10 DNA pool from the controls) regardless of how many individuals' DNA samples are present in the master plates. As in DNA pooling genotyping every pool is treated as a single DNA sample, the reduction in genotyping costs by using *pr*GWAS is substantial.

Despite the encouraging results of the *pr*GWAS method when compared to GWAS using the WTCCC data, some limitations need to be acknowledged. First, in contrast with the reported GWAS results [6]. In the analyses reported here all cases and controls individual genotypes were used for the genetic association analysis; secondly, no control by population structure was performed, and thirdly, no SNPs were removed after visual inspection.

Potential applications of *pr*GWAS include whole-exome and whole-genome sequencing for cases and controls, and GWAS when families of (not necessarily) identical structure with affected and unaffected siblings are recruited [23].

**Additional files**

Supplementary material

The authors declare that they have no competing interests.

Authors' contributions |
JIV |
CAJ |
AC |
BB |
JCC |
SE |
MAB |

Research concept and design | √ | -- | -- | -- | -- | √ | √ |

Collection and/or assembly of data | √ | √ | √ | √ | -- | √ | √ |

Data analysis and interpretation | √ | -- | -- | -- | √ | √ | √ |

Writing the article | √ | -- | -- | -- | -- | √ | √ |

Critical revision of the article | √ | -- | -- | -- | -- | √ | √ |

Final approval of article | √ | √ | √ | √ | √ | √ | √ |

This research is supported in part by the Eccles Scholarship in Medical Sciences, the Fenner Merit Scholarship and The Australian National University (ANU) High Degree Research scholarships granted to JIV. JIV is a doctoral student at ANU. Some of this work is to be presented in partial fulfillment of the PhD degree requirements. The first author thanks Ms. Lindsay Nailer from ANU and Dr. Fernando Marmolejo-Ramos from Stockholm University, Sweden for critical reading of an earlier version of this document.

EIC: Kenneth Maiese, Wayne State University, USA.

Received: 03-Nov-2014 Final Revised: 09-Dec-2014

Accepted: 14-Jan-2015 Published: 20-Jan-2015

- Velez JI, Chandrasekharappa SC, Henao E, Martinez AF, Harper U, Jones M, Solomon BD, Lopez L, Garcia G, Aguirre-Acevedo DC, Acosta-Baena N, Correa JC, Lopera-Gomez CM, Jaramillo-Elorza MC, Rivera D, Kosik KS, Schork NJ, Swanson JM, Lopera F and Arcos-Burgos M.
**Pooling/bootstrap-based GWAS (pbGWAS) identifies new loci modifying the age of onset in PSEN1 p.Glu280Ala Alzheimer's disease**.*Mol Psychiatry.*2013;**18**:568-75. | Article | PubMed Abstract | PubMed Full Text - Lalli MA, Cox HC, Arcila ML, Cadavid L, Moreno S, Garcia G, Madrigal L, Reiman EM, Arcos-Burgos M, Bedoya G, Brunkow ME, Glusman G, Roach JC, Hood L, Kosik KS and Lopera F.
**Origin of the PSEN1 E280A mutation causing early-onset Alzheimer's disease**.*Alzheimers Dement.*2014;**10**:S277-S283 e10. | Article | PubMed - Londono AC, Castellanos FX, Arbelaez A, Ruiz A, Aguirre-Acevedo DC, Richardson AM, Easteal S, Lidbury BA, Arcos-Burgos M and Lopera F.
**An 1H-MRS framework predicts the onset of Alzheimer's disease symptoms in PSEN1 mutation carriers**.*Alzheimers Dement.*2014;**10**:552-61. | Article | PubMed - Lopera F, Ardilla A, Martinez A, Madrigal L, Arango-Viana JC, Lemere CA, Arango-Lasprilla JC, Hincapie L, Arcos-Burgos M, Ossa JE, Behrens IM, Norton J, Lendon C, Goate AM, Ruiz-Linares A, Rosselli M and Kosik KS.
**Clinical features of early-onset Alzheimer disease in a large kindred with an E280A presenilin-1 mutation**.*JAMA.*1997;**277**:793-9. | Article | PubMed - Parikh R, Mathai A, Parikh S, Chandra Sekhar G and Thomas R.
**Understanding and using sensitivity, specificity and predictive values**.*Indian J Ophthalmol.*2008;**56**:45-50. | Article | PubMed Abstract | PubMed Full Text **Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls**.*Nature.*2007;**447**:661-78. | Article | PubMed Abstract | PubMed Full Text- R Core Team. R.
**A language and environment for statistical computing**. R Foundation for Statistical Computing, Vienna, Austria. 2014. | Website - Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ and Sham PC.
**PLINK: a tool set for whole-genome association and population-based linkage analyses**.*Am J Hum Genet.*2007;**81**:559-75. | Article | PubMed Abstract | PubMed Full Text - Shaffer JP.
**Multiple Hypothesis Testing**.*Ann Rev Psychol*. 1995;**46**:561-84. | Article - Benjamini Y and Hochberg Y.
**Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing**.*Journal of the Royal Statistical Society Series B (Methodological)*. 1995;**57**:289-300. | Article - Bonferroni CE.
**Il calcolo delle assicurazioni su gruppi di teste**. In*Studi in Onore del Professore Salvatore Ortu Carboni*. 1935:13-60. | Article - Liu JZ, McRae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, Hayward NK, Montgomery GW, Visscher PM, Martin NG and Macgregor S.
**A versatile gene-based test for genome-wide association studies**.*Am J Hum Genet.*2010;**87**:139-45. | Article | PubMed Abstract | PubMed Full Text - Nyholt DR.
**A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other**.*Am J Hum Genet.*2004;**74**:765-9. | Article | PubMed Abstract | PubMed Full Text - Storey JD and Tibshirani R.
**Statistical significance for genomewide studies**.*Proc Natl Acad Sci U S A.*2003;**100**:9440-5. | Article | PubMed Abstract | PubMed Full Text - Storey JD.
**A direct approach to false discovery rates**.*Journal of the Royal Statistical Society Series B (Methodological)*. 2002;**64**:479-98. | Article - Vélez JI, Correa JC and Arcos-Burgos M.
**A new method for detecting significant p-values with applications to genetic data**.*Revista Colombiana de Estadistica*. 2014;**37**:67-76. | Article - Stouffer SA, Suchman EA, DeVinney LC, Star SA and Williams RMJ.
**Adjustment During Army Life**. Princeton University Press. 1949. | pdf - Whitlock MC.
**Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach**.*J Evol Biol.*2005;**18**:1368-73. | Article | PubMed - Hartung J.
**A Note on Combining Dependent Tests of Significance**.*Biometrical Journal*. 1999;**41**:849-55. | Article - Brin S, Motwani R, Ullman JD and Tsur S.
**Dynamic itemset counting and implication rules for market basket data**.*ACM SIGMOD International Conference on Management Data*. 1997.265-76. | Article - Piatetsky-Shapiro G and Masand B.
**Estimating Campaign Benefits and Modeling Lift**. San Diego, CA, USA. 1999. | Article - McNicholas PD, Murphy TB and O'Regan M.
**Standardising the Lift of an Association Rule**: Department of Statistics, Trinity College Dublin, Ireland. 2007. | Pdf - Risch N and Teng J.
**The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases I. DNA pooling**.*Genome Res.*1998;**8**:1273-88. | Article | PubMed

Volume 3

Vélez JI, Jack CA, Chuah A, Buckley B, Correa JC, Easteal S and Arcos-Burgos M. **Cross validation of pooling/resampling GWAS using the WTCCC data**. *Mol Biol Genet Eng*. 2015; **3**:1. http://dx.doi.org/10.7243/2053-5767-3-1

View Metrics

Copyright © 2015 Herbert Publications Limited. All rights reserved.

Post Comment|View Comments