Cross validation of pooling/resampling GWAS using the WTCCC data

Recently, we presented a new method of pooling/resampling genome-wide association study (prGWAS) that uncovered new and known loci associated to Alzheimer’s disease. In here, we contrast this method with the Welcome Trust Case Control Consortium (WTCCC) data, a well-known GWAS on seven human complex diseases. Our results suggest that prGWAS can be considered an efficient, specific, and accurate alternative to the conventional GWAS approach at a fraction of the genotyping cost, and provide insights into other potential applications such as next generation sequencing.


Introduction
In a recent manuscript we presented a new strategy for genome wide association studies (GWAS) that uses resampling and DNA pooling, and that we denominated pooling/bootstrap-based GWAS (prGWAS) [1]. This methodology is well suited to identify disease-associated genetic variants using limited and relatively small samples at a fraction of the individual genotyping cost [1]. We applied the prGWAS strategy to a unique cohort of patients with an autosomal-dominant form of Alzheimer's disease (AD) segregating a mutation in the presenilin-1 gene (PSEN1), [1][2][3][4] and identified new and previously reported loci underpinning the susceptibility to AD and/or modifying the age of onset of this dementia [1].
Here we cross-validate the prGWAS methodology using the Welcome Trust Case Control Consortium (WTCCC) data, a collection of publically available genetic information obtained by a network of ~50 research groups across the United Kingdom (http:// www.wtccc.org.uk/). Based on performance measures, [5] we found that the prGWAS provides efficient, specific and accurate parameters comparable to those obtained from the application of traditional GWAS. Furthermore, prGWAS detects 98% of all regions of the genome showing strongest or moderate evidence of association [6] whilst using a reasonably number of DNA pools. Overall, these findings demonstrate that prGWAS might be considered a feasible alternative to individual genotyping, and provide insights of its use into other potential whole genome designs.
doi: 10.7243/2053-5767-3-1 recruited as part of this project (NBS cohort, n~1,500). Both cohorts have approximately the same number of males and females. Individuals in the 58C cohort, aged 44-45 years, were born in the same week in 1958; NBS samples are from blood donors with 18-69 years of age. Previous comparison of SNP allele frequencies between the 58C and NBS cohorts showed few significant differences [6] and therefore treated as a unique cohort in this study.

Genotype data
Control cohorts were genotyped using the lllumina 1.2M and Affymetrix V6.0 platforms, and cases were genotyped using the Affymetrix 500K SNP-chip. Complete information for the genotyped SNPs was obtained from Illumina (www.illumina. com) and Affymetrix (www.affymetrix.com). Non-autosomal SNPs and those with no available coordinate information were excluded. Genotypes for all samples were obtained after application to the WTCCC, and processed in R [7] and Python (https://www.python.org/) (Supplementary material).

Genetic association analysis
Genetic association analysis were performed in PLINK [8] version 1.07. SNPs with a minimal allele frequency (MAF) of at least 1%, a p-value greater than 10 -5 for the Hardy-Weinberg equilibrium test, and a genotype frequency of at least 95% across all samples were included for analysis. To some extent, these parameters mimic those previously used by the WTCCC [6]. After completion, results of the association analyses for all chromosomes in a specific case cohort were combined in a single file.

Construction of DNA pools
Let N 1 be the total number of controls, N 2 the total of cases, n 1 and n 2 the total number of DNA samples in the master plate for cases and controls, respectively, and, n i '= with the integer part of x. Note that, by design, n 1 <N 1 and n 2 <N 2 , and that N 1~3 ,000 and N 2~2 ,000 for each disease in the WTCCC data under this notation. DNA pools were constructed as previously described [1]. First, a random sample of size n i is draw from N i to form the master plates; and secondly, k samples of size n i ' are randomly selected without replacement from the master plate (Figure 1). During the random selection process within master plates, it is unlikely that an individual' DNA sample would be selected more than once. However, it is possible that some individuals belong to several of the k DNA pools but not to the same DNA pool. In the context of prGWAS, k represents the number of Figure 1. In silico experimental strategy to construct and analyze DNA pools using prGWAS on the WTCCC data. In the Collection stage, the 58C and NBS control cohorts are combined into a single control cohort comprising n~3000 individuals. Further, in the Master plates stage, the DNA master plates for the control and case cohorts are constructed by randomly sampling without replacement ni individuals from each cohort (i=1,2). Likewise, up to k random samples n i '=[n i /2] are subsequently draw from each master plates in the Subsampling stage (here, [x] the integer part of x). Finally, in the Testing stage, the allele frequencies of m SNPs in cases and controls are compared for each pair of DNA pools, and the P-values combined across all pools using meta-analytical methods.
DNA pool pairs to be generated from the master plates [1].
In silico experiments were performed to study the effect of n 1 , n 2 and k. In these experiments, master plates of size n 1 = n 2 =n={250, 500, 1000} were generated; these sample sizes correspond to select, respectively, 8%, 16% and 33% of the total number of controls, and 12.5%, 25% and 50% of the total number of cases for each disease. Further, for n 1 and n 2 given, k={1,2,3,4,…,10} pairs of DNA pools were constructed using the individuals' identifiers ( Figure 1). Based on the values of n 1 , n 2 and k, a total of 30 in silico scenarios were evaluated for each disease.

Allele frequency estimation
After constructing the pairs of DNA pools, the genotypes for all m genotyped SNPs were retrieved from the ped files previously constructed (Supplementary material) by matching the IDs of the individuals in the DNA pools with those of the WTCCC samples. The allele frequency for the jth SNP in the ith group was estimated as where y 1,j and y 2,j are, respectively, the number of individuals with one and two alleles in the jth SNP (i=1,2; j=1,2,…, m). The number of alleles was determined based on the ordered two-string genotypes for each SNP.

Detecting associated SNPs
As described in Vélez et al., [1]  a total of k DNA pools are generated for cases and controls, (1) is to be tested k times for each SNP. For k fixed, the test statistic for this purpose is given by [1] (2) with n i ' and as previously defined (i=1,2; j=1,2,…,m). Under the null hypothesis of no difference between the allele frequency of cases and controls for the jth SNP, T j follows with one degree of freedom; the P-value for the j th SNP can be calculated as p j =Pr (T j >X 1 2 ), j=1,2,…,m. Let α є (0,1) be the type I error probability of the test. Because m>1 statistical tests are being performed (and usually m→∞), it is imperative to control by multiple testing [9]. The main problem of rejecting H 0,j when p j <α is that actual type I error probability of the m tests is α m =1-(1-α) m , which leads to an increased number of false positives as m→∞. Although several criteria have been proposed to control by multiple testing in GWAS, [10][11][12][13][14][15][16] the FDR, [10] a relatively new approach, suitable for exploratory analysis, [14] less stringent and more powerful than Bonferroni's, [10,14,15] is preferable in prGWAS [1]. We shall say that the jth SNP is disease-associated in the lth pair of DNA pools if p j (l) <α, with p j (l) the FDR-corrected p-value for the jth SNP when the lth pair of DNA pools is generated (j=1,2,…,m; l=1,2,…,k).

Combination of P-values
Based on the k pairs of DNA pools being generated and subsequently compared using (2), a total of k P-values are calculated for each of m SNPs in the SNP-chip. Further, P-values for each SNP are combined using the Stouffer's Z-transformed method [17,18] after introducing some degree of dependence [19]. The test statistic for the jth SNP is given by [19] (3) w l is the weight and Z l the quantile of the standard normal distribution (i.e., N(0,1) distribution) associated with the j th p-value of the lth DNA pool. Expressions for k can be found elsewhere [19]. When combining the k P-values for each SNP, the null hypothesis of interest when the is whether the allele frequency between cases and controls is the same across all pairs of DNA pools, that is H 0,l : p 1,l =p 2,l l=1,2,…,k (4) with the alternative hypothesis being H 1,l : p 1,l > p 2,l for some l. Under H 0 in (4), the statistic Z s * follows a N(0,1) distribution.

Comparison of prGWAS and GWAS
Let S (l) and S (l) be binary variables constructed after applying the prGWAS and traditional GWAS on the WTCCC data, such that S (l) =1 if p j (l) <α and zero otherwise, and S (l) =1 if p j * <α and equal to zero otherwise. Here, p j * the Bonferroni-corrected P-value of the j th SNP.
Results from the prGWAS method were compared to those obtained in GWAS using several performance measures [5] after constructing 2×2 contingency tables (Table 1a) for every combination of disease, chromosome, k and sample sizes of the master plates. Initially, the results from the association analysis using GWAS and those using prGWAS were merged by SNP after the p-values from the GWAS analysis were corrected by multiple testing using Bonferroni's criterion. For the lth pair of DNA pools and n 1 and n 2 fixed, the entries of the 2×2  contingency table are  ,  ,  ,   and where I {.} is an indicator. For l fixed, a corresponds to the number of statistically significant SNPs (i.e., positive markers) found by both methods, b is the number of positive SNPs according to prGWAS but not positive in GWAS, c is the number of not significant markers using prGWAS but positive using GWAS, and d is the number of SNPs not statistically significant by neither method. In this context, markers found to be positive using the complete data set are said to be true positives (Table 1a).
In addition to the classical measures, we also calculated the lift, a performance measure initially introduced as 'interest' by Brin et al., [20] and recently used in data mining and marketing [21] to evaluate the relative performance of alternative classification models (Table 1b) [22]. If A is the event 'detecting a marker as being statistically significant using prGWAS' and B the event 'detecting a marker as being statistically significant using GWAS' , lift measures how many times more often A and B occur together than expected if they were statistically independent [22]. In other words, this is equivalent to quantify how much more successful the prGWAS method is likely to be than if no predictive model (i.e., random selection) was used to detect statistically significant markers. Table 2a presents the association signals at previously replicated loci. Using prGWAS, we were able to detect 11 of  the 13 associations previously reported; [6] marker rs4420638 in CAD did not pass quality control (see Methods) and marker rs3087243 in T1D was not prGWAS significant at 5%. As a function of the sample size of the master plates, no difference was found in the number of detected markers (trend P-value= 0.2421, Figure 2a). In Table 2b regions of the genome with the strongest association signals are presented. Twenty of the 21 signals previously reported in a standard analysis [6] were detected by prGWAS, with 43% (9/21) of these signals being statistically significant for at least one value of n (see Methods). From the 63 signals/sample size combinations, 11% (7/63) were not significant at 5% (BD: rs420259 for n=500; CD: rs10761659 and rs2542151 for n=250; T1D: rs11171739 for n=250; T2D: rs9465871 regardless of n). Furthermore, the number of detected markers increases with n (trend P-value=0.0283, Figure 2b). A more detailed description of the findings for each of the seven diseases is provided in the Supplementary material.

Previously reported SNPs
A total of 58 markers were reported as showing moderate evidence of association (6th column, Table 2c): 13 in BD, nine in RA and T2D, eight in CD, seven in T1D, and six in CAD and HT. Of those, six (10.3%) did not pass prGWAS quality control. Despite detecting all signals using prGWAS, 38 of the 174 (22.4%) signal/ sample size combinations were not significant for at least one value of n. Overall, the number of markers detected increases with n (trend P-value=0.00343, Figure 2c). An extensive and more detailed description of the associated variants for each disease can be found in the Supplementary Material. Figure 2 depicts the number of DNA pools to be generated k as a function of the sample size of the master plates n to obtain, using prGWAS, similar P-values to those reported previously (see Table 2) when the full genotype data was used.

Number of DNA pools
Panel (a) depicts the results to detect markers in previously replicated loci ( Table 2a). As a function of n, regression analysis shows that k decreases as a function of n (trend P-value=7.88x10 -4 ). Furthermore, analysis of variance (ANOVA) discloses statistically significance difference in the average value of k as a function of n (F 2,26 =7.197, P-value=0.00325). The highest Tukey's honest significance difference was obtained after comparing k for n=1000 and n=250 (d =-4.25, P adjusted =0.0237). These results suggests that, in prGWAS, the larger the value of n the lower the number of DNA pools to be generated in order to obtain comparable P-values to those obtained at previously robustly replicated loci.
The number of DNA pools to be generated in prGWAS decreases as a function of n (trend P-value=0.0252) for detecting regions with the strongest association signals (Table 2b). However, no statistically significance difference was found between the average value of k across sample sizes (k all =5.63, k 250 =7.11, k 250 =5.93, k 1000 =4.68; F 2,40 =2.723, p=0.0778). Furthermore, no linear relationship between n and k (trend P-value=0.189) nor in the average value of k for all sample sizes (k all =5.29, k 250 =5.85, k 250 =5.76, k 1000 =4.92; F 2,61 =0.898, p=0.413) was found when detecting regions of the genome with moderate evidence of association (Table 2c). Altogether, these results indicate that prGWAS detects those regions of the genome showing the strongest or moderate evidence of association previously reported by randomly generating a reasonably number of DNA pools regardless of the sample size of the master plates.

Performance measures
A total of 4,159 (7 diseases 22 chromosomes×3 master plates' sample sizes×9 combined P-values) 2×2 contingency tables were constructed in all in silico experiments. The main results are presented in Figure 3.
Across all seven diseases, the sensitivity ranges from 0.15% to 1.4%. Regardless of k, the lowest sensitivity values are obtained when the sample size of the master plate is n 1 =n 2 =n=250 and the highest, except for CAD, RA and T1D, when n 1 =n 2 =n=1000 (see Figure 3a across all diseases).
In Figure 3b the results for the specificity are depicted. It can be seen that specificity increases slightly as a function of k when n is fixed, and is slightly higher for k fixed when n is large. In practical terms this implies that applying prGWAS using master plates of relatively large size would result in higher specificity values (i.e.,~99%) regardless of k. However, as per      Continuation of Table 2c. HT=Hypertension; T1D=Type 1 diabetes; T2D=Type 2 diabetes. *Statistically significant in prGWAS but the resulting combined P-value was higher than that obtained using the full data set. 1 As reported in Table 2 of Burton et al [6].

Collection SNP prGWAS P-value (k) Genotypic
the results when the GWAS approach is used, it also seems that the change in specificity is not particularly high for large n. Figure 3c presents the results for the classification rate (CR). Obtained values range between 80% and >90%, with master plates of larger size and higher values k producing lower CRs. Conversely, when n is small (i.e., n=250), prGWAS produces the higher CRs for k>2. Furthermore, the CR tends to increase slightly when k>5 for this sample size. Results for the Positive Predictive Value (PPV) and Negative Predictive Value (NPV) are presented in Figures 3d and e, respectively. In the former, master plates of size n=1000 provide PPV of >70% regardless of k, with PPVs stabilizing when k>3. Despite not being the highest, master plates of ~16% of controls and ~50% of cases (i.e., n=500) produce PPVs between 50% and 80%. On the other hand, the NPVs range from 80% to 90%, with the highest values found when n=250 regardless of k, followed by n=500 and n=1000. As a function of k, the NPV decreases for n=1000 and n=500, and slightly increases when k>3 for n=250 (panel (e), blue line).  (Table 2a); (b) regions with the strongest association signals (Table 2b), and (c) regions of the genome showing moderate association when the full genotype data is used (Table 2c). Here, n is the sample size of the master plates across all seven diseases. Number of markers with similar P-values to those reported in Burton et al., [6] are presented in parenthesis. Individual data points were jittered to facilitate interpretation. decreases as k increases, and that, in contrast with n=1000 which produces FDR values close to the nominal type I error probability α, n=250 produces higher FDR values. When the master plate is of size n=1000 (i.e., randomly selecting ~33% of the total number of controls and ~50% of the total number of cases) the FDR is close to the nominal level after k>4 for most diseases, except T1D. This value of k translates in having at least 4,000 cases and equal number of controls in the equivalent association analysis [1]. Figure 3g shows the results of lift for predicting statistically significant markers using prGWAS as a function of n and k. Across all diseases, lift values are higher as the sample size of the master plates increases, suggesting that the prGWAS method performs better at predicting statistically significant markers than a random choice targeting model. Lift values range between 2.5 and 7.5, with the highest values being recorded when for 2<k<4 when n=1000 and n=500 regardless of the disease, and the lowest values when n=250 regardless of k. On the other hand, the lift for predicting negative markers is slightly greater than one for all diseases (data not shown). Altogether, these results suggest that the prGWAS method is potentially useful for predicting positive markers in future data sets.

Discussion
By varying the number of pairs of DNA pools and the sample size of the master plates, it was possible to demonstrate that prGWAS is efficient, specific and accurate compared to conventional GWAS [1].
In the GWAS approach, a total of ~17,000 DNA samples were genotyped, whereas using prGWAS this number would have considerably been reduced to at least 80 DNA pools (up to 10 DNA pools per disease+10 DNA pool from the controls) regardless of how many individuals' DNA samples are present in the master plates. As in DNA pooling genotyping every pool is treated as a single DNA sample, the reduction in genotyping costs by using prGWAS is substantial. Despite the encouraging results of the prGWAS method when compared to GWAS using the WTCCC data, some limitations need to be acknowledged. First, in contrast with the reported GWAS results [6]. In the analyses reported here all cases and controls individual genotypes were used for the genetic association analysis; secondly, no control by population structure was performed, and thirdly, no SNPs were removed after visual inspection.
Potential applications of prGWAS include whole-exome and whole-genome sequencing for cases and controls, and GWAS when families of (not necessarily) identical structure with affected and unaffected siblings are recruited [23].