DNA copy number profiles during tumour progression in breast cancer

Background: In this paper we report the prevalence of copy number aberration events at various stages (subclasses) of breast cancer as assessed by two different statistical methods, GISTIC, a well-known method to identify significant somatic copy-number alterations in a given set and GISDIP: Genomic Identification of Significant Differences in Progression, a newly developed numerical algorithm described here to compare different sets. Methods: GISTIC assesses significant aberrations towards whole genomic location for each stage and GISDIP assesses significant difference towards whole genomic location for two stages pair-wise. Results: We compare GISDIP with the original GISTIC by a simulation study. Comparing with the focal/broad regions obtained by GISTIC, GISDIP can directly illustrate significantly different regions between different stages. We then performed experimental data analyses and identified all significant aberrations at each stage of breast cancer using GISTIC and identified the peaks significant for each stage/ tumour category. Then we applied GISDIP to identify the significant differences between the tumour size categories in progression. The significance was assessed by permuting the data across all samples in all stages. The statistically significant copy number aberrations specific for the various stages are presented here by both statistics on a set of 530 cases (validation set). The same analysis was performed in subsets of cases according to ER-status. The genes located in the identified regions were investigated by applying network analysis and the list for genomic loci related to allele-specific disparity. Conclusions: Our method indicates significant particular regions related to difference of copy number aberration between different stages. The particular biological characteristics for network analysis on estimated regions are summarized.


Introduction
Breast cancer is heterogeneous disease both at clinical and molecular level. Molecular analyses of cancer cells in various stages (subclasses) of progression have revealed that alterations in tumour suppressor genes and oncogenes accumulate during tumour progression and correlate with the clinical aggressiveness of cancer. Tumour size, indicated by the T-category, is known as a strong prognostic indicator for breast cancer and is one of the factors taken into account when deciding how and whether to treat a patient, independently of lymph node status. Although imperfect, tumour size is also together with grade a surrogate marker of time and doi: 10.7243/2053-7662-2-10 progression. Gene expression profiling T1 and T2 tumours was recently attempted and revealed distinct molecular pathways characterizing each tumour category [3,4]. The aim of the present study is to deepen our analysis at array CGH DNA level and compare profiles of various stages of breast tumour as a surrogate of the course of tumorgenesis and identify significant aberrations that may be driving the process at each stage.
To identify somatic copy-number alterations (SCNA) related to tumour development and progression methods, such as Genomic Identification of significant targets in cancer (GISTIC) [1] were applied. This method identifies broad/focal regions of the genome that are significantly amplified or deleted across a set of samples. These regions are recognized as likely drivers SCNAs by evaluating the frequency and amplitude of the observed events. The main procedures to identify the regions take the summation for the aberrations for tumours within a certain class and find significantly aberrant regions using permutated data. GISTIC II [2] adds to the original GISTIC methodological improvements for different complex cancer genomes containing a mixture of SCNA types occurring at distinct background rate. The outputs are very sensitive to sample size. In fact, GISTIC II indicated that most significant peaks become stable around a set as large as 800 samples [2].
Our aim was to identify significant differences between the copy number alteration landscapes in different time points of tumour progression. For this purpose it is not enough to identify SCNA for each stage of progression, but also to assess the significance of difference in their occurrence between the stages. Applying GISTIC for different stages require two steps: 1) apply GISTIC to each stage and 2) compare the significant focal regions between stages. Furthermore, the number of samples for each stage usually varies and the output assessed by permutation within one stage could not be compared with other stages.
Therefore, for the analysis of data involving inconsistent sample number for each stage, we propose a new approach called 'GenomicIdentification of Significant Difference for DNA copy number profilesIn tumour Progression (GISDIP)' , which directly identifies the differences of SCNAs between different stages through one step procedure. GISDIP uses all relevant samples to generate the null distribution to identify the level of significance.
In this study, we apply GISDIP to CGH DNA profiles for the Norwegian Breast Cancer Study consisting of 530 samples from five different cohorts, stratified by T-status and ER-status. The significant regions obtained by GISDIP were subjected to additional analyses such as network function analysis of the harboring genes and allelic-specific disparity [5] that detects allele-specific chromosomal aberrations and loss of heterozygosity.
In the following sections we describe the principle of GISDIP and compare the results obtained by GISTIC and GISDIP as well as their performance using a simulation study.

Segmentation for CGH data
In this study, array CGH DNA profiles were observed as log R (denoted hereafter as logR), which is the log-ratio of observed probe intensity to expected intensity. Positive/negative logR indicate a region of DNA copy number gain/loss. Every tumour genome was segmented by fitting a piecewise constant function to the logR values to obtain smoothing value [6].

GISTIC
Genomic Identification for Significant Targets (GISTIC) was developed to identify broad/focal regions of the genome that are significantly amplified or deleted across a set of samples Each aberration is assigned a G-score that considers the amplitude of the aberration as well as the frequency of its occurrence across samples. False Discovery Rate q-value [7] are then calculated for the aberrant regions, and regions with q-values below a user-defined threshold (default=0.25) are considered significant. Identified 'peak region' is the part of the aberrant region with greatest amplitude and frequency of alteration. The original algorithm was published in 2007 [1] and it has revised as GISTIC II [2] that additionally describes a probabilistic method for defining the boundaries of selected SCNA regions with user-defined confidence. Since each aberration is considered across samples, resolution for average number of genes per peak becomes sensitive for sample size. in article [2] indicated that stable resolution for number of focal peaks and genes per peak can be obtained at sample sizes over 800-1000 samples.

GISDIP
For given CGH data (logR values), let the smoothed value indicate x ijk for genomic location i=1,2,...,N, stage j=1,2,...,L and samples k=1,2,...,m j for stage j. First, the mean value y ij of x ijk is calculated as a representative value for each genomic location of each stage defined by (1) Next, a statistics S i is calculated to investigate difference of y ij between different stages. If the difference between a stage indicating the largest mean and a stage (e.g., j=1) indicating the smallest mean, S i should be defined as.
On the other hand, if each difference between two different stages S i should be taken into account, should be defined as S i |y i1 -y i2 | or S i = y i1 -y i2 . The later formula detect y i2 > y i1 in addition to the difference between two stages. To avoid the variability for each stage, S i should be standardized dividing by the variance σ 1 for each genomic location of all stages that Now, to assess S ij obtained by original data, GISDIP generates the null distribution for S i by the following procedures: ( To assess multiple testing, the false discovery rate (FDR [7]) should be applied to these p-values. The regions with FDR< 5% are taken to indicate significant difference between different stages in this study. The detail of the above procedures is summarized in Figure 1.

Allele-specific disparity
SNP-based arrays can detect allele-specific chromosomal aberrations such as loss of heterozygosity. We and others have developed several novel approaches to identify loci showing preferential disparity from heterozygous toward either the A/B-allele homozygous (allelic-disparity, AD) [5]. In a previous study we applied the method to B-allele frequency and logR for blood and tumour of breast cancer patients and concluded that 13 K loci were subjected to AD and associated with some clinical parameters like grade, ER, Her2, TP53 and PR. In this article, we analyzed the stage specific markers obtained by GISDIP for allelic disparity by overlapping with the 13K loci obtained in [5].

Pathway analysis
Pathway analysis was applied in order to investigate the network function, canonical pathway and biological functional interaction for the lists of markers obtained by GISDIP. This analysis was performed by Ingenuity Pathway Analysis (IPA, Ingenuity system http://www.ingenuity.com/). This delivers an assessment of the signaling and metabolic pathway, molecular networks and biological process that are most significantly perturbed in the dataset of interest. IPA offers many options for finding insights into relationships, mechanisms, function and pathway of relevance. where indicates the largest mean and point to the smallest mean. Then each difference between two different stages takes into account by S ij =|y i1 -y i2 | or S ij =y i1 -y i2 . The later formula detect in addition to the difference between two stages. To avoid the variability and standardization of each stage S i is divided by the variance σ i for each genomic location. To assess

A simulation study
where n is the probe number. In the first set of simulation, the minimum length of probes was equal to 50 and the maximum length of the probes was considered 700 probes. We generated the probes with the normal distribution and the standard deviation of 0.10. To emulate the complexity of real tumour profiles, we added aberrations by uniform distribution that generates random deviates or aberrations. Any values less than 1 were set as a deletion and values greater than two was fixed as amplification. Supplement figure S1 illustrated 25 examples of generated data for each case. Simulation 1 (sim1) involved 30 samples and 1000 artificial features with four amplified and two deleted artificial aberrations. Simulation 2 (sim2) involved 100 samples and 1000 artificial features with six amplifications and four deletions as artificial aberrations. To the two simulation data, we applied the basic theory for GISTIC and GISDIP. We did not apply the entire procedure of GISTIC to enhance the essential difference between GISTIC and GISDIP. In the case of GISTIC (see Figure 1 in [1]), the program first calculates summation [1] (called G-score, -log (probability | background) in the case of Calculate y ij * using x ijk * and obtaine S i GISTIC II [2]) across all samples of a data. To assess the vector of G-score, permutation across the features was performed. For 1000 permutated data, the null distribution was generated for each marker across the genome. As final procedure, the program assesses G-scores for the null distribution and applies q-value to the vector of obtained p-values. In this study, FDR was calculated instead of q-value. GISTIC also involved peeloff algorithm to rescore to potential peaks; however, we here apply only basic procedure (not applying entire procedure of GISTIC) as below: 1. Calculating G-score for sim1 and sim2 across the samples. 2. To assess G-score obtained by original data, generate null-distribution by permuting features and calculating G-score again. 3. Calculating p-value for G-score to null-distribution and apply FDR. Supplement figure S2 summarized the outputs for GISTIC. The solid black plots for two panels indicate G-score for sim1 and sim2. The significant aberrations obtained by GISTIC were indicated by circles for each panel. The regions indicated the highest peaks for amplified and deleted aberrations. On the other hand, the significant regions obtained by GISDIP were illustrated by lines the lowest panels in Supplement figure S3. The top and middle panels of Supplement figure S3 indicated mean values for sim1 and sim2. The curve of the lowest panel of Supplement figure S3 was the absolute value of the difference between mean values of sim1 and sim2. As shown in the lowest panels, the lines obtained by GISDIP indicated clear difference between mean values of sim1 and sim2. When applying GISTIC to this curve, the region marked by circle was estimated as significant peak region. These studies show that while similar profiles were identified by GISTIC and GISDIP, GISDIP had a higher sensitivity to identify significant peaks in smaller subsets. Therefore, GISDIP should be appropriate to identify the significant difference for DNA copy number profiles in the stages for tumour progress.

Materials
We applied GISTIC and GISDIP to logR values transferred from DNA copy number data, which high resolution CGH profiling was carried out by Agilent technologies on 244K Agilent Human Genome CGH Microarrays. These data came from the Norwegian Breast Cancer Study comprising of five different cohorts, designated as MDG, Uppsala, Ull, MicMa and DBCG. MDG [8] consists of pure invasive ductal carcinomas (DCIS) less than 15 mm and mixed invasive and in situ components, stage I (T1, <2cm), stage II (T2, 2cm ≤ and <5cm), and stage III (T3, 5cm ≤). Uppsala [8] came from core biopsies of women with dense breast parenchyma and invasive ductal carcinomas. Both cohorts are clinically similar. Ull [9] consists of mainly stage I and II breast cancers. MicMa consists of mainly stage I and II breast cancers. Fresh frozen tumour biopsies were collected from 150 to 920 patients included in the Oslo Micrometastasis Project between 1995 and 1998. DBCG comprise a collection of tumour tissues from 3,083 high-risk Danish breast cancer patients diagnosed in the period 1982-1990 [10][11][12]. Number of patients for each stage in five cohorts is summarized in Table 1A.
The 530 cases analyzed here have been characterized for all routine clinical parameters. We have in this study performed the analysis on the entire data set as well as stratified by estrogen receptor (ER), which is well-known important parameter in breast cancer [13]. The samples for each stage are shown in Table 1B. A few patients of each stage were not observed ER status. For the reason, the total number of ER status of each stage was less than total number of each stage. We applied GISDIP to all samples shown in Table 1A as well as stratified by ER as shown in Table 1B.

Results and discussion
Applying GISTIC to all samples of different stages GISTIC was first used for the analysis of all samples for different stages and tumour categories. Two plots of G-scores and q-values were obtained with respect to amplifications and deletions separately and are shown for all markers by red and blue lines in all four stages in Figure 2. For detecting significant gain and loss, we used the default threshold 0.25, The symbol '*' and ' x' indicate the areas corresponding to the areas identified also by GISDIP as significant for each stage (see below). The exact amp and del loci and residing genes identified by this method are summarized in Supplement Tables S1a-d. Gene rich loci amplified at 1q32.1, 8q23, 20q13.33, and deletions in multiple others (Supplement Table S1a Tables  S1b-d). Therefore a formal method to compare the differences between the significant peaks at each stage was needed.

significant by GISTIC in the next stages (Supplement
Applying GISDIP to all samples for different stages GISDIP was applied to compare pair wise each two neighboring stages (subclasses) as follows DCIS and T1 (DCIS-T1), T1 and T2 (T1-T2), and T2 and T3 (T2-T3). The absolute value S ij =|y iA -y iB | where A and B are two neighbor stages and the p-values calculated by (3) were summarized in Figure 3. The plots for the significant regions illustrate negative/positive difference S ij =y iA -y iB between stages A and B (for example) instead of the absolute amount calculated by S ij =|y iA -y iB |.
For each difference S ij =y iDCTS -y iT1 , for DCIS-T1, S ij =y iT1 -y iT2 for T1-T2, and S ij =y iT2 -y iT3 for T2-T3 were calculated. The positive plots (above zero) mean that the aberrations of stage A are larger than the aberrations of stage B, and the negative plots (below zero) mean that the aberrations of stage B are larger than the aberrations of stage A, e.g., for 'DCIS-T1' , stage A corresponds to 'DCIS' and stage B corresponds to'T1' . The significant regions by 5% FDR between DCIS and T1, T1 and T2, and T2 and T3 are also summarized in Supplement Table S2. The total number of markers (genes) estimated as significant different regions (FDR 5%) by GISDIP were 1923 (99) when comparing DCIS to T1, 5558 (626), comparing T1-T2, and 1035 (53) comparing T2-T3, respectively. The probes and corresponding gene symbols are summarized in Supplement Table S2. Comparing our data with the TCGA data on breast cancer ( [14],   Table S2. we see that among the focal amplifications/deletions and arm-level gains and losses identified by the TCGA network we are able to suggest a time-line to these CAN events ( Table 2) Table S3). Among the gene lists from the identified regions, we found in   Continuation of Table 2.

Pathway analysis for GISDIP outputs to all samples for different stages
We further examined whether the genes identified by GISDIP as differentially amplified or deleted between pre-invasive and invasive tumours of different size categories are enriched for certain functions, we applied IPA as introduced in subsection Method-Pathway analysis to these markers. Using IPA we investigated the canonical pathways and network functions enriched within the GISDIP gene list. As shown in 'Top canonical pathway' of

ER stratification for different stages
In addition, we applied GISDIP to the samples stratified by ER status (positive/negative). Estimated significant regions were summarized in Supplement Table S4. In the ER negative cases no significant regions were identified by GISDIP when comparing DCIS-T1 and T1-T2. Comparing T2-T3 however we found several loci significantly different between the two subclasses such as 1p31.1, 6q14.1 and 9p21-22 (Figure 5). When overlapping these data with the GISTIC data only the region at 9p21 was found by both methods (Figure 6 and Supplement Table S4). Much more distinct differences were observed in the ER positive cases. The estimated number of markers was 273 for T2-T3 in ER negative and 474 for DCIS-T1, 6020 for T1-T2, 517 for T2-T3 in ER positive (Supplement Table  S4, Figure 5). The overlapping results from GISTIC and GISDIP analyses indicate significant copy number alterations in ER+ (T1-T2) for regions such as 2q11, 4p16, 8q24, 11,q13, 16p13 and 17q25 (Figure 6 and Supplement Table S4). In general, it is known that copy number alterations are better defined in ER positive cases than copy number alternations for ER negative, where a multiple firestorms events are observed [18]. This fact and the smaller number of cases may support that it was hard to have significant differences between DCIS-T1 and T1-T2. We applied IPA to the markers obtained by GISDIP applying to ER negative samples for T2-T3 and positive samples for all cases. IPA investigated the canonical pathway and network function. The outputs were summarized in Table 4.

Overlapped regions with allele-specific disparity (AD)
As introduced in sub-section Method-GISDIP, in one of the datasets (MicMa) we had available SNP-CGH data and found 13K loci in which we observed a preferential loss or gain of a specific allele (referred here to as allelic disparity (AD)). Illumina Beadstudio package was applied to MicMa samples and handled B-allele frequency and logR values as described in Kaveh et al [5]. We matched the identified aberrant list on RefSeq gene IDs from this study (Agilent 44K array) with the positions of the rs-numbers for SNPs, using an annotation file called "SNP_annotation_Infinium HumanBeadChip1_v1.2.1_ allchr" released by Illumina. In 13K loci, the overlapped number and corresponding chromosome regions were summarized in the cases of all samples and samples for ER positive in Table 5

Conclusion
To investigate difference for copy number aberration among different stages, we proposed a new approach called GISDIP. To compare well-known GISTIC that identify focal/broad regions of the genome that are significantly amplified/deleted, a Events specific for ER+ are indicated by blue triangles and ER-events by red triangles.  The terms with* indicate -log (p-value) by Fisher's exact test because these were insignificant after applying multiple testing correction.
simulation study could demonstrate each role from GISTIC and GISDIP. GISDIP illustrated different significant regions and biological characters for all samples and for samples stratified   by ER positive/negative. Significant regions obtained by GISDIP involved the parts overlapped with allele-specific disparity. The probabilities of the overlapping regions were not same between different stages. As the further work, GISDIP could expand to identify differences of expression level like micro-RNA changing according to different stages.