

Solvang HK, Shimodaira H, Kaveh F, Nebdal D, Børresen-Dale A-L, Edvardsen H and Kristensen VN. DNA copy number profiles during tumour progression in breast cancer. J Med Stat Inform. 2014; 2:10. http://dx.doi.org/10.7243/2053-7662-2-10
Hiroko Kato Solvang1*, Hidetoshi Shimodaira2, Fatemeh Kaveh3, Daniel Nebdal4, Anne-Lise Børresen-Dale4,5, Hege Edvardsen6 and Vessela N Kristensen4,5,6
*Correspondence: Hiroko Kato Solvang hiroko.solvang@imr.no
1. Marine Mammals Research Group, Institute of Marine Research, Bergen, Norway.
2. Division of Mathematical Science, Graduate School of Engineering Science, Osaka University, Japan.
3. Medical Genetics Department, Oslo University Hospital (Ullevål), Kirkeveien 166, 0407 Oslo, Norway.
4. Department of Genetics, Institute for Cancer Research, Oslo University Hospital, Montebello, 0310 Oslo, Norway.
5. Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, Norway.
6. Department of Clinical Molecular Biology and Laboratory Science (EpiGen), Division of Medicine, Akershus Universit Hospital, University of Oslo, Norway.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Background: In this paper we report the prevalence of copy number aberration events at various stages (subclasses) of breast cancer as assessed by two different statistical methods, GISTIC, a well-known method to identify significant somatic copy-number alterations in a given set and GISDIP: Genomic Identification of Significant Differences in Progression, a newly developed numerical algorithm described here to compare different sets.
Methods: GISTIC assesses significant aberrations towards whole genomic location for each stage and GISDIP assesses significant difference towards whole genomic location for two stages pair-wise.
Results: We compare GISDIP with the original GISTIC by a simulation study. Comparing with the focal/broad regions obtained by GISTIC, GISDIP can directly illustrate significantly different regions between different stages. We then performed experimental data analyses and identified all significant aberrations at each stage of breast cancer using GISTIC and identified the peaks significant for each stage/tumour category. Then we applied GISDIP to identify the significant differences between the tumour size categories in progression. The significance was assessed by permuting the data across all samples in all stages. The statistically significant copy number aberrations specific for the various stages are presented here by both statistics on a set of 530 cases (validation set). The same analysis was performed in subsets of cases according to ER-status. The genes located in the identified regions were investigated by applying network analysis and the list for genomic loci related to allele-specific disparity.
Conclusions: Our method indicates significant particular regions related to difference of copy number aberration between different stages. The particular biological characteristics for network analysis on estimated regions are summarized.
Keywords: DNA copy number alteration, breast cancer, computational algorithm, permutation, false discovery rate, allele-specific disparity
Breast cancer is heterogeneous disease both at clinical and molecular level. Molecular analyses of cancer cells in various stages (subclasses) of progression have revealed that alterations in tumour suppressor genes and oncogenes accumulate during tumour progression and correlate with the clinical aggressiveness of cancer. Tumour size, indicated by the T-category, is known as a strong prognostic indicator for breast cancer and is one of the factors taken into account when deciding how and whether to treat a patient, independently of lymph node status. Although imperfect, tumour size is also together with grade a surrogate marker of time and progression. Gene expression profiling T1 and T2 tumours was recently attempted and revealed distinct molecular pathways characterizing each tumour category [3,4]. The aim of the present study is to deepen our analysis at array CGH DNA level and compare profiles of various stages of breast tumour as a surrogate of the course of tumorgenesis and identify significant aberrations that may be driving the process at each stage.
To identify somatic copy-number alterations (SCNA) related to tumour development and progression methods, such as Genomic Identification of significant targets in cancer (GISTIC) [1] were applied. This method identifies broad/focal regions of the genome that are significantly amplified or deleted across a set of samples. These regions are recognized as likely drivers SCNAs by evaluating the frequency and amplitude of the observed events. The main procedures to identify the regions take the summation for the aberrations for tumours within a certain class and find significantly aberrant regions using permutated data. GISTIC II [2] adds to the original GISTIC methodological improvements for different complex cancer genomes containing a mixture of SCNA types occurring at distinct background rate. The outputs are very sensitive to sample size. In fact, GISTIC II indicated that most significant peaks become stable around a set as large as 800 samples [2].
Our aim was to identify significant differences between the copy number alteration landscapes in different time points of tumour progression. For this purpose it is not enough to identify SCNA for each stage of progression, but also to assess the significance of difference in their occurrence between the stages. Applying GISTIC for different stages require two steps: 1) apply GISTIC to each stage and 2) compare the significant focal regions between stages. Furthermore, the number of samples for each stage usually varies and the output assessed by permutation within one stage could not be compared with other stages.
Therefore, for the analysis of data involving inconsistent sample number for each stage, we propose a new approach called 'Genomic Identification of Significant Difference for DNA copy number profiles In tumour Progression (GISDIP)', which directly identifies the differences of SCNAs between different stages through one step procedure. GISDIP uses all relevant samples to generate the null distribution to identify the level of significance.
In this study, we apply GISDIP to CGH DNA profiles for the Norwegian Breast Cancer Study consisting of 530 samples from five different cohorts, stratified by T-status and ER-status. The significant regions obtained by GISDIP were subjected to additional analyses such as network function analysis of the harboring genes and allelic-specific disparity [5] that detects allele-specific chromosomal aberrations and loss of heterozygosity.
In the following sections we describe the principle of GISDIP and compare the results obtained by GISTIC and GISDIP as well as their performance using a simulation study.
Segmentation for CGH data
In this study, array CGH DNA profiles were observed as log R
(denoted hereafter as logR), which is the log-ratio of observed
probe intensity to expected intensity. Positive/negative logR
indicate a region of DNA copy number gain/loss. Every tumour
genome was segmented by fitting a piecewise constant
function to the logR values to obtain smoothing value [6].
GISTIC
Genomic Identification for Significant Targets (GISTIC) was
developed to identify broad/focal regions of the genome that
are significantly amplified or deleted across a set of samples Each
aberration is assigned a G-score that considers the amplitude of
the aberration as well as the frequency of its occurrence across
samples. False Discovery Rate q-value [7] are then calculated for
the aberrant regions, and regions with q-values below a userdefined
threshold (default=0.25) are considered significant.
Identified 'peak region' is the part of the aberrant region with
greatest amplitude and frequency of alteration. The original
algorithm was published in 2007 [1] and it has revised as
GISTIC II [2] that additionally describes a probabilistic method
for defining the boundaries of selected SCNA regions with
user-defined confidence. Since each aberration is considered
across samples, resolution for average number of genes per
peak becomes sensitive for sample size. In article [2] indicated that stable resolution
for number of focal peaks and genes per peak can be obtained
at sample sizes over 800-1000 samples.
GISDIP
For given CGH data (logR values), let the smoothed value indicate xijk for genomic location i=1,2,....,N, stage j=1,2,....,L and samples
k=1,2,....,mj for stage j. First, the mean value yij of xijk is calculated
as a representative value for each genomic location of each
stage defined by
(1)
Next, a statistics Si is calculated to investigate difference of yij between different stages. If the difference between a stage
indicating the largest mean and a stage (e.g., j=1) indicating
the smallest mean, Si should be defined as
. On
the other hand, if each difference between two different stages
Si should be taken into account, should be defined as Si |yi1-yi2|
or Si = yi1-yi2. The later formula detect yi2>yi1 in addition to
the difference between two stages. To avoid the variability for
each stage, Si should be standardized dividing by the variance
σ1 for each genomic location of all stages that
(2)
Now, to assess Sij obtained by original data, GISDIP generates the null distribution for Si by the following procedures:
If 1000 times iteration is performed through the above procedures 2-4, we can generate a null distribution by Si*(1),..., Si*(1000). In this case, the p-value for each genomic location
(3)
To assess multiple testing, the false discovery rate (FDR [7]) should be applied to these p-values. The regions with FDR<5% are taken to indicate significant difference between different stages in this study. The detail of the above procedures is summarized in Figure 1.
Figure 1 : Procedures to generate permutated data.
Allele-specific disparity
SNP-based arrays can detect allele-specific chromosomal
aberrations such as loss of heterozygosity. We and others have
developed several novel approaches to identify loci showing
preferential disparity from heterozygous toward either the
A/B-allele homozygous (allelic-disparity, AD) [5]. In a previous
study we applied the method to B-allele frequency and logR
for blood and tumour of breast cancer patients and concluded
that 13 K loci were subjected to AD and associated with some
clinical parameters like grade, ER, Her2, TP53 and PR. In this
article, we analyzed the stage specific markers obtained by
GISDIP for allelic disparity by overlapping with the 13K loci
obtained in [5].
Pathway analysis
Pathway analysis was applied in order to investigate the
network function, canonical pathway and biological functional
interaction for the lists of markers obtained by GISDIP. This
analysis was performed by Ingenuity Pathway Analysis (IPA,
Ingenuity system http://www.ingenuity.com/). This delivers an
assessment of the signaling and metabolic pathway, molecular
networks and biological process that are most significantly
perturbed in the dataset of interest. IPA offers many options
for finding insights into relationships, mechanisms, function
A simulation study
Before applying both methods to real data, we performed a
simulation study. Suppose u= {(xij, yij), i=1,..., m, j=1,...,n} be the copy number data where m is the number of samples yi1,yi2,...,yin are the log-ratio measurements of one sample that
are ordered by the location of a probe along one chro-mosome
(xi1≤xi2≤....≤xin) where n is the probe number. In the first set
of simulation, the minimum length of probes was equal to
50 and the maximum length of the probes was considered
700 probes. We generated the probes with the normal
distribution and the standard deviation of 0.10. To emulate
the complexity of real tumour profiles, we added aberrations
by uniform distribution that generates random deviates or
aberrations. Any values less than 1 were set as a deletion and
values greater than two was fixed as amplification. Supplement figure S1 illustrated 25 examples of generated data
for each case. Simulation 1(sim1) involved 30 samples and
1000 artificial features with four amplified and two deleted
artificial aberrations. Simulation 2 (sim2) involved 100 samples
and 1000 artificial features with six amplifications and four
deletions as artificial aberrations. To the two simulation data,
we applied the basic theory for GISTIC and GISDIP. We did not
apply the entire procedure of GISTIC to enhance the essential
difference between GISTIC and GISDIP. In the case of GISTIC
(see Figure 1 in [1]), the program first calculates summation [1]
(called G-score, -log (probability | background) in the case of GISTIC II [2]) across all samples of a data. To assess the vector
of G-score, permutation across the features was performed.
For 1000 permutated data, the null distribution was generated
for each marker across the genome. As final procedure, the
program assesses G-scores for the null distribution and applies
q-value to the vector of obtained p-values. In this study, FDR
was calculated instead of q-value. GISTIC also involved peeloff
algorithm to rescore to potential peaks; however, we here
apply only basic procedure (not applying entire procedure
of GISTIC) as below:
Supplement figure S2 summarized the outputs for GISTIC. The solid black plots for two panels indicate G-score for sim1 and sim2. The significant aberrations obtained by GISTIC were indicated by circles for each panel. The regions indicated the highest peaks for amplified and deleted aberrations. On the other hand, the significant regions obtained by GISDIP were illustrated by lines the lowest panels in Supplement figure S3. The top and middle panels of Supplement figure S3 indicated mean values for sim1 and sim2. The curve of the lowest panel of Supplement figure S3 was the absolute value of the difference between mean values of sim1 and sim2. As shown in the lowest panels, the lines obtained by GISDIP indicated clear difference between mean values of sim1 and sim2. When applying GISTIC to this curve, the region marked by circle was estimated as significant peak region. These studies show that while similar profiles were identified by GISTIC and GISDIP, GISDIP had a higher sensitivity to identify significant peaks in smaller subsets. Therefore, GISDIP should be appropriate to identify the significant difference for DNA copy number profiles in the stages for tumour progress.
Materials
We applied GISTIC and GISDIP to logR values transferred from
DNA copy number data, which high resolution CGH profiling
was carried out by Agilent technologies on 244K Agilent
Human Genome CGH Microarrays. These data came from the
Norwegian Breast Cancer Study comprising of five different
cohorts, designated as MDG, Uppsala, Ull, MicMa and DBCG.
MDG [8] consists of pure invasive ductal carcinomas (DCIS)
less than 15 mm and mixed invasive and in situ components,
stage I (T1, <2cm), stage II (T2, 2cm ≤ and <5cm), and stage
III (T3, 5cm ≤). Uppsala [8] came from core biopsies of women
with dense breast parenchyma and invasive ductal carcinomas.
Both cohorts are clinically similar. Ull [9] consists of mainly
stage I and II breast cancers. MicMa consists of mainly stage
I and II breast cancers. Fresh frozen tumour biopsies were
collected from 150 to 920 patients included in the Oslo Micrometastasis
Project between 1995 and 1998. DBCG comprise
a collection of tumour tissues from 3,083 high-risk Danish
breast cancer patients diagnosed in the period 1982-1990
[10-12]. Number of patients for each stage in five cohorts is
summarized in Table 1A.
Table 1 : Summary of the data for five cohorts and T-status.
The 530 cases analyzed here have been characterized for all routine clinical parameters. We have in this study performed the analysis on the entire data set as well as stratified by estrogen receptor (ER), which is well-known important parameter in breast cancer [13]. The samples for each stage are shown in Table 1B. A few patients of each stage were not observed ER status. For the reason, the total number of ER status of each stage was less than total number of each stage. We applied GISDIP to all samples shown in Table 1A as well as stratified by ER as shown in Table 1B.
Applying GISTIC to all samples of different stages
GISTIC was first used for the analysis of all samples for different
stages and tumour categories. Two plots of G-scores and
q-values were obtained with respect to amplifications and
deletions separately and are shown for all markers by red
and blue lines in all four stages in Figure 2. For detecting
significant gain and loss, we used the default threshold 0.25,
The symbol '*' and 'x' indicate the areas corresponding to the
areas identified also by GISDIP as significant for each stage
(see below). The exact amp and del loci and residing genes
identified by this method are summarized in Supplement Table S1a-d. Gene rich loci amplified at 1q32.1, 8q23,
20q13.33, and deletions in multiple others (Supplement Table S1a) were observed at DCIS level. However many of the same loci continued to be significant by GISTIC in the next
stages (Supplement Table S1b-d). Therefore a formal
method to compare the differences between the significant
peaks at each stage was needed.
Figure 2 : Significant aberration (amplification/deleted) estimated by GISTIC.
Applying GISDIP to all samples for different stages
GISDIP was applied to compare pair wise each two neighboring
stages (subclasses) as follows DCIS and T1 (DCIS-T1), T1
and T2 (T1-T2), and T2 and T3 (T2-T3). The absolute value
Sij=|yiA-yiB| where Aand B are two neighbor stages and the
p-values calculated by (3) were summarized in Figure 3.
The plots for the significant regions illustrate negative/positive
difference Sij=yiA- yiB between stages A and B (for example)
instead of the absolute amount calculated by
Sij=|yiA- yiB|.
For each difference Sij=yiDCTS-yiT1,
for DCIS-T1, Sij=yiT1- yiT2 for
T1-T2, and Sij=yiT2 - yiT3 for T2-T3 were calculated. The positive
plots (above zero) mean that the aberrations of stage A are
larger than the aberrations of stage B, and the negative plots
(below zero) mean that the aberrations of stage B are larger
than the aberrations of stage A, e.g., for 'DCIS-T1', stage A
corresponds to 'DCIS' and stage B corresponds to'T1'. The
significant regions by 5% FDR between DCIS and T1, T1 and
T2, and T2 and T3 are also summarized in Supplement Table S2.
Figure 3 : Pairwise comparison of two neighboring stages by GISDIP.
The total number of markers (genes) estimated as significant different regions (FDR 5%) by GISDIP were 1923 (99) when comparing DCIS to T1, 5558 (626), comparing T1-T2, and 1035 (53) comparing T2-T3, respectively. The probes and corresponding gene symbols are summarized in Supplement Table S2. Comparing our data with the TCGA data on breast cancer [14], we see that among the focal amplifications/deletions and arm-level gains and losses identified by the TCGA network we are able to suggest a time-line to these CAN events (Table 2). Applying GISDIP a more clarified picture of the stage dependent events emerges. Only the deletion on chromosome 3q21.3, and amplification of 16q12.1 and 23 are significantly predominant in DCIS, while chromosomes 1,5,6,8,13,14, 16, 19 and 20, are characteristic Figure 3).
Table 2 : Comparison of GISTIC results with the TCGA data on breast cancer.
Focusing on the overlapping regions identified by both GISTIC and GISDIP we find for DCIS only the deletion on 3q21.3 and amplification of 16q12.1 remaining. For the T1-T2 comparison a main finding is the increasing frequency of amplifications on 1q (regions 1q21.2, 1q21.1 and 1q31 among others) and 8q24.3 and 9q34.3. For T2-T3 we see, among others, an increasing frequency of amplification of 11q13.4-5 as well as deletion of 7q35 only found in T3 (Supplement Table S3). Among the gene lists from the identified regions, we found in DCIS-T1 deleted cluster of differentiation markers CD80 and CD86 providing co-stimulatory signals necessary for T-cell activation [15,16]; in T1-T2 we find in RBBP5, a protein coding and tumour suppressor, PIK3C2B (phosphatidylonositol 3 kinase pathway) and MDM4 for p53 inhibitor like MDM2 [17]; in T2-T3 we see specifically deletion of the tumour suppressor Figure 4).
Figure 4 : Overlapping CNA events identified by GISTIC and GISDIP.
Pathway analysis for GISDIP outputs to all samples for
different stages
We further examined whether the genes identified by GISDIP
as differentially amplified or deleted between pre-invasive
and invasive tumours of different size categories are enriched
for certain functions, we applied IPA as introduced in subsection
Method-Pathway analysis to these markers. Using
IPA we investigated the canonical pathways and network
functions enriched within the GISDIP gene list. As shown
in 'Top canonical pathway' of Table 3, the events specific for
DCIS included genes from DNA Double-Strand Break Repair
by Homologous Recombination, B Cell Development, PPARα/RXRα Activation, Autoimmune Thyroid Disease Signaling and
Allograft Rejection Signaling. The molecules indicated ATRX,
POLA1, IL1RAPL1, and GK located on chromosome X. For T1-T2, all canonical pathways were related to regions indicating
specific to T1 aberrations when compared to T2 aberrations.
For differences between T2-T3, most canonical pathways
contain molecules of CDK2A on chromosome 9, FZD4 on
chromosome 11 for T2 aberrations >T3 aberrations, and WNT11
on chromosome 11 for T3 aberrations >T2 aberrations.
Table 3 : IPA outputs of top canonical pathways and top associated network functions for different stagesa.
ER stratification for different stages
In addition, we applied GISDIP to the samples stratified by ER
status (positive/negative). Estimated significant regions were
summarized in Supplement Table S4. In the ER negative
cases no significant regions were identified by GISDIP when comparing DCIS-T1 and T1-T2. Comparing T2-T3 however we
found several loci significantly different between the two
subclasses such as 1p31.1, 6q14.1 and 9p21-22 (Figure 5).
When overlapping these data with the GISTIC data only the
region at 9p21 was found by both methods (Figure 6 and
Supplement Table S4). Much more distinct differences
were observed in the ER positive cases. The estimated number
of markers was 273 for T2-T3 in ER negative and 474 for DCIS-T1,
6020 for T1-T2, 517 for T2-T3 in ER positive (Supplement Table S4, Figure 5). The overlapping results from GISTIC and
GISDIP analyses indicate significant copy number alterations
in ER+ (T1-T2) for regions such as 2q11, 4p16, 8q24, 11,q13,
16p13 and 17q25 (Figure 6 and Supplement Table S4). In
general, it is known that copy number alterations are better
defined in ER positive cases than copy number alternations for
ER negative, where a multiple firestorms events are observed
[18]. This fact and the smaller number of cases may support
that it was hard to have significant differences between
DCIS-T1 and T1-T2.
Figure 5 : GISDIP identified chromosomal aberrations according to ER-status between
DCIS-T1, T1-T2 and T2-T3.
Figure 6 : ER dependent CNA events identified by both GISTIC and GISDIP.
We applied IPA to the markers obtained by GISDIP applying to ER negative samples for T2-T3 and positive samples for all cases. IPA investigated the canonical pathway and network function. The outputs were summarized in Table 4.
Table 4 : IPA outputs of top canonical pathways and top associated network functions for different stages.
Overlapped regions with allele-specific disparity (AD)
As introduced in sub-section Method-GISDIP, in one of the
datasets (MicMa) we had available SNP-CGH data and found
13K loci in which we observed a preferential loss or gain of
a specific allele (referred here to as allelic disparity (AD)).
Illumina Beadstudio package was applied to MicMa samples
and handled B-allele frequency and logR values as described
in Kaveh et al., [5]. We matched the identified aberrant list on
RefSeq gene IDs from this study (Agilent 44K array) with the
positions of the rs-numbers for SNPs, using an annotation file
called "SNP_annotation_Infinium HumanBeadChip1_v1.2.1_
allchr" released by Illumina. In 13K loci, the overlapped number
and corresponding chromosome regions were summarized in the cases of all samples and samples for ER positive in Table 5.
According to three combinations of different stages, DCIS-T1, T1-T2, and T2-T3, the probabilities for the events of allelic disparity
occurring in the loci of significant stage associated aberration were assessed by Chi-square test. The p-values are 0.00025 for
all cases and 1.63e-14 for the ER positive subgroup. It suggests
that the probabilities finding AD in significant different
regions are significantly different among different stages.
Table 5 : Overlapped number of regions obtained by GISDIP on 13K loci obtained by AD in the case of all samples and ER positive samples.
To investigate difference for copy number aberration among different stages, we proposed a new approach called GISDIP. Tocompare well-known GISTIC that identify focal/broad regions of the genome that are significantly amplified/deleted, a simulation study could demonstrate each role from GISTIC and GISDIP. GISDIP illustrated different significant regions and biological characters for all samples and for samples stratified by ER positive/negative. Significant regions obtained by GISDIP involved the parts overlapped with allele-specific disparity. The probabilities of the overlapping regions were not same between different stages. As the further work, GISDIP could expand to identify differences of expression level like micro- RNA changing according to different stages.
Additional files
Supplement Table S1a
Supplement Table S1b
Supplement Table S1c
Supplement Table S1d
Supplement Table S2
Supplement Table S3
Supplement Table S4
Supplement figure S1
Supplement figure S2
Supplement figure S3
The authors declare that they have no competing interests.
Authors' contributions | HKS | HS | FK | DN | ALB | HE | VNK |
Research concept and design | √ | √ | -- | -- | -- | -- | √ |
Collection and/or assembly of data | -- | -- | -- | √ | √ | -- | √ |
Data analysis and interpretation | √ | √ | √ | √ | -- | √ | -- |
Writing the article | √ | -- | √ | -- | -- | √ | √ |
Critical revision of the article | √ | √ | -- | -- | -- | √ | √ |
Final approval of article | √ | √ | √ | √ | √ | √ | √ |
Statistical analysis | √ | √ | √ | -- | -- | -- | -- |
This study was supported by grants 193387/V50 Understanding breast cancer genomics to ALBD/VKN from the Norwegian Research Council (NFR). We appreciate for Dr. Hege G. Russnes, Department of Genetics, Radiumhospitalet, who gave us valuable biological comments for ER status.
Editor: Zhongxue Chen, Indiana University Bloomington, USA.
EIC: Jimmy Efird, East Carolina University, USA.
Received: 20-Sep-2014 Final Revised: 23-Oct-2014
Accepted: 26-Nov-2014 Published: 02-Dec-2014
Solvang HK, Shimodaira H, Kaveh F, Nebdal D, Børresen-Dale A-L, Edvardsen H and Kristensen VN. DNA copy number profiles during tumour progression in breast cancer. J Med Stat Inform. 2014; 2:10. http://dx.doi.org/10.7243/2053-7662-2-10
Copyright © 2015 Herbert Publications Limited. All rights reserved.