Determining juvenile cancer types from gene expression using gene contribution and differential analysis

Background: The usefulness of DNA microarrays is limited to the efficacy of methods of gene importance analysis available, which could have far-reaching implications in diagnosis, discovery, and treatment of the genetic nature of diseases. This article applies a powerful differential method (DM) based on principle component analysis (PCA) to a vast DNA microarray data set containing data from 88 samples of juvenile small round blue cell tumors. Methods: Using this DM for ranking the most critical genes from microarray data, the top 25 genes in the resulting rank-ordered list were associated with four functional categories: directly cancer-related, protein synthesis, cell cycle control, and neurological function. Conclusions: The strength of the DM is demonstrated in the ability to tie these functions to previous cancer research. The results show the method’s ability to differentiate between different types of similar juvenile cancers and that the method could also be useful in exploring the genetic nature of various diseases.


Introduction
With the advent of the human genome project and countless other research projects devoted to gene identification and function, several methods of analyzing the resulting vast arrays of data have arisen [1][2][3]. Some of the most useful gene data has been found to be in DNA microarray data sets. Microarray data from various organisms have been used to suggest the function as well as the contribution of a multitude of genes [1,2]. In order to process these data sets, several methods have been developed and proposed as accurate means by which to divulge useful information about genes from the vast DNA microarray data sets. Methods ranging from clustering to neural networks to PCA have made for both intriguing and varied results in determining gene expression from DNA microarrays [1][2][3][4]. These methods all have their own unique pros and cons relating to their use in analysing data sets.
We sought out microarray data with high variability to demonstrate the ability of the Rollins and Teh [4] differential method (DM) to handle variability in data sets. This search yielded a DNA microarray developed for Khan et al., [2]. The microarray contains data taken from a study of 84 different samples, or arrays, composed of four different types of childhood cancers classified as small round blue cell tumors: neuroblastoma (NB), non-Hodgkin lymphoma (BL), rhabdomyosarcoma (RMS), and Ewing sarcoma (EWS) [3]. Five of the samples included in the microarray were non-cancerous in nature and made up of various normal human tissues. Various methods [2,3] for analyzing this data set have been applied in the hopes of determining the genetic differences between these similar childhood cancers. The realization of the genetic differences between these similar cancers has potential for both diagnosis and treatment. Microarray analysis provides for fascinating possibilities in the realm of gene expression, the focus of the application of the method to this data set.
In this study we employ the Rollins and Teh [4] differential method (DM), an extension of the method developed in Rollins et al., [1], to a DNA microarray from Khan et al., [2]. We hypothesize that this application will showcase the powerful capabilities of the comprehensive method to span the differences found in various sets of DNA microarray data. The DM allows the experimenter to choose between two types of contribution plots, known as eigengene (EG) or eigenassay (EA). This method can also process very large sets of data, discarding the need for data reductions that may unintentionally eliminate important data. In this article, we explain the methodologies behind the DM and apply it to the complete, unabridged Khan data set. Next, we use the resulting gene data to make comparisons to results obtained in previous studies. Functions of the genes found to be differentially expressed will be stated and an analysis of these functions will be shown doi: 10.7243/2053-7662-2-2 to give a greater understanding of the power of the DM.

Methods
The expression data derived from the DNA microarray is defined as X, an m by n matrix. Along the rows of X there are m genes expressed, and along the columns n assays are expressed. This means that each cell of the matrix is the expression level of the i th gene in the j th assay and is given as x ij . Also, variables for X, or any matrix, are along the columns, and measurements for X, or any matrix, are along the rows. This means that for X the assays are variables while genes are measurements. An exception is the transposed matrix of X, or X T , where the variables are expressed along the rows and measurements are expressed along the columns. The DM is applied to the microarray gene expression data, and since the DM is based on PCA, the microarray expression data matrix orthogonally transforms into a new coordinate data system. This space is used to maximize composite variability [1][2][3][4]. In short, the new coordinate system has the first axis point towards the greatest spread in data and the second axis point orthogonal to the first axis. This is defined as the PC loading matrix. Eigenvectors, the principal components (PCs), are found from the correlation matrix, or the covariance matrix. The first PC has the largest eigenvalue; thus, it is ranked first. We derive the EG contribution plot, an n by n loading matrix, from X and the EA contribution plot, an m by n loading matrix, from X T [4].

Eigengene contribution plot approach
After applying PCA to X, the values within the EG scores matrix, an m by n pseudo data matrix, must be found. From there, the EG differential gene contribution between two groups can be determined. Before finding the elements of the scores matrix, however, X must standardize into the matrix Z. Each element of Z is defined as z ij and must follow the equation: where the sample mean and sample standard deviation of the data column j are and , respectively. To find the elements of , the EG scores matrix, the equations apply where EG ij s is the score for the i th gene using the j th vector of EG loadings, is the i th loading for the i th EG vector, and is the contribution for the i th gene, on the k th assay from the j th EG loading vector. Next, two groups are chosen so that A = Group A with n A assay members and B = Group B with n B assay members. No members should be in common between both groups such that Before giving the EG differential contribution it is necessary to give the mean contribution for the i th gene from the j th EG loading vector from Group A and Group B, respectively. The equations are given as where the assay members are k' and k'' in Groups A and B, respectively. Now we can find the EG differential gene contribution between both groups for the i th gene from the j th EG loading vector by the equation, The EG differential gene contribution will also be known as [4].

Eigenassay contribution plot approach
The EA contribution plot approach applies PCA to X T ; thus, creating an m by n loading matrix that will be used to find the values within the EA scores matrix, an n by n pseudo data matrix. From there the EA differential gene contribution between two groups can be determined. Unlike the EG contribution plot approach, X T does not need to be standardized. This allows us to solve for the elements within S EA , the EA scores matrix, with the following equations: The score for the i th assay using the j th vector of EA loadings is . is the i th loading for the j th EA vector, and is the contribution for the p th gene on the i th assay from the j th EA loading vector. Next, we need to choose two groups so that x j s j S EG doi: 10.7243/2053-7662-2-2 A = Group A with n A assay members and B = Group B with n B assay members. One group should not contain any members in common with the other such that it follows Eq. 2. Again the mean contribution must be found, but this time it is for the p th gene from the j th EA loading vector for both groups. The equations are given as: where the assay members are i' and i'' in Groups A and B, respectively. Now we can find the EA differential gene contribution between both groups for the p th gene from the j th EA loading vector by the equation, The EA differential gene contribution will also be known as [4].

Application of the DM
Our study involves the application of the DM developed in Rollins and Teh [4]. The DM was created to determine a ranked order set of genes that express most differently in two or more groups of assays. Initial testing of the DM involved the use of simulated and real gene data for two groups of assays, mutant and wild type mice. In simulated data study two hundred of the 40,000 simulated mouse genes were purposely differentiated between the two groups of mice. The DM was able to distinguish these 200 genes based on the differential between their gene contributions [4]. With the success of the DM when applied in this work, this is the first application to real cancer data that we are calling the Khan data set. For the Khan data set, there are essentially five different groups of assays-the four different cancer types and the non-cancerous samples. In applying the DM, we show that it may have far-reaching possibilities in a relatively quick and thorough analysis of DNA microarray data, specifically data sets with discernible groups of assays. We applied the DM to the unfiltered complete data set, which included all 6,567 genes. We discovered that the EG loading plots yielded more interesting groupings for analysis than their EA score plot counterparts. For the top ten PCs, PC 5 yielded a strikingly similar and more refined grouping than that of PC 6 in the DM analysis. Figure 1 shows this grouping of 11 assays, visibly separated from the rest of the 88 assays. Biologically interpretable grouping is desired when using (8) the DM. After finding such a grouping, the contribution associated with the 11 distinguished assays and the differential contribution associated with the remaining 77 assays are found. Using the knowledge of the original data set, we found that PC 5 separates all of the BL samples from the rest of the samples. Thus, we applied the DM to find the differential gene expression between samples with BL to those without BL. After finding the individual contributions of both groups of assays, the differential contribution, , is found through simple subtraction of the linear combination contribution values for each gene. The EG loading versus assay number based on PC 5. The black squares (BL samples) are obviously distinguishable from the rest of the assays, yielding a signature of interest that is further analyzed using the DM.

Figure 2. Rank-ordered Differential Contributions for BL Assays versus non-BL Assays.
Rank-ordered differential contribution, from equation 5, plotted against gene index.
EGis ranked in decreasing order. Differential is between BL assays and non-BL assays. Genes of importance are found in the top left of the graph and are more spread out than the majority of the genes. groups of assays. We chose to further analyze and compare the top 25 genes of interest to investigate the effectiveness of the DM. Table 1 lists the top 25 genes determined by the DM, including their abbreviations and commonly known functions. Next, we utilize these genes and their functions are utilized to determine the success of the DM based upon past studies and the genes' possible contributions to the cancers being studied.

Top genes analysis
In analyzing and comparing the top genes determined by the DM, we did two types. First, we compared our top genes to those found in previous studies by other methods. Second, we related the functions of the top genes to the cancers being studied. The two previous studies that use the same microarray data, Khan et al., [2] and Pal et al., [3], include lists of genes that supported the results of their methods. In directly comparing our top 25 list of genes, none of them can be found in either of the corresponding important genes lists in those two studies. In fact, when we increase our list of top genes to 100, only two genes, number 66 and number 77, can be found to match the top 96 genes found by Khan et al., [2] with none occurring in the seven important genes found by Pal et al., [3]. While these results may seem discouraging, further analysis of the two previous studies yields prior existence of such discrepancies within these two older methods as well as possible short comings in their filtering techniques. Of the seven genes found to be important using relational fuzzy clustering in Pal et al., [3], only four may be found in the top 96 genes reported in Khan et al., [2]. This discovery reveals a common disparity found when using different methods to analyze the vast DNA microarray data sets. In addition, seven of our top 100 genes were found to be absent from the analysis involving the previous studies. These genes were subjectively filtered out of the data set before any analysis on the data was completed. This, coupled with the pervasive nature of To support the list of top contributing genes found by the DM, we next discuss these genes' functions. In determining our genes' contributions to the cancers being considered, we reviewed the current gene data using "Gene" [14]. The resulting gene functions generally fell into one of four different categories: specific links to cancer, protein synthesis, cell cycle regulation, and neurological association. These four functional categories will be used to show the biological importance of the genes chosen by the DM. Some of the top 25 genes have generalized or specific relations to cancer. In addition to being expressly identified by their contributions to cancers, some additional genes have also been considered to have possible links to cancer. Of the top 25 genes, 16 have been considered to be associated with some type of cancer according to the Atlas of Genetics and Cytogenetics in Oncology and Haematology [13]. This means that over three-fifths of the genes found to be important by the DM have already been considered to have a possible association with the development of cancer. With the Human Genome Project being completed, it can only be expected that this number will increase as researchers uncover more of the mysteries involved with genes and gene function. Some genes have been directly named after the role they play in cancer. The top gene in our analysis, RPL36A, has a vital role in tumor cell proliferation. It encodes for the protein L36a and has been shown to be over-expressed in hepatocellular carcinoma. The means by which RPL36A becomes overexpressed is unknown, but one particular result of its overexpression is applicable to our study. Previous studies of the gene have linked its over-expression to accelerating the cell cycle [5]. The cell cycle plays a vital role in all cancers and their proliferation, as cells must complete multiple cycles in the process of proliferating and metastasizing. The DM picked out this gene because it showed differential expression in the BL samples as opposed to the other samples. RPL36A may be over-expressed to a much greater or less degree in BL than in any of the other cancers. The ninth gene in our analysis, RPL6, is associated with gastric cancer resistance. The gene encodes for the protein L6, and its high expression in tumors increases the growth rate of the cells. Counteracting the proliferation with down regulation may suppress the transition from the G 1 phase to the S phase; thus impeding the cell cycle and proliferation. Also, RPL6 expression has ties to prognosis and can function as a biomarker for gastric cancer [12]. These findings may pave the way for more indepth research regarding RPL36A and RPL6 involvement in small round blue cell tumors. Such research would greatly aid in showing the power and potential of the DM.
The largest portion of the top 25 list of genes was genes linked to protein synthesis. Proteins are the building blocks of all life and are also essential to cancer cells as they proliferate.
Protein synthesis is a complicated process, but the relation of these genes to the process has several varied consequences when considering a link to cancer. For example, EEF2, number four in our list, involves a kinase that has been shown to possibly promote cancer cell survival, particularly in glioblastoma cells [6]. In order for cancer to proliferate, it must have a plethora of proteins for use in cell division and growth. Another of the top 25 genes, ACY1, has low expression levels in small-cell lung cancer. It tends to be a putative tumor suppressor. This is because, without the gene deacetylating amino acids, protein synthesis is limited. The suppressor role, however, is not the case for neuroblastoma according to Long et al., [9]. In neuroblastoma, ACY1 expression can be clearly seen but it is still unclear if the gene has further roles in other cancerous tumors.
mRNA is essential in the formation of these proteins, and whenever problems in mRNA transcription or translation occur, necessary proteins cannot be made. A number of the top 25 genes found by the DM are genes that have been found to inhibit transcription (HMGB1, BTF3) or translation (RPLP0, EEF1G) and to govern the metabolism and transport of the mRNA (HNRPC). HMGB1 in particular also hinders the abilities of most treatments for cancer cells as stated by Tang et al., [11]. The gene increases tumor cell autophagy, which allows resistance to a treatment. For tumor cells to induce apoptosis and decrease autophagy, HMGB1 must be repressed. Further research regarding the relationship of HMGB1 and autophagy will be of use. These effects can vary greatly based upon the expression of the associated genes and may be key in understanding the significance of the genes found by the DM.
Another functional grouping that contains some of the top 25 genes involve mechanisms that govern the cell cycle. A couple of genes involve the control of microtubules. Microtubules are essential in mitosis of cells during cell division, a key component in the cell cycle and in cell proliferation. STMN1 and Homo sapiens clone 24703 beta-tubulin mRNA, complete cds (unknown gene symbol), have both been linked to microtubule stabilization. Microtubules must be stabilized for a cell to enter into mitosis. STMN1 regulates this stability through depolymerisation of microtubules during the interphase of a cell. When highly-expressed, it corresponds to microtubule destabilization; which, in the case of cancer, would restrict and retard cell proliferation. Interestingly enough, however, STMN1 over-expression supports tumor-cell invasion for hepatocellular carcinoma according to Hsieh et al., [10]. Only when STMN1 was silenced did the tumor-cell invasion become suppressed. The high expression destabilizes the cell microtubules, but there is polyploidy formation as well [10]. This explains the apparent contradiction. Future research will be able to determine the effects and provide more information about STMN1 expression on other types of cancer.
An additional gene affecting the cell cycle is the telomerase TERT, number eight on the list. TERT maintains telomere ends and is repressed in postnatal cells. This repression is applicable to the juvenile cancers being analyzed. Its expression is most important in the developmental stage of humans; thus it has an increased likelihood to be linked to the juvenile diseases being studied. In addition, a prior study has linked the overexpression of TERT to samples of BL [7]. Both of these genes provide strong support for the DM, as it ranked STMN1 and TERT very highly in importance.
The fourth and final category for the top 25 genes includes genes of neurological importance. Genes that affect neurological pathways and functions are especially important due to the nature of the four cancers being studied. NB and EWS have important neurological effects, whereas BL does not [8]. TUBA1A, GNAL, and GABRB1 are all important to neurological cells and functions. While they have not been directly linked to these cancers at this time, their level of expression could play a significant role in the proliferation of NB and EWS tumors due to these cancers' neurological ties.

Additional differential analysis
Due to knowledge of the DNA microarray data set contents, we conducted additional differential analysis for BL against samples for each type of cancer and the five non-related tissues. Although the order of the genes changed for each additional analysis, the top genes remained relatively constant. This provides support for the effectiveness of the PCA analysis, the first part of the DM process. The PC found to have a signature of interest separated the BL samples from all of the rest. Differential analysis of these two groups yields the same basic results as if the BL group was analyzed with any one particular group of cancer. This allows for confidence in the DM when applying it to data sets of unknown nature, which would be the case in any diagnostic application.

Conclusions
This article enhances the application for the powerful PCAbased differential method developed by Rollins and Teh [4]. It establishes a foundation for the possible use of the DM in analyzing DNA microarray data with the goal of disease diagnosis or discovery. The DM is able to handle all of the DNA microarray data without the need of any subjective filtering methods for dimensional reduction. This makes it a very powerful method, with a great advantage over other methods such as neural networks or clustering. The DM selects assay-specific signatures based upon loading and score plots, giving the user the capability to selectively determine the signature of most interest. It is unique in its use of using rank-ordered differential gene contribution to establish a list of genes that contribute most to the chosen assay-specific signature.
This article presented the application of the DM to a DNA microarray including 83 samples from four types of juvenile cancers, NB, BL, RMS, and EWS, all forms of small round blue cell tumors. Five non-related tissues were also included in the microarray. PCA on the complete data set of 6567 genes established a signature that completely separated BL samples from the rest. Application of the DM yielded a rank-ordered list of all 6567 genes, listed in order of their importance to the signature. The top 25 genes from this list were closely analyzed to find as many links as possible to the cancers within the data. Top genes were found to be included in one of four categories: directly cancer-related, protein synthesis, cell cycle control, and neurological effects. Gene functions as well as findings from various studies of the genes yielded information that promotes a general relation of the top genes to the juvenile cancers for which the DM found them to be important. These findings, coupled with further research regarding the nature of the top genes found by the DM, supports the strength DM in differentiating between two different groups of assays. It lays the foundation for future applications in disease diagnosis and provides exciting possibilities for discovery of unknown genetic causes for several diseases.