
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
1 Department of Oncology and 2 Complex Systems Division, Department of Theoretical Physics, Lund University, Lund, Sweden; 3 Cancer Genetics Branch, National Human Genome Research Institute, NIH, Bethesda, MD; and 4 Department of Pathology, Helsingborg Hospital, Helsingborg, Sweden
Requests for Reprints:Paul S. Meltzer, Section of Molecular Genetics, Cancer Genetics Branch, National Human Genome Research Institute, NIH, MSC 8000, Room 5139, 50 South Drive, Bethesda, MD 20892-8000. Phone: (301) 594-5283; Fax: (301) 480-3281. E-mail:pmeltzer{at}nhgri.nih.gov
| Abstract |
|---|
|
|
|---|
status. To test whether the properties and specific values of conventional prognostic markers are encoded within tumor gene expression profiles, we have analyzed 48 well-characterized primary tumors from lymph node-negative breast cancer patients using 6728-element cDNA microarrays. In the present study, we used artificial neural networks trained with tumor gene expression data to predict the ER protein values on a continuous scale. Furthermore, we determined a gene expression profile-directed threshold for ER protein level to redefine the cutoff between ER-positive and ER-negative classes that may be more biologically relevant. With a similar approach, we studied the prediction of other prognostic parameters such as percentage cells in the S phase of the cell cycle (SPF), histological grade, DNA ploidy status, and progesterone receptor status. Interestingly, there was a consistent reciprocal relationship in expression levels of the genes important for both ER and SPF prediction. This and similar studies may be used to increase our understanding of the biology underlying these markers as well as to improve the currently available prognostic markers for breast cancer. | Introduction |
|---|
|
|
|---|
Multiparametric methods such as microarray analysis, which rely on many pieces of information, seem ideally suited for grouping of tumor subtypes. Indeed, the microarray technique has successfully been used to classify breast cancer into different subgroups with clinical correlations (13) as well as using the expression profiles to predict cancer types and disease recurrence of patients (47). In general, these studies use statistical methods to generate an output, which classifies a sample as a member of one group or another. Expression profiles have thus far not been used to provide a graded output corresponding to the continuum of biological properties exhibited by tumors.
Although prognostic markers for breast tumors are used to categorize tumors into two groups [e.g., estrogen receptor (ER) positiveversus ER negative or high SPFversus low SPF], in reality, these subdivisions are defined by applying cutoff values to a continuous laboratory value. For example, the cutoff values used to subgroup tumors based on ER status are defined from clinical studies correlating ER values with response to endocrine treatment and are not based on measurements of the functional activity of the ER signal transduction pathway. In this study, we have investigated the possibility of predicting not only the binary ER status and SPF of a tumor but also the continuous values of ER protein and SPF from gene expression profiles. We have used cDNA microarrays and artificial neural networks (ANNs) to analyze the expression of 6728 genes in 48 well-characterized primary tumors representing a broad spectrum of ER protein expression and SPF values. From the results of these predictions, we have generated ranked lists of the genes most sensitive for the predictions and defined a cutoff for ER status based on gene expression. Furthermore, using a similar approach, we have studied the gene expression profiles associated with histological grade, DNA ploidy, and progesterone receptor (PgR) status in these tumors. Ours and similar studies may give us a better understanding of the underlying biological events in tumors that display these different clinical properties and may one day be used to augment presently used laboratory evaluation of breast cancer.
| Materials and Methods |
|---|
|
|
|---|
Microarray Analysis
cDNA microarray analysis was performed as described previously (4, 12) and according to standard protocols (http://research.nhgri.nih.gov/microarray/protocols.html). In short, 200 µg of BT-474 total RNA and 65100 µg of tumor total RNA were used to produce labeled cDNA by anchored oligo(deoxythymidylate)-primed reverse transcription using SuperScript II reverse transcriptase (Invitrogen, Carlsbad, CA) in the presence of either Cy5-dUTP or Cy3-dUTP (Amersham Pharmacia, Piscataway, NJ), respectively. The arrays used were spotted with 6728 sequence-verified cDNA clones obtained from Research Genetics (Invitrogen). Fluorescence scanning and image analysis with DeArray software were performed as described previously (13,14).
Data Analysis
For each gene, the expression intensity of the most intense channel (red or green) for each sample was averaged over all samples. All genes for which this average exceeded 300 fluorescence units (scale 065,535 units) were included in the analysis. In addition, we required, for all samples, that the red and green intensities both exceeded 20 fluorescence units and that the union (of the two channels) spot area exceeded 30 pixels. These requirements left us with different fractions of the original 6728 genes for the different classification problems, depending on the samples included in the analysis, which in turn was determined by the availability of measured clinical variables to be predicted (ER value and DNA ploidy: 48 samples leaving 3855 genes; PgR: 47 samples leaving 3880 genes; SPF: 45 samples leaving 3924 genes; histological grade: 35 samples leaving 4054 genes).
The data analysis was an extension of what was used by Khan et al. (15) and Gruvberger et al. (4). In brief, principal component analysis projections of the gene expression data were used as inputs to ANNs, and a classifier consisting of a committee of networks was obtained using a 3-fold cross-validation scheme. An ANN sensitivity measure was used to determine the importance of individual genes for the classification. Three extensions to this procedure were introduced: (a) "cross-testing" for better statistics in the test results; (b) a systematic search for the best ANN design; and (c) application to regression problems.
Cross-Testing. The predictive power of a committee can be tested by applying the committee to blind tests. Khanet al. (15) and Gruvbergeret al. (4) used fixed blind test sets. In the present study, this was extended, for better statistical significance, to a 7-fold "cross-testing" procedure analogous to a cross-validation scheme (see supplemental methods). Each ANN committee was thus based on 6 of 7 available samples. With the 3-fold cross-validation procedure, each ANN model was then trained on (2/3) * (6/7) = 4/7 of the available samples. With this cross-testing, we obtained as many test results as there were samples. The cross-testing was repeated five times. Thus, the blind test result for a sample was the average result of five different committees.
ANN Architecture Selection. To obtain ANN committees with good predictive power, the ANN designs, architecture, and training parameters as described by Khanet al. (15) were selected to optimize the validation result [in terms of mean squared error (MSE)]. To avoid information leaks in the cross-testing scheme, every member of a predefined pool of different ANN designs was considered for each new blind test selection.
Regression Problems. Part of the analyses involved regression problems (i.e., prediction of continuous values such as ER protein expression levels rather than binary classifications). For regression problems, no logistic response function was applied in the ANN output layer, and the output was directly associated with the target value. As a measure of the performance, the MSE normalized with the variance (Var) of measured (target) values was used. With this normalization, comparisons between the regression problem performances can be made. If there is no useful information in the ANN inputs, MSE/Var = 1, while MSE/Var < 1 indicates a meaningful prediction. Furthermore, it is possible to evaluate the statistical significance of MSE/Var < 1 (for details, see Supplement).5
Gene Lists. Based on the committee of trained networks, the genes were ranked using a sensitivity measure similar to that of Khan et al. (15), although with a few modifications. The new sensitivity definition for a gene was based on the partial derivatives of the ANN output layer arguments, with respect to the gene expression. For each sample, these derivatives were averaged over ANN models, and the absolute value of these committee averages was then averaged over samples to get the sensitivity. Motivations for this sensitivity are given in the supplement. The analysis steps above were then redone using only the 100 genes with highest sensitivity. Note that for each choice of test set, a different gene list was used. To better evaluate the statistical significance of a high sensitivity measure of a gene, a permutation test was performed to calculate the probability
that a gene gets a larger sensitivity in a problem where target values are randomly permuted. This permutation analysis is further described in the supplemental methods.
In principle, it is possible to combine the different gene lists to one single list, but it would be computationally very costly to generate gene lists in this way in a permutation test. Instead, the most frequently generated ANN design was chosen, and a committee of 600 nets trained on different subsets of all available samples was employed, using 3-fold cross-validation.
Molecularly Motivated ER Cutoff. We investigated the possibility to define an ER protein concentration cutoff from gene expression profiles. Classification into ER positives and ER negatives, based on gene expression levels, was done for every possible partition (from having only samples with ER protein concentration = 0 as ER negatives to having only two samples with the largest available ER concentration,
490 fmol/mg protein, as ER positives) and the success of the classification was used as a measure of how well the partition corresponds to molecularly distinct classes. Fisher's linear discriminant (16) was used as a classifier in this analysis.
To distinguish the classification performance of different class partitions, a leave-one-out cross-validation was performed, and the area under the receiver operating characteristic (ROC) curve (17) was calculated based on the validation results. Different choices of the decision threshold correspond to different balances between the sensitivity and the specificity of the classification. All possible thresholds cause the ROC curve in the (sensitivity, 1 specificity) plane. The area under this curve (ROC area) is a convenient measure of the classification performance with a greater area (closer to 1) signifying better performance. In Fisher's discriminant analysis, the samples are projected down to one dimension, and to compare validation results based on different projections, the scale of the one-dimensional projection result was fixed by setting the mean of the ER-negative and ER-positive classes to 1 and 1, respectively.
| Results |
|---|
|
|
|---|
|
|
|
|
|
| Discussion |
|---|
|
|
|---|
The ER status of a tumor is determined from its protein value and has long been used as a means to identify the group of patients that will benefit from endocrine therapy. However, the ER status based on protein expression does not give a direct verification of the functional activity in the ER signaling pathways. In previous studies of global gene expression of breast tumors, it has become evident that the ER status of tumors is associated with distinct gene expression profiles involving a large number of genes (46). However, these studies have only focused on the binary ER status and did not examine the relationship of gene expression profiles to the continuous range of ER protein values. In this study, we successfully calculated the ER protein expression values from gene expression profiles, showing that gene expression data from tumors are sufficiently robust and informative not only to determine ER status but also to indicate the actual level of ER protein expression. Moreover, the strength of the ER profile is evidenced when even after removing the most important 1000 genes of the ER profile, we were still able to predict the ER protein values with good performance (MSE/Var = 0.69,P = 1.5 x 104;Fig. 2). The genes associated with ER protein expression value predictions are to a large extent overlapping with the genes associated with ER status prediction in this and other studies (2,46). Conventionally, the threshold value used to assign ER status (positive or negative) has been determined empirically from response to endocrine treatment, and the cutoffs used differ between laboratories and clinics (18). Using the ER-associated gene expression profiles, we have determined a protein level cutoff for ER status. In this patient cohort, an appropriate cutoff for ER status based on the top 100 ER-associated genes (from the continuous value predictions) is in the range of 6.515 fmol/mg protein. Only few tumors were within this range of protein expression values, which therefore was difficult to narrow. Still, this range of values is somewhat lower than the cutoff that was used at the hospitals of origin of the tumors at the time of diagnosis for these patients (25 fmol/mg protein). Because the number of samples in this study, especially in the critical range, is limited, this cutoff value may not be applicable to other patient cohorts. However, this approach appears sufficiently promising to warrant studies with larger numbers of tumors. Determining an ER status cutoff threshold based on the expression of a panel of genes associated with ER in breast tumors could possibly be a more accurate way of assigning their ER status than using merely the ER protein level.
The proliferative activity of a tumor can be estimated by flow cytometric analysis whereby information on DNA ploidy status and SPF is generated. We found that the performance for the prediction of SPF values based on gene expression profiles is good. It should be mentioned that because of the correlation between ER status and SPF in the patient cohort, the strong signal from genes associated with ER status contributes to some degree to the prediction of SPF. However, a low overlap of the top ranked genes between the S-phase- and ER-associated gene expression profiles (20%) indicates that although ER-associated genes do assist in the prediction of SPF, most genes important to SPF prediction are indeed associated more specifically with the S-phase profile. Interestingly, all of the 20 genes comprising the intersection of the top 100 SPF and top 100 ER value gene lists show an inverse relation in their expression in that the genes that are highly expressed in tumors with high S-phase fraction have a low expression in tumors with high ER values (Fig. 4). Additionally, 67 of the remaining 80 genes on the top 100 S-phase list that are found far down the ER value gene list also show an inverse correlation in expression level (Fig. 5 and Supplemental Tables 2 and 3).5 This striking inverse relationship shows that at the molecular level, the gene expression of many individual genes important to a high proliferation phenotype relates directly to a low expression of ER protein. Not surprisingly, several of the 80 genes that are strongly associated with SPF but not with ER have functions associated with cell proliferation. For example, the ubiquitin-conjugating enzyme E2C (rank 12) is highly expressed in tumors with a high S-phase fraction and involved in the ubiquitin-dependent proteolysis of both cyclin A and cyclin B (19). The cell growth-inhibiting transcription factors AP-2ß (rank 24; Ref. 20) and activating transcription factor 3 (rank 48; Ref. 21) both have a low expression in tumors with high S phase as do the inhibin ßA subunit of the inhibin complex (rank 15; Ref. 22) and insulin-like growth factor-1 (rank 23; Ref. 23), both are involved in modulation of cell growth and differentiation. Another group of genes associated with high S-phase fraction are genes that have previously been associated with tumor invasion, tumorigenesis, and transformation including ADAM8 (a disintegrin and metalloproteinase domain 8; rank 29), a transmembrane protein identified to have metalloproteinase activity (24), and melanoma cell adhesion molecule (rank 11), which has been implicated to play an important role in initiation and malignant progression in melanoma and prostate cancer (25). The fact that the gene transcobalamin I was more highly expressed in tumors with a low S phase suggests that a higher S-phase fraction in a tumor also seems to be correlated with a lower degree of cellular differentiation. Transcobalamin I, a member of the vitamin B12 binding protein family also called R binders, has been demonstrated by immunohistochemistry to be expressed more often in the well-differentiated tumors in invasive ductal carcinomas of the breast (26).
The histological grade of a tumor is determined by microscopic evaluation of breast tumor paraffin sections. From our results, predicting histological grade from gene expression profiles seems to be possible. Although there is an influence by ER status, owing to a correlation between low histological grade and ER positivity, the ER protein expression values alone predict the histological grades less accurately than the gene expression profiles. Although their study did not address the prediction of histological grade, very recently, gene expression profiles have been observed which distinguish high- and low-grade tumors (27).
The DNA ploidy status of a tumor can reveal whether some cells in the tumor have an abnormal amount of DNA in the nucleus. The prediction of ploidy status (diploid versus nondiploid) was not as good as for the other clinical parameters studied, indicating that the DNA ploidy status of a tumor is not strongly correlated to any specific gene expression profile. It is not surprising that it was difficult to find a unifying gene expression profile for all nondiploid tumors because their chromosomal gains and losses do not necessarily follow the same pattern; therefore, the effects of aneuploidy on gene expression are diverse. Possibly, better results could be obtained by grouping tumors according to comparative genomic hybridization profiles, which are determined by specific patterns of cytogenetic change. Indeed, several studies have reported correlations between comparative genomic hybridization profiles and gene expression data (2830).
Our study sheds light on the molecular background behind the already established markers ER status and SPF. Using computer models, we were able to predict the continuous values of these clinically relevant markers, demonstrating that the biological basis of these markers is encoded and detectable within global gene expression patterns, even from within heterogeneous tumor samples. The method of predicting a tumor characteristic on a continuous scale may be a better approach than predicting binary classes in other microarray studies (e.g., prediction of time to disease recurrence instead of recurrence by a fixed end point). Additional studies and a reliable approach to generate expression data in a clinical setting are necessary before gene expression profiling can be used as a practical clinical tool. However, our study and others strongly suggest the approaching potential of gene expression profiling to aid treatment decision-making for the individual patient by refining prognostic categories and elucidating the molecular properties that affect outcome.
| Acknowledgments |
|---|
| Footnotes |
|---|
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Note:S. K. Gruvberger-Saal and P. Edén contributed equally.
5 Supplementary data for this article are available atMCT Online (http://mct.aacrjournals.org). ![]()
Received 6/24/03; revised 11/ 3/03; accepted 11/ 4/03.
| References |
|---|
|
|
|---|
zip. Transcriptional repressionversus activation by alternatively spliced isoforms. J Biol Chem, 1994;269:15819 26.This article has been cited by other articles:
![]() |
M. Fan, P. S. Yan, C. Hartman-Frey, L. Chen, H. Paik, S. L. Oyer, J. D. Salisbury, A. S.L. Cheng, L. Li, P. H. Abbosh, et al. Diverse Gene Expression and DNA Methylation Profiles Correlate with Differential Adaptation of Breast Cancer Cells to the Antiestrogens Tamoxifen and Fulvestrant Cancer Res., December 15, 2006; 66(24): 11954 - 11966. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. K Gruvberger-Saal, H. E Cunliffe, K. M Carr, and I. A Hedenfalk Microarrays in breast cancer research and clinical practice - the future lies ahead Endocr. Relat. Cancer, December 1, 2006; 13(4): 1017 - 1031. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M. Fowler, N. M. Solodin, C. C. Valley, and E. T. Alarid Altered Target Gene Regulation Controlled by Estrogen Receptor-{alpha} Concentration Mol. Endocrinol., February 1, 2006; 20(2): 291 - 301. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Ohlsson, J. M. A. Moreira, P. Gromov, G. Sauter, and J. E. Celis Loss of Expression of the Adipocyte-type Fatty Acid-binding Protein (A-FABP) Is Associated with Progression of Human Urothelial Carcinomas Mol. Cell. Proteomics, April 1, 2005; 4(4): 570 - 581. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |