|
|
||||||||
Published online before print July 15, 2003
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



* Max von Pettenkofer-Institut, Department Bacteriology, Munich, Germany;
Institute of Cancer Research, Section of Gene Function and Regulation, Chester Beatty Laboratories, London, United Kingdom; and
Department of Medical Informatics, Biometrics and Epidemiology; University of Munich, Germany
1 Correspondence: Max von Pettenkofer-Institut, Department Bacteriology, Pettenkoferstr. 9a, 80336 Munich, Germany. E-mail: r_hoffmann{at}m3401.mpk.med.uni-muenchen.de
| ABSTRACT |
|---|
|
|
|---|
Key Words: Microarray B/T cell lineage Principal Component Analysis
| INTRODUCTION |
|---|
|
|
|---|
(TCR-ß/
) chains in T cells. The recombination status of these loci can be used to temporarily order and define the cellular stages [1
, 2
]. Cells undergoing this differentiation process express a stage- and lineage-specific set of surface markers. The B cell classification scheme of Rolink et al. [3 ] uses four cell-surface markers to distinguish among the five stages of B cell development (B220, c-kit, CD25, and surface IgM (sIgM); Fig. 1A ]. In T cell development, the expression of four genes (CD25, CD44, CD4, and CD8) distinguishes eight consecutive cellular differentiation stages (Fig. 1B) [4 ]. Here, cellular stages are identified by the specific expression patterns of only a few genes. Although these stage assignments have been shown to perform well in a number of experiments, there might be additional, unknown marker genes. These might distinguish more precisely between the currently known stages, or they might help to define novel subpopulations functionally distinct from the currently known stages.
|
For identification of those genes that were able to separate the distinct cellular populations, we used supervised algorithms previously applied to the classification of leukemias [8 , 9 ], other cancers [10 , 11 ], and cancer cell lines, according to the susceptibility to chemotherapeutic agents [12 ]. Briefly, all genes are first ranked for their ability to distinguish between the classes. Next, different numbers of the top-ranking genes are assembled into models. Based on these models, the samples were classified into two categories using a weighted voting scheme, where every gene casts a vote for the sample to belong to one of the two classes. Next, the models were tested using a leave-one-out cross-validation approach. As a reference, an empirical distribution obtained by permuting the phenotype class labels was used. To avoid over-training, a final model was generated using the smallest number of genes capable of performing the desired classification with high confidence.
The accuracy of this classification process can be analyzed in two ways: First, the proportion of correctly predicted samples among all samples is recorded, resulting in a percentage of correctly classified samples (apparent classification accuracy). Second, by means of Principal Component Analysis (PCA), individual samples are represented on a two-dimensional plot according to the expression patterns of the genes used for classification. This graphical representation enables us to assess how well samples from different cell populations can be separated. We show that this strategy is able to identify novel lineage and stage-specific marker genes and that the currently used protein markers better identify molecularly distinct stages in B cell development than those for T cell development.
| MATERIALS AND METHODS |
|---|
|
|
|---|
For isolation of thymic T cell precursors, cell suspensions from 4-week-old B6 mice (10 mice per experiment) were divided in two portions. One portion was used for the isolation of DP and SP thymocytes and was directly stained with CD4 and CD8 mAb (Fig. 1B) . The second portion of the sample was used to isolate DN cells. First, CD4- and CD8-positive cells were removed by complement-mediated lysis. Subsequently, samples were stained with a cocktail of lineage-specific mAb (B220, NK11, CD3, CD4, CD8), antigen-presenting cell-conjugated together with CD25 (fluorescein isothiocyanate-conjugated) and CD44 (phycoerythrin-conjugated). Populations were sorted as shown in Figure 1B .
A cell purity of
98% was routinely achieved with flow cytometric profiles, expression patterns of genes known to be involved in lymphocyte development, and independent confirmation by polymerase chain reaction of some genes newly found to be differentially expressed can be reviewed in refs. [5
, 6
].
RNA was then subjected to two rounds of in vitro transcription-based RNA amplification as described earlier [5 , 13 , 14 ]. Affymetrix Mu11k subA and subB GeneChip® arrays were hybridized, washed, stained, and scanned according to the manufacturers specifications. Five to six independent, replicate experiments were performed per cell population examined. Scanned raw data images were processed with Affymetrix GeneChip v3.2 software, resulting in processed image (.cel) and numerical (.chp) files.
Data analysis
Arrays were normalized for differences in overall fluorescence using a piecewise running median line fitted onto an invariant set of synthesis features as a normalization curve [15
]. "Average difference" values were then calculated based on the normalized fluorescence levels as a measure of transcript abundance. The average differences were log-transformed, and class prediction was performed according to the method of Golub et al. [8
], as modified by Pomeroy et al. [16
].
Notably, the signal-to-noise ratio of gene x was defined as Sx = (µ1µ2)/(
1+
2), where µk and
k denote the mean expression and standard deviation of gene x in class k (k=1,2). According to a specified number of "informative" genes (e.g., 20), the best discriminating genes are selected. For each informative gene, a decision limit is calculated as bx = (µ1+µ2)/2. To classify a new sample, the gene expression levels of informative genes are taken, and for each gene x and sample y, a so-called vote is calculated as Vx = Sx (gxybx), where gxy denotes expression level of gene x in sample y. The votes of all informative genes are summed up ("weighted voting"), and depending on the sign of this sum, the new sample is classified as class 1 or class 2. The confidence in the prediction is calculated as |
Vx/
|Vx||.
To assess the significance of each gene, a permutation test is performed, which determines signal-to-noise ratios when class labels are permuted randomly.
To assess the robustness of the classifier, a leave-one-out cross-validation is performed. Apparent accuracy is the rate of correctly classified samples.
In addition to Golubs method [8 ], we apply a heuristic approach to select a minimal set of differential genes with high classification accuracy: For each informative gene, accuracy and confidence are calculated. The gene with highest accuracy and confidence is entered into the model. Next, we add each of the remaining informative genes, one-by-one, into the model and repeat the leave-one-out cross-validation. If the gene improves accuracy and confidence as measured in the leave-one-out cross-validation, it is added to the weighted voting model; otherwise, it is discarded. By this method, a subset of informative genes is selected, which is optimized in terms of accuracy and confidence.
For PCA, the program JExpress (www.molmine.com) was used.
| RESULTS |
|---|
|
|
|---|
Figure 2 shows the performance of models containing different numbers of genes in predicting whether the samples belong to the B or T cell lineage, respectively. Here, classification accuracy is defined as number of samples classified correctly, divided by the total number of samples. High-confidence values (maximum 1) indicate a homogeneous pattern of the different genes in the model.
|
Of the eight genes forming this most reliable model, three are highly expressed in T-lineage cells (Fig. 3A
). Two of these (TCR-ß chain, Fyn-binding protein/Src-like adaptor protein) are known to be expressed in T cells [17
, 18
]. The third probe set is an EST with no known homologues. The remaining five classifying probe sets are up-regulated in B-lineage cells and represent four distinct genes (Fig. 3B)
. The early B cell factor is a known B cell-specific transcription factor [19
], and the µ chain-associated protein 8HS20/VpreB3 complexes with Ig µ heavy chains during development [20
].
-Globin has not previously been shown to be expressed in lymphocytes or B cells in particular. However, a recent study shows expression of a human
-globin transgene in mouse BM but not thymus [21
]. Again, one of the newly identified markers is an EST without known homologues.
|
|
Classification of B-lineage samples according to developmental stages
We next evaluated how well our machine-learning algorithm would be able to classify individual B cell precursor cell stages according to the global RNA gene expression program. This was achieved by classifying samples from one particular stage versus all others in turn for every developmental stage. The relative expression of the marker is characterized by the signal-to-noise ratio Sx (Table 1
). All the Sx values of the novel marker genes are below the 0.1 or above the 99.9 percentile of those obtained after randomly permuting the dataset (data not shown), indicating that the association between RNA expression pattern and developmental stage is statistically significant.
|
Of seven probe sets contained in the pre-BI cell prediction model, three are underexpressed in pre-BI cells as compared with the other B cell precursor stages. Two of these are specific for Ig
sequences, and the third one is specific for an EST contained in one UniGene cluster with an Fc receptor family member [22
].
Of the four probe sets overexpressed in pre-BI cell as compared with the remainder of the B cell precursor stages, one is an EST contained in one UniGene cluster with and highly similar to the growth factor receptor-bound protein 7, an Src homology 2 domain containing protein involved in the human epidermal growth factor 2 signaling pathway in breast cancer [23 ]. The other three are known genes. Endoglin is a transforming growth factor-ß receptor [24 ]. Thy-1 is a well-known cell-surface glycoprotein thought to be specific for thymocytes and neurons with a possible role in signaling from the T cell antigen receptor [25 ]. Finding this gene as a pre-BI cell-specific marker indicates that Thy-1 has a previously unrecognized role specifically in early B cell development or that these fully committed precursors still show some degree of multilineage gene expression characteristic of uncommitted precursors.
The model for predicting large pre-BII cells contains only citron, a putative rho/rac effector that is otherwise poorly characterized [26 ]. Still, up-regulation of this gene correctly predicts 92% of the samples.
Three probe sets form the best model for predicting small pre-BII cells. One EST contained in one UniGene cluster with and very similar to the interferon (IFN)-related developmental regulator 2 is underexpresed in small pre-BII cells. The remaining two probe sets are specific for protamine, a sperm nuclear basic protein with a function similar to histones [27 ].
The model predicting immature B cells contains 13 genes; together, these correctly predict 96% of the samples. Seven probe sets are underexpressed in immature B cells. These contain the Ig surrogate light chain
5. This gene is expressed only in pre-BI and a small subpopulation of large pre-BII cells [28
]. It appears as a negative, immature B cell-specific marker, as low-level signals are uniformly detected in the small pre-BII cell samples, and one of five mature B cell samples shows a signal above background, which is likely to be an experimental outlier. Other negative markers include ESTs contained in UniGene clusters with the thymocyte B cell antigen, the kinesin-related protein KIFC5A, and the antimicrobial peptide hepcidin. One EST has no known homologues. Known genes that serve as negative markers include the prostaglandin receptor EP3 subtype and gelsolin, an actin filament severing and capping protein that is implicated in actin remodeling in growing and in apoptotic cells [29
]. The model contains the following up-regulated genes: acrogranin, a secreted factor modifying mouse preimplantation development; properdin, a molecule stabilizing the C3 convertase in the alternative, complement-activation pathway [30
], one antibody sequence, and three ESTs. All ESTs are contained in one UniGene cluster with a IFN-
-inducible lysosomal thiol reductase, which facilitates the processing and presentation to antigen-specific T cells of protein antigens containing disulfide bonds [31
].
Samples derived from mature B cells can be predicted with 100% accuracy using a model consisting of five probe sets. Two of them are underexpressed: one EST without known homologues and the ß-galactoside-binding protein (bgbp), an autocrine-negative growth regulator. Three probe sets overexpressed in mature B cells represent one single gene, apolipoprotein B.
PCA tested resolving the developmental stages (Fig. 4B) . A near-perfect separation of all stages is achieved, and 81.4% of the total variance is retained.
Classification of T-lineage samples according to developmental stages
Finally, a similar analysis was performed for samples derived from eight stages of T cell differentiation. Thirty-nine genes are contained in the cross-validated models (Table 2
). Again, all the signal-to-noise ratios differ significantly from the values obtained after random permutation of the data (data not shown); thus, the association between RNA gene expression and developmental stage is statistically significant.
|
Surprisingly, highly accurate models are also identified for the DN2, DN3, and DN4 cell populations, as we have shown earlier that these have very similar gene expression profiles [6 ]. For predicting DN2 cells, a combination of seven genes achieves a classification accuracy of 93%. Only two ESTs are contained in the model, both without homologous genes. The hyltk gene is a CSK homologous C-terminal Src kinase shown to be active in platelets and megacaryocytes [32 ]. T11 is a very late antigen-4 ligand serving as a cell adhesion molecule [33 ]. TRA1 regulates transmembrane transport of plasma membrane phospholipids [34 ], and ADSEVERIN has a gelsolin-like function [35 ]. Mast cell carboxypeptidase A is an extracellular protease.
Prediction of DN3 cells uses two EST probe sets without known homologues. It is interesting that one of them is already contained in the DN2 model, underlining necessity to use more than one marker gene for classification in some cases.
Models for predicting DN4 and DPL cells are based almost exclusively on ESTs, and only one gene is contained in either of the two models, respectively. Nevertheless, classification accuracy is very high, with 98% and 100% for the DN4- and DPL-specific models, respectively. The putative pheromone receptor VR10 serves as a negative marker for DN4 cells, and another S100 family protein, calgizzarin/S100C/endothelial monocyte-activating polypeptide I, constitutes a positive marker for DPL cells.
DPS cells are characterized by exit from the cell cycle after TCR-
ß selection, a process that involves down-regulation of many genes. Thus, nine of the 11 genes needed to achieve a classification accuracy of 98% are down-regulated. These include cell cycle-related genes, such as a D-type cyclin or the BRCA-1 breast- and ovarian cancer-susceptibility gene. Of the two positive markers, one is an EST, and the other is dynamin, a microtubule motor protein.
Three genes correctly predict 93% of the CD4 SP cell-derived samples. Again, the predictors come from a variety of functional classes. Vascular endothelial growth factor B is a growth factor for endothelial cells. Annexin IV regulates membrane behavior in a calcium-dependent manner. These two genes serve as negative markers, and the creatine kinase B gene is contained as a positive marker.
In contrast to these highly specific models, prediction of CD8 single-positive T cell appears significantly more difficult. The best model involves only one gene, asparagine synthetase. The classification accuracy achieved is 85%, compared with >90% for the other stage-specific models.
Using PCA, only 63.5% of the total variance is retained by the first two components, and only DPL and DPS cells separate clearly from the rest of the samples (Fig 4C) . Thus, the amount of variance retained in the first two principal components is not sufficient for a clear distinction between the different T cell precursor stages.
| DISCUSSION |
|---|
|
|
|---|
Before classification is performed, the algorithms used result in a list of genes that could serve as novel markers for a given developmental stage. As gene expression on the RNA level must not always be concordant with protein expression, the first question would be whether the RNA-based analysis performed here actually discovers genes indicative for the cellular stages as defined on the protein level. Two lines of evidence show that this is indeed the case. First, in the leave-one-out cross-validation, the samples are usually classified to the correct (protein-defined) stage when classified according to the markers identified on the RNA level. Second, analysis of randomly permuted datasets shows that the association between gene expression patterns and developmental stage is statistically highly significant.
The marker genes thus identified were assembled in lineage- or stage-predicting models. The number of genes contained in any model had significant influence on the quality of the prediction, as inclusion of more genes than required can lead to misclassification of individual samples (Fig. 2) . This effect is termed "overfitting". Therefore, the final models contain as few genes as possible, that is, between one and 13 genes. This is smaller than most of the models described in the literature that have been identified with similar algorithms [8 , 12 ]. Although overfitting is effectively controlled by small model sizes, this might render the model more sensitive toward experimental noise. Obviously, these two effects must be balanced against each other in choosing the optimal size of the model. This has been achieved in choosing models with maximum classification accuracy and confidence.
Surprisingly, in the distinction between B- and T-lineage cells, none of the well-established marker genes such as CD19, CD20, CD3 appears in the optimal or another highly reliable model. As these markers are often associated with antigen-receptor complexes [1 , 2 ], they are developmentally regulated and show substantial variation in expression throughout the different precursor stages. Markers differentiating B- from T-lineage cells should, however, be uniformly expressed in all stages of one of the two lineages, as shown for the novel marker genes identified here (Fig. 3A and 3B) . Although most of these genes have previously been described as expressed in B- or T-lineage cells, it has not been evident that they could, in combination, perfectly predict to which of the two lineages investigated a cell belongs.
In B cell development, models containing between one and 13 genes reach classification accuracies between 92% and 100%. Notably, these models contain genes that are up-regulated only in the stage under investigation as well as genes that are specifically not expressed in that stage. It will be interesting to determine whether these genes have some functional relevance, i.e., whether a block in development can be induced by enforced expression of these genes in the respective stage.
Many genes that have not previously been associated with lymphocyte development, including ESTs, are detected as novel marker genes. ApoE, for example, although known to have immune-regulating function [36 ], has never been assessed in B cell development but appears repeatedly in the mature B cell-specific model.
Classification of T cell developmental stages on the RNA level is more difficult than in B cell development. First, only one of eight populations examined can be classified with 100% accuracy, compared with three out of five stages of B cell development. Second, when analyzed by PCA, substantially less of the total variation is retained in the first two principal components, and the stages are not separated well on the corresponding plot.
One possible explanation is that the DN1 cell population is included in this analysis. These cells have multilineage potential [37 , 38 ] and express many genes that are down-regulated during T cell differentiation and re-expressed in mature T cells [6 ]. However, omitting the DN1 cell population from the analysis does not result in an improved accuracy of the classification (data not shown).
A second explanation is that the RNA gene expression patterns are more similar in the different stages of T cell development than they are in the different stages of B cell development. This indicates that separation of T cell precursors by expression of CD25, CD44, CD4, and CD8 incompletely distinguishes distinct entities on the RNA level. It might well be possible that the genes identified here as significantly associated with the T cell developmental stages could help to better define those precursors.
Received February 28, 2003; revised April 25, 2003; accepted May 27, 2003.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M. E. Hystad, J. H. Myklebust, T. H. Bo, E. A. Sivertsen, E. Rian, L. Forfang, E. Munthe, A. Rosenwald, M. Chiorazzi, I. Jonassen, et al. Characterization of Early Stages of Human B Cell Development by Gene Expression Profiling J. Immunol., September 15, 2007; 179(6): 3662 - 3671. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |