Published online before print November 5, 2004
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
,1


* Co-operative Research Centre for Diagnostics and
Child Health Research Institute, Womens and Childrens Hospital, Adelaide, South Australia, Australia; and
Department of Biochemistry, Latrobe University, Bundoora, Victoria, Australia
1 Correspondence: Child Health Research Institute, 72 King William Road, North Adelaide 5006, South Australia, Australia. E-mail: ian.nicholson{at}adelaide.edu.au
|
|
|---|
400 such proteins have been characterized; however, analysis of the completed human genome sequence indicates that it may contain several thousand as-yet unidentified molecules, which may be expressed on the leukocyte cell surface. Recent advances in protein isolation and analysis using mass spectrometry illustrate that it is now feasible to identify the protein composition of a complex sample such as a plasma membrane extract. Such an approach may be useful for the identification of the cell-surface proteins that have not been identified using mAb techniques. Here, we detail the results of an in silico evaluation of the peptides isolated using two methods used to label plasma membrane proteins to determine whether these methods are suitable for the identification of known leukocyte cell-surface proteins by mass spectrometry. The labeling of cell-surface proteins before isolation and characterization is a valuable means of differentiating between plasma membrane and internal membrane proteins The results indicate that although the majority of cell-surface proteins can be identified using either of the approaches, others known to be important diagnostically and/or therapeutically would not be identified using either approach. The implication of this for the use of these techniques in the discovery of new leukocyte cell-surface proteins is discussed.
Key Words: proteomics protein identification MASCOT search cell-surface biotinylation glycoprotein identification
|
|
|---|
The CD molecules represent only a small fraction of the total of plasma membrane proteins expressed by leukocytes. The preparations for the eighth HLDA workshop identified
200 "potential CD molecules" (see http://www.hlda8.org/PotentialCDs.htm), which have been characterized using mAb and/or gene analysis techniques [1
, 5
]. Using recent experimental data, it is possible to derive an estimate of the number of cell-surface proteins expressed by leukocytes [5
]. The human genome is predicted to encode 23,758 genes and can express 34,091 transcripts (Ensembl build v21.34d.1, May 10, 2004; http://www.ensembl.org/Homo_sapiens/whatsnew/index.html).Expressed sequence tag data indicate that individual leukocyte populations express between 6000 and 17,000 genes, depending on their activation state [6
]. Various studies predict that between 15% and 20% of the protein sequences in the human genome have one or more transmembrane helices [7
]. These results imply that leukocytes may express several thousand membrane proteins, most of which remain to be identified and characterized. This estimate does not include some types of cell-surface proteins, such as glycosylphosphatidylinositol-anchored proteins, which may not have transmembrane domains. The failure of current approaches to generate these reagents may be a result of a variety of reasons, such as low copy number and lack of immunogenicity as a result of high sequence identity with the equivalent protein in the species used to make antibodies. The identification of these molecules and the generation of suitable high-affinity binding reagents would facilitate the evaluation of these proteins as diagnostic or therapeutic targets.
Mass spectrometry (MS)-based sequencing of peptide fragments is increasingly being used to define the protein composition of a complex mixture and has been used to characterize protein expression by different populations of leukocytes. Several publications describing the partial analysis of leukocyte whole-cell protein extracts using 1-dimensional or 2-dimensional electrophoresis followed by MS analysis [8
9
10
] suggest that this approach is relatively inefficient in identifying plasma membrane proteins, which have been identified using electrophoresis followed by MS in the analysis of detergent-resistant lipid raft domains of the membrane. However, these tend to be cytoskeletal proteins (e.g., flotillin,
-actinin, and moesin) and signaling molecules rather than integral membrane proteins [10
11
12
13
]. Although the analysis of membrane lipid rafts does focus the analysis to the plasma membrane, only a fraction of the plasma membrane proteins is found in lipid rafts and the raft composition changes depending on the activation state of the cells, and relatively few of the known CD molecules have been identified in such preparations.
Two factors that contribute to the difficulty in analyzing membrane proteins by approaches based on electrophoresis followed by MS are the highly hydrophobic nature of many integral membrane proteins and the presence of post-translational modifications such as N-linked glycosylation. The partial fractionation of complex mixtures of proteins using specific labeling of proteins followed by affinity chromatography can be used in the MS-based identification of leukocyte plasma membrane proteins, which have not been identified using mAb-based approaches. The simplest labeling procedure is the biotinylation of cell-surface proteins using a water-soluble reagent such as sulfo-NHS-LC-biotin [14 ]. Although this approach proved useful to isolate CD molecules, over 40% of the proteins identified by MS-based sequencing were nonplasma membrane proteins. Cell-surface proteins can be labeled in preference to intracellular proteins by incubating intact cells using lipid-insoluble, maleimide-based, thio-reactive reagents [15 , 16 ] or through N-linked carbohydrates using hydrazide-based reagents [17 ]. Alternatively, cell-surface glycoproteins can be purified using lectin affinity chromatography and then tagged with the required isotope during deglycosylation [18 ]. An analysis of protein expression using (thiol-based) isotope-coded affinity tag reagents identified many CD molecules in a microsomal extract of HL-60 cells [16 ]. This analysis also identified a number of "hypothetical" proteins (proteins predicted from open reading frames of the human genome sequence), some of which may be cell-surface molecules.
Although these methods can lead to the discovery of previously unidentified plasma membrane proteins, it is not clear what proportion of proteins may be missed by any of these approaches. To gain some insight into the value of methods based on protein labeling through cysteines or glycosylation sites, we analyzed all 261 protein CD molecules to determine the likelihood that these known cell-surface proteins would be identified using these methods. This provides an indication of the proportion of typical membrane proteins that may be amenable to identification by MS following isolation using the approaches described. We find that a significant number of known CD molecules may not be identified by these methods, including several molecules that are currently important diagnostically and/or therapeutically. This suggests that a single approach will not be sufficient to generate a complete catalog of the proteins expressed on the surface of a cell and also indicates that additional alternatives to the isolation of plasma membrane proteins will need to be developed.
|
|
|---|
In silico trypsin digestion
Peptide fragments generated by trypsin digestion were predicted using the "PeptideMass" program [19
], accessed through the ExPASy proteomics server (http://www.expasy.org/tools/peptide-mass.html). The SwissProt protein identifier or the TrEMBL accession number was used to identify the protein or the amino acid sequence transferred from the NCBI record. Trypsin was selected as the enzyme, and the option to include "all known post-translational modifications" was selected. The other parameters were as follows: Cysteines were treated with "nothing"; peptide masses were isotopic and uncharged; zero missed cleavages were allowed; and peptides of a mass greater than 500 Da were displayed. These parameters will predict peptides that are readily characterizable using the currently available MS equipment.
N-linked glycosylation sites
For proteins with a SwissProt or TrEMBL identifier, the selection of all known post-translational modifications in the PeptideMass program was used to identify the peptide fragments that contained N-linked glycosylation sites. For the other proteins or where no glycosylation sites were indicated in the SwissProt annotation, potential N-linked glycosylation sites were identified using the "NetNGlyc Server" (http://www.cbs.dtu.dk/services/NetNGlyc/), with the output selected to "Predict only on Asn-Xaa-Ser/Thr sequons." O-linked glycosylation sites were not examined, as a consensus motif has not been identified [20
], making the prediction of which peptides carry O-linked glycosylation too complicated for a rapid analysis.
MASCOT searching
For each protein, the peptides that contained a single cysteine residue were identified. To determine whether a peptide could identify the original protein, the peptide mass and part or all of the sequence were analyzed using the MASCOT sequence query form [21
] on the Matrix Science site (http://www.matrix-science.com). For peptide fragments less than six residues, the entire sequence was submitted; for longer fragments, between five and seven residues were submitted. For example, the sequence tag "1436 seq(y[il][dn]atg)" (where "[il]" and "[dn]" indicate that either amino acid could be present at that position) was submitted as the N -> C sequence (the default "b-ion" setting) to search with the CD98-derived glycosylated peptide SLVTQL<N>ATGNR (amino acids 316329 of CD98, where the <N> residue is predicted to be glycosylated). The following parameters were used in the form: NCBInr database; Homo sapiens taxonomy; trypsin as the enzyme; no missed cleavages allowed; no modifications selected; peptide tolerance of ±2 Da; MS/MS tolerance of ±0.8 Da; the peptide charge as "molecular weight" (no peptide charge); and the masses as monoisotopic. The score of the top match was recorded, and if the top match were not the source protein, then the position of the protein in the list was recorded. If the top match were not the correct protein, but the score for the correct protein was close to the score of the top match, then the sequences were compared by a pair-wise BLAST alignment to determine whether they were independent submissions to the database of a single sequence or whether the entries were distinct sequences. If they were, then the peptide was included as having identified the correct protein.
Data compilation
The following data were collected for each protein analyzed: the predicted polypeptide mass; the total number of peptides with a mass between 500 and 2000 Da; the number of peptides in this size range with a single cysteine; and the number of peptides in this size range with an N-linked glycosylation site.
Protein identification using cysteine-containing peptides
Protein identifications were rejected if the following set of criteria were not met for two or more peptides predicted from trypsin digestion: The peptides were of a mass between 500 and 2000 Da; a single cysteine residue was present in each peptide; there were no predicted N-glycosylation sites; and the peptides identified the original protein using the (MASCOT) search engine.
Protein identification using glycosylated peptides
Protein identifications were rejected if the following set of criteria were not met for two or more peptides predicted from trypsin digestion: The peptides were of a mass between 500 and 2000 Da; the peptide included a predicted N-linked glycosylation site; and the peptides identified the original protein using the (MASCOT) search engine.
|
|
|---|
There was a linear relationship between the predicted size of the protein and the number of peptides between 500 and 2000 Da, which would be generated by trypsin digestion (R2=0.87; Fig. 1A ). In contrast, the relationships between the total number of peptides in the size range and the number of peptides with a single cysteine residue (R2=0.63; Fig. 1B ) or the number of peptides with N-linked glycosylation (R2=0.55; Fig. 1C ) were not strong.
![]() View larger version (12K): [in a new window] |
Figure 1. Correlations of peptides in a size range 5002000 Da predicted to be generated by trypsin digestion of the sequences of 261 CD molecules with other characteristics of the proteins. (A) Plot of the mass of the protein sequence against the number of peptides generated. (B) Plot of the number of peptides generated against the number that has a single cysteine residue present. (C) Plot of the number of peptides generated against the number that has a single predicted N-linked glycosylation site. R2 values given in the main text were obtained by plotting a linear regression line to pass through the intersection of the axes (0,0).
|
A protein identification approach using two peptides containing a single cysteine residue was not able to identify 88 of the CD molecules (33.7% of the sequences analyzed), as listed in Table 1 . An identification approach using glycosylated peptides was not able to identify 130 CD molecules (49.8% of the sequences analyzed), as listed in Table 2 . The reasons why particular CD molecules could not be identified using either method are summarized in Figure 2 . In most cases, the sequences did not generate two or more suitable cysteine-containing or glycosylated peptides.
|
View this table: [in a new window] |
Table 1. List of 88 CD Molecules That Could Not Be Identified by a Sequence Database Search Using Two or More Peptides Containing a Single Cysteine Residue
|
|
View this table: [in a new window] |
Table 2. List of 130 CD Molecules That Could Not Be Identified by a Sequence Database Search Using Two or More Glycosylated Peptides
|
![]() View larger version (18K): [in a new window] |
Figure 2. Reasons for which CD molecules could not be identified using cysteine-containing peptides (A) or glycosylated peptides (B). The identities of the CD molecules in each category are listed in Tables 1
and 2
. "No Glycosylation" indicates the number of CD molecules that were not glycosylated. "No Peptides" indicates that no peptides in the size range contained a cysteine residue peptide or a glycosylation site. "Single peptide" indicates that only one in the size range contained a cysteine residue. "One suitable peptide" indicates that only one peptide with a cysteine residue peptide or a glycosylation site was present. "Search failed" indicates that the MASCOT search did not return the correct sequence.
|
|
View larger version (6K): [in a new window] |
Figure 3. Sequences of peptide fragments resulting from trypsin digestion of the CD47 protein. The first 60 amino acids of the CD47 protein sequence are shown. The signal sequence as annotated in the SwissProt entry "CD47_HUMAN" is shown by underlining. The sequence is predicted to be glycosylated at amino acid 23, as indicated with an "N" above the sequence line. The peptide sequence 19..24 is the fragment predicted by trypsin digestion following cleavage of the signal peptide, as would occur when the protein is expressed. The peptide 1..24 is the trypsin digestion fragment predicted if the signal peptide is not removed. The peptide 19..24 (QLLFNK, mass 762 Da) does not return the CD47 sequence in a MASCOT search, as the NCBInr database used does not include a sequence entry without the signal peptide, and the same peptide fragment sequence (QLLFNK) with a mass of 2506 Da returns the CD47 sequence.
|
|
View this table: [in a new window] |
Table 3. List of 103 CD Molecules That Could Be Identified Using Cysteine-Containing Peptides and Glycosylated Peptides
|
|
View this table: [in a new window] |
Table 4. List of 60 CD Molecules That Could Not be Identified by Searching for Cysteine-Containing Peptides or Glycosylated Peptides
|
|
View this table: [in a new window] |
Table 5. Summary of the Numbers of CD Molecules That Could or Could Not be Identified Using Each Approach
|
![]() View larger version (48K): [in a new window] |
Figure 4. Alignment of the amino acid sequences of the CD32 family members, showing the sequences of the shared peptides that would be generated by trypsin digestion. Neighboring peptides are distinguished by a change from uppercase text to lowercase. The number below the peptide indicates the predicted mass of the fragment.
|
|
View this table: [in a new window] |
Table 6. Matches Returned by MASCOT Search of Peptide Fragments from the CD169 Protein Sequence
|
|
|
|---|
Our results demonstrate that the isolation of peptides through N-linked glycosylation or thiol bonds can be used in conjunction with MS analysis to identify most (77%) of the known CD molecules (201 of 261 analyzed). Given that large size of the dataset and that it is representative of the diversity of plasma membrane proteins, this would suggest that these same approaches would be valuable for the detection and characterization of other cell-surface proteins that have not yet been identified. It is also evident that no single approach can be used to identify all the proteins in a particular sample and that other methods of isolating proteins and/or peptides will be needed to ensure that a complete coverage of the cell-surface proteins is obtained.
The analysis described here required that two peptides identify the target protein in the database source. The identification of proteins using peptide mass tags is a probability-based process [21 , 23 ], which can be affected by a variety of factors including the quality of the MS spectra, the mass accuracy of the peptide mass determinations, and the quality of the sequence database. Although it is possible to identify a protein using a single peptide sequence tag [23 ], mixtures of peptides can often result in poor quality MSMS and the failure to identify a correct parent protein [18 , 24 ]. Thus, the need for two peptide sequence tags to contribute to the identification of the parent protein increases the confidence of the match [16 ]. In the approaches examined here, further confidence in the assignments can be gained by the requirement that the peptide sequence contains a cysteine residue or a motif for N-linked glycosylation. In the present study, an "in silico" analysis of database searches using peptide sequence tags means that factors such as MS spectra quality and mass accuracy could be ignored, and this allowed us to address whether a "perfect" peptide sequence tag is capable of identifying the parent protein when it is known to be in the database.
There are a number of factors that affected the identification of CD molecules using the approaches described and that would also impact on the use of these approaches for the identification of new cell-surface molecules. For example, the CD20 molecule, which is commonly used to identify B lymphocytes, has no N-linked glycosylation sites and generated only one peptide with a cysteine residue in the specified size range. A similar situation was observed with the CD8
- and ß-chains. The detection of these proteins would require the use of alternative methods. In other cases, such as the low-affinity immunoglobulin G FcR-2 proteins (CD32a, CD32b, and CD32c), closely related but functionally distinct proteins could not be distinguished, as they generated a number of identical peptides.
The identification of certain proteins may be affected by the quality of the protein database used to match the peptide sequences. This may be a result of duplication of sequences or errors in the annotation of sequences. The database searches using peptides derived from the CD169 sequence returned a number of matches corresponding to clones from cDNA libraries as well as the prototype CD169 sequence. Further analysis of the other matches suggested that they represented fragments of the CD169 sequence rather than full-length sequences of closely related proteins. The presence of sequence fragments in a database may complicate the identification of novel proteins using these approaches. The completion of the human genome sequence and the determination of the final number of 23,758 protein-coding genes and 34,091 transcripts (Ensembl build v21.34d.1, May 10, 2004; http://www.ensembl.org/Homo_sapiens/whatsnew/index.html) mean that a reference sequence database, which does not contain multiple copies of sequences or fragments of sequences, can now be assembled and curated.
In conclusion, the application of the protein isolation and identification strategies discussed in this paper will allow a significant proportion of the unknown cell-surface proteins to be identified and evaluated as markers of cell populations. It is apparent, however, that additional techniques will need to be applied to complete the analysis of the protein composition of the leukocyte cell surface.
Received August 2, 2004; revised October 4, 2004; accepted October 5, 2004.
|
|
|---|
RII) J. Biol. Regul. Homeost. Agents 14,311-316[Medline]This article has been cited by other articles:
![]() |
S. P. Mirza and M. Olivier Methods and approaches for the comprehensive characterization and quantification of cellular proteomes using mass spectrometry Physiol Genomics, October 8, 2008; 33(1): 3 - 11. [Abstract] [Full Text] [PDF] |
||||
![]() |
R.P. Webster, D. Brockman, and L. Myatt Nitration of p38 MAPK in the placenta: association of nitration with reduced catalytic activity of p38 MAPK in pre-eclampsia Mol. Hum. Reprod., November 1, 2006; 12(11): 677 - 685. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||