Completed Projects


  • N. Gupta, J. Benhamida, V. Bhargava, D. Goodman, E. Kain, I. Kerman, N. Nguyen, N. Ollikainen, J. Rodriguez, J. Wang, M.S. Lipton, M. Romine, A. Osterman, V. Bafna, R.D. Smith and P.A. Pevzner. Comparative Proteogenomics: Combining Mass Spectrometry and Comparative Genomics to Analyze Multiple Genomes. Genome Research (in press), 2008.

    Mass spectrometry recently emerged as a valuable technique for proteogenomic annotations that improve on the state-of-the art in predicting genes and other features. However, previous proteogenomic approaches were limited to a single genome and did not take advantage of analyzing mass spectrometry data from multiple genomes at once. We show that such comparative proteogenomics approach (similarly to comparative genomics approaches) allows one to address the problems that remained beyond the reach of the traditional “single proteome” approach in mass-spectrometry. In particular, we show how comparative proteogenomics addresses the notoriously difficult problem of “one-hit-wonders” in proteomics and improves on the existing gene prediction tools in genomics. (See our earlier paper on proteogenomics: Gupta et al, Genome Res, 2007).

  • J. Rodriguez, N. Gupta, P.A. Pevzner. Does trypsin cut before Proline? Journal of Proteome Research. 7:300-5, 2008.

    Trypsin is the most commonly used enzyme in mass spectrometry for protein digestion with high substrate specificity. Many peptide identification algorithms incorporate these specificity rules as filtering criteria. A generally accepted "Keil rule" is that trypsin cleaves next to arginine or lysine, but not before proline. Since this rule  was derived two decades ago based on a small number of experimentally confirmed cleavages, we decided to re-examine it using 14.5 million tandem spectra (two orders of magnitude increase in the number of observed tryptic cleavages). Our analysis revealed a surprisingly large number of cleavages before proline. We examine several hypotheses to explain these cleavages and argue that trypsin specificity rules used in peptide identification algorithms should be modified to "legitimatize" cleavages before proline. Our approach can be applied to analyzing any protease and we further argue that specificity rules for other enzymes should also be re-evaluated based on statistical evidence derived from large MS/MS datasets.

 

Open Projects


  • Finding Motifs for Post-Translational Modifications (PTMs)

    Post-translational modifications (PTM) of a protein sequence may change the properties of a protein by adding a chemical group to an amino acid, thus changing its mass. In the analysis of three Shewanella species, we observe several thousands of post-translationally modified amino acids in each species. While these sites were predicted using mass spectrometry data, it may be possible to detect these modified sites just based on the amino acid sequence of the protein flanking the modification site (PTM Motifs). In the past, PTM motifs were studied for a small set of important modifications like phosphorylation (see the list of PTM prediction tools below). Large MS/MS datasets from three species will make good training sets to build sequence-based predictor for a much larger variety of PTMs. The goal of this project is to build a tool that allows one to predict which amino acids in a protein are likely to be post-translationally modified, given the protein sequence.

    Required Datasets
    * List of observed PTMs in three organisms
    * Protein sequences for each organisms

    List of PTM prediction tools:
    http://us.expasy.org/tools/#ptm
    AutoMotif: http://bioinformatics.oxfordjournals.org/cgi/content/full/21/10/2525
    Sulfinator,
    NetPhos
    GANNPhos


  • Analysis of Conserved Post-Translational Modifications

    In the analysis of three Shewanella species, we observe several thousands of post-translationally modified amino acids in each species. These three species are genetically close to each other. The goal of this project is to find out which PTMs are conserved between the three species, and is there anything interesting about these conserved PTMs?

    This is an exploratory project and several directions can be taken. For example, a very recent study on comparison of phosphorylation sites between E. coli and B. subtilis from Matthias Mann group found about 80 phosphorylation sites in E coli and B. subtilis each. They show that the modified positions in the proteome are significantly more conserved than non-modified positions. In our case, we have many more types of modifications, and much larger number of modified residues, as well as three close species. This puts us in a good position to check the hypothesis that modified residues are more conserved in evolution.

    A challenge here is to distinguish in-vivo modifications from in-vitro modifications. Majority of the modifications identified in our dataset correspond to in-vitro modifications as these are more commonly present. These, however, may have nothing to do with actual bacterial evolution. The more interesting ones are the in-vivo modifications. So it may be important to develop approaches to filter out in-vitro modifications in this study.

    Reference
    Boris Macek, Florian Gnad, Boumediene Soufi, Chanchal Kumar, Jesper V. Olsen, Ivan Mijakovic, and Matthias Mann. Phosphoproteome analysis of E. coli reveals evolutionary conservation of bacterial Ser/Thr/Tyr phosphorylation. Advanced online papers in Mol Cell Proteomics.

    Required Datasets
    * List of observed PTMs in three organisms
    * Protein sequences for each organism
    * Orthology information (list and alignments)



  • Structural Analysis of Post-Translational Modification and Proteolytic Cleavage Sites

    Post translational modifications and proteolytic cleavages are specific events that happen at specific sites in a protein. The occurrence of these events may be dependent on the structure of the protein at these sites. In each of the three Shewanella species, thousands of PTM sites and hundreds of protelytic cleavage sites are identified. It will be interesting to analyze the secondary and tertiary structure of the proteins in the vicinity of these sites, and learn how structure may affect modifications and proteolytic cleavages in proteins.

    List of secondary structure prediction programs
    http://us.expasy.org/tools/#secondary
    http://www.russell.embl-heidelberg.de/gtsp/secstrucpred.html

    Required Datasets
    * List of observed PTMs in three organisms
    * List of observed doubly-confirmed cleavages in three organisms
    * Protein sequences for each organism



  • Improving Signal Peptide Predictions with Mass Spectrometry Data

  • Signal peptides are short amino acid sequences at the N-termini of secreted proteins, which are important for translocating these proteins to their final destination after translation. Once a protein is translocated, the signal sequence is cleaved from its N-terminus. These cleavages can be identified from MS/MS data as described in Gupta et al, 2007. We have now identified signal cleavage sites in about 100 proteins in each of the three Shewanella species. This analysis of signal peptides independently "discovered" the correct signal peptide motif without any prior knowledge. At the same time it found many potential signal peptides missed by signal peptide prediction tools like SignalP and PrediSi and refuted some of their predictions. This is not surprising since SignalP and PrediSi made many predictions that are in conflict with each other.

    Three approaches can be used to identify highly reliable signal peptide predictions from MS/MS data:
    * Signal peptides that are seen in MS/MS data from multiple enzyme digests: indeed it is extremely unlikely that the same computational artifact is confirmed by two different digestions (We are currently generating the list of signal peptides from these multi-enzyme digests).
    * The second way to verify these predictions is comparative proteogenomic analysis of MS/MS data from three Shewanella species.
    * The third way to further increase confidence in MS/MS-based signal peptide predictions is to analyze doubly confirmed cuts that have an order of magnitude smaller false positive rates as compared to MS/MS peptide identifications (We already have this list of doubly-confirmed cuts). See this paper for the definition of doubly-confirmed cuts.

    After forming the set of highly reliable signal peptides, we can analyze why they were missed by SignalP and PrediSi. The profile-like motif models (Postition Weighted Matrix or PWM) that these approaches use is limited to a single motif while in reality there may be multiple motifs describing signal peptides. This limitation of existing motif models that is well known in studies of DNA motifs [Hannenhalli 2005] has not been explored yet in studies of signal peptides (this is not a criticism of existing signal peptide prediction tools but rather a limitation imposed by small learning samples). We can explore whether there exist subclasses of signal peptides and if the mixture of these subclass-PWMs can improve the signal peptide predictions. The set of reliable signal peptides derived from MS/MS data will allow us to devise a mixture model [Hannenhalli 2005] consisting of a few PWMs for representing signal peptides that will complement the existing PWM.

    In addition to signal peptides missed by existing programs, our preliminary study refuted some of SignalP and PrediSi predictions. It may be a reflection of overly general PWM model these software tools use. Constructing PWM based on the refuted predictions shows that it differs from the existing PWM. We therefore argue that these existing models have to be made more specific to minimize false positive errors and propose to make such an adjustment based on MS/MS data.

    Required Datasets
    * List of signal peptide identifications in each species
    * List of multi-enzyme signal peptide identifications in each species (optional, being generated)
    * List of doubly-confirmed cuts in each species
    * Protein sequences for each organism
    * Orthology information (list and alignments)

    Reference
    S Hannenhalli, LS Wang. Enhanced position weight matrices using mixture models. Bioinformatics, 2005.


  • Computational Prediction of N-terminal Methionine Cleavage

    Protein N-terminal methionine excision (NME) is an essential cotranslational process that occurs in the cytoplasm of all organisms and in the two organelles (i.e. mitochondria and plastids) displaying protein synthesis.
    The N-terminal methionine residue is cleaved by MAP or AmpP from a number of cytosolic proteins. Methionine, which is important during translation, may not be required (or actually be detrimental) for the function of the protein. For example, NME is important for the stability of several recombinant proteins. Few hundred proteins in each Shewanella species were identified with the loss of initial methionine in the protein sequence. It is known that proteins that have a second amino acid with small side chain (such as G, A, P, S, T etc) are more likely to lose the methionine. However, the effect of the subsequent positions in the protein is not clearly visible. The goal of this project will be to build a sequence-based prediction tool for N-terminal methionine cleavages. The list of N-terminally methionine cleavages observed in the three organisms will provide a positive control dataset for the predictions. Proteins that are observed with initial methionines can be used as a negative control dataset. The methionine-cleavages that are observed across multiple organisms can be considered more confident than others.

    Required Datasets
    * List of proteins with and without N-terminal methionine cleavages for each organism
    * Protein sequences for each organism
    * Orthology information (list and alignments)

    Reference
    Frottin F, Martinez A, Peynot P, Mitra S, Holz RC, Giglione C, Meinnel T.   
    The proteomics of N-terminal methionine cleavage.
    Mol Cell Proteomics. 2006 Dec;5(12):2336-49.


  • Sequence-based Prediction of Peptide Detectability

    In mass spectrometry, proteins are first cut at specific positions by an enzyme (usually trypsin that cuts after R or K), and then these peptides are identified. Not all possible tryptic peptides in a protein are equally detectable in mass spectrometry. The detectability of a peptide may depend on its sequence, length of the peptide, it's spatial location in the folded protein etc. There are some methods recently proposed to predict peptide detectability (Tang 2006, Lu 2007, Mallick 2007).

    As opposed to these previous studies, we have the additional advantage of having the comparative data from three closely related species. Orthologous peptides observed in two or more species provide a very strong positive-control dataset, while orthologous peptides not observed in any organism provide a strong negative-control dataset.

    We can further analyze the effect of mutations in protein sequences across these organisms on peptide detectability. For example, by analyzing the orthologous peptides that are observed in multiple species, we can find which mutations do not adversely affect peptide detectability. On the other hand, if a peptide is observed in one species, but its orthologous peptide is not observed in other species, we can try to compare the sequences of these peptides to see what mutations might have turned the peptide non-detectable.

    Required Datasets
    * List of peptides identified in each organism
    * Protein sequences for each organism
    * Orthology information (list and alignments)

    References

    Tang, H., Arnold, R., Alves, P., Xun, Z., Clemmer, D., Novotny, M., Reilly, J., and Rejivojac, P. 2006. A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics 22: e481–e488.

    Lu, P., Vogel, C., Wang, R., Yao, X., and Marcotte, E. 2007. Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat. Biotechnol. 25: 117–124.

    Mallick, P. et al. Computational prediction of proteotypic peptides for quantitative proteomics. Nat. Biotechnol. 25, 125–131 (2007).


  • Proteogenomic Browser

    We would like to develop a web based GUI tool for proteogenomic visualization. One should be able to visualize certain part of the genome and see what peptides match there (in each of the six frames), what genes are present in the nearby regions, what is the codon at any position etc. It will have search and slide functionalities to make it interactive. One should be able to query with an amino acid or nucleotide sequence and visualize regions containing that sequence, or query by coordinates. Right now, we spend a lot of time to get such information using command line scripts; the new tool will instantly provide the "proteogenomic landscape" of the region of interest.

    When a user goes to the browser, he/she can upload the genome sequence file, coordinates of genes, and the list of observed peptides. After that, the browser should be able to respond to any of the queries.

    Required Datasets
    * List of peptides identified in each organism
    * Protein coordinates
    * Nucleotide sequence files
  • Annotating Bacterial Genomes Using Mass Spectrometry Data (multiple projects)

    We have developed an automated proteogenomic analysis pipeline. This pipeline takes as input the mass spectrometry data and the genome sequence (with existing gene annotations) of a bacterium, and produces a report with suggestions for correcting gene annotations, putative cleavage sites and post-translational modifications. An example of such a report is shown here (please see Gupta et al, 2007 for details). In Gupta et al, 2007, we improved the annotation of Shewanella oneidensis MR-1 using this type of data. We have now generated similar reports for other organisms, including Shewanella putrefaciens CN-32, Shewanella frigidimarina, Pelagibacter Ubique and Porphyromonas gingivalis. The goal of these projects is to use these reports for improving the annotations in the respective bacterium.

    It is expected that the student/team working on each species will closely interact with our collaborators who are interested in the particular microbe. These projects may include significant amount of literature search (for example, to find support for the PTMs), or manual checks (for exampe, BLAST searches to validate gene corrections). 

    Required Datasets
    * List of peptides identified in each organism
    * Protein coordinates
    * List of observed PTMs in three organisms
    * Nucleotide sequence files

 

 

 

 

 

 

 

Supported By   HHMI

 

 

 

 

 

 

 

 

 

 

Whats New

Comparative proteogenomics paper accepted

The paper including results of all students on using comparative approaches for analyzing MS/MS data will appear in Genome Research. The paper is also covered in research highlights of Nature Reviews Genetics!

Trypsin paper accepted

"Does Trypsin cut Before Proline?" paper with our undergraduate student Jesse Rodriguez as the first author has been accepted for publication in the Journal of Proteome Research. Congratualtions Jesse!

Consortium alumni

Jesse Rodriguez joins Stanford, Noah Ollikainen joins UCSF and Ian Kerman joins UCSD for graduate studies. Congratulations to all!

Poster at ASMS

Jesse Rodriguez presented a poster on his work on analyzing trypsin specificity at ASMS 2007 in Indianapolis to thousands of mass-spectrometry experts.

HHMI site visit

Mary Koszalka from HHMI visited us, and learnt about the consortium projects through presentations made by Jamal Benhamida, Jesse Rodriguez and Nitin Gupta.

 

Collaborators


University of California, San Diego

Pacific Northwest National Laboratory

Burnham Institute

Howard Hughes Medical Institute