Tandem mass spectrometry (MS/MS) has emerged as the leading technology for high-throughput proteomics analysis, making it possible to rapidly identify and characterize thousands of different proteins in complex biological samples. In recent years we have witnessed a dramatic increase in the capability to acquire proteomics MS/MS data. To avoid computational bottlenecks, this growth in acquisition power must be accompanied by a comparable improvement in analysis capabilities. In this dissertation we present several algorithms we developed to meet some of the major computational challenges that have arisen in MS/MS analysis. Throughout our work we continually address two (sometimes overlapping) problems: how to make MS/MS-based sequence identifications more accurate, and how to make the identification process work much faster.
Much of the work we present revolves around algorithms for de novo sequencing of peptides, which aims to discover the amino acid sequence of protein digests (peptides), solely from their experimental mass spectrum. We start off by describing a new scoring model which is used in our de novo sequencing algorithm called PepNovo. Our scoring scheme is based on a graphical model decomposition that describes many of the conditions that determine the intensities of fragment ions observed in mass spectra, such as dependencies between related fragment ions and the influence of the amino acids adjacent to the cleavage site.
Besides predicting whole peptide sequences, one of the most useful applications of de novo algorithms is to generate short sequence tags for the purpose of database filtration. We demonstrate how using these tags speeds up database searches by two orders of magnitude compared to conventional methods. We extend the use of tag filtration and show that with high-resolution data, our de novo sequencing is accurate enough to enable extremely rapid identification via direct hash lookup of peptide sequences.
The vast amount of MS/MS data that has become available has made it possible to use advanced data-driven machine learning methods to devise more acute algorithms. We describe a new scoring function for peptide-spectrum matches that uses the Rank-Boost ranking algorithm to learn and model the influences of the many intricate processes that occur during peptide fragmentation. Our method's superior discriminatory power boosts PepNovo's performance beyond the current state-of-the-art de novo sequencing algorithms. Our score also greatly improves the performance of database search programs, significantly increasing both their speed and sensitivity. When we applied our method to the challenging task of a proteogenomic search against a six-frame translation of the human genome, we were able to significantly increase the number of peptide identifications compared to current techniques by 60%.
To help speed up MS/MS analysis, we developed a clustering algorithm that exploits the redundancy that is inherent in large mass spectrometry datasets (these often contain hundreds and even thousands of spectra of the same peptide). When applied to large MS/MS datasets on the order of ten million spectra, our clustering algorithm reduces the number of spectra by an order of magnitude, without losing peptide identifications.
Finally, we touch upon sequencing of intact proteins ("top-down" analysis), which from a computational perspective, is only in its infancy—very few algorithms have been developed for analysis of this type of data. We developed MS-TopDown, which uses the Spectral Alignment algorithm to characterize protein forms (i.e., determine the modification/mutation sites). Our algorithm can handle heavily modified proteins and can also distinguish between several isobaric protein forms present in the same spectrum.