Monday, 8 September 2014

Evaluation of Proteomic Search Engines for PTMs Identification

The peptide-centric MS strategy is called bottom-up, in which proteins are extracted from cells, digested into peptides with proteases, and analyzed by liquid chromatography tandem mass spectrometry (LC−MS/MS). More specifically, peptides are resolved by chromatography, ionized in mass spectrometers, and scanned to obtain full MS spectra. Next, some high-abundance peptides (precursor ions) are selected and fragmented to obtain MS/MS spectra by high- energy C-trap dissociation (HCD) or collision-induced dissociation (CID). 

Then, peptides are commonly identified by searching the MS/MS spectra against a database and finally assembled into identified proteins. Database searching plays an important role in proteomics analysis because it can be used to translate thousands of MS/MS spectra into protein identifications (IDs). 

Many database search engines have been developed to quickly and accurately analyze large volumes of proteomics data. Some of the more well-known search engines are MascotSEQUEST, PEAKS DB, ProteinPilot, Andromeda, and X!Tandem. Here a list of commonly use search engines in proteomics and mass spectrometry.

Recently Garcia and co-workers published a comparison between some of the search engine results for the the analysis of Histone Modifications (Evaluation of Proteomic Search Engines for the Analysis of Histone Modifications. Zuo-Fei et al. Journal of Proteome Research. 2014). The authors demonstrated that pFind and Mascot tools identified most of the confident results. 

Besides the accuracy of search engines, the authors also compare the the search time and size of the result files for each search engine. PEAKS runs the most slowly, from 2 to 7 h. MaxQuant runs the second most slowly, ∼15 min. X!Tandem runs the fastest, ∼20 s. pFind and OMSSA run the second fastest, from 20 to 100 s. The MaxQuant results are the largest, from 200 to 600 MB. The OMSSA results are the smallest, from 1 to 5 MB. The pFind results are the second smallest, from 15 to 40 MB.

Some of the majors search engines pros and cons for the identification of histone modifications: 
  • pFind finishes the first six searches in several minutes but finishes the seventh search with all spectra in several hours. 
  • Mascot exhibits excellent performance in their data sets but cannot identify more than nine modifications in one search. 
  • Sequest HT is much faster than the old SEQUEST version (e.g., v27 rev12) but cannot identify more than six modifications in one search. 
  • ProteinPilot can identify many modifications in one search by assigning different probabilities beforehand, but the way to preprocess spectra does not work well (e.g., in pParse, the scan number and the precursor type of a filename can be put in different order; when the scan number is ahead such as histone.4.110.2.dta, very few spectra can be identified; when the precursor type is ahead such as histone.110.4.2.dta, many spectra can be identified).
  • PEAKS Studio has many powerful tools for de novo sequencing, database searching, and PTM discovery, but when the maximal allowed modification site per peptide becomes large (e.g, >3) or many modifications are considered PEAKS DB becomes slow or even runs out of memory.
  • OMSSA in COMPASS is fairly easy to use but except for acetylation other PTMs are not identified well.
  • X!Tandem in TPP is pretty fast but cannot identify modifications on the same residue (e.g., when Propionyl[K] and Acetyl[K] are both set as variable modifications, only the last modification is included in search, so Propionyl[K] has to be set as fixed modification and other PTMs’ masses minus the mass of Propionyl[K] are set as the variable modifications, but in the seventh search, only Trimethyl[K] and Phospho[ST] are included in search because ac, me, di, and tr all occur on lysine; this causes X!Tandem to be unable to identify many multimodified spectra)
  • Andromeda in MaxQuant has advantages for analyzing SILAC data but the speed becomes slow due to 38 processing steps and fewer PTM spectra can be identified due to the too high default score threshold for modified peptides.