Thursday, 24 October 2013

Creating an Open Source Revolution in Computational Proteomics

First of all, I don’t want to discuss in this post about Open-Source, its strengths & strengths. This post is about the most useful Open-Source packages, frameworks or libraries in the field of computational proteomics (a short version of our manuscript “Open source libraries and frameworks for Mass Spectrometry based Proteomics: A developer’s perspective”). 

Schema of the possible computational processing steps of a proteomics data set.

In proteomics like other Omics, the bioinformatics efforts can be divided in three major fields: data processing, storage and visualization. From MS/MS preprocessing to post-processing of the identifications results, even though the objectives of these libraries and packages can vary significantly, they usually share a number of features. Common use cases include the handling of protein and peptide sequences, the parsing of results from various proteomics search engines output files, and the visualization of MS-related information (including mass spectra and chromatograms).

I know genomics is the topic in the coffee table; in the NGS decade is difficult to hear about other topics such as proteomics or cheminformatics. In fact, sometime, I feel in twitter, LinkedIn or Bioinformatics RSS that proteomics is only for nerds. However, we exist, and there is an increasing demand for high-performance bioinformatics solutions that can help to address the various data processing and data interpretation challenges in the field of proteomics.

Even though the objectives of proteomics libraries and packages can vary significantly, they usually share a number of features and this is the perfect scenario for an Open-Source environment. Common MS data processing tasks comprise theoretical analysis of proteomes, processing of raw spectra, file format conversions, generation of identification statistics, identification and quantitation results.

The most extensively used open-source frameworks are OpenMS, the Trans-Proteomic Pipeline (TPP), the Computational Omics (Compomics) suite, the PRoteomics IDEntifications (PRIDE) database toolsuite, ProteoWizard and the Java Proteomic Library (JPL). Other well-known libraries/frameworks with a more specialized scope include InsilicoSpectro, multiplierz, mMass, mzMine, msInspect, MSQuant and MASPECTRAS.

OpenMS (Pure C++): used external libraries such as: (i) Qt, which provides visualization and database support; (ii) Xerces for XML file parsing; (iii) libSVM, for machine learning algorithms; and (iv) the GNU Scientific Library (GSL), used for mathematical and statistical analysis. One of the strong points of OpenMS is a complete set of examples to extend and use the libraries, the TOPP (The OpenMS Proteomics Pipeline) and TOPPView tutorials describe in detail the OpenMS.

Trans-Proteomic Pipeline (C++, Perl, Java):   encompassed most of the steps involved in a proteomics data analysis workflow in a single, integrated software system. PeptideProphet, iProphet, and ProteinProphet  are the milestones and can be used to validate the search engine results and to model correct vs. incorrect peptide-spectrum matches (PSMs) and the protein inference. The TPP components have been developed using different programming languages such as C++, Perl and Java. This fact complicates the integration with other pieces of code and the development of new applications using the TPP framework.

Compomics (Pure Java): contains a set of parsers for popular search engines output files (Mascot, X!Tandem, OMSSA and Proteome Discoverer (Thermo Scientific). It also includes a collection of user-friendly tools, including among others: (i) ms_lims and DBToolkit for storing and performing different in silico analysis of proteomics data; (ii) Peptizer for manual validation of MS/MS search results; (iii) Rover, for visualizing and validating quantitative proteomics data; (iv) FragmentationAnalyzer for analyzing MS/MS fragmentation data; and (v) the new PeptideShaker (, for comprehensive MS data combined analysis of results from multiple search engines (Mascot, OMSSA and X!Tandem); (vi) SearchGUI, which provides a unified GUI (Graphical User Interface) for MS identification using multiple search engines (OMSSA and X!Tandem).

ProteoWizard and Skyline (Pure C++):  enables rapid tool creation and unifies data file access and conversion to perform standard proteomics and LC-MS analysis computations. It includes different tools for data conversion from RAW files and a core API for parsing different data formats. In addition to the open mzML, mzXML, mzIdentML, and mzData XML formats, a variety of proprietary formats can also be handled. Skyline extend and use the ProteoWizard core APIs and tools for targeted proteomics and label-free quantitative methods. Other important feature of the tool is the vast community behind the platform, supported by the number of publications and the rich array of graphs available for inspecting data integrity.

Java Proteomic Library (Pure Java):  provides a strong chemical-based representation of MS proteomics data. It is composed of several modules and APIs for manipulating peptide or protein sequences, PTMs and mass spectra. It also provides methods for in silico protein digestion and peptide fragmentation, which takes into account various ion types and modifications. Many classes dealing with spectrum processing and filtering, and/or spectrum matching and clustering, are also provided. In addition, it also contains  several standalone tools for performing protein sequence digestion, creating spectra and sequence decoy databases, and performing open modification spectrum library searches (QuickMod/Liberator).

PRIDE toolsuite (Pure Java): constitutes a set of pure Java libraries, tools and packages designed to handle MS proteomics experiments from a vast range of approaches, instruments and analysis platforms. The framework contains a set of components such as: (i) the mzGraph Browser library, for visualizing MS spectra, chromatograms and MS/MS spectrum annotation; (ii) the QualityChart library provides a number of charts for performing a quick quality assessment of the MS experiments; (iii) several APIs for parsing standard data proteomics formats such as mzML, mzIdentML, mzTab and PRIDE XML (the PRIDE internal data format); (iv) the XXIndex library enables the fast indexing of large XML files; (v) the PRIDE Utilities library contains classes with some functionality shared by many of the PRIDE related tools; and (vi) the PRIDE core library, for general data management. The PRIDE Converter 2 and the PRIDE Inspector are currently the most popular tools of the framework, and both of them offer a user-friendly GUI. PRIDE Converter 2, recently released, is a new submission tool for converting a large variety of popular MS proteomics formats into PRIDE XML, by guiding the user through a wizard-like process. A command line mode (CLI) mode is also available for converting multiple files at once in batch mode. Its predecessor, the original PRIDE Converter tool, is currently being phased out, since the new software has been made available. Finally, PRIDE Inspector is a tool that allows the user to efficiently browse, visualize, and perform an initial assessment of MS proteomics data in the PRIDE XML and mzML formats, and also allows direct access to a PRIDE MySQL public database instance. Support for the formats mzIdentML and mzTab is in progress. Finally, the most recent addition to the PRIDE-toolsuite is the PRIDE spectra clustering API.

The R language has also a set of dedicated packages for the analysis and interpretation of mass-spectrometry based proteomics data. Efficient low-level access to raw data in any of the PSI standards is possible through the mzR package (which used C and C++ code from the proteowizard project, see above). Other packages like MSnbase, MALDIquant or xcms (widely used in metabolomics) provide higher-level abstractions to facilitate data processing and analysis. The R environment is particularly well suited for exploratory data analysis, visualisation and statistical analysis of proteomicsdata. In terms of peptide identification, the rTANDEM package provides an interface to the popular X!Tandem search engine and mzID allows to parse mzIdentML files. Many more packages as available and new additions are released on a regular basis. For more details and general information about R and proteomics, have a look at the 'R for proteomics' introductory paper (Pubmed, pre-print) and package. Questions about R/Bioconductor packages can be sought on the mailing list and general questions and suggestions about R/Bioc for proteomics data analysis can be posted on the Google group.

Other tools, packages and open-source frameworks: InsilicoSpectro  was developed in Perl and offers different set of functionalities, for instance protein digestion, sequence database file readers, property estimation (pI, retention time, mass) and MS fragmentation prediction. Python is not an extensively used programming language in computational proteomics, but in recent years is gaining popularity. Then, Multiplierz and Pyteomics are frameworks to support proteomics data analytic tasks in this language. Access to the available functionality is provided via high-level Python scripts. MZmine2 is Java library mainly implemented for MS preprocessing purposes and also provides several data mining algorithms (principal component analysis, clustering and log-ratio analysis) to reduce the dimensionality of the data. FDRAnalysis is a Java library which enables the upload of peptide identification results from target/decoy searches carried out by three different search engines: Mascot, OMSSA and X!Tandem.

Open-Source libraries have been fundamental in building new bioinformatics tools. In fact, there has been a big progress in the development of new libraries, allowing them to be folded into other applications and pipelines as reusable building blocks. One of the reasons behind is that the development of open source software offers the potential for a more flexible technology and potentially, quicker innovation. One of the known downsides is the lack of a thorough documentation in some cases, which may cause that the software cannot be easily reused. Also some of them are no well-tested with extreme use cases, then they fail frequently. However, in the near future we will continue to expand these libraries and frameworks to provide more powerful and robust analysis tools.

"Open source libraries and frameworks for mass spectrometry based proteomics: A developer's perspective. Yasset Perez-Riverol, Rui Wang, Henning Hermjakob, Markus Müller, Vladimir Vesada, Juan Antonio Vizcaíno. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics, ISSN 1570-9639"