Sunday, 14 September 2014

ProteoWizard: The chosen one in RAW file conversion

I'm the chosen one.
After five years in proteomics and a quick walk through different computational proteomics topics such as: database analysisproteomics repositories and databases or identification algorithms I'm sure that the most painful and no grateful job is work with file formats: writing, reading, and dealing with end-users. 

File formats (the way that we use to represent, storage and exchange our data) are fundamentals piece in bioinformatics, more than that, are one of the milestone of the Information Era. In some fields the topic is more stable than others, but the topic is still in the table for most of us. To have a quick idea see the evolution of general standards in recent years like XML, JSON and recently YAML.


What happens in computational proteomics? Here a great picture to resume the broad of file formats in Computational proteomics from Eric manuscript in MCP [1]:



Incredible? In proteomics and mass spectrometry file formats cover a wide range of process, workflows, and analytical protocols divided in to main groups Informatics Analysis and MS Analysis; and the starting point is RAW files (Vendor formats) (I don’t want to explain in details all of these file formats, I’ll dedicate in the near future a post to them). I will talk about RAW Files and why is important to deal with them and who is doing very well the job.

The key to interpreting RAW data directly has been the development of specific software to parse the binary content of these raw files into intelligible data, a tedious and time-consuming task that typically needs to be redone each time a new machine or a new version of an existing machine or its operating software appears [2]. Next to the above-mentioned caveats associated with proprietary raw data formats, there is also the very real problem of “aging” that comes with any binary formatted data. As time goes by, support for certain formats tends to evaporate and within the space of several years, readers can no longer be found for the format.

Then, most of the new softwares and tools in computational proteomics avoid to handle original RAW files and use Standards formats such as: mzXML, mzML or simple peak files. Most of the search engines, quatitation or visualisation tools are based on those files which are more simple to exchange and read.  But, who export the original data (RAWs) to those files, the vendors?, No...: PROTEOWIZARD


Originally published on Bioinformatics in 2008 [3], Proteowizard has played its role for data conversion better than any other tool. ProteoWizard provides a modular and extensible set of open source, cross-platform tools and libraries The framework includes different tools for data conversion and a core API for parsing different data formats [4]. In addition to the open mzML, mzXML, mzIdentML, and mzData XML formats, a variety of proprietary formats can also be handled. 

One of the things I really like from this tool is the simple and modular design which allow the conversion of different proprietary formats to a common data model, see figure about from nature biotechnology paper:


One of the thing still missing in the tool is that to be fully-functional it must be installed and run in a Windows System (Vendors fault). Here (https://github.com/jmchilton/proteowizard-wine-packager) you can find a linux wrapper to run it with wine in a linux machine (didn't tested). See also this help page from TPP.

Supported Data Formats


WIFF, T2D (with DataExplorer)
MassHunter (.d directories)
FID, .d directories, XMASSXML
RAW
Raw directories
mzML
mzXML
MGF
Yates/MacCoss Laboratories
MS2/CMS2/BMS2
mz5


Other tools?

  • ms2mz by bioproximity: simple utility for converting between common mass spectrometer file formats.
  • APLToMGFConverter: converts MaxQuant APL (Andromeda peak lists) to MGF.
  • CompassXport: converts Bruker analysis.baf and analysis.yep files to mzXML.
  • dat2mgf: converts Mascot results files back to MGF
  • DataAnalysis2TPP: converts MGF from Bruker DataAnalysis to TPP-friendly format for use with XPRESS and ASAPRatio
  • MassWolf: converts MassLynx format to mzXML
  • MGF to .dta File Converter: converts MGF to .dta
  • mz2mgf: converts mzData files to MGF
  • mzBruker: converts Bruker analysis.baf files to mzXML
  • mzStar: converts SCIEX/ABI Analyst format (WIFF) to mzXML
  • mzXML2Other: converts mzXML to SEQUEST .dta, MGF, and Micromass .pkl
  • PklFileMerger: merges individual Q-TOF .pkl files into a single file for database searching.
  • ReAdW: converts Xcalibur native acquisition files to mzXML
  • T2D converter: converts ABI SCIEX 4700/4800 t2d files to mzXML
  • unfinnigan: reading Thermo .raw files without MsFileReader
  • wiff2dta: converts ABI WIFF to .dta
  • X2XML: converts from almost any format (Thermo, Bruker, Agilent and Micromass) to mzXML

References

[1] Deutsch, Eric W. "File formats commonly used in mass spectrometry proteomics." Molecular & Cellular Proteomics 11.12 (2012): 1612-1621.

[2] Martens, Lennart, et al. "Do we want our data raw? Including binary mass spectrometry data in public proteomics data repositories." Proteomics 5.13 (2005): 3501-3505.


[3] Kessner, Darren, et al. "ProteoWizard: open source software for rapid proteomics tools development." Bioinformatics 24.21 (2008): 2534-2536.

[4] Perez-Riverol, Yasset, et al. "Open source libraries and frameworks for mass spectrometry based proteomics: A developer's perspective." Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics 1844.1 (2014): 63-76.