Wednesday, 20 August 2014

ProteoStats: Computing false discovery rates in proteomics

By Amit K. Yadav (@theoneamit) & Yasset Perez-Riverol (@ypriverol):

Perl is a legacy language thought to be abstruse by many modern programmers. I’m passionate with the idea of not letting die a programming language such as Perl. Even when the language is used less in Computational Proteomics, it is still widely used in Bioinformatics. I’m enthusiastic writing about new open-source libraries in Perl that can be easily used. Two years ago, I wrote a post about InSilicoSpectro and how it can be used to study protein databases like I did in “In silico analysis of accurate proteomics, complemented by selective isolation of peptides”. 

Today’s post is about ProteoStats [1], a Perl library for False Discovery Rate (FDR) related calculations in proteomics studies. Some background for non-experts:

One of the central and most widely used approach for shotgun proteomics is the use of database search tools to assign spectra to peptides (called as Peptide Spectrum Matches or PSMs). To evaluate the quality of the assignments, these programs need to calculate/correct for population wise error rates to keep the number of false positives under control. In that sense, the best strategy to control the false positives is the target-decoy approach. Originally proposed by Elias & Gygi in 2007, the so-called classical FDR strategy or formula proposed involved a concatenated target-decoy (TD) database search for FDR estimation. This calculation is either done by the search engine or using scripts (in-house, non-published, not benchmarked, different implementations). 

So far, the only library developed to compute FDR at spectra level, peptide level and protein level FDRs is MAYU [2]. But, while MAYU only uses the classical FDR approach, ProteoStats provides options for 5 different strategies for calculating the FDR. The only prerequisite being that you need to search using a separate TD database as proposed by Kall et al (2008) [3]. Also, ProteoStats provides a programming interface that can read the native output from most widely used search tools and provide FDR related statistics. In case of tools not supported, pepXML, which has become a de facto standard output format, can be directly read along with tabular text based formats like TSV and CSV (or any other well-defined separator). 

Here, some concepts will be explained using the functions provided by the library. Let’s start:

A generic project starts from a database search, which can be done either with a (1) concatenated TD database (single search) or (2) separate T-D databases (two separate searches). Researchers then apply either Elias & Gygi formula (Concatenated) or Kall’s formula (Separate). 

FDR estimation formulae/methods in ProteoStats 

The ProteoStats library supports the following formulations on FDR estimation:

(1) Concatenated target decoy search based FDR [4], Concatenated FDR (FDRC)

This method was proposed by Elias and Gygi [4], although the basic concept of reverse database searching was based on Peng et al. [5] The database consists of combined set of proteins from target and their reversed sequences as decoy. In general, the ratio of false positives to true positives is estimated to be the FDR at the given threshold. The concatenated search is the most popular method of FDR estimation. The assumption is that for every decoy passing the threshold, there must be a corresponding false hit in target. Thus, the false positives are estimated by doubling the decoy count. The formula is-

(2) Simple/Separate target-decoy based FDR by Kall et. al [3], Separate/Simple FDR (FDRS)

This method was proposed by Kall et al as an alternative to combined search strategy. Kall et al suggested that combining the target and decoy databases overestimates the FDR and the decoy distribution no longer matches the target incorrect population. In this method, target and decoy searches are conducted separately. Each spectrum has one best target PSM and one best decoy PSM. Since the two searches are independent, it is assumed that the number of false positives amongst the targets is same as the number of decoys passing the threshold. It is also referred to as the simple FDR method by the authors. False positives are estimated by the count of decoys above the threshold. 

(3) Percentage of Incorrect target (PIT) correction to the above simple [3], FDR with PIT (FDRPIT)

Simple FDR described above assumes the size of true and false hits to be same or similar which is not an entirely correct assumption. Though all decoys contribute fully to the null distribution, all targets are not correct. Target distribution is a bimodal distribution representing the true and false target hits. This causes a non-balanced ratio of the true and false hits leading to overestimation of FDR. If a correction factor, Percentage of Incorrect Targets (PIT), is introduced in the equation 2 above, the FDR estimation is more accurate and increases sensitivity by allowing more correct hits at the same FDR. PIT is traditionally known as π0 or true negatives. This formulation depends on the percentage of incorrect target hits (PIT) that contribute to the negative (false) distribution as a weight in the above method to prevent overestimation of FDR and thus enhances the number of identifications. It should be noted that the name PIT is misleading since it is not a percentage but a fraction. 

(4) Refined formulation on Kall’s method [6], Refined Separate FDR (FDRRS)

In this formulation, the authors propose for calculating the FDR in the correct reference population by not doubling the estimated false targets by observing decoy hits above threshold directly. All decoys above threshold should not be taken as false positives unless they score more than target PSMs for the given spectrum. This simplification leads to inflation of decoy population and thus overestimation of FDR. The hits could be above threshold only in target (target only) or only in decoy (decoy only). When both are above threshold, either target could be better (target better) or the decoy (decoy better). Only decoy better are considered to be false positives and their count is doubled as suggested in Elias and Gygi method. The FDR in the correct reference population is then calculated by estimating the false positive PSM by subtracting from this population (decoy better + target better + target only), and dividing the result by the same number. The formula thus becomes- 

(5) Refined method on Elias et al method [7], FDRRC 

In this formulation, the authors argue that since decoy hits are obviously wrong, they can be disregarded in FDR estimation and thus the FDRRC formula is changed to yield the following formula-

ProteoStats supports these two methods and some of their alternate formulae described above, which provide better results. The newer formulae seem to work better but not much in practice due to code availability. It also follows that proteomics community has not settled for a single formula and these differ in basic concept. A primary testing of FDR method suited to a lab protocol is thus recommended. The five methods are explained in the paper. The details for installation, dependencies and developer documentation can be found in Documentation downloaded with the code. 

Quick Guide, Examples, Data & Results files 

Example Scripts Many example scripts are provided in the tests folder.


The files are provided in the DATA folder. Every file name is self-explanatory with a tag for target or decoy and algorithm name. For every algorithm, file names are like this-



The example result files are provided in the OUTPUT folder.

File Parsers

Some inbuilt functions can be used to parse XML or native search engine files into tabular text (CSV/TSV) formats. For example X!Tandem XML, pepXML, Mascot dat etc.

FDR estimation

The user has two files per search after database search - one Target and one corresponding
Decoy file. Choose the script corresponding to the search algorithm used for database searching.

use MODULES::PSMFileIO; #file reading & parsing
use MODULES::PepXMLParser; #parsing pepXML files

#parse tandem XML file
my $tandemCSV = ReadTandemXML($tar); # Tandem target/decoy XML file

#parse Sequest XLS file from Thermo ProteomeDiscoverer
my $SequestCSV = ReadSequestXLS($tar); # Tandem target/decoy XML file

#parse pepXML file to TSV XLS file
my $pepXML2tsv = ReadPepXML($tar); # pepXML to TSV file
#parse Mascot dat file, convert to pepXML and parse the PepXML file
system("Mascot2XML $tar -notgz -nodta -D$tardb"); # convert to pepXML using  Mascot2XML from TPP

my $pepXML2tsv = ReadPepXML($tarpepXML); # pepXML to TSV file

Edit any one of the script to enter the file names in @Tars, @Decs and @Outs arrays. Run the following command perl and press enter to calculate FDR for input files. These scripts have a complete pipeline. 

Use Modules Examples

Initially, the program requires the use pragmas for calling specific modules from library. If you are writing your code in tests folder, you need to specify the base directory (ProteoStats, so that it can find MODULES folder) by use lib ‘../’ pragma.

use strict;
use warnings;
use lib '../'; #to define base directory
use MODULES::FDRCalculator; #main module that controls calling others
use MODULES::PSMFileIO; # Handles input file reading and Output writing
use MODULES::PepXMLParser; # Handles pepXML parsing
use MODULES::SeparateFDR; # Calculates FDRS
use MODULES::ConcatenatedFDR; # Calculates FDRC
use MODULES::RefinedConcatenatedFDR; Calculates FDRRC
use MODULES::RefinedSeparateFDR; Calculates FDRRS
use MODULES::ChartMaker; # Creates ScatterPlot, Histogram, ROC curves
use MODULES:: ComparisonVenn; # Creates Venn, Compares PSMs/Peptides

Apart from the FDR and q-value calculations, ProteoStats can also be used for following tasks-

(1) Generating ROC curves

ROC curves can be plotted by the function ROCfromFDRFile.

use MODULES::ChartMaker; #
#define the parameters/arguments
my @FDRfiles = (‘FDR1_mascot.csv’,
                ‘FDR3_omssa.csv’)#input files
my @seriesname = (‘Mascot’,‘MassWiz,‘OMSSA’)#corresponding seriesname
my $out = ‘MyROC.xlsx’ ; # Excel Output
my $FDRthr = 0.05; # FDR cutoff to show in ROC
my $qvalcol = -1; #column number for q-value. -1 means last column

#Call the ROC function with the arguments


(2) Generating Scatter Plots

Scatterplots can be generated using the function ScatterPlot_FDRFile.

use MODULES::ChartMaker; #
#define the parameters/arguments
my @FDRfiles = (‘FDR1_masswiz.csv’,
                ‘FDR3_masswiz.csv’); #input files
my $xcol = 5;
my $xcolname = Mass’;
my $ycol = 7; #last column
my $ycolname = ‘Score’;
my $out = ‘MyScatter.xlsx’ ; # Excel Output

#Call the ScatterPlot function with the arguments
$out= ScatterPlot_FDRFile(\@FDRfiles,$xcol,$xcolname,$ycol,$ycolname,$out);


(3) Generating Histograms

Histograms can be generated using the function HistogramsfromFile function.

use MODULES::ChartMaker;

#define the input/arguments
my @files = (‘FDR_MassWiz1.csv’, FDR_ MassWiz2.csv’, FDR_ MassWiz3.csv’); #FDR Files
my $ use_linear_axes = 1;
my $use_integral_bins = 0;

HistogramsfromFile(@files, @columns, $bins)

(4) Comparing FDR outputs and Creating Venn Diagrams 

This is a handy utility that can compare two or three FDR files. The FDR files need to contain only the filtered PSMs , else it will compare everything without any cutoff of score/p-value/e-value.

Start by defining the input files and Venn legends. Define 2 files for two set comparison and 3 for a three way comparison. Also define the column numbers for the FDR file for scans, peptide and protein columns.

use MODULES::ComparisonVenn;

#define the input/arguments
my $FDR1 = ‘FDR_MassWiz1.csv’; #FDR File 1
my $FDR2 = ‘FDR_ MassWiz2.csv’; #FDR File 2
my $FDR3 = ‘FDR_ MassWiz3.csv’; #FDR File 3
my $out = ‘Comparison_MassWiz_3_replicates.csv’; # OUTPUT File

# Define the labels for Venn
my $legend1 = ‘MassWiz_rep1’;
my $legend2 = ‘MassWiz_rep2’;
my $legend3 = ‘MassWiz_rep3’;

#Define column numbers for ScanID, peptide and Proteins
my $scancol = 0; #scan column
my $pepcol = 3; #peptide column

my $protcol = 7; #protein column

Read the files and get the data structures as hash references for easy comparison. Dereference the hashes.

#Get Hashrefs for Spectra and Peptides for comparison
my($SpecRef1,$PepRef1) = ReadSpecPep($FDR1,$scancol,$pepcol,$protcol);
my($SpecRef2,$PepRef2) = ReadSpecPep($FDR2,$scancol,$pepcol,$protcol);
my($SpecRef3,$PepRef3) = ReadSpecPep($FDR3,$scancol,$pepcol,$protcol);

#Dereference the hashrefs
my %SpecFDR1 = %$SpecRef1;
my %SpecFDR2 = %$SpecRef2;
my %SpecFDR3 = %$SpecRef3;
my %PepFDR1 = %$PepRef1;
my %PepFDR2 = %$PepRef2;

my %PepFDR3 = %$PepRef3;

Create Venn diagrams for 2 or 3 sets as per requirement. Define a title for chart and call the function CreateVenn2 or CreateVenn3 as shown in example below.

##Create Venn Charts for 3 or 2 sets passing spectra comparison
# Venn 3 sets ScanIDs
my $title1 = 'Spectra Comparison'; #Chart title
my $VennSpec3 = CreateVenn3 ([keys%SpecFDR1], [keys%SpecFDR2], [keys%SpecFDR3], $legend1, $legend2, $legend3, $title1, "$out.spectra.png" ) ;
print "Spectra venn created in file $VennSpec3\n";

#or try Venn 2 sets ScanIDs
my $VennSpec2=CreateVenn2 ( [ keys%SpecFDR1 ] , [ keys%SpecFDR2 ] , $legend1,$legend2,$title1,"$out.spectra.png") ;
print "Spectra venn created in file $VennSpec2\n";

##Create Venn Charts for 3 or 2 sets for peptide comparison
# Venn 3 sets peptides
my $title2 = 'Peptides Comparison';
my $VennPep3 = CreateVenn3 ([ keys%PepFDR1], [keys%PepFDR2], [keys%PepFDR3], $legend1, $legend2, $legend3, $title2, "$out.peptide.png" ) ;

print "Peptide venn created in file $VennPep3\n";

Finally, the comparisons can also be made and results written to a CSV file. The function CompareFDR3 or CompareFDR2 can be used with the hash references. The CSV file $out (defined in the input at the beginning of this code) contains the output of the comparisons. 

Reference List 

[1] Yadav, A. K.; Kadimi, P. K.; Kumar, D.; Dash, D. ProteoStats--a library for estimating false discovery rates in proteomics pipelines. Bioinformatics 2013, 29 (21), 2799-2800. 

[2] Reiter, L.; Claassen, M.; Schrimpf, S. P.; Jovanovic, M.; Schmidt, A.; Buhmann, J. M.; Hengartner, M. O.; Aebersold, R. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Mol. Cell Proteomics 2009, 8 (11), 2405-2417. 

[3] Kall, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J. Proteome. Res. 2008, 7 (1), 29-34. 

[4] Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4 (3), 207-214. 

[5] Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome. Res. 2003, 2 (1), 43-50. 

[6] Navarro, P.; Vazquez, J. A refined method to calculate false discovery rates for peptide identification using decoy databases. J. Proteome. Res. 2009, 8 (4), 1792-1796. 

[7] Cerqueira, F. R.; Graber, A.; Schwikowski, B.; Baumgartner, C. MUDE: a new approach for optimizing sensitivity in the target-decoy search strategy for large-scale peptide/protein identification. J. Proteome Res. 2010, 9 (5), 2265-2277.