Monday, 28 October 2013

One step ahead in Bioinformatics using Package Repositories

About a year ago I published a post about in-house tools in research and how using this type of software may end up undermining the quality of a manuscript and the reproducibility of its results.  While I can certainly relate to someone reluctant to release nasty code (i.e. not commented, not well-tested, not documented), I still think we must provide (as supporting information) all “in-house” tools that have been used to reach a result we intend to publish. This applies especially to manuscripts dealing with software packages, tools, etc. I am willing to cut some slack to journals such as Analytical Chemistry or Molecular Cell Proteomics, whose editorial staffs are –and rightly so- more concerned about quality issues involving raw data and experimental reproducibility, but in instances like Bioinformatics, BMC Bioinformatics, several members of the Nature family and others at the forefront of bioinformatics, methinks we should hold them to a higher standard. Some of these journals would greatly benefit from implementing a review system from the point of view of Software Production, moving bioinformatics and science in general one step forward in terms of reproducibility and software reusability. What do you think would happen if the following were checked during peer reviewing?

Friday, 25 October 2013

Little Book of R for Bioinformatics by Avril Coghlan

Introduction to bioinformatics, with a focus on genome analysis, using the R statistics software. By Avril Coghlan (Wellcome Trust Sanger Institute, Cambridge,UK). 





To encourage research into neglected tropical diseases such as leprosy, Chagas disease, trachoma, schistosomiasis etc., most of the examples in this booklet are for analysis of the genomes of the organisms that cause these diseases.

The author booklet on using R for biomedical statistics, http://a-little-book-of-r-for-biomedical-statistics.readthedocs.org/, and booklet on using R for time series analysis, http://a-little-book-of-r-for-time-series.readthedocs.org/.

Thursday, 24 October 2013

Creating an Open Source Revolution in Computational Proteomics

First of all, I don’t want to discuss in this post about Open-Source, its strengths & strengths. This post is about the most useful Open-Source packages, frameworks or libraries in the field of computational proteomics (a short version of our manuscript “Open source libraries and frameworks for Mass Spectrometry based Proteomics: A developer’s perspective”). 


Schema of the possible computational processing steps of a proteomics data set.

In proteomics like other Omics, the bioinformatics efforts can be divided in three major fields: data processing, storage and visualization. From MS/MS preprocessing to post-processing of the identifications results, even though the objectives of these libraries and packages can vary significantly, they usually share a number of features. Common use cases include the handling of protein and peptide sequences, the parsing of results from various proteomics search engines output files, and the visualization of MS-related information (including mass spectra and chromatograms).

Tuesday, 22 October 2013

Some Reasons to Rename my Blog as BioCode's Notes


Hi Dear Readers:

I’ve decided that it would be prudent, exposure-wise,  to change the name of my professional blog to BioCode's Notes, for a number of reasons:

1. People into bioinformatics comprise a significant part of my –alas, still small- readership. They tend to be always hungry for code tips, language comparisons, and other things that do not fit neatly under the umbrella of “computational proteomics”.

2. My own work is straying more and more from computational proteomics per se into other problems linking biology (Proteomics, Genomics, Life Sciences) with programming (R, Java, Perl, C++). Biocoding is now my bread-and-butter…

3. I need a shorter, catchier name that is easy to use in coffee talks, presentations, or when sharing links with friends.

4. I also decided to add a Blog's mascot, our T-rex:
              Truth    => Science is about Truth.
              Tea: UK Science.
              STaTisTics => OK, this one’s got as many ‘S’ as ‘T’, but the latter is more frequent in English.
              T-rex  => The future belongs to Big Data, which we’ll use (and are already
                                 using) to trace back the march of evolution to our preferred
                                 species, including the dinosaurs. And last, but not least, this is
                                 Abel’s (my son) favorite animal.          

Hope you enjoy this Idea
Yasset

Saturday, 19 October 2013

Which are the best programming languages for a bioinformatician?

This is a basic question when you (as a programmer or biologist or mass spectrometrist) start a career in bioinformatics. What is your favorite programming language in bioinformatics?. This pool will give you a short picture about which languages are mandatory in computational proteomics & bioinformatics. Which languages would you recommend to a student wishing to enter the world of bioinformatics?. We can use this post to comment the strengths and weaknesses of each languages. 






Some Polls and Discussion about this topic can be found in:

Thursday, 17 October 2013

Introduction to Feature selection for bioinformaticians using R, correlation matrix filters, PCA & backward selection

Bioinformatics is becoming more and more a Data Mining field. Every passing day, Genomics and Proteomics yield bucketloads of multivariate data (genes, proteins, DNA, identified peptides, structures), and every one of these biological data units is described by a number of features: length, physicochemical properties, scores, etc. Careful consideration of which features to select when trying to reduce the dimensionality of a specific dataset is, therefore, critical if one wishes to analyze and understand their impact on a model, or to identify what attributes produce a specific biological effect.

For instance, considering a predictive model C1A1 + C2A2 + C3A3 … CnAn = S, where Ci are constants, Ai are features or attributes and S is the predictor output (retention time, toxicity, score, etc). It is essential to identify which of those features (A1, A2 and A3…An) are most relevant to the model and to understand how they correlate with S, as working with such a subset will enable the researcher to discard a lot of irrelevant and redundant information.


Monday, 14 October 2013

What is your tool for peptide/protein identification?

In 2012 We published a Poll about most used softwares from peptide/protein identification in proteomics in Computational Proteomics Linkedin Group. I decided to reproduce the Poll here because linkedin remove this option and also here i have the opportunity to add more softwares to the list.

Wednesday, 9 October 2013

My List of Most Influential Authors in Computational Proteomics (according to Articles References, Google Scholar, twitter, Linkedin, Microsoft Academic Search and ResearchGate)

Young researchers starting their careers will often look for reviews, opinions and research manuscripts from the most influential authors of their chosen field. In science, however, unlike many other topics on the Internet, ranked lists or manuscript repositories of top authors sorted by research topic are hard to come by. For some researchers, the idea of such a task brings the words ‘wasted time’ to their minds; the most critical condemn it as a frivolous pursuit. Maybe so. In my opinion, however, it as an excellent starting point.

ResearchGate Home page
Home Page of ResearchGate with more than 3 millions of users

These days, more people than ever are involved in science and research. Just look at ResearchGate’s homepage.  There are over 3 million persons there –and we’re only counting ResearchGate users. Once simple undertakings, such as finding the right manuscript to cite, the most authoritative group on a topic, or the best software application for a specific task, have become increasingly difficult for graduate students navigating this ocean of data, despite the availability of services such as Google Scholar or Pubmed. The situation will only worsen in the future, as is easy to see by simply tallying the number of  published papers in the fields of Proteomics, Genomics, Bioinformatics and Computational Proteomics since 1997:

Number of published manuscripts in Pubmed per year (1997-2012). the statistics was done using the Medline Trend Service http://dan.corlan.net/medline-trend.html

In 2012 alone, over 6,000 and 17,000 manuscripts were published in the fields of proteomics and bioinformatics, respectively. Our young field, computational proteomics, published more than four hundred papers. Perhaps well-established PI’s or Group Leaders can easily tell apart derivative or me-too contributions from groundbreaking work, but young scientists, who spend most of their time implementing someone else’s ideas, can certainly have a hard time doing so. Although technology has come to the rescue with today’s mixture of search engines and social networking tools (ResearchGate, Google Scholar, twitter and LinkedIn among them), the best way to harness its power is, precisely, by starting from a ranked list of the most authoritative voices within a field of research, whose whereabouts can then be traced in the scientific literature, the blogosphere, and anywhere else.


Tuesday, 1 October 2013

Celebrating Ten Years of Mann and Aebersold’s “Mass spectrometry-based proteomics” review.

In 2003 Mann & Aebersold reviewed on the pages of Nature the challenges and perspectives of the then-nascent field of MS-based proteomics. Mass spectrometry (MS) has since entrenched itself as the method of choice for analyzing complex protein samples, and MS-based proteomics has become an indispensable technology for interpreting genomic data and performing protein analyses (primary sequence, post-translational modifications (PTMs) or protein–protein interactions).

" The ability of mass spectrometry to identify and, increasingly, to precisely quantify thousands of proteins from complex samples can be expected to impact broadly on biology and medicine."
The manuscript by Mann & Aebersold is one of the most cited manuscripts in the field of MS proteomics, For this reason is one of the “core papers” in the field of proteomics and computational proteomics, outlining most of the basic concepts required to understand the fundamentals of this discipline.

Ten years after its publication the main workflow described in the manuscript do not change dramatically. In this period major advances are related to the development of the Thermo’s Orbitrap Mass Spectrometer (Velos, LTQ, Exactive, etc) and new fragmentations types (ETD, HCD). Separation techniques (electrophoretic and chromatographic) were explored extensively in these ten years. Aebersold pioneered in 2005 the use of OFFGEL electrophoresis and electrophoresis fragmentation at peptide level (Heller 2005) and Mann’s group developed the FASP method for sample preparation before protein digestion (Wiśniewski JR et al 2009),both of which have contributed significantly to the dramatic increase in the number of identified proteins characterizing today’s proteomic projects. Surprisingly, the development of electrophoretic methods in the last 3 years looks like a “passed-on topic”. In ten years we moved from identifying at most 500 species in complex samples to identifying 60% of the human proteome.