Open Critique

Over the last few months, a open debate has been unraveling between two research groups / developers of RNA-seq alignment and quantification methods.  The open debate is occurring between Lior Pachter’s lab regarding their kallisto tool (Near-optimal probabilistic RNA-Seq quantification by Nicolas Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Nature Biotechnology 34 (2016), 525–527), and the Carl Kingsford’s Salmon tool (Salmon provides fast and bias-aware quantification of transcript expressionby Rob Patro, Geet Duggal, Michael I. Love, Rafael A. Irizarry and Carl Kingsford, Nature Methods 14 (2017), 417–419).

Our mission has always been open / transparent research.  It is the purest way to move knowledge forward.  A key benefit to open science and software development is there’s a permanent record containing updates to code and data to reproduce analyses.  Much of this debate continually references past versions of software.   However, I’m not going to get into the argument other than to present some scientifically interesting findings throughout the debate.  

We used a diverse variety of tools to analyze data and it is almost absolute fact that every tool used on the same data will produce slightly different results.  This holds true for genomics (SOAP / Ray / Velvet), transcriptomics (StringTie / BWA / TopHat /STAR), proteomics (SEQUEST / Tandem / Mascot) and even metabolomics (XCMS / MAVEN), just to name a few.  Further, we often use two or more tools with different core methodologies that can tackle an analysis simply to examine overlap between the methods to discern significant findings.  The idea, if more than one tool show similar results, we feel if verifies the finding.  Thus, when repeated studies (Boj et al. 2015, Beaulieu-Jones and Greene, 2017, and Zhang et al. 2017  from various different labs have shown that Salmon and kallisto show almost exact findings, rightfully the authors of kallisto raised eyebrow.  Again, we’re not getting into this debate only highlighting five simple points for our readers edification;

  • Transcripts Per Million (TPM) vs. Counts, as methods to calculate differential expression after read alignment.  TPM is very similar to RPKM or FPKM, but has a order switch during the mathematical operations.  One first divides the read counts by the length of each gene in kilobases (RPK).   Then one calculates the scaling factor by counting up all the RPK values in each sample and dividing by 1 million.  Finally, divide each RPK value by the scaling factor to give TPM. In effect, this is to normalize to the gene first, then the sequencing depth which results in the sum all TPMs in each sample equal making comparisons easier across samples in an experiment. It does not normalize across experiments.  Count is simply the number of reads that overlap on the gene.  There is no normalization, it is the RNAseq data in it’s purest form.  How we handle these data have important implications downstream.  There are arguments on either's’ use, and even more permutations of these two methods that may work better in your experimental aims.  Knowing the research aims, i.e., the question that the data is trying to answer dictates how these data are handled, normalized or not.  And thus, one should know the output of the alignment and input requirements for the quantification tool you are using.  Each have their own flavor and it’s not something that should be assumed without question.  

  • GC-bias correction:  this is a systematic error that while happened more prevalently in the past, is still very much an issue with all downstream analyses of transcript and genomic data.  You must correct for this in two very important and common analyses we do; 1) when you want to compare samples from different laboratories or samples ran at different times or different instrumentation in the same labs; and 2) when the library preparation method was unable to amplify fragments with very low or very high GC content.  This is sometimes very sample dependent, like some fungi or bacterial but even across different genes within a species, like the sodium transport pumps in humans.  Now there are burgeoning library preparation technologies that will help with these specific samples, but systematic bias and subsequent correction cannot be understated.
  • Correlation between methods:  is often used to not only compare different methodologies, but compare different experiments, or, even more complex to compare biological findings between diverse experimental designs and technologies, e.g., proteomics and transcriptomics.  The often felt view, the higher the correlations, the more valid the results.  But, the statistical truth is that these results and the methods used to calculate them only provide perspective on the data.  

To expand on this, there are two main calculations of correlation used in this debate: Pearson and Spearman.  The primary difference between these is that Spearman ranks data points and uses the rank numbers in its calculation rather than the raw expression values used in Pearson.  Right away, one can see a glaring difference in that the presence of outliers will influence the Pearson correlation whereas outliers will have no influence with Spearman.  Additionally, in their truest form, Pearson should be used when one is measuring linearity while Spearman is most appropriate for monotonicity.  Monotonicity is a term used for directional dependence, increasing or decrease while linearity takes monotonicity one step further and integrates how much those increases and decreases are. 

  • MA vs. Scatter plots are probably the most common plots used in transcriptomics differential expression.  In this debate, both sides used them as did the others referenced for good reason; these plots show a lot of information, however improvements can be added.  First, adding +/- lines around zero in an MA plot to highlight thresholds.  Secondly, zooming in on scatter plots could hide outliers and should be avoided even at the costs of masking the similar points around the origin.  To correct these visual biased, an alpha transparency factor was used along with an interactive plot.  These are very effective in visualizing the properties of the data.  We would also suggest using a hexbin plot or 2-D kernal density estimation, both of which bring density into the visualization.

  • Instrument peculiarities over enhancements / improvements:  This is definitely one of those knowledge structures that are additive.  Having historical perspective, and anticipating the direction a technology must go to improve are those intangible skills that are hard to quantify but we all acknowledge can make a huge difference between answering your research aim or not.  Why use a technology that will not, cannot, despite best efforts, answer the question you’re asking?  Seems simple enough.  Now apply to data analysis and bioinformatics.  The question of Why is the most important of all.