Nifty Visualization Tool

One nifty visualization tool is a result from a three day hackathon over at Brandeis University.  The open publication and repository of source code is here (  and here (, respectively.  

Formally called an ideogram, this tool draws histograms of expression (mRNA, tRNA, ncRNA and others) along the coordinates of a chromosome.   You can visualize the entire set of chromosomes or simply filter and focus on one chromosome.  The later would be pretty cool looking at insertions / deletions on a more global scale, showing more deleterious gross abnormalities.  Another question this tool helps provide perspective; have you ever wondered if there are any biased in sequencing reads over parts of a chromosome or even all chromosomes? This tool can easily show the data.  Then, there’s a host of comparisons between samples or experiments one could develop provide perspective with this tool.  

Currently, it takes a SRA accession, gets read counts directly from NCBI, grabs the proper alignment coordinates (only takes human GRCh37 and GRCh38), normalizes the reads to TPM (also provides raw counts), converts the reads to .json format which can then be imported into Ideogram.js for your viewing pleasure.    

Obviously, this would be seriously cool for other organisms, I’m thinking some of our plant projects right now, but since it’s open source, anyone could update / improve upon.  

Thanks @ RNA-seq Viewer Team at the NCBI-assisted Boston Genomics Hackathon

Open Critique

Over the last few months, a open debate has been unraveling between two research groups / developers of RNA-seq alignment and quantification methods.  The open debate is occurring between Lior Pachter’s lab regarding their kallisto tool (Near-optimal probabilistic RNA-Seq quantification by Nicolas Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Nature Biotechnology 34 (2016), 525–527), and the Carl Kingsford’s Salmon tool (Salmon provides fast and bias-aware quantification of transcript expressionby Rob Patro, Geet Duggal, Michael I. Love, Rafael A. Irizarry and Carl Kingsford, Nature Methods 14 (2017), 417–419).

Our mission has always been open / transparent research.  It is the purest way to move knowledge forward.  A key benefit to open science and software development is there’s a permanent record containing updates to code and data to reproduce analyses.  Much of this debate continually references past versions of software.   However, I’m not going to get into the argument other than to present some scientifically interesting findings throughout the debate.  

We used a diverse variety of tools to analyze data and it is almost absolute fact that every tool used on the same data will produce slightly different results.  This holds true for genomics (SOAP / Ray / Velvet), transcriptomics (StringTie / BWA / TopHat /STAR), proteomics (SEQUEST / Tandem / Mascot) and even metabolomics (XCMS / MAVEN), just to name a few.  Further, we often use two or more tools with different core methodologies that can tackle an analysis simply to examine overlap between the methods to discern significant findings.  The idea, if more than one tool show similar results, we feel if verifies the finding.  Thus, when repeated studies (Boj et al. 2015, Beaulieu-Jones and Greene, 2017, and Zhang et al. 2017  from various different labs have shown that Salmon and kallisto show almost exact findings, rightfully the authors of kallisto raised eyebrow.  Again, we’re not getting into this debate only highlighting five simple points for our readers edification;

  • Transcripts Per Million (TPM) vs. Counts, as methods to calculate differential expression after read alignment.  TPM is very similar to RPKM or FPKM, but has a order switch during the mathematical operations.  One first divides the read counts by the length of each gene in kilobases (RPK).   Then one calculates the scaling factor by counting up all the RPK values in each sample and dividing by 1 million.  Finally, divide each RPK value by the scaling factor to give TPM. In effect, this is to normalize to the gene first, then the sequencing depth which results in the sum all TPMs in each sample equal making comparisons easier across samples in an experiment. It does not normalize across experiments.  Count is simply the number of reads that overlap on the gene.  There is no normalization, it is the RNAseq data in it’s purest form.  How we handle these data have important implications downstream.  There are arguments on either's’ use, and even more permutations of these two methods that may work better in your experimental aims.  Knowing the research aims, i.e., the question that the data is trying to answer dictates how these data are handled, normalized or not.  And thus, one should know the output of the alignment and input requirements for the quantification tool you are using.  Each have their own flavor and it’s not something that should be assumed without question.  

  • GC-bias correction:  this is a systematic error that while happened more prevalently in the past, is still very much an issue with all downstream analyses of transcript and genomic data.  You must correct for this in two very important and common analyses we do; 1) when you want to compare samples from different laboratories or samples ran at different times or different instrumentation in the same labs; and 2) when the library preparation method was unable to amplify fragments with very low or very high GC content.  This is sometimes very sample dependent, like some fungi or bacterial but even across different genes within a species, like the sodium transport pumps in humans.  Now there are burgeoning library preparation technologies that will help with these specific samples, but systematic bias and subsequent correction cannot be understated.
  • Correlation between methods:  is often used to not only compare different methodologies, but compare different experiments, or, even more complex to compare biological findings between diverse experimental designs and technologies, e.g., proteomics and transcriptomics.  The often felt view, the higher the correlations, the more valid the results.  But, the statistical truth is that these results and the methods used to calculate them only provide perspective on the data.  

To expand on this, there are two main calculations of correlation used in this debate: Pearson and Spearman.  The primary difference between these is that Spearman ranks data points and uses the rank numbers in its calculation rather than the raw expression values used in Pearson.  Right away, one can see a glaring difference in that the presence of outliers will influence the Pearson correlation whereas outliers will have no influence with Spearman.  Additionally, in their truest form, Pearson should be used when one is measuring linearity while Spearman is most appropriate for monotonicity.  Monotonicity is a term used for directional dependence, increasing or decrease while linearity takes monotonicity one step further and integrates how much those increases and decreases are. 

  • MA vs. Scatter plots are probably the most common plots used in transcriptomics differential expression.  In this debate, both sides used them as did the others referenced for good reason; these plots show a lot of information, however improvements can be added.  First, adding +/- lines around zero in an MA plot to highlight thresholds.  Secondly, zooming in on scatter plots could hide outliers and should be avoided even at the costs of masking the similar points around the origin.  To correct these visual biased, an alpha transparency factor was used along with an interactive plot.  These are very effective in visualizing the properties of the data.  We would also suggest using a hexbin plot or 2-D kernal density estimation, both of which bring density into the visualization.

  • Instrument peculiarities over enhancements / improvements:  This is definitely one of those knowledge structures that are additive.  Having historical perspective, and anticipating the direction a technology must go to improve are those intangible skills that are hard to quantify but we all acknowledge can make a huge difference between answering your research aim or not.  Why use a technology that will not, cannot, despite best efforts, answer the question you’re asking?  Seems simple enough.  Now apply to data analysis and bioinformatics.  The question of Why is the most important of all.  

Direct to Consumer Genetic Testing

If you didn’t catch this, it’s true.  23andMe was given FDA approval for direct to consumer genetic testing services, the FIRST.TIME.EVER.  Interestingly, the FDA hopes that this ruling will allow consumers to change lifestyle choices related to the 10 diseases 23andMe will report currently.  

I say interesting, because for many of those diseases, lifestyle choices have not been proven to affect the outcome of the genetic alteration. Seriously, how does one change the outcome of Gaucher’s disease, early-onset primary dystonia, hereditary hemophilia or hemochromatosis through personal lifestyle choices?  And really, don’t you already know from your physician through signs and symptoms that you have these diseases, like from infancy?  Maybe you decide not to have children, but I sincerely hope you talk with a genetic counselor to learn the difference between X-linked and autosomal recessive or dominant inheritance.  

Other tidbits that are interesting… this is major PRECEDENT.  According to the FDA, it intends to offer further exemptions to the premarket review…  

“ to exempt additional 23andMe GHR tests from the FDA’s premarket review, and GHR tests from other makers may be exempt after submitting their first premarket notification. A proposed exemption of this kind would allow other, similar tests to enter the market as quickly as possible and in the least burdensome way, after a one-time FDA review.”

The agency is primarily concerned, “to help ensure that they [Genetic Health Risk tests] provide accurate and reproducible results.”  The FDA will not provide exemptions for “Diagnostic Tests.”  

I appreciate the accurate and reproducible results, but really; 

How many people know the difference between a Diagnostic Test and Genetic Health Risks?

Much educating needs to be done, here’s the links;

MIT Review

FDA Press Release

And oh… 23andMe sells the data… ALL.THE.DATA!

MIT Review

The Science of Collaboration

“Alone we can do so little; together we can do so much” – Helen Keller.

A new paper put out by Nature Communications made the above quote pop into my head. The paper, “Accelerating the search for the missing proteins in the human proteome,” discusses a new database that can hopefully aid in the efforts to find all of the “missing proteins” in Homo sapiens, MissingProteinPedia.

The goal of MissingProteinPedia is to help speed up the process of classifying proteins as “real” proteins. The Human Proteome Project (HPP) is a major database that classifies human proteins. It uses a ranking system of PE1-PE5, where PE1 are those proteins that have been confirmed through mass spec, solved X-ray structures, antibody verification and/or sequencing via Edman degradation. The PE2-PE4 groups are proteins that have evidence for existence at the transcript level, are inferred to exist based on homology, or just flat out inferred to exist.

Now, although the HPP is great at ensuring protein data is very quantitative and high-stringency, those two factors can be a hindrance at times to categorizing proteins as PE1 proteins. For example, the authors bring multiple proteins with REPRODUCIBLE evidence of their impact on humans (such as prestin and interleukin-9), but are relegated to PE2-PE4 status. As there are no data “confirming” the existence of the protein that are in line with HPP requirements, the proteins will not be elevated to PE1 status. This is where MissingProteinPedia comes in.

The goal of MissingProteinPedia is two-fold. First, it should be a database where anyone can both deposit and access information. Second, the hope is that this collaborative data can be used as a platform to help researchers generate the data required by the HPP to elevate these proteins to PE1 status.

NOW, are there certain things to be wary of? Of course. The authors of the paper openly admit that there is no check on the quality of the data in the database, and the data can come from a wide variety of sources, including unpublished work. Call – Out to REPLICATE and VALIDATE.

Currently, there are just under 1500 proteins in the MissingProteinPedia database. The website itself is easy to use and has some great information. You can narrow your search to a specific gene, or you can also search by chromosome. Clicking on a protein gets you a short description of the protein, as well as all relevant data, including homology, known domains, and references.

Additionally, there are some great characteristics of the database that make it more user-friendly:

  1. The data provided for proteins includes BLAST results for sequence similarity and functional annotation. This is unique amongst databases.

  2. MissingProteinPedia pulls in mass spectra from two of the best mass spec databases, PRIDE and GPM.

  3. The database in schema-less, making it more flexible. Without any rigid requirements for formatting or structuring of data, it is much more open and inclusive of different data.

  4. It incorporates text-mining. This allows researchers to retrieve more information, as the database sifts through text to identify other possibly related and/or relevant information.

Although the quantity and quality of data varies between proteins, there is plenty of information to give a researcher a head start on characterizing these proteins. And isn’t that what we do as scientists? We constantly build off each other and look to take everything one step further. You never know who your limited data will help or what big discovery someone’s piece of information sparks you to make.

When Science isn’t an Exact Science.

Nature recently published an article that highlights one of the uglier aspects of science that at times tends to plague students, postdocs and P.I.s alike: reproducibility.

The Nature editorial article focused on the work of the Reproducibility Project: Cancer Biology, which is a group dedicated to replicating experiments from over 50 papers published in big name journals like Science and Cell. While we always hope that replication studies go smoothly, that isn’t always the case.

The editorial spent a good chunk of its time discussing the attempt made to reproduce this 2010 paper that had some breakthroughs in tumor-penetration of cancer drugs:

Ruoslahti et al., Coadministration of a Tumor-Penetrating Peptide Enhances the Efficacy of Cancer Drugs, Science 21 May 2010 : 1031-1035

Unfortunately, the reproducibility group had some different results than the original paper. And when I say different results, I mean that the replication study found no statistical significance whereas the original study found great significance for the following end-points:

  1. The permeability or penetrance of doxorubicin was not enhanced when it was co-administered with IGRD peptide.

  2. Tumor weights showed no statistical significant difference.

  3. No difference was seen in TUNEL staining.

So what do we make of a result like this?

Well, we do believe it is important to state what we should NOT do. We shouldn’t entirely disregard the results of the 2010 paper. As stated previously, REPLICATING is not REPRODUCING. In order to properly reproduce evidence-based science, there needs to be different methods and multiple observations under diverse conditions. The reproducibility project seemed to use most of the same conditions, and one would think that these experiments should be reproducible….. But it wasn’t IN.THIS.CASE.

However, maybe we should not be focusing solely on the issue of reproducibility and instead ask if the effects of the IGRD peptide are similar to the findings of the 2010 paper when it is tested with other chemotherapeutics and/or cancer models.  If the effects seen with the peptide are indicative of a true biochemical effect, the enhancements of permeability and penetration of chemotherapeutics when they are co-administered with the peptide should be seen across the board, regardless of the model.

To this end… there are currently 51 articles in PubMed that can be found with a simple search of “Tumor Penetrating Peptides”. Most of these 51 papers are not from the lab that published the 2010 paper.

NOW: Should we disregard this line of investigation and view it as bunk due to the failure to replicate? Thankfully, no. The 50 papers on PubMed indicate that this field of study is an active and growing body of research.

Unfortunately in our click-bait society, people will only read the headline and a select few sentences before drawing a conclusion. In fact, Nature spent most of the editorial on this one failure despite mentioning that 10 other labs have already validated the findings of the original 2010 paper. If 10 independent labs are able to reproduce the findings and only 1 lab has failed to do so, that’s science.

And truth be told, isn’t that the purpose of peer-review publications? Put yourself and your scientific ideas out there for the world to comment, replicate and reproduce? And then the body of evidence and then knowledge moves forward.