Human Protein Atlas v15 is OUT

Alright, enough commentary about OpEd articles in the NYT… back to science.  

This month, The Human Protein Atlas (HPA) released a fantastic update. The primary research article is here.  They, not only integrated the transcript data from Broad’s GTEx database with the HPA, but also gave a nice, concise review of those databases and others.  

First, the appropriate caveat was given, i.e.,

“Overall, these studies suggest that the amount of a given protein in a cell or tissue is, IN GENERAL, reflected by the corresponding mRNA level, ALTHOUGH…..” Blah, blah, “translational rates, protein half-lives,.... the transcript level for a given gene might therefore be used to predict the corresponding protein level.  This hypothesis needs to be confirmed by more…..  thus forming an attractive link between the field of genomics and proteomics.”

Phew… now that that’s been said…. Again.  On to the comparisons.  

Initially, the paper very briefly reviews the comparison of Yu et al., of CAGE (cap analysis gene expression) where only the sequencing of the 5’-end of the capped mRNA molecules are measured as compared to full-length RNA-seq.  The HPA created the full-length RNA-seq data, where the FANTOM5 consortium release the CAGE data.  They concluded that the differences in the data (22 tissues analyzed by both methods), are largely due to technical artifacts inherent in the respective technologies or annotation issues.   That’s comforting.  

I should take a step back and describe the data:

  1. HPA contains immunohistochemistry-based expression, spacial localization within tissues for the human proteome.  For this update,  also have matched RNA-sequencing data, from 32 histologically normal tissues on 95 individuals with 2 replicates each.  

  2. FANTOM5 (Functional Analysis of Mammalian Genomes 5) performed CAGE analysis on approximately 975 human samples including tissues, cell lines and primary cells.  

  3. GTEx data set includes ~ 1,600 post mortem samples, RNA-seq.

These are all internally generated data, i.e., systematic laboratory workflows and coherent bioinformatics pipelines.  

Interestingly, and quite smartly, the term “tissue specific” was avoided during the comparisons of data as the term can have researcher controlled variables.  Like, what would be the expression level above zero that would be considered the threshold for being absent, 1 FPKM?  So, they devised terms that are more qualitative rather than quantitative, but still give the overall flavor of expression.  

As the comparison of the GTEx and HPA RNA-seq data goes, they are very comparable.  Globally, it seems as little less than half of the protein coding transcripts are ubiquitously expressed across all tissues.  Further, the tissues that do show “tissue-enriched” genes are identified independently by both datasets as being testis, brain, skin and liver. Very comforting, those ‘tissue-enriched” genes in those tissue types are consistent with function.  Further, it was somewhat surprising, but very comforting nonetheless, there is significant overlap between the postmortem and fresh frozen tissue samples.  This can be interpreted as the sampling procedures for limiting RNA degradation has very little effect in these datasets and frankly the laboratory procedures in place are excellent.  

The rest of the comparisons between the data, via tissue type, show the independent databases are spot on in agreement.  Here’s a take home, those genes that are expressed in all tissues, those ‘housekeeping’ genes, show very low coefficient of variation among individuals measured and between laboratories.  In comparison, those genes that do show tissue specific distribution of expression have much more inter-individual variation.     

Lastly, the paper advertises these resources as a source for creating genome-scale metabolic models (GEMs).  GEMs can contain millions of biochemical reactions across individual tissue types.  When integrating gene to protein in each tissue, one can start generating a map of metabolic processes.   Not surprisingly, these GEMs can be used to predict small molecule perturbations in a tissue specific manner.  Within a context, GEMs can elucidate metabolism related disorders or can provide insights into how drug metabolisms changes in the context of disease versus healthy .  Although, this work, is largely predictive modeling with very niche algorithms even as it’s being advertised as a personalized medicine approach.    

All in all, a fantastic comparison and truly shows that one can study human health and disease in a global, systems biology approach with large-scale data.  Thanks!