Database Upgrades

The last blog posting, I had lamented about the lack of NIH DAVID updates and noted that is just was updated after almost 6 years.  So I took it upon myself to not only investigate the update to DAVID, but also applaud some other database upgrades/updates that have occurred as of late.

First DAVID.  Version 6.7 was release in 2010.  And it was a HUGE release encompassing the complete rebuilding of the DAVID knowledgebase and the DAVID engine.  Both of these, in theory allowed for ease in future updates and development.  Yet, it took 6-years to get to the latest version 6.8.  What do you get in this latest update?  New annotation categories, new list identifier systems for conversions, and the knowledge base completely rebuilt, again.  So, what is the David Knowledgebase?  Here’s the complete description, but in a nutshell:  The knowledgebase is centered around the “DAVID Gene Concept", a linkage method to agglomerate tens of million of gene/protein identifiers and associated annotation from dozens of well-known bio-databases. It’s a very comprehensive mapping file and freely available.  But, it doesn’t say that it’s a Gene Ontology or Gene Set Enrichment Analysis service.  Interestly, this is what we see a lot of people doing with DAVID.  And again, there are definitely different, diverse and much better tools for GO or GSEA, see past posts.  However, digging around a little bit, I found a resource I didn’t know about that was actually included in the older version 6.7; the NIAID Pathogen Annotation Browser.  Following the manual/help, I selected a pathogen of interest; tried a couple of biologically relevant keywords and the submitted.  Unfortunately, the page just hung showing 0%.  I used a couple of different approaches to just get something in return, but nothing.  Maybe it’s broke?  Bummer, but for pathogenic organisms, we highly recommend PATRIC.    

And, PATRIC underwent a huge update as of late.  The publication is entitled, “Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center”   This looks to be an entire rehab, starting with a very intuitive web interface.  Most importantly, anyone can set up a private workspace.  Then, you can import raw reads, assemble and annotate your private raw data with RASTtk and then compare/integrate your private data with the library of public data.  This public data is nothing to sneeze at.  It contains tens of thousands bacterial genomes with standardized annotations.  One can also integrate RNA-seq for differential expression analysis, proteome comparisons, metabolic models and look for genetic variations.  All the tools needed are integrated into this great web interface.  It really opens the door for an open-access, consistent way to do comparative pathogenic bacteria genomic analysis.  Especially in the realm of antibiotic resistance studies.  The folks at PATRIC, which by the way is brought to us from the National Institute of Allergy and Infectious Disease (NIAID), pulled down over 6,000 genomes from the antibiotic resistance studies published in the SRA, re-assembled, annotated in standardized manner and made available to the public.  Mapping between additional databases was made easy by pulling in legacy taxonomies, GenBank and RefSeq annotations.  What’s super cool for us at A2IDEA, how PATRIC focused efforts on antimicrobial resistance annotations, collections of genes, and curated the AMR genomes with standardized metadata, like infection site.  Super Cool!

Thirdly, Broad’s Genome Aggregation Database; gnomAD. This new resource is the result of many investigators harmonizing and summarizing human exome and genome sequencing data from various projects; including 1000 Genomes, Framingham Heart Study, GTEx, National Institute of Mental Health Controls, National Heart Lung and Blood Institute, TCGA and many, many more.  The first release and primary paper calls it Exome Aggregation Consortium (ExAC).  Very similar Herculean data analysis project, but now including ~ 16K genomes and a whopping 123K individual exome data.  All raw data from this extensive list of projects have been reprocessed through the same pipeline to increase consistency across the projects.  The pipelines used are also available open-source with workflow definition language (WDL, not just for bioinformatics), Hail and executed using the Cromwell engine for anyone needing genomic workflows at massive scale on multiple platforms.  But, let’s just say, you’re not into massive scale genomic analysis; there’s LOTs to be learned from this very beautifully designed web resource.  Put in your favorite gene and check it out.  The annotations are comprehensive (includes dbSNP IDs) and intuitive, with discrete allele frequency bins to allow quick visualization of statistically important genetic variations. All with an ‘export table.csv’ option and downloadable coverage plots.  This resource might be the seed of a new, freely obtained A2IDEA instructional datasheet.  Stay tuned or contact us; we’re always happy to listen to your bioinformatics pain points.  

Data Science Resources

How’d this get published?