“Alone we can do so little; together we can do so much” – Helen Keller.
A new paper put out by Nature Communications made the above quote pop into my head. The paper, “Accelerating the search for the missing proteins in the human proteome,” discusses a new database that can hopefully aid in the efforts to find all of the “missing proteins” in Homo sapiens, MissingProteinPedia.
The goal of MissingProteinPedia is to help speed up the process of classifying proteins as “real” proteins. The Human Proteome Project (HPP) is a major database that classifies human proteins. It uses a ranking system of PE1-PE5, where PE1 are those proteins that have been confirmed through mass spec, solved X-ray structures, antibody verification and/or sequencing via Edman degradation. The PE2-PE4 groups are proteins that have evidence for existence at the transcript level, are inferred to exist based on homology, or just flat out inferred to exist.
Now, although the HPP is great at ensuring protein data is very quantitative and high-stringency, those two factors can be a hindrance at times to categorizing proteins as PE1 proteins. For example, the authors bring multiple proteins with REPRODUCIBLE evidence of their impact on humans (such as prestin and interleukin-9), but are relegated to PE2-PE4 status. As there are no data “confirming” the existence of the protein that are in line with HPP requirements, the proteins will not be elevated to PE1 status. This is where MissingProteinPedia comes in.
The goal of MissingProteinPedia is two-fold. First, it should be a database where anyone can both deposit and access information. Second, the hope is that this collaborative data can be used as a platform to help researchers generate the data required by the HPP to elevate these proteins to PE1 status.
NOW, are there certain things to be wary of? Of course. The authors of the paper openly admit that there is no check on the quality of the data in the database, and the data can come from a wide variety of sources, including unpublished work. Call – Out to REPLICATE and VALIDATE.
Currently, there are just under 1500 proteins in the MissingProteinPedia database. The website itself is easy to use and has some great information. You can narrow your search to a specific gene, or you can also search by chromosome. Clicking on a protein gets you a short description of the protein, as well as all relevant data, including homology, known domains, and references.
Additionally, there are some great characteristics of the database that make it more user-friendly:
The data provided for proteins includes BLAST results for sequence similarity and functional annotation. This is unique amongst databases.
MissingProteinPedia pulls in mass spectra from two of the best mass spec databases, PRIDE and GPM.
The database in schema-less, making it more flexible. Without any rigid requirements for formatting or structuring of data, it is much more open and inclusive of different data.
It incorporates text-mining. This allows researchers to retrieve more information, as the database sifts through text to identify other possibly related and/or relevant information.
Although the quantity and quality of data varies between proteins, there is plenty of information to give a researcher a head start on characterizing these proteins. And isn’t that what we do as scientists? We constantly build off each other and look to take everything one step further. You never know who your limited data will help or what big discovery someone’s piece of information sparks you to make.