I was so very excited to read this new paper, open access, in Nature Scientific Reports, Cell Cycle Model Systems for Advancing Cancer Biomarker Research. I. M. Lazar, I. Hoeschele, J. de Morais and M. J. Tenga, Sci. Rep., 2017.
I’m a cup half-full kind of person, so I really don’t want to flame this paper. However; I need some feedback, was this good? How’d this get published in Nature. Please, comment.
First, Brief summary of the paper. The group undertook a global proteomics experiment characterizing three cancer cell lines. Each cell line was arrested during the cell cycle and underwent subcellular fractionation so that they have four distinct samples for each of only three cell lines. Each of these samples was replicated three times. Next, they went to the literature and came up with a listing of ‘cancer markers’ in the entire human proteome. This was done primarily with UniProt and DAVID. The listing of literature proteins and experimental detected proteins had an intersection of ~300 proteins. They then, ad nauseum describe this listing.
Ok, so why am I so disappointed.
Depth. They only ID’d at most 1500 proteins per experiment. Even though they used an older instrument, the LTQ, I expect more coverage for a Nature paper in 2017. Heck, I’d expect more in 2007. These limited measurements actually forms their own conclusion as stated, “The analysis of ... on more sensitive MS platforms is expected to not only increase the number of protein ID’s, but also the overlap and reproducibility of their identification.” Yea, NO.... kidding? Suffice to say, they don’t have the depth to conclude anything, much less the “can reveal novel protein associations and evidence for previously unsuspected mechanisms.” At a depth of coverage ~ 1500 proteins, it’s been discovered unless you’re looking at differential post-translational modifications, small nucleotide polymorphism or other mutations.
Reproducibility. Even in their conclusion above, they know the data are lacking. They did 12 solid experiments. The majority, about 100, of the ~350 detected ‘cancer markers’ are found in only 1 experiment. Only 30 proteins, out of the 350 ‘cancer markers’ are found in all 12 experiments. I’ll just let that sink in. Is this really biological, the cancer cell lines are; MCF7, MCF10A and SKBR3, all breast cell lines. Biology alone would make me believe these should show much more similarity, even with subcellular fractionation, which even in the greatest of hands is not the cleanest separation. But really, cup half full; it doesn’t matter if you only find 30 proteins if they elucidate that special sauce needed to bring the next therapy to people. But, this is high-throughput proteomics.. and well, I expect more.
DAVID. I’ve had a love hate relationship with DAVID for over a decade. It draws you with supposed ease of use, but it was never updated frequently. This paper uses version 6.8, and to my delight this is a new release (2016). That’s way better than the previous 2010 version 6.7. OK, so this just means, I must reacquaint myself with this tool. In the meantime, I’ve learned there are other great tools out there to do gene set enrichment with gene ontology. Here’s small listing;
There are Venn’s, horizontal bars, pies, circus, and heatmaps. First off, Venn’s are useful for visualizing overlap, but do we need four? They all have similar proportions, whether it’s G1 nuclear or S-phase cytoplasmic. So, again.. do we need four or is one representative one good enough with a sentence that says these proportions of intersection are seen in all experiments?
Horizontal bars.. these are supposed to show the gene set enrichment for gene ontology categories according to DAVID. These are normally useful plots, but the authors here choose not to make them useful. When you have category duplication due to redundancy of GO, the figure becomes useless, because you can no long use the relative quantification. Why didn’t they try to use GOSlim, or any number of GO reduction techniques that can simplify the GO terms and most importantly, provide useful information to the reader? Without taking care of the redundancy in these GO enrichment attempts, the quantity of fold change is inaccurate at best.
Pies... Granted I think pie charts are some of the most useless plots to use for biological data. It offers no information on the totality of the data. This one takes a spot in a new level of ‘useless information’.. and it occupies almost half the page. And the topper; it’s in 3-D. How exactly is that useful in conveying the data? I think the authors might be using this pie to summarize the above DAVID GO enrichment bar plots, removing redundancy, etc. But did the reduction of redundancy occur manually?
Circos plots can be very useful to see associations. In this case, they limit to the top 100 proteins out of the ~350 ‘cancer markers’ identified in their study and look at it for their biological separations (G1/S and cytoplasmic and nuclear separation) and for GO up and down regulation. These might have been useful.. before they cut off edges of each circos plot. Now, you can’t even see the connections. How’d that get through editor proof?
Enter in the hair-balls of STRING protein-protein interactions (PPI). What good are these, other than to further substantiate their GO enrichment? You can't even read them.
Well, sorry to flame. I truly was excited to see this paper, but it really disappointed. The technology used doesn’t hit current standards of high-throughput measurement in proteomics, and then they convey the findings in redundant, useless figures, I find this hard to believe it got through review at Nature. But it did.. What am I missing?
And this is the final kicker:
The data analyzed in this study are included in the Supplementary Information files. Raw files are available from the corresponding author per reasonable request." HUH?