FEV_KEGG.Experiments.39 module

Context

The approach of building a consensus/majority graph of enzymes/EC numbers to find a core metabolism shared among several organisms has to be validated against previous research. One such previous research deals with seven representative genomes of Thaumarchaeota, but manually curates all genes, including their associated EC numbers. Kerou et al. (2016) list EC numbers in the additional file “Dataset_S02.xlsx”, from which we extracted the ones on the sheets “Cell surface & glycosyl” and “Metabolism”. As with other validations before, we have to filter EC numbers which are outdated, or somehow not represented by KEGG’s standard pathways. This is done here by restricting any EC number to the ones in NUKA, whis is done for both, their and our set of EC numbers.

Question

Does the consensus/majority graph approach to core metabolism yield a similar set of EC numbers as the approach of Kerou et al. (2016)?

Method

  • extract EC numbers from Kerou et al. (2016) by hand
  • sanitise them by leaving only the ones found in NUKA
  • also remove the ones with wildcards
  • REPEAT with different groups
    1. get group of organisms by clade ‘Thaumarchaeota’
    1. get only the seven organisms used by Kerou et al. (2016)
  • REPEAT for varying majority-percentages:
  • calculate EC numbers occuring in group’s core metabolism
  • sanitise them by leaving only the ones found in NUKA
  • also remove the ones with wildcards
  • overlap their set with ours and print amount of EC numbers inside the intersection and falling off either side

Result

Maj. %  others both   ours
Thaumarchaeota:
100%:   102     65     80
90%:    73     94    139
80%:    66    101    154
70%:    65    102    156
60%:    62    105    162
50%:    60    107    167
40%:    58    109    174
30%:    53    114    188
20%:    51    116    203
10%:    46    121    240
1%:    38    129    334

Representative organisms:
100%:    74     93    142
90%:    74     93    142
80%:    65    102    155
70%:    65    102    161
60%:    65    102    161
50%:    61    106    174
40%:    53    114    191
30%:    53    114    191
20%:    50    117    209
10%:    47    120    245
1%:    47    120    245

Conclusion

When comparing the core metabolisms of all Thaumarchaeota known today, with only the ones from the seven representative organisms, there is not much difference. This shows that these seven organisms are indeed very well chosen representatives.

Considering the amount of EC numbers falling off to either side: The number of ECs only in our set is larger than the overlap, thus, we again see that core metabolisms created with our approach tend to be bigger than manually curated ones. The latter is most likely due to the fact that ECs which occur in all genomes do not necessarily have to be essential, while Kerou et al. aimed at only including essential ECs. The number of ECs only in their set is also very high, accounting to roughly 65% of the overlap, or 40% of their set, and 20% of the overall set. These ECs only in their set can not stem from ECs not in KEGG pathways at all, since we pre-filtered them using NUKA. The most likely explanations seems to be that Kerou et al. were able to annotate many more EC numbers manually than KEGG’s GENE database has stored to this date. This, again, would mean that KEGG’s data is incomplete, which is strongly implied by the fact that even the collective graph (1% majority) does not contain 47 of the representative’s EC numbers, which can only happen if these EC numbers are nowhere to be found in any of today’s seven organisms in KEGG.

In conclusion of the effectiveness of our approach of building a core metabolism, we are left to say that completeness and quality of EC number annoations vary greatly, both within literature and KEGG. Therefore, to achieve the most exact model of an organisms metabolism, one needs to apply further steps beyond our approach. Such steps may involve flux balance analysis with a manually curated list of ‘essential’ metabolites. Still, however, when reducing the set of EC numbers to the ones known to standard KEGG pathways (using NUKA), core metabolisms created via our approach can be used to roughly compare the metabolic capabilities of closely, or even remotely related organisms, groups of organisms, and whole clades.