FEV_KEGG.Experiments.30 module

Context

The approach of building a consensus/majority graph of enzymes/EC numbers to find a core metabolism shared among several organisms has to be validated against previous research. One such previous research deals with Gammaproteobacteria, but uses a slightly different approach, first calculating Enzymatic Step Sequences (ESS) and comparing these among species. Poot-Hernandez et al. (2015) list EC numbers associated with enzymes from ESS shared among most of the 40 representative species of Gammaproteobacteria in File S2, marked with the hexadecimal code for blue. It does not matter in which category the EC numbers occur, since “highly preserved” etc. only represents the percentage of EC numbers in the pathway, which are actually common in Gammaproteobacteria, and are thus marked as blue. These EC numbers are used to validate the approach of this library. EC numbers containing wildcards (e.g. 1.2.-.-) are excluded on both sides, to minimise statistical skew. Multifunctinal enzymes are, however, included, because they were not excluded by Poot-Hernandez et al. Our group consists of: 1) The exact same organisms deemed representative by Poot-Hernandez et al. (Table 2 lists them) 2) All organisms of Gammaproteobacteria 3) All organisms of Gammaproteobacteria, excluding ‘unclassified’ organisms

Question

Does the consensus/majority graph approach to core metabolism yield a similar set of EC numbers as the approach of Poot-Hernandez et al. (2015)?

Method

  • extract EC numbers from Poot-Hernandez et al. (2015) by hand, any which are marked as blue (preserved)
  • REPEAT with different groups
    1. get group of organisms deemed representative by Poot-Hernandez et al.
    1. get group of organisms ‘Gammaproteobacteria’
    1. get group of organisms ‘Gammaproteobacteria’, excluding unclassified
  • REPEAT for varying majority-percentages:
  • calculate EC numbers occuring in group’s core metabolism
  • overlap Poot-Hernandez’ set with ours and print amount of EC numbers inside the intersection and falling off either side

Result

Maj. %    others    both    ours
Representative:
100%:    228    9    14
90%:    128    109    65
80%:    71    166    113
70%:    54    183    163
60%:    44    193    209
50%:    39    198    259
40%:    33    204    315
30%:    26    211    383
20%:    24    213    466
10%:    22    215    659
1%:    15    222    1059

Gammaproteobacteria:
100%:    235    2    0
90%:    85    152    98
80%:    51    186    174
70%:    43    194    219
60%:    39    198    263
50%:    32    205    336
40%:    30    207    402
30%:    25    212    497
20%:    23    214    620
10%:    21    216    760
1%:    18    219    1070

Gammaproteobacteria without unclassified:
100%:    235    2    0
90%:    85    152    98
80%:    51    186    174
70%:    43    194    219
60%:    39    198    264
50%:    32    205    335
40%:    30    207    402
30%:    25    212    500
20%:    23    214    620
10%:    21    216    760
1%:    18    219    1070

Conclusion

Regarding the result at 1% of the representative group: There are 15 EC numbers which occur in Poot-Hernandez’ but nowhere in our organisms. Looking at the EC numbers in detail, this is easily explained by outdated used by Poot-Hernandez: 1.1.1.158 deleted 2013 1.17.1.2 deleted 2016 1.17.4.2 in none of the 40 organisms of today 1.3.1.26 deleted 2013 1.3.3.1 deleted 2011 1.3.99.1 deleted 2014 2.3.1.89 in none of the 40 organisms of today 2.4.2.11 deleted 2013 2.7.4.14 in none of the 40 organisms of today 3.5.1.47 in none of the 40 organisms of today 3.6.1.15 in none of the 40 organisms of today 3.6.1.19 deleted 2016 4.2.1.52 deleted 2012 4.2.1.60 deleted 2012 5.4.2.1 deleted 2013

The experiment should be re-run without the now obsolete EC numbers above, to avoid skewed results.

Regarding the difference between Gammaproteobacteria with and without ‘unclassified’ organisms: In the upper (100/90%) and lower (20/10/1%) regions of majority, there is no difference. Only in the middle regions some results differ by one to three counts. This comes as no surprise, because the group of Gammaproteobacteria consists of 1074 organisms, while only one of them can be excepted as ‘unclassified’. This does not leave much room for difference. Still, it might be best practice to always exclude ‘unclassified’ organisms, if only for their unknown position in taxonomy.

As to be expected, using all Gammaproteobacteria organisms, e.g. at 70% majority, yields a higher overlap between our core metabolism and theirs. But the difference is small, implying that the 40 organisms were well-selected representatives. Still, however, there is a high number of EC numbers only found in their core metabolism. This might result from their methodology. After creating the Enzymatic Step Sequences (ESS), they reduced the EC numbers in the ESS to contain only three levels, i.e. abstracting from substrate specificity keeping only reaction types. Then, at some undocumented point, they translated the set of three-level EC numbers back to the original set of EC numbers. But this results in a list of EC numbers which reactions are preserved, not which substrate specificities are preserved! In order to accurately follow their definition of preservation, we have to reduce both sets of EC numbers, ours and theirs, to their first three levels, and then re-run the experiment.