FEV_KEGG.Experiments.30 module¶
Context¶
The approach of building a consensus/majority graph of enzymes/EC numbers to find a core metabolism shared among several organisms has to be validated against previous research. One such previous research deals with Gammaproteobacteria, but uses a slightly different approach, first calculating Enzymatic Step Sequences (ESS) and comparing these among species. Poot-Hernandez et al. (2015) list EC numbers associated with enzymes from ESS shared among most of the 40 representative species of Gammaproteobacteria in File S2, marked with the hexadecimal code for blue. It does not matter in which category the EC numbers occur, since “highly preserved” etc. only represents the percentage of EC numbers in the pathway, which are actually common in Gammaproteobacteria, and are thus marked as blue. These EC numbers are used to validate the approach of this library. EC numbers containing wildcards (e.g. 1.2.-.-) are excluded on both sides, to minimise statistical skew. Multifunctinal enzymes are, however, included, because they were not excluded by Poot-Hernandez et al. Our group consists of: 1) The exact same organisms deemed representative by Poot-Hernandez et al. (Table 2 lists them) 2) All organisms of Gammaproteobacteria 3) All organisms of Gammaproteobacteria, excluding ‘unclassified’ organisms
Question¶
Does the consensus/majority graph approach to core metabolism yield a similar set of EC numbers as the approach of Poot-Hernandez et al. (2015)?
Method¶
- extract EC numbers from Poot-Hernandez et al. (2015) by hand, any which are marked as blue (preserved)
- REPEAT with different groups
- get group of organisms deemed representative by Poot-Hernandez et al.
- get group of organisms ‘Gammaproteobacteria’
- get group of organisms ‘Gammaproteobacteria’, excluding unclassified
- REPEAT for varying majority-percentages:
- calculate EC numbers occuring in group’s core metabolism
- overlap Poot-Hernandez’ set with ours and print amount of EC numbers inside the intersection and falling off either side
Result¶
Maj. % others both ours
Representative:
100%: 228 9 14
90%: 128 109 65
80%: 71 166 113
70%: 54 183 163
60%: 44 193 209
50%: 39 198 259
40%: 33 204 315
30%: 26 211 383
20%: 24 213 466
10%: 22 215 659
1%: 15 222 1059
Gammaproteobacteria:
100%: 235 2 0
90%: 85 152 98
80%: 51 186 174
70%: 43 194 219
60%: 39 198 263
50%: 32 205 336
40%: 30 207 402
30%: 25 212 497
20%: 23 214 620
10%: 21 216 760
1%: 18 219 1070
Gammaproteobacteria without unclassified:
100%: 235 2 0
90%: 85 152 98
80%: 51 186 174
70%: 43 194 219
60%: 39 198 264
50%: 32 205 335
40%: 30 207 402
30%: 25 212 500
20%: 23 214 620
10%: 21 216 760
1%: 18 219 1070
Conclusion¶
Regarding the result at 1% of the representative group: There are 15 EC numbers which occur in Poot-Hernandez’ but nowhere in our organisms. Looking at the EC numbers in detail, this is easily explained by outdated used by Poot-Hernandez: 1.1.1.158 deleted 2013 1.17.1.2 deleted 2016 1.17.4.2 in none of the 40 organisms of today 1.3.1.26 deleted 2013 1.3.3.1 deleted 2011 1.3.99.1 deleted 2014 2.3.1.89 in none of the 40 organisms of today 2.4.2.11 deleted 2013 2.7.4.14 in none of the 40 organisms of today 3.5.1.47 in none of the 40 organisms of today 3.6.1.15 in none of the 40 organisms of today 3.6.1.19 deleted 2016 4.2.1.52 deleted 2012 4.2.1.60 deleted 2012 5.4.2.1 deleted 2013
The experiment should be re-run without the now obsolete EC numbers above, to avoid skewed results.
Regarding the difference between Gammaproteobacteria with and without ‘unclassified’ organisms: In the upper (100/90%) and lower (20/10/1%) regions of majority, there is no difference. Only in the middle regions some results differ by one to three counts. This comes as no surprise, because the group of Gammaproteobacteria consists of 1074 organisms, while only one of them can be excepted as ‘unclassified’. This does not leave much room for difference. Still, it might be best practice to always exclude ‘unclassified’ organisms, if only for their unknown position in taxonomy.
As to be expected, using all Gammaproteobacteria organisms, e.g. at 70% majority, yields a higher overlap between our core metabolism and theirs. But the difference is small, implying that the 40 organisms were well-selected representatives. Still, however, there is a high number of EC numbers only found in their core metabolism. This might result from their methodology. After creating the Enzymatic Step Sequences (ESS), they reduced the EC numbers in the ESS to contain only three levels, i.e. abstracting from substrate specificity keeping only reaction types. Then, at some undocumented point, they translated the set of three-level EC numbers back to the original set of EC numbers. But this results in a list of EC numbers which reactions are preserved, not which substrate specificities are preserved! In order to accurately follow their definition of preservation, we have to reduce both sets of EC numbers, ours and theirs, to their first three levels, and then re-run the experiment.