FEV_KEGG.Statistics.SequenceComparison module¶
-
FEV_KEGG.Statistics.SequenceComparison.getExpectationValue(bitScore: float, searchSequenceLength: int, foundSequenceLength: int, numberOfSequencesInDatabase: int) → float[source]¶ Returns the E-value of a single query in a sequence database.
Parameters: - bitScore (float) – Comparison score, normalised to base 2 (bits).
- searchSequenceLength (int) – Length of the sequence that was the input of the query.
- foundSequenceLength (int) – Length of the sequence that was the result of the query. If you have several results, run this function for each of them individually.
- numberOfSequencesInDatabase (int) – Count of all sequences that could have potentially be found. For example, a search in all genes of eco would mean the count of all sequenced genes for eco -> numberOfSequencesInDatabase = 4,498. A search in all organisms, however, would mean the count of all sequenced genes in KEGG -> numberOfSequencesInDatabase = 25,632,969.
Returns: Statistical E-value (expectation value) for the occurence of a match of the same confidence with a totally unrelated, e.g. random, sequence.
Return type: float
-
FEV_KEGG.Statistics.SequenceComparison.isMatchSignificant(bitScore: float, searchSequenceLength: int, foundSequenceLength: int, numberOfSequencesInDatabase: int, significanceThreshold: float = 1e-15) → bool[source]¶ Check if a sequence match is significant.
Calculates E-value, using
getExpectationValue(). If E-value is smaller than significanceThreshold, match is significant, return True.Parameters: - bitScore (float) – Comparison score, normalised to base 2 (bits).
- searchSequenceLength (int) – Length of the sequence that was the input of the query.
- foundSequenceLength (int) – Length of the sequence that was the result of the query. If you have several results, run this function for each of them individually.
- numberOfSequencesInDatabase (int) – Count of all sequences that could have potentially be found. For example, a search in all genes of eco would mean the count of all sequenced genes for eco -> numberOfSequencesInDatabase = 4,498. A search in all organisms, however, would mean the count of all sequenced genes in KEGG -> numberOfSequencesInDatabase = 25,632,969.
- significanceThreshold (float, optional) – Threshold of E-value below which to consider a match significant.
Returns: Whether a sequence match is significant.
Return type: bool