On counting the frequency distribution of string motifs in molecular sequences
Prosperi MCF., Prosperi L., Gray RR., Salemi M.
This work investigates frequency distributions of strings within a text. The mathematical derivation accounts for variable alphabet size, character probabilities, and string/text lengths, under both the Bernoullian and the Markovian model for string generation. The analysis is limited to the set of non-clumpable strings, that cannot overlap with themselves. Two formulae (exact and approximated) are derived, calculating the frequency distribution of a string of length m found inside a text of length n (with m < n). The approximated formula has a constant complexity (in contrast to an exponential complexity of the exact) and makes it applicable to very long texts. The proposed formulae were applied to analyze string frequencies in a portion of the human genome, and to recalculate frequencies of known repeated motif within genes, associated to genetic diseases. A comparison with state-of-the-art methods was provided. The formulae presented here can be of use in the statistical evaluation of specific motif frequencies within very long texts (e.g. genes or genomes) and help in characterizing motifs in pathologic conditions. © 2012 World Scientific Publishing Company.