Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

This work investigates frequency distributions of strings within a text. The mathematical derivation accounts for variable alphabet size, character probabilities, and string/text lengths, under both the Bernoullian and the Markovian model for string generation. The analysis is limited to the set of non-clumpable strings, that cannot overlap with themselves. Two formulae (exact and approximated) are derived, calculating the frequency distribution of a string of length m found inside a text of length n (with m < n). The approximated formula has a constant complexity (in contrast to an exponential complexity of the exact) and makes it applicable to very long texts. The proposed formulae were applied to analyze string frequencies in a portion of the human genome, and to recalculate frequencies of known repeated motif within genes, associated to genetic diseases. A comparison with state-of-the-art methods was provided. The formulae presented here can be of use in the statistical evaluation of specific motif frequencies within very long texts (e.g. genes or genomes) and help in characterizing motifs in pathologic conditions. © 2012 World Scientific Publishing Company.

Original publication




Journal article


International Journal of Biomathematics

Publication Date