Given a profile matrix $$\mathcal{P}$$, we can evaluate the probability of every $$k$$-mer in a string $$s$$ and find a $$\mathcal{P}$$-most probable $$k$$-mer in $$s$$, i.e., a $$k$$-mer that was most likely to have been generated by $$\mathcal{P}$$ among all $$k$$-mers in $$s$$. For example, if
A: | .2 | .2 | .0 | .0 | .0 | .0 | .9 | .1 | .1 | .1 | .3 | .0 |
C: | .1 | .6 | .0 | .0 | .0 | .0 | .0 | .4 | .1 | .2 | .4 | .6 |
G: | .0 | .0 | 1. | 1. | .9 | .9 | .1 | .0 | .0 | .0 | .0 | .0 |
T: | .7 | .2 | .0 | .0 | .1 | .1 | .0 | .5 | .8 | .7 | .3 | .4 |
then ACGGGGATTACC is the $$\mathcal{P}$$-most probable 12-mer in GGTACGGGGATTACCT. Indeed, every other 12-mer in this string has probability 0.
In the following interactive session, we assume the FASTA files data01.fna1, data02.fna2 and data03.fna3 and the text files data01.prof4, data02.prof5 and data03.prof6 to be located in the current directory.
>>> profilemost_probable_kmer('data01.fna', 'data01.prof') 'CCGAG' >>> profilemost_probable_kmer('data02.fna', 'data02.prof') 'AGCAGCTT' >>> profilemost_probable_kmer('data03.fna', 'data03.prof') 'AAGCAGAGTTTA'