We will now turn to randomized algorithms1 that flip coins and roll dice in order to search for motifs. Making random algorithmic decisions may sound like a disastrous idea. Just imagine a chess game in which every move would be decided by rolling a die. However, an 18th Century French mathematician and naturalist, Comte de Buffon2, first proved that randomized algorithms are useful by randomly dropping needles3 onto parallel strips of wood and using the results of this experiment to accurately approximate the constant $$\pi$$.
Randomized algorithms may be non-intuitive because they lack the control of traditional algorithms. Some randomized algorithms are Las Vegas algorithms4, which deliver solutions that are guaranteed to be exact, despite the fact that they rely on making random decisions. Yet most randomized algorithms are Monte Carlo algorithms5. These algorithms are not guaranteed to return exact solutions, but they do quickly find approximate solutions. Because of their speed, they can be run many times, allowing us to choose the best approximation from thousands of runs.
A randomized algorithm for motif finding is given below.
RandomizedMotifSearch(k, CDNA) randomly select k-mers Motifs = (Motif1, …, Motift) in each string from CDNA BestMotifs ← Motifs while forever Profile ← Profile(Motifs) Motifs ← Motifs(Profile, CDNA) if Score(Motifs) < Score(BestMotifs) BestMotifs ← Motifs else return BestMotifs
Write a function randomized_motif_search that takes an integer $$k$$ and the location of a FASTA file containing a collection of DNA strings $$\mathcal{C}_\text{DNA}$$. The function must return a tuple containing the $$k$$-mers resulting from a randomized motif search with pseudocounts in $$\mathcal{C}_\text{DNA}$$. The function must repeat the randomization process 1000 times.
In the following interactive session, we assume the FASTA files data01.fna6, data02.fna78 and data03.fna9 to be located in the current directory.
>>> randomized_motif_search(8, 'data01.fna') ('TCTCGGGG', 'CCAAGGTG', 'TACAGGCG', 'TTCAGGTG', 'TCCACGTG') >>> randomized_motif_search(6, 'data02.fna') ('CGATAA', 'GGTTAA', 'GGTATA', 'GGTTAA', 'GGTTAC', 'GGTTAA', 'GGCCAA', 'GGTTAA') >>> randomized_motif_search(6, 'data03.fna') ('TTAACC', 'ATAACT', 'TTAACC', 'TGAAGT', 'TTAACC', 'TTAAGC', 'TTAACC', 'TTACTC')