We will now turn to randomized algorithms¹ that flip coins and roll dice in order to search for motifs. Making random algorithmic decisions may sound like a disastrous idea. Just imagine a chess game in which every move would be decided by rolling a die. However, an 18th Century French mathematician and naturalist, Comte de Buffon², first proved that randomized algorithms are useful by randomly dropping needles³ onto parallel strips of wood and using the results of this experiment to accurately approximate the constant $$\pi$$.

Randomized algorithms may be non-intuitive because they lack the control of traditional algorithms. Some randomized algorithms are Las Vegas algorithms⁴, which deliver solutions that are guaranteed to be exact, despite the fact that they rely on making random decisions. Yet most randomized algorithms are Monte Carlo algorithms⁵. These algorithms are not guaranteed to return exact solutions, but they do quickly find approximate solutions. Because of their speed, they can be run many times, allowing us to choose the best approximation from thousands of runs.

A randomized algorithm for motif finding is given below.

        RandomizedMotifSearch(k, CDNA)
    randomly select k-mers Motifs = (Motif1, …, Motift) in each string from CDNA
    BestMotifs ← Motifs
    while forever
        Profile ← Profile(Motifs)
        Motifs ← Motifs(Profile, CDNA)
        if Score(Motifs) < Score(BestMotifs)
            BestMotifs ← Motifs
        else
            return BestMotifs
      

Assignment

Write a function randomized_motif_search that takes an integer $$k$$ and the location of a FASTA file containing a collection of DNA strings $$\mathcal{C}_\text{DNA}$$. The function must return a tuple containing the $$k$$-mers resulting from a randomized motif search with pseudocounts in $$\mathcal{C}_\text{DNA}$$. The function must repeat the randomization process 1000 times.

Example

In the following interactive session, we assume the FASTA files data01.fna⁶, data02.fna⁷⁸ and data03.fna⁹ to be located in the current directory.

        >>> randomized_motif_search(8, 'data01.fna')
('TCTCGGGG', 'CCAAGGTG', 'TACAGGCG', 'TTCAGGTG', 'TCCACGTG')
>>> randomized_motif_search(6, 'data02.fna')
('CGATAA', 'GGTTAA', 'GGTATA', 'GGTTAA', 'GGTTAC', 'GGTTAA', 'GGCCAA', 'GGTTAA')
>>> randomized_motif_search(6, 'data03.fna')
('TTAACC', 'ATAACT', 'TTAACC', 'TGAAGT', 'TTAACC', 'TTAAGC', 'TTAACC', 'TTACTC')