In the exercise "Profile-most probable k-mer", we defined the notion of a $$\mathcal{P}$$-most probable $$k$$-mer in a string. We now define a $$\mathcal{P}$$-randomly generated $$k$$-mer in a string $$s$$. For each $$k$$-mer $$p$$ in $$s$$, compute the probability $$P(p | \mathcal{P})$$, resulting in $$n = |s| - k + 1$$ probabilities $$(p_1, \ldots, p_n)$$. These probabilities do not necessarily sum to 1, but we can still form the random number generator $$\text{random}(p_1, \ldots, p_n)$$ based on them. GibbsSampler uses this random number generator to select a $$\mathcal{P}$$-randomly generated $$k$$-mer at each step: if the die rolls the number $$i$$, then we define the $$\mathcal{P}$$-randomly generated $$k$$-mer as the $$i$$-th $$k$$-mer in $$s$$.

        GibbsSampler(k, CDNA, N)
    randomly select k-mers Motifs = (Motif1, …, Motift) in each string from CDNA
    BestMotifs ← Motifs
    for j ← 1 to N
        i ← Random(t)
        Profile ← profile matrix constructed from all strings in Motifs except for Motifi
        Motifi ← Profile-randomly generated k-mer in the i-th sequence
        if Score(Motifs) < Score(BestMotifs)
            BestMotifs ← Motifs
    return BestMotifs
      

Assignment

Write a function gibbs_sampler that takes three arguments: i) an integer $$k \in \mathbb{N}_0$$, ii) the location of a FASTA file containing a collection of DNA strings $$\mathcal{C}_\text{DNA}$$ and iii) an integer $$N \in \mathbb{N}_0$$. The function must return a tuple containing the $$k$$-mers resulting from Gibbs sampling in $$\mathcal{C}_\text{DNA}$$ $$N$$ iterations, pseudocounts and 20 random starts.

Example

In the following interactive session, we assume the FASTA files data01.fna¹, data02.fna²³ and data03.fna⁴ to be located in the current directory.

        >>> gibbs_sampler(8, 'data01.fna', 100)
('TCTCGGGG', 'CCAAGGTG', 'TACAGGCG', 'TTCAGGTG', 'TCCACGTG')
>>> gibbs_sampler(6, 'data02.fna', 100)
('CGATAA', 'GGTTAA', 'GGTATA', 'GGTTAA', 'GGTTAC', 'GGTTAA', 'GGCCAA', 'GGTTAA')
>>> gibbs_sampler(6, 'data03.fna', 100)
('TTAACC', 'ATAACT', 'TTAACC', 'TGAAGT', 'TTAACC', 'TTAAGC', 'TTAACC', 'TTACTC')