Consider this greedy approach for finding $$k$$-mer motifs in a collection of DNA strings $$\mathcal{C}_\text{DNA}$$.
GreedyMotifSearch(k, CDNA)
BestMotifs ← motif matrix formed by first k-mers in each string from CDNA
for each k-mer Motif in the first string from CDNA
Motif1 ← Motif
for i = 2 to t
form Profile from motifs Motif1, …, Motifi - 1
Motifi ← Profile-most probable k-mer in the i-th string in CDNA
Motifs ← (Motif1, …, Motift)
if Score(Motifs) < Score(BestMotifs)
BestMotifs ← Motifs
return BestMotifs
Write a function greedy_motif_search that takes an integer $$k$$ and the location of a FASTA file containing a collection of DNA strings $$\mathcal{C}_\text{DNA}$$. The function must return a tuple containing the $$k$$-mers resulting from a greedy motif search in $$\mathcal{C}_\text{DNA}$$. If at any step the function finds more than one $$\mathcal{P}$$-most probable $$k$$-mers in a given DNA string, it must use the one occurring first (the leftmost one).
In the following interactive session, we assume the FASTA files data01.fna1, data02.fna2, data03.fna3, data04.fna4 and data05.fna5 to be located in the current directory.
>>> greedy_motif_search(3, 'data01.fna') ('CAG', 'CAG', 'CAA', 'CAA', 'CAA') >>> greedy_motif_search(3, 'data02.fna') ('GCC', 'GCC', 'AAC', 'TTC') >>> greedy_motif_search(5, 'data03.fna') ('GAGGC', 'TCATC', 'TCGGC', 'GAGTC', 'GCAGC', 'GCGGC', 'GCGGC', 'GCATC') >>> greedy_motif_search(6, 'data04.fna') ('GTGCGT', 'GTGCGT', 'GCGCCA', 'GTGCCA', 'GCGCCA') >>> greedy_motif_search(5, 'data05.fna') ('GCAGC', 'TCATT', 'GGAGT', 'TCATC', 'GCATC', 'GCATC', 'GGTAT', 'GCAAC')