In "Distance between pattern and strings1" we computed the distance $$d(p, \mathcal{C}_{\text{DNA}})$$ between a $$k$$-mer $$p$$ and a collection of DNA strings $$\mathcal{C}_{\text{DNA}}$$. We will now try to find a $$k$$-mer $$p$$ that minimizes $$d(p, \mathcal{C}_{\text{DNA}})$$ over all $$k$$-mers $$p$$, the same task that the Equivalent Motif Finding problem is trying to achieve. We call such a $$k$$-mer a median string for $$\mathcal{C}_{\text{DNA}}$$.

Assignment

Write a function median_string that takes an integer $$k$$ and a FASTA file containing a collection of DNA strings $$\mathcal{C}_{\text{DNA}}$$. The function must return a set containing all $$k$$-mers that minimize $$d(p, \mathcal{C}_{\text{DNA}})$$ over all $$k$$-mers $$p$$.

Example

In the following interactive session, we assume the FASTA files data01.fna2, data02.fna3, data03.fna4 and data04.fna5 to be located in the current directory.

>>> median_string(3, 'data01.fna')
{'GAC', 'ACG'}
>>> median_string(3, 'data02.fna')
{'CGT', 'ACG'}
>>> median_string(3, 'data03.fna')
{'AAA'}
>>> median_string(3, 'data04.fna')
{'AAG', 'AAT'}