In "Distance between pattern and strings¹" we computed the distance $$d(p, \mathcal{C}_{\text{DNA}})$$ between a $$k$$-mer $$p$$ and a collection of DNA strings $$\mathcal{C}_{\text{DNA}}$$. We will now try to find a $$k$$-mer $$p$$ that minimizes $$d(p, \mathcal{C}_{\text{DNA}})$$ over all $$k$$-mers $$p$$, the same task that the Equivalent Motif Finding problem is trying to achieve. We call such a $$k$$-mer a median string for $$\mathcal{C}_{\text{DNA}}$$.

Assignment

Write a function median_string that takes an integer $$k$$ and a FASTA file containing a collection of DNA strings $$\mathcal{C}_{\text{DNA}}$$. The function must return a set containing all $$k$$-mers that minimize $$d(p, \mathcal{C}_{\text{DNA}})$$ over all $$k$$-mers $$p$$.

Example

In the following interactive session, we assume the FASTA files data01.fna², data02.fna³, data03.fna⁴ and data04.fna⁵ to be located in the current directory.

        >>> median_string(3, 'data01.fna')
{'GAC', 'ACG'}
>>> median_string(3, 'data02.fna')
{'CGT', 'ACG'}
>>> median_string(3, 'data03.fna')
{'AAA'}
>>> median_string(3, 'data04.fna')
{'AAG', 'AAT'}