Given a $$k$$-mer $$p$$ and a longer string $$s$$, we use $$d(p, s)$$ to denote the minimum Hamming distance between $$p$$ and any $$k$$-mer in $$s$$. \[ d(p, s) = \min\limits_{\text{all }k\text{-mers p' in s}}{\text{hammingDistance}(p, p')} \] Given a $$k$$-mer $$p$$ and a collection of DNA strings $$\mathcal{C}_\text{DNA} = \left\{s_1,\ldots s_n\right\}$$, we define $$d(p, \mathcal{C}_\text{DNA})$$ as the sum of the distances between $$p$$ and all strings in $$\mathcal{C}_\text{DNA}$$. \[ d(p, \mathcal{C}_\text{DNA}) = \sum_{i=1}^{n}d(p, s_i) \]

Assignment

Example

In the following interactive session, we assume the FASTA files data01.fna1 and data02.fna2 to be located in the current directory.

>>> distance_to_string('AAA', 'TTACCTTAAC')
1
>>> distance_to_string('AAA', 'GATATCTGTC')
1
>>> distance_to_string('AAA', 'ACGGCGTTCG')
2
>>> distance_to_string('AAA', 'CCCTAAAGAG')
0
>>> distance_to_string('AAA', 'CGTCAGAGGT')
1

>>> distance_to_strings('AAA', 'data01.fna')
5
>>> distance_to_strings('TAA', 'data02.fna')
3