Given integers $$L$$ and $$t$$, a string $$p$$ forms an $$(L, t)$$-clump inside a (larger) string $$s$$ if there is a length $$L$$ interval of $$s$$ in which $$p$$ appears at least $$t$$ times. Appearances of $$p$$ are allowed to overlap in the interval. For example, TGCA forms a $$(25,3)$$-clump in the string

gatcagcataagggtcccTGCAATGCATGACAAGCCTGCAgttgttttac

Assignment

Write a function clump_finding that takes a DNA string $$s$$ and three integers $$k$$, $$L$$ and $$t$$. The function must return a set containing all distinct $$k$$-mers that form $$(L, t)$$-clumps in $$s$$.

Example

In the following interactive session, we assume the FASTA file data.fna1 to be located in the current directory.

>>> clump_finding('CGGACTCGACAGATGTGAAGAAATGTGAAGACTGAGTGAAGAGAAGAGGAAACACGACACGACATTGCGACATAATGTACGAATGTAATGTGCCTATGGC', 5, 75, 4)
{'GAAGA', 'CGACA', 'AATGT'}
>>> clump_finding('AAAACGTCGAAAAA', 2, 4, 2)
{'AA'}
>>> clump_finding('ACGTACGT', 1, 5, 2)
{'G', 'T', 'C', 'A'}

>>> from Bio import SeqIO
>>> clump_finding(*SeqIO.parse('data.fna', 'fasta'), 11, 566, 18)
{'AAACCAGGTGG'}