Given integers $$L$$ and $$t$$, a string $$p$$ forms an $$(L, t)$$-clump inside a (larger) string $$s$$ if there is a length $$L$$ interval of $$s$$ in which $$p$$ appears at least $$t$$ times. Appearances of $$p$$ are allowed to overlap in the interval. For example, TGCA forms a $$(25,3)$$-clump in the string
gatcagcataagggtcccTGCAATGCATGACAAGCCTGCAgttgttttac
In the following interactive session, we assume the FASTA file data.fna1 to be located in the current directory.
>>> clump_finding('CGGACTCGACAGATGTGAAGAAATGTGAAGACTGAGTGAAGAGAAGAGGAAACACGACACGACATTGCGACATAATGTACGAATGTAATGTGCCTATGGC', 5, 75, 4) {'GAAGA', 'CGACA', 'AATGT'} >>> clump_finding('AAAACGTCGAAAAA', 2, 4, 2) {'AA'} >>> clump_finding('ACGTACGT', 1, 5, 2) {'G', 'T', 'C', 'A'} >>> from Bio import SeqIO >>> clump_finding(*SeqIO.parse('data.fna', 'fasta'), 11, 566, 18) {'AAACCAGGTGG'}