Given a string $$s$$, its $$k$$-mer composition1 composition($$k$$, $$s$$) is the collection of all $$k$$-mer substrings of $$s$$ (including repeated $$k$$-mers). For example,
>>> composition(3, 'TATGGGGTGC') ['ATG', 'GGG', 'GGG', 'GGT', 'GTG', 'TAT', 'TGC', 'TGG']
Note that we have listed $$k$$-mers in lexicographic order2 (i.e., how they would appear in a dictionary) rather than in the order of their appearance in TATGGGGTGC. We have done this because the correct ordering of the reads is unknown when they are generated.
Write a function composition that takes three arguments: i) an integer $$k \in \mathbb{N_0}$$, ii) the location of a FASTA file containing a DNA string $$s$$ and iii) another file location. The function must generate the $$k$$-mer composition of strings $$s$$, and write the lexicographically ordered $$k$$-mers in FASTA format to the file whose location is passed as the third argument.
In the following interactive session, we assume the FASTA files data01.fna3 and output01.fna4 to be located in the current directory.
>>> composition(5, 'data01.fna', 'output01.fna') >>> print(open('output01.fna').read().rstrip()) >seq01 AATCC >seq02 ATCCA >seq03 CAATC >seq04 CCAAC >seq05 TCCAA