In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format originates from the FASTA software package1, but has now become a de facto standard in the field of bioinformatics. FASTA files may contain multiple records, where each record corresponds to a single nucleotide or amino acid sequence. A record in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (>) symbol in the first column.

The example FASTA file given below contains four records. In order to point out that the number of sequence characters per line is not fixed, we have split the sequences in the example over several lines with a different number of characters for each of the records. But even within a single record it is allowed to have sequence fragments of variable length.

>read1
AAAAAAAAGGTGCGGAGGAAAGCAACACATTTGTTCCTCAGGACTCTTCA
GCGGGAGATATCTGCAGAACCAAACACGCTCAAAGACCCGCGCAAATCGG
CAAATTGCCTGACGTAGAACACCGACCTAGCGTGTTTATTATGATACTCG
GCACCTCTGACTTAATCAAACGTTGTCGAGGTGGAGATGGTATCATCTGG
CGTTAGGCATAACGAGCGTGACACTAGCTTC
>read2
CTCTCGGAAGTTTGTCCGCACCGACATAAATAGACTGATACTGATCAGGGGGACGGTACG
ACCCACTCTGTCGATCGAACCAGTGACCTGTTCGCTTCGTAACTGGCCAGACGATAGATC
TTAGCATAACCGACGCGAAGTGTGGCAGATAAGATCCCAAGGTAGTAAATAGTACATATT
AGTGGTCAACAGGTTTTTAGCGCAGCAGCTGATCTATGCAATTGACTGCAACACCATGAC
GTAGGTTGCTTCTATAAGAACAAGTTTACCGCTACGAATTCGCGTCGGTTCCGTACATA
>read3
AAAAAAAAACGTCAGAAACGTGAGTAGGGTTACCCACACGCATCTAAGAATGCTCGGGCAATGTGACGGT
ATGAAGTTGAGAGCCTCTGTTACCCTCCACCTGATGGGGTGGCGGCCATTTACAGTATTGCTTAGCGCAC
TCAGATATACGCATGATGGGACTGATTCCCCAGGCGAGTACGTAGTCACCCGCGGCGACTCGACAAGGAT
ACTATTATCAGGGTTCTCCCCGGGAGGAGGTATTAAGA
>read4
AAAAAAAAAATGTCATGACCCTAGAAGGCCCTGCATATACATGGCAGGGCAGTCTATCAGCGCCCATCATCATCGCTGAC
GTAGTTGGAGCCGTATCTGTACTGGATCTAGGGGGCATCGTGAACTAGCGAGGTCGTGTGACGCGCTACAAAGGCTCGGC
CCATCTGGAACGTCAGACGAGGCTTCTCTACGGGGGTCCTGCCGTGGGCTATTGAGGTGCAAAGTTCGAATTCGGCACTG
TCGCGTGTAATTGAATTCGTGCCCAGT

Determining an organism's complete genome (called genome sequencing) forms a central task of bioinformatics. Unfortunately, we still don't possess the microscope technology to zoom into the nucleotide level and determine the sequence of a genome's nucleotides, one at a time. However, researchers can apply chemical methods to generate and identify much smaller snippets of DNA, called reads.

Because the current generation of sequencing machines produce so many reads at once, it is possible to sequence different organisms in parallel. This is done by fragmenting the DNA of the organisms, and sequencing these fragments in a single run. In order to link the sequenced fragments to the organism they originate from, each of the fragments is labeled with some kind of an organism-specific barcode. 454 Life Sciences (Roche) sequencers use a barcode of eight nucleotides that prefix each of the sequence fragments.

DNA pooling

Assignment

You are given a FASTA file containing the reads generated by a single multiplexed run on a 454 Life Sciences sequencer. Your task is to split this file into multiple FASTA files, each containing the reads grouped per label. This is done in the following way:

Example

In the following interactive session we assume the current directory to contain the text file reads.fasta2. The content of this file is the same as the example FASTA file displayed in the introduction.

>>> outputFasta('read1', 'GGTGCGGAGGAAAGCAACACATTTGT', width=10)
>read1
GGTGCGGAGG
AAAGCAACAC
ATTTGT

>>> output = open('out.fasta', 'w')
>>> outputFasta('read2', 'AGTTTGTCCGCACCGACATAAATAGA', width=10, file=output)
>>> outputFasta('read3', 'ACGTCAGAAACGTGAG', width=10, file=output)
>>> output.close()
>>> print(open('out.fasta', 'r').read(), end='')
>read2
AGTTTGTCCG
CACCGACATA
AATAGA
>read3
ACGTCAGAAA
CGTGAG

>>> demultiplexFasta('reads.fasta', width=60)

>>> print(open('AAAAAAAA.fasta', 'r').read(), end='')
>read1
GGTGCGGAGGAAAGCAACACATTTGTTCCTCAGGACTCTTCAGCGGGAGATATCTGCAGA
ACCAAACACGCTCAAAGACCCGCGCAAATCGGCAAATTGCCTGACGTAGAACACCGACCT
AGCGTGTTTATTATGATACTCGGCACCTCTGACTTAATCAAACGTTGTCGAGGTGGAGAT
GGTATCATCTGGCGTTAGGCATAACGAGCGTGACACTAGCTTC
>read3
ACGTCAGAAACGTGAGTAGGGTTACCCACACGCATCTAAGAATGCTCGGGCAATGTGACG
GTATGAAGTTGAGAGCCTCTGTTACCCTCCACCTGATGGGGTGGCGGCCATTTACAGTAT
TGCTTAGCGCACTCAGATATACGCATGATGGGACTGATTCCCCAGGCGAGTACGTAGTCA
CCCGCGGCGACTCGACAAGGATACTATTATCAGGGTTCTCCCCGGGAGGAGGTATTAAGA
>read4
AATGTCATGACCCTAGAAGGCCCTGCATATACATGGCAGGGCAGTCTATCAGCGCCCATC
ATCATCGCTGACGTAGTTGGAGCCGTATCTGTACTGGATCTAGGGGGCATCGTGAACTAG
CGAGGTCGTGTGACGCGCTACAAAGGCTCGGCCCATCTGGAACGTCAGACGAGGCTTCTC
TACGGGGGTCCTGCCGTGGGCTATTGAGGTGCAAAGTTCGAATTCGGCACTGTCGCGTGT
AATTGAATTCGTGCCCAGT

>>> print(open('CTCTCGGA.fasta', 'r').read(), end='')
>read2
AGTTTGTCCGCACCGACATAAATAGACTGATACTGATCAGGGGGACGGTACGACCCACTC
TGTCGATCGAACCAGTGACCTGTTCGCTTCGTAACTGGCCAGACGATAGATCTTAGCATA
ACCGACGCGAAGTGTGGCAGATAAGATCCCAAGGTAGTAAATAGTACATATTAGTGGTCA
ACAGGTTTTTAGCGCAGCAGCTGATCTATGCAATTGACTGCAACACCATGACGTAGGTTG
CTTCTATAAGAACAAGTTTACCGCTACGAATTCGCGTCGGTTCCGTACATA