The FASTQ format1 is a text-based format for storing both DNA-sequences and their corresponding quality scores. It was originally developed at the Wellcome Trust Sanger Institute2, but has recently become the de facto standard for storing the output of high throughput sequencing instruments.

A FASTQ file contains a series of sequence records, with each record corresponding to four consecutive lines:

Both the sequence letters and quality scores are encoded with a single ASCII character for brevity. As an example, the content of the FASTQ file sample.fq3 containing six sequence records looks like this:

@sample sequence 1
GGAAGTCCATGGAGTTATTTGCGGTAACCGGGACCGGCCT
+sample sequence 1
ba^aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^a
@sample sequence 2
CTCCTTATAGCATCATACGCACAGCTGGTATCCCCATTGCTTATTAGGGTTCAGCCCTGTCGCTCTCCCC
+sample sequence 2
ababa``aaaababaaaabaa_``aa``]^aa_`aa_Z_aaaaa^baaaaaaaaaaaaaaaaaa`^```a
@sample sequence 3
GGAAGTCCATGGAGTTATTTGCGGTAACCGGGACCGGCCTGGGGTTAGATTTTATGTCGT
+sample sequence 3
ba^aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^ababbaaabaaaaa`]`ba`]
@sample sequence 4
TATAC
+sample sequence 4
baab`
@sample sequence 5
CTCCTTATAGCATCATACGCACAGCTGGTATCCCCATTGCTTATTAGGGT
+sample sequence 5
ababa``aaaababaaaabaa_``aa``]^aa_`aa_Z_aaaaa^baaaa
@sample sequence 6
GGAAGTCCATGGAGTTATTTGCGGTAACCGGGACCGGCCTGGGGTTAGATTTTATGTCGT
+sample sequence 6
ba^aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^ababbaaabaaaaa`]`ba`]

Prior to sequencing itself, a genomic library has to be constructed. As an artifact of the PCR amplification during library preparation, the same fragment may be read multiple times. That's why the DNA sequences in a FASTQ file are usually deduplicated, in order to avoid a possible bias when analyzing the sequences. Deduplication does not take into account full length sequences, but only relies on a prefix formed by the first $$n$$ bases of the sequence.

For example, if we consider prefixes of length 10, records 1, 3 and 6 in the sample FASTQ file share the same prefix GGAAGTCCAT. Similarly, records 2 and 4 also share the same prefix CTCCTTATAG. We have graphically depicted the records of this FASTQ file below, where the same prefixes have been given the same color.

FASTQ deduplication

In practice, two strategies are used for deduplicating the sequences in a FASTQ file. Both strategies have in common that all records whose sequence is shorter than $$n$$ bases (the length of the prefix) are always removed. If $$n = 10$$, this is for example the case for record 4 in the sample FASTQ file.

In the first strategy — resulting in the left outcome show in the above figure — a FASTQ record is only retained if the prefix of its sequence is seen for the first time. To implement this strategy, the FASTQ records are read one by one and a set of prefixes is constructed to remember which prefixes have already been seen. A FASTQ record is then only retained in case its prefix was not yet in the set.

In the second strategy — resulting in the right outcome show in the above figure — for each prefix on the record with the longest sequence is retained. To implement this strategy, the FASTQ file now have to be processed twice. The first time, the records are read one by one to construct a dictionary that maps each prefix of sequences longer than $$n$$ onto the length of the longest sequence with that prefix. The second time the records are again read one by one, and only those records are retained whose sequence length matches the length onto which the prefix of the sequence is mapped by the dictionary. In case the same prefix is found in multiple records whose sequence length matches the length onto which the prefix of the sequence is mapped by the dictionary, only the first record of those records needs to be retained. For example, the sequences of records 3 and 6 in the sample FASTQ file share the same prefix and the same sequence length (the maximal length of all records sharing that prefix), and thus only record 3 needs to be retained. The sequence of record 1 also shares the same prefix, but is sequence is shorter than those of records 3 and 6.

Assignment

In this assignment we only work with FASTQ files that contain DNA sequences. These DNA sequences are represented as strings that only contain the uppercase letters A, C, G en T. Your task:

In practice, FASTQ files contain the results of next-generation sequencing experiments that produce millions of reads (short DNA sequences). Such files are simply too big to fit in computer memory. Therefore, avoid during the implementation of the deduplication functions to load the entire file content in computer memory, but only claim the memory needed to store the information of a single FASTQ record, together with a set (function retain_first) or a dictionary (function retain_longest) of prefixes that is used to decide whether or not the FASTQ records must be retained.

Example

In the following interactive session, we assume the FASTQ file sample.fq4 to be located in the current directory. This file has the same content as the sample file shown in the introduction of this assignment.

>>> retain_first('sample.fq', 'first.fq', 10)
>>> print(open('first.fq', 'r').read().rstrip())
@sample sequence 1
GGAAGTCCATGGAGTTATTTGCGGTAACCGGGACCGGCCT
+sample sequence 1
ba^aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^a
@sample sequence 2
CTCCTTATAGCATCATACGCACAGCTGGTATCCCCATTGCTTATTAGGGTTCAGCCCTGTCGCTCTCCCC
+sample sequence 2
ababa``aaaababaaaabaa_``aa``]^aa_`aa_Z_aaaaa^baaaaaaaaaaaaaaaaaa`^```a

>>> retain_longest('sample.fq', 'longest.fq', 10)
>>> print(open('longest.fq', 'r').read().rstrip())
@sample sequence 2
CTCCTTATAGCATCATACGCACAGCTGGTATCCCCATTGCTTATTAGGGTTCAGCCCTGTCGCTCTCCCC
+sample sequence 2
ababa``aaaababaaaabaa_``aa``]^aa_`aa_Z_aaaaa^baaaaaaaaaaaaaaaaaa`^```a
@sample sequence 3
GGAAGTCCATGGAGTTATTTGCGGTAACCGGGACCGGCCTGGGGTTAGATTTTATGTCGT
+sample sequence 3
ba^aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^ababbaaabaaaaa`]`ba`]