A read mapper maps sequencing reads on a reference sequence. Paired-end reads occur in pairs, such that the distance between the reads in a pair (the insert size) satisfies a certain distribution. This additional information can be used to resolve ambiguities in the mapping process, namely when the reads might individually map to several locations in the reference sequence. Treating them as a pair might lead to a unique alignment (in case of doubt, pick the most probable alignment).
In this assignment you are expected to write a paired-end read mapper that uses the FM-index constructed for a reference genome. In particular, you are provided with the FM-index of the reference genome of Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S (CP0013631) and a set of (simulated) paired-end reads2. In addition, the reference genome is provided, which can be used to check your results.
The FM-index is provided in plain text format and consists of the BWT, a sparse suffix array, a sparse occurrences table and a counts array. In the sparse suffix array only the entries $$SA[k \cdot 32]$$ for $$k \in [0 \dots \left\lfloor{n/32}\right\rfloor]$$ are included. In the sparse occurrences table only the entries $$O[\sigma, k \cdot 128]$$ for $$k \in [0 \dots \left\lfloor{n/128}\right\rfloor]$$ are included. These sparseness factors are introduced to reduce the memory requirements. The other values $$SA[i]$$ and $$O(\sigma, i)$$ can be computed from these sparse tables and the BWT.
There are two ways to introduce sparseness to the suffix array, here we use the sparse suffix array $$SA[k \cdot 32]$$, where $$32$$ is the sparseness factor. Alternatively one could use the compressed suffix array, i.e. the array that only stores those values $$SA[k]$$ that are a multiple of the sparseness factor $$32$$. The compressed suffix array guarantees that $$O(1)$$ applications of Lemma 3 (see further) suffice to find the desired value in $$SA$$, but it requires additional sparse and compressed tables. The approach used here has a $$O(n)$$ worst case run time, but is significantly easier to implement.
Here, the unique terminal character $ is lexicographically larger than the characters ($$\texttt{A}$$, $$\texttt{C}$$, $$\texttt{G}$$, $$\texttt{T}$$) in the text.
The reads contain no errors and no mutations with respect to the reference genome. All read pairs can be uniquely mapped to the reference genome when handling them as pairs.
Note that DNA is double stranded, and each read pair consists of one read from both strands. The reverse complement of a DNA sequence is formed by reversing the letters, interchanging A and T and interchanging C and G. For example, the reverse complement of ACCTGAG is CTCAGGT. The FM-index is only constructed from the forward direction of the reference genome, so make sure to also handle reads that align to the reverse complement of the reference genome. This can be done by aligning every read in its forward direction, and then repeating the alignment procedure with the reverse complement of the read. The reads in a pair are oriented as follows:
The insert size refers to the distance between the two extreme ends of the alignments of the reads. In the given read data set, insert sizes follow a normal distribution $$N(800, 100)$$. Note that one of the reads in a pair maps to the forward strand in the reference, and the other maps to the reverse complement strand. As shown in the figure, the read that maps to the forward strand, maps before the other read (w.r.t. that strand). It is however not known a priori which of the reads in a pair maps where. In particular, the read in a pair that comes first in the FASTA file is not necessarily the read that maps to the forward strand of the reference.
The output is expected to be in SAM format, as specified in the Sequence Alignment/Map Format Specification3. An example of the output4 is provided for the first 5 read pairs. The output must preserve the order of the reads and should be written to standard output.
Define a class ReadMapper that implements the use of a given FM-index of a reference sequence to map paired-end reads (exact matching) onto the reference sequence. This class must support at least the following methods:
An initialization method that takes the path prefix of the files that make up an FM-index. These files use extensions as specified in the Files5 section. The method also takes two more optional integer arguments that respectively represent the sparseness factor of the occurrence table (default value: 128), and the sparseness factor of the suffix array (default value: 32).
A method O that takes a symbol $$\sigma$$ and an integer index $$k$$. The method must return $$O[\sigma, k]$$.
A method LF that takes an integer index $$k$$. The method must return the index $$j$$ such that $$SA[j] = SA[k] - 1$$.
A method SA that takes an integer index $$k$$. The method must return $$SA[k]$$.
A method exactMatches that takes a string $$s$$. The method must return a sorted list containing the start positions (0-based) of all exact substring matches of $$s$$ in the reference sequence.
A method bestPairedMatch that takes two strings $$s_1$$ and $$s_2$$. The method also takes an optional third argument that represents the mean insert size (default value: 800). The method must return the best alignment (exact match) of the given pair of reads $$s_1$$ and $$s_2$$, with respect to the given mean insert size. The method must return this alignment as a tuple containing three elements: i) start of the best alignment of $$s_1$$ in the reference sequence, ii) start of the best alignment of $$s_2$$ in the reference sequence and iii) a Boolean that indicates whether or not the alignment is on the forward strand of the reference sequence.
A method map that takes two file locations: i) the location of a FASTA file containing a sequence of paired-end reads and ii) the location of a SAM file. The read that form a pair are in consecutive records of the FASTA file. The method also takes an optional third argument that represents the mean insert size (default value: 800). The method must find the best alignment (exact match) of each pair of reads in the given FASTA file, and write the alignment results to a SAM file at the given location (an existing file at that location must be overwritten). The SAM output should follow these rules:
the header section should only contain header lines @SQ and @PG
the @SQ line should only have SN and LN tags, with the value of the SN tag being the path prefix passed to the initialization method
the @PG line should only have an ID tag, whose value is ReadMapper
the alignment section should only contain the 11 mandatory fields
the field QNAME should contain the identifier of the read taken from the FASTA file
the field RNAME should contain the path prefix passed to the initialization method
the fields MAPQ, PNEXT, TLEN, SEQ and QUAL should contain an indication that the information is not available or not stored
the field RNEXT should contain an equal sign (=), indicating that RNEXT is identical to RNAME
In the following interactive session, we assume all demo.*6 and CP001363.*7 text files that make up the FM-indices to be located in the current directory, as do the input files demo.reads.fasta8 and CP001363.reads.fasta9. See the explanation of the files below to understand the content and formatting of these files. In addition, you can also use the FASTA files demo.fasta10 and CP001363.fasta11 containing the reference genomes, together with the output files demo.sam12 and CP001363.sam13, for debugging purposes.
>>> demo = ReadMapper('demo', 4, 2) >>> [demo.SA(index) for index in range(11)] [3, 4, 0, 6, 8, 2, 5, 1, 7, 9, 10] >>> [demo.LF(index) for index in range(11)] [5, 0, 10, 6, 8, 7, 1, 2, 3, 4, 9] >>> [demo.O('A', index) for index in range(11)] [0, 0, 1, 1, 1, 1, 1, 2, 3, 4, 5] >>> [demo.O('C', index) for index in range(11)] [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] >>> [demo.O('G', index) for index in range(11)] [0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2] >>> [demo.O('T', index) for index in range(11)] [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1] >>> [demo.O('$', index) for index in range(11)] [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1] >>> demo.exactMatches('GAT') [5] >>> demo.exactMatches('AT') [6, 8] >>> demo.bestPairedMatch('GCA', 'TA', 8) (1, 7, True) >>> demo.bestPairedMatch('TGC', 'TA', 8) (1, 7, False) >>> demo.map('demo.reads.fasta14', 'demo.sam15') >>> CP001363 = ReadMapper('CP001363', 128, 32) >>> indices = [0, 247380, 685559, 704037, 1502653, 2759112, 3187255, 4870265] >>> [CP001363.SA(index) for index in indices] [1952500, 901989, 2630792, 4626188, 3179929, 573015, 6885, 4870265] >>> [CP001363.LF(index) for index in indices] [3706925, 2497723, 3847099, 3850517, 2813008, 1870350, 4399036, 1164822] >>> [CP001363.O('A', index) for index in indices] [0, 83953, 215950, 220490, 400412, 661685, 746479, 1164822] >>> [CP001363.O('C', index) for index in indices] [0, 50134, 167547, 175400, 405799, 705527, 852759, 1271002] >>> [CP001363.O('G', index) for index in indices] [0, 61898, 161887, 164554, 377183, 790736, 895905, 1271100] >>> [CP001363.O('T', index) for index in indices] [0, 51395, 140174, 143592, 319258, 601163, 692111, 1163340] >>> [CP001363.O('$', index) for index in indices] [0, 0, 1, 1, 1, 1, 1, 1] >>> CP001363.exactMatches('TAAGTAACGATA') [293380, 4117066, 4213150, 4368115, 4411497] >>> CP001363.exactMatches('TCAAGCTATA') [2851405, 3581795, 4362552] >>> CP001363.bestPairedMatch('TAAGTAACGATA', 'TCAAGCTATA') (293380, 294198, True) >>> CP001363.bestPairedMatch('GTGGGTTGCA', 'GCCCGGGGATT') (291296, 290485, True) >>> CP001363.bestPairedMatch('ATCGTCAACTCC', 'GGCGGCCGTA') (2852374, 2851581, True) >>> CP001363.bestPairedMatch('CTTAACCTTC', 'CAAGACTCAA') (291320, 290529, True) >>> CP001363.map('CP001363.reads.fasta16', 'CP001363.sam17')
In the demo example we attempt to align the sequence GAT to the reference sequence AGCAAGATAT. For the sake of the exposition, the sparseness of the suffix array is $$2$$, whereas the sparseness of the occurrence table is $$4$$. The table below contains the arrays $$BWT$$, $$SA$$ and $$O$$, where bold black numbers indicate values that are contained in the sparse index and greyed out numbers represent values that are not indexed.
$$i$$ | $$BWT$$ | $$SA[i]$$ | $$LF[i]$$ | $$O[\texttt{A}]$$ | $$O[\texttt{C}]$$ | $$O[\texttt{G}]$$ | $$O[\texttt{T}]$$ | $$O[\texttt{\$}]$$ |
---|---|---|---|---|---|---|---|---|
0 | $$\texttt{C}$$ | 3 | 5 | 0 | 0 | 0 | 0 | 0 |
1 | $$\texttt{A}$$ | 4 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | $$\texttt{\$}$$ | 0 | 10 | 1 | 1 | 0 | 0 | 0 |
3 | $$\texttt{G}$$ | 6 | 6 | 1 | 1 | 0 | 0 | 1 |
4 | $$\texttt{T}$$ | 8 | 8 | 1 | 1 | 1 | 0 | 1 |
5 | $$\texttt{G}$$ | 2 | 7 | 1 | 1 | 1 | 1 | 1 |
6 | $$\texttt{A}$$ | 5 | 1 | 1 | 1 | 2 | 1 | 1 |
7 | $$\texttt{A}$$ | 1 | 2 | 2 | 1 | 2 | 1 | 1 |
8 | $$\texttt{A}$$ | 7 | 3 | 3 | 1 | 2 | 1 | 1 |
9 | $$\texttt{A}$$ | 9 | 4 | 4 | 1 | 2 | 1 | 1 |
10 | $$\texttt{T}$$ | 10 | 9 | 5 | 1 | 2 | 1 | 1 |
The table below contains the array $$C$$.
$$C[\texttt{A}]$$ | $$C[\texttt{C}]$$ | $$C[\texttt{G}]$$ | $$C[\texttt{T}]$$ | $$C[\texttt{\$}]$$ |
---|---|---|---|---|
0 | 5 | 6 | 8 | 10 |
To match $$P = \texttt{GAT}$$, we recursively match its suffixes $$P_i$$. The empty string is a prefix of every suffix, hence $$I(P_3) = [0, 10]$$. From relation (1) we get that $$L(P_2) = L(\texttt{T}P_3) = C(\texttt{T}) + O(\texttt{T}, 0) = 8$$, whereas from relation (2) we obtain that $$R(P_2) = R(\texttt{T}P_3) = C(\texttt{T}) + O(\texttt{T}, 11) - 1 = 9$$. Hence $$I(P_2) = [8, 9]$$. Repeated application of this procedure gives $$I(P_1) = [3, 4]$$ and $$I(P) = I(P_0) = [6, 6]$$. Hence, there is a single match and it starts at position $$SA[6]$$.
To find $$SA[6]$$ we compute $$LF^{j}(6)$$ until we find the minimal $$j$$ such that $$LF^{j}(6) = 0\!\!\!\mod\!\! 2$$: \[ \begin{align} LF(6) &= C(\texttt{A}) + O(\texttt{A}, 6) = 1 \\ LF^{2}(6) &= LF(1) = C(\texttt{A}) + O(\texttt{A}, 1) = 0 = 0\!\!\!\!\!\mod\!\! 2 \end{align} \] Hence, $$SA[6] = SA[0] + 2 = 5$$. We conclude that GAT occurs in AGCAAGATAT starting at position 5.
The indexes and reads are provided in plain text. Here we describe what data each of the input file types contain.
The .bwt file contains the BWT string.
The .counts file contains the counts $$C[\sigma]$$ in lexicographical order, separated from each other by newlines.
The .occ file contains the sparse occurrence table $$O(\sigma, k)$$, with $$k$$ a multiple of 128. Each line contains the occurrences in lexicographical order, separated by spaces. Elements $$O(\sigma, k)$$ for $$k$$ not a multiple of 128 can be obtained from this table by parsing the relevant subsection of the BWT.
The .sa file contains the sparse suffix array $$SA[k]$$, for $$k$$ a multiple of 32, separated by newlines. All other elements of the suffix array can be obtained from the sparse suffix array through application of relation (3).
The .fasta files contain the reference genome (genome.fasta) and the paired-end reads (pe.fasta). Concerning paired-end reads: read >i/1 is paired with read >i/2.
This file contains the output that is expected for the first 5 read pairs.
The counts table $$C[\sigma]$$ counts for all $$\sigma \in \Sigma$$ how many symbols in $$BWT$$ are lexicographically smaller than $$\sigma$$: \[ \forall \sigma \in \Sigma: C[\sigma] = \#\{k: BWT[k] < \sigma \} \]
The occurrence table $$O(\sigma, i)$$ is the number of times $$\sigma$$ occurs in $$S[0..i - 1]$$: \[ O(\sigma, i) = \#\{ k: k < i \text{ and } S[k] = \sigma \} \]
Given a pattern $$P$$ we define the interval $$I(P) = [L(P), R(P)]$$ as the interval containing all suffixes of $$S$$ that start with $$P$$: \[ \begin{align} L(P) &= \min \left\{ k: P \text{ is a prefix of } S_{SA[k]} \right\} \tag{1} \\ R(P) &= \max \left\{ k: P \text{ is a prefix of } S_{SA[k]} \right\} \tag{2} \end{align} \]
The last-first relation $$LF$$ maps $$i$$ on $$j$$ if and only if $$SA[i] = SA[j] + 1$$.
The interval $$I(\sigma P)$$ can be computed from $$I(P)$$ and the occurrence tables in the following way: \[ \begin{align} L(\sigma P) &= C(\sigma) + O(\sigma, L(P)) \\ R(\sigma P) &= C(\sigma) + O(\sigma, R(P) + 1) - 1 \end{align} \]
Proof. $$L(\sigma P)$$ refers to a suffix starting with $$\sigma$$, hence it is of the form $$C(\sigma) + k$$ with $$k \in[0, c_\sigma]$$, where $$c_\sigma$$ is the number of $$\sigma$$ in $$S$$. The first $$O(\sigma, L(P))$$ occurrences of $$\sigma$$ in the $$BWT$$ correspond to positions in the suffix array that do not have a prefix $$P$$, else they would be in $$I(P)$$. Hence $$k = O(\sigma, L(P))$$. Analogous for $$R(P)$$, but now we are interested in the number of prefixes starting with $$P$$ or a prefix smaller than $$P$$, hence $$k = O(\sigma, R(P) + 1) - 1$$.
$$LF(i) = C(BWT[i]) + O(BWT[i], i)$$
Proof. Since the suffixes in a suffix array are sorted, the following follows immediately from the definition of $$LF$$: $$LF$$ maps the $$n$$-th occurrence of $$\sigma$$ in $$BWT$$ on the $$n$$-th occurrence of $$\sigma$$ in the lexicographical sorting of all characters in $$BWT$$. Hence, $$LF$$ can be computed from the counts and occurrences tables.
Using the $$LF$$ function the following relation between different suffix array entries can be established: \[ SA(i) = SA[LF^{j}(i)] + j \tag{3} \]
Proof. By definition of $$LF$$ this holds for $$j = 1$$. The formula now follows by induction.