A read mapper maps sequencing reads on a reference sequence. Paired-end reads occur in pairs, such that the distance between the reads in a pair (the insert size) satisfies a certain distribution. This additional information can be used to resolve ambiguities in the mapping process, namely when the reads might individually map to several locations in the reference sequence. Treating them as a pair might lead to a unique alignment (in case of doubt, pick the most probable alignment).

Assignment

In this assignment you are expected to write a paired-end read mapper that uses the FM-index constructed for a reference genome. In particular, you are provided with the FM-index of the reference genome of Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S (CP0013631) and a set of (simulated) paired-end reads2. In addition, the reference genome is provided, which can be used to check your results.

Index structure

The FM-index is provided in plain text format and consists of the BWT, a sparse suffix array, a sparse occurrences table and a counts array. In the sparse suffix array only the entries $$SA[k \cdot 32]$$ for $$k \in [0 \dots \left\lfloor{n/32}\right\rfloor]$$ are included. In the sparse occurrences table only the entries $$O[\sigma, k \cdot 128]$$ for $$k \in [0 \dots \left\lfloor{n/128}\right\rfloor]$$ are included. These sparseness factors are introduced to reduce the memory requirements. The other values $$SA[i]$$ and $$O(\sigma, i)$$ can be computed from these sparse tables and the BWT.

Note

There are two ways to introduce sparseness to the suffix array, here we use the sparse suffix array $$SA[k \cdot 32]$$, where $$32$$ is the sparseness factor. Alternatively one could use the compressed suffix array, i.e. the array that only stores those values $$SA[k]$$ that are a multiple of the sparseness factor $$32$$. The compressed suffix array guarantees that $$O(1)$$ applications of Lemma 3 (see further) suffice to find the desired value in $$SA$$, but it requires additional sparse and compressed tables. The approach used here has a $$O(n)$$ worst case run time, but is significantly easier to implement.

Note

Here, the unique terminal character $ is lexicographically larger than the characters ($$\texttt{A}$$, $$\texttt{C}$$, $$\texttt{G}$$, $$\texttt{T}$$) in the text.

Reads

The reads contain no errors and no mutations with respect to the reference genome. All read pairs can be uniquely mapped to the reference genome when handling them as pairs.

Note that DNA is double stranded, and each read pair consists of one read from both strands. The reverse complement of a DNA sequence is formed by reversing the letters, interchanging A and T and interchanging C and G. For example, the reverse complement of ACCTGAG is CTCAGGT. The FM-index is only constructed from the forward direction of the reference genome, so make sure to also handle reads that align to the reverse complement of the reference genome. This can be done by aligning every read in its forward direction, and then repeating the alignment procedure with the reverse complement of the read. The reads in a pair are oriented as follows:

paired end reads
Paired-end sequencing enables both ends of the DNA fragment to be sequenced. Because the distance between each paired read is known, alignment algorithms can use this information to map the reads over repetitive regions more precisely. This results in much better alignment of the reads, especially across difficult-to-sequence, repetitive regions of the genome.

The insert size refers to the distance between the two extreme ends of the alignments of the reads. In the given read data set, insert sizes follow a normal distribution $$N(800, 100)$$. Note that one of the reads in a pair maps to the forward strand in the reference, and the other maps to the reverse complement strand. As shown in the figure, the read that maps to the forward strand, maps before the other read (w.r.t. that strand). It is however not known a priori which of the reads in a pair maps where. In particular, the read in a pair that comes first in the FASTA file is not necessarily the read that maps to the forward strand of the reference.

Output

The output is expected to be in SAM format, as specified in the Sequence Alignment/Map Format Specification3. An example of the output4 is provided for the first 5 read pairs. The output must preserve the order of the reads and should be written to standard output.

Specification

Define a class ReadMapper that implements the use of a given FM-index of a reference sequence to map paired-end reads (exact matching) onto the reference sequence. This class must support at least the following methods:

Example

In the following interactive session, we assume all demo.*6 and CP001363.*7 text files that make up the FM-indices to be located in the current directory, as do the input files demo.reads.fasta8 and CP001363.reads.fasta9. See the explanation of the files below to understand the content and formatting of these files. In addition, you can also use the FASTA files demo.fasta10 and CP001363.fasta11 containing the reference genomes, together with the output files demo.sam12 and CP001363.sam13, for debugging purposes.

>>> demo = ReadMapper('demo', 4, 2)
>>> [demo.SA(index) for index in range(11)]
[3, 4, 0, 6, 8, 2, 5, 1, 7, 9, 10]
>>> [demo.LF(index) for index in range(11)]
[5, 0, 10, 6, 8, 7, 1, 2, 3, 4, 9]
>>> [demo.O('A', index) for index in range(11)]
[0, 0, 1, 1, 1, 1, 1, 2, 3, 4, 5]
>>> [demo.O('C', index) for index in range(11)]
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> [demo.O('G', index) for index in range(11)]
[0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2]
>>> [demo.O('T', index) for index in range(11)]
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
>>> [demo.O('$', index) for index in range(11)]
[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
>>> demo.exactMatches('GAT')
[5]
>>> demo.exactMatches('AT')
[6, 8]
>>> demo.bestPairedMatch('GCA', 'TA', 8)
(1, 7, True)
>>> demo.bestPairedMatch('TGC', 'TA', 8)
(1, 7, False)
>>> demo.map('demo.reads.fasta14', 'demo.sam15')

>>> CP001363 = ReadMapper('CP001363', 128, 32)
>>> indices = [0, 247380, 685559, 704037, 1502653, 2759112, 3187255, 4870265]
>>> [CP001363.SA(index) for index in indices]
[1952500, 901989, 2630792, 4626188, 3179929, 573015, 6885, 4870265]
>>> [CP001363.LF(index) for index in indices]
[3706925, 2497723, 3847099, 3850517, 2813008, 1870350, 4399036, 1164822]
>>> [CP001363.O('A', index) for index in indices]
[0, 83953, 215950, 220490, 400412, 661685, 746479, 1164822]
>>> [CP001363.O('C', index) for index in indices]
[0, 50134, 167547, 175400, 405799, 705527, 852759, 1271002]
>>> [CP001363.O('G', index) for index in indices]
[0, 61898, 161887, 164554, 377183, 790736, 895905, 1271100]
>>> [CP001363.O('T', index) for index in indices]
[0, 51395, 140174, 143592, 319258, 601163, 692111, 1163340]
>>> [CP001363.O('$', index) for index in indices]
[0, 0, 1, 1, 1, 1, 1, 1]
>>> CP001363.exactMatches('TAAGTAACGATA')
[293380, 4117066, 4213150, 4368115, 4411497]
>>> CP001363.exactMatches('TCAAGCTATA')
[2851405, 3581795, 4362552]
>>> CP001363.bestPairedMatch('TAAGTAACGATA', 'TCAAGCTATA')
(293380, 294198, True)
>>> CP001363.bestPairedMatch('GTGGGTTGCA', 'GCCCGGGGATT')
(291296, 290485, True)
>>> CP001363.bestPairedMatch('ATCGTCAACTCC', 'GGCGGCCGTA')
(2852374, 2851581, True)
>>> CP001363.bestPairedMatch('CTTAACCTTC', 'CAAGACTCAA')
(291320, 290529, True)
>>> CP001363.map('CP001363.reads.fasta16', 'CP001363.sam17')

In the demo example we attempt to align the sequence GAT to the reference sequence AGCAAGATAT. For the sake of the exposition, the sparseness of the suffix array is $$2$$, whereas the sparseness of the occurrence table is $$4$$. The table below contains the arrays $$BWT$$, $$SA$$ and $$O$$, where bold black numbers indicate values that are contained in the sparse index and greyed out numbers represent values that are not indexed.

$$i$$ $$BWT$$ $$SA[i]$$ $$LF[i]$$ $$O[\texttt{A}]$$ $$O[\texttt{C}]$$ $$O[\texttt{G}]$$ $$O[\texttt{T}]$$ $$O[\texttt{\$}]$$
0 $$\texttt{C}$$ 3 5 0 0 0 0 0
1 $$\texttt{A}$$ 4 0 0 1 0 0 0
2 $$\texttt{\$}$$ 0 10 1 1 0 0 0
3 $$\texttt{G}$$ 6 6 1 1 0 0 1
4 $$\texttt{T}$$ 8 8 1 1 1 0 1
5 $$\texttt{G}$$ 2 7 1 1 1 1 1
6 $$\texttt{A}$$ 5 1 1 1 2 1 1
7 $$\texttt{A}$$ 1 2 2 1 2 1 1
8 $$\texttt{A}$$ 7 3 3 1 2 1 1
9 $$\texttt{A}$$ 9 4 4 1 2 1 1
10 $$\texttt{T}$$ 10 9 5 1 2 1 1

The table below contains the array $$C$$.

$$C[\texttt{A}]$$ $$C[\texttt{C}]$$ $$C[\texttt{G}]$$ $$C[\texttt{T}]$$ $$C[\texttt{\$}]$$
0 5 6 8 10

To match $$P = \texttt{GAT}$$, we recursively match its suffixes $$P_i$$. The empty string is a prefix of every suffix, hence $$I(P_3) = [0, 10]$$. From relation (1) we get that $$L(P_2) = L(\texttt{T}P_3) = C(\texttt{T}) + O(\texttt{T}, 0) = 8$$, whereas from relation (2) we obtain that $$R(P_2) = R(\texttt{T}P_3) = C(\texttt{T}) + O(\texttt{T}, 11) - 1 = 9$$. Hence $$I(P_2) = [8, 9]$$. Repeated application of this procedure gives $$I(P_1) = [3, 4]$$ and $$I(P) = I(P_0) = [6, 6]$$. Hence, there is a single match and it starts at position $$SA[6]$$.

To find $$SA[6]$$ we compute $$LF^{j}(6)$$ until we find the minimal $$j$$ such that $$LF^{j}(6) = 0\!\!\!\mod\!\! 2$$: \[ \begin{align} LF(6) &= C(\texttt{A}) + O(\texttt{A}, 6) = 1 \\ LF^{2}(6) &= LF(1) = C(\texttt{A}) + O(\texttt{A}, 1) = 0 = 0\!\!\!\!\!\mod\!\! 2 \end{align} \] Hence, $$SA[6] = SA[0] + 2 = 5$$. We conclude that GAT occurs in AGCAAGATAT starting at position 5.

Files

The indexes and reads are provided in plain text. Here we describe what data each of the input file types contain.

.bwt

The .bwt file contains the BWT string.

.counts

The .counts file contains the counts $$C[\sigma]$$ in lexicographical order, separated from each other by newlines.

.occ

The .occ file contains the sparse occurrence table $$O(\sigma, k)$$, with $$k$$ a multiple of 128. Each line contains the occurrences in lexicographical order, separated by spaces. Elements $$O(\sigma, k)$$ for $$k$$ not a multiple of 128 can be obtained from this table by parsing the relevant subsection of the BWT.

.sa

The .sa file contains the sparse suffix array $$SA[k]$$, for $$k$$ a multiple of 32, separated by newlines. All other elements of the suffix array can be obtained from the sparse suffix array through application of relation (3).

.fasta

The .fasta files contain the reference genome (genome.fasta) and the paired-end reads (pe.fasta). Concerning paired-end reads: read >i/1 is paired with read >i/2.

SAM-output-example.sam

This file contains the output that is expected for the first 5 read pairs.

Formulas

Definition 1

  1. The counts table $$C[\sigma]$$ counts for all $$\sigma \in \Sigma$$ how many symbols in $$BWT$$ are lexicographically smaller than $$\sigma$$: \[ \forall \sigma \in \Sigma: C[\sigma] = \#\{k: BWT[k] < \sigma \} \]

  2. The occurrence table $$O(\sigma, i)$$ is the number of times $$\sigma$$ occurs in $$S[0..i - 1]$$: \[ O(\sigma, i) = \#\{ k: k < i \text{ and } S[k] = \sigma \} \]

  3. Given a pattern $$P$$ we define the interval $$I(P) = [L(P), R(P)]$$ as the interval containing all suffixes of $$S$$ that start with $$P$$: \[ \begin{align} L(P) &= \min \left\{ k: P \text{ is a prefix of } S_{SA[k]} \right\} \tag{1} \\ R(P) &= \max \left\{ k: P \text{ is a prefix of } S_{SA[k]} \right\} \tag{2} \end{align} \]

  4. The last-first relation $$LF$$ maps $$i$$ on $$j$$ if and only if $$SA[i] = SA[j] + 1$$.

Lemma 1

The interval $$I(\sigma P)$$ can be computed from $$I(P)$$ and the occurrence tables in the following way: \[ \begin{align} L(\sigma P) &= C(\sigma) + O(\sigma, L(P)) \\ R(\sigma P) &= C(\sigma) + O(\sigma, R(P) + 1) - 1 \end{align} \]

Proof. $$L(\sigma P)$$ refers to a suffix starting with $$\sigma$$, hence it is of the form $$C(\sigma) + k$$ with $$k \in[0, c_\sigma]$$, where $$c_\sigma$$ is the number of $$\sigma$$ in $$S$$. The first $$O(\sigma, L(P))$$ occurrences of $$\sigma$$ in the $$BWT$$ correspond to positions in the suffix array that do not have a prefix $$P$$, else they would be in $$I(P)$$. Hence $$k = O(\sigma, L(P))$$. Analogous for $$R(P)$$, but now we are interested in the number of prefixes starting with $$P$$ or a prefix smaller than $$P$$, hence $$k = O(\sigma, R(P) + 1) - 1$$.

Lemma 2

$$LF(i) = C(BWT[i]) + O(BWT[i], i)$$

Proof. Since the suffixes in a suffix array are sorted, the following follows immediately from the definition of $$LF$$: $$LF$$ maps the $$n$$-th occurrence of $$\sigma$$ in $$BWT$$ on the $$n$$-th occurrence of $$\sigma$$ in the lexicographical sorting of all characters in $$BWT$$. Hence, $$LF$$ can be computed from the counts and occurrences tables.

Lemma 3

Using the $$LF$$ function the following relation between different suffix array entries can be established: \[ SA(i) = SA[LF^{j}(i)] + j \tag{3} \]

Proof. By definition of $$LF$$ this holds for $$j = 1$$. The formula now follows by induction.