The odd one out

Recall problems from the ABC books where you have to find out the odd object. When you're dealing with nucleotide strings you can face with the same problem — for example, if you need to detect the foreign admixture among the samples. Such analysis nowadays is preformed in the customs of many countries in order to find smuggling or fake food (passing tilapia off as salmon, etc.).

Fish samples waiting for analysis in the lab of US Customs and Border Protection.

In order to compare several homologous strings we need to align all of them simultaneously, a procedure known as multiple sequence alignment (MSA). Because it requires us to compare more than two sequences at once, MSA is a more complicated problem than pairwise alignment. In fact, finding a optimal alignment between more than a very few sequences is so computationally intensive that many MSA programs rely instead on "quick and dirty" heuristic methods that are guaranteed to provide a "good" solution but not necessarily the best possible one.

First 90 positions of a protein multiple sequence alignment of instances of the acidic ribosomal protein P0 from several organisms.

Actually some MSA programs execute a series of pairwise alignments, and optimize some score over all pairs of characters in each position.

Assignment

One of the first and commonly used programs for MSA is Clustal, developed by Des Higgins in 1988. The current version using the same approach is called ClustalW2, and it is embedded in many software packages. There is even a modification of ClustalW2 called ClustalX that provides a graphical user interface for MSA. The EBI website contains a convenient online interface¹ that runs ClustalW2.

Select "Protein" or "DNA", then either paste your sequence in one of the listed formats or upload an entire file. To obtain a more accurate alignment, leave Alignment type: slow selected: if you choose to run Clustal on only two sequences, then the parameter options correspond to those in Needle (see "Pairwise global alignment²").

Write a function outlier that takes the location of a FASTA file containing two or more DNA strings. The function must return the label of the DNA string that differs most from the others.

Example

In the following interactive session, we assume the FASTA files data01.fna³ and data02.fna⁴ to be located in the current directory.

        >>> outlier('data01.fna')
'seq01'
>>> outlier('data02.fna')
'seq05'

Programming shortcut

There are three main steps in the Clustal work:

do a pairwise alignment; the program takes every pair of strings in the given set and finds the optimal global alignment for the pair constructing the distance matrix
create a "guide tree"; the program builds the bifurcating tree using distance matrix — it takes the closest pair, adds the next closest string to that pair as a neighbour, and so on
use the guide tree to carry out a multiple alignment; DNA strings are aligned progressively according to the hierarchy in the guide tree

You can download the ClustalW2 program from its homepage and run it via the command line or using graphical interface ClustalX. You can also use installed ClustalW directly or via some wrappers:

Emma⁵ from the EMBOSS package
the BioPython module Clustalw⁶

The Clustal algorithm is described in detail in the ClustalW and ClustalX version 2.0 paper⁷.