A sequence logo is a graphical representation for a multiple alignment of DNA, RNA or protein sequences, from which regions with conserved residues can immediately be read. Sequence logos are, for example, used to detect conserved regions in the DNA because there, transcription factors can bind.
In making sequence logos one starts from related DNA, RNA or protein sequences, or from DNA sequences with conserved binding regions. In a first step, these sequences are aligned relative to each other, wherein the most conserved residues are put under each other. Thereafter, the frequency of the residues is calculated per position in this multiple alignment. The sequence logo shows for each position how good the residues are conserved: the more residues of a particular type, the higher the letter, because the preservation of the residue at that position is larger. The letters of the residues at the same position are scaled according to their frequency. The height of all the letters in the same position corresponds to the information expressed in bits (entropy2).
>118480563|DQ207729|Bacillus cereus|16S ribosomal RNA gene
AGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAATGGATTA
AGAGCTTGCTCTTATGAAGTTAGCGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCCATAAGACTGGG
ATAACTCCGGGAAACCGGGGCTAATACCGGATAACATTTTGAACCGCATGGTTCGAAATTGAAAGGCGGC
TTCGGCTGTCACTTATGGATGGACCCGCGTCGCATTAGCTAGTTGGTGAGGTAACGGCTCACCAAGGCAA
CGATGCGTA
Below is an example of a FASTA file with multiple sequences. Note that in this case, there are multiple description lines - lines that start with a ">" - which indicate that a new sequence begins thereafter.
>571435|U16165|Clostridium acetobutylicum|16S ribosomal RNA gene
TGGCGGCGTGCTTAACACATGCAAGTCGAGCGATGAAGCTCCTTCGGGAGTGGATTAGCGGCGGACGGGT
GAGTAACACGTGGGTAACCTGCCTCATAGAGGGGAATAGCCTTTCGAAAGGAAGATTAATACCGCATAAG
ATTGTAGTGCCGCATGGCATAGCAATTAAAGGAGTAATCCGCTATGAGATGGACCCGCGTCGCATTAGCT
AGTTGGTGAGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACATTGG
GACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTG
>996091|L07834|Geobacter metallireducens|16S ribosomal RNA gene
AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGAGTGCCTAACACATGCAAGTCGAACGTGAAGGGGG
CTTCGGTCCCCGGAAAGTGGCGCACGGGTGAGTAACGCGTGGATAATCTGCCCAGTGATCTGGGATAACA
TCTCGAAAGGGGTGCTAATACCGGATAAGCCCACGGAGTCCTTGGATTCTGCGGGAAAAGGGGGGGACCT
TCGGGCCTTTTGTCACTGGATGAGTCCGCGTACCATTAGCTAGTTGGTGGGGTAATGGCCCACCAAGGCT
ACGATGGTTAG
After each description line one or more lines that describe the sequence follow. Sequences can represent both DNA sequences and protein sequences, and they can contain holes that are represented by a minus sign (-).
Write a function readFasta with which the sequences from a FASTA file can be read. The location of the FASTA file must be passed to the function as string-argument. The function must return a list of strings, in which the successive strings correspond to the consecutive sequences as they are listed in the file. A sequence that was split over multiple lines in the file must be displayed in the list as a single string with no whitespace.
Write a function sequenceLogo to which a list of
aligned sequences must be passed. This includes that all the strings
in the list have the same length. See the example below to see how the
function should react if this condition is not met. The second
optional argument alphabet (default ACGT) can pass
another string to the function, in which the residue letters are
listed from which the sequences exist. See the example below to see
how the function should respond, if there is residue in given
sequences that does not belong to the given alphabet. The function
cannot distinguish between uppercase and lowercase letters in the
letter representation of the residues, nor in the given sequences, nor
the given alphabet.
The function should return a list of which each element at position
$$i$$ is itself a list containing the frequencies of residues at
position $$i$$ in the aligned sequences. The order of the residues in
each frequency table is the same as the order of the residues in the
given alphabet.
In the following example we assume that the
file seq.fasta
is in the current directory.
>>> sequences = readFasta('seq.fasta')
>>> sequences
['ATG', 'gtg', 'CTA', 'TTa', 'ATG']
>>> sequenceLogo(sequences)
[[0.4, 0.2, 0.2, 0.2], [0.0, 0.0, 0.0, 1.0], [0.4, 0.0, 0.6, 0.0]]
>>> sequenceLogo(sequences, alphabet='gcta')
[[0.2, 0.2, 0.2, 0.4], [0.0, 0.0, 1.0, 0.0], [0.6, 0.0, 0.0, 0.4]]
>>> sequenceLogo(sequences, alphabet='GCUA')
Traceback (most recent call last):
AssertionError: invalid residu
>>> sequenceLogo(['AGCTGC', 'TCGT', 'CGTATGATAG'])
Traceback (most recent call last):
AssertionError: not all sequences have the same length
The construction of the first sequence logo from the above example is shown in the table below. Such a table - for DNA sequences sometimes shortened to the last four columns - is referred to as position-specific scoring matrix3 in bioinformatics.
seq1 | seq2 | seq3 | seq4 | seq5 | Â | A | C | G | T |
---|---|---|---|---|---|---|---|---|---|
A | g | C | T | A | Â | 0.4 | 0.2 | 0.2 | 0.2 |
T | t | T | T | T | Â | 0.0 | 0.0 | 0.0 | 1.0 |
G | g | A | a | G | Â | 0.4 | 0.0 | 0.6 | 0.0 |