A sequence logo is a graphical representation for a multiple alignment of DNA, RNA or protein sequences, from which regions with conserved residues can immediately be read. Sequence logos are, for example, used to detect conserved regions in the DNA because there, transcription factors can bind.

RNA codons
Sequence logo that represents the most conserved bases around the start codon of all human mRNAs1. Note that the start codon itself is not shown to scale, because otherwise the letter AUG would each have a height of 2 bits.

In making sequence logos one starts from related DNA, RNA or protein sequences, or from DNA sequences with conserved binding regions. In a first step, these sequences are aligned relative to each other, wherein the most conserved residues are put under each other. Thereafter, the frequency of the residues is calculated per position in this multiple alignment. The sequence logo shows for each position how good the residues are conserved: the more residues of a particular type, the higher the letter, because the preservation of the residue at that position is larger. The letters of the residues at the same position are scaled according to their frequency. The height of all the letters in the same position corresponds to the information expressed in bits (entropy2).

FASTA format

In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. A simple example of one sequence in FASTA format:
>118480563|DQ207729|Bacillus cereus|16S ribosomal RNA gene
AGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAATGGATTA
AGAGCTTGCTCTTATGAAGTTAGCGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCCATAAGACTGGG
ATAACTCCGGGAAACCGGGGCTAATACCGGATAACATTTTGAACCGCATGGTTCGAAATTGAAAGGCGGC
TTCGGCTGTCACTTATGGATGGACCCGCGTCGCATTAGCTAGTTGGTGAGGTAACGGCTCACCAAGGCAA
CGATGCGTA

Below is an example of a FASTA file with multiple sequences. Note that in this case, there are multiple description lines - lines that start with a ">" - which indicate that a new sequence begins thereafter.

>571435|U16165|Clostridium acetobutylicum|16S ribosomal RNA gene
TGGCGGCGTGCTTAACACATGCAAGTCGAGCGATGAAGCTCCTTCGGGAGTGGATTAGCGGCGGACGGGT
GAGTAACACGTGGGTAACCTGCCTCATAGAGGGGAATAGCCTTTCGAAAGGAAGATTAATACCGCATAAG
ATTGTAGTGCCGCATGGCATAGCAATTAAAGGAGTAATCCGCTATGAGATGGACCCGCGTCGCATTAGCT
AGTTGGTGAGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACATTGG
GACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTG
>996091|L07834|Geobacter metallireducens|16S ribosomal RNA gene
AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGAGTGCCTAACACATGCAAGTCGAACGTGAAGGGGG
CTTCGGTCCCCGGAAAGTGGCGCACGGGTGAGTAACGCGTGGATAATCTGCCCAGTGATCTGGGATAACA
TCTCGAAAGGGGTGCTAATACCGGATAAGCCCACGGAGTCCTTGGATTCTGCGGGAAAAGGGGGGGACCT
TCGGGCCTTTTGTCACTGGATGAGTCCGCGTACCATTAGCTAGTTGGTGGGGTAATGGCCCACCAAGGCT
ACGATGGTTAG

After each description line one or more lines that describe the sequence follow. Sequences can represent both DNA sequences and protein sequences, and they can contain holes that are represented by a minus sign (-).

Assignment

Example

In the following example we assume that the file seq.fasta is in the current directory.

>>> sequences = readFasta('seq.fasta')
>>> sequences
['ATG', 'gtg', 'CTA', 'TTa', 'ATG']

>>> sequenceLogo(sequences)
[[0.4, 0.2, 0.2, 0.2], [0.0, 0.0, 0.0, 1.0], [0.4, 0.0, 0.6, 0.0]]

>>> sequenceLogo(sequences, alphabet='gcta')
[[0.2, 0.2, 0.2, 0.4], [0.0, 0.0, 1.0, 0.0], [0.6, 0.0, 0.0, 0.4]]

>>> sequenceLogo(sequences, alphabet='GCUA')
Traceback (most recent call last):
AssertionError: invalid residu

>>> sequenceLogo(['AGCTGC', 'TCGT', 'CGTATGATAG'])
Traceback (most recent call last):
AssertionError: not all sequences have the same length

The construction of the first sequence logo from the above example is shown in the table below. Such a table - for DNA sequences sometimes shortened to the last four columns - is referred to as position-specific scoring matrix3 in bioinformatics.

seq1 seq2 seq3 seq4 seq5   A C G T
A g C T A   0.4 0.2 0.2 0.2
T t T T T   0.0 0.0 0.0 1.0
G g A a G   0.4 0.0 0.6 0.0