A quick introduction to molecular biology

Making up all living material, the cell is considered to be the building block of life. The nucleus, a component of most eukaryotic cells, was identified as the hub of cellular activity 150 years ago. Viewed under a light microscope, the nucleus appears only as a darker region of the cell, but as we increase magnification, we find that the nucleus is densely filled with a stew of macromolecules called chromatin. During mitosis (eukaryotic cell division), most of the chromatin condenses into long, thin strings called chromosomes. The following figure shows cells in different stages of mitosis.

A 1900 drawing by Edmund Wilson of onion cells at different stages of mitosis. The sample has been dyed, causing chromatin in the cells (which soaks up the dye) to appear in greater contrast to the rest of the cell.

One class of the macromolecules contained in chromatin are called nucleic acids. Early 20th century research into the chemical identity of nucleic acids culminated with the conclusion that nucleic acids are polymers, or repeating chains of smaller, similarly structured molecules known as monomers. Because of their tendency to be long and thin, nucleic acid polymers are commonly called strands.

The nucleic acid monomer is called a nucleotide and is used as a unit of strand length (abbreviated to nt). Each nucleotide is formed of three parts: a sugar molecule, a negatively charged ion called a phosphate, and a compound called a nucleobase ("base" for short). Polymerization is achieved as the sugar of one nucleotide bonds to the phosphate of the next nucleotide in the chain, which forms a sugar-phosphate backbone for the nucleic acid strand. A key point is that the nucleotides of a specific type of nucleic acid always contain the same sugar and phosphate molecules, and they differ only in their choice of base. Thus, one strand of a nucleic acid can be differentiated from another based solely on the order of its bases; this ordering of bases defines a nucleic acid's primary structure.

primary structure of DNA
A sketch of DNA's primary structure.

For example, the above figure shows a strand of deoxyribose nucleic acid (DNA), in which the sugar is called deoxyribose, and the only four choices for nucleobases are molecules called adenine (A), cytosine (C), guanine (G), and thymine (T).

For reasons we will soon see, DNA is found in all living organisms on Earth, including bacteria. It is even found in many viruses (which are often considered to be nonliving). Because of its importance, we reserve the term genome to refer to the sum total of the DNA contained in an organism's chromosomes.

Assignment

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word. The length of a string is the number of symbols that it contains. An example of a length 21 DNA string (whose alphabet contains the symbols A, C, G and T) is ATGCTTCAGAAAGGTCTTACG.

Write a function baseCount that takes a DNA string as its argument. The function must return a dictionary that maps each symbol (a string of length 1) that occurs in the given sequence onto an integer that indicates the number of occurrences of that symbol in the sequence.

Example

>>> baseCount('AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC')
{'A': 20, 'C': 12, 'G': 17, 'T': 21}

>>> from Bio import SeqIO
>>> baseCount(*SeqIO.parse('data.fna', 'fasta'))
{'A': 233, 'C': 232, 'G': 238, 'T': 244}

Programming shortcut

Our default choice for existing functions and modules to analyze biological data is BioPython1, a set of freely available tools for computational biology that are written in Python. We will give you tips on how to solve certain problems (like this one) using BioPython functions and methods. Detailed installation instructions for BioPython are available in PDF2 and HTML3 formats.

BioPython offers a specific data type called Seq for representing sequences. Seq represents an extension of the data type str (string) that is built into Python by supporting additional biologically relevant methods like translate() and reverse_complement(). In this problem, you can easily use the built-in Python method .count() for strings. Here's how you could count the occurrences of the letter A found in a Seq object.

>>> from Bio.Seq import Seq
>>> my_seq = Seq('AGTACACTGGT')
>>> my_seq.count('A')
3