Gene Regulation

Gene regulation is the process by which the expression of genes is controlled. This control is essential for cells to function properly and to respond to their environment. There are many different mechanisms of gene regulation, but one very important component is the regulation of chromatin accessibility.

Chromatin is the complex of DNA and proteins that make up the chromosomes. It is tightly packed in most cells, which keeps genes closed and unable to be transcribed. However, in some regions of the genome, chromatin is more accessible, which allows genes to be transcribed.

Transcription factors (TFs) are proteins that bind to specific sequences of DNA and regulate gene transcription. They can either activate transcription, by recruiting the machinery that makes RNA from DNA, or repress transcription, by blocking the access of other proteins to the gene.

Accessible chromatin is a region of chromatin that is open and accessible to transcription factors. This means that it is more likely to be transcribed than regions of chromatin that are not accessible.

ATAC-seq (Assay for Transposase-Accessible Chromatin Sequencing)

One of the available techniques that is used to measure chromatin accessibility is ATAC-seq1. It works by introducing a transposase enzyme into cells. The transposase enzyme cuts the DNA at random, and the fragments of DNA are then sequenced. The more accessible a region of chromatin is, the more fragments of DNA will be sequenced from that region.

Mapping the obtained sequences back on the reference genome allows to obtain the coordinates of accessible chromatin regions, which are often encoded in a BED file.

BED files

A typical BED file2 contains at least three columns separated by tabs or spaces:

  1. The chromosome name.
  2. The start coordinate of the region.
  3. The end coordinate of the region.

Additional columns may be added, containing more detailed information such as strand. For the assignment below, any additional columns except the first three required columns can be ignored.

An example of a minimal BED file may look as follows:

chr2    0       900
chr1    120     2304
chr1    3244    5691

Caution: the coordinates in a BED are 0-based and the end coordinate is exclusive. This means that in the example above, the region on chr2 starts at the start of the chromosome (position 0) and ends at position 899 (= 900 - 1). The length (= number of base pairs) of this region is 900.

Assignment

Write a self-executable Bash script that, given a BED file as single argument, writes out some statistics about the contained genomic regions to standard output. The following information should be printed (see example output below):

  1. For each chromosome: the number of regions within that chromosome that are present in the file. Chromosome names are sorted alphabetically (ascending).
  2. The total number of regions.
  3. The minimum, maximum and median3 region length. If there are an even number of regions in the file, the median length is defined as the average of the two middlemost lengths.

For the example BED file above, the output of the script would be:

$ ./script.sh example.bed
chr1: 2 regions
chr2: 1 region
# Total: 3 regions
# Min length: 900
# Max length: 2447
# Median length: 2184

Note that the script knows its English grammar like a pro: it says 2 regions (plural) but 1 region (singular). This should be taken into account in your solution.

Another example input BED file with one additional region and chromosome:

chr2    0       900
chr1    120     2304
chr1    3244    5691
chr3    100     2316

This input would yield the following output:

$ ./script.sh example2.bed
chr1: 2 regions
chr2: 1 region
chr3: 1 region
# Total: 4 regions
# Min length: 900
# Max length: 2447
# Median length: 2200

To further test your solution you can easily create some BED files yourself or look for more examples online, e.g. here4 you will find some.

Bash functions

Try to use functions in your solution to avoid code duplication. For example, for both the number of regions per chromosome as well as the total number of regions in the file, you need to distinguish between 1 region or multiple x regions. How can you use a Bash function to implement the same logic only once? Also for inferring the region lengths, a Bash function may come in handy, to avoid code duplication when computing the minimum, median and maximum region lengths.

Think about other possibilities for writing Bash functions to avoid code duplication in this exercise.