A GFF (General Feature Format)1 file is used to describe genes and other features of DNA, RNA or protein sequences. It contains a bunch of tab-separated columns, of which the most important ones for this exercise are:
gene
).The first few lines of an example GFF file may look as follows:
# PLAZA instance : dicots_05
# File generation timestamp : Thu Sep 30 11:23:03 CEST 2021
# Species information:
# - species : oeu
# - common name : Olea europaea
# - tax id : 4146
# - assembly/annotation source/version : v1.0
# - annotation data provider : https://phytozome-next.jgi.doe.gov/info/Oeuropaea_v1_0
chr1 v1.0 gene 289 3692 . - . ID=Oeu061231.1;tid=PAC:37727357;id=gOeu061231.1;Name=Oeu061231.1;gene_id=Oeu061231.1
chr1 v1.0 mRNA 289 3692 . - . ID=Oeu061231.1;Parent=Oeu061231.1;Name=Oeu061231.1;gene_id=Oeu061231.1
chr1 v1.0 exon 289 349 . - . ID=Oeu061231.1:exon:1;Parent=Oeu061231.1;Name=Oeu061231.1;gene_id=Oeu061231.1
chr1 v1.0 CDS 289 349 . - 1 ID=Oeu061231.1:CDS;Parent=Oeu061231.1;Name=Oeu061231.1;gene_id=Oeu061231.1
chr1 v1.0 exon 473 787 . - . ID=Oeu061231.1:exon:2;Parent=Oeu061231.1;Name=Oeu061231.1;gene_id=Oeu061231.1
chr1 v1.0 CDS 473 787 . - 1 ID=Oeu061231.1:CDS;Parent=Oeu061231.1;Name=Oeu061231.1;gene_id=Oeu061231.1
...
chr2 v1.0 gene 21189213 21190423 . + . ID=Oeu046640.1;tid=PAC:37723918;id=gOeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2 v1.0 mRNA 21189213 21190423 . + . ID=Oeu046640.1;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2 v1.0 exon 21189213 21189336 . + . ID=Oeu046640.1:exon:1;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2 v1.0 CDS 21189213 21189336 . + 0 ID=Oeu046640.1:CDS;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2 v1.0 exon 21189890 21189977 . + . ID=Oeu046640.1:exon:2;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2 v1.0 CDS 21189890 21189977 . + 2 ID=Oeu046640.1:CDS;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2 v1.0 exon 21190084 21190150 . + . ID=Oeu046640.1:exon:3;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2 v1.0 CDS 21190084 21190150 . + 1 ID=Oeu046640.1:CDS;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2 v1.0 exon 21190370 21190423 . + . ID=Oeu046640.1:exon:4;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2 v1.0 CDS 21190370 21190423 . + 0 ID=Oeu046640.1:CDS;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
...
Note that there may be a header at the top of the file, providing metadata.
All header lines start with a #
character and should be ignored in this exercise.
The attributes (column 9) consist of a list of key-value pairs, separated with semicolons (;
).
Each key-value pair is written as key=value
, e.g. ID=Oeu061231.1
.
Caution: the order in which attributes occur in the list is arbitrary, so don’t make any assumptions.
Write a self-executable Bash script that takes two arguments:
chr2
).The script writes the list of gene IDs found on the given sequence within the GFF file to standard output, together with the strand of the gene (+ or -), sorted by start coordinate (ascending).
For example, when running the script on the file example.gff3.gz
2 with
$ ./script example.gff3.gz chr2
it would write this text to standard output:
Oeu001786.1+
Oeu001787.1-
Oeu001790.1+
Any features that are not of type gene
are ignored, as are any genes on other sequences
than the one that was requested to be processed. If the requested sequence id is not present in the provided GFF file, the script does not give any output.
To further test your solution, you can find many GFF files online for various species, e.g. at PLAZA3 for plant species.