GFF files

A GFF (General Feature Format)1 file is used to describe genes and other features of DNA, RNA or protein sequences. It contains a bunch of tab-separated columns, of which the most important ones for this exercise are:

The first few lines of an example GFF file may look as follows:

# PLAZA instance : dicots_05
# File generation timestamp : Thu Sep 30 11:23:03 CEST 2021
# Species information:
# - species : oeu
# - common name : Olea europaea
# - tax id : 4146
# - assembly/annotation source/version : v1.0
# - annotation data provider : https://phytozome-next.jgi.doe.gov/info/Oeuropaea_v1_0
chr1	v1.0	gene	289	3692	.	-	.	ID=Oeu061231.1;tid=PAC:37727357;id=gOeu061231.1;Name=Oeu061231.1;gene_id=Oeu061231.1
chr1	v1.0	mRNA	289	3692	.	-	.	ID=Oeu061231.1;Parent=Oeu061231.1;Name=Oeu061231.1;gene_id=Oeu061231.1
chr1	v1.0	exon	289	349	.	-	.	ID=Oeu061231.1:exon:1;Parent=Oeu061231.1;Name=Oeu061231.1;gene_id=Oeu061231.1
chr1	v1.0	CDS	289	349	.	-	1	ID=Oeu061231.1:CDS;Parent=Oeu061231.1;Name=Oeu061231.1;gene_id=Oeu061231.1
chr1	v1.0	exon	473	787	.	-	.	ID=Oeu061231.1:exon:2;Parent=Oeu061231.1;Name=Oeu061231.1;gene_id=Oeu061231.1
chr1	v1.0	CDS	473	787	.	-	1	ID=Oeu061231.1:CDS;Parent=Oeu061231.1;Name=Oeu061231.1;gene_id=Oeu061231.1
...
chr2	v1.0	gene	21189213	21190423	.	+	.	ID=Oeu046640.1;tid=PAC:37723918;id=gOeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2	v1.0	mRNA	21189213	21190423	.	+	.	ID=Oeu046640.1;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2	v1.0	exon	21189213	21189336	.	+	.	ID=Oeu046640.1:exon:1;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2	v1.0	CDS	21189213	21189336	.	+	0	ID=Oeu046640.1:CDS;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2	v1.0	exon	21189890	21189977	.	+	.	ID=Oeu046640.1:exon:2;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2	v1.0	CDS	21189890	21189977	.	+	2	ID=Oeu046640.1:CDS;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2	v1.0	exon	21190084	21190150	.	+	.	ID=Oeu046640.1:exon:3;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2	v1.0	CDS	21190084	21190150	.	+	1	ID=Oeu046640.1:CDS;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2	v1.0	exon	21190370	21190423	.	+	.	ID=Oeu046640.1:exon:4;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
chr2	v1.0	CDS	21190370	21190423	.	+	0	ID=Oeu046640.1:CDS;Parent=Oeu046640.1;Name=Oeu046640.1;gene_id=Oeu046640.1
...

Note that there may be a header at the top of the file, providing metadata. All header lines start with a # character and should be ignored in this exercise. The attributes (column 9) consist of a list of key-value pairs, separated with semicolons (;). Each key-value pair is written as key=value, e.g. ID=Oeu061231.1. Caution: the order in which attributes occur in the list is arbitrary, so don’t make any assumptions.

Assignment

Write a self-executable Bash script that takes two arguments:

  1. A gzip’d GFF file.
  2. A sequence id (e.g. chr2).

The script writes the list of gene IDs found on the given sequence within the GFF file to standard output, together with the strand of the gene (+ or -), sorted by start coordinate (ascending).

For example, when running the script on the file example.gff3.gz2 with

$ ./script example.gff3.gz chr2

it would write this text to standard output:

Oeu001786.1+
Oeu001787.1-
Oeu001790.1+

Any features that are not of type gene are ignored, as are any genes on other sequences than the one that was requested to be processed. If the requested sequence id is not present in the provided GFF file, the script does not give any output.

To further test your solution, you can find many GFF files online for various species, e.g. at PLAZA3 for plant species.