Conversion from GFF to BED

In bioinformatics, GFF (General Feature Format) and BED (Browser Extensible Data) are two widely used formats for representing genomic features like genes and exons. Understanding these formats is essential for genomic data analysis. Quite often, conversions between these two formats are needed to match the requirements of various bioinformatics tools.

GFF (General Feature Format)

The GFF format is a tab-delimited text file that describes genomic features with nine columns:

  1. Sequence Name: The chromosome or scaffold (e.g., chr1).
  2. Source: The annotation source (e.g., a tool or database). Warning: the value in this column may contain spaces.
  3. Feature Type: Type of feature (e.g., gene, exon).
  4. Start: Start position (inclusive).
  5. End: End position (inclusive).
  6. Score: Confidence score.
  7. Strand: + or - indicating the strand.
  8. Frame: Reading frame for coding sequences.
  9. Attributes: Additional details in key-value pairs (separated by ;). Examples of important attributes are the ID of the feature and it’s Parent.

The GFF format is versatile and often used for complex (hierarchical) annotations. Important: positions in a GFF file are 1-based and both start as well as end position are inclusive.

BED (Browser Extensible Data)

The BED format is a simpler tab-delimited format primarily used for visualizing genomic features in genome browsers like UCSC Genome Browser. It is more compact than GFF and focuses on the essential information needed for displaying features, such as chromosomal location and name.

A BED file typically contains three mandatory columns, with optional columns for additional data:

  1. Chromosome: The chromosome or scaffold (e.g., chr1).
  2. Start: Start position (inclusive).
  3. End: End position (exclusive).
  4. Name: The name of the feature (e.g., gene name).
  5. Score: A score.
  6. Strand: + or - indicating the strand.

Unlike for GFF files, positions in BED files are 0-based and the end position is exclusive.

Assignment

While GFF provides more detailed information, converting a GFF file to BED often involves extracting and simplifying this information to match the BED format. The conversion process typically includes selecting key fields like chromosome, start, end, name, frame, and strand, while potentially omitting or transforming other GFF fields.

Write a self-executable Bash script that takes a single argument: the path to a compressed GFF file (e.g., ath.gff.gz). The script should convert the GFF file to a BED file according to these specifications:

  1. Only top-level features should be extracted from the GFF file. These are features without a parent (check the attributes column). Any other features should be ignored.
  2. Only the relevant columns (see above) should be extracted from the GFF file.
  3. Mind the positions: take specific care of the conversion from 1-based (GFF) to 0-based (BED) positions.
  4. The Name column of the output BED file should contain the ID of the corresponding GFF feature (check the attributes column).
  5. The output should be written to a file with the same base name as the input file: e.g., for input file ath.gff.gz the output file name should be ath.bed. In case the output file already exists, a warning should be printed before overwriting the file: Overwriting existing output file ath.bed!.
  6. Any lines starting with a # character in the GFF file are header lines and should be ignored.

For the example ath.gff.gz GFF file, the first few lines of the output BED file should be exactly these (tab-separated):

Chr1	3630	5899	AT1G01010	.	+
Chr1	6787	9130	AT1G01020	.	-
Chr1	11100	11372	AT1G03987	.	+
Chr1	11648	13714	AT1G01030	.	-
Chr1	11896	11976	AT1TE00010	.	+
Chr1	16882	17009	AT1TE00020	.	-
Chr1	17023	18924	AT1TE00025	.	+
Chr1	18330	18642	AT1TE00030	.	-
Chr1	23311	24099	AT1G03993	.	-
Chr1	23120	31227	AT1G01040	.	+