Assignment
Give valid commands for the following tasks:
- Download the file Ostta_prot_LATEST.tfa.gz.
This file contains protein sequences from the Ostreococcus Tauri algae species. Caution: make sure that the command
does not output anything to the screen.
- Decompress the downloaded file
Ostta_prot_LATEST.tfa.gz into Ostta_prot_LATEST.tfa.
The original compressed file should be discarded.
- Count the number of lines in the file.
- Compute the total length (= number of amino acids) of all protein sequences in the file.
The
* character at the end of the sequences should also be counted as part of the length of a sequence.
- Count the number of sequences that are located on the reverse strand (= have
"r:" in the header).
- What is the length of the longest sequence in this file? Hint: the
length() function in awk may come in handy.
You may assume that the sequences are not spread over multiple lines.
- How many proteins have a genomic length longer than 600 base pairs? Keep in mind that three base pairs translate into a single amino acid.
Again, you may assume that the sequences are not spread over multiple lines.
Keep the commands as simple and concise as possible.
It is of course allowed to combine multiple commands with pipes where needed.
All commands that output counts or lengths should print these numbers only, nothing else.