Determining an organism's complete genome (called genome sequencing) forms a central task of bioinformatics. Unfortunately, we still don't possess the microscope technology to zoom into the nucleotide level and determine the sequence of a genome's nucleotides, one at a time. However, researchers can apply chemical methods to generate and identify much smaller snippets of DNA, called reads. After obtaining a large collection of reads from multiple copies of the same genome, the aim is to reconstruct the desired genome from these small pieces of DNA. This process is called genome assembly.
From a computational perspective, genome assembly is an extremely difficult problem. It becomes even harder, because many of the genomes contain large numbers of identical sequences that are repeated at various locations in the genome. Such repetitions often stretch over thousands of nucleotides, and some are found at thousands of different positions across the genome. This especially occurs in the large genomes of plants and animals. As a result, it is often impossible to reconstruct the complete genome and the assembly process ends with a number of large pieces of the genome which are called contigs in this context.
To determine the quality of a genome assembly (for example, when comparing two different assembly algorithms that separately have computed contigs starting from the same set of reads), the N50 statistic is often used.
Suppose we denote the list of lengths of the contigs of a genome assembly as $$L$$ (this is a list of positive integers), then the N50 statistic can be computed in the following way:
create a new list $$L'$$ that contains $$n$$ copies of each element $$n$$ from the original list $$L$$
the N50 statistic equals the median1 of $$L'$$
Your task:
Write a function median that takes a collection (a list, tuple, set, …) of positive integers. This function must return the median of the elements in the given collection as a floating point number.
Write a function extend that takes a collection (a list, tuple, set, …) of positive integers. The function must return a new list containing $$n$$ copies of each element $$n$$ in the given collection.
Use the functions median and extend to write a function N50. This function takes a collection (a list, tuple, set, …) of positive integers representing the lengths of the contigs of a genome assembly. The function must return the N50 statistic for the given collection of contig lengths.
The median of a list of numbers is the number in the middle if the list is sorted in ascending or descending order. If the list contains an even number of values, the median is determined as the average of the two middle numbers of the sorted list.
>>> contigs = (2, 2, 2, 3, 3, 4, 8, 8)
>>> median(contigs)
3.0
>>> extend(contigs)
[2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
>>> N50(contigs)
6.0