The tree of life

As buds give rise by growth to fresh buds, and these, if vigorous, branch out and overtop on all sides many a feebler branch, so by generation I believe it has been with the great Tree of Life, which fills with its dead and broken branches the crust of the earth, and covers the surface with its ever-branching and beautiful ramifications.

Charles Darwin, The Origin of Species

A century and a half has passed since the publication of Darwin's magnum opus, and yet the construction of a single Tree of Life uniting life on Earth still has not been completed, with perhaps as many as 90% of all living species not yet catalogued (although a beautiful interactive animation has been produced by OneZoom1).

To get an insight about state-of-art attempts to build this tree, you may take a look at the Tree of Life Web Project2 — a collaborative effort of biologists from around the world to combine information about diversity of life on Earth. It is a peer-reviewed ongoing project started in 1995, now it holds more than 10,000 pages with characteristics of different groups of organisms and their evolutionary history, and tree still grows. A similar effort is the Encyclopedia of Life3.

Instead of trying to construct the entire Tree of Life all at once, we often wish to form a simpler tree in which a collection of species have been clumped together for the sake of simplicity. Such a group is called a taxon (pl. taxa). For a given collection of taxa, a phylogeny is a treelike diagram that best represents the evolutionary connections between the taxa: the construction of a particular phylogeny depends on our specific assumptions regarding how these evolutionary relationships should be interpreted.

phylogenetic tree
A phylogeny illustrating proposed evolutionary relationships among the three domains of life: Bacteria, Archaea, and Eukaryota.

Assignment

An undirected graph is connected if there is a path connecting any two nodes. A tree is a connected (undirected) graph containing no cycles. This definition forces the tree to have a branching structure organized around a central core of nodes, just like its living counterpart.

tree graph
A labeled tree with 6 vertices and 5 edges.

We have already grown familiar with trees in "Mendel's first law4", where we introduced the probability tree diagram to visualize the outcomes of a random variable.

In the creation of a phylogeny, taxa are encoded by the tree's leaves, or nodes having degree 1. A node of a tree having degree larger than 1 is called an internal node.

Write a function missingEdges that takes a positive integer $$n$$ and the location of a text file that contains an adjacency list corresponding to a graph on $$n$$ nodes (labeled $$1, 2, \ldots, n$$) that contains no cycles. Each line of the file contains two space-separated integers that describe an edge relation of the graph. The function must return the minimum number of edges that can be added to the graph to produce a tree.

Example

>>> missingEdges(10, 'data01.txt')
3
>>> missingEdges(826, 'data02.txt')
801

Extra information

After solving this assignment, a standard mathematical exercise for the technically minded is to verify that every tree having 2 or more nodes must contain at least two leaves.