Searching through the haystack

In "Finding a motif in DNA" we searched a given genetic string for a motif. However, this problem assumed that we know the motif in advance. In practice, biologists often do not know exactly what they are looking for. Rather, they must hunt through several different genomes at the same time to identify regions of similarity that may indicate genes shared by different organisms or species.

The simplest such region of similarity is a motif occurring without mutation in every one of a collection of genetic strings taken from a database. Such a motif corresponds to a substring shared by all the strings. We want to search for long shared substrings, as a longer motif will likely indicate a greater shared function.

Assignment

A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, CG is a common substring of ACGTACGT and AACCGTATA, but it is not as long as possible. In this case, CGTA is a longest common substring of ACGTACGT and AACCGTATA.

Note that the longest common substring is not necessarily unique. For a simple example, AA and CC are both longest common substrings of AACC and CCAA.

Write a function longestCommonSubstrings that takes the location of a FASTA file containing a collection of DNA strings. The function must return a set containing all longest common substrings of the given collection of DNA strings.

Example

In the following interactive session, we assume the FASTA files data01.fna and data02.fna to be located in the current directory.

        >>> longestCommonSubstrings('data01.fna')
{'AC', 'CA', 'TA'}
>>> longestCommonSubstrings('data02.fna')
{'AACTT', 'CCCGC'}