Proteins perform every practical function in the cell. A structural and functional unit of the protein is a domain: in terms of the protein's primary structure, the domain is an interval of amino acids that can evolve and function independently.
Each domain usually corresponds to a single function of the protein (e.g., binding the protein to DNA, creating or breaking specific chemical bonds, …). Some proteins, such as myoglobin and the Cytochrome complex, have only one domain, but many proteins are multifunctional and therefore possess several domains. It is even possible to artificially fuse different domains into a protein molecule with definite properties, creating a chimeric protein.
Just like species, proteins can evolve, forming homologous groups called protein families. Proteins from one family usually have the same set of domains, performing similar functions.
A component of a domain essential for its function is called a motif, a term that in general has the same meaning as it does in nucleic acids, although many other terms are also used (blocks, signatures, fingerprints, …). Usually protein motifs are evolutionarily conservative, meaning that they appear without much change in different species.
A part of the domain that is essential for the function is called a motif. This term is also used for nucleic acids, although many other terms are often used (blocks, signatures, fingerprints, …). From an evolutionary point of view, protein motifs are usually preserved, which means they exist in multiple forms without all too many differences.
Proteins are identified in different labs around the world and gathered into freely accessible databases. A central repository for protein data is UniProt1, which provides detailed protein annotation, including function description, domain structure, and post-translational modifications. UniProt also supports protein similarity search, taxonomy analysis, and literature citations.
To allow for the presence of its varying forms, a protein motif is represented by the following shorthand notation. In the shorthand, each uppercase letter represents a specific amino acid. If a series of uppercase letters is enclosed within a pair of square brackets, it corresponds to a single amino acid from the series. The motif [AC][DEF]G will thus match the following six protein sequences: ADG, AEG, AFG, CDG, CEG and CFG. If a series of uppercase letters is enclosed within a pair of curly braces, it corresponds to a single amino acid not included in the series. The motif {AC} thus represents any amino acid, except A or C. The lowercase letter x is used to represent a single amino acid without any further restrictions.
Using the shorthand notation, a motif is represented as a sequence of groups, where each group belongs to one of four categories as summarized in the table below. We say that a protein sequence matches a given motif, if the number of amino acids of the protein sequence equals the number of groups in the motif, and each amino acid matches with its corresponding group. This way, we see for example that the protein sequence NFSD matches the N-glycosylation motif that is written as N{P}[ST]{P}.
category | example | matches with |
---|---|---|
uppercase letter | A | the amino acid A |
lowercase letter x |
x | a single amino acid |
series of uppercase letters between
square brackets |
[ACD] | the amino acid A, C or D |
series of uppercase letters between curly braces | {ACD} | any amino acid, except A, C or D |
Your task:
Write a function groups that takes a motif as its string argument. The function must return the number of groups that are contained in the given motif.
Write a function match that takes two string arguments: a protein sequence and a motif. The function must return a Boolean value that indicates whether or not the given protein sequence matches the given motif.
Write a function positions that takes two string arguments: a protein sequence and a motif. The function must return a list containing all positions in the given protein sequence where a match starts with the given motif. These positions must be listed in ascending order.
>>> groups('N{P}[ST]{P}')
4
>>> groups('{TCGFSM}{E}[GYD]xSx[YTA]N[AVWMYGCHD]P')
10
>>> match('NFSD', 'N{P}[ST]{P}')
True
>>> match('MFSD', 'N{P}[ST]{P}')
False
>>> match('NPSD', 'N{P}[ST]{P}')
False
>>> match('NFAD', 'N{P}[ST]{P}')
False
>>> match('NFSP', 'N{P}[ST]{P}')
False
>>> match('QDNPYIEEIR', '{TCGFSM}{E}[GYD]xSx[YTA]N[AVWMYGCHD]P')
False
>>> positions('MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQKDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSSNEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVNFKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKYLNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYDLSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILMDLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIYCLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK', 'N{P}[ST]{P}')
[84, 117, 141, 305, 394]