Proteins perform every practical function in the cell. A structural and functional unit of the protein is a domain: in terms of the protein's primary structure, the domain is an interval of amino acids that can evolve and function independently.

Each domain usually corresponds to a single function of the protein (e.g., binding the protein to DNA, creating or breaking specific chemical bonds, …). Some proteins, such as myoglobin and the Cytochrome complex, have only one domain, but many proteins are multifunctional and therefore possess several domains. It is even possible to artificially fuse different domains into a protein molecule with definite properties, creating a chimeric protein.

Just like species, proteins can evolve, forming homologous groups called protein families. Proteins from one family usually have the same set of domains, performing similar functions.

mutatie van omgekeerd paar
The human cyclophilin family, as represented by the structures of the isomerase domains of some of its members.

A component of a domain essential for its function is called a motif, a term that in general has the same meaning as it does in nucleic acids, although many other terms are also used (blocks, signatures, fingerprints, …). Usually protein motifs are evolutionarily conservative, meaning that they appear without much change in different species.

A part of the domain that is essential for the function is called a motif. This term is also used for nucleic acids, although many other terms are often used (blocks, signatures, fingerprints, …). From an evolutionary point of view, protein motifs are usually preserved, which means they exist in multiple forms without all too many differences.

Proteins are identified in different labs around the world and gathered into freely accessible databases. A central repository for protein data is UniProt1, which provides detailed protein annotation, including function description, domain structure, and post-translational modifications. UniProt also supports protein similarity search, taxonomy analysis, and literature citations.

Assignment

To allow for the presence of its varying forms, a protein motif is represented by the following shorthand notation. In the shorthand, each uppercase letter represents a specific amino acid. If a series of uppercase letters is enclosed within a pair of square brackets, it corresponds to a single amino acid from the series. The motif [AC][DEF]G will thus match the following six protein sequences: ADG, AEG, AFG, CDG, CEG and CFG. If a series of uppercase letters is enclosed within a pair of curly braces, it corresponds to a single amino acid not included in the series. The motif {AC} thus represents any amino acid, except A or C. The lowercase letter x is used to represent a single amino acid without any further restrictions.

Using the shorthand notation, a motif is represented as a sequence of groups, where each group belongs to one of four categories as summarized in the table below. We say that a protein sequence matches a given motif, if the number of amino acids of the protein sequence equals the number of groups in the motif, and each amino acid matches with its corresponding group. This way, we see for example that the protein sequence NFSD matches the N-glycosylation motif that is written as N{P}[ST]{P}.

category example matches with
uppercase letter A the amino acid A
lowercase letter x
x a single amino acid
series of uppercase letters between square brackets
[ACD] the amino acid A, C or D
series of uppercase letters between curly braces {ACD} any amino acid, except A, C or D

Your task:

Example

>>> groups('N{P}[ST]{P}')
4
>>> groups('{TCGFSM}{E}[GYD]xSx[YTA]N[AVWMYGCHD]P')
10

>>> match('NFSD', 'N{P}[ST]{P}')
True
>>> match('MFSD', 'N{P}[ST]{P}')
False
>>> match('NPSD', 'N{P}[ST]{P}')
False
>>> match('NFAD', 'N{P}[ST]{P}')
False
>>> match('NFSP', 'N{P}[ST]{P}')
False
>>> match('QDNPYIEEIR', '{TCGFSM}{E}[GYD]xSx[YTA]N[AVWMYGCHD]P')
False

>>> positions('MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQKDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSSNEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVNFKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKYLNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYDLSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILMDLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIYCLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK', 'N{P}[ST]{P}')
[84, 117, 141, 305, 394]