Base64 is a way to represent binary data in an ASCII¹ string format. This is done by translating the binary data into a radix-64 representation. The term Base64 originates from a specific MIME content transfer encoding². Each Base64 digit represents exactly 6 bits of data ($$2^6 = 64$$). Three 8-bit bytes (i.e., a total of 24 bits) can therefore be represented by four 6-bit Base64 digits.

The particular set of 64 characters chosen to represent the 64 place-values for the base varies between implementations. The general strategy is to choose 64 characters that are both members of a subset common to most encodings, and also printable. This combination leaves the data unlikely to be modified in transit through information systems, such as email, that were traditionally not 8-bit clean³. For example, MIME⁴'s Base64 implementation uses a set of characters that consists of the 26 uppercase letters (A-Z), the 26 lowercase letters (a-z), the 10 digits (0-9), the plus symbol (+) and the forward slash (/).

index	character	index	character	index	character	index	character
0	`A`	16	`Q`	32	`g`	48	`w`
1	`B`	17	`R`	33	`h`	49	`x`
2	`C`	18	`S`	34	`i`	50	`y`
3	`D`	19	`T`	35	`j`	51	`z`
4	`E`	20	`U`	36	`k`	52	`0`
5	`F`	21	`V`	37	`l`	53	`1`
6	`G`	22	`W`	38	`m`	54	`2`
7	`H`	23	`X`	39	`n`	55	`3`
8	`I`	24	`Y`	40	`o`	56	`4`
9	`J`	25	`Z`	41	`p`	57	`5`
10	`K`	26	`a`	42	`q`	58	`6`
11	`L`	27	`b`	43	`r`	59	`7`
12	`M`	28	`c`	44	`s`	60	`8`
13	`N`	29	`d`	45	`t`	61	`9`
14	`O`	30	`e`	46	`u`	62	`+`
15	`P`	31	`f`	47	`v`	63	`/`

Assignment

Each line of the text file base64.txt⁵ consists of a pattern $$p \in \mathcal{P}$$, followed by a space and a word $$w \in \mathcal{W}$$. The set $$\mathcal{C}$$ contains all printable characters: uppercase letters, lowercase letters, digits and punctuation characters. This set does not contain whitespace characters (spaces, tabs, newlines, …). The set $$\mathcal{P}$$ contains all possible encodings: strings that only consist of characters that belong to the set $$\mathcal{C}$$. The set $$\mathcal{W}$$ contains all words that only consist of letters. Your task:

Determine the shortest possible regular expressions for the following subsets of $$\mathcal{P}$$:

$$\mathcal{P}_1 = \{\,p \in \mathcal{P}\,|\,p\ $$contains characters that do not occur in MIME's Base64 implementation$$\,\}$$

example:

FR+w0|^Z4<bq_,WVT{%xfVxdB<>0JY phis $$\in \mathcal{P}_1$$

$$\mathcal{P}_2 = \{\,p \in \mathcal{P}\,|\,p\ $$contains at least four consecutive digits that are enclosed between forward slashes$$\,\}$$

example:

Q4HXD/7100525/+wgzC5mV2IimhFXh falcate $$\in \mathcal{P}_2$$

$$\mathcal{P}_3 = \{\,p \in \mathcal{P}\,|\,$$a particular digit occurs at least four times in $$p\,\}$$

example:

ZU4mg4I4BAVUzm5lfAbm4r9reE9zyk tarweeds $$\in \mathcal{P}_3$$

the set $$\mathcal{P}_4$$ consists of all lines that are matched by the following two Unix commands; both commands match the exact same lines
```
# first alternative
egrep -i '(.+).* \1$' base64.txt

# second alternative
rev base64.txt | egrep -i '^(.*) .*\1' | rev
```
1. describe in words which lines are matched by these commands
2. explain why one solution works much faster than the other

Each time give a Unix command where the regular expression is used by a command from the grep family to write only those lines from the text file to stdout whose pattern $$p$$ belongs to $$\mathcal{P}_i\ (i = 1, 2, 3, 4)$$.

Note

Make sure you use a recent version of GNU grep for this assignment. A recent version of GNU grep has been installed on helios, but Mac OS X typically uses and older version of GNU grep by default. Mac users are advised to upgrade their grep to the latest version.

Find the words $$w_1\ w_2\ w_3\ w_4$$ of a secret message in the following way:
- the word $$w_1$$ is on the unique line whose pattern $$p$$ belongs to $$\mathcal{P}_1 \cap \mathcal{P_2}$$
- the word $$w_2$$ is on the unique line whose pattern $$p$$ belongs to $$ \mathcal{P}_2 \cap \mathcal{P_3}$$
- the word $$w_3$$ is on the unique line whose pattern $$p$$ belongs to $$\mathcal{P}_3 \cap \mathcal{P_4}$$
- the word $$w_4$$ is on the unique line whose pattern $$p$$ belongs to $$\mathcal{P}_4 \cap \mathcal{P_1}$$
Each time give a Unix command where the regular expressions for the subsets $$\mathcal{P}_i\ (i = 1, 2, 3, 4)$$ are used by commands from the grep family to find the word $$w_j\ (j = 1, 2, 3, 4)$$ in the text file and write it to stdout. It is not allow to write the word $$w_j$$ literally (e.g. echo $$w_j$$).