Base64 is a way to represent binary data in an ASCII1 string format. This is done by translating the binary data into a radix-64 representation. The term Base64 originates from a specific MIME content transfer encoding2. Each Base64 digit represents exactly 6 bits of data ($$2^6 = 64$$). Three 8-bit bytes (i.e., a total of 24 bits) can therefore be represented by four 6-bit Base64 digits.
The particular set of 64 characters chosen to represent the 64 place-values for the base varies between implementations. The general strategy is to choose 64 characters that are both members of a subset common to most encodings, and also printable. This combination leaves the data unlikely to be modified in transit through information systems, such as email, that were traditionally not 8-bit clean3. For example, MIME4's Base64 implementation uses a set of characters that consists of the 26 uppercase letters (A-Z), the 26 lowercase letters (a-z), the 10 digits (0-9), the plus symbol (+) and the forward slash (/).
index | character | index | character | index | character | index | character | |||
---|---|---|---|---|---|---|---|---|---|---|
0 | A | 16 | Q | 32 | g | 48 | w | |||
1 | B | 17 | R | 33 | h | 49 | x | |||
2 | C | 18 | S | 34 | i | 50 | y | |||
3 | D | 19 | T | 35 | j | 51 | z | |||
4 | E | 20 | U | 36 | k | 52 | 0 | |||
5 | F | 21 | V | 37 | l | 53 | 1 | |||
6 | G | 22 | W | 38 | m | 54 | 2 | |||
7 | H | 23 | X | 39 | n | 55 | 3 | |||
8 | I | 24 | Y | 40 | o | 56 | 4 | |||
9 | J | 25 | Z | 41 | p | 57 | 5 | |||
10 | K | 26 | a | 42 | q | 58 | 6 | |||
11 | L | 27 | b | 43 | r | 59 | 7 | |||
12 | M | 28 | c | 44 | s | 60 | 8 | |||
13 | N | 29 | d | 45 | t | 61 | 9 | |||
14 | O | 30 | e | 46 | u | 62 | + | |||
15 | P | 31 | f | 47 | v | 63 | / |
Each line of the text file base64.txt5 consists of a pattern $$p \in \mathcal{P}$$, followed by a space and a word $$w \in \mathcal{W}$$. The set $$\mathcal{C}$$ contains all printable characters: uppercase letters, lowercase letters, digits and punctuation characters. This set does not contain whitespace characters (spaces, tabs, newlines, …). The set $$\mathcal{P}$$ contains all possible encodings: strings that only consist of characters that belong to the set $$\mathcal{C}$$. The set $$\mathcal{W}$$ contains all words that only consist of letters. Your task:
Determine the shortest possible regular expressions for the following subsets of $$\mathcal{P}$$:
$$\mathcal{P}_1 = \{\,p \in \mathcal{P}\,|\,p\ $$contains characters that do not occur in MIME's Base64 implementation$$\,\}$$
example: |
$$\mathcal{P}_2 = \{\,p \in \mathcal{P}\,|\,p\ $$contains at least four consecutive digits that are enclosed between forward slashes$$\,\}$$
example: |
$$\mathcal{P}_3 = \{\,p \in \mathcal{P}\,|\,$$a particular digit occurs at least four times in $$p\,\}$$
example: |
the set $$\mathcal{P}_4$$ consists of all lines that are matched by the following two Unix commands; both commands match the exact same lines
# first alternative egrep -i '(.+).* \1$' base64.txt # second alternative rev base64.txt | egrep -i '^(.*) .*\1' | rev
describe in words which lines are matched by these commands
explain why one solution works much faster than the other
Each time give a Unix command where the regular expression is used by a command from the grep family to write only those lines from the text file to stdout whose pattern $$p$$ belongs to $$\mathcal{P}_i\ (i = 1, 2, 3, 4)$$.
Make sure you use a recent version of GNU grep for this assignment. A recent version of GNU grep has been installed on helios, but Mac OS X typically uses and older version of GNU grep by default. Mac users are advised to upgrade their grep to the latest version.
Find the words $$w_1\ w_2\ w_3\ w_4$$ of a secret message in the following way:
the word $$w_1$$ is on the unique line whose pattern $$p$$ belongs to $$\mathcal{P}_1 \cap \mathcal{P_2}$$
the word $$w_2$$ is on the unique line whose pattern $$p$$ belongs to $$ \mathcal{P}_2 \cap \mathcal{P_3}$$
the word $$w_3$$ is on the unique line whose pattern $$p$$ belongs to $$\mathcal{P}_3 \cap \mathcal{P_4}$$
the word $$w_4$$ is on the unique line whose pattern $$p$$ belongs to $$\mathcal{P}_4 \cap \mathcal{P_1}$$
Each time give a Unix command where the regular expressions for the subsets $$\mathcal{P}_i\ (i = 1, 2, 3, 4)$$ are used by commands from the grep family to find the word $$w_j\ (j = 1, 2, 3, 4)$$ in the text file and write it to stdout. It is not allow to write the word $$w_j$$ literally (e.g. echo $$w_j$$).