Base64 is a way to represent binary data in an ASCII1 string format. This is done by translating the binary data into a radix-64 representation. The term Base64 originates from a specific MIME content transfer encoding2. Each Base64 digit represents exactly 6 bits of data ($$2^6 = 64$$). Three 8-bit bytes (i.e., a total of 24 bits) can therefore be represented by four 6-bit Base64 digits.

The particular set of 64 characters chosen to represent the 64 place-values for the base varies between implementations. The general strategy is to choose 64 characters that are both members of a subset common to most encodings, and also printable. This combination leaves the data unlikely to be modified in transit through information systems, such as email, that were traditionally not 8-bit clean3. For example, MIME4's Base64 implementation uses a set of characters that consists of the 26 uppercase letters (A-Z), the 26 lowercase letters (a-z), the 10 digits (0-9), the plus symbol (+) and the forward slash (/).

index character   index character   index character   index character
0 A 16 Q 32 g 48 w
1 B 17 R 33 h 49 x
2 C 18 S 34 i 50 y
3 D 19 T 35 j 51 z
4 E 20 U 36 k 52 0
5 F 21 V 37 l 53 1
6 G 22 W 38 m 54 2
7 H 23 X 39 n 55 3
8 I 24 Y 40 o 56 4
9 J 25 Z 41 p 57 5
10 K 26 a 42 q 58 6
11 L 27 b 43 r 59 7
12 M 28 c 44 s 60 8
13 N 29 d 45 t 61 9
14 O 30 e 46 u 62 +
15 P 31 f 47 v 63 /

Assignment

Each line of the text file base64.txt5 consists of a pattern $$p \in \mathcal{P}$$, followed by a space and a word $$w \in \mathcal{W}$$. The set $$\mathcal{C}$$ contains all printable characters: uppercase letters, lowercase letters, digits and punctuation characters. This set does not contain whitespace characters (spaces, tabs, newlines, …). The set $$\mathcal{P}$$ contains all possible encodings: strings that only consist of characters that belong to the set $$\mathcal{C}$$. The set $$\mathcal{W}$$ contains all words that only consist of letters. Your task:

  1. Determine the shortest possible regular expressions for the following subsets of $$\mathcal{P}$$:

    • $$\mathcal{P}_1 = \{\,p \in \mathcal{P}\,|\,p\ $$contains characters that do not occur in MIME's Base64 implementation$$\,\}$$

      example: FR+w0|^Z4<bq_,WVT{%xfVxdB<>0JY phis $$\in \mathcal{P}_1$$
    • $$\mathcal{P}_2 = \{\,p \in \mathcal{P}\,|\,p\ $$contains at least four consecutive digits that are enclosed between forward slashes$$\,\}$$

      example: Q4HXD/7100525/+wgzC5mV2IimhFXh falcate $$\in \mathcal{P}_2$$
    • $$\mathcal{P}_3 = \{\,p \in \mathcal{P}\,|\,$$a particular digit occurs at least four times in $$p\,\}$$

      example: ZU4mg4I4BAVUzm5lfAbm4r9reE9zyk tarweeds $$\in \mathcal{P}_3$$
    • the set $$\mathcal{P}_4$$ consists of all lines that are matched by the following two Unix commands; both commands match the exact same lines

      # first alternative
      egrep -i '(.+).* \1$' base64.txt
      
      # second alternative
      rev base64.txt | egrep -i '^(.*) .*\1' | rev
      
      1. describe in words which lines are matched by these commands

      2. explain why one solution works much faster than the other

    Each time give a Unix command where the regular expression is used by a command from the grep family to write only those lines from the text file to stdout whose pattern $$p$$ belongs to $$\mathcal{P}_i\ (i = 1, 2, 3, 4)$$.

    Note

    Make sure you use a recent version of GNU grep for this assignment. A recent version of GNU grep has been installed on helios, but Mac OS X typically uses and older version of GNU grep by default. Mac users are advised to upgrade their grep to the latest version.

  2. Find the words $$w_1\ w_2\ w_3\ w_4$$ of a secret message in the following way:

    • the word $$w_1$$ is on the unique line whose pattern $$p$$ belongs to $$\mathcal{P}_1 \cap \mathcal{P_2}$$

    • the word $$w_2$$ is on the unique line whose pattern $$p$$ belongs to $$ \mathcal{P}_2 \cap \mathcal{P_3}$$

    • the word $$w_3$$ is on the unique line whose pattern $$p$$ belongs to $$\mathcal{P}_3  \cap \mathcal{P_4}$$

    • the word $$w_4$$ is on the unique line whose pattern $$p$$ belongs to $$\mathcal{P}_4 \cap \mathcal{P_1}$$

    Each time give a Unix command where the regular expressions for the subsets $$\mathcal{P}_i\ (i = 1, 2, 3, 4)$$ are used by commands from the grep family to find the word $$w_j\ (j = 1, 2, 3, 4)$$ in the text file and write it to stdout. It is not allow to write the word $$w_j$$ literally (e.g. echo $$w_j$$).