In studying the relationship between brain function and language, pychologists of the University of Alberta (Canada) together with linguists of the University of Tübingen (Germany) found that people agree nearly unanimously as to the funniness of nonsense words. Some of the words predicted the be most humorous in their study:
word | 1-grams | 2-grams | 3-grams |
---|---|---|---|
howaymb | 1.06771 | 0.09682 | 0.00699 |
proffic | 1.33591 | 0.19497 | 0.03164 |
subvick | 1.21500 | 0.14689 | 0.02015 |
quarban | 1.38082 | 0.19559 | 0.02078 |
supplic | 1.50944 | 0.20794 | 0.02595 |
octeste | 2.00986 | 0.44417 | 0.06772 |
tatinse | 2.06316 | 0.47488 | 0.10412 |
It seems that the less statistically likely a collection of letters is to form a real word in English, the funnier it strikes us. Why should that be? Possibly laughter signals to ourselves and others that we've recognized that something is amiss but that it's not a danger to our safety.
The above study quantifies the degree of funniness of a letter sequence in the following way. Use a corpus of existing words (e.g. words.txt1) and count how often each $$n$$-gram occurs (an $$n$$-gram is a sequence of $$n$$ letters). The word banana, for example, has the following bigrams (2-grams): ba, an, na, an and na. Note that the bigrams an and na occur twice in this word, whereas the bigram ba occurs only once. The number of occurrences of an $$n$$-gram $$\text{x}$$ in a given corpus is represented as $$O(\text{x})$$. As such, we count for example 118845 occurrences of the letter (1-gram) a in the corpus from the file words.txt2 making $$O(\text{a}) = 118845$$, 2506 occurrences of the bigram qu making $$O(\text{qu}) = 2506$$ and 722 occurrences of the trigram (3-gram) que making $$O(\text{que}) = 722$$.
The probability of an $$n$$-gram $$\text{x}$$ (represented as $$P(\text{x})$$) is estimated as the division of $$O(\text{x})$$ by the total number of $$n$$-grams in the corpus. This way, for the corpus words.txt3 we can for example compute that $$P(\text{a}) = 0.07567$$, $$P(\text{qu}) = 0.001793$$ and $$P(\text{que}) = 0.0005894$$.
The entropy of an $$n$$-gram $$\text{x}$$ (represented as $$E(\text{x})$$) is computed as \[ E(\text{x}) = - P(\text{x}) \log_2 P(\text{x})\] By definition $$E(\text{x}) = 0$$ if $$P(\text{x}) = 0$$, which is consistent with the limit \[ \lim_{p \rightarrow 0+} p \log_2 p = 0 \]
For a given $$n \in \mathbb{N}_0$$ the funniness of a letter sequence $$s_1s_2s_3\ldots s_m$$ (with $$m \geq n$$) is computed as the sum of the entropies of all $$n$$-grams in the letter sequence \[ E(s_1\ldots s_n) + E(s_2\ldots s_{n+1}) + E(s_3\ldots s_{n+2}) + \cdots \]
Your task:
Write a function ngrams that takes an integer $$n \in \mathbb{N}_0$$ (int) and the location of a text file (str). The text file must contain a corpus of words, where each word is on a separate line and only contains letters. The function must return a dictionary (dict) whose keys are all $$n$$-grams in the given corpus. The function should not make a distinction between uppercase and lowercase letters and use lowercase letters in the keys of the dictionary. The dictionary must map each $$n$$-gram $$\text{x}$$ onto the number of occurrences $$O(\text{x})$$ of the $$n$$-gram in the given corpus.
Write a function probability that takes two arguments: i) an $$n$$-gram $$\text{x}$$ (str) and ii) a dictionary (dict) formatted as the ones returned when calling the function ngrams with first argument $$n$$. The function must return the probability $$P(\text{x})$$ (float), where $$O(\text{x})$$ can be found in the given dictionary (by definition $$O(\text{x}) = 0$$ if $$\text{x}$$ is not a key of the given dictionary).
Write a function entropy that takes the same two arguments as the function probability. The function must return the entropy $$E(\text{x})$$ (float), with $$P(\text{x})$$ computed as in the function probability.
Write a function funniness that takes two arguments: i) a string $$s$$ (str) that only contains letters and ii) a dictionary (dict) formatted as the ones returned by the function ngrams. The function must return the funniness of $$s$$ (float), with $$n \in \mathbb{N}_0$$ equal to the length of the keys of the given dictionary. The entropy of the $$n$$-grams must be computed as in the function entropy. The function also has an optional parameter normalized that takes a Boolean value (default value: False). If the value True is passed to the parameter normalized, the funniness of $$s$$ must also be divided by the number of $$n$$-grams in $$s$$.
The functions probability, entropy and funniness should not make a distinction between uppercase and lowercase letters.
In the following interactive session we assume the text file words.txt4 to be located in the current directory.
>>> letters = ngrams(1, 'words.txt')
>>> letters['a']
118845
>>> letters['q']
2535
>>> letters['z']
7450
>>> bigrams = ngrams(2, 'words.txt')
>>> bigrams['ac']
6799
>>> bigrams['qu']
2506
>>> bigrams['zx']
Traceback (most recent call last):
KeyError: 'zx'
>>> trigrams = ngrams(3, 'words.txt')
>>> trigrams['abc']
2
>>> trigrams['que']
722
>>> trigrams['xcz']
Traceback (most recent call last):
KeyError: 'xcz'
>>> probability('a', letters)
0.07567142511492862
>>> probability('q', letters)
0.0016140945152622664
>>> probability('z', letters)
0.00474359137621455
>>> probability('ac', bigrams)
0.0048643609543276645
>>> probability('qu', bigrams)
0.001792923746366396
>>> probability('zx', bigrams)
0.0
>>> probability('abc', trigrams)
1.6327943479190852e-06
>>> probability('que', trigrams)
0.0005894387595987898
>>> probability('xcz', trigrams)
0.0
>>> entropy('a', letters)
0.28180852742167367
>>> entropy('Q', letters)
0.014970822223506234
>>> entropy('z', letters)
0.03661959827269034
>>> entropy('ac', bigrams)
0.03737548277295434
>>> entropy('QU', bigrams)
0.016357686287373385
>>> entropy('Zx', bigrams)
0.0
>>> entropy('abc', trigrams)
3.1389206700115134e-05
>>> entropy('QUE', trigrams)
0.006323717369962047
>>> entropy('XcZ', trigrams)
0.0
>>> funniness('howaymb', letters)
1.0677136135941063
>>> funniness('howaymb', letters, normalized=False)
1.0677136135941063
>>> funniness('howaymb', letters, normalized=True)
0.15253051622772948
>>> funniness('Subvick', letters)
1.2150001024776427
>>> funniness('QUARBAN', letters)
1.3808215071539807
>>> funniness('howaymb', bigrams)
0.09682092801393552
>>> funniness('howaymb', bigrams, normalized=False)
0.09682092801393552
>>> funniness('howaymb', bigrams, normalized=True)
0.01613682133565592
>>> funniness('Subvick', bigrams)
0.14689473722037735
>>> funniness('QUARBAN', bigrams)
0.19558791736702053
>>> funniness('howaymb', trigrams)
0.006989174063509218
>>> funniness('howaymb', trigrams, normalized=False)
0.006989174063509218
>>> funniness('howaymb', trigrams, normalized=True)
0.0013978348127018435
>>> funniness('Subvick', trigrams)
0.020151762537186854
>>> funniness('QUARBAN', trigrams)
0.020777229925614614