In studying the relationship between brain function and language, pychologists of the University of Alberta (Canada) together with linguists of the University of Tübingen (Germany) found that people agree nearly unanimously as to the funniness of nonsense words. Some of the words predicted the be most humorous in their study:

word 1-grams 2-grams 3-grams
howaymb 1.06771 0.09682 0.00699
proffic 1.33591 0.19497 0.03164
subvick 1.21500 0.14689 0.02015
quarban 1.38082 0.19559 0.02078
supplic 1.50944 0.20794 0.02595
octeste 2.00986 0.44417 0.06772
tatinse 2.06316 0.47488 0.10412

It seems that the less statistically likely a collection of letters is to form a real word in English, the funnier it strikes us. Why should that be? Possibly laughter signals to ourselves and others that we've recognized that something is amiss but that it's not a danger to our safety.

Assigment

The above study quantifies the degree of funniness of a letter sequence in the following way. Use a corpus of existing words (e.g. words.txt1) and count how often each $$n$$-gram occurs (an $$n$$-gram is a sequence of $$n$$ letters). The word banana, for example, has the following bigrams (2-grams): ba, an, na, an and na. Note that the bigrams an and na occur twice in this word, whereas the bigram ba occurs only once. The number of occurrences of an $$n$$-gram $$\text{x}$$ in a given corpus is represented as $$O(\text{x})$$. As such, we count for example 118845 occurrences of the letter (1-gram) a in the corpus from the file words.txt2 making $$O(\text{a}) = 118845$$, 2506 occurrences of the bigram qu making $$O(\text{qu}) = 2506$$ and 722 occurrences of the trigram (3-gram) que making $$O(\text{que}) = 722$$.

The probability of an $$n$$-gram $$\text{x}$$ (represented as $$P(\text{x})$$) is estimated as the division of $$O(\text{x})$$ by the total number of $$n$$-grams in the corpus. This way, for the corpus words.txt3 we can for example compute that $$P(\text{a}) = 0.07567$$, $$P(\text{qu}) = 0.001793$$ and $$P(\text{que}) = 0.0005894$$.

The entropy of an $$n$$-gram $$\text{x}$$ (represented as $$E(\text{x})$$) is computed as \[ E(\text{x}) = - P(\text{x}) \log_2 P(\text{x})\] By definition $$E(\text{x}) = 0$$ if $$P(\text{x}) = 0$$, which is consistent with the limit \[ \lim_{p \rightarrow 0+} p \log_2 p = 0 \]

For a given $$n \in \mathbb{N}_0$$ the funniness of a letter sequence $$s_1s_2s_3\ldots s_m$$ (with $$m \geq n$$) is computed as the sum of the entropies of all $$n$$-grams in the letter sequence \[ E(s_1\ldots s_n) + E(s_2\ldots s_{n+1}) + E(s_3\ldots s_{n+2}) + \cdots \]

Your task:

The functions probability, entropy and funniness should not make a distinction between uppercase and lowercase letters.

Example

In the following interactive session we assume the text file words.txt4 to be located in the current directory.

>>> letters = ngrams(1, 'words.txt')
>>> letters['a']
118845
>>> letters['q']
2535
>>> letters['z']
7450
>>> bigrams = ngrams(2, 'words.txt')
>>> bigrams['ac']
6799
>>> bigrams['qu']
2506
>>> bigrams['zx']
Traceback (most recent call last):
KeyError: 'zx'
>>> trigrams = ngrams(3, 'words.txt')
>>> trigrams['abc']
2
>>> trigrams['que']
722
>>> trigrams['xcz']
Traceback (most recent call last):
KeyError: 'xcz'

>>> probability('a', letters)
0.07567142511492862
>>> probability('q', letters)
0.0016140945152622664
>>> probability('z', letters)
0.00474359137621455
>>> probability('ac', bigrams)
0.0048643609543276645
>>> probability('qu', bigrams)
0.001792923746366396
>>> probability('zx', bigrams)
0.0
>>> probability('abc', trigrams)
1.6327943479190852e-06
>>> probability('que', trigrams)
0.0005894387595987898
>>> probability('xcz', trigrams)
0.0

>>> entropy('a', letters)
0.28180852742167367
>>> entropy('Q', letters)
0.014970822223506234
>>> entropy('z', letters)
0.03661959827269034
>>> entropy('ac', bigrams)
0.03737548277295434
>>> entropy('QU', bigrams)
0.016357686287373385
>>> entropy('Zx', bigrams)
0.0
>>> entropy('abc', trigrams)
3.1389206700115134e-05
>>> entropy('QUE', trigrams)
0.006323717369962047
>>> entropy('XcZ', trigrams)
0.0

>>> funniness('howaymb', letters)
1.0677136135941063
>>> funniness('howaymb', letters, normalized=False)
1.0677136135941063
>>> funniness('howaymb', letters, normalized=True)
0.15253051622772948
>>> funniness('Subvick', letters)
1.2150001024776427
>>> funniness('QUARBAN', letters)
1.3808215071539807

>>> funniness('howaymb', bigrams)
0.09682092801393552
>>> funniness('howaymb', bigrams, normalized=False)
0.09682092801393552
>>> funniness('howaymb', bigrams, normalized=True)
0.01613682133565592
>>> funniness('Subvick', bigrams)
0.14689473722037735
>>> funniness('QUARBAN', bigrams)
0.19558791736702053

>>> funniness('howaymb', trigrams)
0.006989174063509218
>>> funniness('howaymb', trigrams, normalized=False)
0.006989174063509218
>>> funniness('howaymb', trigrams, normalized=True)
0.0013978348127018435
>>> funniness('Subvick', trigrams)
0.020151762537186854
>>> funniness('QUARBAN', trigrams)
0.020777229925614614

Resources