In 1951 James Thurber¹'s friend Joseph Mitchell² challenged him to think of an English word that contains the four consecutive letters SGRA. Lying in bed that night, Thurber came up with these:

kissgranny: a man who seeks the company of older women, especially older women with money; a designing fellow, a fortune hunter
blessgravy: a minister or cleric; the head of a family; one who says grace
hossgrace: innate or native dignity, similar to that of the thoroughbred hoss
bussgranite: literally, a stonekisser; a man who persists in trying to win the favor or attention of cold, indifferent, or capricious women
tossgravel: a male human being who tosses gravel, usually at night, at the window of a female human being’s bedroom, usually that of a young virgin; hence, a lover, a male sweetheart, and an eloper

Unfortunately, none of these is in the dictionary. What word was Mitchell thinking of?

Show answer

Disgrace.

This may well be the most common word containing the consecutive letters SGRA, in most English dictionaries it is definitely not the only one. Consider for example the words palsgrave³ (a high noble title), grosgrain⁴ (a type of fabric characterized by its ribbed appearance), dysgraphia⁵ (a deficiency in the ability to write) or sgraffito⁶ (a technique either of wall decor or in pottery). However, these are all words of foreign origin.

Assignment

An $$n$$-gram is a string (str) with $$n$$ letters. Only the 26 letters of the alphabet (a–z) may appear in an $$n$$-gram, so spaces, punctuation marks or letters with diacritics⁷ are not allowed. Your task:

Write a function ngrams that takes two arguments: i) a string $$s$$ (str) and ii) a number $$n \in \mathbb{N}_0$$ (int). The function must return a list (list) containing all $$n$$-grams of consecutive letters that appear in $$s$$, listed in order of appearance.
Write a function dictionary that takes two arguments: i) the location of a text file (str) and ii) a number $$n \in \mathbb{N}_0$$ (int). The given text file must contain a list of words, with each word on a separate line. The function must return a dictionary (dict) that maps each $$n$$-gram appearing in at least one word from the word list onto the set (set) of words (str) in which the $$n$$-gram appears. No distinction should be made between uppercase and lowercase letters when determining if an $$n$$-gram appears in a word. The keys of the dictionary returned by the function use in uppercase letters.
Write a function ngram_count that takes a dictionary (dict) that is constructed as the dictionaries returned by the function dictionary. As a result, the given dictionary maps each $$n$$-gram (in uppercase letters) that appears in a word list $$\mathcal{W}$$ onto the set of words in which the $$n$$-gram appears, for a certain number $$n \in \mathbb{N}_0$$. The function ngram_count also has a second optional parameter that may take a number $$m \in \mathbb{N}_0$$ (int). If no value is explicitly passed to the optional parameter, the function must return the number of different $$n$$-grams that appear in the word list $$\mathcal{W}$$. Otherwise, the function must return the number of different $$n$$-grams that appear in exactly $$m$$ words from the word list $$\mathcal{W}$$. For example, for $$m = 1$$ the function must return the number of $$n$$-grams that appear in a single word, and for $$m = 2$$ it must return the number of $$n$$-grams that appear in exactly two words.
Write a function unique_ngram that takes the same dictionary (dict) as the function ngram_count. The function must return a tuple (tuple) with two strings (str), where the first element is a randomly selected $$n$$-gram that appears in a single word from the word list $$\mathcal{W}$$ and the second element is the word in which the selected $$n$$-gram appears.

The functions ngram_count and unique_ngram may never change the given dictionary. The text files passed to the function dictionary use the UTF-8⁸ character encoding. The built-in function open has a parameter encoding that can be used to specify the character encoding for the file:

        >>> open('file.txt', 'r', encoding='utf-8')

Example

In the following interactive session we assume the text files dictionary.en.txt¹¹ and dictionary.nl.txt¹² to be located in the current directory.

        >>> ngrams('Disgrace', 4)
['Disg', 'isgr', 'sgra', 'grac', 'race']
>>> ngrams('four-year-old', 3)
['fou', 'our', 'yea', 'ear', 'old']
>>> ngrams('façade', 2)
['fa', 'ad', 'de']
>>> ngrams('semi-self-sustaining', 4)
['semi', 'self', 'sust', 'usta', 'stai', 'tain', 'aini', 'inin', 'ning']

>>> english = dictionary('dictionary.en.txt13', 4)
>>> english['SGRA']
{'disgrading', 'disgrade', 'disgracers', 'misgrafts', 'palsgrave', 'hansgrave', 'disgracement', 'misgraded', 'misgrading', 'disgracer', 'disgraced', 'grosgrain', 'misgracious', 'disgracive', 'disgrace', 'predisgrace', 'palsgraf', 'disgracing', 'dysgraphia', 'misgrafting', 'disgraded', 'disgracefulness', 'misgraft', 'misgrave', 'misgraffed', 'misgrade', 'grosgrained', 'disgraceful', 'disgracia', 'disgracefully', 'disgracious', 'grosgrains', 'disgradation', 'disgraces', 'sgraffiti', 'crossgrainedness', 'misgraff', 'sgraffiato', 'undisgraced', 'disgradulate', 'misgrafted', 'sgraffito', 'palsgravine'}
>>> ngram_count(english)
63546
>>> ngram_count(english, 1)
13711
>>> ngram_count(english, 2)
7408
>>> ngram_count(english, 3)
4613
>>> unique_ngram(english)
('RLBU', 'pearlbush')
>>> unique_ngram(english)
('FIFR', 'kefifrel')
>>> unique_ngram(english)
('PPAT', 'wappato')

>>> english = dictionary('dictionary.en.txt14', 3)
>>> english['GNT']
{'sovereignties', 'pgntt', 'supersovereignty', 'pgnttrp', 'sovereignty', 'cosovereignty', 'semisovereignty'}
>>> ngram_count(english)
9025
>>> ngram_count(english, 1)
1125
>>> ngram_count(english, 2)
561
>>> ngram_count(english, 3)
332
>>> unique_ngram(english)
('HMP', 'rhythmproof')
>>> unique_ngram(english)
('IPZ', 'leipzig')
>>> unique_ngram(english)
('LZH', 'alzheimer')

>>> dutch = dictionary('dictionary.nl.txt15', 4)
>>> dutch['SGRA']
{'visgraatje', 'bezettingsgraad', 'hardheidsgraad', 'asgrauwe', 'jongensgrappen', 'verzuringsgraad', 'doctorsgraden', 'alfabetiseringsgraad', 'werkgelegenheidsgraad', 'schansgravers', 'orpheusgrasmussen', 'visgraatmotief', 'moeilijkheidsgraad', 'werkloosheidsgraad', 'stadsgrachten', 'zeemansgraf', 'tewerkstellingsgraad', 'basisgrammatica', 'varkensgras', 'werkloosheidsgraden', 'werkingsgraad', 'leesgrage', 'scholingsgraad', 'ontwikkelingsgraad', 'vullingsgraad', 'glansgraad', 'Bosgra', 'koersgrafieken', 'vervuilingsgraad', 'activiteitsgraad', 'traangasgranaten', 'visgraatpatronen', 'gifgasgranaten', 'paltsgraven', 'visgraatdessins', 'kostendekkingsgraden', 'bewustzijnsgraad', 'visgraat', 'beladingsgraad', 'oorlogsgraf', 'zelfvoorzieningsgraad', 'oorlogsgraven', 'hellingsgraad', 'paltsgraaf', 'liesgras', 'koningsgraf', 'dekkingsgraden', 'moeilijkheidsgraden', 'visgraatdiagram', 'gasgranaten', 'vrijheidsgraad', 'beschavingsgraad', 'vrijheidsgraden', 'gasgranaat', 'doctorsgraad', 'asgrauw', 'besmettingsgraad', 'dekkingsgraad', 'luchtvochtigheidsgraad', 'beursgraadmeters', 'automatiseringsgraad', 'bezettingsgraden', 'leesgraag', 'Palsgraaf', 'koersgrafiek', 'stadsgracht', 'bedekkingsgraad', 'opleidingsgraad', 'olifantsgras', 'vochtigheidsgraad', 'beschermingsgraad', 'molenaarsgraaf', 'benuttingsgraad', 'zeemansgraven', 'hardheidsgraden', 'paltsgravin', 'bewolkingsgraad', 'struisgras', 'beursgraadmeter', 'koningsgraven', 'kostendekkingsgraad', 'visgraten', 'uitbuitingsgraad', 'paltsgraafschap', 'visgraatjes', 'jongensgrap', 'traangasgranaat', 'werkzaamheidsgraad', 'zuiverheidsgraad'}
>>> ngram_count(dutch)
57210
>>> ngram_count(dutch, 1)
8224
>>> ngram_count(dutch, 2)
7360
>>> ngram_count(dutch, 3)
3856
>>> unique_ngram(dutch)
('LAMR', 'glamrock')
>>> unique_ngram(dutch)
('ACGL', 'cognacglazen')
>>> unique_ngram(dutch)
('ALOL', 'afvalolie')

Assignment

Epilogue

Example