In 1951 James Thurber1's friend Joseph Mitchell2 challenged him to think of an English word that contains the four consecutive letters SGRA. Lying in bed that night, Thurber came up with these:
kissgranny: a man who seeks the company of older women, especially older women with money; a designing fellow, a fortune hunter
blessgravy: a minister or cleric; the head of a family; one who says grace
hossgrace: innate or native dignity, similar to that of the thoroughbred hoss
bussgranite: literally, a stonekisser; a man who persists in trying to win the favor or attention of cold, indifferent, or capricious women
tossgravel: a male human being who tosses gravel, usually at night, at the window of a female human being’s bedroom, usually that of a young virgin; hence, a lover, a male sweetheart, and an eloper
Unfortunately, none of these is in the dictionary. What word was Mitchell thinking of?
Disgrace.
This may well be the most common word containing the consecutive letters SGRA, in most English dictionaries it is definitely not the only one. Consider for example the words palsgrave3 (a high noble title), grosgrain4 (a type of fabric characterized by its ribbed appearance), dysgraphia5 (a deficiency in the ability to write) or sgraffito6 (a technique either of wall decor or in pottery). However, these are all words of foreign origin.
An $$n$$-gram is a string (str) with $$n$$ letters. Only the 26 letters of the alphabet (a–z) may appear in an $$n$$-gram, so spaces, punctuation marks or letters with diacritics7 are not allowed. Your task:
Write a function ngrams that takes two arguments: i) a string $$s$$ (str) and ii) a number $$n \in \mathbb{N}_0$$ (int). The function must return a list (list) containing all $$n$$-grams of consecutive letters that appear in $$s$$, listed in order of appearance.
Write a function dictionary that takes two arguments: i) the location of a text file (str) and ii) a number $$n \in \mathbb{N}_0$$ (int). The given text file must contain a list of words, with each word on a separate line. The function must return a dictionary (dict) that maps each $$n$$-gram appearing in at least one word from the word list onto the set (set) of words (str) in which the $$n$$-gram appears. No distinction should be made between uppercase and lowercase letters when determining if an $$n$$-gram appears in a word. The keys of the dictionary returned by the function use in uppercase letters.
Write a function ngram_count that takes a dictionary (dict) that is constructed as the dictionaries returned by the function dictionary. As a result, the given dictionary maps each $$n$$-gram (in uppercase letters) that appears in a word list $$\mathcal{W}$$ onto the set of words in which the $$n$$-gram appears, for a certain number $$n \in \mathbb{N}_0$$. The function ngram_count also has a second optional parameter that may take a number $$m \in \mathbb{N}_0$$ (int). If no value is explicitly passed to the optional parameter, the function must return the number of different $$n$$-grams that appear in the word list $$\mathcal{W}$$. Otherwise, the function must return the number of different $$n$$-grams that appear in exactly $$m$$ words from the word list $$\mathcal{W}$$. For example, for $$m = 1$$ the function must return the number of $$n$$-grams that appear in a single word, and for $$m = 2$$ it must return the number of $$n$$-grams that appear in exactly two words.
Write a function unique_ngram that takes the same dictionary (dict) as the function ngram_count. The function must return a tuple (tuple) with two strings (str), where the first element is a randomly selected $$n$$-gram that appears in a single word from the word list $$\mathcal{W}$$ and the second element is the word in which the selected $$n$$-gram appears.
The functions ngram_count and unique_ngram may never change the given dictionary. The text files passed to the function dictionary use the UTF-88 character encoding. The built-in function open has a parameter encoding that can be used to specify the character encoding for the file:
>>> open('file.txt', 'r', encoding='utf-8')
From a 1921 essay by A.A. Milne9:
TERALBAY is not a word which one uses much in ordinary life. Rearrange the letters, however, and it becomes such a word. A friend — no, I can call him a friend no longer — a person gave me this collection of letters as I was going to bed and challenged me to make a proper word of it. He added that Lord Melbourne — this, he alleged, is a well-known historical fact — Lord Melbourne had given this word to Queen Victoria once, and it had kept her awake the whole night. After this, one could not be so disloyal as to solve it at once. For two hours or so, therefore, I merely toyed with it. Whenever I seemed to be getting warm I hurriedly thought of something else. This quixotic loyalty has been the undoing of me; my chances of a solution have slipped by, and I am beginning to fear that they will never return. While this is the case, the only word I can write about is TERALBAY.
The answer is not RATEABLY, or BAT-EARLY, which "ought to mean something, but it doesn't". Rudolf Flesch10 notes that TRAYABLE is not a word, and that, though TEARABLY appears in small type in Webster's Unabridged, "it obviously won't do".
What's the answer? There's no trick — it's an ordinary English word.
BETRAYAL
In the following interactive session we assume the text files dictionary.en.txt11 and dictionary.nl.txt12 to be located in the current directory.
>>> ngrams('Disgrace', 4)
['Disg', 'isgr', 'sgra', 'grac', 'race']
>>> ngrams('four-year-old', 3)
['fou', 'our', 'yea', 'ear', 'old']
>>> ngrams('façade', 2)
['fa', 'ad', 'de']
>>> ngrams('semi-self-sustaining', 4)
['semi', 'self', 'sust', 'usta', 'stai', 'tain', 'aini', 'inin', 'ning']
>>> english = dictionary('dictionary.en.txt13', 4)
>>> english['SGRA']
{'disgrading', 'disgrade', 'disgracers', 'misgrafts', 'palsgrave', 'hansgrave', 'disgracement', 'misgraded', 'misgrading', 'disgracer', 'disgraced', 'grosgrain', 'misgracious', 'disgracive', 'disgrace', 'predisgrace', 'palsgraf', 'disgracing', 'dysgraphia', 'misgrafting', 'disgraded', 'disgracefulness', 'misgraft', 'misgrave', 'misgraffed', 'misgrade', 'grosgrained', 'disgraceful', 'disgracia', 'disgracefully', 'disgracious', 'grosgrains', 'disgradation', 'disgraces', 'sgraffiti', 'crossgrainedness', 'misgraff', 'sgraffiato', 'undisgraced', 'disgradulate', 'misgrafted', 'sgraffito', 'palsgravine'}
>>> ngram_count(english)
63546
>>> ngram_count(english, 1)
13711
>>> ngram_count(english, 2)
7408
>>> ngram_count(english, 3)
4613
>>> unique_ngram(english)
('RLBU', 'pearlbush')
>>> unique_ngram(english)
('FIFR', 'kefifrel')
>>> unique_ngram(english)
('PPAT', 'wappato')
>>> english = dictionary('dictionary.en.txt14', 3)
>>> english['GNT']
{'sovereignties', 'pgntt', 'supersovereignty', 'pgnttrp', 'sovereignty', 'cosovereignty', 'semisovereignty'}
>>> ngram_count(english)
9025
>>> ngram_count(english, 1)
1125
>>> ngram_count(english, 2)
561
>>> ngram_count(english, 3)
332
>>> unique_ngram(english)
('HMP', 'rhythmproof')
>>> unique_ngram(english)
('IPZ', 'leipzig')
>>> unique_ngram(english)
('LZH', 'alzheimer')
>>> dutch = dictionary('dictionary.nl.txt15', 4)
>>> dutch['SGRA']
{'visgraatje', 'bezettingsgraad', 'hardheidsgraad', 'asgrauwe', 'jongensgrappen', 'verzuringsgraad', 'doctorsgraden', 'alfabetiseringsgraad', 'werkgelegenheidsgraad', 'schansgravers', 'orpheusgrasmussen', 'visgraatmotief', 'moeilijkheidsgraad', 'werkloosheidsgraad', 'stadsgrachten', 'zeemansgraf', 'tewerkstellingsgraad', 'basisgrammatica', 'varkensgras', 'werkloosheidsgraden', 'werkingsgraad', 'leesgrage', 'scholingsgraad', 'ontwikkelingsgraad', 'vullingsgraad', 'glansgraad', 'Bosgra', 'koersgrafieken', 'vervuilingsgraad', 'activiteitsgraad', 'traangasgranaten', 'visgraatpatronen', 'gifgasgranaten', 'paltsgraven', 'visgraatdessins', 'kostendekkingsgraden', 'bewustzijnsgraad', 'visgraat', 'beladingsgraad', 'oorlogsgraf', 'zelfvoorzieningsgraad', 'oorlogsgraven', 'hellingsgraad', 'paltsgraaf', 'liesgras', 'koningsgraf', 'dekkingsgraden', 'moeilijkheidsgraden', 'visgraatdiagram', 'gasgranaten', 'vrijheidsgraad', 'beschavingsgraad', 'vrijheidsgraden', 'gasgranaat', 'doctorsgraad', 'asgrauw', 'besmettingsgraad', 'dekkingsgraad', 'luchtvochtigheidsgraad', 'beursgraadmeters', 'automatiseringsgraad', 'bezettingsgraden', 'leesgraag', 'Palsgraaf', 'koersgrafiek', 'stadsgracht', 'bedekkingsgraad', 'opleidingsgraad', 'olifantsgras', 'vochtigheidsgraad', 'beschermingsgraad', 'molenaarsgraaf', 'benuttingsgraad', 'zeemansgraven', 'hardheidsgraden', 'paltsgravin', 'bewolkingsgraad', 'struisgras', 'beursgraadmeter', 'koningsgraven', 'kostendekkingsgraad', 'visgraten', 'uitbuitingsgraad', 'paltsgraafschap', 'visgraatjes', 'jongensgrap', 'traangasgranaat', 'werkzaamheidsgraad', 'zuiverheidsgraad'}
>>> ngram_count(dutch)
57210
>>> ngram_count(dutch, 1)
8224
>>> ngram_count(dutch, 2)
7360
>>> ngram_count(dutch, 3)
3856
>>> unique_ngram(dutch)
('LAMR', 'glamrock')
>>> unique_ngram(dutch)
('ACGL', 'cognacglazen')
>>> unique_ngram(dutch)
('ALOL', 'afvalolie')