The Bixby letter1 from 1864 has not been written by president Abraham Lincoln2, but by his secretary. A team of forensic linguistics experts claims to have finally solved this longstanding mystery using computational text analysis. The letter is considered one of the most beautiful pieces of prose by a US president.
Boston, November 25, 1864. A woman named Lydia Parker Bixby receives a handwritten letter from Abraham Lincoln, president of the United States of America who at that time were already involved for three years in a civil war with the Confederate States of America. Mrs Bixby was thought to have lost five sons in this conflict. In an attempt to alleviate her suffering, he writes her some consoling words. On the same day, the letter also appears in a number of local newspapers, where it en masse grabs readers by the throat. A legend was born.
However, the letter also immediately became contentious as questions arose about its authorship. Although it has been signed with "A. Lincoln", many started to believe that it was in fact written by John Hay3, Lincoln's personal secretary. This claim is almost impossible to prove, also because the original letter got lost quite quickly. Other analysis techniques based on, among other things, style and choice of words also fall short, mainly because the Bixby letter is rather concise and therefore offers too little reference points.
Using a completely new method, forensic linguistics experts at the University of Manchester have been able to subject the letter to a thorough analysis. They come to the conclusion that Hay is the true author. Their technique is called $$n$$-gram tracing and allows a computer to discover sequences of linguistic forms, even in very short pieces of writing. First they have analyzed 500 texts by Lincoln and 500 texts by Hay. With those results they further examined the Bixby letter. Specifically, they searched for word segments that do (or do not) correspond to the writing style of Lincoln and Hay.
Their research sheds new light on the Bixby letter, which has been praised as one of Lincoln's finest pieces of prose or by any American president. Passages from the letter have been reproduced on statues and monuments throughout the United States. Former president George W. Bush4 read the letter on September 11, 2011 at the 10th anniversary of the attacks in New York and the letter also appears in the 1998 film Saving private Ryan5 by Steven Spielberg.
Incidentally, the Bixby letter may be partially based on lies. Although Mrs Bixby stated that five of her sons had been killed in the civil war, in reality two of them would have survived the conflict. According to some, she concealed this because she received a survivor's pension for one of them. Others rumor that in reality she had sympathies for the Confederate States.
To expose the true author of a text using the $$n$$-gram tracing method, the first step is to construct profiles of the text and some other texts that are attributed with great certainty to some possible authors. We use the following text file as an example to explain how such a profile is constructed.
Knock-knock. Who's there? To. To who? No, to whom.
At first, the text is cleansed by
replacing all uppercase letters with their corresponding lowercase variant
replacing all characters that are no letter with an underscore (_)
replacing all consecutive underscores with a single underscore
remove leading and trailing underscores
In cleansing the text, only the 26 letters of the alphabet (a–z) are considered as letters, so no whitespace characters6, punctuation marks or letters with diacritics7. This way, the sample text is cleansed into
knock_knock_who_s_there_to_to_who_no_to_whom
Next, the cleansed text is split into $$n$$-grams, where an $$n$$-gram is nothing but a sequence of $$n \in \mathbb{N}_0$$ consecutive characters. For example, the sample text consists of the following 3-grams:
kno kno who the to_ who _to noc noc ho_ her o_t ho_ to_ ock ock o_s ere _to o_n o_w ck_ ck_ _s_ re_ to_ _no _wh k_k k_w s_t e_t o_w no_ who _kn _wh _th _to _wh o_t hom
This representation also makes it explicit that the $$n$$-grams partially overlap. The number of consecutive $$n$$-grams in the cleansed version of text $$t$$ is noted as $$c_n^t$$. If $$t$$ is the sample text, then $$c_3^t = 42$$. The set of all $$n$$-grams in the cleansed version of text $$t$$ is noted as $$\Omega_n^t$$.
The profile of text $$t$$ is nothing but a frequency table of the $$n$$-grams in the cleansed version of $$t$$. The number of occurrences of $$n$$-gram $$\omega$$ in the cleansed version of text $$t$$ is noted as $$p_n^t(\omega)$$. If $$t$$ is the sample text, then $$p_3^t(\texttt{the}) = 1$$ (green), $$p_3^t(\texttt{_wh}) = 3$$ (blue) and $$p_3^t(\texttt{o_t}) = 2$$ (orange).
The final step of $$n$$-gram tracing is to use the profile of text $$t$$ whose authorship is unknown and the profile of text $$a$$ attributed to a known author (this may also be a collection of texts from that author) to determine the possible attribution of the author to the text using the formula \[ -\sum_{\omega \in \Omega_n^t}p_n^t(\omega) \ln\left(\frac{1 + p_n^a(\omega)}{c_n^a}\right) \] where $$\ln(x)$$ is the natural logarithm of $$x$$. The greater the attribution, the greater the chance that text $$t$$ was written by the author of text $$a$$. Your task:
Write a function cleanse that takes a text (str)
and returns the cleansed version (str) of the given text.
Write a function ngrams that takes a text (str). The function also has an optional parameter n (default value: 1) that may take a number $$n \in \mathbb{N}_0$$ (int). The function must return a list (list) containing all consecutive $$n$$-grams in the cleansed version of the given text, listed in their order of appearance in the cleansed text.
Write a function profile that takes the location (str) of a text file. The function also has an optional parameter n (default value: 1) that may take a number $$n \in \mathbb{N}_0$$ (int). The function must return a dictionary (dict) that maps each $$n$$-gram $$\omega \in \Omega_n^t$$ onto $$p_n^t(\omega)$$, where $$t$$ is the text in the given file.
Write a function ngram_count that takes the profile (dict) of a text $$t$$. The function must return the value $$c_n^t$$.
Write a function attribution that takes the profile of text $$t$$ whose authorship is unknown and the profile of text $$a$$ attributed to a known author. The function must return the possible attribution to text $$t$$ from the author who has written text $$a$$.
The profiles passed to the functions ngram_count and attribution must be dictionaries (dict) as returned by the function profile and may not be modified by the functions. The text files passed to the function profile use UTF-88 character encoding. The built-in function open has a parameter encoding that can be used to specify the character encoding for the file:
>>> open('file.txt', 'r', encoding='utf-8')
In the following interactive session we assume the text files knock.txt9, bixby.txt10, lincoln.txt11, hay.txt12 and obama.txt13 to be located in the current directory.
>>> cleanse("What's wrong? NOTHING's wrong!")
'what_s_wrong_nothing_s_wrong'
>>> cleanse("Knock-knock. Who's there? To. To who? No, to whom.")
'knock_knock_who_s_there_to_to_who_no_to_whom'
>>> cleanse('The past, the present, and the future walked into a bar. It was tense.')
'the_past_the_present_and_the_future_walked_into_a_bar_it_was_tense'
>>> ngrams("What's wrong? NOTHING's wrong!")
['w', 'h', 'a', 't', '_', 's', '_', 'w', 'r', 'o', 'n', 'g', '_', 'n', 'o', 't', 'h', 'i', 'n', 'g', '_', 's', '_', 'w', 'r', 'o', 'n', 'g']
>>> ngrams("Knock-knock. Who's there? To. To who? No, to whom.", 3)
['kno', 'noc', 'ock', 'ck_', 'k_k', '_kn', 'kno', 'noc', 'ock', 'ck_', 'k_w', '_wh', 'who', 'ho_', 'o_s', '_s_', 's_t', '_th', 'the', 'her', 'ere', 're_', 'e_t', '_to', 'to_', 'o_t', '_to', 'to_', 'o_w', '_wh', 'who', 'ho_', 'o_n', '_no', 'no_', 'o_t', '_to', 'to_', 'o_w', '_wh', 'who', 'hom']
>>> ngrams('The past, the present, and the future walked into a bar. It was tense.', n=2)
['th', 'he', 'e_', '_p', 'pa', 'as', 'st', 't_', '_t', 'th', 'he', 'e_', '_p', 'pr', 're', 'es', 'se', 'en', 'nt', 't_', '_a', 'an', 'nd', 'd_', '_t', 'th', 'he', 'e_', '_f', 'fu', 'ut', 'tu', 'ur', 're', 'e_', '_w', 'wa', 'al', 'lk', 'ke', 'ed', 'd_', '_i', 'in', 'nt', 'to', 'o_', '_a', 'a_', '_b', 'ba', 'ar', 'r_', '_i', 'it', 't_', '_w', 'wa', 'as', 's_', '_t', 'te', 'en', 'ns', 'se']
>>> knock = profile('knock.txt14', n=3)
>>> len(knock)
27
>>> knock['the']
1
>>> knock['_wh']
3
>>> knock['o_t']
2
>>> ngram_count(knock)
42
>>> bixby = profile('bixby.txt15', n=3)
>>> len(bixby)
456
>>> bixby['the']
17
>>> bixby['in_']
3
>>> bixby['f_t']
4
>>> ngram_count(bixby)
743
>>> lincoln = profile('lincoln.txt16', n=3)
>>> attribution(bixby, lincoln)
5209.447183892647
>>> hay = profile('hay.txt17', n=3)
>>> attribution(bixby, hay)
5216.17091674669
>>> obama = profile('obama.txt18', n=3)
>>> attribution(bixby, obama)
5079.372864440405
Two journalists of The Sunday Times19 collaborated with professor Patrick Juola20 from Duquesne University (USA) to prove that J.K. Rowling21 is the true author of The Cuckoo's Calling22, the book she wrote in 2013 under the pseudonym Robert Galbraith. This was done using a variant of the n-gram tracing technique.
The story has been widely reported in the international press, including an elaborate article in the New York Times23. Time24 magazine explains how the discovery was made:
As one part of his work, Juola uses a program to pull out the hundred most frequent words across an author's vocabulary. This step eliminates rare words, character names and plot points, leaving him with words like "of" and "but", ranked by usage. Those words might seem inconsequential, but they leave an authorial fingerprint on any work. "Propositions and articles and similar little function words are actually very individual," Juola says. "It's actually very, very hard to change them because they're so subconscious."
Brooks R, Flyn C (2013). JK Rowling, the cuckoo in crime novel nest. The Sunday Times. 25
Grieve J, Clarke I, Chiang E, Gideon H, Heini A, Nini A, Waibel E (2018). Attributing the Bixby Letter using n-gram tracing. Digital Scholarship in the Humanities. 26
Juola P (2008). Authorship attribution. Foundations and Trends in Information Retrieval 1(3), 233–334. 27