The Bixby letter1 from 1864 has not been written by president Abraham Lincoln2, but by his secretary. A team of forensic linguistics experts claims to have finally solved this longstanding mystery using computational text analysis. The letter is considered one of the most beautiful pieces of prose by a US president.

Bixby Letter
Abraham Lincoln (left) and a later copy of the Bixby Letter (right).

Boston, November 25, 1864. A woman named Lydia Parker Bixby receives a handwritten letter from Abraham Lincoln, president of the United States of America who at that time were already involved for three years in a civil war with the Confederate States of America. Mrs Bixby was thought to have lost five sons in this conflict. In an attempt to alleviate her suffering, he writes her some consoling words. On the same day, the letter also appears in a number of local newspapers, where it en masse grabs readers by the throat. A legend was born.

However, the letter also immediately became contentious as questions arose about its authorship. Although it has been signed with "A. Lincoln", many started to believe that it was in fact written by John Hay3, Lincoln's personal secretary. This claim is almost impossible to prove, also because the original letter got lost quite quickly. Other analysis techniques based on, among other things, style and choice of words also fall short, mainly because the Bixby letter is rather concise and therefore offers too little reference points.

Using a completely new method, forensic linguistics experts at the University of Manchester have been able to subject the letter to a thorough analysis. They come to the conclusion that Hay is the true author. Their technique is called $$n$$-gram tracing and allows a computer to discover sequences of linguistic forms, even in very short pieces of writing. First they have analyzed 500 texts by Lincoln and 500 texts by Hay. With those results they further examined the Bixby letter. Specifically, they searched for word segments that do (or do not) correspond to the writing style of Lincoln and Hay.

Their research sheds new light on the Bixby letter, which has been praised as one of Lincoln's finest pieces of prose or by any American president. Passages from the letter have been reproduced on statues and monuments throughout the United States. Former president George W. Bush4 read the letter on September 11, 2011 at the 10th anniversary of the attacks in New York and the letter also appears in the 1998 film Saving private Ryan5 by Steven Spielberg.

Incidentally, the Bixby letter may be partially based on lies. Although Mrs Bixby stated that five of her sons had been killed in the civil war, in reality two of them would have survived the conflict. According to some, she concealed this because she received a survivor's pension for one of them. Others rumor that in reality she had sympathies for the Confederate States.

Assignment

To expose the true author of a text using the $$n$$-gram tracing method, the first step is to construct profiles of the text and some other texts that are attributed with great certainty to some possible authors. We use the following text file as an example to explain how such a profile is constructed.

Knock-knock.
Who's there?
To.
To who?
No, to whom.

At first, the text is cleansed by

  1. replacing all uppercase letters with their corresponding lowercase variant

  2. replacing all characters that are no letter with an underscore (_)

  3. replacing all consecutive underscores with a single underscore

  4. remove leading and trailing underscores

In cleansing the text, only the 26 letters of the alphabet (az) are considered as letters, so no whitespace characters6, punctuation marks or letters with diacritics7. This way, the sample text is cleansed into

knock_knock_who_s_there_to_to_who_no_to_whom

Next, the cleansed text is split into $$n$$-grams, where an $$n$$-gram is nothing but a sequence of $$n \in \mathbb{N}_0$$ consecutive characters. For example, the sample text consists of the following 3-grams:

kno   kno   who   the   to_   who   _to
 noc   noc   ho_   her   o_t   ho_   to_
  ock   ock   o_s   ere   _to   o_n   o_w
   ck_   ck_   _s_   re_   to_   _no   _wh
    k_k   k_w   s_t   e_t   o_w   no_   who
     _kn   _wh   _th   _to   _wh   o_t   hom

This representation also makes it explicit that the $$n$$-grams partially overlap. The number of consecutive $$n$$-grams in the cleansed version of text $$t$$ is noted as $$c_n^t$$. If $$t$$ is the sample text, then $$c_3^t = 42$$. The set of all $$n$$-grams in the cleansed version of text $$t$$ is noted as $$\Omega_n^t$$.

The profile of text $$t$$ is nothing but a frequency table of the $$n$$-grams in the cleansed version of $$t$$. The number of occurrences of $$n$$-gram $$\omega$$ in the cleansed version of text $$t$$ is noted as $$p_n^t(\omega)$$. If $$t$$ is the sample text, then $$p_3^t(\texttt{the}) = 1$$ (green), $$p_3^t(\texttt{_wh}) = 3$$ (blue) and $$p_3^t(\texttt{o_t}) = 2$$ (orange).

The final step of $$n$$-gram tracing is to use the profile of text $$t$$ whose authorship is unknown and the profile of text $$a$$ attributed to a known author (this may also be a collection of texts from that author) to determine the possible attribution of the author to the text using the formula \[ -\sum_{\omega \in \Omega_n^t}p_n^t(\omega) \ln\left(\frac{1 + p_n^a(\omega)}{c_n^a}\right) \] where $$\ln(x)$$ is the natural logarithm of $$x$$. The greater the attribution, the greater the chance that text $$t$$ was written by the author of text $$a$$. Your task:

The profiles passed to the functions ngram_count and attribution must be dictionaries (dict) as returned by the function profile and may not be modified by the functions. The text files passed to the function profile use UTF-88 character encoding. The built-in function open has a parameter encoding that can be used to specify the character encoding for the file:

>>> open('file.txt', 'r', encoding='utf-8')

Example

In the following interactive session we assume the text files knock.txt9, bixby.txt10, lincoln.txt11, hay.txt12 and obama.txt13 to be located in the current directory.

>>> cleanse("What's wrong? NOTHING's wrong!")
'what_s_wrong_nothing_s_wrong'
>>> cleanse("Knock-knock. Who's there? To. To who? No, to whom.")
'knock_knock_who_s_there_to_to_who_no_to_whom'
>>> cleanse('The past, the present, and the future walked into a bar. It was tense.')
'the_past_the_present_and_the_future_walked_into_a_bar_it_was_tense'

>>> ngrams("What's wrong? NOTHING's wrong!")
['w', 'h', 'a', 't', '_', 's', '_', 'w', 'r', 'o', 'n', 'g', '_', 'n', 'o', 't', 'h', 'i', 'n', 'g', '_', 's', '_', 'w', 'r', 'o', 'n', 'g']
>>> ngrams("Knock-knock. Who's there? To. To who? No, to whom.", 3)
['kno', 'noc', 'ock', 'ck_', 'k_k', '_kn', 'kno', 'noc', 'ock', 'ck_', 'k_w', '_wh', 'who', 'ho_', 'o_s', '_s_', 's_t', '_th', 'the', 'her', 'ere', 're_', 'e_t', '_to', 'to_', 'o_t', '_to', 'to_', 'o_w', '_wh', 'who', 'ho_', 'o_n', '_no', 'no_', 'o_t', '_to', 'to_', 'o_w', '_wh', 'who', 'hom']
>>> ngrams('The past, the present, and the future walked into a bar. It was tense.', n=2)
['th', 'he', 'e_', '_p', 'pa', 'as', 'st', 't_', '_t', 'th', 'he', 'e_', '_p', 'pr', 're', 'es', 'se', 'en', 'nt', 't_', '_a', 'an', 'nd', 'd_', '_t', 'th', 'he', 'e_', '_f', 'fu', 'ut', 'tu', 'ur', 're', 'e_', '_w', 'wa', 'al', 'lk', 'ke', 'ed', 'd_', '_i', 'in', 'nt', 'to', 'o_', '_a', 'a_', '_b', 'ba', 'ar', 'r_', '_i', 'it', 't_', '_w', 'wa', 'as', 's_', '_t', 'te', 'en', 'ns', 'se']

>>> knock = profile('knock.txt14', n=3)
>>> len(knock)
27
>>> knock['the']
1
>>> knock['_wh']
3
>>> knock['o_t']
2
>>> ngram_count(knock)
42

>>> bixby = profile('bixby.txt15', n=3)
>>> len(bixby)
456
>>> bixby['the']
17
>>> bixby['in_']
3
>>> bixby['f_t']
4
>>> ngram_count(bixby)
743

>>> lincoln = profile('lincoln.txt16', n=3)
>>> attribution(bixby, lincoln)
5209.447183892647

>>> hay = profile('hay.txt17', n=3)
>>> attribution(bixby, hay)
5216.17091674669

>>> obama = profile('obama.txt18', n=3)
>>> attribution(bixby, obama)
5079.372864440405

Epilogue

Two journalists of The Sunday Times19 collaborated with professor Patrick Juola20 from Duquesne University (USA) to prove that J.K. Rowling21 is the true author of The Cuckoo's Calling22, the book she wrote in 2013 under the pseudonym Robert Galbraith. This was done using a variant of the n-gram tracing technique.

The Cuckoo's Calling
Cover illustration for JK Rowling's novel The Cuckoo’s Calling (published under the pseudonym Robert Galbraith), which was lavishly praised by critics.

The story has been widely reported in the international press, including an elaborate article in the New York Times23. Time24 magazine explains how the discovery was made:

As one part of his work, Juola uses a program to pull out the hundred most frequent words across an author's vocabulary. This step eliminates rare words, character names and plot points, leaving him with words like "of" and "but", ranked by usage. Those words might seem inconsequential, but they leave an authorial fingerprint on any work. "Propositions and articles and similar little function words are actually very individual," Juola says. "It's actually very, very hard to change them because they're so subconscious."

Resources