In 1996, Alan Sokal 1- a professor of physics at the University of New York - conducted the following somewhat unethical experiment. He sent a hoax, peppered with nonsenseical reasoning and pseudoscientific jargon to the American journal Social Text that is published by Duke University. Through this experiment Sokal wanted to know whether a well made, but completely nonsenseical article could be published in a postmodern journal, if it sounded good and hit the editors with ideological but hollow concepts. The article was indeed published and caused a lot of outrage in the international academic world.

The Sokal affair2 raises the question whether it is possible to generate random nonsenseical language with a computer, which seems legitimate at first glance. Natural language is highly structured and not randomly built at all: not every set of letters represents a word, and not every set of words constitutes a meaningful and grammatically correct sentence. Still, certain aspects of randomly generated phrases sound familiar, if they are sufficiently structured. In addition, each author has his own pattern in selecting the order of words. By preserving the order of adjacent words in a text, but constructing the rest of the text at random, it is possible to generate a surprisingly good appearing text that largely preserves the style (but not the meaning) of the original text.

Assignment

  1. Write a function wordlist with which a list of words can be read from a text file. The location of the text file has to be passed to the function as an argument. The function must return a list of words in the order in which they appear in the text file as a result. A word is created in the text file by the longest possible sequence of characters in which no whitespace characters (spaces, tabs, carriage returns, and newlines) occur. The words in the text file are separated from each other by one or more consecutive white space characters. Punctuation marks which might stick to a word, are part of that word.

  2. Write a function sequelwords with two parameters: a parameter words to which a list of words must be passed, and a parameter k to which an integer ($$k \geq 1$$) must be passed. The function must return a dictionary, of which the keys are formed by all tuples of $$k$$ words that occur consecutively in the given word list, and after which there is at least one more word. The dictionary displays each of these $$k$$-tuples on the list of words that occur in the given word list after the succession of $$k$$ words, and this in the sequence in which these words occur in the given word list. If, for example, a key ('aa', 'bb', 'cc') occurs in the dictionary with the matching value ['dd', 'ee', 'dd', 'ff'], then this means that the given word list contains the words aa, bb and cc four times in a row, and that these occurrences are respectively followed by the words dd, ee, dd and ff. Note that the same word can follow the same $$k$$-tuple multiple times, causing it to occur multiple times in the list on which this $$k$$-tuple is displayed by the dictionary.

  3. Write a function nonsense that returns a randomly generated text. This function has the following three parameters: a start parameter to which a tuple of words must be passed, a sequel parameter to which a dictionary must be passed which is constructed like the dictionaries that are returned by the sequelwords function, and a parameter minimumlength to which an integer must be passed. The random text is to be generated by the function in the following way:

    1. The words of the tuple start form the first words of the text, and are then separated from each other by a single space.

    2. Use the dictionary sequel to determine the next word of the text, by choosing any word from the list that is displayed by the dictionary on the start tuple. This new word is added at the end of the text, preceded by a single space.

    3. Calculate a new value for the tuple start by leaving the first word out from the old value of the tuple start, and adding the the new word to back of the tuple that was found in step b.

    4. Keep repeating this procedure from step b until

      • tuple start is not a key of the dictionary sequel, or

      • the generated text consists of at least minimumlength words and the last word ends with a period (.), a question mark (?) or an exclamation mark (!), or

      • the generated text consists of at least twice as many words as indicated by the minimumlength parameter.

Example

In the following example we assume that the file shelovesyou.txt3 with the text from the song She Loves You by the Beatles is in the current folder. Click here to view the lyrics of this song. Note that the output of the example was partially omitted to save space, and that the output of the function nonsense was written over multiple lines as not to make the text unnecessarily wide.

>>> words = wordlist('shelovesyou.txt')
>>> words
['She', 'loves', 'you,', 'yeh,', 'yeh,', ..., 'yeh;', 'yeh,', 'yeh,', 'yeeeh!']

>>> k = 3
>>> start = tuple(words[:k])
>>> start
('She', 'loves', 'you,')
>>> continuation = nextwords(words, k)
>>> continuation
{
  ('yeh,', 'yeh.', 'She'): ['loves', 'loves'], 
  ('should', 'be', 'glad.'): ['Ooh!', 'Ooh!', 'And', 'Ooh!', 'And', 'And'], 
  ('you', 'shouuuld', 'be'): ['glad.'], 
  ("It's", 'you', "she's"): ['thinking'], 
  ..., 
  ('like', 'that,', 'you'): ['know', 'know', 'know', 'know']
}

>>> nonsense(start, continuation, 25)
She loves you, yeh, yeh, yeh, yeeeh! You think you lost your
love, when I saw her yesterday. It's you she's thinking of,
and she told me what to say.

>>> nonsense(start, continuation, 25)
She loves you, yeh, yeh, yeh. She loves you, yeh, yeh, yeh.
She loves you, yeh, yeh, yeh. She loves you, yeh, yeh, yeh!
She loves you, yeh, yeh, yeh.

>>> nonsense(start, continuation, 25)
She loves you, yeh, yeh, yeh. She loves you, yeh, yeh, yeh!
And with a love like that, you know you should be glad. And
now it's up to you, I think it's only fair, if I should hurt
you too, apologize to her, because she loves you, and you

>>> nonsense(start, continuation, 25)
She loves you, yeh, yeh, yeh! And with a love like that, you
know you shouuuld be glad. Yeh, yeh, yeh; yeh, yeh, yeh;
yeh, yeh, yeh; yeh, yeh, yeh; yeh, yeh, yeh; yeh, yeh, yeh;
yeh, yeh, yeh; yeh, yeh, yeh; yeh, yeh, yeh; yeh, yeh, yeh;
yeh, yeh,

The following text files can be used to further test your solution. The first of these is also used by us to test your solution automatically.

Scientific Background

The theme of this task is more of a playfulness to familiarize you with reading information from files and learning to work with dictionaries in Python. Still, it is interesting to note that similar concepts are used in major scientific problems which use Markov chains8 to model probabilistic processes. In the study of genome sequences, there is, for example, a lot of evidence that indicates that DNA sequences contain higher-order structures (Karlin et al., 1998). This means that it is not sufficient to describe the frequencies of the nucleotides A, C, G and T in a genome, but also the correlations between longer chains of nucleotides. For example, the use of codons (triplets of nucleotides) to encode the amino acids from which proteins are constructed, already indicates that there are important three-letter correlations to be found in a genomic sequence. As for natural language, this task will hopefully have already made it clear that three-word correlations can help to display the characteristic properties of the language.

Karlin S, Campbell AM, Mrázek J (1998). Comparative DNA analysis across diverse genomes. Annu Rev Genet 32, 185-225. 9