We've all been irritated by jargon — those collections of polysyllabic technical terms used in place of simple English, or worse still, everyday English words redefined into technical meanings. Below are two of my favorite, if ancient, definitions of geological terms, both of which may be just slightly tongue-in-cheek.

"Crocydite, belonging to the group of vaguely bordered migmatites (dictyonite, nebulite, stictolite), may be genetically defined by the new terminology as an endomerismite with magmatic neosome in a palaeosome which is a stereogenic cyriosome." (de Waard D, 1950)

"A cactolith is a quasihorizontal chonolith composed of anastomosing ductoliths whose distal ends curl like a harpolith, thin like a sphenolith, or bulge discordantly like an akmolith or ethmolith." (Hunt CB, 1953)

The latter term and its associated definition were created by Charles B. Hunt, a researcher at the United States Geological Survey (USGS). Whilst he was in fact describing an actual geological feature — a laccolith which he saw as resembling a cactus — he was also, tongue-in-cheek, commenting on what he saw as an absurd number of "-lith" words in the field of Geology. Word Ways: The Journal of Recreational Linguistics1 chose cactolith as its word of the year for 2010.

In linguistics, the Gunning-Fog index is used to measure the readability of English writing. The index estimates the years of formal eduction needed to understand a given text on first reading. For example, a fog index of 12 requires the reading level of a U.S. high school senior (around 18 years old). The fog index is commonly used to confirm that a text can be read easily by the intended audience. Texts for a wide audience generally need a fog index less than 12. Texts requiring near-universal understanding generally need an index less than 8. Philip Chalmers of Benefit from IT provided the following typical fog index scores, to help ascertain the readability of documents.

fog index examples
6 TV guides, the Bible, Mark Twain
8 Reader's Digest
8-10 most popular novels
10 Time, Newsweek
11 Wall Street Journal
14 The Times, The Guardian
15-20 scientific papers
$$\geq$$ 20 only government sites can get away with
this, because you can't ignore them
$$\geq$$ 30 the government is covering something up

The Gunning-Fog index is calculated using the following algorithm:

  1. Select a passage (such as one or more full paragraphs) of around 100 words. Do not omit any sentences.

  2. Determine the average sentence length by dividing the number of words by the number of sentences.

  3. Count the complex words: these are the words with three or more syllables.

  4. Add the average sentence length and the percentage of complex words.

  5. Multiply the result from the previous step by 0.4.

Expressed as a formula, this becomes \[ 0.4 \left[ \left( \frac{\textrm{words}}{\textrm{sentences}} \right) + 100 \left( \frac{\textrm{complex words}}{\textrm{words}} \right) \right] \] While the fog index is a good sign of hard-to-read text, it has limits. Not all complex words are difficult. For example, "asparagus" is not generally thought to be a difficult word, though it has four syllables. A short word can be difficult if it is not used very often by most people.

Gunning fog index
The complete formula of the Gunning-Fog index.

Assignment

Your task is to compute the Gunning-Fog index for a series of text fragments, where each of these fragments has been stored in a text file. In order to do so, you proceed as follows.

Example

In the following interactive session, we assume that the text files crocydite.txt2, cactolith.txt3 and wikipedia.txt4 are located in the current directory. The first two files contain the definitions of the geological terms as given in the introduction of this exercise. The third file contains the initial paragraphs of the Wikipedia article about Geology.

>>> syllables('cactolith')
3
>>> syllables('quasihorizontal')
6
>>> syllables('palaeosome')
4

>>> statistics('crocydite.txt')
(1, 34, 17)
>>> statistics('cactolith.txt')
(1, 29, 11)
>>> statistics('wikipedia.txt')
(5, 119, 37)

>>> gunningfog('crocydite.txt')
33.6
>>> gunningfog('cactolith.txt')
26.77241379310345
>>> gunningfog('wikipedia.txt')
21.956974789915968

Resources

de Waard D (1950). Palingenetic structures in augen gneiss of the Sierra de Guadarrama, Spain. Bull. Comm. Géol. Finlande 150(23), 51–66. 5

Hunt CB (1953). Geology and geography of the Henry Mountains region, Utah. US Geological Survey Professional Paper 228, 234. 6