We've all been irritated by jargon — those collections of polysyllabic technical terms used in place of simple English, or worse still, everyday English words redefined into technical meanings. Below are two of my favorite, if ancient, definitions of geological terms, both of which may be just slightly tongue-in-cheek.

"Crocydite, belonging to the group of vaguely bordered migmatites (dictyonite, nebulite, stictolite), may be genetically defined by the new terminology as an endomerismite with magmatic neosome in a palaeosome which is a stereogenic cyriosome." (de Waard D, 1950)

"A cactolith is a quasihorizontal chonolith composed of anastomosing ductoliths whose distal ends curl like a harpolith, thin like a sphenolith, or bulge discordantly like an akmolith or ethmolith." (Hunt CB, 1953)

The latter term and its associated definition were created by Charles B. Hunt, a researcher at the United States Geological Survey (USGS). Whilst he was in fact describing an actual geological feature — a laccolith which he saw as resembling a cactus — he was also, tongue-in-cheek, commenting on what he saw as an absurd number of "-lith" words in the field of Geology. Word Ways: The Journal of Recreational Linguistics¹ chose cactolith as its word of the year for 2010.

In linguistics, the Gunning-Fog index is used to measure the readability of English writing. The index estimates the years of formal eduction needed to understand a given text on first reading. For example, a fog index of 12 requires the reading level of a U.S. high school senior (around 18 years old). The fog index is commonly used to confirm that a text can be read easily by the intended audience. Texts for a wide audience generally need a fog index less than 12. Texts requiring near-universal understanding generally need an index less than 8. Philip Chalmers of Benefit from IT provided the following typical fog index scores, to help ascertain the readability of documents.

fog index	examples
6	TV guides, the Bible, Mark Twain
8	Reader's Digest
8-10	most popular novels
10	Time, Newsweek
11	Wall Street Journal
14	The Times, The Guardian
15-20	scientific papers
$$\geq$$ 20	only government sites can get away with this, because you can't ignore them
$$\geq$$ 30	the government is covering something up

The Gunning-Fog index is calculated using the following algorithm:

Select a passage (such as one or more full paragraphs) of around 100 words. Do not omit any sentences.
Determine the average sentence length by dividing the number of words by the number of sentences.
Count the complex words: these are the words with three or more syllables.
Add the average sentence length and the percentage of complex words.
Multiply the result from the previous step by 0.4.

Expressed as a formula, this becomes \[ 0.4 \left[ \left( \frac{\textrm{words}}{\textrm{sentences}} \right) + 100 \left( \frac{\textrm{complex words}}{\textrm{words}} \right) \right] \] While the fog index is a good sign of hard-to-read text, it has limits. Not all complex words are difficult. For example, "asparagus" is not generally thought to be a difficult word, though it has four syllables. A short word can be difficult if it is not used very often by most people.

The complete formula of the Gunning-Fog index.

Assignment

Your task is to compute the Gunning-Fog index for a series of text fragments, where each of these fragments has been stored in a text file. In order to do so, you proceed as follows.

Write a function syllables that takes a word (string) containing letters only. The function must return an estimate of the number of syllables in the given word. Although not entirely correct, the function must determine the number of syllables as the number of vowel sequences (a, e, i, o, u or y). The function should make no distinction between uppercase and lowercase letters.
Use the function syllables to write a function statistics that takes the location of a text file. This file must contain a text fragment, with each sentence on a separate line. In addition, the text file may contain empty lines (lines containing nothing or just whitespace characters (spaces and tabs)), for example to separate the sentences of successive paragraphs. These empty lines are not considered to be sentences. The function must return a tuple containing three integers, that respectively indicate how many sentences, words and complex words occur in the given text fragments. The words of a sentence are defined as the longest possible sequence of letters. In determining whether or not a word is complex, the number of syllables in the word must be determined using the function syllables.
Use the function statistics to write a function gunningfog that takes the location of a text file. This files contains a text fragment that should be interpreted in the same way as with the function statistics. The function must return the computed Gunning-Fog index of the given text fragment as a floating point number.

Example

In the following interactive session, we assume that the text files crocydite.txt², cactolith.txt³ and wikipedia.txt⁴ are located in the current directory. The first two files contain the definitions of the geological terms as given in the introduction of this exercise. The third file contains the initial paragraphs of the Wikipedia article about Geology.

>>> syllables('cactolith')
3
>>> syllables('quasihorizontal')
6
>>> syllables('palaeosome')
4

>>> statistics('crocydite.txt')
(1, 34, 17)
>>> statistics('cactolith.txt')
(1, 29, 11)
>>> statistics('wikipedia.txt')
(5, 119, 37)

>>> gunningfog('crocydite.txt')
33.6
>>> gunningfog('cactolith.txt')
26.77241379310345
>>> gunningfog('wikipedia.txt')
21.956974789915968

Resources

de Waard D (1950). Palingenetic structures in augen gneiss of the Sierra de Guadarrama, Spain. Bull. Comm. Géol. Finlande 150(23), 51–66. ⁵

Hunt CB (1953). Geology and geography of the Henry Mountains region, Utah. US Geological Survey Professional Paper 228, 234. ⁶