You can find people of all ages on Twitter. Moreover, those short messages make sure it is very accessible, making it easy for people to post messages (or tweets in the jargon). This makes Twitter ideal to follow trends in the population. To help researchers Twitter offers an API. This is a way to view the website without having to use a browser and obtain the data in a structured way, so that they can be processed more easily by a computer program.
To examine trends for example you can set up a frequency table of the words used. If you do this with the tweets that you get by searching for a particular term, then you expect that term to have a high frequency and that related words occur often too.
A classic way to present a frequency table in a way that is visually interesting to the human eye, is a tag cloud. In a tag cloud, the words with a higher frequency are in a larger font, so they also get noticed more easily.
Above you see an example of a tag cloud. This tag cloud was built on the basis of the results for the search term 'tag cloud'. You immediately see here that such a frequency table can easily fill up with so-called stop words, i.e., words that are not directly related to the search term, but just always occur frequently in a given language.
The Twitter Search API can be found at http://search.twitter.com/search.json?q=zoekterm, in which case zoekterm should be replaced by the specific word that you are looking for. The API returns the results in JSON-format (JavaScript Object Notation). In principle this format has got nothing to do with Python, but luckily it is written exactly the same as a nested dictionary. This means we can just interpret the string with the function eval. However, three little problems arise: in JavaScript None is written as null, True as true and False as false. If we create three variables, however, that carry the same name and corresponding values, then we can process such a result in one time. Below is an example of how this can be done. We assume that the variable result contains the text that is returned by the API.
null, true, false = None, True, False
dictionary = eval(result)
The dictionary that we have constructed above, contains a key 'results' with which a list corresponds. In this list there are dictionaries that each contain the rext of a tweet linked to a key 'text'.
Write a function twittercloud that has an obligatory argument and two optional arguments. The obligatory argument is the search term. This search term cannot contain any spaces. The function returns a frequency table of the words in the tweets that are returned for the search term. A word is the longest possible succession of letters (accents included), numbers, the character # and the character @. The first optional argument has the name stopwords and contains a collection of words that should be ignored. If this argument is not given, then the collection with only the string 'RT' is used (i.e. the most popular abbreviation on Twitter to indicate a retweet). The second optional argument has the name number and contains the default value 20. This is the number of words that have to be incorporated in the eventual frequency table. The frequency table then of course contains the words with the highest frequency. If there are multiple words with the same amount, then the words are arranged alphabetically. All comparisons of words have to be case-sensitive.
Because of the changeable content that is returned by the Twitter Search API, it will not be possible to use this example as a DocTest.
>>> twittercloud('geografie')
{
'i': 39,
'na': 48,
'geografie': 82,
'mi': 9,
'nie': 17,
'sie': 36,
'in': 9,
'mam': 12,
'D': 8,
'Geografie': 13,
'a': 13,
'ale': 9,
'chemie': 7,
'fizyke': 8,
'historie': 12,
'ja': 13,
'to': 8,
'uczyc': 9,
'z': 14,
'za': 8
}