It's puzzling but true that in any group of 23 people there is a 50% chance that at least two of them share the same birthday. This birthday paradox1 isn't a logical paradox — there's nothing self-contradictory about it — it's just unexpected.

verjaardagsparadox
The birthday paradox: $$p(n)$$ indicates the probability that in a group of $$n$$ people at least two of them share the same birthday; $$q(n)$$ indicates the probability that in a group of $$n$$ people there is at least one person sharing the same birthday as another person that was chosen beforehand (for example: yourself). The value $$n$$ is shown on the horizontal axis and probabilities (in the interval from 0 to 1, or the interval from 0% to 100%) are shown on the vertical axis.

Imagine a typical classroom having 30 students. People usually think it's an amazing coincidence that two students in such a class share the same birthday, but is isn't all that rare after all. Actually, with 30 students there is a 70% chance.

Perhaps the best data set of all to test the birthday paradox could be found this summer on the 2014 FIFA World Cup, an international football tournament held in Brazil from 12 June to 13 July 2014. The 32 national teams involved in the tournament were required to register a squad of 23 players, including three goalkeepers. Only players in these squads were eligible to take part in the tournament. If the birthday paradox is true, 50% of the squads should have shared birthdays.

In order to test this, we have collected information about all players from FIFA's official squad list. We have stored the information about all players from a national team squad in a text file. Each line of in such a file contains the following information fields for a single player: i) name, ii) national team, iii) squad number, iv) position on the field (GK=goalkeeper, DF=defender, MF=midfielder, FW=forward), v) date of birth (YYYY-MM-DD), vi) number of caps and vii) club. Information fields are separated using commas (,). As an example, the first few lines of the file france.txt2 that contains information about the players of the French national team are shown below.

Hugo Lloris,France,1,GK,1986-12-26,57,Tottenham Hotspur
Mathieu Debuchy,France,2,DF,1985-07-28,21,Newcastle United
Patrice Evra,France,3,DF,1981-05-15,58,Manchester United
Raphaël Varane,France,4,DF,1993-04-25,6,Real Madrid
Mamadou Sakho,France,5,DF,1990-02-13,19,Liverpool
Yohan Cabaye,France,6,MF,1986-01-14,30,Paris Saint-Germain

Assignment

In this exercise we process text files that contain information about all players in a national team squad on the World Cup football. The information about the players is stored in the format as outlined in the introduction. All text files use UTF-8 character encoding (see below). Your task:

Unicode files

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. The standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference data computer files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts).

Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8, UTF-16 and the now-obsolete UCS-2. UTF-8 uses one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters. To open a Unicode file in Python, you can pass the encoding used to the optional parameter encoding of the built-in function open. For example, in order to read the information from the Unicode text file france.txt3 that makes use of UTF-8 encoding, the file can be opened in the following way:

>>> open('france.txt', 'r', encoding='utf-8')

Example

In the following interactive session we assume that the text files algeria.txt4, belgium.txt5 and france.txt6 are located in the current directory.

>>> born = birthdays('france.txt')
>>> born['02-13']
{'Mamadou Sakho', 'Eliaquim Mangala'}
>>> born['03-08']
{'Rio Mavuba', 'Rémy Cabella'}

>>> birthdayparadox('france.txt')
True
>>> birthdayparadox('belgium.txt')
False

>>> testparadox([('Algeria', 'algeria.txt'), ('Belgium', 'belgium.txt'), ('France', 'france.txt')])
{'Algeria', 'France'}

Resources

J. Fletcher (2014). The birtday paradox at the World Cup. BBC News Magazine. 7