It's puzzling but true that in any group of 23 people there is a 50% chance that at least two of them share the same birthday. This birthday paradox¹ isn't a logical paradox — there's nothing self-contradictory about it — it's just unexpected.

The birthday paradox: $$p(n)$$ indicates the probability that in a group of $$n$$ people at least two of them share the same birthday; $$q(n)$$ indicates the probability that in a group of $$n$$ people there is at least one person sharing the same birthday as another person that was chosen beforehand (for example: yourself). The value $$n$$ is shown on the horizontal axis and probabilities (in the interval from 0 to 1, or the interval from 0% to 100%) are shown on the vertical axis.

Imagine a typical classroom having 30 students. People usually think it's an amazing coincidence that two students in such a class share the same birthday, but is isn't all that rare after all. Actually, with 30 students there is a 70% chance.

Perhaps the best data set of all to test the birthday paradox could be found this summer on the 2014 FIFA World Cup, an international football tournament held in Brazil from 12 June to 13 July 2014. The 32 national teams involved in the tournament were required to register a squad of 23 players, including three goalkeepers. Only players in these squads were eligible to take part in the tournament. If the birthday paradox is true, 50% of the squads should have shared birthdays.

In order to test this, we have collected information about all players from FIFA's official squad list. We have stored the information about all players from a national team squad in a text file. Each line of in such a file contains the following information fields for a single player: i) name, ii) national team, iii) squad number, iv) position on the field (GK=goalkeeper, DF=defender, MF=midfielder, FW=forward), v) date of birth (YYYY-MM-DD), vi) number of caps and vii) club. Information fields are separated using commas (,). As an example, the first few lines of the file france.txt² that contains information about the players of the French national team are shown below.

Hugo Lloris,France,1,GK,1986-12-26,57,Tottenham Hotspur
Mathieu Debuchy,France,2,DF,1985-07-28,21,Newcastle United
Patrice Evra,France,3,DF,1981-05-15,58,Manchester United
Raphaël Varane,France,4,DF,1993-04-25,6,Real Madrid
Mamadou Sakho,France,5,DF,1990-02-13,19,Liverpool
Yohan Cabaye,France,6,MF,1986-01-14,30,Paris Saint-Germain

Assignment

In this exercise we process text files that contain information about all players in a national team squad on the World Cup football. The information about the players is stored in the format as outlined in the introduction. All text files use UTF-8 character encoding (see below). Your task:

Write a function birthdays that takes the location of a text file. The function must return a dictionary that maps all days in the year in which at least one player in the national team squad celebrates its birthday onto the set of players who are born on that day of the year. Note that the keys of the dictionary are days in the year (format MM-DD) that need to be derived from the birthdays of the players as stored in the given file.
Use the function birthdays to write a function birthdayparadox that takes the location of a text file. The function must return a Boolean value that indicates whether or not there is a day in the year in which at least two players of the national team squad celebrate their birthday.
Use the function birthdayparadox to write a function testparadox that takes a list of national team squads. Each national team squad is represented as a tuple containing two strings: the name of the country and the location of the text files containing information about all players in the squad. The function must return a set containing the names of all countries whose national team squad has at least two players that celebrate their birthday on the same day of the year.

Unicode files

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. The standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference data computer files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts).

Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8, UTF-16 and the now-obsolete UCS-2. UTF-8 uses one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters. To open a Unicode file in Python, you can pass the encoding used to the optional parameter encoding of the built-in function open. For example, in order to read the information from the Unicode text file france.txt³ that makes use of UTF-8 encoding, the file can be opened in the following way:

>>> open('france.txt', 'r', encoding='utf-8')

Example

In the following interactive session we assume that the text files algeria.txt⁴, belgium.txt⁵ and france.txt⁶ are located in the current directory.

>>> born = birthdays('france.txt')
>>> born['02-13']
{'Mamadou Sakho', 'Eliaquim Mangala'}
>>> born['03-08']
{'Rio Mavuba', 'Rémy Cabella'}

>>> birthdayparadox('france.txt')
True
>>> birthdayparadox('belgium.txt')
False

>>> testparadox([('Algeria', 'algeria.txt'), ('Belgium', 'belgium.txt'), ('France', 'france.txt')])
{'Algeria', 'France'}

Resources

J. Fletcher (2014). The birtday paradox at the World Cup. BBC News Magazine. ⁷