Hash tables

Binary search is a huge improvement over linear search. It still requires about log₂(n) trials, though. Is it possible to improve upon that?

Suppose that I have a list of names of personnel, and each employee has an employee number that has six digits. I store the employee names in a list, where they are a tuple of the employee number and their name. For example, the list could look like:

employees = [
    ( 142351, "Apricot, A." ),
    ( 211111, "Cherry, C." ),
    ( 232112, "Banana, B." ),
    ( 341123, "Fig, F." ),
    ( 400012, "Apple, A." ),
    ( 513210, "Mango, M." ),
    ( 778899, "Pomegranate, P." ), 
    ( 963210, "Pear, P." ) ]

Since the list is sorted on employee number, if I want to search for employees by number, I can use binary search. For these eight employees, it may take up to 4 checks with binary search to find out if a particular number is on the list. If this list would contain 1000 employees, it would take about 10 checks.

Exercise to be done on your preferred code editor: Implement a binary search for the employee list above to search for employees by their number.

Can this be done faster? Can you imagine a way to store the employees in a list in such a way that you can check for the existence of their number with a method that is O(1)?

A possible solution would be to store the names of the employees in a list, whereby the index for their name in the list is their employee number. So to find out if employee with number 513210 exists, I just check if employee[513210] contains an empty string (in which case no employee exists with this number), or a name. Thus, checking for existence of an employee takes just one attempt.

Naturally, this solution is not realistic in most cases. For six-digit employee numbers, it needs a list with one million elements. While that might still be doable for modern computers, with bigger numbers they might get problems. Moreover, this approach will not work if the list is not sorted by number, but, for instance, by name. A name cannot be used (directly) as index for a list.

Hopefully at this point you have been thinking “Wouldn’t I use a dictionary in that case?” And the answer is, “Of course, you would use a dictionary.” A dictionary, however, is no more than a list which is ordered in a very smart way, so that you can use any type of data as index. A dictionary is implemented as a so-called “hash table”.