Python's Dictionary Implementation: Being All Things to All People > Inside the Dictionary

18. Python's Dictionary Implementation: Being All Things to All People

Andrew Kuchling

Dictionaries are a fundamental data type in the python programming language. Like awk's associative arrays and Perl's hashes, dictionaries store a mapping of unique keys to values. Basic operations on a dictionary include:

Here's a brief example of using a dictionary at the Python interpreter prompt. (To try out this example, you can just run the python command on Mac OS and most Linux distributions. If Python isn't already installed, you can download it from http://www.python.org.)

In the following interactive session, the >>> signs represent the Python interpreter's prompts, and d is the name of the dictionary I'm playing with:

	>>> d = {1: 'January', 2: 'February',
	... 'jan': 1, 'feb': 2, 'mar': 3}
	{'jan': 1, 1: 'January', 2: 'February', 'mar': 3, 'feb': 2}
	>>> d['jan'], d[1]
	(1, 'January')
	>>> d[12]
	Traceback (most recent call last):
	  File "<stdin>", line 1, in <module>
	KeyError: 12
	>>> del d[2]
	>>> for k, v in d.items(): print k,v # Looping over all pairs.
	jan 1
	1 January
	mar 3
	>feb 2
	...

Two things to note about Python's dictionary type are:

It's important that retrieval of keys be a very fast operation, so dictionary-like types are usually implemented as hash tables. For the C implementation of Python (henceforth referred to as CPython), dictionaries are even more pivotal because they underpin several other language features. For example, classes and class instances use a dictionary to store their attributes:

	>>> obj = MyClass()          # Create a class instance
	>>> obj.name = 'object'      # Add a .name attribute
	>>> obj.id = 14              # Add a .id attribute
	>>> obj._ _dict_ _            # Retrieve the underlying dictionary
	{'name': 'object', 'id': 14}
	>>> obj._ _dict_ _['id'] = 12 # Store a new value in the dictionary
	>>> obj.id                   # Attribute is changed accordingly
	12

Module contents are also represented as a dictionary, most notably the _ _builtin_ _ module that contains built-in identifiers such as int and open. Any expression that uses such built-ins will therefore result in a few dictionary lookups. Another use of dictionaries is to pass keyword arguments to a function, so a dictionary could potentially be created and destroyed on every function call. This internal use of the dictionary type means that any running Python program has many dictionaries active at the same time, even if the user's program code doesn't explicitly use a dictionary. It's therefore important that dictionaries can be created and destroyed quickly and not use an overly large amount of memory.

The implementation of dictionaries in Python teaches several lessons about performance-critical code. First, one has to trade off the advantages of an optimization against the overhead it adds in space or calculation time. There were places where the Python developers found that a relatively naïve implementation was better in the long run than an extra optimization that seemed more appealing at first. In short, it often pays to keep things simple.

Second, real-life benchmarking is critical; only that way can you discover what's really worth doing.

18.1. Inside the Dictionary

Dictionaries are represented by a C structure, PyDictObject, defined in Include/dictobject.h. Here's a schematic of the structure representing a small dictionary mapping "aa", "bb", "cc", …, "mm" to the integers 1 to 13:

	int ma_fill         13
	int ma_used         13
	int ma_mask         31

	PyDictEntry ma_table[]:
	[0]: aa, 1                  hash(aa) == -1549758592, -1549758592 & 31 = 0
	[1]: ii, 9                  hash(ii) == -1500461680, -1500461680 & 31 = 16	
	[2]: null, null
	[3]: null, null 
	[4]: null, null
	[5]: jj, 10                 hash(jj) == 653184214, 653184214 & 31 = 22
	[6]: bb, 2                  hash(bb) == 603887302, 603887302 & 31 = 6
	[7]: null, null 
	[8]: cc, 3                  hash(cc) == -1537434360, -1537434360 & 31 = 8
	[9]: null, null
	[10]: dd, 4                 hash(dd) == 616211530, 616211530 & 31 = 10
	[11]: null, null
	[12]: null, null
	[13]: null, null
	[14]: null, null 
	[15]: null, null
	[16]: gg, 7                 hash(gg) == -1512785904, -1512785904 & 31 = 16
	[17]: ee, 5                 hash(ee) == -1525110136, -1525110136 & 31 = 8
	[18]: hh, 8                 hash(hh) == 640859986, 640859986 & 31 = 18
	[19]: null, null
	[20]: null, null
	[21]: kk, 11                hash(kk) == -1488137240, -1488137240 & 31 = 8
	[22]: ff, 6                 hash(ff) == 628535766, 628535766 & 31 = 22
	[23]: null, null
	[24]: null, null
	[25]: null, null
	[26]: null, null
	[27]: null, null
	[28]: null, null
	[29]: ll, 12                hash(ll) == 665508394, 665508394 & 31 = 10
	[30]: mm, 13                hash(mm) == -1475813016, -1475813016 & 31 = 8 
	[31]: null, null


					    

The ma_prefix in the field names comes from the word mapping, Python's term for data types that provide key/value lookups. The fields in the structure are:


ma_used

Number of slots occupied by keys (in this case, 13).


ma_fill

Number of slots occupied by keys or by dummy entries (also 13).


ma_mask

Bitmask representing the size of the hash table. The hash table contains ma_mask+1 slots—in this case, 32. The number of slots in the table is always a power of 2, so this value is always of the form 2n–1 for some n, and therefore consists of n set bits.


ma_table

Pointer to an array of PyDictEntry structures. PyDictEntry contains pointers to:

  • The key object

  • The value object

  • A cached copy of the key's hash code

The hash value is cached for the sake of speed. When searching for a key, the exact hash values can be quickly compared before performing a slower, full equality comparison of the keys. Resizing a dictionary also requires the hash value for each key, so caching the value saves having to rehash all the keys when resizing.

We don't keep track directly of the number of slots in the table, but derive it instead as needed from ma_mask. When looking up the entry for a key, slot = hash & mask is used to figure out the initial slot for a particular hash value. For instance, the hash function for the first entry generated a hash of –1549758592, and –1549758592 mod 31 is 0, so the entry is stored in slot 0.

Because the mask is needed so often, we store it instead of the number of slots. It's easy to calculate the number of slots by adding 1, and we never need to do so in the most speed-critical sections of code.

ma_fill and ma_used are updated as objects are added and deleted. ma_used is the number of keys present in the dictionary; adding a new key increases it by 1, and deleting a key decreases it by 1. To delete a key, we make the appropriate slot point to a dummy key; ma_fill therefore remains the same when a key is deleted, but may increase by 1 when a new key is added. (ma_fill is never decremented, but will be given a new value when a dictionary is resized.)