| 1 | NOTES ON OPTIMIZING DICTIONARIES
 | 
|---|
| 2 | ================================
 | 
|---|
| 3 | 
 | 
|---|
| 4 | 
 | 
|---|
| 5 | Principal Use Cases for Dictionaries
 | 
|---|
| 6 | ------------------------------------
 | 
|---|
| 7 | 
 | 
|---|
| 8 | Passing keyword arguments
 | 
|---|
| 9 |     Typically, one read and one write for 1 to 3 elements.
 | 
|---|
| 10 |     Occurs frequently in normal python code.
 | 
|---|
| 11 | 
 | 
|---|
| 12 | Class method lookup
 | 
|---|
| 13 |     Dictionaries vary in size with 8 to 16 elements being common.
 | 
|---|
| 14 |     Usually written once with many lookups.
 | 
|---|
| 15 |     When base classes are used, there are many failed lookups
 | 
|---|
| 16 |         followed by a lookup in a base class.
 | 
|---|
| 17 | 
 | 
|---|
| 18 | Instance attribute lookup and Global variables
 | 
|---|
| 19 |     Dictionaries vary in size.  4 to 10 elements are common.
 | 
|---|
| 20 |     Both reads and writes are common.
 | 
|---|
| 21 | 
 | 
|---|
| 22 | Builtins
 | 
|---|
| 23 |     Frequent reads.  Almost never written.
 | 
|---|
| 24 |     Size 126 interned strings (as of Py2.3b1).
 | 
|---|
| 25 |     A few keys are accessed much more frequently than others.
 | 
|---|
| 26 | 
 | 
|---|
| 27 | Uniquification
 | 
|---|
| 28 |     Dictionaries of any size.  Bulk of work is in creation.
 | 
|---|
| 29 |     Repeated writes to a smaller set of keys.
 | 
|---|
| 30 |     Single read of each key.
 | 
|---|
| 31 |     Some use cases have two consecutive accesses to the same key.
 | 
|---|
| 32 | 
 | 
|---|
| 33 |     * Removing duplicates from a sequence.
 | 
|---|
| 34 |         dict.fromkeys(seqn).keys()
 | 
|---|
| 35 | 
 | 
|---|
| 36 |     * Counting elements in a sequence.
 | 
|---|
| 37 |         for e in seqn:
 | 
|---|
| 38 |           d[e] = d.get(e,0) + 1
 | 
|---|
| 39 | 
 | 
|---|
| 40 |     * Accumulating references in a dictionary of lists:
 | 
|---|
| 41 | 
 | 
|---|
| 42 |         for pagenumber, page in enumerate(pages):
 | 
|---|
| 43 |           for word in page:
 | 
|---|
| 44 |             d.setdefault(word, []).append(pagenumber)
 | 
|---|
| 45 | 
 | 
|---|
| 46 |     Note, the second example is a use case characterized by a get and set
 | 
|---|
| 47 |     to the same key.  There are similar use cases with a __contains__
 | 
|---|
| 48 |     followed by a get, set, or del to the same key.  Part of the
 | 
|---|
| 49 |     justification for d.setdefault is combining the two lookups into one.
 | 
|---|
| 50 | 
 | 
|---|
| 51 | Membership Testing
 | 
|---|
| 52 |     Dictionaries of any size.  Created once and then rarely changes.
 | 
|---|
| 53 |     Single write to each key.
 | 
|---|
| 54 |     Many calls to __contains__() or has_key().
 | 
|---|
| 55 |     Similar access patterns occur with replacement dictionaries
 | 
|---|
| 56 |         such as with the % formatting operator.
 | 
|---|
| 57 | 
 | 
|---|
| 58 | Dynamic Mappings
 | 
|---|
| 59 |     Characterized by deletions interspersed with adds and replacements.
 | 
|---|
| 60 |     Performance benefits greatly from the re-use of dummy entries.
 | 
|---|
| 61 | 
 | 
|---|
| 62 | 
 | 
|---|
| 63 | Data Layout (assuming a 32-bit box with 64 bytes per cache line)
 | 
|---|
| 64 | ----------------------------------------------------------------
 | 
|---|
| 65 | 
 | 
|---|
| 66 | Smalldicts (8 entries) are attached to the dictobject structure
 | 
|---|
| 67 | and the whole group nearly fills two consecutive cache lines.
 | 
|---|
| 68 | 
 | 
|---|
| 69 | Larger dicts use the first half of the dictobject structure (one cache
 | 
|---|
| 70 | line) and a separate, continuous block of entries (at 12 bytes each
 | 
|---|
| 71 | for a total of 5.333 entries per cache line).
 | 
|---|
| 72 | 
 | 
|---|
| 73 | 
 | 
|---|
| 74 | Tunable Dictionary Parameters
 | 
|---|
| 75 | -----------------------------
 | 
|---|
| 76 | 
 | 
|---|
| 77 | * PyDict_MINSIZE.  Currently set to 8.
 | 
|---|
| 78 |     Must be a power of two.  New dicts have to zero-out every cell.
 | 
|---|
| 79 |     Each additional 8 consumes 1.5 cache lines.  Increasing improves
 | 
|---|
| 80 |     the sparseness of small dictionaries but costs time to read in
 | 
|---|
| 81 |     the additional cache lines if they are not already in cache.
 | 
|---|
| 82 |     That case is common when keyword arguments are passed.
 | 
|---|
| 83 | 
 | 
|---|
| 84 | * Maximum dictionary load in PyDict_SetItem.  Currently set to 2/3.
 | 
|---|
| 85 |     Increasing this ratio makes dictionaries more dense resulting
 | 
|---|
| 86 |     in more collisions.  Decreasing it improves sparseness at the
 | 
|---|
| 87 |     expense of spreading entries over more cache lines and at the
 | 
|---|
| 88 |     cost of total memory consumed.
 | 
|---|
| 89 | 
 | 
|---|
| 90 |     The load test occurs in highly time sensitive code.  Efforts
 | 
|---|
| 91 |     to make the test more complex (for example, varying the load
 | 
|---|
| 92 |     for different sizes) have degraded performance.
 | 
|---|
| 93 | 
 | 
|---|
| 94 | * Growth rate upon hitting maximum load.  Currently set to *2.
 | 
|---|
| 95 |     Raising this to *4 results in half the number of resizes,
 | 
|---|
| 96 |     less effort to resize, better sparseness for some (but not
 | 
|---|
| 97 |     all dict sizes), and potentially doubles memory consumption
 | 
|---|
| 98 |     depending on the size of the dictionary.  Setting to *4
 | 
|---|
| 99 |     eliminates every other resize step.
 | 
|---|
| 100 | 
 | 
|---|
| 101 | * Maximum sparseness (minimum dictionary load).  What percentage
 | 
|---|
| 102 |     of entries can be unused before the dictionary shrinks to
 | 
|---|
| 103 |     free up memory and speed up iteration?  (The current CPython
 | 
|---|
| 104 |     code does not represent this parameter directly.)
 | 
|---|
| 105 | 
 | 
|---|
| 106 | * Shrinkage rate upon exceeding maximum sparseness.  The current
 | 
|---|
| 107 |     CPython code never even checks sparseness when deleting a
 | 
|---|
| 108 |     key.  When a new key is added, it resizes based on the number
 | 
|---|
| 109 |     of active keys, so that the addition may trigger shrinkage
 | 
|---|
| 110 |     rather than growth.
 | 
|---|
| 111 | 
 | 
|---|
| 112 | Tune-ups should be measured across a broad range of applications and
 | 
|---|
| 113 | use cases.  A change to any parameter will help in some situations and
 | 
|---|
| 114 | hurt in others.  The key is to find settings that help the most common
 | 
|---|
| 115 | cases and do the least damage to the less common cases.  Results will
 | 
|---|
| 116 | vary dramatically depending on the exact number of keys, whether the
 | 
|---|
| 117 | keys are all strings, whether reads or writes dominate, the exact
 | 
|---|
| 118 | hash values of the keys (some sets of values have fewer collisions than
 | 
|---|
| 119 | others).  Any one test or benchmark is likely to prove misleading.
 | 
|---|
| 120 | 
 | 
|---|
| 121 | While making a dictionary more sparse reduces collisions, it impairs
 | 
|---|
| 122 | iteration and key listing.  Those methods loop over every potential
 | 
|---|
| 123 | entry.  Doubling the size of dictionary results in twice as many
 | 
|---|
| 124 | non-overlapping memory accesses for keys(), items(), values(),
 | 
|---|
| 125 | __iter__(), iterkeys(), iteritems(), itervalues(), and update().
 | 
|---|
| 126 | Also, every dictionary iterates at least twice, once for the memset()
 | 
|---|
| 127 | when it is created and once by dealloc().
 | 
|---|
| 128 | 
 | 
|---|
| 129 | Dictionary operations involving only a single key can be O(1) unless 
 | 
|---|
| 130 | resizing is possible.  By checking for a resize only when the 
 | 
|---|
| 131 | dictionary can grow (and may *require* resizing), other operations
 | 
|---|
| 132 | remain O(1), and the odds of resize thrashing or memory fragmentation
 | 
|---|
| 133 | are reduced. In particular, an algorithm that empties a dictionary
 | 
|---|
| 134 | by repeatedly invoking .pop will see no resizing, which might
 | 
|---|
| 135 | not be necessary at all because the dictionary is eventually
 | 
|---|
| 136 | discarded entirely.
 | 
|---|
| 137 | 
 | 
|---|
| 138 | 
 | 
|---|
| 139 | Results of Cache Locality Experiments
 | 
|---|
| 140 | -------------------------------------
 | 
|---|
| 141 | 
 | 
|---|
| 142 | When an entry is retrieved from memory, 4.333 adjacent entries are also
 | 
|---|
| 143 | retrieved into a cache line.  Since accessing items in cache is *much*
 | 
|---|
| 144 | cheaper than a cache miss, an enticing idea is to probe the adjacent
 | 
|---|
| 145 | entries as a first step in collision resolution.  Unfortunately, the
 | 
|---|
| 146 | introduction of any regularity into collision searches results in more
 | 
|---|
| 147 | collisions than the current random chaining approach.
 | 
|---|
| 148 | 
 | 
|---|
| 149 | Exploiting cache locality at the expense of additional collisions fails
 | 
|---|
| 150 | to payoff when the entries are already loaded in cache (the expense
 | 
|---|
| 151 | is paid with no compensating benefit).  This occurs in small dictionaries
 | 
|---|
| 152 | where the whole dictionary fits into a pair of cache lines.  It also
 | 
|---|
| 153 | occurs frequently in large dictionaries which have a common access pattern
 | 
|---|
| 154 | where some keys are accessed much more frequently than others.  The
 | 
|---|
| 155 | more popular entries *and* their collision chains tend to remain in cache.
 | 
|---|
| 156 | 
 | 
|---|
| 157 | To exploit cache locality, change the collision resolution section
 | 
|---|
| 158 | in lookdict() and lookdict_string().  Set i^=1 at the top of the
 | 
|---|
| 159 | loop and move the  i = (i << 2) + i + perturb + 1 to an unrolled
 | 
|---|
| 160 | version of the loop.
 | 
|---|
| 161 | 
 | 
|---|
| 162 | This optimization strategy can be leveraged in several ways:
 | 
|---|
| 163 | 
 | 
|---|
| 164 | * If the dictionary is kept sparse (through the tunable parameters),
 | 
|---|
| 165 | then the occurrence of additional collisions is lessened.
 | 
|---|
| 166 | 
 | 
|---|
| 167 | * If lookdict() and lookdict_string() are specialized for small dicts
 | 
|---|
| 168 | and for largedicts, then the versions for large_dicts can be given
 | 
|---|
| 169 | an alternate search strategy without increasing collisions in small dicts
 | 
|---|
| 170 | which already have the maximum benefit of cache locality.
 | 
|---|
| 171 | 
 | 
|---|
| 172 | * If the use case for a dictionary is known to have a random key
 | 
|---|
| 173 | access pattern (as opposed to a more common pattern with a Zipf's law
 | 
|---|
| 174 | distribution), then there will be more benefit for large dictionaries
 | 
|---|
| 175 | because any given key is no more likely than another to already be
 | 
|---|
| 176 | in cache.
 | 
|---|
| 177 | 
 | 
|---|
| 178 | * In use cases with paired accesses to the same key, the second access
 | 
|---|
| 179 | is always in cache and gets no benefit from efforts to further improve
 | 
|---|
| 180 | cache locality.
 | 
|---|
| 181 | 
 | 
|---|
| 182 | Optimizing the Search of Small Dictionaries
 | 
|---|
| 183 | -------------------------------------------
 | 
|---|
| 184 | 
 | 
|---|
| 185 | If lookdict() and lookdict_string() are specialized for smaller dictionaries,
 | 
|---|
| 186 | then a custom search approach can be implemented that exploits the small
 | 
|---|
| 187 | search space and cache locality.
 | 
|---|
| 188 | 
 | 
|---|
| 189 | * The simplest example is a linear search of contiguous entries.  This is
 | 
|---|
| 190 |   simple to implement, guaranteed to terminate rapidly, never searches
 | 
|---|
| 191 |   the same entry twice, and precludes the need to check for dummy entries.
 | 
|---|
| 192 | 
 | 
|---|
| 193 | * A more advanced example is a self-organizing search so that the most
 | 
|---|
| 194 |   frequently accessed entries get probed first.  The organization
 | 
|---|
| 195 |   adapts if the access pattern changes over time.  Treaps are ideally
 | 
|---|
| 196 |   suited for self-organization with the most common entries at the
 | 
|---|
| 197 |   top of the heap and a rapid binary search pattern.  Most probes and
 | 
|---|
| 198 |   results are all located at the top of the tree allowing them all to
 | 
|---|
| 199 |   be located in one or two cache lines.
 | 
|---|
| 200 | 
 | 
|---|
| 201 | * Also, small dictionaries may be made more dense, perhaps filling all
 | 
|---|
| 202 |   eight cells to take the maximum advantage of two cache lines.
 | 
|---|
| 203 | 
 | 
|---|
| 204 | 
 | 
|---|
| 205 | Strategy Pattern
 | 
|---|
| 206 | ----------------
 | 
|---|
| 207 | 
 | 
|---|
| 208 | Consider allowing the user to set the tunable parameters or to select a
 | 
|---|
| 209 | particular search method.  Since some dictionary use cases have known
 | 
|---|
| 210 | sizes and access patterns, the user may be able to provide useful hints.
 | 
|---|
| 211 | 
 | 
|---|
| 212 | 1) For example, if membership testing or lookups dominate runtime and memory
 | 
|---|
| 213 |    is not at a premium, the user may benefit from setting the maximum load
 | 
|---|
| 214 |    ratio at 5% or 10% instead of the usual 66.7%.  This will sharply
 | 
|---|
| 215 |    curtail the number of collisions but will increase iteration time.
 | 
|---|
| 216 |    The builtin namespace is a prime example of a dictionary that can
 | 
|---|
| 217 |    benefit from being highly sparse.
 | 
|---|
| 218 | 
 | 
|---|
| 219 | 2) Dictionary creation time can be shortened in cases where the ultimate
 | 
|---|
| 220 |    size of the dictionary is known in advance.  The dictionary can be
 | 
|---|
| 221 |    pre-sized so that no resize operations are required during creation.
 | 
|---|
| 222 |    Not only does this save resizes, but the key insertion will go
 | 
|---|
| 223 |    more quickly because the first half of the keys will be inserted into
 | 
|---|
| 224 |    a more sparse environment than before.  The preconditions for this
 | 
|---|
| 225 |    strategy arise whenever a dictionary is created from a key or item
 | 
|---|
| 226 |    sequence and the number of *unique* keys is known.
 | 
|---|
| 227 | 
 | 
|---|
| 228 | 3) If the key space is large and the access pattern is known to be random,
 | 
|---|
| 229 |    then search strategies exploiting cache locality can be fruitful.
 | 
|---|
| 230 |    The preconditions for this strategy arise in simulations and
 | 
|---|
| 231 |    numerical analysis.
 | 
|---|
| 232 | 
 | 
|---|
| 233 | 4) If the keys are fixed and the access pattern strongly favors some of
 | 
|---|
| 234 |    the keys, then the entries can be stored contiguously and accessed
 | 
|---|
| 235 |    with a linear search or treap.  This exploits knowledge of the data,
 | 
|---|
| 236 |    cache locality, and a simplified search routine.  It also eliminates
 | 
|---|
| 237 |    the need to test for dummy entries on each probe.  The preconditions
 | 
|---|
| 238 |    for this strategy arise in symbol tables and in the builtin dictionary.
 | 
|---|
| 239 | 
 | 
|---|
| 240 | 
 | 
|---|
| 241 | Readonly Dictionaries
 | 
|---|
| 242 | ---------------------
 | 
|---|
| 243 | Some dictionary use cases pass through a build stage and then move to a
 | 
|---|
| 244 | more heavily exercised lookup stage with no further changes to the
 | 
|---|
| 245 | dictionary.
 | 
|---|
| 246 | 
 | 
|---|
| 247 | An idea that emerged on python-dev is to be able to convert a dictionary
 | 
|---|
| 248 | to a read-only state.  This can help prevent programming errors and also
 | 
|---|
| 249 | provide knowledge that can be exploited for lookup optimization.
 | 
|---|
| 250 | 
 | 
|---|
| 251 | The dictionary can be immediately rebuilt (eliminating dummy entries),
 | 
|---|
| 252 | resized (to an appropriate level of sparseness), and the keys can be
 | 
|---|
| 253 | jostled (to minimize collisions).  The lookdict() routine can then
 | 
|---|
| 254 | eliminate the test for dummy entries (saving about 1/4 of the time
 | 
|---|
| 255 | spent in the collision resolution loop).
 | 
|---|
| 256 | 
 | 
|---|
| 257 | An additional possibility is to insert links into the empty spaces
 | 
|---|
| 258 | so that dictionary iteration can proceed in len(d) steps instead of
 | 
|---|
| 259 | (mp->mask + 1) steps.  Alternatively, a separate tuple of keys can be
 | 
|---|
| 260 | kept just for iteration.
 | 
|---|
| 261 | 
 | 
|---|
| 262 | 
 | 
|---|
| 263 | Caching Lookups
 | 
|---|
| 264 | ---------------
 | 
|---|
| 265 | The idea is to exploit key access patterns by anticipating future lookups
 | 
|---|
| 266 | based on previous lookups.
 | 
|---|
| 267 | 
 | 
|---|
| 268 | The simplest incarnation is to save the most recently accessed entry.
 | 
|---|
| 269 | This gives optimal performance for use cases where every get is followed
 | 
|---|
| 270 | by a set or del to the same key.
 | 
|---|