1 | NOTES ON OPTIMIZING DICTIONARIES
|
---|
2 | ================================
|
---|
3 |
|
---|
4 |
|
---|
5 | Principal Use Cases for Dictionaries
|
---|
6 | ------------------------------------
|
---|
7 |
|
---|
8 | Passing keyword arguments
|
---|
9 | Typically, one read and one write for 1 to 3 elements.
|
---|
10 | Occurs frequently in normal python code.
|
---|
11 |
|
---|
12 | Class method lookup
|
---|
13 | Dictionaries vary in size with 8 to 16 elements being common.
|
---|
14 | Usually written once with many lookups.
|
---|
15 | When base classes are used, there are many failed lookups
|
---|
16 | followed by a lookup in a base class.
|
---|
17 |
|
---|
18 | Instance attribute lookup and Global variables
|
---|
19 | Dictionaries vary in size. 4 to 10 elements are common.
|
---|
20 | Both reads and writes are common.
|
---|
21 |
|
---|
22 | Builtins
|
---|
23 | Frequent reads. Almost never written.
|
---|
24 | Size 126 interned strings (as of Py2.3b1).
|
---|
25 | A few keys are accessed much more frequently than others.
|
---|
26 |
|
---|
27 | Uniquification
|
---|
28 | Dictionaries of any size. Bulk of work is in creation.
|
---|
29 | Repeated writes to a smaller set of keys.
|
---|
30 | Single read of each key.
|
---|
31 | Some use cases have two consecutive accesses to the same key.
|
---|
32 |
|
---|
33 | * Removing duplicates from a sequence.
|
---|
34 | dict.fromkeys(seqn).keys()
|
---|
35 |
|
---|
36 | * Counting elements in a sequence.
|
---|
37 | for e in seqn:
|
---|
38 | d[e] = d.get(e,0) + 1
|
---|
39 |
|
---|
40 | * Accumulating references in a dictionary of lists:
|
---|
41 |
|
---|
42 | for pagenumber, page in enumerate(pages):
|
---|
43 | for word in page:
|
---|
44 | d.setdefault(word, []).append(pagenumber)
|
---|
45 |
|
---|
46 | Note, the second example is a use case characterized by a get and set
|
---|
47 | to the same key. There are similar use cases with a __contains__
|
---|
48 | followed by a get, set, or del to the same key. Part of the
|
---|
49 | justification for d.setdefault is combining the two lookups into one.
|
---|
50 |
|
---|
51 | Membership Testing
|
---|
52 | Dictionaries of any size. Created once and then rarely changes.
|
---|
53 | Single write to each key.
|
---|
54 | Many calls to __contains__() or has_key().
|
---|
55 | Similar access patterns occur with replacement dictionaries
|
---|
56 | such as with the % formatting operator.
|
---|
57 |
|
---|
58 | Dynamic Mappings
|
---|
59 | Characterized by deletions interspersed with adds and replacements.
|
---|
60 | Performance benefits greatly from the re-use of dummy entries.
|
---|
61 |
|
---|
62 |
|
---|
63 | Data Layout (assuming a 32-bit box with 64 bytes per cache line)
|
---|
64 | ----------------------------------------------------------------
|
---|
65 |
|
---|
66 | Smalldicts (8 entries) are attached to the dictobject structure
|
---|
67 | and the whole group nearly fills two consecutive cache lines.
|
---|
68 |
|
---|
69 | Larger dicts use the first half of the dictobject structure (one cache
|
---|
70 | line) and a separate, continuous block of entries (at 12 bytes each
|
---|
71 | for a total of 5.333 entries per cache line).
|
---|
72 |
|
---|
73 |
|
---|
74 | Tunable Dictionary Parameters
|
---|
75 | -----------------------------
|
---|
76 |
|
---|
77 | * PyDict_MINSIZE. Currently set to 8.
|
---|
78 | Must be a power of two. New dicts have to zero-out every cell.
|
---|
79 | Each additional 8 consumes 1.5 cache lines. Increasing improves
|
---|
80 | the sparseness of small dictionaries but costs time to read in
|
---|
81 | the additional cache lines if they are not already in cache.
|
---|
82 | That case is common when keyword arguments are passed.
|
---|
83 |
|
---|
84 | * Maximum dictionary load in PyDict_SetItem. Currently set to 2/3.
|
---|
85 | Increasing this ratio makes dictionaries more dense resulting
|
---|
86 | in more collisions. Decreasing it improves sparseness at the
|
---|
87 | expense of spreading entries over more cache lines and at the
|
---|
88 | cost of total memory consumed.
|
---|
89 |
|
---|
90 | The load test occurs in highly time sensitive code. Efforts
|
---|
91 | to make the test more complex (for example, varying the load
|
---|
92 | for different sizes) have degraded performance.
|
---|
93 |
|
---|
94 | * Growth rate upon hitting maximum load. Currently set to *2.
|
---|
95 | Raising this to *4 results in half the number of resizes,
|
---|
96 | less effort to resize, better sparseness for some (but not
|
---|
97 | all dict sizes), and potentially doubles memory consumption
|
---|
98 | depending on the size of the dictionary. Setting to *4
|
---|
99 | eliminates every other resize step.
|
---|
100 |
|
---|
101 | * Maximum sparseness (minimum dictionary load). What percentage
|
---|
102 | of entries can be unused before the dictionary shrinks to
|
---|
103 | free up memory and speed up iteration? (The current CPython
|
---|
104 | code does not represent this parameter directly.)
|
---|
105 |
|
---|
106 | * Shrinkage rate upon exceeding maximum sparseness. The current
|
---|
107 | CPython code never even checks sparseness when deleting a
|
---|
108 | key. When a new key is added, it resizes based on the number
|
---|
109 | of active keys, so that the addition may trigger shrinkage
|
---|
110 | rather than growth.
|
---|
111 |
|
---|
112 | Tune-ups should be measured across a broad range of applications and
|
---|
113 | use cases. A change to any parameter will help in some situations and
|
---|
114 | hurt in others. The key is to find settings that help the most common
|
---|
115 | cases and do the least damage to the less common cases. Results will
|
---|
116 | vary dramatically depending on the exact number of keys, whether the
|
---|
117 | keys are all strings, whether reads or writes dominate, the exact
|
---|
118 | hash values of the keys (some sets of values have fewer collisions than
|
---|
119 | others). Any one test or benchmark is likely to prove misleading.
|
---|
120 |
|
---|
121 | While making a dictionary more sparse reduces collisions, it impairs
|
---|
122 | iteration and key listing. Those methods loop over every potential
|
---|
123 | entry. Doubling the size of dictionary results in twice as many
|
---|
124 | non-overlapping memory accesses for keys(), items(), values(),
|
---|
125 | __iter__(), iterkeys(), iteritems(), itervalues(), and update().
|
---|
126 | Also, every dictionary iterates at least twice, once for the memset()
|
---|
127 | when it is created and once by dealloc().
|
---|
128 |
|
---|
129 | Dictionary operations involving only a single key can be O(1) unless
|
---|
130 | resizing is possible. By checking for a resize only when the
|
---|
131 | dictionary can grow (and may *require* resizing), other operations
|
---|
132 | remain O(1), and the odds of resize thrashing or memory fragmentation
|
---|
133 | are reduced. In particular, an algorithm that empties a dictionary
|
---|
134 | by repeatedly invoking .pop will see no resizing, which might
|
---|
135 | not be necessary at all because the dictionary is eventually
|
---|
136 | discarded entirely.
|
---|
137 |
|
---|
138 |
|
---|
139 | Results of Cache Locality Experiments
|
---|
140 | -------------------------------------
|
---|
141 |
|
---|
142 | When an entry is retrieved from memory, 4.333 adjacent entries are also
|
---|
143 | retrieved into a cache line. Since accessing items in cache is *much*
|
---|
144 | cheaper than a cache miss, an enticing idea is to probe the adjacent
|
---|
145 | entries as a first step in collision resolution. Unfortunately, the
|
---|
146 | introduction of any regularity into collision searches results in more
|
---|
147 | collisions than the current random chaining approach.
|
---|
148 |
|
---|
149 | Exploiting cache locality at the expense of additional collisions fails
|
---|
150 | to payoff when the entries are already loaded in cache (the expense
|
---|
151 | is paid with no compensating benefit). This occurs in small dictionaries
|
---|
152 | where the whole dictionary fits into a pair of cache lines. It also
|
---|
153 | occurs frequently in large dictionaries which have a common access pattern
|
---|
154 | where some keys are accessed much more frequently than others. The
|
---|
155 | more popular entries *and* their collision chains tend to remain in cache.
|
---|
156 |
|
---|
157 | To exploit cache locality, change the collision resolution section
|
---|
158 | in lookdict() and lookdict_string(). Set i^=1 at the top of the
|
---|
159 | loop and move the i = (i << 2) + i + perturb + 1 to an unrolled
|
---|
160 | version of the loop.
|
---|
161 |
|
---|
162 | This optimization strategy can be leveraged in several ways:
|
---|
163 |
|
---|
164 | * If the dictionary is kept sparse (through the tunable parameters),
|
---|
165 | then the occurrence of additional collisions is lessened.
|
---|
166 |
|
---|
167 | * If lookdict() and lookdict_string() are specialized for small dicts
|
---|
168 | and for largedicts, then the versions for large_dicts can be given
|
---|
169 | an alternate search strategy without increasing collisions in small dicts
|
---|
170 | which already have the maximum benefit of cache locality.
|
---|
171 |
|
---|
172 | * If the use case for a dictionary is known to have a random key
|
---|
173 | access pattern (as opposed to a more common pattern with a Zipf's law
|
---|
174 | distribution), then there will be more benefit for large dictionaries
|
---|
175 | because any given key is no more likely than another to already be
|
---|
176 | in cache.
|
---|
177 |
|
---|
178 | * In use cases with paired accesses to the same key, the second access
|
---|
179 | is always in cache and gets no benefit from efforts to further improve
|
---|
180 | cache locality.
|
---|
181 |
|
---|
182 | Optimizing the Search of Small Dictionaries
|
---|
183 | -------------------------------------------
|
---|
184 |
|
---|
185 | If lookdict() and lookdict_string() are specialized for smaller dictionaries,
|
---|
186 | then a custom search approach can be implemented that exploits the small
|
---|
187 | search space and cache locality.
|
---|
188 |
|
---|
189 | * The simplest example is a linear search of contiguous entries. This is
|
---|
190 | simple to implement, guaranteed to terminate rapidly, never searches
|
---|
191 | the same entry twice, and precludes the need to check for dummy entries.
|
---|
192 |
|
---|
193 | * A more advanced example is a self-organizing search so that the most
|
---|
194 | frequently accessed entries get probed first. The organization
|
---|
195 | adapts if the access pattern changes over time. Treaps are ideally
|
---|
196 | suited for self-organization with the most common entries at the
|
---|
197 | top of the heap and a rapid binary search pattern. Most probes and
|
---|
198 | results are all located at the top of the tree allowing them all to
|
---|
199 | be located in one or two cache lines.
|
---|
200 |
|
---|
201 | * Also, small dictionaries may be made more dense, perhaps filling all
|
---|
202 | eight cells to take the maximum advantage of two cache lines.
|
---|
203 |
|
---|
204 |
|
---|
205 | Strategy Pattern
|
---|
206 | ----------------
|
---|
207 |
|
---|
208 | Consider allowing the user to set the tunable parameters or to select a
|
---|
209 | particular search method. Since some dictionary use cases have known
|
---|
210 | sizes and access patterns, the user may be able to provide useful hints.
|
---|
211 |
|
---|
212 | 1) For example, if membership testing or lookups dominate runtime and memory
|
---|
213 | is not at a premium, the user may benefit from setting the maximum load
|
---|
214 | ratio at 5% or 10% instead of the usual 66.7%. This will sharply
|
---|
215 | curtail the number of collisions but will increase iteration time.
|
---|
216 | The builtin namespace is a prime example of a dictionary that can
|
---|
217 | benefit from being highly sparse.
|
---|
218 |
|
---|
219 | 2) Dictionary creation time can be shortened in cases where the ultimate
|
---|
220 | size of the dictionary is known in advance. The dictionary can be
|
---|
221 | pre-sized so that no resize operations are required during creation.
|
---|
222 | Not only does this save resizes, but the key insertion will go
|
---|
223 | more quickly because the first half of the keys will be inserted into
|
---|
224 | a more sparse environment than before. The preconditions for this
|
---|
225 | strategy arise whenever a dictionary is created from a key or item
|
---|
226 | sequence and the number of *unique* keys is known.
|
---|
227 |
|
---|
228 | 3) If the key space is large and the access pattern is known to be random,
|
---|
229 | then search strategies exploiting cache locality can be fruitful.
|
---|
230 | The preconditions for this strategy arise in simulations and
|
---|
231 | numerical analysis.
|
---|
232 |
|
---|
233 | 4) If the keys are fixed and the access pattern strongly favors some of
|
---|
234 | the keys, then the entries can be stored contiguously and accessed
|
---|
235 | with a linear search or treap. This exploits knowledge of the data,
|
---|
236 | cache locality, and a simplified search routine. It also eliminates
|
---|
237 | the need to test for dummy entries on each probe. The preconditions
|
---|
238 | for this strategy arise in symbol tables and in the builtin dictionary.
|
---|
239 |
|
---|
240 |
|
---|
241 | Readonly Dictionaries
|
---|
242 | ---------------------
|
---|
243 | Some dictionary use cases pass through a build stage and then move to a
|
---|
244 | more heavily exercised lookup stage with no further changes to the
|
---|
245 | dictionary.
|
---|
246 |
|
---|
247 | An idea that emerged on python-dev is to be able to convert a dictionary
|
---|
248 | to a read-only state. This can help prevent programming errors and also
|
---|
249 | provide knowledge that can be exploited for lookup optimization.
|
---|
250 |
|
---|
251 | The dictionary can be immediately rebuilt (eliminating dummy entries),
|
---|
252 | resized (to an appropriate level of sparseness), and the keys can be
|
---|
253 | jostled (to minimize collisions). The lookdict() routine can then
|
---|
254 | eliminate the test for dummy entries (saving about 1/4 of the time
|
---|
255 | spent in the collision resolution loop).
|
---|
256 |
|
---|
257 | An additional possibility is to insert links into the empty spaces
|
---|
258 | so that dictionary iteration can proceed in len(d) steps instead of
|
---|
259 | (mp->mask + 1) steps. Alternatively, a separate tuple of keys can be
|
---|
260 | kept just for iteration.
|
---|
261 |
|
---|
262 |
|
---|
263 | Caching Lookups
|
---|
264 | ---------------
|
---|
265 | The idea is to exploit key access patterns by anticipating future lookups
|
---|
266 | based on previous lookups.
|
---|
267 |
|
---|
268 | The simplest incarnation is to save the most recently accessed entry.
|
---|
269 | This gives optimal performance for use cases where every get is followed
|
---|
270 | by a set or del to the same key.
|
---|