| 1 | <HTML>
|
|---|
| 2 | <HEAD>
|
|---|
| 3 | <TITLE>Garbage collector scalability</TITLE>
|
|---|
| 4 | </HEAD>
|
|---|
| 5 | <BODY>
|
|---|
| 6 | <H1>Garbage collector scalability</h1>
|
|---|
| 7 | In its default configuration, the Boehm-Demers-Weiser garbage collector
|
|---|
| 8 | is not thread-safe. It can be made thread-safe for a number of environments
|
|---|
| 9 | by building the collector with the appropriate
|
|---|
| 10 | <TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation
|
|---|
| 11 | flag. This has primarily two effects:
|
|---|
| 12 | <OL>
|
|---|
| 13 | <LI> It causes the garbage collector to stop all other threads when
|
|---|
| 14 | it needs to see a consistent memory state.
|
|---|
| 15 | <LI> It causes the collector to acquire a lock around essentially all
|
|---|
| 16 | allocation and garbage collection activity.
|
|---|
| 17 | </ol>
|
|---|
| 18 | Since a single lock is used for all allocation-related activity, only one
|
|---|
| 19 | thread can be allocating or collecting at one point. This inherently
|
|---|
| 20 | limits performance of multi-threaded applications on multiprocessors.
|
|---|
| 21 | <P>
|
|---|
| 22 | On most platforms, the allocator/collector lock is implemented as a
|
|---|
| 23 | spin lock with exponential back-off. Longer wait times are implemented
|
|---|
| 24 | by yielding and/or sleeping. If a collection is in progress, the pure
|
|---|
| 25 | spinning stage is skipped. This has the advantage that uncontested and
|
|---|
| 26 | thus most uniprocessor lock acquisitions are very cheap. It has the
|
|---|
| 27 | disadvantage that the application may sleep for small periods of time
|
|---|
| 28 | even when there is work to be done. And threads may be unnecessarily
|
|---|
| 29 | woken up for short periods. Nonetheless, this scheme empirically
|
|---|
| 30 | outperforms native queue-based mutual exclusion implementations in most
|
|---|
| 31 | cases, sometimes drastically so.
|
|---|
| 32 | <H2>Options for enhanced scalability</h2>
|
|---|
| 33 | Version 6.0 of the collector adds two facilities to enhance collector
|
|---|
| 34 | scalability on multiprocessors. As of 6.0alpha1, these are supported
|
|---|
| 35 | only under Linux on X86 and IA64 processors, though ports to other
|
|---|
| 36 | otherwise supported Pthreads platforms should be straightforward.
|
|---|
| 37 | They are intended to be used together.
|
|---|
| 38 | <UL>
|
|---|
| 39 | <LI>
|
|---|
| 40 | Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to
|
|---|
| 41 | run the mark phase in parallel in multiple threads, and thus on multiple
|
|---|
| 42 | processors. The mark phase typically consumes the large majority of the
|
|---|
| 43 | collection time. Thus this largely parallelizes the garbage collector
|
|---|
| 44 | itself, though not the allocation process. Currently the marking is
|
|---|
| 45 | performed by the thread that triggered the collection, together with
|
|---|
| 46 | <I>N</i>-1 dedicated
|
|---|
| 47 | threads, where <I>N</i> is the number of processors detected by the collector.
|
|---|
| 48 | The dedicated threads are created once at initialization time.
|
|---|
| 49 | <P>
|
|---|
| 50 | A second effect of this flag is to switch to a more concurrent
|
|---|
| 51 | implementation of <TT>GC_malloc_many</tt>, so that free lists can be
|
|---|
| 52 | built, and memory can be cleared, by more than one thread concurrently.
|
|---|
| 53 | <LI>
|
|---|
| 54 | Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread
|
|---|
| 55 | local allocation. It does not, by itself, cause thread local allocation
|
|---|
| 56 | to be used. It simply allows the use of the interface in
|
|---|
| 57 | <TT>gc_local_alloc.h</tt>.
|
|---|
| 58 | <P>
|
|---|
| 59 | Memory returned from thread-local allocators is completely interchangeable
|
|---|
| 60 | with that returned by the standard allocators. It may be used by other
|
|---|
| 61 | threads. The only difference is that, if the thread allocates enough
|
|---|
| 62 | memory of a certain kind, it will build a thread-local free list for
|
|---|
| 63 | objects of that kind, and allocate from that. This greatly reduces
|
|---|
| 64 | locking. The thread-local free lists are refilled using
|
|---|
| 65 | <TT>GC_malloc_many</tt>.
|
|---|
| 66 | <P>
|
|---|
| 67 | An important side effect of this flag is to replace the default
|
|---|
| 68 | spin-then-sleep lock to be replace by a spin-then-queue based implementation.
|
|---|
| 69 | This <I>reduces performance</i> for the standard allocation functions,
|
|---|
| 70 | though it usually improves performance when thread-local allocation is
|
|---|
| 71 | used heavily, and thus the number of short-duration lock acquisitions
|
|---|
| 72 | is greatly reduced.
|
|---|
| 73 | </ul>
|
|---|
| 74 | <P>
|
|---|
| 75 | The easiest way to switch an application to thread-local allocation is to
|
|---|
| 76 | <OL>
|
|---|
| 77 | <LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>,
|
|---|
| 78 | and then include the <TT>gc.h</tt>
|
|---|
| 79 | header in each client source file.
|
|---|
| 80 | <LI> Invoke <TT>GC_thr_init()</tt> before any allocation.
|
|---|
| 81 | <LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>,
|
|---|
| 82 | and/or <TT>GC_GCJ_MALLOC</tt>.
|
|---|
| 83 | </ol>
|
|---|
| 84 | <H2>The Parallel Marking Algorithm</h2>
|
|---|
| 85 | We use an algorithm similar to
|
|---|
| 86 | <A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by
|
|---|
| 87 | Endo, Taura, and Yonezawa</a> at the University of Tokyo.
|
|---|
| 88 | However, the data structures and implementation are different,
|
|---|
| 89 | and represent a smaller change to the original collector source,
|
|---|
| 90 | probably at the expense of extreme scalability. Some of
|
|---|
| 91 | the refinements they suggest, <I>e.g.</i> splitting large
|
|---|
| 92 | objects, were also incorporated into out approach.
|
|---|
| 93 | <P>
|
|---|
| 94 | The global mark stack is transformed into a global work queue.
|
|---|
| 95 | Unlike the usual case, it never shrinks during a mark phase.
|
|---|
| 96 | The mark threads remove objects from the queue by copying them to a
|
|---|
| 97 | local mark stack and changing the global descriptor to zero, indicating
|
|---|
| 98 | that there is no more work to be done for this entry.
|
|---|
| 99 | This removal
|
|---|
| 100 | is done with no synchronization. Thus it is possible for more than
|
|---|
| 101 | one worker to remove the same entry, resulting in some work duplication.
|
|---|
| 102 | <P>
|
|---|
| 103 | The global work queue grows only if a marker thread decides to
|
|---|
| 104 | return some of its local mark stack to the global one. This
|
|---|
| 105 | is done if the global queue appears to be running low, or if
|
|---|
| 106 | the local stack is in danger of overflowing. It does require
|
|---|
| 107 | synchronization, but should be relatively rare.
|
|---|
| 108 | <P>
|
|---|
| 109 | The sequential marking code is reused to process local mark stacks.
|
|---|
| 110 | Hence the amount of additional code required for parallel marking
|
|---|
| 111 | is minimal.
|
|---|
| 112 | <P>
|
|---|
| 113 | It should be possible to use generational collection in the presence of the
|
|---|
| 114 | parallel collector, by calling <TT>GC_enable_incremental()</tt>.
|
|---|
| 115 | This does not result in fully incremental collection, since parallel mark
|
|---|
| 116 | phases cannot currently be interrupted, and doing so may be too
|
|---|
| 117 | expensive.
|
|---|
| 118 | <P>
|
|---|
| 119 | Gcj-style mark descriptors do not currently mix with the combination
|
|---|
| 120 | of local allocation and incremental collection. They should work correctly
|
|---|
| 121 | with one or the other, but not both.
|
|---|
| 122 | <P>
|
|---|
| 123 | The number of marker threads is set on startup to the number of
|
|---|
| 124 | available processors (or to the value of the <TT>GC_NPROCS</tt>
|
|---|
| 125 | environment variable). If only a single processor is detected,
|
|---|
| 126 | parallel marking is disabled.
|
|---|
| 127 | <P>
|
|---|
| 128 | Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside
|
|---|
| 129 | the collector to immediately yield the processor instead of busy waiting
|
|---|
| 130 | first. In the case of a multiprocessor and a client with multiple
|
|---|
| 131 | simultaneously runnable threads, this may have disastrous performance
|
|---|
| 132 | consequences (e.g. a factor of 10 slowdown).
|
|---|
| 133 | <H2>Performance</h2>
|
|---|
| 134 | We conducted some simple experiments with a version of
|
|---|
| 135 | <A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to
|
|---|
| 136 | run multiple concurrent client threads in the same address space.
|
|---|
| 137 | Each client thread does the same work as the original benchmark, but they share
|
|---|
| 138 | a heap.
|
|---|
| 139 | This benchmark involves very little work outside of memory allocation.
|
|---|
| 140 | This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine
|
|---|
| 141 | under Linux 2.2.12.
|
|---|
| 142 | <P>
|
|---|
| 143 | Running with a thread-unsafe collector, the benchmark ran in 9
|
|---|
| 144 | seconds. With the simple thread-safe collector,
|
|---|
| 145 | built with <TT>-DLINUX_THREADS</tt>, the execution time
|
|---|
| 146 | increased to 10.3 seconds, or 23.5 elapsed seconds with two clients.
|
|---|
| 147 | (The times for the <TT>malloc</tt>/i<TT>free</tt> version
|
|---|
| 148 | with glibc <TT>malloc</tt>
|
|---|
| 149 | are 10.51 (standard library, pthreads not linked),
|
|---|
| 150 | 20.90 (one thread, pthreads linked),
|
|---|
| 151 | and 24.55 seconds respectively. The benchmark favors a
|
|---|
| 152 | garbage collector, since most objects are small.)
|
|---|
| 153 | <P>
|
|---|
| 154 | The following table gives execution times for the collector built
|
|---|
| 155 | with parallel marking and thread-local allocation support
|
|---|
| 156 | (<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested
|
|---|
| 157 | the client using either one or two marker threads, and running
|
|---|
| 158 | one or two client threads. Note that the client uses thread local
|
|---|
| 159 | allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector
|
|---|
| 160 | switches to a locking strategy that is better tuned to less frequent
|
|---|
| 161 | lock acquisition. The standard allocation primitives thus peform
|
|---|
| 162 | slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be
|
|---|
| 163 | avoided in time-critical code.
|
|---|
| 164 | <P>
|
|---|
| 165 | (The results using <TT>pthread_mutex_lock</tt>
|
|---|
| 166 | directly for allocation locking would have been worse still, at
|
|---|
| 167 | least for older versions of linuxthreads.
|
|---|
| 168 | With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the
|
|---|
| 169 | lock with pthread_mutex_try_lock(), busy_waiting between attempts.
|
|---|
| 170 | After a fixed number of attempts, we use pthread_mutex_lock().)
|
|---|
| 171 | <P>
|
|---|
| 172 | These measurements do not use incremental collection, nor was prefetching
|
|---|
| 173 | enabled in the marker. We used the C version of the benchmark.
|
|---|
| 174 | All measurements are in elapsed seconds on an unloaded machine.
|
|---|
| 175 | <P>
|
|---|
| 176 | <TABLE BORDER ALIGN="CENTER">
|
|---|
| 177 | <TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th>
|
|---|
| 178 | <TH>2 marker threads (secs.)</th></tr>
|
|---|
| 179 | <TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td>
|
|---|
| 180 | <TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td>
|
|---|
| 181 | </table>
|
|---|
| 182 | <PP>
|
|---|
| 183 | The execution time for the single threaded case is slightly worse than with
|
|---|
| 184 | simple locking. However, even the single-threaded benchmark runs faster than
|
|---|
| 185 | even the thread-unsafe version if a second processor is available.
|
|---|
| 186 | The execution time for two clients with thread local allocation time is
|
|---|
| 187 | only 1.4 times the sequential execution time for a single thread in a
|
|---|
| 188 | thread-unsafe environment, even though it involves twice the client work.
|
|---|
| 189 | That represents close to a
|
|---|
| 190 | factor of 2 improvement over the 2 client case with the old collector.
|
|---|
| 191 | The old collector clearly
|
|---|
| 192 | still suffered from some contention overhead, in spite of the fact that the
|
|---|
| 193 | locking scheme had been fairly well tuned.
|
|---|
| 194 | <P>
|
|---|
| 195 | Full linear speedup (i.e. the same execution time for 1 client on one
|
|---|
| 196 | processor as 2 clients on 2 processors)
|
|---|
| 197 | is probably not achievable on this kind of
|
|---|
| 198 | hardware even with such a small number of processors,
|
|---|
| 199 | since the memory system is
|
|---|
| 200 | a major constraint for the garbage collector,
|
|---|
| 201 | the processors usually share a single memory bus, and thus
|
|---|
| 202 | the aggregate memory bandwidth does not increase in
|
|---|
| 203 | proportion to the number of processors.
|
|---|
| 204 | <P>
|
|---|
| 205 | These results are likely to be very sensitive to both hardware and OS
|
|---|
| 206 | issues. Preliminary experiments with an older Pentium Pro machine running
|
|---|
| 207 | an older kernel were far less encouraging.
|
|---|
| 208 |
|
|---|
| 209 | </body>
|
|---|
| 210 | </html>
|
|---|