| 1 | ptmalloc3 - a multi-thread malloc implementation | 
|---|
| 2 | ================================================ | 
|---|
| 3 |  | 
|---|
| 4 | Wolfram Gloger (wg@malloc.de) | 
|---|
| 5 |  | 
|---|
| 6 | Jan 2006 | 
|---|
| 7 |  | 
|---|
| 8 |  | 
|---|
| 9 | Thanks | 
|---|
| 10 | ====== | 
|---|
| 11 |  | 
|---|
| 12 | This release was partly funded by Pixar Animation Studios.  I would | 
|---|
| 13 | like to thank David Baraff of Pixar for his support and Doug Lea | 
|---|
| 14 | (dl@cs.oswego.edu) for the great original malloc implementation. | 
|---|
| 15 |  | 
|---|
| 16 |  | 
|---|
| 17 | Introduction | 
|---|
| 18 | ============ | 
|---|
| 19 |  | 
|---|
| 20 | This package is a modified version of Doug Lea's malloc-2.8.3 | 
|---|
| 21 | implementation (available seperately from ftp://g.oswego.edu/pub/misc) | 
|---|
| 22 | that I adapted for multiple threads, while trying to avoid lock | 
|---|
| 23 | contention as much as possible. | 
|---|
| 24 |  | 
|---|
| 25 | As part of the GNU C library, the source files may be available under | 
|---|
| 26 | the GNU Library General Public License (see the comments in the | 
|---|
| 27 | files).  But as part of this stand-alone package, the code is also | 
|---|
| 28 | available under the (probably less restrictive) conditions described | 
|---|
| 29 | in the file 'COPYRIGHT'.  In any case, there is no warranty whatsoever | 
|---|
| 30 | for this package. | 
|---|
| 31 |  | 
|---|
| 32 | The current distribution should be available from: | 
|---|
| 33 |  | 
|---|
| 34 | http://www.malloc.de/malloc/ptmalloc3.tar.gz | 
|---|
| 35 |  | 
|---|
| 36 |  | 
|---|
| 37 | Compilation | 
|---|
| 38 | =========== | 
|---|
| 39 |  | 
|---|
| 40 | It should be possible to build ptmalloc3 on any UN*X-like system that | 
|---|
| 41 | implements the sbrk(), mmap(), munmap() and mprotect() calls.  Since | 
|---|
| 42 | there are now several source files, a library (libptmalloc3.a) is | 
|---|
| 43 | generated.  See the Makefile for examples of the compile-time options. | 
|---|
| 44 |  | 
|---|
| 45 | Note that support for non-ANSI compilers is no longer there. | 
|---|
| 46 |  | 
|---|
| 47 | Several example targets are provided in the Makefile: | 
|---|
| 48 |  | 
|---|
| 49 | o Posix threads (pthreads), compile with "make posix" | 
|---|
| 50 |  | 
|---|
| 51 | o Posix threads with explicit initialization, compile with | 
|---|
| 52 | "make posix-explicit" (known to be required on HPUX) | 
|---|
| 53 |  | 
|---|
| 54 | o Posix threads without "tsd data hack" (see below), compile with | 
|---|
| 55 | "make posix-with-tsd" | 
|---|
| 56 |  | 
|---|
| 57 | o Solaris threads, compile with "make solaris" | 
|---|
| 58 |  | 
|---|
| 59 | o SGI sproc() threads, compile with "make sproc" | 
|---|
| 60 |  | 
|---|
| 61 | o no threads, compile with "make nothreads" (currently out of order?) | 
|---|
| 62 |  | 
|---|
| 63 | For Linux: | 
|---|
| 64 |  | 
|---|
| 65 | o make "linux-pthread" (almost the same as "make posix") or | 
|---|
| 66 | make "linux-shared" | 
|---|
| 67 |  | 
|---|
| 68 | Note that some compilers need special flags for multi-threaded code, | 
|---|
| 69 | e.g. with Solaris cc with Posix threads, one should use: | 
|---|
| 70 |  | 
|---|
| 71 | % make posix SYS_FLAGS='-mt' | 
|---|
| 72 |  | 
|---|
| 73 | Some additional targets, ending in `-libc', are also provided in the | 
|---|
| 74 | Makefile, to compare performance of the test programs to the case when | 
|---|
| 75 | linking with the standard malloc implementation in libc. | 
|---|
| 76 |  | 
|---|
| 77 | A potential problem remains: If any of the system-specific functions | 
|---|
| 78 | for getting/setting thread-specific data or for locking a mutex call | 
|---|
| 79 | one of the malloc-related functions internally, the implementation | 
|---|
| 80 | cannot work at all due to infinite recursion.  One example seems to be | 
|---|
| 81 | Solaris 2.4.  I would like to hear if this problem occurs on other | 
|---|
| 82 | systems, and whether similar workarounds could be applied. | 
|---|
| 83 |  | 
|---|
| 84 | For Posix threads, too, an optional hack like that has been integrated | 
|---|
| 85 | (activated when defining USE_TSD_DATA_HACK) which depends on | 
|---|
| 86 | `pthread_t' being convertible to an integral type (which is of course | 
|---|
| 87 | not generally guaranteed).  USE_TSD_DATA_HACK is now the default | 
|---|
| 88 | because I haven't yet found a non-glibc pthreads system where this | 
|---|
| 89 | hack is _not_ needed. | 
|---|
| 90 |  | 
|---|
| 91 | *NEW* and _important_: In (currently) one place in the ptmalloc3 | 
|---|
| 92 | source, a write memory barrier is needed, named | 
|---|
| 93 | atomic_write_barrier().  This macro needs to be defined at the end of | 
|---|
| 94 | malloc-machine.h.  For gcc, a fallback in the form of a full memory | 
|---|
| 95 | barrier is already defined, but you may need to add another definition | 
|---|
| 96 | if you don't use gcc. | 
|---|
| 97 |  | 
|---|
| 98 | Usage | 
|---|
| 99 | ===== | 
|---|
| 100 |  | 
|---|
| 101 | Just link libptmalloc3 into your application. | 
|---|
| 102 |  | 
|---|
| 103 | Some wicked systems (e.g. HPUX apparently) won't let malloc call _any_ | 
|---|
| 104 | thread-related functions before main().  On these systems, | 
|---|
| 105 | USE_STARTER=2 must be defined during compilation (see "make | 
|---|
| 106 | posix-explicit" above) and the global initialization function | 
|---|
| 107 | ptmalloc_init() must be called explicitly, preferably at the start of | 
|---|
| 108 | main(). | 
|---|
| 109 |  | 
|---|
| 110 | Otherwise, when using ptmalloc3, no special precautions are necessary. | 
|---|
| 111 |  | 
|---|
| 112 | Link order is important | 
|---|
| 113 | ======================= | 
|---|
| 114 |  | 
|---|
| 115 | On some systems, when overriding malloc and linking against shared | 
|---|
| 116 | libraries, the link order becomes very important.  E.g., when linking | 
|---|
| 117 | C++ programs on Solaris with Solaris threads [this is probably now | 
|---|
| 118 | obsolete], don't rely on libC being included by default, but instead | 
|---|
| 119 | put `-lthread' behind `-lC' on the command line: | 
|---|
| 120 |  | 
|---|
| 121 | CC ... libptmalloc3.a -lC -lthread | 
|---|
| 122 |  | 
|---|
| 123 | This is because there are global constructors in libC that need | 
|---|
| 124 | malloc/ptmalloc, which in turn needs to have the thread library to be | 
|---|
| 125 | already initialized. | 
|---|
| 126 |  | 
|---|
| 127 | Debugging hooks | 
|---|
| 128 | =============== | 
|---|
| 129 |  | 
|---|
| 130 | All calls to malloc(), realloc(), free() and memalign() are routed | 
|---|
| 131 | through the global function pointers __malloc_hook, __realloc_hook, | 
|---|
| 132 | __free_hook and __memalign_hook if they are not NULL (see the malloc.h | 
|---|
| 133 | header file for declarations of these pointers).  Therefore the malloc | 
|---|
| 134 | implementation can be changed at runtime, if care is taken not to call | 
|---|
| 135 | free() or realloc() on pointers obtained with a different | 
|---|
| 136 | implementation than the one currently in effect.  (The easiest way to | 
|---|
| 137 | guarantee this is to set up the hooks before any malloc call, e.g. | 
|---|
| 138 | with a function pointed to by the global variable | 
|---|
| 139 | __malloc_initialize_hook). | 
|---|
| 140 |  | 
|---|
| 141 | You can now also tune other malloc parameters (normally adjused via | 
|---|
| 142 | mallopt() calls from the application) with environment variables: | 
|---|
| 143 |  | 
|---|
| 144 | MALLOC_TRIM_THRESHOLD_    for deciding to shrink the heap (in bytes) | 
|---|
| 145 |  | 
|---|
| 146 | MALLOC_GRANULARITY_       The unit for allocating and deallocating | 
|---|
| 147 | MALLOC_TOP_PAD_           memory from the system.  The default | 
|---|
| 148 | is 64k and this parameter _must_ be a | 
|---|
| 149 | power of 2. | 
|---|
| 150 |  | 
|---|
| 151 | MALLOC_MMAP_THRESHOLD_    min. size for chunks allocated via | 
|---|
| 152 | mmap() (in bytes) | 
|---|
| 153 |  | 
|---|
| 154 | Tests | 
|---|
| 155 | ===== | 
|---|
| 156 |  | 
|---|
| 157 | Two testing applications, t-test1 and t-test2, are included in this | 
|---|
| 158 | source distribution.  Both perform pseudo-random sequences of | 
|---|
| 159 | allocations/frees, and can be given numeric arguments (all arguments | 
|---|
| 160 | are optional): | 
|---|
| 161 |  | 
|---|
| 162 | % t-test[12] <n-total> <n-parallel> <n-allocs> <size-max> <bins> | 
|---|
| 163 |  | 
|---|
| 164 | n-total = total number of threads executed (default 10) | 
|---|
| 165 | n-parallel = number of threads running in parallel (2) | 
|---|
| 166 | n-allocs = number of malloc()'s / free()'s per thread (10000) | 
|---|
| 167 | size-max = max. size requested with malloc() in bytes (10000) | 
|---|
| 168 | bins = number of bins to maintain | 
|---|
| 169 |  | 
|---|
| 170 | The first test `t-test1' maintains a completely seperate pool of | 
|---|
| 171 | allocated bins for each thread, and should therefore show full | 
|---|
| 172 | parallelism.  On the other hand, `t-test2' creates only a single pool | 
|---|
| 173 | of bins, and each thread randomly allocates/frees any bin.  Some lock | 
|---|
| 174 | contention is to be expected in this case, as the threads frequently | 
|---|
| 175 | cross each others arena. | 
|---|
| 176 |  | 
|---|
| 177 | Performance results from t-test1 should be quite repeatable, while the | 
|---|
| 178 | behaviour of t-test2 depends on scheduling variations. | 
|---|
| 179 |  | 
|---|
| 180 | Conclusion | 
|---|
| 181 | ========== | 
|---|
| 182 |  | 
|---|
| 183 | I'm always interested in performance data and feedback, just send mail | 
|---|
| 184 | to ptmalloc@malloc.de. | 
|---|
| 185 |  | 
|---|
| 186 | Good luck! | 
|---|