| 1 | <HTML> | 
|---|
| 2 | <HEAD> | 
|---|
| 3 | <TITLE>Debugging Garbage Collector Related Problems</title> | 
|---|
| 4 | </head> | 
|---|
| 5 | <BODY> | 
|---|
| 6 | <H1>Debugging Garbage Collector Related Problems</h1> | 
|---|
| 7 | This page contains some hints on | 
|---|
| 8 | debugging issues specific to | 
|---|
| 9 | the Boehm-Demers-Weiser conservative garbage collector. | 
|---|
| 10 | It applies both to debugging issues in client code that manifest themselves | 
|---|
| 11 | as collector misbehavior, and to debugging the collector itself. | 
|---|
| 12 | <P> | 
|---|
| 13 | If you suspect a bug in the collector itself, it is strongly recommended | 
|---|
| 14 | that you try the latest collector release, even if it is labelled as "alpha", | 
|---|
| 15 | before proceeding. | 
|---|
| 16 | <H2>Bus Errors and Segmentation Violations</h2> | 
|---|
| 17 | <P> | 
|---|
| 18 | If the fault occurred in GC_find_limit, or with incremental collection enabled, | 
|---|
| 19 | this is probably normal.  The collector installs handlers to take care of | 
|---|
| 20 | these.  You will not see these unless you are using a debugger. | 
|---|
| 21 | Your debugger <I>should</i> allow you to continue. | 
|---|
| 22 | It's often preferable to tell the debugger to ignore SIGBUS and SIGSEGV | 
|---|
| 23 | ("<TT>handle SIGSEGV SIGBUS nostop noprint</tt>" in gdb, | 
|---|
| 24 | "<TT>ignore SIGSEGV SIGBUS</tt>" in most versions of dbx) | 
|---|
| 25 | and set a breakpoint in <TT>abort</tt>. | 
|---|
| 26 | The collector will call abort if the signal had another cause, | 
|---|
| 27 | and there was not other handler previously installed. | 
|---|
| 28 | <P> | 
|---|
| 29 | We recommend debugging without incremental collection if possible. | 
|---|
| 30 | (This applies directly to UNIX systems. | 
|---|
| 31 | Debugging with incremental collection under win32 is worse.  See README.win32.) | 
|---|
| 32 | <P> | 
|---|
| 33 | If the application generates an unhandled SIGSEGV or equivalent, it may | 
|---|
| 34 | often be easiest to set the environment variable GC_LOOP_ON_ABORT.  On many | 
|---|
| 35 | platforms, this will cause the collector to loop in a handler when the | 
|---|
| 36 | SIGSEGV is encountered (or when the collector aborts for some other reason), | 
|---|
| 37 | and a debugger can then be attached to the looping | 
|---|
| 38 | process.  This sidesteps common operating system problems related | 
|---|
| 39 | to incomplete core files for multithreaded applications, etc. | 
|---|
| 40 | <H2>Other Signals</h2> | 
|---|
| 41 | On most platforms, the multithreaded version of the collector needs one or | 
|---|
| 42 | two other signals for internal use by the collector in stopping threads. | 
|---|
| 43 | It is normally wise to tell the debugger to ignore these.  On Linux, | 
|---|
| 44 | the collector currently uses SIGPWR and SIGXCPU by default. | 
|---|
| 45 | <H2>Warning Messages About Needing to Allocate Blacklisted Blocks</h2> | 
|---|
| 46 | The garbage collector generates warning messages of the form | 
|---|
| 47 | <PRE> | 
|---|
| 48 | Needed to allocate blacklisted block at 0x... | 
|---|
| 49 | </pre> | 
|---|
| 50 | when it needs to allocate a block at a location that it knows to be | 
|---|
| 51 | referenced by a false pointer.  These false pointers can be either permanent | 
|---|
| 52 | (<I>e.g.</i> a static integer variable that never changes) or temporary. | 
|---|
| 53 | In the latter case, the warning is largely spurious, and the block will | 
|---|
| 54 | eventually be reclaimed normally. | 
|---|
| 55 | In the former case, the program will still run correctly, but the block | 
|---|
| 56 | will never be reclaimed.  Unless the block is intended to be | 
|---|
| 57 | permanent, the warning indicates a memory leak. | 
|---|
| 58 | <OL> | 
|---|
| 59 | <LI>Ignore these warnings while you are using GC_DEBUG.  Some of the routines | 
|---|
| 60 | mentioned below don't have debugging equivalents.  (Alternatively, write | 
|---|
| 61 | the missing routines and send them to me.) | 
|---|
| 62 | <LI>Replace allocator calls that request large blocks with calls to | 
|---|
| 63 | <TT>GC_malloc_ignore_off_page</tt> or | 
|---|
| 64 | <TT>GC_malloc_atomic_ignore_off_page</tt>.  You may want to set a | 
|---|
| 65 | breakpoint in <TT>GC_default_warn_proc</tt> to help you identify such calls. | 
|---|
| 66 | Make sure that a pointer to somewhere near the beginning of the resulting block | 
|---|
| 67 | is maintained in a (preferably volatile) variable as long as | 
|---|
| 68 | the block is needed. | 
|---|
| 69 | <LI> | 
|---|
| 70 | If the large blocks are allocated with realloc, we suggest instead allocating | 
|---|
| 71 | them with something like the following.  Note that the realloc size increment | 
|---|
| 72 | should be fairly large (e.g. a factor of 3/2) for this to exhibit reasonable | 
|---|
| 73 | performance.  But we all know we should do that anyway. | 
|---|
| 74 | <PRE> | 
|---|
| 75 | void * big_realloc(void *p, size_t new_size) | 
|---|
| 76 | { | 
|---|
| 77 | size_t old_size = GC_size(p); | 
|---|
| 78 | void * result; | 
|---|
| 79 |  | 
|---|
| 80 | if (new_size <= 10000) return(GC_realloc(p, new_size)); | 
|---|
| 81 | if (new_size <= old_size) return(p); | 
|---|
| 82 | result = GC_malloc_ignore_off_page(new_size); | 
|---|
| 83 | if (result == 0) return(0); | 
|---|
| 84 | memcpy(result,p,old_size); | 
|---|
| 85 | GC_free(p); | 
|---|
| 86 | return(result); | 
|---|
| 87 | } | 
|---|
| 88 | </pre> | 
|---|
| 89 |  | 
|---|
| 90 | <LI> In the unlikely case that even relatively small object | 
|---|
| 91 | (<20KB) allocations are triggering these warnings, then your address | 
|---|
| 92 | space contains lots of "bogus pointers", i.e. values that appear to | 
|---|
| 93 | be pointers but aren't.  Usually this can be solved by using GC_malloc_atomic | 
|---|
| 94 | or the routines in gc_typed.h to allocate large pointer-free regions of bitmaps, etc.  Sometimes the problem can be solved with trivial changes of encoding | 
|---|
| 95 | in certain values.  It is possible, to identify the source of the bogus | 
|---|
| 96 | pointers by building the collector with <TT>-DPRINT_BLACK_LIST</tt>, | 
|---|
| 97 | which will cause it to print the "bogus pointers", along with their location. | 
|---|
| 98 |  | 
|---|
| 99 | <LI> If you get only a fixed number of these warnings, you are probably only | 
|---|
| 100 | introducing a bounded leak by ignoring them.  If the data structures being | 
|---|
| 101 | allocated are intended to be permanent, then it is also safe to ignore them. | 
|---|
| 102 | The warnings can be turned off by calling GC_set_warn_proc with a procedure | 
|---|
| 103 | that ignores these warnings (e.g. by doing absolutely nothing). | 
|---|
| 104 | </ol> | 
|---|
| 105 |  | 
|---|
| 106 | <H2>The Collector References a Bad Address in <TT>GC_malloc</tt></h2> | 
|---|
| 107 |  | 
|---|
| 108 | This typically happens while the collector is trying to remove an entry from | 
|---|
| 109 | its free list, and the free list pointer is bad because the free list link | 
|---|
| 110 | in the last allocated object was bad. | 
|---|
| 111 | <P> | 
|---|
| 112 | With > 99% probability, you wrote past the end of an allocated object. | 
|---|
| 113 | Try setting <TT>GC_DEBUG</tt> before including <TT>gc.h</tt> and | 
|---|
| 114 | allocating with <TT>GC_MALLOC</tt>.  This will try to detect such | 
|---|
| 115 | overwrite errors. | 
|---|
| 116 |  | 
|---|
| 117 | <H2>Unexpectedly Large Heap</h2> | 
|---|
| 118 |  | 
|---|
| 119 | Unexpected heap growth can be due to one of the following: | 
|---|
| 120 | <OL> | 
|---|
| 121 | <LI> Data structures that are being unintentionally retained.  This | 
|---|
| 122 | is commonly caused by data structures that are no longer being used, | 
|---|
| 123 | but were not cleared, or by caches growing without bounds. | 
|---|
| 124 | <LI> Pointer misidentification.  The garbage collector is interpreting | 
|---|
| 125 | integers or other data as pointers and retaining the "referenced" | 
|---|
| 126 | objects. | 
|---|
| 127 | <LI> Heap fragmentation.  This should never result in unbounded growth, | 
|---|
| 128 | but it may account for larger heaps.  This is most commonly caused | 
|---|
| 129 | by allocation of large objects.  On some platforms it can be reduced | 
|---|
| 130 | by building with -DUSE_MUNMAP, which will cause the collector to unmap | 
|---|
| 131 | memory corresponding to pages that have not been recently used. | 
|---|
| 132 | <LI> Per object overhead.  This is usually a relatively minor effect, but | 
|---|
| 133 | it may be worth considering.  If the collector recognizes interior | 
|---|
| 134 | pointers, object sizes are increased, so that one-past-the-end pointers | 
|---|
| 135 | are correctly recognized.  The collector can be configured not to do this | 
|---|
| 136 | (<TT>-DDONT_ADD_BYTE_AT_END</tt>). | 
|---|
| 137 | <P> | 
|---|
| 138 | The collector rounds up object sizes so the result fits well into the | 
|---|
| 139 | chunk size (<TT>HBLKSIZE</tt>, normally 4K on 32 bit machines, 8K | 
|---|
| 140 | on 64 bit machines) used by the collector.   Thus it may be worth avoiding | 
|---|
| 141 | objects of size 2K + 1 (or 2K if a byte is being added at the end.) | 
|---|
| 142 | </ol> | 
|---|
| 143 | The last two cases can often be identified by looking at the output | 
|---|
| 144 | of a call to <TT>GC_dump()</tt>.  Among other things, it will print the | 
|---|
| 145 | list of free heap blocks, and a very brief description of all chunks in | 
|---|
| 146 | the heap, the object sizes they correspond to, and how many live objects | 
|---|
| 147 | were found in the chunk at the last collection. | 
|---|
| 148 | <P> | 
|---|
| 149 | Growing data structures can usually be identified by | 
|---|
| 150 | <OL> | 
|---|
| 151 | <LI> Building the collector with <TT>-DKEEP_BACK_PTRS</tt>, | 
|---|
| 152 | <LI> Preferably using debugging allocation (defining <TT>GC_DEBUG</tt> | 
|---|
| 153 | before including <TT>gc.h</tt> and allocating with <TT>GC_MALLOC</tt>), | 
|---|
| 154 | so that objects will be identified by their allocation site, | 
|---|
| 155 | <LI> Running the application long enough so | 
|---|
| 156 | that most of the heap is composed of "leaked" memory, and | 
|---|
| 157 | <LI> Then calling <TT>GC_generate_random_backtrace()</tt> from backptr.h | 
|---|
| 158 | a few times to determine why some randomly sampled objects in the heap are | 
|---|
| 159 | being retained. | 
|---|
| 160 | </ol> | 
|---|
| 161 | <P> | 
|---|
| 162 | The same technique can often be used to identify problems with false | 
|---|
| 163 | pointers, by noting whether the reference chains printed by | 
|---|
| 164 | <TT>GC_generate_random_backtrace()</tt> involve any misidentified pointers. | 
|---|
| 165 | An alternate technique is to build the collector with | 
|---|
| 166 | <TT>-DPRINT_BLACK_LIST</tt> which will cause it to report values that | 
|---|
| 167 | are almost, but not quite, look like heap pointers.  It is very likely that | 
|---|
| 168 | actual false pointers will come from similar sources. | 
|---|
| 169 | <P> | 
|---|
| 170 | In the unlikely case that false pointers are an issue, it can usually | 
|---|
| 171 | be resolved using one or more of the following techniques: | 
|---|
| 172 | <OL> | 
|---|
| 173 | <LI> Use <TT>GC_malloc_atomic</tt> for objects containing no pointers. | 
|---|
| 174 | This is especially important for large arrays containing compressed data, | 
|---|
| 175 | pseudo-random numbers, and the like.  It is also likely to improve GC | 
|---|
| 176 | performance, perhaps drastically so if the application is paging. | 
|---|
| 177 | <LI> If you allocate large objects containing only | 
|---|
| 178 | one or two pointers at the beginning, either try the typed allocation | 
|---|
| 179 | primitives is <TT>gc_typed.h</tt>, or separate out the pointerfree component. | 
|---|
| 180 | <LI> Consider using <TT>GC_malloc_ignore_off_page()</tt> | 
|---|
| 181 | to allocate large objects.  (See <TT>gc.h</tt> and above for details. | 
|---|
| 182 | Large means > 100K in most environments.) | 
|---|
| 183 | </ol> | 
|---|
| 184 | <H2>Prematurely Reclaimed Objects</h2> | 
|---|
| 185 | The usual symptom of this is a segmentation fault, or an obviously overwritten | 
|---|
| 186 | value in a heap object.  This should, of course, be impossible.  In practice, | 
|---|
| 187 | it may happen for reasons like the following: | 
|---|
| 188 | <OL> | 
|---|
| 189 | <LI> The collector did not intercept the creation of threads correctly in | 
|---|
| 190 | a multithreaded application, <I>e.g.</i> because the client called | 
|---|
| 191 | <TT>pthread_create</tt> without including <TT>gc.h</tt>, which redefines it. | 
|---|
| 192 | <LI> The last pointer to an object in the garbage collected heap was stored | 
|---|
| 193 | somewhere were the collector couldn't see it, <I>e.g.</i> in an | 
|---|
| 194 | object allocated with system <TT>malloc</tt>, in certain types of | 
|---|
| 195 | <TT>mmap</tt>ed files, | 
|---|
| 196 | or in some data structure visible only to the OS.  (On some platforms, | 
|---|
| 197 | thread-local storage is one of these.) | 
|---|
| 198 | <LI> The last pointer to an object was somehow disguised, <I>e.g.</i> by | 
|---|
| 199 | XORing it with another pointer. | 
|---|
| 200 | <LI> Incorrect use of <TT>GC_malloc_atomic</tt> or typed allocation. | 
|---|
| 201 | <LI> An incorrect <TT>GC_free</tt> call. | 
|---|
| 202 | <LI> The client program overwrote an internal garbage collector data structure. | 
|---|
| 203 | <LI> A garbage collector bug. | 
|---|
| 204 | <LI> (Empirically less likely than any of the above.) A compiler optimization | 
|---|
| 205 | that disguised the last pointer. | 
|---|
| 206 | </ol> | 
|---|
| 207 | The following relatively simple techniques should be tried first to narrow | 
|---|
| 208 | down the problem: | 
|---|
| 209 | <OL> | 
|---|
| 210 | <LI> If you are using the incremental collector try turning it off for | 
|---|
| 211 | debugging. | 
|---|
| 212 | <LI> If you are using shared libraries, try linking statically.  If that works, | 
|---|
| 213 | ensure that DYNAMIC_LOADING is defined on your platform. | 
|---|
| 214 | <LI> Try to reproduce the problem with fully debuggable unoptimized code. | 
|---|
| 215 | This will eliminate the last possibility, as well as making debugging easier. | 
|---|
| 216 | <LI> Try replacing any suspect typed allocation and <TT>GC_malloc_atomic</tt> | 
|---|
| 217 | calls with calls to <TT>GC_malloc</tt>. | 
|---|
| 218 | <LI> Try removing any GC_free calls (<I>e.g.</i> with a suitable | 
|---|
| 219 | <TT>#define</tt>). | 
|---|
| 220 | <LI> Rebuild the collector with <TT>-DGC_ASSERTIONS</tt>. | 
|---|
| 221 | <LI> If the following works on your platform (i.e. if gctest still works | 
|---|
| 222 | if you do this), try building the collector with | 
|---|
| 223 | <TT>-DREDIRECT_MALLOC=GC_malloc_uncollectable</tt>.  This will cause | 
|---|
| 224 | the collector to scan memory allocated with malloc. | 
|---|
| 225 | </ol> | 
|---|
| 226 | If all else fails, you will have to attack this with a debugger. | 
|---|
| 227 | Suggested steps: | 
|---|
| 228 | <OL> | 
|---|
| 229 | <LI> Call <TT>GC_dump()</tt> from the debugger around the time of the failure.  Verify | 
|---|
| 230 | that the collectors idea of the root set (i.e. static data regions which | 
|---|
| 231 | it should scan for pointers) looks plausible.  If not, i.e. if it doesn't | 
|---|
| 232 | include some static variables, report this as | 
|---|
| 233 | a collector bug.  Be sure to describe your platform precisely, since this sort | 
|---|
| 234 | of problem is nearly always very platform dependent. | 
|---|
| 235 | <LI> Especially if the failure is not deterministic, try to isolate it to | 
|---|
| 236 | a relatively small test case. | 
|---|
| 237 | <LI> Set a break point in <TT>GC_finish_collection</tt>.  This is a good | 
|---|
| 238 | point to examine what has been marked, i.e. found reachable, by the | 
|---|
| 239 | collector. | 
|---|
| 240 | <LI> If the failure is deterministic, run the process | 
|---|
| 241 | up to the last collection before the failure. | 
|---|
| 242 | Note that the variable <TT>GC_gc_no</tt> counts collections and can be used | 
|---|
| 243 | to set a conditional breakpoint in the right one.  It is incremented just | 
|---|
| 244 | before the call to GC_finish_collection. | 
|---|
| 245 | If object <TT>p</tt> was prematurely recycled, it may be helpful to | 
|---|
| 246 | look at <TT>*GC_find_header(p)</tt> at the failure point. | 
|---|
| 247 | The <TT>hb_last_reclaimed</tt> field will identify the collection number | 
|---|
| 248 | during which its block was last swept. | 
|---|
| 249 | <LI> Verify that the offending object still has its correct contents at | 
|---|
| 250 | this point. | 
|---|
| 251 | The call <TT>GC_is_marked(p)</tt> from the debugger to verify that the | 
|---|
| 252 | object has not been marked, and is about to be reclaimed. | 
|---|
| 253 | <LI> Determine a path from a root, i.e. static variable, stack, or | 
|---|
| 254 | register variable, | 
|---|
| 255 | to the reclaimed object.  Call <TT>GC_is_marked(q)</tt> for each object | 
|---|
| 256 | <TT>q</tt> along the path, trying to locate the first unmarked object, say | 
|---|
| 257 | <TT>r</tt>. | 
|---|
| 258 | <LI> If <TT>r</tt> is pointed to by a static root, | 
|---|
| 259 | verify that the location | 
|---|
| 260 | pointing to it is part of the root set printed by <TT>GC_dump()</tt>.  If it | 
|---|
| 261 | is on the stack in the main (or only) thread, verify that | 
|---|
| 262 | <TT>GC_stackbottom</tt> is set correctly to the base of the stack.  If it is | 
|---|
| 263 | in another thread stack, check the collector's thread data structure | 
|---|
| 264 | (<TT>GC_thread[]</tt> on several platforms) to make sure that stack bounds | 
|---|
| 265 | are set correctly. | 
|---|
| 266 | <LI> If <TT>r</tt> is pointed to by heap object <TT>s</tt>, check that the | 
|---|
| 267 | collector's layout description for <TT>s</tt> is such that the pointer field | 
|---|
| 268 | will be scanned.  Call <TT>*GC_find_header(s)</tt> to look at the descriptor | 
|---|
| 269 | for the heap chunk.  The <TT>hb_descr</tt> field specifies the layout | 
|---|
| 270 | of objects in that chunk.  See gc_mark.h for the meaning of the descriptor. | 
|---|
| 271 | (If it's low order 2 bits are zero, then it is just the length of the | 
|---|
| 272 | object prefix to be scanned.  This form is always used for objects allocated | 
|---|
| 273 | with <TT>GC_malloc</tt> or <TT>GC_malloc_atomic</tt>.) | 
|---|
| 274 | <LI> If the failure is not deterministic, you may still be able to apply some | 
|---|
| 275 | of the above technique at the point of failure.  But remember that objects | 
|---|
| 276 | allocated since the last collection will not have been marked, even if the | 
|---|
| 277 | collector is functioning properly.  On some platforms, the collector | 
|---|
| 278 | can be configured to save call chains in objects for debugging. | 
|---|
| 279 | Enabling this feature will also cause it to save the call stack at the | 
|---|
| 280 | point of the last GC in GC_arrays._last_stack. | 
|---|
| 281 | <LI> When looking at GC internal data structures remember that a number | 
|---|
| 282 | of <TT>GC_</tt><I>xxx</i> variables are really macro defined to | 
|---|
| 283 | <TT>GC_arrays._</tt><I>xxx</i>, so that | 
|---|
| 284 | the collector can avoid scanning them. | 
|---|
| 285 | </ol> | 
|---|
| 286 | </body> | 
|---|
| 287 | </html> | 
|---|
| 288 |  | 
|---|
| 289 |  | 
|---|
| 290 |  | 
|---|
| 291 |  | 
|---|