1 | <HTML>
|
---|
2 | <HEAD>
|
---|
3 | <TITLE>Debugging Garbage Collector Related Problems</title>
|
---|
4 | </head>
|
---|
5 | <BODY>
|
---|
6 | <H1>Debugging Garbage Collector Related Problems</h1>
|
---|
7 | This page contains some hints on
|
---|
8 | debugging issues specific to
|
---|
9 | the Boehm-Demers-Weiser conservative garbage collector.
|
---|
10 | It applies both to debugging issues in client code that manifest themselves
|
---|
11 | as collector misbehavior, and to debugging the collector itself.
|
---|
12 | <P>
|
---|
13 | If you suspect a bug in the collector itself, it is strongly recommended
|
---|
14 | that you try the latest collector release, even if it is labelled as "alpha",
|
---|
15 | before proceeding.
|
---|
16 | <H2>Bus Errors and Segmentation Violations</h2>
|
---|
17 | <P>
|
---|
18 | If the fault occurred in GC_find_limit, or with incremental collection enabled,
|
---|
19 | this is probably normal. The collector installs handlers to take care of
|
---|
20 | these. You will not see these unless you are using a debugger.
|
---|
21 | Your debugger <I>should</i> allow you to continue.
|
---|
22 | It's often preferable to tell the debugger to ignore SIGBUS and SIGSEGV
|
---|
23 | ("<TT>handle SIGSEGV SIGBUS nostop noprint</tt>" in gdb,
|
---|
24 | "<TT>ignore SIGSEGV SIGBUS</tt>" in most versions of dbx)
|
---|
25 | and set a breakpoint in <TT>abort</tt>.
|
---|
26 | The collector will call abort if the signal had another cause,
|
---|
27 | and there was not other handler previously installed.
|
---|
28 | <P>
|
---|
29 | We recommend debugging without incremental collection if possible.
|
---|
30 | (This applies directly to UNIX systems.
|
---|
31 | Debugging with incremental collection under win32 is worse. See README.win32.)
|
---|
32 | <P>
|
---|
33 | If the application generates an unhandled SIGSEGV or equivalent, it may
|
---|
34 | often be easiest to set the environment variable GC_LOOP_ON_ABORT. On many
|
---|
35 | platforms, this will cause the collector to loop in a handler when the
|
---|
36 | SIGSEGV is encountered (or when the collector aborts for some other reason),
|
---|
37 | and a debugger can then be attached to the looping
|
---|
38 | process. This sidesteps common operating system problems related
|
---|
39 | to incomplete core files for multithreaded applications, etc.
|
---|
40 | <H2>Other Signals</h2>
|
---|
41 | On most platforms, the multithreaded version of the collector needs one or
|
---|
42 | two other signals for internal use by the collector in stopping threads.
|
---|
43 | It is normally wise to tell the debugger to ignore these. On Linux,
|
---|
44 | the collector currently uses SIGPWR and SIGXCPU by default.
|
---|
45 | <H2>Warning Messages About Needing to Allocate Blacklisted Blocks</h2>
|
---|
46 | The garbage collector generates warning messages of the form
|
---|
47 | <PRE>
|
---|
48 | Needed to allocate blacklisted block at 0x...
|
---|
49 | </pre>
|
---|
50 | when it needs to allocate a block at a location that it knows to be
|
---|
51 | referenced by a false pointer. These false pointers can be either permanent
|
---|
52 | (<I>e.g.</i> a static integer variable that never changes) or temporary.
|
---|
53 | In the latter case, the warning is largely spurious, and the block will
|
---|
54 | eventually be reclaimed normally.
|
---|
55 | In the former case, the program will still run correctly, but the block
|
---|
56 | will never be reclaimed. Unless the block is intended to be
|
---|
57 | permanent, the warning indicates a memory leak.
|
---|
58 | <OL>
|
---|
59 | <LI>Ignore these warnings while you are using GC_DEBUG. Some of the routines
|
---|
60 | mentioned below don't have debugging equivalents. (Alternatively, write
|
---|
61 | the missing routines and send them to me.)
|
---|
62 | <LI>Replace allocator calls that request large blocks with calls to
|
---|
63 | <TT>GC_malloc_ignore_off_page</tt> or
|
---|
64 | <TT>GC_malloc_atomic_ignore_off_page</tt>. You may want to set a
|
---|
65 | breakpoint in <TT>GC_default_warn_proc</tt> to help you identify such calls.
|
---|
66 | Make sure that a pointer to somewhere near the beginning of the resulting block
|
---|
67 | is maintained in a (preferably volatile) variable as long as
|
---|
68 | the block is needed.
|
---|
69 | <LI>
|
---|
70 | If the large blocks are allocated with realloc, we suggest instead allocating
|
---|
71 | them with something like the following. Note that the realloc size increment
|
---|
72 | should be fairly large (e.g. a factor of 3/2) for this to exhibit reasonable
|
---|
73 | performance. But we all know we should do that anyway.
|
---|
74 | <PRE>
|
---|
75 | void * big_realloc(void *p, size_t new_size)
|
---|
76 | {
|
---|
77 | size_t old_size = GC_size(p);
|
---|
78 | void * result;
|
---|
79 |
|
---|
80 | if (new_size <= 10000) return(GC_realloc(p, new_size));
|
---|
81 | if (new_size <= old_size) return(p);
|
---|
82 | result = GC_malloc_ignore_off_page(new_size);
|
---|
83 | if (result == 0) return(0);
|
---|
84 | memcpy(result,p,old_size);
|
---|
85 | GC_free(p);
|
---|
86 | return(result);
|
---|
87 | }
|
---|
88 | </pre>
|
---|
89 |
|
---|
90 | <LI> In the unlikely case that even relatively small object
|
---|
91 | (<20KB) allocations are triggering these warnings, then your address
|
---|
92 | space contains lots of "bogus pointers", i.e. values that appear to
|
---|
93 | be pointers but aren't. Usually this can be solved by using GC_malloc_atomic
|
---|
94 | or the routines in gc_typed.h to allocate large pointer-free regions of bitmaps, etc. Sometimes the problem can be solved with trivial changes of encoding
|
---|
95 | in certain values. It is possible, to identify the source of the bogus
|
---|
96 | pointers by building the collector with <TT>-DPRINT_BLACK_LIST</tt>,
|
---|
97 | which will cause it to print the "bogus pointers", along with their location.
|
---|
98 |
|
---|
99 | <LI> If you get only a fixed number of these warnings, you are probably only
|
---|
100 | introducing a bounded leak by ignoring them. If the data structures being
|
---|
101 | allocated are intended to be permanent, then it is also safe to ignore them.
|
---|
102 | The warnings can be turned off by calling GC_set_warn_proc with a procedure
|
---|
103 | that ignores these warnings (e.g. by doing absolutely nothing).
|
---|
104 | </ol>
|
---|
105 |
|
---|
106 | <H2>The Collector References a Bad Address in <TT>GC_malloc</tt></h2>
|
---|
107 |
|
---|
108 | This typically happens while the collector is trying to remove an entry from
|
---|
109 | its free list, and the free list pointer is bad because the free list link
|
---|
110 | in the last allocated object was bad.
|
---|
111 | <P>
|
---|
112 | With > 99% probability, you wrote past the end of an allocated object.
|
---|
113 | Try setting <TT>GC_DEBUG</tt> before including <TT>gc.h</tt> and
|
---|
114 | allocating with <TT>GC_MALLOC</tt>. This will try to detect such
|
---|
115 | overwrite errors.
|
---|
116 |
|
---|
117 | <H2>Unexpectedly Large Heap</h2>
|
---|
118 |
|
---|
119 | Unexpected heap growth can be due to one of the following:
|
---|
120 | <OL>
|
---|
121 | <LI> Data structures that are being unintentionally retained. This
|
---|
122 | is commonly caused by data structures that are no longer being used,
|
---|
123 | but were not cleared, or by caches growing without bounds.
|
---|
124 | <LI> Pointer misidentification. The garbage collector is interpreting
|
---|
125 | integers or other data as pointers and retaining the "referenced"
|
---|
126 | objects.
|
---|
127 | <LI> Heap fragmentation. This should never result in unbounded growth,
|
---|
128 | but it may account for larger heaps. This is most commonly caused
|
---|
129 | by allocation of large objects. On some platforms it can be reduced
|
---|
130 | by building with -DUSE_MUNMAP, which will cause the collector to unmap
|
---|
131 | memory corresponding to pages that have not been recently used.
|
---|
132 | <LI> Per object overhead. This is usually a relatively minor effect, but
|
---|
133 | it may be worth considering. If the collector recognizes interior
|
---|
134 | pointers, object sizes are increased, so that one-past-the-end pointers
|
---|
135 | are correctly recognized. The collector can be configured not to do this
|
---|
136 | (<TT>-DDONT_ADD_BYTE_AT_END</tt>).
|
---|
137 | <P>
|
---|
138 | The collector rounds up object sizes so the result fits well into the
|
---|
139 | chunk size (<TT>HBLKSIZE</tt>, normally 4K on 32 bit machines, 8K
|
---|
140 | on 64 bit machines) used by the collector. Thus it may be worth avoiding
|
---|
141 | objects of size 2K + 1 (or 2K if a byte is being added at the end.)
|
---|
142 | </ol>
|
---|
143 | The last two cases can often be identified by looking at the output
|
---|
144 | of a call to <TT>GC_dump()</tt>. Among other things, it will print the
|
---|
145 | list of free heap blocks, and a very brief description of all chunks in
|
---|
146 | the heap, the object sizes they correspond to, and how many live objects
|
---|
147 | were found in the chunk at the last collection.
|
---|
148 | <P>
|
---|
149 | Growing data structures can usually be identified by
|
---|
150 | <OL>
|
---|
151 | <LI> Building the collector with <TT>-DKEEP_BACK_PTRS</tt>,
|
---|
152 | <LI> Preferably using debugging allocation (defining <TT>GC_DEBUG</tt>
|
---|
153 | before including <TT>gc.h</tt> and allocating with <TT>GC_MALLOC</tt>),
|
---|
154 | so that objects will be identified by their allocation site,
|
---|
155 | <LI> Running the application long enough so
|
---|
156 | that most of the heap is composed of "leaked" memory, and
|
---|
157 | <LI> Then calling <TT>GC_generate_random_backtrace()</tt> from backptr.h
|
---|
158 | a few times to determine why some randomly sampled objects in the heap are
|
---|
159 | being retained.
|
---|
160 | </ol>
|
---|
161 | <P>
|
---|
162 | The same technique can often be used to identify problems with false
|
---|
163 | pointers, by noting whether the reference chains printed by
|
---|
164 | <TT>GC_generate_random_backtrace()</tt> involve any misidentified pointers.
|
---|
165 | An alternate technique is to build the collector with
|
---|
166 | <TT>-DPRINT_BLACK_LIST</tt> which will cause it to report values that
|
---|
167 | are almost, but not quite, look like heap pointers. It is very likely that
|
---|
168 | actual false pointers will come from similar sources.
|
---|
169 | <P>
|
---|
170 | In the unlikely case that false pointers are an issue, it can usually
|
---|
171 | be resolved using one or more of the following techniques:
|
---|
172 | <OL>
|
---|
173 | <LI> Use <TT>GC_malloc_atomic</tt> for objects containing no pointers.
|
---|
174 | This is especially important for large arrays containing compressed data,
|
---|
175 | pseudo-random numbers, and the like. It is also likely to improve GC
|
---|
176 | performance, perhaps drastically so if the application is paging.
|
---|
177 | <LI> If you allocate large objects containing only
|
---|
178 | one or two pointers at the beginning, either try the typed allocation
|
---|
179 | primitives is <TT>gc_typed.h</tt>, or separate out the pointerfree component.
|
---|
180 | <LI> Consider using <TT>GC_malloc_ignore_off_page()</tt>
|
---|
181 | to allocate large objects. (See <TT>gc.h</tt> and above for details.
|
---|
182 | Large means > 100K in most environments.)
|
---|
183 | </ol>
|
---|
184 | <H2>Prematurely Reclaimed Objects</h2>
|
---|
185 | The usual symptom of this is a segmentation fault, or an obviously overwritten
|
---|
186 | value in a heap object. This should, of course, be impossible. In practice,
|
---|
187 | it may happen for reasons like the following:
|
---|
188 | <OL>
|
---|
189 | <LI> The collector did not intercept the creation of threads correctly in
|
---|
190 | a multithreaded application, <I>e.g.</i> because the client called
|
---|
191 | <TT>pthread_create</tt> without including <TT>gc.h</tt>, which redefines it.
|
---|
192 | <LI> The last pointer to an object in the garbage collected heap was stored
|
---|
193 | somewhere were the collector couldn't see it, <I>e.g.</i> in an
|
---|
194 | object allocated with system <TT>malloc</tt>, in certain types of
|
---|
195 | <TT>mmap</tt>ed files,
|
---|
196 | or in some data structure visible only to the OS. (On some platforms,
|
---|
197 | thread-local storage is one of these.)
|
---|
198 | <LI> The last pointer to an object was somehow disguised, <I>e.g.</i> by
|
---|
199 | XORing it with another pointer.
|
---|
200 | <LI> Incorrect use of <TT>GC_malloc_atomic</tt> or typed allocation.
|
---|
201 | <LI> An incorrect <TT>GC_free</tt> call.
|
---|
202 | <LI> The client program overwrote an internal garbage collector data structure.
|
---|
203 | <LI> A garbage collector bug.
|
---|
204 | <LI> (Empirically less likely than any of the above.) A compiler optimization
|
---|
205 | that disguised the last pointer.
|
---|
206 | </ol>
|
---|
207 | The following relatively simple techniques should be tried first to narrow
|
---|
208 | down the problem:
|
---|
209 | <OL>
|
---|
210 | <LI> If you are using the incremental collector try turning it off for
|
---|
211 | debugging.
|
---|
212 | <LI> If you are using shared libraries, try linking statically. If that works,
|
---|
213 | ensure that DYNAMIC_LOADING is defined on your platform.
|
---|
214 | <LI> Try to reproduce the problem with fully debuggable unoptimized code.
|
---|
215 | This will eliminate the last possibility, as well as making debugging easier.
|
---|
216 | <LI> Try replacing any suspect typed allocation and <TT>GC_malloc_atomic</tt>
|
---|
217 | calls with calls to <TT>GC_malloc</tt>.
|
---|
218 | <LI> Try removing any GC_free calls (<I>e.g.</i> with a suitable
|
---|
219 | <TT>#define</tt>).
|
---|
220 | <LI> Rebuild the collector with <TT>-DGC_ASSERTIONS</tt>.
|
---|
221 | <LI> If the following works on your platform (i.e. if gctest still works
|
---|
222 | if you do this), try building the collector with
|
---|
223 | <TT>-DREDIRECT_MALLOC=GC_malloc_uncollectable</tt>. This will cause
|
---|
224 | the collector to scan memory allocated with malloc.
|
---|
225 | </ol>
|
---|
226 | If all else fails, you will have to attack this with a debugger.
|
---|
227 | Suggested steps:
|
---|
228 | <OL>
|
---|
229 | <LI> Call <TT>GC_dump()</tt> from the debugger around the time of the failure. Verify
|
---|
230 | that the collectors idea of the root set (i.e. static data regions which
|
---|
231 | it should scan for pointers) looks plausible. If not, i.e. if it doesn't
|
---|
232 | include some static variables, report this as
|
---|
233 | a collector bug. Be sure to describe your platform precisely, since this sort
|
---|
234 | of problem is nearly always very platform dependent.
|
---|
235 | <LI> Especially if the failure is not deterministic, try to isolate it to
|
---|
236 | a relatively small test case.
|
---|
237 | <LI> Set a break point in <TT>GC_finish_collection</tt>. This is a good
|
---|
238 | point to examine what has been marked, i.e. found reachable, by the
|
---|
239 | collector.
|
---|
240 | <LI> If the failure is deterministic, run the process
|
---|
241 | up to the last collection before the failure.
|
---|
242 | Note that the variable <TT>GC_gc_no</tt> counts collections and can be used
|
---|
243 | to set a conditional breakpoint in the right one. It is incremented just
|
---|
244 | before the call to GC_finish_collection.
|
---|
245 | If object <TT>p</tt> was prematurely recycled, it may be helpful to
|
---|
246 | look at <TT>*GC_find_header(p)</tt> at the failure point.
|
---|
247 | The <TT>hb_last_reclaimed</tt> field will identify the collection number
|
---|
248 | during which its block was last swept.
|
---|
249 | <LI> Verify that the offending object still has its correct contents at
|
---|
250 | this point.
|
---|
251 | The call <TT>GC_is_marked(p)</tt> from the debugger to verify that the
|
---|
252 | object has not been marked, and is about to be reclaimed.
|
---|
253 | <LI> Determine a path from a root, i.e. static variable, stack, or
|
---|
254 | register variable,
|
---|
255 | to the reclaimed object. Call <TT>GC_is_marked(q)</tt> for each object
|
---|
256 | <TT>q</tt> along the path, trying to locate the first unmarked object, say
|
---|
257 | <TT>r</tt>.
|
---|
258 | <LI> If <TT>r</tt> is pointed to by a static root,
|
---|
259 | verify that the location
|
---|
260 | pointing to it is part of the root set printed by <TT>GC_dump()</tt>. If it
|
---|
261 | is on the stack in the main (or only) thread, verify that
|
---|
262 | <TT>GC_stackbottom</tt> is set correctly to the base of the stack. If it is
|
---|
263 | in another thread stack, check the collector's thread data structure
|
---|
264 | (<TT>GC_thread[]</tt> on several platforms) to make sure that stack bounds
|
---|
265 | are set correctly.
|
---|
266 | <LI> If <TT>r</tt> is pointed to by heap object <TT>s</tt>, check that the
|
---|
267 | collector's layout description for <TT>s</tt> is such that the pointer field
|
---|
268 | will be scanned. Call <TT>*GC_find_header(s)</tt> to look at the descriptor
|
---|
269 | for the heap chunk. The <TT>hb_descr</tt> field specifies the layout
|
---|
270 | of objects in that chunk. See gc_mark.h for the meaning of the descriptor.
|
---|
271 | (If it's low order 2 bits are zero, then it is just the length of the
|
---|
272 | object prefix to be scanned. This form is always used for objects allocated
|
---|
273 | with <TT>GC_malloc</tt> or <TT>GC_malloc_atomic</tt>.)
|
---|
274 | <LI> If the failure is not deterministic, you may still be able to apply some
|
---|
275 | of the above technique at the point of failure. But remember that objects
|
---|
276 | allocated since the last collection will not have been marked, even if the
|
---|
277 | collector is functioning properly. On some platforms, the collector
|
---|
278 | can be configured to save call chains in objects for debugging.
|
---|
279 | Enabling this feature will also cause it to save the call stack at the
|
---|
280 | point of the last GC in GC_arrays._last_stack.
|
---|
281 | <LI> When looking at GC internal data structures remember that a number
|
---|
282 | of <TT>GC_</tt><I>xxx</i> variables are really macro defined to
|
---|
283 | <TT>GC_arrays._</tt><I>xxx</i>, so that
|
---|
284 | the collector can avoid scanning them.
|
---|
285 | </ol>
|
---|
286 | </body>
|
---|
287 | </html>
|
---|
288 |
|
---|
289 |
|
---|
290 |
|
---|
291 |
|
---|