source: trunk/gcc/boehm-gc/doc/debugging.html@ 3157

Last change on this file since 3157 was 2, checked in by bird, 23 years ago

Initial revision

  • Property cvs2svn:cvs-rev set to 1.1
  • Property svn:eol-style set to native
  • Property svn:executable set to *
File size: 14.6 KB
Line 
1<HTML>
2<HEAD>
3<TITLE>Debugging Garbage Collector Related Problems</title>
4</head>
5<BODY>
6<H1>Debugging Garbage Collector Related Problems</h1>
7This page contains some hints on
8debugging issues specific to
9the Boehm-Demers-Weiser conservative garbage collector.
10It applies both to debugging issues in client code that manifest themselves
11as collector misbehavior, and to debugging the collector itself.
12<P>
13If you suspect a bug in the collector itself, it is strongly recommended
14that you try the latest collector release, even if it is labelled as "alpha",
15before proceeding.
16<H2>Bus Errors and Segmentation Violations</h2>
17<P>
18If the fault occurred in GC_find_limit, or with incremental collection enabled,
19this is probably normal. The collector installs handlers to take care of
20these. You will not see these unless you are using a debugger.
21Your debugger <I>should</i> allow you to continue.
22It's often preferable to tell the debugger to ignore SIGBUS and SIGSEGV
23("<TT>handle SIGSEGV SIGBUS nostop noprint</tt>" in gdb,
24"<TT>ignore SIGSEGV SIGBUS</tt>" in most versions of dbx)
25and set a breakpoint in <TT>abort</tt>.
26The collector will call abort if the signal had another cause,
27and there was not other handler previously installed.
28<P>
29We recommend debugging without incremental collection if possible.
30(This applies directly to UNIX systems.
31Debugging with incremental collection under win32 is worse. See README.win32.)
32<P>
33If the application generates an unhandled SIGSEGV or equivalent, it may
34often be easiest to set the environment variable GC_LOOP_ON_ABORT. On many
35platforms, this will cause the collector to loop in a handler when the
36SIGSEGV is encountered (or when the collector aborts for some other reason),
37and a debugger can then be attached to the looping
38process. This sidesteps common operating system problems related
39to incomplete core files for multithreaded applications, etc.
40<H2>Other Signals</h2>
41On most platforms, the multithreaded version of the collector needs one or
42two other signals for internal use by the collector in stopping threads.
43It is normally wise to tell the debugger to ignore these. On Linux,
44the collector currently uses SIGPWR and SIGXCPU by default.
45<H2>Warning Messages About Needing to Allocate Blacklisted Blocks</h2>
46The garbage collector generates warning messages of the form
47<PRE>
48Needed to allocate blacklisted block at 0x...
49</pre>
50when it needs to allocate a block at a location that it knows to be
51referenced by a false pointer. These false pointers can be either permanent
52(<I>e.g.</i> a static integer variable that never changes) or temporary.
53In the latter case, the warning is largely spurious, and the block will
54eventually be reclaimed normally.
55In the former case, the program will still run correctly, but the block
56will never be reclaimed. Unless the block is intended to be
57permanent, the warning indicates a memory leak.
58<OL>
59<LI>Ignore these warnings while you are using GC_DEBUG. Some of the routines
60mentioned below don't have debugging equivalents. (Alternatively, write
61the missing routines and send them to me.)
62<LI>Replace allocator calls that request large blocks with calls to
63<TT>GC_malloc_ignore_off_page</tt> or
64<TT>GC_malloc_atomic_ignore_off_page</tt>. You may want to set a
65breakpoint in <TT>GC_default_warn_proc</tt> to help you identify such calls.
66Make sure that a pointer to somewhere near the beginning of the resulting block
67is maintained in a (preferably volatile) variable as long as
68the block is needed.
69<LI>
70If the large blocks are allocated with realloc, we suggest instead allocating
71them with something like the following. Note that the realloc size increment
72should be fairly large (e.g. a factor of 3/2) for this to exhibit reasonable
73performance. But we all know we should do that anyway.
74<PRE>
75void * big_realloc(void *p, size_t new_size)
76{
77 size_t old_size = GC_size(p);
78 void * result;
79
80 if (new_size <= 10000) return(GC_realloc(p, new_size));
81 if (new_size <= old_size) return(p);
82 result = GC_malloc_ignore_off_page(new_size);
83 if (result == 0) return(0);
84 memcpy(result,p,old_size);
85 GC_free(p);
86 return(result);
87}
88</pre>
89
90<LI> In the unlikely case that even relatively small object
91(&lt;20KB) allocations are triggering these warnings, then your address
92space contains lots of "bogus pointers", i.e. values that appear to
93be pointers but aren't. Usually this can be solved by using GC_malloc_atomic
94or the routines in gc_typed.h to allocate large pointer-free regions of bitmaps, etc. Sometimes the problem can be solved with trivial changes of encoding
95in certain values. It is possible, to identify the source of the bogus
96pointers by building the collector with <TT>-DPRINT_BLACK_LIST</tt>,
97which will cause it to print the "bogus pointers", along with their location.
98
99<LI> If you get only a fixed number of these warnings, you are probably only
100introducing a bounded leak by ignoring them. If the data structures being
101allocated are intended to be permanent, then it is also safe to ignore them.
102The warnings can be turned off by calling GC_set_warn_proc with a procedure
103that ignores these warnings (e.g. by doing absolutely nothing).
104</ol>
105
106<H2>The Collector References a Bad Address in <TT>GC_malloc</tt></h2>
107
108This typically happens while the collector is trying to remove an entry from
109its free list, and the free list pointer is bad because the free list link
110in the last allocated object was bad.
111<P>
112With &gt; 99% probability, you wrote past the end of an allocated object.
113Try setting <TT>GC_DEBUG</tt> before including <TT>gc.h</tt> and
114allocating with <TT>GC_MALLOC</tt>. This will try to detect such
115overwrite errors.
116
117<H2>Unexpectedly Large Heap</h2>
118
119Unexpected heap growth can be due to one of the following:
120<OL>
121<LI> Data structures that are being unintentionally retained. This
122is commonly caused by data structures that are no longer being used,
123but were not cleared, or by caches growing without bounds.
124<LI> Pointer misidentification. The garbage collector is interpreting
125integers or other data as pointers and retaining the "referenced"
126objects.
127<LI> Heap fragmentation. This should never result in unbounded growth,
128but it may account for larger heaps. This is most commonly caused
129by allocation of large objects. On some platforms it can be reduced
130by building with -DUSE_MUNMAP, which will cause the collector to unmap
131memory corresponding to pages that have not been recently used.
132<LI> Per object overhead. This is usually a relatively minor effect, but
133it may be worth considering. If the collector recognizes interior
134pointers, object sizes are increased, so that one-past-the-end pointers
135are correctly recognized. The collector can be configured not to do this
136(<TT>-DDONT_ADD_BYTE_AT_END</tt>).
137<P>
138The collector rounds up object sizes so the result fits well into the
139chunk size (<TT>HBLKSIZE</tt>, normally 4K on 32 bit machines, 8K
140on 64 bit machines) used by the collector. Thus it may be worth avoiding
141objects of size 2K + 1 (or 2K if a byte is being added at the end.)
142</ol>
143The last two cases can often be identified by looking at the output
144of a call to <TT>GC_dump()</tt>. Among other things, it will print the
145list of free heap blocks, and a very brief description of all chunks in
146the heap, the object sizes they correspond to, and how many live objects
147were found in the chunk at the last collection.
148<P>
149Growing data structures can usually be identified by
150<OL>
151<LI> Building the collector with <TT>-DKEEP_BACK_PTRS</tt>,
152<LI> Preferably using debugging allocation (defining <TT>GC_DEBUG</tt>
153before including <TT>gc.h</tt> and allocating with <TT>GC_MALLOC</tt>),
154so that objects will be identified by their allocation site,
155<LI> Running the application long enough so
156that most of the heap is composed of "leaked" memory, and
157<LI> Then calling <TT>GC_generate_random_backtrace()</tt> from backptr.h
158a few times to determine why some randomly sampled objects in the heap are
159being retained.
160</ol>
161<P>
162The same technique can often be used to identify problems with false
163pointers, by noting whether the reference chains printed by
164<TT>GC_generate_random_backtrace()</tt> involve any misidentified pointers.
165An alternate technique is to build the collector with
166<TT>-DPRINT_BLACK_LIST</tt> which will cause it to report values that
167are almost, but not quite, look like heap pointers. It is very likely that
168actual false pointers will come from similar sources.
169<P>
170In the unlikely case that false pointers are an issue, it can usually
171be resolved using one or more of the following techniques:
172<OL>
173<LI> Use <TT>GC_malloc_atomic</tt> for objects containing no pointers.
174This is especially important for large arrays containing compressed data,
175pseudo-random numbers, and the like. It is also likely to improve GC
176performance, perhaps drastically so if the application is paging.
177<LI> If you allocate large objects containing only
178one or two pointers at the beginning, either try the typed allocation
179primitives is <TT>gc_typed.h</tt>, or separate out the pointerfree component.
180<LI> Consider using <TT>GC_malloc_ignore_off_page()</tt>
181to allocate large objects. (See <TT>gc.h</tt> and above for details.
182Large means &gt; 100K in most environments.)
183</ol>
184<H2>Prematurely Reclaimed Objects</h2>
185The usual symptom of this is a segmentation fault, or an obviously overwritten
186value in a heap object. This should, of course, be impossible. In practice,
187it may happen for reasons like the following:
188<OL>
189<LI> The collector did not intercept the creation of threads correctly in
190a multithreaded application, <I>e.g.</i> because the client called
191<TT>pthread_create</tt> without including <TT>gc.h</tt>, which redefines it.
192<LI> The last pointer to an object in the garbage collected heap was stored
193somewhere were the collector couldn't see it, <I>e.g.</i> in an
194object allocated with system <TT>malloc</tt>, in certain types of
195<TT>mmap</tt>ed files,
196or in some data structure visible only to the OS. (On some platforms,
197thread-local storage is one of these.)
198<LI> The last pointer to an object was somehow disguised, <I>e.g.</i> by
199XORing it with another pointer.
200<LI> Incorrect use of <TT>GC_malloc_atomic</tt> or typed allocation.
201<LI> An incorrect <TT>GC_free</tt> call.
202<LI> The client program overwrote an internal garbage collector data structure.
203<LI> A garbage collector bug.
204<LI> (Empirically less likely than any of the above.) A compiler optimization
205that disguised the last pointer.
206</ol>
207The following relatively simple techniques should be tried first to narrow
208down the problem:
209<OL>
210<LI> If you are using the incremental collector try turning it off for
211debugging.
212<LI> If you are using shared libraries, try linking statically. If that works,
213ensure that DYNAMIC_LOADING is defined on your platform.
214<LI> Try to reproduce the problem with fully debuggable unoptimized code.
215This will eliminate the last possibility, as well as making debugging easier.
216<LI> Try replacing any suspect typed allocation and <TT>GC_malloc_atomic</tt>
217calls with calls to <TT>GC_malloc</tt>.
218<LI> Try removing any GC_free calls (<I>e.g.</i> with a suitable
219<TT>#define</tt>).
220<LI> Rebuild the collector with <TT>-DGC_ASSERTIONS</tt>.
221<LI> If the following works on your platform (i.e. if gctest still works
222if you do this), try building the collector with
223<TT>-DREDIRECT_MALLOC=GC_malloc_uncollectable</tt>. This will cause
224the collector to scan memory allocated with malloc.
225</ol>
226If all else fails, you will have to attack this with a debugger.
227Suggested steps:
228<OL>
229<LI> Call <TT>GC_dump()</tt> from the debugger around the time of the failure. Verify
230that the collectors idea of the root set (i.e. static data regions which
231it should scan for pointers) looks plausible. If not, i.e. if it doesn't
232include some static variables, report this as
233a collector bug. Be sure to describe your platform precisely, since this sort
234of problem is nearly always very platform dependent.
235<LI> Especially if the failure is not deterministic, try to isolate it to
236a relatively small test case.
237<LI> Set a break point in <TT>GC_finish_collection</tt>. This is a good
238point to examine what has been marked, i.e. found reachable, by the
239collector.
240<LI> If the failure is deterministic, run the process
241up to the last collection before the failure.
242Note that the variable <TT>GC_gc_no</tt> counts collections and can be used
243to set a conditional breakpoint in the right one. It is incremented just
244before the call to GC_finish_collection.
245If object <TT>p</tt> was prematurely recycled, it may be helpful to
246look at <TT>*GC_find_header(p)</tt> at the failure point.
247The <TT>hb_last_reclaimed</tt> field will identify the collection number
248during which its block was last swept.
249<LI> Verify that the offending object still has its correct contents at
250this point.
251The call <TT>GC_is_marked(p)</tt> from the debugger to verify that the
252object has not been marked, and is about to be reclaimed.
253<LI> Determine a path from a root, i.e. static variable, stack, or
254register variable,
255to the reclaimed object. Call <TT>GC_is_marked(q)</tt> for each object
256<TT>q</tt> along the path, trying to locate the first unmarked object, say
257<TT>r</tt>.
258<LI> If <TT>r</tt> is pointed to by a static root,
259verify that the location
260pointing to it is part of the root set printed by <TT>GC_dump()</tt>. If it
261is on the stack in the main (or only) thread, verify that
262<TT>GC_stackbottom</tt> is set correctly to the base of the stack. If it is
263in another thread stack, check the collector's thread data structure
264(<TT>GC_thread[]</tt> on several platforms) to make sure that stack bounds
265are set correctly.
266<LI> If <TT>r</tt> is pointed to by heap object <TT>s</tt>, check that the
267collector's layout description for <TT>s</tt> is such that the pointer field
268will be scanned. Call <TT>*GC_find_header(s)</tt> to look at the descriptor
269for the heap chunk. The <TT>hb_descr</tt> field specifies the layout
270of objects in that chunk. See gc_mark.h for the meaning of the descriptor.
271(If it's low order 2 bits are zero, then it is just the length of the
272object prefix to be scanned. This form is always used for objects allocated
273with <TT>GC_malloc</tt> or <TT>GC_malloc_atomic</tt>.)
274<LI> If the failure is not deterministic, you may still be able to apply some
275of the above technique at the point of failure. But remember that objects
276allocated since the last collection will not have been marked, even if the
277collector is functioning properly. On some platforms, the collector
278can be configured to save call chains in objects for debugging.
279Enabling this feature will also cause it to save the call stack at the
280point of the last GC in GC_arrays._last_stack.
281<LI> When looking at GC internal data structures remember that a number
282of <TT>GC_</tt><I>xxx</i> variables are really macro defined to
283<TT>GC_arrays._</tt><I>xxx</i>, so that
284the collector can avoid scanning them.
285</ol>
286</body>
287</html>
288
289
290
291
Note: See TracBrowser for help on using the repository browser.