Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

bzip2.1@ 3918

Visit:

Last change on this file since 3918 was 3318, checked in by bird, 19 years ago
bzip2 1.0.4
File size: 15.9 KB

Line
1	.PU
2	.TH bzip2 1
3	.SH NAME
4	bzip2, bunzip2 \- a block-sorting file compressor, v1.0.4
5	.br
6	bzcat \- decompresses files to stdout
7	.br
8	bzip2recover \- recovers data from damaged bzip2 files
9
10	.SH SYNOPSIS
11	.ll +8
12	.B bzip2
13	.RB [ " \-cdfkqstvzVL123456789 " ]
14	[
15	.I "filenames \&..."
16	]
17	.ll -8
18	.br
19	.B bunzip2
20	.RB [ " \-fkvsVL " ]
21	[
22	.I "filenames \&..."
23	]
24	.br
25	.B bzcat
26	.RB [ " \-s " ]
27	[
28	.I "filenames \&..."
29	]
30	.br
31	.B bzip2recover
32	.I "filename"
33
34	.SH DESCRIPTION
35	.I bzip2
36	compresses files using the Burrows-Wheeler block sorting
37	text compression algorithm, and Huffman coding. Compression is
38	generally considerably better than that achieved by more conventional
39	LZ77/LZ78-based compressors, and approaches the performance of the PPM
40	family of statistical compressors.
41
42	The command-line options are deliberately very similar to
43	those of
44	.I GNU gzip,
45	but they are not identical.
46
47	.I bzip2
48	expects a list of file names to accompany the
49	command-line flags. Each file is replaced by a compressed version of
50	itself, with the name "original_name.bz2".
51	Each compressed file
52	has the same modification date, permissions, and, when possible,
53	ownership as the corresponding original, so that these properties can
54	be correctly restored at decompression time. File name handling is
55	naive in the sense that there is no mechanism for preserving original
56	file names, permissions, ownerships or dates in filesystems which lack
57	these concepts, or have serious file name length restrictions, such as
58	MS-DOS.
59
60	.I bzip2
61	and
62	.I bunzip2
63	will by default not overwrite existing
64	files. If you want this to happen, specify the \-f flag.
65
66	If no file names are specified,
67	.I bzip2
68	compresses from standard
69	input to standard output. In this case,
70	.I bzip2
71	will decline to
72	write compressed output to a terminal, as this would be entirely
73	incomprehensible and therefore pointless.
74
75	.I bunzip2
76	(or
77	.I bzip2 \-d)
78	decompresses all
79	specified files. Files which were not created by
80	.I bzip2
81	will be detected and ignored, and a warning issued.
82	.I bzip2
83	attempts to guess the filename for the decompressed file
84	from that of the compressed file as follows:
85
86	filename.bz2 becomes filename
87	filename.bz becomes filename
88	filename.tbz2 becomes filename.tar
89	filename.tbz becomes filename.tar
90	anyothername becomes anyothername.out
91
92	If the file does not end in one of the recognised endings,
93	.I .bz2,
94	.I .bz,
95	.I .tbz2
96	or
97	.I .tbz,
98	.I bzip2
99	complains that it cannot
100	guess the name of the original file, and uses the original name
101	with
102	.I .out
103	appended.
104
105	As with compression, supplying no
106	filenames causes decompression from
107	standard input to standard output.
108
109	.I bunzip2
110	will correctly decompress a file which is the
111	concatenation of two or more compressed files. The result is the
112	concatenation of the corresponding uncompressed files. Integrity
113	testing (\-t)
114	of concatenated
115	compressed files is also supported.
116
117	You can also compress or decompress files to the standard output by
118	giving the \-c flag. Multiple files may be compressed and
119	decompressed like this. The resulting outputs are fed sequentially to
120	stdout. Compression of multiple files
121	in this manner generates a stream
122	containing multiple compressed file representations. Such a stream
123	can be decompressed correctly only by
124	.I bzip2
125	version 0.9.0 or
126	later. Earlier versions of
127	.I bzip2
128	will stop after decompressing
129	the first file in the stream.
130
131	.I bzcat
132	(or
133	.I bzip2 -dc)
134	decompresses all specified files to
135	the standard output.
136
137	.I bzip2
138	will read arguments from the environment variables
139	.I BZIP2
140	and
141	.I BZIP,
142	in that order, and will process them
143	before any arguments read from the command line. This gives a
144	convenient way to supply default arguments.
145
146	Compression is always performed, even if the compressed
147	file is slightly
148	larger than the original. Files of less than about one hundred bytes
149	tend to get larger, since the compression mechanism has a constant
150	overhead in the region of 50 bytes. Random data (including the output
151	of most file compressors) is coded at about 8.05 bits per byte, giving
152	an expansion of around 0.5%.
153
154	As a self-check for your protection,
155	.I
156	bzip2
157	uses 32-bit CRCs to
158	make sure that the decompressed version of a file is identical to the
159	original. This guards against corruption of the compressed data, and
160	against undetected bugs in
161	.I bzip2
162	(hopefully very unlikely). The
163	chances of data corruption going undetected is microscopic, about one
164	chance in four billion for each file processed. Be aware, though, that
165	the check occurs upon decompression, so it can only tell you that
166	something is wrong. It can't help you
167	recover the original uncompressed
168	data. You can use
169	.I bzip2recover
170	to try to recover data from
171	damaged files.
172
173	Return values: 0 for a normal exit, 1 for environmental problems (file
174	not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
175	compressed file, 3 for an internal consistency error (eg, bug) which
176	caused
177	.I bzip2
178	to panic.
179
180	.SH OPTIONS
181	.TP
182	.B \-c --stdout
183	Compress or decompress to standard output.
184	.TP
185	.B \-d --decompress
186	Force decompression.
187	.I bzip2,
188	.I bunzip2
189	and
190	.I bzcat
191	are
192	really the same program, and the decision about what actions to take is
193	done on the basis of which name is used. This flag overrides that
194	mechanism, and forces
195	.I bzip2
196	to decompress.
197	.TP
198	.B \-z --compress
199	The complement to \-d: forces compression, regardless of the
200	invocation name.
201	.TP
202	.B \-t --test
203	Check integrity of the specified file(s), but don't decompress them.
204	This really performs a trial decompression and throws away the result.
205	.TP
206	.B \-f --force
207	Force overwrite of output files. Normally,
208	.I bzip2
209	will not overwrite
210	existing output files. Also forces
211	.I bzip2
212	to break hard links
213	to files, which it otherwise wouldn't do.
214
215	bzip2 normally declines to decompress files which don't have the
216	correct magic header bytes. If forced (-f), however, it will pass
217	such files through unmodified. This is how GNU gzip behaves.
218	.TP
219	.B \-k --keep
220	Keep (don't delete) input files during compression
221	or decompression.
222	.TP
223	.B \-s --small
224	Reduce memory usage, for compression, decompression and testing. Files
225	are decompressed and tested using a modified algorithm which only
226	requires 2.5 bytes per block byte. This means any file can be
227	decompressed in 2300k of memory, albeit at about half the normal speed.
228
229	During compression, \-s selects a block size of 200k, which limits
230	memory use to around the same figure, at the expense of your compression
231	ratio. In short, if your machine is low on memory (8 megabytes or
232	less), use \-s for everything. See MEMORY MANAGEMENT below.
233	.TP
234	.B \-q --quiet
235	Suppress non-essential warning messages. Messages pertaining to
236	I/O errors and other critical events will not be suppressed.
237	.TP
238	.B \-v --verbose
239	Verbose mode -- show the compression ratio for each file processed.
240	Further \-v's increase the verbosity level, spewing out lots of
241	information which is primarily of interest for diagnostic purposes.
242	.TP
243	.B \-L --license -V --version
244	Display the software version, license terms and conditions.
245	.TP
246	.B \-1 (or \-\-fast) to \-9 (or \-\-best)
247	Set the block size to 100 k, 200 k .. 900 k when compressing. Has no
248	effect when decompressing. See MEMORY MANAGEMENT below.
249	The \-\-fast and \-\-best aliases are primarily for GNU gzip
250	compatibility. In particular, \-\-fast doesn't make things
251	significantly faster.
252	And \-\-best merely selects the default behaviour.
253	.TP
254	.B \--
255	Treats all subsequent arguments as file names, even if they start
256	with a dash. This is so you can handle files with names beginning
257	with a dash, for example: bzip2 \-- \-myfilename.
258	.TP
259	.B \--repetitive-fast --repetitive-best
260	These flags are redundant in versions 0.9.5 and above. They provided
261	some coarse control over the behaviour of the sorting algorithm in
262	earlier versions, which was sometimes useful. 0.9.5 and above have an
263	improved algorithm which renders these flags irrelevant.
264
265	.SH MEMORY MANAGEMENT
266	.I bzip2
267	compresses large files in blocks. The block size affects
268	both the compression ratio achieved, and the amount of memory needed for
269	compression and decompression. The flags \-1 through \-9
270	specify the block size to be 100,000 bytes through 900,000 bytes (the
271	default) respectively. At decompression time, the block size used for
272	compression is read from the header of the compressed file, and
273	.I bunzip2
274	then allocates itself just enough memory to decompress
275	the file. Since block sizes are stored in compressed files, it follows
276	that the flags \-1 to \-9 are irrelevant to and so ignored
277	during decompression.
278
279	Compression and decompression requirements,
280	in bytes, can be estimated as:
281
282	Compression: 400k + ( 8 x block size )
283
284	Decompression: 100k + ( 4 x block size ), or
285	100k + ( 2.5 x block size )
286
287	Larger block sizes give rapidly diminishing marginal returns. Most of
288	the compression comes from the first two or three hundred k of block
289	size, a fact worth bearing in mind when using
290	.I bzip2
291	on small machines.
292	It is also important to appreciate that the decompression memory
293	requirement is set at compression time by the choice of block size.
294
295	For files compressed with the default 900k block size,
296	.I bunzip2
297	will require about 3700 kbytes to decompress. To support decompression
298	of any file on a 4 megabyte machine,
299	.I bunzip2
300	has an option to
301	decompress using approximately half this amount of memory, about 2300
302	kbytes. Decompression speed is also halved, so you should use this
303	option only where necessary. The relevant flag is -s.
304
305	In general, try and use the largest block size memory constraints allow,
306	since that maximises the compression achieved. Compression and
307	decompression speed are virtually unaffected by block size.
308
309	Another significant point applies to files which fit in a single block
310	-- that means most files you'd encounter using a large block size. The
311	amount of real memory touched is proportional to the size of the file,
312	since the file is smaller than a block. For example, compressing a file
313	20,000 bytes long with the flag -9 will cause the compressor to
314	allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
315	kbytes of it. Similarly, the decompressor will allocate 3700k but only
316	touch 100k + 20000 * 4 = 180 kbytes.
317
318	Here is a table which summarises the maximum memory usage for different
319	block sizes. Also recorded is the total compressed size for 14 files of
320	the Calgary Text Compression Corpus totalling 3,141,622 bytes. This
321	column gives some feel for how compression varies with block size.
322	These figures tend to understate the advantage of larger block sizes for
323	larger files, since the Corpus is dominated by smaller files.
324
325	Compress Decompress Decompress Corpus
326	Flag usage usage -s usage Size
327
328	-1 1200k 500k 350k 914704
329	-2 2000k 900k 600k 877703
330	-3 2800k 1300k 850k 860338
331	-4 3600k 1700k 1100k 846899
332	-5 4400k 2100k 1350k 845160
333	-6 5200k 2500k 1600k 838626
334	-7 6100k 2900k 1850k 834096
335	-8 6800k 3300k 2100k 828642
336	-9 7600k 3700k 2350k 828642
337
338	.SH RECOVERING DATA FROM DAMAGED FILES
339	.I bzip2
340	compresses files in blocks, usually 900kbytes long. Each
341	block is handled independently. If a media or transmission error causes
342	a multi-block .bz2
343	file to become damaged, it may be possible to
344	recover data from the undamaged blocks in the file.
345
346	The compressed representation of each block is delimited by a 48-bit
347	pattern, which makes it possible to find the block boundaries with
348	reasonable certainty. Each block also carries its own 32-bit CRC, so
349	damaged blocks can be distinguished from undamaged ones.
350
351	.I bzip2recover
352	is a simple program whose purpose is to search for
353	blocks in .bz2 files, and write each block out into its own .bz2
354	file. You can then use
355	.I bzip2
356	\-t
357	to test the
358	integrity of the resulting files, and decompress those which are
359	undamaged.
360
361	.I bzip2recover
362	takes a single argument, the name of the damaged file,
363	and writes a number of files "rec00001file.bz2",
364	"rec00002file.bz2", etc, containing the extracted blocks.
365	The output filenames are designed so that the use of
366	wildcards in subsequent processing -- for example,
367	"bzip2 -dc rec*file.bz2 > recovered_data" -- processes the files in
368	the correct order.
369
370	.I bzip2recover
371	should be of most use dealing with large .bz2
372	files, as these will contain many blocks. It is clearly
373	futile to use it on damaged single-block files, since a
374	damaged block cannot be recovered. If you wish to minimise
375	any potential data loss through media or transmission errors,
376	you might consider compressing with a smaller
377	block size.
378
379	.SH PERFORMANCE NOTES
380	The sorting phase of compression gathers together similar strings in the
381	file. Because of this, files containing very long runs of repeated
382	symbols, like "aabaabaabaab ..." (repeated several hundred times) may
383	compress more slowly than normal. Versions 0.9.5 and above fare much
384	better than previous versions in this respect. The ratio between
385	worst-case and average-case compression time is in the region of 10:1.
386	For previous versions, this figure was more like 100:1. You can use the
387	\-vvvv option to monitor progress in great detail, if you want.
388
389	Decompression speed is unaffected by these phenomena.
390
391	.I bzip2
392	usually allocates several megabytes of memory to operate
393	in, and then charges all over it in a fairly random fashion. This means
394	that performance, both for compressing and decompressing, is largely
395	determined by the speed at which your machine can service cache misses.
396	Because of this, small changes to the code to reduce the miss rate have
397	been observed to give disproportionately large performance improvements.
398	I imagine
399	.I bzip2
400	will perform best on machines with very large caches.
401
402	.SH CAVEATS
403	I/O error messages are not as helpful as they could be.
404	.I bzip2
405	tries hard to detect I/O errors and exit cleanly, but the details of
406	what the problem is sometimes seem rather misleading.
407
408	This manual page pertains to version 1.0.4 of
409	.I bzip2.
410	Compressed data created by this version is entirely forwards and
411	backwards compatible with the previous public releases, versions
412	0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and 1.0.3, but with the following
413	exception: 0.9.0 and above can correctly decompress multiple
414	concatenated compressed files. 0.1pl2 cannot do this; it will stop
415	after decompressing just the first file in the stream.
416
417	.I bzip2recover
418	versions prior to 1.0.2 used 32-bit integers to represent
419	bit positions in compressed files, so they could not handle compressed
420	files more than 512 megabytes long. Versions 1.0.2 and above use
421	64-bit ints on some platforms which support them (GNU supported
422	targets, and Windows). To establish whether or not bzip2recover was
423	built with such a limitation, run it without arguments. In any event
424	you can build yourself an unlimited version if you can recompile it
425	with MaybeUInt64 set to be an unsigned 64-bit integer.
426
427
428
429	.SH AUTHOR
430	Julian Seward, jsewardbzip.org.
431
432	http://www.bzip.org
433
434	The ideas embodied in
435	.I bzip2
436	are due to (at least) the following
437	people: Michael Burrows and David Wheeler (for the block sorting
438	transformation), David Wheeler (again, for the Huffman coder), Peter
439	Fenwick (for the structured coding model in the original
440	.I bzip,
441	and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
442	(for the arithmetic coder in the original
443	.I bzip).
444	I am much
445	indebted for their help, support and advice. See the manual in the
446	source distribution for pointers to sources of documentation. Christian
447	von Roques encouraged me to look for faster sorting algorithms, so as to
448	speed up compression. Bela Lubkin encouraged me to improve the
449	worst-case compression performance.
450	Donna Robinson XMLised the documentation.
451	The bz* scripts are derived from those of GNU gzip.
452	Many people sent patches, helped
453	with portability problems, lent machines, gave advice and were generally
454	helpful.

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format