Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

sed-in.texi@ 1846

Last change on this file since 1846 was 599, checked in by bird, 19 years ago
GNU sed 4.1.5.
File size: 134.7 KB

Line
1	\input texinfo @c --texinfo--
2	@c
3	@c -- Stuff that needs adding: ----------------------------------------------
4	@c (document the `;' command-separator)
5	@c --------------------------------------------------------------------------
6	@c Check for consistency: regexps in @code, text that they match in @samp.
7	@c
8	@c Tips:
9	@c @command for command
10	@c @samp for command fragments: @samp{cat -s}
11	@c @code for sed commands and flags
12	@c Use ``quote'' not `quote' or "quote".
13	@c
14	@c %**start of header
15	@setfilename sed.info
16	@settitle sed, a stream editor
17	@c %**end of header
18
19	@c @smallbook
20
21	@include version.texi
22
23	@c Combine indices.
24	@syncodeindex ky cp
25	@syncodeindex pg cp
26	@syncodeindex tp cp
27
28	@defcodeindex op
29	@syncodeindex op fn
30
31	@include config.texi
32
33	@copying
34	This file documents version @value{VERSION} of
35	@value{SSED}, a stream editor.
36
37	Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free
38	Software Foundation, Inc.
39
40	This document is released under the terms of the @acronym{GNU} Free
41	Documentation License as published by the Free Software Foundation;
42	either version 1.1, or (at your option) any later version.
43
44	You should have received a copy of the @acronym{GNU} Free Documentation
45	License along with @value{SSED}; see the file @file{COPYING.DOC}.
46	If not, write to the Free Software Foundation, 59 Temple Place - Suite
47	330, Boston, MA 02110-1301, USA.
48
49	There are no Cover Texts and no Invariant Sections; this text, along
50	with its equivalent in the printed manual, constitutes the Title Page.
51	@end copying
52
53	@setchapternewpage off
54
55	@titlepage
56	@title @command{sed}, a stream editor
57	@subtitle version @value{VERSION}, @value{UPDATED}
58	@author by Ken Pizzini, Paolo Bonzini
59
60	@page
61	@vskip 0pt plus 1filll
62	Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc.
63
64	@insertcopying
65
66	Published by the Free Software Foundation, @*
67	51 Franklin Street, Fifth Floor @*
68	Boston, MA 02110-1301, USA
69	@end titlepage
70
71
72	@node Top
73	@top
74
75	@ifnottex
76	@insertcopying
77	@end ifnottex
78
79	@menu
80	* Introduction:: Introduction
81	* Invoking sed:: Invocation
82	* sed Programs:: @command{sed} programs
83	* Examples:: Some sample scripts
84	* Limitations:: Limitations and (non-)limitations of @value{SSED}
85	* Other Resources:: Other resources for learning about @command{sed}
86	* Reporting Bugs:: Reporting bugs
87
88	* Extended regexps:: @command{egrep}-style regular expressions
89	@ifset PERL
90	* Perl regexps:: Perl-style regular expressions
91	@end ifset
92
93	* Concept Index:: A menu with all the topics in this manual.
94	* Command and Option Index:: A menu with all @command{sed} commands and
95	command-line options.
96
97	@detailmenu
98	--- The detailed node listing ---
99
100	sed Programs:
101	* Execution Cycle:: How @command{sed} works
102	* Addresses:: Selecting lines with @command{sed}
103	* Regular Expressions:: Overview of regular expression syntax
104	* Common Commands:: Often used commands
105	* The "s" Command:: @command{sed}'s Swiss Army Knife
106	* Other Commands:: Less frequently used commands
107	* Programming Commands:: Commands for @command{sed} gurus
108	* Extended Commands:: Commands specific of @value{SSED}
109	* Escapes:: Specifying special characters
110
111	Examples:
112	* Centering lines::
113	* Increment a number::
114	* Rename files to lower case::
115	* Print bash environment::
116	* Reverse chars of lines::
117	* tac:: Reverse lines of files
118	* cat -n:: Numbering lines
119	* cat -b:: Numbering non-blank lines
120	* wc -c:: Counting chars
121	* wc -w:: Counting words
122	* wc -l:: Counting lines
123	* head:: Printing the first lines
124	* tail:: Printing the last lines
125	* uniq:: Make duplicate lines unique
126	* uniq -d:: Print duplicated lines of input
127	* uniq -u:: Remove all duplicated lines
128	* cat -s:: Squeezing blank lines
129
130	@ifset PERL
131	Perl regexps:: Perl-style regular expressions
132	* Backslash:: Introduces special sequences
133	* Circumflex/dollar sign/period:: Behave specially with regard to new lines
134	* Square brackets:: Are a bit different in strange cases
135	* Options setting:: Toggle modifiers in the middle of a regexp
136	* Non-capturing subpatterns:: Are not counted when backreferencing
137	* Repetition:: Allows for non-greedy matching
138	* Backreferences:: Allows for more than 10 back references
139	* Assertions:: Allows for complex look ahead matches
140	* Non-backtracking subpatterns:: Often gives more performance
141	* Conditional subpatterns:: Allows if/then/else branches
142	* Recursive patterns:: For example to match parentheses
143	* Comments:: Because things can get complex...
144	@end ifset
145
146	@end detailmenu
147	@end menu
148
149
150	@node Introduction
151	@chapter Introduction
152
153	@cindex Stream editor
154	@command{sed} is a stream editor.
155	A stream editor is used to perform basic text
156	transformations on an input stream
157	(a file or input from a pipeline).
158	While in some ways similar to an editor which
159	permits scripted edits (such as @command{ed}),
160	@command{sed} works by making only one pass over the
161	input(s), and is consequently more efficient.
162	But it is @command{sed}'s ability to filter text in a pipeline
163	which particularly distinguishes it from other types of
164	editors.
165
166
167	@node Invoking sed
168	@chapter Invocation
169
170	Normally @command{sed} is invoked like this:
171
172	@example
173	sed SCRIPT INPUTFILE...
174	@end example
175
176	The full format for invoking @command{sed} is:
177
178	@example
179	sed OPTIONS... [SCRIPT] [INPUTFILE...]
180	@end example
181
182	If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-},
183	@command{sed} filters the contents of the standard input. The @var{script}
184	is actually the first non-option parameter, which @command{sed} specially
185	considers a script and not an input file if (and only if) none of the
186	other @var{options} specifies a script to be executed, that is if neither
187	of the @option{-e} and @option{-f} options is specified.
188
189	@command{sed} may be invoked with the following command-line options:
190
191	@table @code
192	@item --version
193	@opindex --version
194	@cindex Version, printing
195	Print out the version of @command{sed} that is being run and a copyright notice,
196	then exit.
197
198	@item --help
199	@opindex --help
200	@cindex Usage summary, printing
201	Print a usage message briefly summarizing these command-line options
202	and the bug-reporting address,
203	then exit.
204
205	@item -n
206	@itemx --quiet
207	@itemx --silent
208	@opindex -n
209	@opindex --quiet
210	@opindex --silent
211	@cindex Disabling autoprint, from command line
212	By default, @command{sed} prints out the pattern space
213	at the end of each cycle through the script.
214	These options disable this automatic printing,
215	and @command{sed} only produces output when explicitly told to
216	via the @code{p} command.
217
218	@item -i[@var{SUFFIX}]
219	@itemx --in-place[=@var{SUFFIX}]
220	@opindex -i
221	@opindex --in-place
222	@cindex In-place editing, activating
223	@cindex @value{SSEDEXT}, in-place editing
224	This option specifies that files are to be edited in-place.
225	@value{SSED} does this by creating a temporary file and
226	sending output to this file rather than to the standard
227	output.@footnote{This applies to commands such as @code{=},
228	@code{a}, @code{c}, @code{i}, @code{l}, @code{p}. You can
229	still write to the standard output by using the @code{w}
230	@cindex @value{SSEDEXT}, @file{/dev/stdout} file
231	or @code{W} commands together with the @file{/dev/stdout}
232	special file}.
233
234	This option implies @option{-s}.
235
236	When the end of the file is reached, the temporary file is
237	renamed to the output file's original name. The extension,
238	if supplied, is used to modify the name of the old file
239	before renaming the temporary file, thereby making a backup
240	copy@footnote{Note that @value{SSED} creates the backup
241	file whether or not any output is actually changed.}).
242
243	@cindex In-place editing, Perl-style backup file names
244	This rule is followed: if the extension doesn't contain a @code{*},
245	then it is appended to the end of the current filename as a
246	suffix; if the extension does contain one or more @code{*}
247	characters, then @emph{each} asterisk is replaced with the
248	current filename. This allows you to add a prefix to the
249	backup file, instead of (or in addition to) a suffix, or
250	even to place backup copies of the original files into another
251	directory (provided the directory already exists).
252
253	If no extension is supplied, the original file is
254	overwritten without making a backup.
255
256	@item -l @var{N}
257	@itemx --line-length=@var{N}
258	@opindex -l
259	@opindex --line-length
260	@cindex Line length, setting
261	Specify the default line-wrap length for the @code{l} command.
262	A length of 0 (zero) means to never wrap long lines. If
263	not specified, it is taken to be 70.
264
265	@item --posix
266	@cindex @value{SSEDEXT}, disabling
267	@value{SSED} includes several extensions to @acronym{POSIX}
268	sed. In order to simplify writing portable scripts, this
269	option disables all the extensions that this manual documents,
270	including additional commands.
271	@cindex @code{POSIXLY_CORRECT} behavior, enabling
272	Most of the extensions accept @command{sed} programs that
273	are outside the syntax mandated by @acronym{POSIX}, but some
274	of them (such as the behavior of the @command{N} command
275	described in @pxref{Reporting Bugs}) actually violate the
276	standard. If you want to disable only the latter kind of
277	extension, you can set the @code{POSIXLY_CORRECT} variable
278	to a non-empty value.
279
280	@item -r
281	@itemx --regexp-extended
282	@opindex -r
283	@opindex --regexp-extended
284	@cindex Extended regular expressions, choosing
285	@cindex @acronym{GNU} extensions, extended regular expressions
286	Use extended regular expressions rather than basic
287	regular expressions. Extended regexps are those that
288	@command{egrep} accepts; they can be clearer because they
289	usually have less backslashes, but are a @acronym{GNU} extension
290	and hence scripts that use them are not portable.
291	@xref{Extended regexps, , Extended regular expressions}.
292
293	@ifset PERL
294	@item -R
295	@itemx --regexp-perl
296	@opindex -R
297	@opindex --regexp-perl
298	@cindex Perl-style regular expressions, choosing
299	@cindex @value{SSEDEXT}, Perl-style regular expressions
300	Use Perl-style regular expressions rather than basic
301	regular expressions. Perl-style regexps are extremely
302	powerful but are a @value{SSED} extension and hence scripts that
303	use it are not portable. @xref{Perl regexps, ,
304	Perl-style regular expressions}.
305	@end ifset
306
307	@item -s
308	@itemx --separate
309	@cindex Working on separate files
310	By default, @command{sed} will consider the files specified on the
311	command line as a single continuous long stream. This @value{SSED}
312	extension allows the user to consider them as separate files:
313	range addresses (such as @samp{/abc/,/def/}) are not allowed
314	to span several files, line numbers are relative to the start
315	of each file, @code{$} refers to the last line of each file,
316	and files invoked from the @code{R} commands are rewound at the
317	start of each file.
318
319	@item -u
320	@itemx --unbuffered
321	@opindex -u
322	@opindex --unbuffered
323	@cindex Unbuffered I/O, choosing
324	Buffer both input and output as minimally as practical.
325	(This is particularly useful if the input is coming from
326	the likes of @samp{tail -f}, and you wish to see the transformed
327	output as soon as possible.)
328
329	@item -e @var{script}
330	@itemx --expression=@var{script}
331	@opindex -e
332	@opindex --expression
333	@cindex Script, from command line
334	Add the commands in @var{script} to the set of commands to be
335	run while processing the input.
336
337	@item -f @var{script-file}
338	@itemx --file=@var{script-file}
339	@opindex -f
340	@opindex --file
341	@cindex Script, from a file
342	Add the commands contained in the file @var{script-file}
343	to the set of commands to be run while processing the input.
344
345	@end table
346
347	If no @option{-e}, @option{-f}, @option{--expression}, or @option{--file}
348	options are given on the command-line,
349	then the first non-option argument on the command line is
350	taken to be the @var{script} to be executed.
351
352	@cindex Files to be processed as input
353	If any command-line parameters remain after processing the above,
354	these parameters are interpreted as the names of input files to
355	be processed.
356	@cindex Standard input, processing as input
357	A file name of @samp{-} refers to the standard input stream.
358	The standard input will be processed if no file names are specified.
359
360
361	@node sed Programs
362	@chapter @command{sed} Programs
363
364	@cindex @command{sed} program structure
365	@cindex Script structure
366	A @command{sed} program consists of one or more @command{sed} commands,
367	passed in by one or more of the
368	@option{-e}, @option{-f}, @option{--expression}, and @option{--file}
369	options, or the first non-option argument if zero of these
370	options are used.
371	This document will refer to ``the'' @command{sed} script;
372	this is understood to mean the in-order catenation
373	of all of the @var{script}s and @var{script-file}s passed in.
374
375	Each @code{sed} command consists of an optional address or
376	address range, followed by a one-character command name
377	and any additional command-specific code.
378
379	@menu
380	* Execution Cycle:: How @command{sed} works
381	* Addresses:: Selecting lines with @command{sed}
382	* Regular Expressions:: Overview of regular expression syntax
383	* Common Commands:: Often used commands
384	* The "s" Command:: @command{sed}'s Swiss Army Knife
385	* Other Commands:: Less frequently used commands
386	* Programming Commands:: Commands for @command{sed} gurus
387	* Extended Commands:: Commands specific of @value{SSED}
388	* Escapes:: Specifying special characters
389	@end menu
390
391
392	@node Execution Cycle
393	@section How @command{sed} Works
394
395	@cindex Buffer spaces, pattern and hold
396	@cindex Spaces, pattern and hold
397	@cindex Pattern space, definition
398	@cindex Hold space, definition
399	@command{sed} maintains two data buffers: the active @emph{pattern} space,
400	and the auxiliary @emph{hold} space. Both are initially empty.
401
402	@command{sed} operates by performing the following cycle on each
403	lines of input: first, @command{sed} reads one line from the input
404	stream, removes any trailing newline, and places it in the pattern space.
405	Then commands are executed; each command can have an address associated
406	to it: addresses are a kind of condition code, and a command is only
407	executed if the condition is verified before the command is to be
408	executed.
409
410	When the end of the script is reached, unless the @option{-n} option
411	is in use, the contents of pattern space are printed out to the output
412	stream, adding back the trailing newline if it was removed.@footnote{Actually,
413	if @command{sed} prints a line without the terminating newline, it will
414	nevertheless print the missing newline as soon as more text is sent to
415	the same output stream, which gives the ``least expected surprise''
416	even though it does not make commands like @samp{sed -n p} exactly
417	identical to @command{cat}.} Then the next cycle starts for the next
418	input line.
419
420	Unless special commands (like @samp{D}) are used, the pattern space is
421	deleted between two cycles. The hold space, on the other hand, keeps
422	its data between cycles (see commands @samp{h}, @samp{H}, @samp{x},
423	@samp{g}, @samp{G} to move data between both buffers).
424
425
426	@node Addresses
427	@section Selecting lines with @command{sed}
428	@cindex Addresses, in @command{sed} scripts
429	@cindex Line selection
430	@cindex Selecting lines to process
431
432	Addresses in a @command{sed} script can be in any of the following forms:
433	@table @code
434	@item @var{number}
435	@cindex Address, numeric
436	@cindex Line, selecting by number
437	Specifying a line number will match only that line in the input.
438	(Note that @command{sed} counts lines continuously across all input files
439	unless @option{-i} or @option{-s} options are specified.)
440
441	@item @var{first}~@var{step}
442	@cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses
443	This @acronym{GNU} extension matches every @var{step}th line
444	starting with line @var{first}.
445	In particular, lines will be selected when there exists
446	a non-negative @var{n} such that the current line-number equals
447	@var{first} + (@var{n} * @var{step}).
448	Thus, to select the odd-numbered lines,
449	one would use @code{1~2};
450	to pick every third line starting with the second, @samp{2~3} would be used;
451	to pick every fifth line starting with the tenth, use @samp{10~5};
452	and @samp{50~0} is just an obscure way of saying @code{50}.
453
454	@item $
455	@cindex Address, last line
456	@cindex Last line, selecting
457	@cindex Line, selecting last
458	This address matches the last line of the last file of input, or
459	the last line of each file when the @option{-i} or @option{-s} options
460	are specified.
461
462	@item /@var{regexp}/
463	@cindex Address, as a regular expression
464	@cindex Line, selecting by regular expression match
465	This will select any line which matches the regular expression @var{regexp}.
466	If @var{regexp} itself includes any @code{/} characters,
467	each must be escaped by a backslash (@code{\}).
468
469	@cindex empty regular expression
470	@cindex @value{SSEDEXT}, modifiers and the empty regular expression
471	The empty regular expression @samp{//} repeats the last regular
472	expression match (the same holds if the empty regular expression is
473	passed to the @code{s} command). Note that modifiers to regular expressions
474	are evaluated when the regular expression is compiled, thus it is invalid to
475	specify them together with the empty regular expression.
476
477	@item \%@var{regexp}%
478	(The @code{%} may be replaced by any other single character.)
479
480	@cindex Slash character, in regular expressions
481	This also matches the regular expression @var{regexp},
482	but allows one to use a different delimiter than @code{/}.
483	This is particularly useful if the @var{regexp} itself contains
484	a lot of slashes, since it avoids the tedious escaping of every @code{/}.
485	If @var{regexp} itself includes any delimiter characters,
486	each must be escaped by a backslash (@code{\}).
487
488	@item /@var{regexp}/I
489	@itemx \%@var{regexp}%I
490	@cindex @acronym{GNU} extensions, @code{I} modifier
491	@ifset PERL
492	@cindex Perl-style regular expressions, case-insensitive
493	@end ifset
494	The @code{I} modifier to regular-expression matching is a @acronym{GNU}
495	extension which causes the @var{regexp} to be matched in
496	a case-insensitive manner.
497
498	@item /@var{regexp}/M
499	@itemx \%@var{regexp}%M
500	@ifset PERL
501	@cindex @value{SSEDEXT}, @code{M} modifier
502	@end ifset
503	@cindex Perl-style regular expressions, multiline
504	The @code{M} modifier to regular-expression matching is a @value{SSED}
505	extension which causes @code{^} and @code{$} to match respectively
506	(in addition to the normal behavior) the empty string after a newline,
507	and the empty string before a newline. There are special character
508	sequences
509	@ifset PERL
510	(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
511	in basic or extended regular expression modes)
512	@end ifset
513	@ifclear PERL
514	(@code{\`} and @code{\'})
515	@end ifclear
516	which always match the beginning or the end of the buffer.
517	@code{M} stands for @cite{multi-line}.
518
519	@ifset PERL
520	@item /@var{regexp}/S
521	@itemx \%@var{regexp}%S
522	@cindex @value{SSEDEXT}, @code{S} modifier
523	@cindex Perl-style regular expressions, single line
524	The @code{S} modifier to regular-expression matching is only valid
525	in Perl mode and specifies that the dot character (@code{.}) will
526	match the newline character too. @code{S} stands for @cite{single-line}.
527	@end ifset
528
529	@ifset PERL
530	@item /@var{regexp}/X
531	@itemx \%@var{regexp}%X
532	@cindex @value{SSEDEXT}, @code{X} modifier
533	@cindex Perl-style regular expressions, extended
534	The @code{X} modifier to regular-expression matching is also
535	valid in Perl mode only. If it is used, whitespace in the
536	pattern (other than in a character class) and
537	characters between a @kbd{#} outside a character class and the
538	next newline character are ignored. An escaping backslash
539	can be used to include a whitespace or @kbd{#} character as part
540	of the pattern.
541	@end ifset
542	@end table
543
544	If no addresses are given, then all lines are matched;
545	if one address is given, then only lines matching that
546	address are matched.
547
548	@cindex Range of lines
549	@cindex Several lines, selecting
550	An address range can be specified by specifying two addresses
551	separated by a comma (@code{,}). An address range matches lines
552	starting from where the first address matches, and continues
553	until the second address matches (inclusively).
554
555	If the second address is a @var{regexp}, then checking for the
556	ending match will start with the line @emph{following} the
557	line which matched the first address: a range will always
558	span at least two lines (except of course if the input stream
559	ends).
560
561	If the second address is a @var{number} less than (or equal to)
562	the line matching the first address, then only the one line is
563	matched.
564
565	@cindex Special addressing forms
566	@cindex Range with start address of zero
567	@cindex Zero, as range start address
568	@cindex @var{addr1},+N
569	@cindex @var{addr1},~N
570	@cindex @acronym{GNU} extensions, special two-address forms
571	@cindex @acronym{GNU} extensions, @code{0} address
572	@cindex @acronym{GNU} extensions, 0,@var{addr2} addressing
573	@cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing
574	@cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing
575	@value{SSED} also supports some special two-address forms; all these
576	are @acronym{GNU} extensions:
577	@table @code
578	@item 0,/@var{regexp}/
579	A line number of @code{0} can be used in an address specification like
580	@code{0,/@var{regexp}/} so that @command{sed} will try to match
581	@var{regexp} in the first input line too. In other words,
582	@code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/},
583	except that if @var{addr2} matches the very first line of input the
584	@code{0,/@var{regexp}/} form will consider it to end the range, whereas
585	the @code{1,/@var{regexp}/} form will match the beginning of its range and
586	hence make the range span up to the @emph{second} occurrence of the
587	regular expression.
588
589	Note that this is the only place where the @code{0} address makes
590	sense; there is no 0-th line and commands which are given the @code{0}
591	address in any other way will give an error.
592
593	@item @var{addr1},+@var{N}
594	Matches @var{addr1} and the @var{N} lines following @var{addr1}.
595
596	@item @var{addr1},~@var{N}
597	Matches @var{addr1} and the lines following @var{addr1}
598	until the next line whose input line number is a multiple of @var{N}.
599	@end table
600
601	@cindex Excluding lines
602	@cindex Selecting non-matching lines
603	Appending the @code{!} character to the end of an address
604	specification negates the sense of the match.
605	That is, if the @code{!} character follows an address range,
606	then only lines which do @emph{not} match the address range
607	will be selected.
608	This also works for singleton addresses,
609	and, perhaps perversely, for the null address.
610
611
612	@node Regular Expressions
613	@section Overview of Regular Expression Syntax
614
615	To know how to use @command{sed}, people should understand regular
616	expressions (@dfn{regexp} for short). A regular expression
617	is a pattern that is matched against a
618	subject string from left to right. Most characters are
619	@dfn{ordinary}: they stand for
620	themselves in a pattern, and match the corresponding characters
621	in the subject. As a trivial example, the pattern
622
623	@example
624	The quick brown fox
625	@end example
626
627	@noindent
628	matches a portion of a subject string that is identical to
629	itself. The power of regular expressions comes from the
630	ability to include alternatives and repetitions in the pattern.
631	These are encoded in the pattern by the use of @dfn{special characters},
632	which do not stand for themselves but instead
633	are interpreted in some special way. Here is a brief description
634	of regular expression syntax as used in @command{sed}.
635
636	@table @code
637	@item @var{char}
638	A single ordinary character matches itself.
639
640	@item *
641	@cindex @acronym{GNU} extensions, to basic regular expressions
642	Matches a sequence of zero or more instances of matches for the
643	preceding regular expression, which must be an ordinary character, a
644	special character preceded by @code{\}, a @code{.}, a grouped regexp
645	(see below), or a bracket expression. As a @acronym{GNU} extension, a
646	postfixed regular expression can also be followed by @code{*}; for
647	example, @code{a*} is equivalent to @code{a}. @acronym{POSIX}
648	1003.1-2001 says that @code{*} stands for itself when it appears at
649	the start of a regular expression or subexpression, but many
650	non@acronym{GNU} implementations do not support this and portable
651	scripts should instead use @code{\*} in these contexts.
652
653	@item \+
654	@cindex @acronym{GNU} extensions, to basic regular expressions
655	As @code{*}, but matches one or more. It is a @acronym{GNU} extension.
656
657	@item \?
658	@cindex @acronym{GNU} extensions, to basic regular expressions
659	As @code{*}, but only matches zero or one. It is a @acronym{GNU} extension.
660
661	@item \@{@var{i}\@}
662	As @code{*}, but matches exactly @var{i} sequences (@var{i} is a
663	decimal integer; for portability, keep it between 0 and 255
664	inclusive).
665
666	@item \@{@var{i},@var{j}\@}
667	Matches between @var{i} and @var{j}, inclusive, sequences.
668
669	@item \@{@var{i},\@}
670	Matches more than or equal to @var{i} sequences.
671
672	@item $@var{regexp}$
673	Groups the inner @var{regexp} as a whole, this is used to:
674
675	@itemize @bullet
676	@item
677	@cindex @acronym{GNU} extensions, to basic regular expressions
678	Apply postfix operators, like @code{$abcd$*}:
679	this will search for zero or more whole sequences
680	of @samp{abcd}, while @code{abcd*} would search
681	for @samp{abc} followed by zero or more occurrences
682	of @samp{d}. Note that support for @code{$abcd$*} is
683	required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU}
684	implementations do not support it and hence it is not universally
685	portable.
686
687	@item
688	Use back references (see below).
689	@end itemize
690
691	@item .
692	Matches any character, including newline.
693
694	@item ^
695	Matches the null string at beginning of line, i.e. what
696	appears after the circumflex must appear at the
697	beginning of line. @code{^#include} will match only
698	lines where @samp{#include} is the first thing on line---if
699	there are spaces before, for example, the match fails.
700	@code{^} acts as a special character only at the beginning
701	of the regular expression or subexpression (that is,
702	after @code{\(} or @code{\\|}). Portable scripts should avoid
703	@code{^} at the beginning of a subexpression, though, as
704	@acronym{POSIX} allows implementations that treat @code{^} as
705	an ordinary character in that context.
706
707
708	@item $
709	It is the same as @code{^}, but refers to end of line.
710	@code{$} also acts as a special character only at the end
711	of the regular expression or subexpression (that is, before @code{\)}
712	or @code{\\|}), and its use at the end of a subexpression is not
713	portable.
714
715
716	@item [@var{list}]
717	@itemx [^@var{list}]
718	Matches any single character in @var{list}: for example,
719	@code{[aeiou]} matches all vowels. A list may include
720	sequences like @code{@var{char1}-@var{char2}}, which
721	matches any character between (inclusive) @var{char1}
722	and @var{char2}.
723
724	A leading @code{^} reverses the meaning of @var{list}, so that
725	it matches any single character @emph{not} in @var{list}. To include
726	@code{]} in the list, make it the first character (after
727	the @code{^} if needed), to include @code{-} in the list,
728	make it the first or last; to include @code{^} put
729	it after the first character.
730
731	@cindex @code{POSIXLY_CORRECT} behavior, bracket expressions
732	The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\}
733	are normally not special within @var{list}. For example, @code{[\*]}
734	matches either @samp{\} or @samp{*}, because the @code{\} is not
735	special here. However, strings like @code{[.ch.]}, @code{[=a=]}, and
736	@code{[:space:]} are special within @var{list} and represent collating
737	symbols, equivalence classes, and character classes, respectively, and
738	@code{[} is therefore special within @var{list} when it is followed by
739	@code{.}, @code{=}, or @code{:}. Also, when not in
740	@env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and
741	@code{\t} are recognized within @var{list}. @xref{Escapes}.
742
743	@item @var{regexp1}\\|@var{regexp2}
744	@cindex @acronym{GNU} extensions, to basic regular expressions
745	Matches either @var{regexp1} or @var{regexp2}. Use
746	parentheses to use complex alternative regular expressions.
747	The matching process tries each alternative in turn, from
748	left to right, and the first one that succeeds is used.
749	It is a @acronym{GNU} extension.
750
751	@item @var{regexp1}@var{regexp2}
752	Matches the concatenation of @var{regexp1} and @var{regexp2}.
753	Concatenation binds more tightly than @code{\\|}, @code{^}, and
754	@code{$}, but less tightly than the other regular expression
755	operators.
756
757	@item \@var{digit}
758	Matches the @var{digit}-th @code{$@dots{}$} parenthesized
759	subexpression in the regular expression. This is called a @dfn{back
760	reference}. Subexpressions are implicity numbered by counting
761	occurrences of @code{\(} left-to-right.
762
763	@item \n
764	Matches the newline character.
765
766	@item \@var{char}
767	Matches @var{char}, where @var{char} is one of @code{$},
768	@code{*}, @code{.}, @code{[}, @code{\}, or @code{^}.
769	Note that the only C-like
770	backslash sequences that you can portably assume to be
771	interpreted are @code{\n} and @code{\\}; in particular
772	@code{\t} is not portable, and matches a @samp{t} under most
773	implementations of @command{sed}, rather than a tab character.
774
775	@end table
776
777	@cindex Greedy regular expression matching
778	Note that the regular expression matcher is greedy, i.e., matches
779	are attempted from left to right and, if two or more matches are
780	possible starting at the same character, it selects the longest.
781
782	@noindent
783	Examples:
784	@table @samp
785	@item abcdef
786	Matches @samp{abcdef}.
787
788	@item a*b
789	Matches zero or more @samp{a}s followed by a single
790	@samp{b}. For example, @samp{b} or @samp{aaaaab}.
791
792	@item a\?b
793	Matches @samp{b} or @samp{ab}.
794
795	@item a\+b\+
796	Matches one or more @samp{a}s followed by one or more
797	@samp{b}s: @samp{ab} is the shortest possible match, but
798	other examples are @samp{aaaab} or @samp{abbbbb} or
799	@samp{aaaaaabbbbbbb}.
800
801	@item .*
802	@itemx .\+
803	These two both match all the characters in a string;
804	however, the first matches every string (including the empty
805	string), while the second matches only strings containing
806	at least one character.
807
808	@item ^main.(.)
809	his matches a string starting with @samp{main},
810	followed by an opening and closing
811	parenthesis. The @samp{n}, @samp{(} and @samp{)} need not
812	be adjacent.
813
814	@item ^#
815	This matches a string beginning with @samp{#}.
816
817	@item \\$
818	This matches a string ending with a single backslash. The
819	regexp contains two backslashes for escaping.
820
821	@item \$
822	Instead, this matches a string consisting of a single dollar sign,
823	because it is escaped.
824
825	@item [a-zA-Z0-9]
826	In the C locale, this matches any @acronym{ASCII} letters or digits.
827
828	@item [^ @kbd{tab}]\+
829	(Here @kbd{tab} stands for a single tab character.)
830	This matches a string of one or more
831	characters, none of which is a space or a tab.
832	Usually this means a word.
833
834	@item ^$.*$\n\1$
835	This matches a string consisting of two equal substrings separated by
836	a newline.
837
838	@item .\@{9\@}A$
839	This matches nine characters followed by an @samp{A}.
840
841	@item ^.\@{15\@}A
842	This matches the start of a string that contains 16 characters,
843	the last of which is an @samp{A}.
844
845	@end table
846
847
848
849	@node Common Commands
850	@section Often-Used Commands
851
852	If you use @command{sed} at all, you will quite likely want to know
853	these commands.
854
855	@table @code
856	@item #
857	[No addresses allowed.]
858
859	@findex # (comments)
860	@cindex Comments, in scripts
861	The @code{#} character begins a comment;
862	the comment continues until the next newline.
863
864	@cindex Portability, comments
865	If you are concerned about portability, be aware that
866	some implementations of @command{sed} (which are not @sc{posix}
867	conformant) may only support a single one-line comment,
868	and then only when the very first character of the script is a @code{#}.
869
870	@findex -n, forcing from within a script
871	@cindex Caveat --- #n on first line
872	Warning: if the first two characters of the @command{sed} script
873	are @code{#n}, then the @option{-n} (no-autoprint) option is forced.
874	If you want to put a comment in the first line of your script
875	and that comment begins with the letter @samp{n}
876	and you do not want this behavior,
877	then be sure to either use a capital @samp{N},
878	or place at least one space before the @samp{n}.
879
880	@item q [@var{exit-code}]
881	This command only accepts a single address.
882
883	@findex q (quit) command
884	@cindex @value{SSEDEXT}, returning an exit code
885	@cindex Quitting
886	Exit @command{sed} without processing any more commands or input.
887	Note that the current pattern space is printed if auto-print is
888	not disabled with the @option{-n} options. The ability to return
889	an exit code from the @command{sed} script is a @value{SSED} extension.
890
891	@item d
892	@findex d (delete) command
893	@cindex Text, deleting
894	Delete the pattern space;
895	immediately start next cycle.
896
897	@item p
898	@findex p (print) command
899	@cindex Text, printing
900	Print out the pattern space (to the standard output).
901	This command is usually only used in conjunction with the @option{-n}
902	command-line option.
903
904	@item n
905	@findex n (next-line) command
906	@cindex Next input line, replace pattern space with
907	@cindex Read next input line
908	If auto-print is not disabled, print the pattern space,
909	then, regardless, replace the pattern space with the next line of input.
910	If there is no more input then @command{sed} exits without processing
911	any more commands.
912
913	@item @{ @var{commands} @}
914	@findex @{@} command grouping
915	@cindex Grouping commands
916	@cindex Command groups
917	A group of commands may be enclosed between
918	@code{@{} and @code{@}} characters.
919	This is particularly useful when you want a group of commands
920	to be triggered by a single address (or address-range) match.
921
922	@end table
923
924	@node The "s" Command
925	@section The @code{s} Command
926
927	The syntax of the @code{s} (as in substitute) command is
928	@samp{s/@var{regexp}/@var{replacement}/@var{flags}}. The @code{/}
929	characters may be uniformly replaced by any other single
930	character within any given @code{s} command. The @code{/}
931	character (or whatever other character is used in its stead)
932	can appear in the @var{regexp} or @var{replacement}
933	only if it is preceded by a @code{\} character.
934
935	The @code{s} command is probably the most important in @command{sed}
936	and has a lot of different options. Its basic concept is simple:
937	the @code{s} command attempts to match the pattern
938	space against the supplied @var{regexp}; if the match is
939	successful, then that portion of the pattern
940	space which was matched is replaced with @var{replacement}.
941
942	@cindex Backreferences, in regular expressions
943	@cindex Parenthesized substrings
944	The @var{replacement} can contain @code{\@var{n}} (@var{n} being
945	a number from 1 to 9, inclusive) references, which refer to
946	the portion of the match which is contained between the @var{n}th
947	@code{$} and its matching @code{$}.
948	Also, the @var{replacement} can contain unescaped @code{&}
949	characters which reference the whole matched portion
950	of the pattern space.
951	@cindex @value{SSEDEXT}, case modifiers in @code{s} commands
952	Finally, as a @value{SSED} extension, you can include a
953	special sequence made of a backslash and one of the letters
954	@code{L}, @code{l}, @code{U}, @code{u}, or @code{E}.
955	The meaning is as follows:
956
957	@table @code
958	@item \L
959	Turn the replacement
960	to lowercase until a @code{\U} or @code{\E} is found,
961
962	@item \l
963	Turn the
964	next character to lowercase,
965
966	@item \U
967	Turn the replacement to uppercase
968	until a @code{\L} or @code{\E} is found,
969
970	@item \u
971	Turn the next character
972	to uppercase,
973
974	@item \E
975	Stop case conversion started by @code{\L} or @code{\U}.
976	@end table
977
978	To include a literal @code{\}, @code{&}, or newline in the final
979	replacement, be sure to precede the desired @code{\}, @code{&},
980	or newline in the @var{replacement} with a @code{\}.
981
982	@findex s command, option flags
983	@cindex Substitution of text, options
984	The @code{s} command can be followed by zero or more of the
985	following @var{flags}:
986
987	@table @code
988	@item g
989	@cindex Global substitution
990	@cindex Replacing all text matching regexp in a line
991	Apply the replacement to @emph{all} matches to the @var{regexp},
992	not just the first.
993
994	@item @var{number}
995	@cindex Replacing only @var{n}th match of regexp in a line
996	Only replace the @var{number}th match of the @var{regexp}.
997
998	@cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command
999	@cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command
1000	Note: the @sc{posix} standard does not specify what should happen
1001	when you mix the @code{g} and @var{number} modifiers,
1002	and currently there is no widely agreed upon meaning
1003	across @command{sed} implementations.
1004	For @value{SSED}, the interaction is defined to be:
1005	ignore matches before the @var{number}th,
1006	and then match and replace all matches from
1007	the @var{number}th on.
1008
1009	@item p
1010	@cindex Text, printing after substitution
1011	If the substitution was made, then print the new pattern space.
1012
1013	Note: when both the @code{p} and @code{e} options are specified,
1014	the relative ordering of the two produces very different results.
1015	In general, @code{ep} (evaluate then print) is what you want,
1016	but operating the other way round can be useful for debugging.
1017	For this reason, the current version of @value{SSED} interprets
1018	specially the presence of @code{p} options both before and after
1019	@code{e}, printing the pattern space before and after evaluation,
1020	while in general flags for the @code{s} command show their
1021	effect just once. This behavior, although documented, might
1022	change in future versions.
1023
1024	@item w @var{file-name}
1025	@cindex Text, writing to a file after substitution
1026	@cindex @value{SSEDEXT}, @file{/dev/stdout} file
1027	@cindex @value{SSEDEXT}, @file{/dev/stderr} file
1028	If the substitution was made, then write out the result to the named file.
1029	As a @value{SSED} extension, two special values of @var{file-name} are
1030	supported: @file{/dev/stderr}, which writes the result to the standard
1031	error, and @file{/dev/stdout}, which writes to the standard
1032	output.@footnote{This is equivalent to @code{p} unless the @option{-i}
1033	option is being used.}
1034
1035	@item e
1036	@cindex Evaluate Bourne-shell commands, after substitution
1037	@cindex Subprocesses
1038	@cindex @value{SSEDEXT}, evaluating Bourne-shell commands
1039	@cindex @value{SSEDEXT}, subprocesses
1040	This command allows one to pipe input from a shell command
1041	into pattern space. If a substitution was made, the command
1042	that is found in pattern space is executed and pattern space
1043	is replaced with its output. A trailing newline is suppressed;
1044	results are undefined if the command to be executed contains
1045	a @sc{nul} character. This is a @value{SSED} extension.
1046
1047	@item I
1048	@itemx i
1049	@cindex @acronym{GNU} extensions, @code{I} modifier
1050	@cindex Case-insensitive matching
1051	@ifset PERL
1052	@cindex Perl-style regular expressions, case-insensitive
1053	@end ifset
1054	The @code{I} modifier to regular-expression matching is a @acronym{GNU}
1055	extension which makes @command{sed} match @var{regexp} in a
1056	case-insensitive manner.
1057
1058	@item M
1059	@itemx m
1060	@cindex @value{SSEDEXT}, @code{M} modifier
1061	@ifset PERL
1062	@cindex Perl-style regular expressions, multiline
1063	@end ifset
1064	The @code{M} modifier to regular-expression matching is a @value{SSED}
1065	extension which causes @code{^} and @code{$} to match respectively
1066	(in addition to the normal behavior) the empty string after a newline,
1067	and the empty string before a newline. There are special character
1068	sequences
1069	@ifset PERL
1070	(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
1071	in basic or extended regular expression modes)
1072	@end ifset
1073	@ifclear PERL
1074	(@code{\`} and @code{\'})
1075	@end ifclear
1076	which always match the beginning or the end of the buffer.
1077	@code{M} stands for @cite{multi-line}.
1078
1079	@ifset PERL
1080	@item S
1081	@itemx s
1082	@cindex @value{SSEDEXT}, @code{S} modifier
1083	@cindex Perl-style regular expressions, single line
1084	The @code{S} modifier to regular-expression matching is only valid
1085	in Perl mode and specifies that the dot character (@code{.}) will
1086	match the newline character too. @code{S} stands for @cite{single-line}.
1087	@end ifset
1088
1089	@ifset PERL
1090	@item X
1091	@itemx x
1092	@cindex @value{SSEDEXT}, @code{X} modifier
1093	@cindex Perl-style regular expressions, extended
1094	The @code{X} modifier to regular-expression matching is also
1095	valid in Perl mode only. If it is used, whitespace in the
1096	pattern (other than in a character class) and
1097	characters between a @kbd{#} outside a character class and the
1098	next newline character are ignored. An escaping backslash
1099	can be used to include a whitespace or @kbd{#} character as part
1100	of the pattern.
1101	@end ifset
1102	@end table
1103
1104
1105	@node Other Commands
1106	@section Less Frequently-Used Commands
1107
1108	Though perhaps less frequently used than those in the previous
1109	section, some very small yet useful @command{sed} scripts can be built with
1110	these commands.
1111
1112	@table @code
1113	@item y/@var{source-chars}/@var{dest-chars}/
1114	(The @code{/} characters may be uniformly replaced by
1115	any other single character within any given @code{y} command.)
1116
1117	@findex y (transliterate) command
1118	@cindex Transliteration
1119	Transliterate any characters in the pattern space which match
1120	any of the @var{source-chars} with the corresponding character
1121	in @var{dest-chars}.
1122
1123	Instances of the @code{/} (or whatever other character is used in its stead),
1124	@code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars}
1125	lists, provide that each instance is escaped by a @code{\}.
1126	The @var{source-chars} and @var{dest-chars} lists @emph{must}
1127	contain the same number of characters (after de-escaping).
1128
1129	@item a\
1130	@itemx @var{text}
1131	@cindex @value{SSEDEXT}, two addresses supported by most commands
1132	As a @acronym{GNU} extension, this command accepts two addresses.
1133
1134	@findex a (append text lines) command
1135	@cindex Appending text after a line
1136	@cindex Text, appending
1137	Queue the lines of text which follow this command
1138	(each but the last ending with a @code{\},
1139	which are removed from the output)
1140	to be output at the end of the current cycle,
1141	or when the next input line is read.
1142
1143	Escape sequences in @var{text} are processed, so you should
1144	use @code{\\} in @var{text} to print a single backslash.
1145
1146	As a @acronym{GNU} extension, if between the @code{a} and the newline there is
1147	other than a whitespace-@code{\} sequence, then the text of this line,
1148	starting at the first non-whitespace character after the @code{a},
1149	is taken as the first line of the @var{text} block.
1150	(This enables a simplification in scripting a one-line add.)
1151	This extension also works with the @code{i} and @code{c} commands.
1152
1153	@item i\
1154	@itemx @var{text}
1155	@cindex @value{SSEDEXT}, two addresses supported by most commands
1156	As a @acronym{GNU} extension, this command accepts two addresses.
1157
1158	@findex i (insert text lines) command
1159	@cindex Inserting text before a line
1160	@cindex Text, insertion
1161	Immediately output the lines of text which follow this command
1162	(each but the last ending with a @code{\},
1163	which are removed from the output).
1164
1165	@item c\
1166	@itemx @var{text}
1167	@findex c (change to text lines) command
1168	@cindex Replacing selected lines with other text
1169	Delete the lines matching the address or address-range,
1170	and output the lines of text which follow this command
1171	(each but the last ending with a @code{\},
1172	which are removed from the output)
1173	in place of the last line
1174	(or in place of each line, if no addresses were specified).
1175	A new cycle is started after this command is done,
1176	since the pattern space will have been deleted.
1177
1178	@item =
1179	@cindex @value{SSEDEXT}, two addresses supported by most commands
1180	As a @acronym{GNU} extension, this command accepts two addresses.
1181
1182	@findex = (print line number) command
1183	@cindex Printing line number
1184	@cindex Line number, printing
1185	Print out the current input line number (with a trailing newline).
1186
1187	@item l @var{n}
1188	@findex l (list unambiguously) command
1189	@cindex List pattern space
1190	@cindex Printing text unambiguously
1191	@cindex Line length, setting
1192	@cindex @value{SSEDEXT}, setting line length
1193	Print the pattern space in an unambiguous form:
1194	non-printable characters (and the @code{\} character)
1195	are printed in C-style escaped form; long lines are split,
1196	with a trailing @code{\} character to indicate the split;
1197	the end of each line is marked with a @code{$}.
1198
1199	@var{n} specifies the desired line-wrap length;
1200	a length of 0 (zero) means to never wrap long lines. If omitted,
1201	the default as specified on the command line is used. The @var{n}
1202	parameter is a @value{SSED} extension.
1203
1204	@item r @var{filename}
1205	@cindex @value{SSEDEXT}, two addresses supported by most commands
1206	As a @acronym{GNU} extension, this command accepts two addresses.
1207
1208	@findex r (read file) command
1209	@cindex Read text from a file
1210	@cindex @value{SSEDEXT}, @file{/dev/stdin} file
1211	Queue the contents of @var{filename} to be read and
1212	inserted into the output stream at the end of the current cycle,
1213	or when the next input line is read.
1214	Note that if @var{filename} cannot be read, it is treated as
1215	if it were an empty file, without any error indication.
1216
1217	As a @value{SSED} extension, the special value @file{/dev/stdin}
1218	is supported for the file name, which reads the contents of the
1219	standard input.
1220
1221	@item w @var{filename}
1222	@findex w (write file) command
1223	@cindex Write to a file
1224	@cindex @value{SSEDEXT}, @file{/dev/stdout} file
1225	@cindex @value{SSEDEXT}, @file{/dev/stderr} file
1226	Write the pattern space to @var{filename}.
1227	As a @value{SSED} extension, two special values of @var{file-name} are
1228	supported: @file{/dev/stderr}, which writes the result to the standard
1229	error, and @file{/dev/stdout}, which writes to the standard
1230	output.@footnote{This is equivalent to @code{p} unless the @option{-i}
1231	option is being used.}
1232
1233	The file will be created (or truncated) before the
1234	first input line is read; all @code{w} commands
1235	(including instances of @code{w} flag on successful @code{s} commands)
1236	which refer to the same @var{filename} are output without
1237	closing and reopening the file.
1238
1239	@item D
1240	@findex D (delete first line) command
1241	@cindex Delete first line from pattern space
1242	Delete text in the pattern space up to the first newline.
1243	If any text is left, restart cycle with the resultant
1244	pattern space (without reading a new line of input),
1245	otherwise start a normal new cycle.
1246
1247	@item N
1248	@findex N (append Next line) command
1249	@cindex Next input line, append to pattern space
1250	@cindex Append next input line to pattern space
1251	Add a newline to the pattern space,
1252	then append the next line of input to the pattern space.
1253	If there is no more input then @command{sed} exits without processing
1254	any more commands.
1255
1256	@item P
1257	@findex P (print first line) command
1258	@cindex Print first line from pattern space
1259	Print out the portion of the pattern space up to the first newline.
1260
1261	@item h
1262	@findex h (hold) command
1263	@cindex Copy pattern space into hold space
1264	@cindex Replace hold space with copy of pattern space
1265	@cindex Hold space, copying pattern space into
1266	Replace the contents of the hold space with the contents of the pattern space.
1267
1268	@item H
1269	@findex H (append Hold) command
1270	@cindex Append pattern space to hold space
1271	@cindex Hold space, appending from pattern space
1272	Append a newline to the contents of the hold space,
1273	and then append the contents of the pattern space to that of the hold space.
1274
1275	@item g
1276	@findex g (get) command
1277	@cindex Copy hold space into pattern space
1278	@cindex Replace pattern space with copy of hold space
1279	@cindex Hold space, copy into pattern space
1280	Replace the contents of the pattern space with the contents of the hold space.
1281
1282	@item G
1283	@findex G (appending Get) command
1284	@cindex Append hold space to pattern space
1285	@cindex Hold space, appending to pattern space
1286	Append a newline to the contents of the pattern space,
1287	and then append the contents of the hold space to that of the pattern space.
1288
1289	@item x
1290	@findex x (eXchange) command
1291	@cindex Exchange hold space with pattern space
1292	@cindex Hold space, exchange with pattern space
1293	Exchange the contents of the hold and pattern spaces.
1294
1295	@end table
1296
1297
1298	@node Programming Commands
1299	@section Commands for @command{sed} gurus
1300
1301	In most cases, use of these commands indicates that you are
1302	probably better off programming in something like @command{awk}
1303	or Perl. But occasionally one is committed to sticking
1304	with @command{sed}, and these commands can enable one to write
1305	quite convoluted scripts.
1306
1307	@cindex Flow of control in scripts
1308	@table @code
1309	@item : @var{label}
1310	[No addresses allowed.]
1311
1312	@findex : (label) command
1313	@cindex Labels, in scripts
1314	Specify the location of @var{label} for branch commands.
1315	In all other respects, a no-op.
1316
1317	@item b @var{label}
1318	@findex b (branch) command
1319	@cindex Branch to a label, unconditionally
1320	@cindex Goto, in scripts
1321	Unconditionally branch to @var{label}.
1322	The @var{label} may be omitted, in which case the next cycle is started.
1323
1324	@item t @var{label}
1325	@findex t (test and branch if successful) command
1326	@cindex Branch to a label, if @code{s///} succeeded
1327	@cindex Conditional branch
1328	Branch to @var{label} only if there has been a successful @code{s}ubstitution
1329	since the last input line was read or conditional branch was taken.
1330	The @var{label} may be omitted, in which case the next cycle is started.
1331
1332	@end table
1333
1334	@node Extended Commands
1335	@section Commands Specific to @value{SSED}
1336
1337	These commands are specific to @value{SSED}, so you
1338	must use them with care and only when you are sure that
1339	hindering portability is not evil. They allow you to check
1340	for @value{SSED} extensions or to do tasks that are required
1341	quite often, yet are unsupported by standard @command{sed}s.
1342
1343	@table @code
1344	@item e [@var{command}]
1345	@findex e (evaluate) command
1346	@cindex Evaluate Bourne-shell commands
1347	@cindex Subprocesses
1348	@cindex @value{SSEDEXT}, evaluating Bourne-shell commands
1349	@cindex @value{SSEDEXT}, subprocesses
1350	This command allows one to pipe input from a shell command
1351	into pattern space. Without parameters, the @code{e} command
1352	executes the command that is found in pattern space and
1353	replaces the pattern space with the output; a trailing newline
1354	is suppressed.
1355
1356	If a parameter is specified, instead, the @code{e} command
1357	interprets it as a command and sends its output to the output stream
1358	(like @code{r} does). The command can run across multiple
1359	lines, all but the last ending with a back-slash.
1360
1361	In both cases, the results are undefined if the command to be
1362	executed contains a @sc{nul} character.
1363
1364	@item L @var{n}
1365	@findex L (fLow paragraphs) command
1366	@cindex Reformat pattern space
1367	@cindex Reformatting paragraphs
1368	@cindex @value{SSEDEXT}, reformatting paragraphs
1369	@cindex @value{SSEDEXT}, @code{L} command
1370	This @value{SSED} extension fills and joins lines in pattern space
1371	to produce output lines of (at most) @var{n} characters, like
1372	@code{fmt} does; if @var{n} is omitted, the default as specified
1373	on the command line is used. This command is considered a failed
1374	experiment and unless there is enough request (which seems unlikely)
1375	will be removed in future versions.
1376
1377	@ignore
1378	Blank lines, spaces between words, and indentation are
1379	preserved in the output; successive input lines with different
1380	indentation are not joined; tabs are expanded to 8 columns.
1381
1382	If the pattern space contains multiple lines, they are joined, but
1383	since the pattern space usually contains a single line, the behavior
1384	of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e.,
1385	it does not join short lines to form longer ones).
1386
1387	@var{n} specifies the desired line-wrap length; if omitted,
1388	the default as specified on the command line is used.
1389	@end ignore
1390
1391	@item Q [@var{exit-code}]
1392	This command only accepts a single address.
1393
1394	@findex Q (silent Quit) command
1395	@cindex @value{SSEDEXT}, quitting silently
1396	@cindex @value{SSEDEXT}, returning an exit code
1397	@cindex Quitting
1398	This command is the same as @code{q}, but will not print the
1399	contents of pattern space. Like @code{q}, it provides the
1400	ability to return an exit code to the caller.
1401
1402	This command can be useful because the only alternative ways
1403	to accomplish this apparently trivial function are to use
1404	the @option{-n} option (which can unnecessarily complicate
1405	your script) or resorting to the following snippet, which
1406	wastes time by reading the whole file without any visible effect:
1407
1408	@example
1409	:eat
1410	$d @i{Quit silently on the last line}
1411	N @i{Read another line, silently}
1412	g @i{Overwrite pattern space each time to save memory}
1413	b eat
1414	@end example
1415
1416	@item R @var{filename}
1417	@findex R (read line) command
1418	@cindex Read text from a file
1419	@cindex @value{SSEDEXT}, reading a file a line at a time
1420	@cindex @value{SSEDEXT}, @code{R} command
1421	@cindex @value{SSEDEXT}, @file{/dev/stdin} file
1422	Queue a line of @var{filename} to be read and
1423	inserted into the output stream at the end of the current cycle,
1424	or when the next input line is read.
1425	Note that if @var{filename} cannot be read, or if its end is
1426	reached, no line is appended, without any error indication.
1427
1428	As with the @code{r} command, the special value @file{/dev/stdin}
1429	is supported for the file name, which reads a line from the
1430	standard input.
1431
1432	@item T @var{label}
1433	@findex T (test and branch if failed) command
1434	@cindex @value{SSEDEXT}, branch if @code{s///} failed
1435	@cindex Branch to a label, if @code{s///} failed
1436	@cindex Conditional branch
1437	Branch to @var{label} only if there have been no successful
1438	@code{s}ubstitutions since the last input line was read or
1439	conditional branch was taken. The @var{label} may be omitted,
1440	in which case the next cycle is started.
1441
1442	@item v @var{version}
1443	@findex v (version) command
1444	@cindex @value{SSEDEXT}, checking for their presence
1445	@cindex Requiring @value{SSED}
1446	This command does nothing, but makes @command{sed} fail if
1447	@value{SSED} extensions are not supported, simply because other
1448	versions of @command{sed} do not implement it. In addition, you
1449	can specify the version of @command{sed} that your script
1450	requires, such as @code{4.0.5}. The default is @code{4.0}
1451	because that is the first version that implemented this command.
1452
1453	This command enables all @value{SSEDEXT} even if
1454	@env{POSIXLY_CORRECT} is set in the environment.
1455
1456	@item W @var{filename}
1457	@findex W (write first line) command
1458	@cindex Write first line to a file
1459	@cindex @value{SSEDEXT}, writing first line to a file
1460	Write to the given filename the portion of the pattern space up to
1461	the first newline. Everything said under the @code{w} command about
1462	file handling holds here too.
1463	@end table
1464
1465	@node Escapes
1466	@section @acronym{GNU} Extensions for Escapes in Regular Expressions
1467
1468	@cindex @acronym{GNU} extensions, special escapes
1469	Until this chapter, we have only encountered escapes of the form
1470	@samp{\^}, which tell @command{sed} not to interpret the circumflex
1471	as a special character, but rather to take it literally. For
1472	example, @samp{\*} matches a single asterisk rather than zero
1473	or more backslashes.
1474
1475	@cindex @code{POSIXLY_CORRECT} behavior, escapes
1476	This chapter introduces another kind of escape@footnote{All
1477	the escapes introduced here are @acronym{GNU}
1478	extensions, with the exception of @code{\n}. In basic regular
1479	expression mode, setting @code{POSIXLY_CORRECT} disables them inside
1480	bracket expressions.}---that
1481	is, escapes that are applied to a character or sequence of characters
1482	that ordinarily are taken literally, and that @command{sed} replaces
1483	with a special character. This provides a way
1484	of encoding non-printable characters in patterns in a visible manner.
1485	There is no restriction on the appearance of non-printing characters
1486	in a @command{sed} script but when a script is being prepared in the
1487	shell or by text editing, it is usually easier to use one of
1488	the following escape sequences than the binary character it
1489	represents:
1490
1491	The list of these escapes is:
1492
1493	@table @code
1494	@item \a
1495	Produces or matches a @sc{bel} character, that is an ``alert'' (@sc{ascii} 7).
1496
1497	@item \f
1498	Produces or matches a form feed (@sc{ascii} 12).
1499
1500	@item \n
1501	Produces or matches a newline (@sc{ascii} 10).
1502
1503	@item \r
1504	Produces or matches a carriage return (@sc{ascii} 13).
1505
1506	@item \t
1507	Produces or matches a horizontal tab (@sc{ascii} 9).
1508
1509	@item \v
1510	Produces or matches a so called ``vertical tab'' (@sc{ascii} 11).
1511
1512	@item \c@var{x}
1513	Produces or matches @kbd{@sc{Control}-@var{x}}, where @var{x} is
1514	any character. The precise effect of @samp{\c@var{x}} is as follows:
1515	if @var{x} is a lower case letter, it is converted to upper case.
1516	Then bit 6 of the character (hex 40) is inverted. Thus @samp{\cz} becomes
1517	hex 1A, but @samp{\c@{} becomes hex 3B, while @samp{\c;} becomes hex 7B.
1518
1519	@item \d@var{xxx}
1520	Produces or matches a character whose decimal @sc{ascii} value is @var{xxx}.
1521
1522	@item \o@var{xxx}
1523	@ifset PERL
1524	@item \@var{xxx}
1525	@end ifset
1526	Produces or matches a character whose octal @sc{ascii} value is @var{xxx}.
1527	@ifset PERL
1528	The syntax without the @code{o} is active in Perl mode, while the one
1529	with the @code{o} is active in the normal or extended @sc{posix} regular
1530	expression modes.
1531	@end ifset
1532
1533	@item \x@var{xx}
1534	Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}.
1535	@end table
1536
1537	@samp{\b} (backspace) was omitted because of the conflict with
1538	the existing ``word boundary'' meaning.
1539
1540	Other escapes match a particular character class and are valid only in
1541	regular expressions:
1542
1543	@table @code
1544	@item \w
1545	Matches any ``word'' character. A ``word'' character is any
1546	letter or digit or the underscore character.
1547
1548	@item \W
1549	Matches any ``non-word'' character.
1550
1551	@item \b
1552	Matches a word boundary; that is it matches if the character
1553	to the left is a ``word'' character and the character to the
1554	right is a ``non-word'' character, or vice-versa.
1555
1556	@item \B
1557	Matches everywhere but on a word boundary; that is it matches
1558	if the character to the left and the character to the right
1559	are either both ``word'' characters or both ``non-word''
1560	characters.
1561
1562	@item \`
1563	Matches only at the start of pattern space. This is different
1564	from @code{^} in multi-line mode.
1565
1566	@item \'
1567	Matches only at the end of pattern space. This is different
1568	from @code{$} in multi-line mode.
1569
1570	@ifset PERL
1571	@item \G
1572	Match only at the start of pattern space or, when doing a global
1573	substitution using the @code{s///g} command and option, at
1574	the end-of-match position of the prior match. For example,
1575	@samp{s/\Ga/Z/g} will change an initial run of @code{a}s to
1576	a run of @code{Z}s
1577	@end ifset
1578	@end table
1579
1580	@node Examples
1581	@chapter Some Sample Scripts
1582
1583	Here are some @command{sed} scripts to guide you in the art of mastering
1584	@command{sed}.
1585
1586	@menu
1587	Some exotic examples:
1588	* Centering lines::
1589	* Increment a number::
1590	* Rename files to lower case::
1591	* Print bash environment::
1592	* Reverse chars of lines::
1593
1594	Emulating standard utilities:
1595	* tac:: Reverse lines of files
1596	* cat -n:: Numbering lines
1597	* cat -b:: Numbering non-blank lines
1598	* wc -c:: Counting chars
1599	* wc -w:: Counting words
1600	* wc -l:: Counting lines
1601	* head:: Printing the first lines
1602	* tail:: Printing the last lines
1603	* uniq:: Make duplicate lines unique
1604	* uniq -d:: Print duplicated lines of input
1605	* uniq -u:: Remove all duplicated lines
1606	* cat -s:: Squeezing blank lines
1607	@end menu
1608
1609	@node Centering lines
1610	@section Centering Lines
1611
1612	This script centers all lines of a file on a 80 columns width.
1613	To change that width, the number in @code{\@{@dots{}\@}} must be
1614	replaced, and the number of added spaces also must be changed.
1615
1616	Note how the buffer commands are used to separate parts in
1617	the regular expressions to be matched---this is a common
1618	technique.
1619
1620	@c start-------------------------------------------
1621	@example
1622	#!/usr/bin/sed -f
1623
1624	# Put 80 spaces in the buffer
1625	1 @{
1626	x
1627	s/^$/ /
1628	s/^.*$/&&&&&&&&/
1629	x
1630	@}
1631
1632	# del leading and trailing spaces
1633	y/@kbd{tab}/ /
1634	s/^ *//
1635	s/ *$//
1636
1637	# add a newline and 80 spaces to end of line
1638	G
1639
1640	# keep first 81 chars (80 + a newline)
1641	s/^$.\@{81\@}$.*$/\1/
1642
1643	# \2 matches half of the spaces, which are moved to the beginning
1644	s/^$.$\n$.$\2/\2\1/
1645	@end example
1646	@c end---------------------------------------------
1647
1648	@node Increment a number
1649	@section Increment a Number
1650
1651	This script is one of a few that demonstrate how to do arithmetic
1652	in @command{sed}. This is indeed possible,@footnote{@command{sed} guru Greg
1653	Ubben wrote an implementation of the @command{dc} @sc{rpn} calculator!
1654	It is distributed together with sed.} but must be done manually.
1655
1656	To increment one number you just add 1 to last digit, replacing
1657	it by the following digit. There is one exception: when the digit
1658	is a nine the previous digits must be also incremented until you
1659	don't have a nine.
1660
1661	This solution by Bruno Haible is very clever and smart because
1662	it uses a single buffer; if you don't have this limitation, the
1663	algorithm used in @ref{cat -n, Numbering lines}, is faster.
1664	It works by replacing trailing nines with an underscore, then
1665	using multiple @code{s} commands to increment the last digit,
1666	and then again substituting underscores with zeros.
1667
1668	@c start-------------------------------------------
1669	@example
1670	#!/usr/bin/sed -f
1671
1672	/[^0-9]/ d
1673
1674	# replace all leading 9s by _ (any other character except digits, could
1675	# be used)
1676	:d
1677	s/9$_*$$/_\1/
1678	td
1679
1680	# incr last digit only. The first line adds a most-significant
1681	# digit of 1 if we have to add a digit.
1682	#
1683	# The @code{tn} commands are not necessary, but make the thing
1684	# faster
1685
1686	s/^$_*$$/1\1/; tn
1687	s/8$_*$$/9\1/; tn
1688	s/7$_*$$/8\1/; tn
1689	s/6$_*$$/7\1/; tn
1690	s/5$_*$$/6\1/; tn
1691	s/4$_*$$/5\1/; tn
1692	s/3$_*$$/4\1/; tn
1693	s/2$_*$$/3\1/; tn
1694	s/1$_*$$/2\1/; tn
1695	s/0$_*$$/1\1/; tn
1696
1697	:n
1698	y/_/0/
1699	@end example
1700	@c end---------------------------------------------
1701
1702	@node Rename files to lower case
1703	@section Rename Files to Lower Case
1704
1705	This is a pretty strange use of @command{sed}. We transform text, and
1706	transform it to be shell commands, then just feed them to shell.
1707	Don't worry, even worse hacks are done when using @command{sed}; I have
1708	seen a script converting the output of @command{date} into a @command{bc}
1709	program!
1710
1711	The main body of this is the @command{sed} script, which remaps the name
1712	from lower to upper (or vice-versa) and even checks out
1713	if the remapped name is the same as the original name.
1714	Note how the script is parameterized using shell
1715	variables and proper quoting.
1716
1717	@c start-------------------------------------------
1718	@example
1719	#! /bin/sh
1720	# rename files to lower/upper case...
1721	#
1722	# usage:
1723	# move-to-lower *
1724	# move-to-upper *
1725	# or
1726	# move-to-lower -R .
1727	# move-to-upper -R .
1728	#
1729
1730	help()
1731	@{
1732	cat << eof
1733	Usage: $0 [-n] [-r] [-h] files...
1734
1735	-n do nothing, only see what would be done
1736	-R recursive (use find)
1737	-h this message
1738	files files to remap to lower case
1739
1740	Examples:
1741	$0 -n * (see if everything is ok, then...)
1742	$0 *
1743
1744	$0 -R .
1745
1746	eof
1747	@}
1748
1749	apply_cmd='sh'
1750	finder='echo "$@@" \| tr " " "\n"'
1751	files_only=
1752
1753	while :
1754	do
1755	case "$1" in
1756	-n) apply_cmd='cat' ;;
1757	-R) finder='find "$@@" -type f';;
1758	-h) help ; exit 1 ;;
1759	*) break ;;
1760	esac
1761	shift
1762	done
1763
1764	if [ -z "$1" ]; then
1765	echo Usage: $0 [-h] [-n] [-r] files...
1766	exit 1
1767	fi
1768
1769	LOWER='abcdefghijklmnopqrstuvwxyz'
1770	UPPER='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
1771
1772	case `basename $0` in
1773	upper) TO=$UPPER; FROM=$LOWER ;;
1774	*) FROM=$UPPER; TO=$LOWER ;;
1775	esac
1776
1777	eval $finder \| sed -n '
1778
1779	# remove all trailing slashes
1780	s/\/*$//
1781
1782	# add ./ if there is no path, only a filename
1783	/\//! s/^/.\//
1784
1785	# save path+filename
1786	h
1787
1788	# remove path
1789	s/.*\///
1790
1791	# do conversion only on filename
1792	y/'$FROM'/'$TO'/
1793
1794	# now line contains original path+file, while
1795	# hold space contains the new filename
1796	x
1797
1798	# add converted file name to line, which now contains
1799	# path/file-name\nconverted-file-name
1800	G
1801
1802	# check if converted file name is equal to original file name,
1803	# if it is, do not print nothing
1804	/^.\/$.$\n\1/b
1805
1806	# now, transform path/fromfile\n, into
1807	# mv path/fromfile path/tofile and print it
1808	s/^$.\/$$.$\n$.*$$/mv "\1\2" "\1\3"/p
1809
1810	' \| $apply_cmd
1811	@end example
1812	@c end---------------------------------------------
1813
1814	@node Print bash environment
1815	@section Print @command{bash} Environment
1816
1817	This script strips the definition of the shell functions
1818	from the output of the @command{set} Bourne-shell command.
1819
1820	@c start-------------------------------------------
1821	@example
1822	#!/bin/sh
1823
1824	set \| sed -n '
1825	:x
1826
1827	@ifinfo
1828	# if no occurrence of "=()" print and load next line
1829	@end ifinfo
1830	@ifnotinfo
1831	# if no occurrence of @samp{=()} print and load next line
1832	@end ifnotinfo
1833	/=()/! @{ p; b; @}
1834	/ () $/! @{ p; b; @}
1835
1836	# possible start of functions section
1837	# save the line in case this is a var like FOO="() "
1838	h
1839
1840	# if the next line has a brace, we quit because
1841	# nothing comes after functions
1842	n
1843	/^@{/ q
1844
1845	# print the old line
1846	x; p
1847
1848	# work on the new line now
1849	x; bx
1850	'
1851	@end example
1852	@c end---------------------------------------------
1853
1854	@node Reverse chars of lines
1855	@section Reverse Characters of Lines
1856
1857	This script can be used to reverse the position of characters
1858	in lines. The technique moves two characters at a time, hence
1859	it is faster than more intuitive implementations.
1860
1861	Note the @code{tx} command before the definition of the label.
1862	This is often needed to reset the flag that is tested by
1863	the @code{t} command.
1864
1865	Imaginative readers will find uses for this script. An example
1866	is reversing the output of @command{banner}.@footnote{This requires
1867	another script to pad the output of banner; for example
1868
1869	@example
1870	#! /bin/sh
1871
1872	banner -w $1 $2 $3 $4 \|
1873	sed -e :a -e '/^.\@{0,'$1'\@}$/ @{ s/$/ /; ba; @}' \|
1874	~/sedscripts/reverseline.sed
1875	@end example
1876	}
1877
1878	@c start-------------------------------------------
1879	@example
1880	#!/usr/bin/sed -f
1881
1882	/../! b
1883
1884	# Reverse a line. Begin embedding the line between two newlines
1885	s/^.*$/\
1886	&\
1887	/
1888
1889	# Move first character at the end. The regexp matches until
1890	# there are zero or one characters between the markers
1891	tx
1892	:x
1893	s/$\n.$$.*$$.\n$/\3\2\1/
1894	tx
1895
1896	# Remove the newline markers
1897	s/\n//g
1898	@end example
1899	@c end---------------------------------------------
1900
1901	@node tac
1902	@section Reverse Lines of Files
1903
1904	This one begins a series of totally useless (yet interesting)
1905	scripts emulating various Unix commands. This, in particular,
1906	is a @command{tac} workalike.
1907
1908	Note that on implementations other than @acronym{GNU} @command{sed}
1909	@ifset PERL
1910	and @value{SSED}
1911	@end ifset
1912	this script might easily overflow internal buffers.
1913
1914	@c start-------------------------------------------
1915	@example
1916	#!/usr/bin/sed -nf
1917
1918	# reverse all lines of input, i.e. first line became last, ...
1919
1920	# from the second line, the buffer (which contains all previous lines)
1921	# is appended to current line, so, the order will be reversed
1922	1! G
1923
1924	# on the last line we're done -- print everything
1925	$ p
1926
1927	# store everything on the buffer again
1928	h
1929	@end example
1930	@c end---------------------------------------------
1931
1932	@node cat -n
1933	@section Numbering Lines
1934
1935	This script replaces @samp{cat -n}; in fact it formats its output
1936	exactly like @acronym{GNU} @command{cat} does.
1937
1938	Of course this is completely useless and for two reasons: first,
1939	because somebody else did it in C, second, because the following
1940	Bourne-shell script could be used for the same purpose and would
1941	be much faster:
1942
1943	@c start-------------------------------------------
1944	@example
1945	#! /bin/sh
1946	sed -e "=" $@@ \| sed -e '
1947	s/^/ /
1948	N
1949	s/^ *$......$\n/\1 /
1950	'
1951	@end example
1952	@c end---------------------------------------------
1953
1954	It uses @command{sed} to print the line number, then groups lines two
1955	by two using @code{N}. Of course, this script does not teach as much as
1956	the one presented below.
1957
1958	The algorithm used for incrementing uses both buffers, so the line
1959	is printed as soon as possible and then discarded. The number
1960	is split so that changing digits go in a buffer and unchanged ones go
1961	in the other; the changed digits are modified in a single step
1962	(using a @code{y} command). The line number for the next line
1963	is then composed and stored in the hold space, to be used in the
1964	next iteration.
1965
1966	@c start-------------------------------------------
1967	@example
1968	#!/usr/bin/sed -nf
1969
1970	# Prime the pump on the first line
1971	x
1972	/^$/ s/^.*$/1/
1973
1974	# Add the correct line number before the pattern
1975	G
1976	h
1977
1978	# Format it and print it
1979	s/^/ /
1980	s/^ *$......$\n/\1 /p
1981
1982	# Get the line number from hold space; add a zero
1983	# if we're going to add a digit on the next line
1984	g
1985	s/\n.*$//
1986	/^9*$/ s/^/0/
1987
1988	# separate changing/unchanged digits with an x
1989	s/.9*$/x&/
1990
1991	# keep changing digits in hold space
1992	h
1993	s/^.*x//
1994	y/0123456789/1234567890/
1995	x
1996
1997	# keep unchanged digits in pattern space
1998	s/x.*$//
1999
2000	# compose the new number, remove the newline implicitly added by G
2001	G
2002	s/\n//
2003	h
2004	@end example
2005	@c end---------------------------------------------
2006
2007	@node cat -b
2008	@section Numbering Non-blank Lines
2009
2010	Emulating @samp{cat -b} is almost the same as @samp{cat -n}---we only
2011	have to select which lines are to be numbered and which are not.
2012
2013	The part that is common to this script and the previous one is
2014	not commented to show how important it is to comment @command{sed}
2015	scripts properly...
2016
2017	@c start-------------------------------------------
2018	@example
2019	#!/usr/bin/sed -nf
2020
2021	/^$/ @{
2022	p
2023	b
2024	@}
2025
2026	# Same as cat -n from now
2027	x
2028	/^$/ s/^.*$/1/
2029	G
2030	h
2031	s/^/ /
2032	s/^ *$......$\n/\1 /p
2033	x
2034	s/\n.*$//
2035	/^9*$/ s/^/0/
2036	s/.9*$/x&/
2037	h
2038	s/^.*x//
2039	y/0123456789/1234567890/
2040	x
2041	s/x.*$//
2042	G
2043	s/\n//
2044	h
2045	@end example
2046	@c end---------------------------------------------
2047
2048	@node wc -c
2049	@section Counting Characters
2050
2051	This script shows another way to do arithmetic with @command{sed}.
2052	In this case we have to add possibly large numbers, so implementing
2053	this by successive increments would not be feasible (and possibly
2054	even more complicated to contrive than this script).
2055
2056	The approach is to map numbers to letters, kind of an abacus
2057	implemented with @command{sed}. @samp{a}s are units, @samp{b}s are
2058	tens and so on: we simply add the number of characters
2059	on the current line as units, and then propagate the carry
2060	to tens, hundreds, and so on.
2061
2062	As usual, running totals are kept in hold space.
2063
2064	On the last line, we convert the abacus form back to decimal.
2065	For the sake of variety, this is done with a loop rather than
2066	with some 80 @code{s} commands@footnote{Some implementations
2067	have a limit of 199 commands per script}: first we
2068	convert units, removing @samp{a}s from the number; then we
2069	rotate letters so that tens become @samp{a}s, and so on
2070	until no more letters remain.
2071
2072	@c start-------------------------------------------
2073	@example
2074	#!/usr/bin/sed -nf
2075
2076	# Add n+1 a's to hold space (+1 is for the newline)
2077	s/./a/g
2078	H
2079	x
2080	s/\n/a/
2081
2082	# Do the carry. The t's and b's are not necessary,
2083	# but they do speed up the thing
2084	t a
2085	: a; s/aaaaaaaaaa/b/g; t b; b done
2086	: b; s/bbbbbbbbbb/c/g; t c; b done
2087	: c; s/cccccccccc/d/g; t d; b done
2088	: d; s/dddddddddd/e/g; t e; b done
2089	: e; s/eeeeeeeeee/f/g; t f; b done
2090	: f; s/ffffffffff/g/g; t g; b done
2091	: g; s/gggggggggg/h/g; t h; b done
2092	: h; s/hhhhhhhhhh//g
2093
2094	: done
2095	$! @{
2096	h
2097	b
2098	@}
2099
2100	# On the last line, convert back to decimal
2101
2102	: loop
2103	/a/! s/[b-h]*/&0/
2104	s/aaaaaaaaa/9/
2105	s/aaaaaaaa/8/
2106	s/aaaaaaa/7/
2107	s/aaaaaa/6/
2108	s/aaaaa/5/
2109	s/aaaa/4/
2110	s/aaa/3/
2111	s/aa/2/
2112	s/a/1/
2113
2114	: next
2115	y/bcdefgh/abcdefg/
2116	/[a-h]/ b loop
2117	p
2118	@end example
2119	@c end---------------------------------------------
2120
2121	@node wc -w
2122	@section Counting Words
2123
2124	This script is almost the same as the previous one, once each
2125	of the words on the line is converted to a single @samp{a}
2126	(in the previous script each letter was changed to an @samp{a}).
2127
2128	It is interesting that real @command{wc} programs have optimized
2129	loops for @samp{wc -c}, so they are much slower at counting
2130	words rather than characters. This script's bottleneck,
2131	instead, is arithmetic, and hence the word-counting one
2132	is faster (it has to manage smaller numbers).
2133
2134	Again, the common parts are not commented to show the importance
2135	of commenting @command{sed} scripts.
2136
2137	@c start-------------------------------------------
2138	@example
2139	#!/usr/bin/sed -nf
2140
2141	# Convert words to a's
2142	s/[ @kbd{tab}][ @kbd{tab}]*/ /g
2143	s/^/ /
2144	s/ [^ ][^ ]*/a /g
2145	s/ //g
2146
2147	# Append them to hold space
2148	H
2149	x
2150	s/\n//
2151
2152	# From here on it is the same as in wc -c.
2153	/aaaaaaaaaa/! bx; s/aaaaaaaaaa/b/g
2154	/bbbbbbbbbb/! bx; s/bbbbbbbbbb/c/g
2155	/cccccccccc/! bx; s/cccccccccc/d/g
2156	/dddddddddd/! bx; s/dddddddddd/e/g
2157	/eeeeeeeeee/! bx; s/eeeeeeeeee/f/g
2158	/ffffffffff/! bx; s/ffffffffff/g/g
2159	/gggggggggg/! bx; s/gggggggggg/h/g
2160	s/hhhhhhhhhh//g
2161	:x
2162	$! @{ h; b; @}
2163	:y
2164	/a/! s/[b-h]*/&0/
2165	s/aaaaaaaaa/9/
2166	s/aaaaaaaa/8/
2167	s/aaaaaaa/7/
2168	s/aaaaaa/6/
2169	s/aaaaa/5/
2170	s/aaaa/4/
2171	s/aaa/3/
2172	s/aa/2/
2173	s/a/1/
2174	y/bcdefgh/abcdefg/
2175	/[a-h]/ by
2176	p
2177	@end example
2178	@c end---------------------------------------------
2179
2180	@node wc -l
2181	@section Counting Lines
2182
2183	No strange things are done now, because @command{sed} gives us
2184	@samp{wc -l} functionality for free!!! Look:
2185
2186	@c start-------------------------------------------
2187	@example
2188	#!/usr/bin/sed -nf
2189	$=
2190	@end example
2191	@c end---------------------------------------------
2192
2193	@node head
2194	@section Printing the First Lines
2195
2196	This script is probably the simplest useful @command{sed} script.
2197	It displays the first 10 lines of input; the number of displayed
2198	lines is right before the @code{q} command.
2199
2200	@c start-------------------------------------------
2201	@example
2202	#!/usr/bin/sed -f
2203	10q
2204	@end example
2205	@c end---------------------------------------------
2206
2207	@node tail
2208	@section Printing the Last Lines
2209
2210	Printing the last @var{n} lines rather than the first is more complex
2211	but indeed possible. @var{n} is encoded in the second line, before
2212	the bang character.
2213
2214	This script is similar to the @command{tac} script in that it keeps the
2215	final output in the hold space and prints it at the end:
2216
2217	@c start-------------------------------------------
2218	@example
2219	#!/usr/bin/sed -nf
2220
2221	1! @{; H; g; @}
2222	1,10 !s/[^\n]*\n//
2223	$p
2224	h
2225	@end example
2226	@c end---------------------------------------------
2227
2228	Mainly, the scripts keeps a window of 10 lines and slides it
2229	by adding a line and deleting the oldest (the substitution command
2230	on the second line works like a @code{D} command but does not
2231	restart the loop).
2232
2233	The ``sliding window'' technique is a very powerful way to write
2234	efficient and complex @command{sed} scripts, because commands like
2235	@code{P} would require a lot of work if implemented manually.
2236
2237	To introduce the technique, which is fully demonstrated in the
2238	rest of this chapter and is based on the @code{N}, @code{P}
2239	and @code{D} commands, here is an implementation of @command{tail}
2240	using a simple ``sliding window.''
2241
2242	This looks complicated but in fact the working is the same as
2243	the last script: after we have kicked in the appropriate number
2244	of lines, however, we stop using the hold space to keep inter-line
2245	state, and instead use @code{N} and @code{D} to slide pattern
2246	space by one line:
2247
2248	@c start-------------------------------------------
2249	@example
2250	#!/usr/bin/sed -f
2251
2252	1h
2253	2,10 @{; H; g; @}
2254	$q
2255	1,9d
2256	N
2257	D
2258	@end example
2259	@c end---------------------------------------------
2260
2261	Note how the first, second and fourth line are inactive after
2262	the first ten lines of input. After that, all the script does
2263	is: exiting on the last line of input, appending the next input
2264	line to pattern space, and removing the first line.
2265
2266	@node uniq
2267	@section Make Duplicate Lines Unique
2268
2269	This is an example of the art of using the @code{N}, @code{P}
2270	and @code{D} commands, probably the most difficult to master.
2271
2272	@c start-------------------------------------------
2273	@example
2274	#!/usr/bin/sed -f
2275	h
2276
2277	:b
2278	# On the last line, print and exit
2279	$b
2280	N
2281	/^$.*$\n\1$/ @{
2282	# The two lines are identical. Undo the effect of
2283	# the n command.
2284	g
2285	bb
2286	@}
2287
2288	# If the @code{N} command had added the last line, print and exit
2289	$b
2290
2291	# The lines are different; print the first and go
2292	# back working on the second.
2293	P
2294	D
2295	@end example
2296	@c end---------------------------------------------
2297
2298	As you can see, we mantain a 2-line window using @code{P} and @code{D}.
2299	This technique is often used in advanced @command{sed} scripts.
2300
2301	@node uniq -d
2302	@section Print Duplicated Lines of Input
2303
2304	This script prints only duplicated lines, like @samp{uniq -d}.
2305
2306	@c start-------------------------------------------
2307	@example
2308	#!/usr/bin/sed -nf
2309
2310	$b
2311	N
2312	/^$.*$\n\1$/ @{
2313	# Print the first of the duplicated lines
2314	s/.*\n//
2315	p
2316
2317	# Loop until we get a different line
2318	:b
2319	$b
2320	N
2321	/^$.*$\n\1$/ @{
2322	s/.*\n//
2323	bb
2324	@}
2325	@}
2326
2327	# The last line cannot be followed by duplicates
2328	$b
2329
2330	# Found a different one. Leave it alone in the pattern space
2331	# and go back to the top, hunting its duplicates
2332	D
2333	@end example
2334	@c end---------------------------------------------
2335
2336	@node uniq -u
2337	@section Remove All Duplicated Lines
2338
2339	This script prints only unique lines, like @samp{uniq -u}.
2340
2341	@c start-------------------------------------------
2342	@example
2343	#!/usr/bin/sed -f
2344
2345	# Search for a duplicate line --- until that, print what you find.
2346	$b
2347	N
2348	/^$.*$\n\1$/ ! @{
2349	P
2350	D
2351	@}
2352
2353	:c
2354	# Got two equal lines in pattern space. At the
2355	# end of the file we simply exit
2356	$d
2357
2358	# Else, we keep reading lines with @code{N} until we
2359	# find a different one
2360	s/.*\n//
2361	N
2362	/^$.*$\n\1$/ @{
2363	bc
2364	@}
2365
2366	# Remove the last instance of the duplicate line
2367	# and go back to the top
2368	D
2369	@end example
2370	@c end---------------------------------------------
2371
2372	@node cat -s
2373	@section Squeezing Blank Lines
2374
2375	As a final example, here are three scripts, of increasing complexity
2376	and speed, that implement the same function as @samp{cat -s}, that is
2377	squeezing blank lines.
2378
2379	The first leaves a blank line at the beginning and end if there are
2380	some already.
2381
2382	@c start-------------------------------------------
2383	@example
2384	#!/usr/bin/sed -f
2385
2386	# on empty lines, join with next
2387	# Note there is a star in the regexp
2388	:x
2389	/^\n*$/ @{
2390	N
2391	bx
2392	@}
2393
2394	# now, squeeze all '\n', this can be also done by:
2395	# s/^$\n$*/\1/
2396	s/\n*/\
2397	/
2398	@end example
2399	@c end---------------------------------------------
2400
2401	This one is a bit more complex and removes all empty lines
2402	at the beginning. It does leave a single blank line at end
2403	if one was there.
2404
2405	@c start-------------------------------------------
2406	@example
2407	#!/usr/bin/sed -f
2408
2409	# delete all leading empty lines
2410	1,/^./@{
2411	/./!d
2412	@}
2413
2414	# on an empty line we remove it and all the following
2415	# empty lines, but one
2416	:x
2417	/./!@{
2418	N
2419	s/^\n$//
2420	tx
2421	@}
2422	@end example
2423	@c end---------------------------------------------
2424
2425	This removes leading and trailing blank lines. It is also the
2426	fastest. Note that loops are completely done with @code{n} and
2427	@code{b}, without relying on @command{sed} to restart the
2428	the script automatically at the end of a line.
2429
2430	@c start-------------------------------------------
2431	@example
2432	#!/usr/bin/sed -nf
2433
2434	# delete all (leading) blanks
2435	/./!d
2436
2437	# get here: so there is a non empty
2438	:x
2439	# print it
2440	p
2441	# get next
2442	n
2443	# got chars? print it again, etc...
2444	/./bx
2445
2446	# no, don't have chars: got an empty line
2447	:z
2448	# get next, if last line we finish here so no trailing
2449	# empty lines are written
2450	n
2451	# also empty? then ignore it, and get next... this will
2452	# remove ALL empty lines
2453	/./!bz
2454
2455	# all empty lines were deleted/ignored, but we have a non empty. As
2456	# what we want to do is to squeeze, insert a blank line artificially
2457	i\
2458
2459	bx
2460	@end example
2461	@c end---------------------------------------------
2462
2463	@node Limitations
2464	@chapter @value{SSED}'s Limitations and Non-limitations
2465
2466	@cindex @acronym{GNU} extensions, unlimited line length
2467	@cindex Portability, line length limitations
2468	For those who want to write portable @command{sed} scripts,
2469	be aware that some implementations have been known to
2470	limit line lengths (for the pattern and hold spaces)
2471	to be no more than 4000 bytes.
2472	The @sc{posix} standard specifies that conforming @command{sed}
2473	implementations shall support at least 8192 byte line lengths.
2474	@value{SSED} has no built-in limit on line length;
2475	as long as it can @code{malloc()} more (virtual) memory,
2476	you can feed or construct lines as long as you like.
2477
2478	However, recursion is used to handle subpatterns and indefinite
2479	repetition. This means that the available stack space may limit
2480	the size of the buffer that can be processed by certain patterns.
2481
2482	@ifset PERL
2483	There are some size limitations in the regular expression
2484	matcher but it is hoped that they will never in practice
2485	be relevant. The maximum length of a compiled pattern
2486	is 65539 (sic) bytes. All values in repeating quantifiers
2487	must be less than 65536. The maximum nesting depth of
2488	all parenthesized subpatterns, including capturing and
2489	non-capturing subpatterns@footnote{The
2490	distinction is meaningful when referring to Perl-style
2491	regular expressions.}, assertions, and other types of
2492	subpattern, is 200.
2493
2494	Also, @value{SSED} recognizes the @sc{posix} syntax
2495	@code{[.@var{ch}.]} and @code{[=@var{ch}=]}
2496	where @var{ch} is a ``collating element'', but these
2497	are not supported, and an error is given if they are
2498	encountered.
2499
2500	Here are a few distinctions between the real Perl-style
2501	regular expressions and those that @option{-R} recognizes.
2502
2503	@enumerate
2504	@item
2505	Lookahead assertions do not allow repeat quantifiers after them
2506	Perl permits them, but they do not mean what you
2507	might think. For example, @samp{(?!a)@{3@}} does not assert that the
2508	next three characters are not @samp{a}. It just asserts three times that the
2509	next character is not @samp{a} --- a waste of time and nothing else.
2510
2511	@item
2512	Capturing subpatterns that occur inside negative lookahead
2513	head assertions are counted, but their entries are counted
2514	as empty in the second half of an @code{s} command.
2515	Perl sets its numerical variables from any such patterns
2516	that are matched before the assertion fails to match
2517	something (thereby succeeding), but only if the negative
2518	lookahead assertion contains just one branch.
2519
2520	@item
2521	The following Perl escape sequences are not supported:
2522	@samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E},
2523	@samp{\Q}. In fact these are implemented by Perl's general
2524	string-handling and are not part of its pattern matching engine.
2525
2526	@item
2527	The Perl @samp{\G} assertion is not supported as it is not
2528	relevant to single pattern matches.
2529
2530	@item
2531	Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})}
2532	and @samp{(?p@{code@})} constructions. However, there is some experimental
2533	support for recursive patterns using the non-Perl item @samp{(?R)}.
2534
2535	@item
2536	There are at the time of writing some oddities in Perl
2537	5.005_02 concerned with the settings of captured strings
2538	when part of a pattern is repeated. For example, matching
2539	@samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets
2540	@samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.}
2541	to the value @samp{b}, but matching @samp{aabbaa}
2542	against @samp{/^(aa(bb)?)+$/} leaves @samp{$2}
2543	unset. However, if the pattern is changed to
2544	@samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set.
2545	In Perl 5.004 @samp{$2} is set in both cases, and that is also
2546	true of @value{SSED}.
2547
2548	@item
2549	Another as yet unresolved discrepancy is that in Perl
2550	5.005_02 the pattern @samp{/^(a)?(?(1)a\|b)+$/} matches
2551	the string @samp{a}, whereas in @value{SSED} it does not.
2552	However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched
2553	against @samp{a} leaves $1 unset.
2554	@end enumerate
2555	@end ifset
2556
2557	@node Other Resources
2558	@chapter Other Resources for Learning About @command{sed}
2559
2560	@cindex Additional reading about @command{sed}
2561	In addition to several books that have been written about @command{sed}
2562	(either specifically or as chapters in books which discuss
2563	shell programming), one can find out more about @command{sed}
2564	(including suggestions of a few books) from the FAQ
2565	for the @code{sed-users} mailing list, available from any of:
2566	@display
2567	@uref{http://www.student.northpark.edu/pemente/sed/sedfaq.html}
2568	@uref{http://sed.sf.net/grabbag/tutorials/sedfaq.html}
2569	@end display
2570
2571	Also of interest are
2572	@uref{http://www.student.northpark.edu/pemente/sed/index.htm}
2573	and @uref{http://sed.sf.net/grabbag},
2574	which include @command{sed} tutorials and other @command{sed}-related goodies.
2575
2576	The @code{sed-users} mailing list itself maintained by Sven Guckes.
2577	To subscribe, visit @uref{http://groups.yahoo.com} and search
2578	for the @code{sed-users} mailing list.
2579
2580	@node Reporting Bugs
2581	@chapter Reporting Bugs
2582
2583	@cindex Bugs, reporting
2584	Email bug reports to @email{bonzini@@gnu.org}.
2585	Be sure to include the word ``sed'' somewhere in the @code{Subject:} field.
2586	Also, please include the output of @samp{sed --version} in the body
2587	of your report if at all possible.
2588
2589	Please do not send a bug report like this:
2590
2591	@example
2592	@i{while building frobme-1.3.4}
2593	$ configure
2594	@error{} sed: file sedscr line 1: Unknown option to 's'
2595	@end example
2596
2597	If @value{SSED} doesn't configure your favorite package, take a
2598	few extra minutes to identify the specific problem and make a stand-alone
2599	test case. Unlike other programs such as C compilers, making such test
2600	cases for @command{sed} is quite simple.
2601
2602	A stand-alone test case includes all the data necessary to perform the
2603	test, and the specific invocation of @command{sed} that causes the problem.
2604	The smaller a stand-alone test case is, the better. A test case should
2605	not involve something as far removed from @command{sed} as ``try to configure
2606	frobme-1.3.4''. Yes, that is in principle enough information to look
2607	for the bug, but that is not a very practical prospect.
2608
2609	Here are a few commonly reported bugs that are not bugs.
2610
2611	@table @asis
2612	@item @code{N} command on the last line
2613	@cindex Portability, @code{N} command on the last line
2614	@cindex Non-bugs, @code{N} command on the last line
2615
2616	Most versions of @command{sed} exit without printing anything when
2617	the @command{N} command is issued on the last line of a file.
2618	@value{SSED} prints pattern space before exiting unless of course
2619	the @command{-n} command switch has been specified. This choice is
2620	by design.
2621
2622	For example, the behavior of
2623	@example
2624	sed N foo bar
2625	@end example
2626	@noindent
2627	would depend on whether foo has an even or an odd number of
2628	lines@footnote{which is the actual ``bug'' that prompted the
2629	change in behavior}. Or, when writing a script to read the
2630	next few lines following a pattern match, traditional
2631	implementations of @code{sed} would force you to write
2632	something like
2633	@example
2634	/foo/@{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N @}
2635	@end example
2636	@noindent
2637	instead of just
2638	@example
2639	/foo/@{ N;N;N;N;N;N;N;N;N; @}
2640	@end example
2641
2642	@cindex @code{POSIXLY_CORRECT} behavior, @code{N} command
2643	In any case, the simplest workaround is to use @code{$d;N} in
2644	scripts that rely on the traditional behavior, or to set
2645	the @code{POSIXLY_CORRECT} variable to a non-empty value.
2646
2647	@item Regex syntax clashes (problems with backslashes)
2648	@cindex @acronym{GNU} extensions, to basic regular expressions
2649	@cindex Non-bugs, regex syntax clashes
2650	@command{sed} uses the @sc{posix} basic regular expression syntax. According to
2651	the standard, the meaning of some escape sequences is undefined in
2652	this syntax; notable in the case of @command{sed} are @code{\\|},
2653	@code{\+}, @code{\?}, @code{\`}, @code{\'}, @code{\<},
2654	@code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}.
2655
2656	As in all @acronym{GNU} programs that use @sc{posix} basic regular
2657	expressions, @command{sed} interprets these escape sequences as special
2658	characters. So, @code{x\+} matches one or more occurrences of @samp{x}.
2659	@code{abc\\|def} matches either @samp{abc} or @samp{def}.
2660
2661	This syntax may cause problems when running scripts written for other
2662	@command{sed}s. Some @command{sed} programs have been written with the
2663	assumption that @code{\\|} and @code{\+} match the literal characters
2664	@code{\|} and @code{+}. Such scripts must be modified by removing the
2665	spurious backslashes if they are to be used with modern implementations
2666	of @command{sed}, like
2667	@ifset PERL
2668	@value{SSED} or
2669	@end ifset
2670	@acronym{GNU} @command{sed}.
2671
2672	On the other hand, some scripts use s\|abc\\|def\|\|g to remove occurrences
2673	of @emph{either} @code{abc} or @code{def}. While this worked until
2674	@command{sed} 4.0.x, newer versions interpret this as removing the
2675	string @code{abc\|def}. This is again undefined behavior according to
2676	@acronym{POSIX}, and this interpretation is arguably more robust: older
2677	@command{sed}s, for example, required that the regex matcher parsed
2678	@code{\/} as @code{/} in the common case of escaping a slash, which is
2679	again undefined behavior; the new behavior avoids this, and this is good
2680	because the regex matcher is only partially under our control.
2681
2682	@cindex @acronym{GNU} extensions, special escapes
2683	In addition, this version of @command{sed} supports several escape characters
2684	(some of which are multi-character) to insert non-printable characters
2685	in scripts (@code{\a}, @code{\c}, @code{\d}, @code{\o}, @code{\r},
2686	@code{\t}, @code{\v}, @code{\x}). These can cause similar problems
2687	with scripts written for other @command{sed}s.
2688
2689	@item @option{-i} clobbers read-only files
2690	@cindex In-place editing
2691	@cindex @value{SSEDEXT}, in-place editing
2692	@cindex Non-bugs, in-place editing
2693
2694	In short, @samp{sed -i} will let you delete the contents of
2695	a read-only file, and in general the @option{-i} option
2696	(@pxref{Invoking sed, , Invocation}) lets you clobber
2697	protected files. This is not a bug, but rather a consequence
2698	of how the Unix filesystem works.
2699
2700	The permissions on a file say what can happen to the data
2701	in that file, while the permissions on a directory say what can
2702	happen to the list of files in that directory. @samp{sed -i}
2703	will not ever open for writing a file that is already on disk.
2704	Rather, it will work on a temporary file that is finally renamed
2705	to the original name: if you rename or delete files, you're actually
2706	modifying the contents of the directory, so the operation depends on
2707	the permissions of the directory, not of the file. For this same
2708	reason, @command{sed} does not let you use @option{-i} on a writeable file
2709	in a read-only directory (but unbelievably nobody reports that as a
2710	bug@dots{}).
2711
2712	@item @code{0a} does not work (gives an error)
2713	There is no line 0. 0 is a special address that is only used to treat
2714	addresses like @code{0,/@var{RE}/} as active when the script starts: if
2715	you write @code{1,/abc/d} and the first line includes the word @samp{abc},
2716	then that match would be ignored because address ranges must span at least
2717	two lines (barring the end of the file); but what you probably wanted is
2718	to delete every line up to the first one including @samp{abc}, and this
2719	is obtained with @code{0,/abc/d}.
2720
2721	@ifclear PERL
2722	@item @code{[a-z]} is case insensitive
2723	You are encountering problems with locales. POSIX mandates that @code{[a-z]}
2724	uses the current locale's collation order -- in C parlance, that means using
2725	@code{strcoll(3)} instead of @code{strcmp(3)}. Some locales have a
2726	case-insensitive collation order, others don't: one of those that have
2727	problems is Estonian.
2728
2729	Another problem is that @code{[a-z]} tries to use collation symbols.
2730	This only happens if you are on the @acronym{GNU} system, using
2731	@acronym{GNU} libc's regular expression matcher instead of compiling the
2732	one supplied with @acronym{GNU} sed. In a Danish locale, for example,
2733	the regular expression @code{^[a-z]$} matches the string @samp{aa},
2734	because this is a single collating symbol that comes after @samp{a}
2735	and before @samp{b}; @samp{ll} behaves similarly in Spanish
2736	locales, or @samp{ij} in Dutch locales.
2737
2738	To work around these problems, which may cause bugs in shell scripts, set
2739	the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}.
2740	@end ifclear
2741	@end table
2742
2743
2744	@node Extended regexps
2745	@appendix Extended regular expressions
2746	@cindex Extended regular expressions, syntax
2747
2748	The only difference between basic and extended regular expressions is in
2749	the behavior of a few characters: @samp{?}, @samp{+}, parentheses,
2750	and braces (@samp{@{@}}). While basic regular expressions require
2751	these to be escaped if you want them to behave as special characters,
2752	when using extended regular expressions you must escape them if
2753	you want them @emph{to match a literal character}.
2754
2755	@noindent
2756	Examples:
2757	@table @code
2758	@item abc?
2759	becomes @samp{abc\?} when using extended regular expressions. It matches
2760	the literal string @samp{abc?}.
2761
2762	@item c\+
2763	becomes @samp{c+} when using extended regular expressions. It matches
2764	one or more @samp{c}s.
2765
2766	@item a\@{3,\@}
2767	becomes @samp{a@{3,@}} when using extended regular expressions. It matches
2768	three or more @samp{a}s.
2769
2770	@item $abc$\@{2,3\@}
2771	becomes @samp{(abc)@{2,3@}} when using extended regular expressions. It
2772	matches either @samp{abcabc} or @samp{abcabcabc}.
2773
2774	@item $abc*$\1
2775	becomes @samp{(abc*)\1} when using extended regular expressions.
2776	Backreferences must still be escaped when using extended regular
2777	expressions.
2778	@end table
2779
2780	@ifset PERL
2781	@node Perl regexps
2782	@appendix Perl-style regular expressions
2783	@cindex Perl-style regular expressions, syntax
2784
2785	@emph{This part is taken from the @file{pcre.txt} file distributed together
2786	with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.}
2787
2788	Perl introduced several extensions to regular expressions, some
2789	of them incompatible with the syntax of regular expressions
2790	accepted by Emacs and other @acronym{GNU} tools (whose matcher was
2791	based on the Emacs matcher). @value{SSED} implements
2792	both kinds of extensions.
2793
2794	@iftex
2795	Summarizing, we have:
2796
2797	@itemize @bullet
2798	@item
2799	A backslash can introduce several special sequences
2800
2801	@item
2802	The circumflex, dollar sign, and period characters behave specially
2803	with regard to new lines
2804
2805	@item
2806	Strange uses of square brackets are parsed differently
2807
2808	@item
2809	You can toggle modifiers in the middle of a regular expression
2810
2811	@item
2812	You can specify that a subpattern does not count when numbering backreferences
2813
2814	@item
2815	@cindex Greedy regular expression matching
2816	You can specify greedy or non-greedy matching
2817
2818	@item
2819	You can have more than ten back references
2820
2821	@item
2822	You can do complex look aheads and look behinds (in the spirit of
2823	@code{\b}, but with subpatterns).
2824
2825	@item
2826	You can often improve performance by avoiding that @command{sed} wastes
2827	time with backtracking
2828
2829	@item
2830	You can have if/then/else branches
2831
2832	@item
2833	You can do recursive matches, for example to look for unbalanced parentheses
2834
2835	@item
2836	You can have comments and non-significant whitespace, because things can
2837	get complex...
2838	@end itemize
2839
2840	Most of these extensions are introduced by the special @code{(?}
2841	sequence, which gives special meanings to parenthesized groups.
2842	@end iftex
2843	@menu
2844	Other extensions can be roughly subdivided in two categories
2845	On one hand Perl introduces several more escaped sequences
2846	(that is, sequences introduced by a backslash). On the other
2847	hand, it specifies that if a question mark follows an open
2848	parentheses it should give a special meaning to the parenthesized
2849	group.
2850
2851	* Backslash:: Introduces special sequences
2852	* Circumflex/dollar sign/period:: Behave specially with regard to new lines
2853	* Square brackets:: Are a bit different in strange cases
2854	* Options setting:: Toggle modifiers in the middle of a regexp
2855	* Non-capturing subpatterns:: Are not counted when backreferencing
2856	* Repetition:: Allows for non-greedy matching
2857	* Backreferences:: Allows for more than 10 back references
2858	* Assertions:: Allows for complex look ahead matches
2859	* Non-backtracking subpatterns:: Often gives more performance
2860	* Conditional subpatterns:: Allows if/then/else branches
2861	* Recursive patterns:: For example to match parentheses
2862	* Comments:: Because things can get complex...
2863	@end menu
2864
2865	@node Backslash
2866	@appendixsec Backslash
2867	@cindex Perl-style regular expressions, escaped sequences
2868
2869	There are a few difference in the handling of backslashed
2870	sequences in Perl mode.
2871
2872	First of all, there are no @code{\o} and @code{\d} sequences.
2873	@sc{ascii} values for characters can be specified in octal
2874	with a @code{\@var{xxx}} sequence, where @var{xxx} is a
2875	sequence of up to three octal digits. If the first digit
2876	is a zero, the treatment of the sequence is straightforward;
2877	just note that if the character that follows the escaped digit
2878	is itself an octal digit, you have to supply three octal digits
2879	for @var{xxx}. For example @code{\07} is a @sc{bel} character
2880	rather than a @sc{nul} and a literal @code{7} (this sequence is
2881	instead represented by @code{\0007}).
2882
2883	@cindex Perl-style regular expressions, backreferences
2884	The handling of a backslash followed by a digit other than 0
2885	is complicated. Outside a character class, @command{sed} reads it
2886	and any following digits as a decimal number. If the number
2887	is less than 10, or if there have been at least that many
2888	previous capturing left parentheses in the expression, the
2889	entire sequence is taken as a back reference. A description
2890	of how this works is given later, following the discussion
2891	of parenthesized subpatterns.
2892
2893	Inside a character class, or if the decimal number is
2894	greater than 9 and there have not been that many capturing
2895	subpatterns, @command{sed} re-reads up to three octal digits following
2896	the backslash, and generates a single byte from the
2897	least significant 8 bits of the value. Any subsequent digits
2898	stand for themselves. For example:
2899
2900	@example
2901	\040 @i{is another way of writing a space}
2902	\40 @i{is the same, provided there are fewer than 40}
2903	@i{previous capturing subpatterns}
2904	\7 @i{is always a back reference}
2905	\011 @i{is always a tab}
2906	\11 @i{might be a back reference, or another way of}
2907	@i{writing a tab}
2908	\0113 @i{is a tab followed by the character @samp{3}}
2909	\113 @i{is the character with octal code 113 (since there}
2910	@i{can be no more than 99 back references)}
2911	\377 @i{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}
2912	\81 @i{is either a back reference, or a binary zero}
2913	@i{followed by the two characters @samp{81}}
2914	@end example
2915
2916	Note that octal values of 100 or greater must not be introduced
2917	duced by a leading zero, because no more than three octal
2918	digits are ever read.
2919
2920	All the sequences that define a single byte value can be
2921	used both inside and outside character classes. In addition,
2922	inside a character class, the sequence @code{\b} is interpreted
2923	as the backspace character (hex 08). Outside a character
2924	class it has a different meaning (see below).
2925
2926	In addition, there are four additional escapes specifying
2927	generic character classes (like @code{\w} and @code{\W} do):
2928
2929	@cindex Perl-style regular expressions, character classes
2930	@table @samp
2931	@item \d
2932	Matches any decimal digit
2933
2934	@item \D
2935	Matches any character that is not a decimal digit
2936	@end table
2937
2938	In Perl mode, these character type sequences can appear both inside and
2939	outside character classes. Instead, in @sc{posix} mode these sequences
2940	(as well as @code{\w} and @code{\W}) are treated as two literal characters
2941	(a backslash and a letter) inside square brackets.
2942
2943	Escaped sequences specifying assertions are also different in
2944	Perl mode. An assertion specifies a condition that has to be met
2945	at a particular point in a match, without consuming any
2946	characters from the subject string. The use of subpatterns
2947	for more complicated assertions is described below. The
2948	backslashed assertions are
2949
2950	@cindex Perl-style regular expressions, assertions
2951	@table @samp
2952	@item \b
2953	Asserts that the point is at a word boundary.
2954	A word boundary is a position in the subject string where
2955	the current character and the previous character do not both
2956	match @code{\w} or @code{\W} (i.e. one matches @code{\w} and
2957	the other matches @code{\W}), or the start or end of the string
2958	if the first or last character matches @code{\w}, respectively.
2959
2960	@item \B
2961	Asserts that the point is not at a word boundary.
2962
2963	@item \A
2964	Asserts the matcher is at the start of pattern space (independent
2965	of multiline mode).
2966
2967	@item \Z
2968	Asserts the matcher is at the end of pattern space,
2969	or at a newline before the end of pattern space (independent of
2970	multiline mode)
2971
2972	@item \z
2973	Asserts the matcher is at the end of pattern space (independent
2974	of multiline mode)
2975	@end table
2976
2977	These assertions may not appear in character classes (but
2978	note that @code{\b} has a different meaning, namely the
2979	backspace character, inside a character class).
2980	Note that Perl mode does not support directly assertions
2981	for the beginning and the end of word; the @acronym{GNU} extensions
2982	@code{\<} and @code{\>} achieve this purpose in @sc{posix} mode
2983	instead.
2984
2985	The @code{\A}, @code{\Z}, and @code{\z} assertions differ
2986	from the traditional circumflex and dollar sign (described below)
2987	in that they only ever match at the very start and end of the
2988	subject string, whatever options are set; in particular @code{\A}
2989	and @code{\z} are the same as the @acronym{GNU} extensions
2990	@code{\`} and @code{\'} that are active in @sc{posix} mode.
2991
2992	@node Circumflex/dollar sign/period
2993	@appendixsec Circumflex, dollar sign, period
2994	@cindex Perl-style regular expressions, newlines
2995
2996	Outside a character class, in the default matching mode, the
2997	circumflex character is an assertion which is true only if
2998	the current matching point is at the start of the subject
2999	string. Inside a character class, the circumflex has an entirely
3000	different meaning (see below).
3001
3002	The circumflex need not be the first character of the pattern if
3003	a number of alternatives are involved, but it should be the
3004	first thing in each alternative in which it appears if the
3005	pattern is ever to match that branch. If all possible alternatives,
3006	start with a circumflex, that is, if the pattern is
3007	constrained to match only at the start of the subject, it is
3008	said to be an @dfn{anchored} pattern. (There are also other constructs
3009	structs that can cause a pattern to be anchored.)
3010
3011	A dollar sign is an assertion which is true only if the
3012	current matching point is at the end of the subject string,
3013	or immediately before a newline character that is the last
3014	character in the string (by default). A dollar sign need not be the
3015	last character of the pattern if a number of alternatives
3016	are involved, but it should be the last item in any branch
3017	in which it appears. A dollar sign has no special meaning in a
3018	character class.
3019
3020	@cindex Perl-style regular expressions, multiline
3021	The meanings of the circumflex and dollar sign characters are
3022	changed if the @code{M} modifier option is used. When this is
3023	the case, they match immediately after and immediately
3024	before an internal @code{\n} character, respectively, in addition
3025	to matching at the start and end of the subject string. For
3026	example, the pattern @code{/^abc$/} matches the subject string
3027	@samp{def\nabc} in multiline mode, but not otherwise. Consequently,
3028	patterns that are anchored in single line mode
3029	because all branches start with @code{^} are not anchored in
3030	multiline mode.
3031
3032	@cindex Perl-style regular expressions, multiline
3033	Note that the sequences @code{\A}, @code{\Z}, and @code{\z}
3034	can be used to match the start and end of the subject in both
3035	modes, and if all branches of a pattern start with @code{\A}
3036	is it always anchored, whether the @code{M} modifier is set or not.
3037
3038	@cindex Perl-style regular expressions, single line
3039	Outside a character class, a dot in the pattern matches any
3040	one character in the subject, including a non-printing character,
3041	but not (by default) newline. If the @code{S} modifier is used,
3042	dots match newlines as well. Actually, the handling of
3043	dot is entirely independent of the handling of circumflex
3044	and dollar sign, the only relationship being that they both
3045	involve newline characters. Dot has no special meaning in a
3046	character class.
3047
3048	@node Square brackets
3049	@appendixsec Square brackets
3050	@cindex Perl-style regular expressions, character classes
3051
3052	An opening square bracket introduces a character class, terminated
3053	by a closing square bracket. A closing square bracket on its own
3054	is not special. If a closing square bracket is required as a
3055	member of the class, it should be the first data character in
3056	the class (after an initial circumflex, if present) or escaped with a backslash.
3057
3058	A character class matches a single character in the subject;
3059	the character must be in the set of characters defined by
3060	the class, unless the first character in the class is a circumflex,
3061	in which case the subject character must not be in
3062	the set defined by the class. If a circumflex is actually
3063	required as a member of the class, ensure it is not the
3064	first character, or escape it with a backslash.
3065
3066	For example, the character class [aeiou] matches any lower
3067	case vowel, while [^aeiou] matches any character that is not
3068	a lower case vowel. Note that a circumflex is just a convenient
3069	venient notation for specifying the characters which are in
3070	the class by enumerating those that are not. It is not an
3071	assertion: it still consumes a character from the subject
3072	string, and fails if the current pointer is at the end of
3073	the string.
3074
3075	@cindex Perl-style regular expressions, case-insensitive
3076	When caseless matching is set, any letters in a class
3077	represent both their upper case and lower case versions, so
3078	for example, a caseless @code{[aeiou]} matches uppercase
3079	and lowercase @samp{A}s, and a caseless @code{[^aeiou]}
3080	does not match @samp{A}, whereas a case-sensitive version would.
3081
3082	@cindex Perl-style regular expressions, single line
3083	@cindex Perl-style regular expressions, multiline
3084	The newline character is never treated in any special way in
3085	character classes, whatever the setting of the @code{S} and
3086	@code{M} options (modifiers) is. A class such as @code{[^a]} will
3087	always match a newline.
3088
3089	The minus (hyphen) character can be used to specify a range
3090	of characters in a character class. For example, @code{[d-m]}
3091	matches any letter between d and m, inclusive. If a minus
3092	character is required in a class, it must be escaped with a
3093	backslash or appear in a position where it cannot be interpreted
3094	as indicating a range, typically as the first or last
3095	character in the class.
3096
3097	It is not possible to have the literal character @code{]} as the
3098	end character of a range. A pattern such as @code{[W-]46]} is
3099	interpreted as a class of two characters (@code{W} and @code{-})
3100	followed by a literal string @code{46]}, so it would match
3101	@samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped
3102	with a backslash it is interpreted as the end of range, so
3103	@code{[W-\]46]} is interpreted as a single class containing a
3104	range followed by two separate characters. The octal or
3105	hexadecimal representation of @code{]} can also be used to end a range.
3106
3107	Ranges operate in @sc{ascii} collating sequence. They can also be
3108	used for characters specified numerically, for example
3109	@code{[\000-\037]}. If a range that includes letters is used when
3110	caseless matching is set, it matches the letters in either
3111	case. For example, a caseless @code{[W-c]} is equivalent to
3112	@code{[][\^_`wxyzabc]}, matched caselessly, and if character
3113	tables for the French locale are in use, @code{[\xc8-\xcb]}
3114	matches accented E characters in both cases.
3115
3116	Unlike in @sc{posix} mode, the character types @code{\d},
3117	@code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W}
3118	may also appear in a character class, and add the characters
3119	that they match to the class. For example, @code{[\dABCDEF]} matches any
3120	hexadecimal digit. A circumflex can conveniently be used
3121	with the upper case character types to specify a more restricted
3122	set of characters than the matching lower case type.
3123	For example, the class @code{[^\W_]} matches any letter or digit,
3124	but not underscore.
3125
3126	All non-alphameric characters other than @code{\}, @code{-},
3127	@code{^} (at the start) and the terminating @code{]}
3128	are non-special in character classes, but it does no harm
3129	if they are escaped.
3130
3131	Perl 5.6 supports the @sc{posix} notation for character classes, which
3132	uses names enclosed by @code{[:} and @code{:]} within the enclosing
3133	square brackets, and @value{SSED} supports this notation as well.
3134	For example,
3135
3136	@example
3137	[01[:alpha:]%]
3138	@end example
3139
3140	@noindent
3141	matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}.
3142	The supported class names are
3143
3144	@table @code
3145	@item alnum
3146	Matches letters and digits
3147
3148	@item alpha
3149	Matches letters
3150
3151	@item ascii
3152	Matches character codes 0 - 127
3153
3154	@item cntrl
3155	Matches control characters
3156
3157	@item digit
3158	Matches decimal digits (same as \d)
3159
3160	@item graph
3161	Matches printing characters, excluding space
3162
3163	@item lower
3164	Matches lower case letters
3165
3166	@item print
3167	Matches printing characters, including space
3168
3169	@item punct
3170	Matches printing characters, excluding letters and digits
3171
3172	@item space
3173	Matches white space (same as \s)
3174
3175	@item upper
3176	Matches upper case letters
3177
3178	@item word
3179	Matches ``word'' characters (same as \w)
3180
3181	@item xdigit
3182	Matches hexadecimal digits
3183	@end table
3184
3185	The names @code{ascii} and @code{word} are extensions valid only in
3186	Perl mode. Another Perl extension is negation, which is
3187	indicated by a circumflex character after the colon. For example,
3188
3189	@example
3190	[12[:^digit:]]
3191	@end example
3192
3193	@noindent
3194	matches @samp{1}, @samp{2}, or any non-digit.
3195
3196	@node Options setting
3197	@appendixsec Options setting
3198	@cindex Perl-style regular expressions, toggling options
3199	@cindex Perl-style regular expressions, case-insensitive
3200	@cindex Perl-style regular expressions, multiline
3201	@cindex Perl-style regular expressions, single line
3202	@cindex Perl-style regular expressions, extended
3203
3204	The settings of the @code{I}, @code{M}, @code{S}, @code{X}
3205	modifiers can be changed from within the pattern by
3206	a sequence of Perl option letters enclosed between @code{(?}
3207	and @code{)}. The option letters must be lowercase.
3208
3209	For example, @code{(?im)} sets caseless, multiline matching. It is
3210	also possible to unset these options by preceding the letter
3211	with a hyphen; you can also have combined settings and unsettings:
3212	@code{(?im-sx)} sets caseless and multiline matching,
3213	while unsets single line matching (for dots) and extended
3214	whitespace interpretation. If a letter appears both before
3215	and after the hyphen, the option is unset.
3216
3217	The scope of these option changes depends on where in the
3218	pattern the setting occurs. For settings that are outside
3219	any subpattern (defined below), the effect is the same as if
3220	the options were set or unset at the start of matching. The
3221	following patterns all behave in exactly the same way:
3222
3223	@example
3224	(?i)abc
3225	a(?i)bc
3226	ab(?i)c
3227	abc(?i)
3228	@end example
3229
3230	which in turn is the same as specifying the pattern abc with
3231	the @code{I} modifier. In other words, ``top level'' settings
3232	apply to the whole pattern (unless there are other
3233	changes inside subpatterns). If there is more than one setting
3234	of the same option at top level, the rightmost setting
3235	is used.
3236
3237	If an option change occurs inside a subpattern, the effect
3238	is different. This is a change of behaviour in Perl 5.005.
3239	An option change inside a subpattern affects only that part
3240	of the subpattern @emph{that follows} it, so
3241
3242	@example
3243	(a(?i)b)c
3244	@end example
3245
3246	@noindent
3247	matches abc and aBc and no other strings (assuming
3248	case-sensitive matching is used). By this means, options can
3249	be made to have different settings in different parts of the
3250	pattern. Any changes made in one alternative do carry on
3251	into subsequent branches within the same subpattern. For
3252	example,
3253
3254	@example
3255	(a(?i)b\|c)
3256	@end example
3257
3258	@noindent
3259	matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C},
3260	even though when matching @samp{C} the first branch is
3261	abandoned before the option setting.
3262	This is because the effects of option settings happen at
3263	compile time. There would be some very weird behaviour otherwise.
3264
3265	@ignore
3266	There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA
3267	that can be changed in the same way as the Perl-compatible options by
3268	using the characters U and X respectively. The (?X) flag
3269	setting is special in that it must always occur earlier in
3270	the pattern than any of the additional features it turns on,
3271	even when it is at top level. It is best put at the start.
3272	@end ignore
3273
3274
3275	@node Non-capturing subpatterns
3276	@appendixsec Non-capturing subpatterns
3277	@cindex Perl-style regular expressions, non-capturing subpatterns
3278
3279	Marking part of a pattern as a subpattern does two things.
3280	On one hand, it localizes a set of alternatives; on the other
3281	hand, it sets up the subpattern as a capturing subpattern (as
3282	defined above). The subpattern can be backreferenced and
3283	referenced in the right side of @code{s} commands.
3284
3285	For example, if the string @samp{the red king} is matched against
3286	the pattern
3287
3288	@example
3289	the ((red\|white) (king\|queen))
3290	@end example
3291
3292	@noindent
3293	the captured substrings are @samp{red king}, @samp{red},
3294	and @samp{king}, and are numbered 1, 2, and 3.
3295
3296	The fact that plain parentheses fulfil two functions is not
3297	always helpful. There are often times when a grouping
3298	subpattern is required without a capturing requirement. If an
3299	opening parenthesis is followed by @code{?:}, the subpattern does
3300	not do any capturing, and is not counted when computing the
3301	number of any subsequent capturing subpatterns. For example,
3302	if the string @samp{the white queen} is matched against the pattern
3303
3304	@example
3305	the ((?:red\|white) (king\|queen))
3306	@end example
3307
3308	@noindent
3309	the captured substrings are @samp{white queen} and @samp{queen},
3310	and are numbered 1 and 2. The maximum number of captured
3311	substrings is 99, while the maximum number of all subpatterns,
3312	both capturing and non-capturing, is 200.
3313
3314	As a convenient shorthand, if any option settings are
3315	equired at the start of a non-capturing subpattern, the
3316	option letters may appear between the @code{?} and the
3317	@code{:}. Thus the two patterns
3318
3319	@example
3320	(?i:saturday\|sunday)
3321	(?:(?i)saturday\|sunday)
3322	@end example
3323
3324	@noindent
3325	match exactly the same set of strings. Because alternative
3326	branches are tried from left to right, and options are not
3327	reset until the end of the subpattern is reached, an option
3328	setting in one branch does affect subsequent branches, so
3329	the above patterns match @samp{SUNDAY} as well as @samp{Saturday}.
3330
3331
3332	@node Repetition
3333	@appendixsec Repetition
3334	@cindex Perl-style regular expressions, repetitions
3335
3336	Repetition is specified by quantifiers, which can follow any
3337	of the following items:
3338
3339	@itemize @bullet
3340	@item
3341	a single character, possibly escaped
3342
3343	@item
3344	the @code{.} special character
3345
3346	@item
3347	a character class
3348
3349	@item
3350	a back reference (see next section)
3351
3352	@item
3353	a parenthesized subpattern (unless it is an assertion; @pxref{Assertions})
3354	@end itemize
3355
3356	The general repetition quantifier specifies a minimum and
3357	maximum number of permitted matches, by giving the two
3358	numbers in curly brackets (braces), separated by a comma.
3359	The numbers must be less than 65536, and the first must be
3360	less than or equal to the second. For example:
3361
3362	@example
3363	z@{2,4@}
3364	@end example
3365
3366	@noindent
3367	matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own
3368	is not a special character. If the second number is omitted,
3369	but the comma is present, there is no upper limit; if the
3370	second number and the comma are both omitted, the quantifier
3371	specifies an exact number of required matches. Thus
3372
3373	@example
3374	[aeiou]@{3,@}
3375	@end example
3376
3377	@noindent
3378	matches at least 3 successive vowels, but may match many
3379	more, while
3380
3381	@example
3382	\d@{8@}
3383	@end example
3384
3385	@noindent
3386	matches exactly 8 digits. An opening curly bracket that
3387	appears in a position where a quantifier is not allowed, or
3388	one that does not match the syntax of a quantifier, is taken
3389	as a literal character. For example, @{,6@} is not a quantifier,
3390	but a literal string of four characters.@footnote{It
3391	raises an error if @option{-R} is not used.}
3392
3393	The quantifier @samp{@{0@}} is permitted, causing the expression to
3394	behave as if the previous item and the quantifier were not
3395	present.
3396
3397	For convenience (and historical compatibility) the three
3398	most common quantifiers have single-character abbreviations:
3399
3400	@table @code
3401	@item *
3402	is equivalent to @{0,@}
3403
3404	@item +
3405	is equivalent to @{1,@}
3406
3407	@item ?
3408	is equivalent to @{0,1@}
3409	@end table
3410
3411	It is possible to construct infinite loops by following a
3412	subpattern that can match no characters with a quantifier
3413	that has no upper limit, for example:
3414
3415	@example
3416	(a?)*
3417	@end example
3418
3419	Earlier versions of Perl used to give an error at
3420	compile time for such patterns. However, because there are
3421	cases where this can be useful, such patterns are now
3422	accepted, but if any repetition of the subpattern does in
3423	fact match no characters, the loop is forcibly broken.
3424
3425	@cindex Greedy regular expression matching
3426	@cindex Perl-style regular expressions, stingy repetitions
3427	By default, the quantifiers are @dfn{greedy} like in @sc{posix}
3428	mode, that is, they match as much as possible (up to the maximum
3429	number of permitted times), without causing the rest of the
3430	pattern to fail. The classic example of where this gives problems
3431	is in trying to match comments in C programs. These appear between
3432	the sequences @code{/} and @code{/} and within the sequence, individual
3433	@code{*} and @code{/} characters may appear. An attempt to match C
3434	comments by applying the pattern
3435
3436	@example
3437	/\.\*/
3438	@end example
3439
3440	@noindent
3441	to the string
3442
3443	@example
3444	/* first command / not comment / second comment */
3445	@end example
3446
3447	@noindent
3448
3449	fails, because it matches the entire string owing to the
3450	greediness of the @code{.*} item.
3451
3452	However, if a quantifier is followed by a question mark, it
3453	ceases to be greedy, and instead matches the minimum number
3454	of times possible, so the pattern @code{/\.?\*/}
3455	does the right thing with the C comments. The meaning of the
3456	various quantifiers is not otherwise changed, just the preferred
3457	number of matches. Do not confuse this use of question
3458	mark with its use as a quantifier in its own right.
3459	Because it has two uses, it can sometimes appear doubled, as in
3460
3461	@example
3462	\d??\d
3463	@end example
3464
3465	which matches one digit by preference, but can match two if
3466	that is the only way the rest of the pattern matches.
3467
3468	Note that greediness does not matter when specifying addresses,
3469	but can be nevertheless used to improve performance.
3470
3471	@ignore
3472	If the PCRE_UNGREEDY option is set (an option which is not
3473	available in Perl), the quantifiers are not greedy by
3474	default, but individual ones can be made greedy by following
3475	them with a question mark. In other words, it inverts the
3476	default behaviour.
3477	@end ignore
3478
3479	When a parenthesized subpattern is quantified with a minimum
3480	repeat count that is greater than 1 or with a limited maximum,
3481	more store is required for the compiled pattern, in
3482	proportion to the size of the minimum or maximum.
3483
3484	@cindex Perl-style regular expressions, single line
3485	If a pattern starts with @code{.*} or @code{.@{0,@}} and the
3486	@code{S} modifier is used, the pattern is implicitly anchored,
3487	because whatever follows will be tried against every character
3488	position in the subject string, so there is no point in
3489	retrying the overall match at any position after the first.
3490	PCRE treats such a pattern as though it were preceded by \A.
3491
3492	When a capturing subpattern is repeated, the value captured
3493	is the substring that matched the final iteration. For example,
3494	after
3495
3496	@example
3497	(tweedle[dume]@{3@}\s*)+
3498	@end example
3499
3500	@noindent
3501	has matched @samp{tweedledum tweedledee} the value of the
3502	captured substring is @samp{tweedledee}. However, if there are
3503	nested capturing subpatterns, the corresponding captured
3504	values may have been set in previous iterations. For example,
3505	after
3506
3507	@example
3508	/(a\|(b))+/
3509	@end example
3510
3511	matches @samp{aba}, the value of the second captured substring is
3512	@samp{b}.
3513
3514	@node Backreferences
3515	@appendixsec Backreferences
3516	@cindex Perl-style regular expressions, backreferences
3517
3518	Outside a character class, a backslash followed by a digit
3519	greater than 0 (and possibly further digits) is a back
3520	reference to a capturing subpattern earlier (i.e. to its
3521	left) in the pattern, provided there have been that many
3522	previous capturing left parentheses.
3523
3524	However, if the decimal number following the backslash is
3525	less than 10, it is always taken as a back reference, and
3526	causes an error only if there are not that many capturing
3527	left parentheses in the entire pattern. In other words, the
3528	parentheses that are referenced need not be to the left of
3529	the reference for numbers less than 10. @ref{Backslash}
3530	for further details of the handling of digits following a backslash.
3531
3532	A back reference matches whatever actually matched the capturing
3533	subpattern in the current subject string, rather than
3534	anything matching the subpattern itself. So the pattern
3535
3536	@example
3537	(sens\|respons)e and \1ibility
3538	@end example
3539
3540	@noindent
3541	matches @samp{sense and sensibility} and @samp{response and responsibility},
3542	but not @samp{sense and responsibility}. If caseful
3543	matching is in force at the time of the back reference, the
3544	case of letters is relevant. For example,
3545
3546	@example
3547	((?i)blah)\s+\1
3548	@end example
3549
3550	@noindent
3551	matches @samp{blah blah} and @samp{Blah Blah}, but not
3552	@samp{BLAH blah}, even though the original capturing
3553	subpattern is matched caselessly.
3554
3555	There may be more than one back reference to the same subpattern.
3556	Also, if a subpattern has not actually been used in a
3557	particular match, any back references to it always fail. For
3558	example, the pattern
3559
3560	@example
3561	(a\|(bc))\2
3562	@end example
3563
3564	@noindent
3565	always fails if it starts to match @samp{a} rather than
3566	@samp{bc}. Because there may be up to 99 back references, all
3567	digits following the backslash are taken as part of a potential
3568	back reference number; this is different from what happens
3569	in @sc{posix} mode. If the pattern continues with a digit
3570	character, some delimiter must be used to terminate the back
3571	reference. If the @code{X} modifier option is set, this can be
3572	whitespace. Otherwise an empty comment can be used, or the
3573	following character can be expressed in hexadecimal or octal.
3574
3575	A back reference that occurs inside the parentheses to which
3576	it refers fails when the subpattern is first used, so, for
3577	example, @code{(a\1)} never matches. However, such references
3578	can be useful inside repeated subpatterns. For example, the
3579	pattern
3580
3581	@example
3582	(a\|b\1)+
3583	@end example
3584
3585	@noindent
3586	matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa},
3587	etc. At each iteration of the subpattern, the back reference matches
3588	the character string corresponding to the previous iteration. In
3589	order for this to work, the pattern must be such that the first
3590	iteration does not need to match the back reference. This can be
3591	done using alternation, as in the example above, or by a
3592	quantifier with a minimum of zero.
3593
3594	@node Assertions
3595	@appendixsec Assertions
3596	@cindex Perl-style regular expressions, assertions
3597	@cindex Perl-style regular expressions, asserting subpatterns
3598
3599	An assertion is a test on the characters following or
3600	preceding the current matching point that does not actually
3601	consume any characters. The simple assertions coded as @code{\b},
3602	@code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$}
3603	are described above. More complicated assertions are coded as
3604	subpatterns. There are two kinds: those that look ahead of the
3605	current position in the subject string, and those that look behind it.
3606
3607	@cindex Perl-style regular expressions, lookahead subpatterns
3608	An assertion subpattern is matched in the normal way, except
3609	that it does not cause the current matching position to be
3610	changed. Lookahead assertions start with @code{(?=} for positive
3611	assertions and @code{(?!} for negative assertions. For example,
3612
3613	@example
3614	\w+(?=;)
3615	@end example
3616
3617	@noindent
3618	matches a word followed by a semicolon, but does not include
3619	the semicolon in the match, and
3620
3621	@example
3622	foo(?!bar)
3623	@end example
3624
3625	@noindent
3626	matches any occurrence of @samp{foo} that is not followed by
3627	@samp{bar}.
3628
3629	Note that the apparently similar pattern
3630
3631	@example
3632	(?!foo)bar
3633	@end example
3634
3635	@noindent
3636	@cindex Perl-style regular expressions, lookbehind subpatterns
3637	finds any occurrence of @samp{bar} even if it is preceded by
3638	@samp{foo}, because the assertion @code{(?!foo)} is always true
3639	when the next three characters are @samp{bar}. A lookbehind
3640	assertion is needed to achieve this effect.
3641	Lookbehind assertions start with @code{(?<=} for positive
3642	assertions and @code{(?<!} for negative assertions. So,
3643
3644	@example
3645	(?<!foo)bar
3646	@end example
3647
3648	achieves the required effect of finding an occurrence of
3649	@samp{bar} that is not preceded by @samp{foo}. The contents of a
3650	lookbehind assertion are restricted
3651	such that all the strings it matches must have a fixed
3652	length. However, if there are several alternatives, they do
3653	not all have to have the same fixed length. This is an extension
3654	compared with Perl 5.005, which requires all branches to match
3655	the same length of string. Thus
3656
3657	@example
3658	(?<=dogs\|cats\|)
3659	@end example
3660
3661	@noindent
3662	is permitted, but the apparently equivalent regular expression
3663
3664	@example
3665	(?<!dogs?\|cats?)
3666	@end example
3667
3668	@noindent
3669	causes an error at compile time. Branches that match different
3670	length strings are permitted only at the top level of
3671	a lookbehind assertion: an assertion such as
3672
3673	@example
3674	(?<=ab(c\|de))
3675	@end example
3676
3677	@noindent
3678	is not permitted, because its single top-level branch can
3679	match two different lengths, but it is acceptable if rewritten
3680	to use two top-level branches:
3681
3682	@example
3683	(?<=abc\|abde)
3684	@end example
3685
3686	All this is required because lookbehind assertions simply
3687	move the current position back by the alternative's fixed
3688	width and then try to match. If there are
3689	insufficient characters before the current position, the
3690	match is deemed to fail. Lookbehinds, in conjunction with
3691	non-backtracking subpatterns can be particularly useful for
3692	matching at the ends of strings; an example is given at the end
3693	of the section on non-backtracking subpatterns.
3694
3695	Several assertions (of any sort) may occur in succession.
3696	For example,
3697
3698	@example
3699	(?<=\d@{3@})(?<!999)foo
3700	@end example
3701
3702	@noindent
3703	matches @samp{foo} preceded by three digits that are not @samp{999}.
3704	Notice that each of the assertions is applied independently
3705	at the same point in the subject string. First there is a
3706	check that the previous three characters are all digits, and
3707	then there is a check that the same three characters are not
3708	@samp{999}. This pattern does not match @samp{foo} preceded by six
3709	characters, the first of which are digits and the last three
3710	of which are not @samp{999}. For example, it doesn't match
3711	@samp{123abcfoo}. A pattern to do that is
3712
3713	@example
3714	(?<=\d@{3@}...)(?<!999)foo
3715	@end example
3716
3717	@noindent
3718	This time the first assertion looks at the preceding six
3719	characters, checking that the first three are digits, and
3720	then the second assertion checks that the preceding three
3721	characters are not @samp{999}. Actually, assertions can be
3722	nested in any combination, so one can write this as
3723
3724	@example
3725	(?<=\d@{3@}(?!999)...)foo
3726	@end example
3727
3728	or
3729
3730	@example
3731	(?<=\d@{3@}...(?<!999))foo
3732	@end example
3733
3734	@noindent
3735	both of which might be considered more readable.
3736
3737	Assertion subpatterns are not capturing subpatterns, and may
3738	not be repeated, because it makes no sense to assert the
3739	same thing several times. If any kind of assertion contains
3740	capturing subpatterns within it, these are counted for the
3741	purposes of numbering the capturing subpatterns in the whole
3742	pattern. However, substring capturing is carried out only
3743	for positive assertions, because it does not make sense for
3744	negative assertions.
3745
3746	Assertions count towards the maximum of 200 parenthesized
3747	subpatterns.
3748
3749	@node Non-backtracking subpatterns
3750	@appendixsec Non-backtracking subpatterns
3751	@cindex Perl-style regular expressions, non-backtracking subpatterns
3752
3753	With both maximizing and minimizing repetition, failure of
3754	what follows normally causes the repeated item to be evaluated
3755	again to see if a different number of repeats allows the
3756	rest of the pattern to match. Sometimes it is useful to
3757	prevent this, either to change the nature of the match, or
3758	to cause it fail earlier than it otherwise might, when the
3759	author of the pattern knows there is no point in carrying
3760	on.
3761
3762	Consider, for example, the pattern @code{\d+foo} when applied to
3763	the subject line
3764
3765	@example
3766	123456bar
3767	@end example
3768
3769	After matching all 6 digits and then failing to match @samp{foo},
3770	the normal action of the matcher is to try again with only 5
3771	digits matching the @code{\d+} item, and then with 4, and so on,
3772	before ultimately failing. Non-backtracking subpatterns
3773	provide the means for specifying that once a portion of the
3774	pattern has matched, it is not to be re-evaluated in this way,
3775	so the matcher would give up immediately on failing to match
3776	@samp{foo} the first time. The notation is another kind of special
3777	parenthesis, starting with @code{(?>} as in this example:
3778
3779	@example
3780	(?>\d+)bar
3781	@end example
3782
3783	This kind of parenthesis ``locks up'' the part of the pattern
3784	it contains once it has matched, and a failure further into
3785	the pattern is prevented from backtracking into it.
3786	Backtracking past it to previous items, however, works as
3787	normal.
3788
3789	Non-backtracking subpatterns are not capturing subpatterns. Simple
3790	cases such as the above example can be thought of as a maximizing
3791	repeat that must swallow everything it can. So,
3792	while both @code{\d+} and @code{\d+?} are prepared to adjust the number of
3793	digits they match in order to make the rest of the pattern
3794	match, @code{(?>\d+)} can only match an entire sequence of digits.
3795
3796	This construction can of course contain arbitrarily complicated
3797	subpatterns, and it can be nested.
3798
3799	@cindex Perl-style regular expressions, lookbehind subpatterns
3800	Non-backtracking subpatterns can be used in conjunction with look-behind
3801	assertions to specify efficient matching at the end
3802	of the subject string. Consider a simple pattern such as
3803
3804	@example
3805	abcd$
3806	@end example
3807
3808	@noindent
3809	when applied to a long string which does not match. Because
3810	matching proceeds from left to right, @command{sed} will look for
3811	each @samp{a} in the subject and then see if what follows matches
3812	the rest of the pattern. If the pattern is specified as
3813
3814	@example
3815	^.*abcd$
3816	@end example
3817
3818	@noindent
3819	the initial @code{.*} matches the entire string at first, but when
3820	this fails (because there is no following @samp{a}), it backtracks
3821	to match all but the last character, then all but the
3822	last two characters, and so on. Once again the search for
3823	@samp{a} covers the entire string, from right to left, so we are
3824	no better off. However, if the pattern is written as
3825
3826	@example
3827	^(?>.*)(?<=abcd)
3828	@end example
3829
3830	there can be no backtracking for the .* item; it can match
3831	only the entire string. The subsequent lookbehind assertion
3832	does a single test on the last four characters. If it fails,
3833	the match fails immediately. For long strings, this approach
3834	makes a significant difference to the processing time.
3835
3836	When a pattern contains an unlimited repeat inside a subpattern
3837	that can itself be repeated an unlimited number of
3838	times, the use of a once-only subpattern is the only way to
3839	avoid some failing matches taking a very long time
3840	indeed.@footnote{Actually, the matcher embedded in @value{SSED}
3841	tries to do something for this in the simplest cases,
3842	like @code{([^b]b)}. These cases are actually quite
3843	common: they happen for example in a regular expression
3844	like @code{\/\([^]\)*\/} which matches C comments.}
3845
3846	The pattern
3847
3848	@example
3849	(\D+\|<\d+>)*[!?]
3850	@end example
3851
3852	([^0-9<]+<(\d+>)?)*[!?]
3853
3854	@noindent
3855	matches an unlimited number of substrings that either consist
3856	of non-digits, or digits enclosed in angular brackets, followed by
3857	an exclamation or question mark. When it matches, it runs quickly.
3858	However, if it is applied to
3859
3860	@example
3861	aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3862	@end example
3863
3864	@noindent
3865	it takes a long time before reporting failure. This is
3866	because the string can be divided between the two repeats in
3867	a large number of ways, and all have to be tried.@footnote{The
3868	example used @code{[!?]} rather than a single character at the end,
3869	because both @value{SSED} and Perl have an optimization that allows
3870	for fast failure when a single character is used. They
3871	remember the last single character that is required for a
3872	match, and fail early if it is not present in the string.}
3873
3874	If the pattern is changed to
3875
3876	@example
3877	((?>\D+)\|<\d+>)*[!?]
3878	@end example
3879
3880	sequences of non-digits cannot be broken, and failure happens
3881	quickly.
3882
3883	@node Conditional subpatterns
3884	@appendixsec Conditional subpatterns
3885	@cindex Perl-style regular expressions, conditional subpatterns
3886
3887	It is possible to cause the matching process to obey a subpattern
3888	conditionally or to choose between two alternative
3889	subpatterns, depending on the result of an assertion, or
3890	whether a previous capturing subpattern matched or not. The
3891	two possible forms of conditional subpattern are
3892
3893	@example
3894	(?(@var{condition})@var{yes-pattern})
3895	(?(@var{condition})@var{yes-pattern}\|@var{no-pattern})
3896	@end example
3897
3898	If the condition is satisfied, the yes-pattern is used; otherwise
3899	the no-pattern (if present) is used. If there are more than two
3900	alternatives in the subpattern, a compile-time error occurs.
3901
3902	There are two kinds of condition. If the text between the
3903	parentheses consists of a sequence of digits, the condition
3904	is satisfied if the capturing subpattern of that number has
3905	previously matched. The number must be greater than zero.
3906	Consider the following pattern, which contains non-significant
3907	white space to make it more readable (assume the @code{X} modifier)
3908	and to divide it into three parts for ease of discussion:
3909
3910	@example
3911	( $ )? [^()]+ (?(1) $ )
3912	@end example
3913
3914	The first part matches an optional opening parenthesis, and
3915	if that character is present, sets it as the first captured
3916	substring. The second part matches one or more characters
3917	that are not parentheses. The third part is a conditional
3918	subpattern that tests whether the first set of parentheses
3919	matched or not. If they did, that is, if subject started
3920	with an opening parenthesis, the condition is true, and so
3921	the yes-pattern is executed and a closing parenthesis is
3922	required. Otherwise, since no-pattern is not present, the
3923	subpattern matches nothing. In other words, this pattern
3924	matches a sequence of non-parentheses, optionally enclosed
3925	in parentheses.
3926
3927	@cindex Perl-style regular expressions, lookahead subpatterns
3928	If the condition is not a sequence of digits, it must be an
3929	assertion. This may be a positive or negative lookahead or
3930	lookbehind assertion. Consider this pattern, again containing
3931	non-significant white space, and with the two alternatives
3932	on the second line:
3933
3934	@example
3935	(?(?=...[a-z])
3936	\d\d-[a-z]@{3@}-\d\d \|
3937	\d\d-\d\d-\d\d )
3938	@end example
3939
3940	The condition is a positive lookahead assertion that matches
3941	a letter that is three characters away from the current point.
3942	If a letter is found, the subject is matched against the first
3943	alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are
3944	letters and @var{dd} are digits); otherwise it is matched against
3945	the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}.
3946
3947
3948	@node Recursive patterns
3949	@appendixsec Recursive patterns
3950	@cindex Perl-style regular expressions, recursive patterns
3951	@cindex Perl-style regular expressions, recursion
3952
3953	Consider the problem of matching a string in parentheses,
3954	allowing for unlimited nested parentheses. Without the use
3955	of recursion, the best that can be done is to use a pattern
3956	that matches up to some fixed depth of nesting. It is not
3957	possible to handle an arbitrary nesting depth. Perl 5.6 has
3958	provided an experimental facility that allows regular
3959	expressions to recurse (amongst other things). It does this
3960	by interpolating Perl code in the expression at run time,
3961	and the code can refer to the expression itself. A Perl pattern
3962	tern to solve the parentheses problem can be created like
3963	this:
3964
3965	@example
3966	$re = qr@{$ (?: (?>[^()]+) \| (?p@{$re@}) )* $@}x;
3967	@end example
3968
3969	The @code{(?p@{...@})} item interpolates Perl code at run time,
3970	and in this case refers recursively to the pattern in which it
3971	appears. Obviously, @command{sed} cannot support the interpolation of
3972	Perl code. Instead, the special item @code{(?R)} is provided for
3973	the specific case of recursion. This pattern solves the
3974	parentheses problem (assume the @code{X} modifier option is used
3975	so that white space is ignored):
3976
3977	@example
3978	$ ( (?>[^()]+) \| (?R) )* $
3979	@end example
3980
3981	First it matches an opening parenthesis. Then it matches any
3982	number of substrings which can either be a sequence of
3983	non-parentheses, or a recursive match of the pattern itself
3984	(i.e. a correctly parenthesized substring). Finally there is
3985	a closing parenthesis.
3986
3987	This particular example pattern contains nested unlimited
3988	repeats, and so the use of a non-backtracking subpattern for
3989	matching strings of non-parentheses is important when applying
3990	the pattern to strings that do not match. For example, when
3991	it is applied to
3992
3993	@example
3994	(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
3995	@end example
3996
3997	it yields a ``no match'' response quickly. However, if a
3998	standard backtracking subpattern is not used, the match runs
3999	for a very long time indeed because there are so many different
4000	ways the @code{+} and @code{*} repeats can carve up the subject,
4001	and all have to be tested before failure can be reported.
4002
4003	The values set for any capturing subpatterns are those from
4004	the outermost level of the recursion at which the subpattern
4005	value is set. If the pattern above is matched against
4006
4007	@example
4008	(ab(cd)ef)
4009	@end example
4010
4011	@noindent
4012	the value for the capturing parentheses is @samp{ef}, which is
4013	the last value taken on at the top level.
4014
4015	@node Comments
4016	@appendixsec Comments
4017	@cindex Perl-style regular expressions, comments
4018
4019	The sequence (?# marks the start of a comment which continues
4020	ues up to the next closing parenthesis. Nested parentheses
4021	are not permitted. The characters that make up a comment
4022	play no part in the pattern matching at all.
4023
4024	@cindex Perl-style regular expressions, extended
4025	If the @code{X} modifier option is used, an unescaped @code{#} character
4026	outside a character class introduces a comment that continues
4027	up to the next newline character in the pattern.
4028	@end ifset
4029
4030
4031	@page
4032	@node Concept Index
4033	@unnumbered Concept Index
4034
4035	This is a general index of all issues discussed in this manual, with the
4036	exception of the @command{sed} commands and command-line options.
4037
4038	@printindex cp
4039
4040	@page
4041	@node Command and Option Index
4042	@unnumbered Command and Option Index
4043
4044	This is an alphabetical list of all @command{sed} commands and command-line
4045	options.
4046
4047	@printindex fn
4048
4049	@contents
4050	@bye
4051
4052	@c XXX FIXME: the term "cycle" is never defined...

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/src/sed/doc/sed-in.texi@ 1846

Download in other formats: