source: trunk/src/sed/doc/sed.texi@ 1810

Last change on this file since 1810 was 599, checked in by bird, 19 years ago

GNU sed 4.1.5.

File size: 136.3 KB
Line 
1\input texinfo @c -*-texinfo-*-
2@c Do not edit this file!! It is automatically generated from sed-in.texi.
3@c
4@c -- Stuff that needs adding: ----------------------------------------------
5@c (document the `;' command-separator)
6@c --------------------------------------------------------------------------
7@c Check for consistency: regexps in @code, text that they match in @samp.
8@c
9@c Tips:
10@c @command for command
11@c @samp for command fragments: @samp{cat -s}
12@c @code for sed commands and flags
13@c Use ``quote'' not `quote' or "quote".
14@c
15@c %**start of header
16@setfilename sed.info
17@settitle sed, a stream editor
18@c %**end of header
19
20@c @smallbook
21
22@include version.texi
23
24@c Combine indices.
25@syncodeindex ky cp
26@syncodeindex pg cp
27@syncodeindex tp cp
28
29@defcodeindex op
30@syncodeindex op fn
31
32@include config.texi
33
34@copying
35This file documents version @value{VERSION} of
36@value{SSED}, a stream editor.
37
38Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free
39Software Foundation, Inc.
40
41This document is released under the terms of the @acronym{GNU} Free
42Documentation License as published by the Free Software Foundation;
43either version 1.1, or (at your option) any later version.
44
45You should have received a copy of the @acronym{GNU} Free Documentation
46License along with @value{SSED}; see the file @file{COPYING.DOC}.
47If not, write to the Free Software Foundation, 59 Temple Place - Suite
48330, Boston, MA 02110-1301, USA.
49
50There are no Cover Texts and no Invariant Sections; this text, along
51with its equivalent in the printed manual, constitutes the Title Page.
52@end copying
53
54@setchapternewpage off
55
56@titlepage
57@title @command{sed}, a stream editor
58@subtitle version @value{VERSION}, @value{UPDATED}
59@author by Ken Pizzini, Paolo Bonzini
60
61@page
62@vskip 0pt plus 1filll
63Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc.
64
65@insertcopying
66
67Published by the Free Software Foundation, @*
6851 Franklin Street, Fifth Floor @*
69Boston, MA 02110-1301, USA
70@end titlepage
71
72
73@node Top
74@top
75
76@ifnottex
77@insertcopying
78@end ifnottex
79
80@menu
81* Introduction:: Introduction
82* Invoking sed:: Invocation
83* sed Programs:: @command{sed} programs
84* Examples:: Some sample scripts
85* Limitations:: Limitations and (non-)limitations of @value{SSED}
86* Other Resources:: Other resources for learning about @command{sed}
87* Reporting Bugs:: Reporting bugs
88
89* Extended regexps:: @command{egrep}-style regular expressions
90@ifset PERL
91* Perl regexps:: Perl-style regular expressions
92@end ifset
93
94* Concept Index:: A menu with all the topics in this manual.
95* Command and Option Index:: A menu with all @command{sed} commands and
96 command-line options.
97
98@detailmenu
99--- The detailed node listing ---
100
101sed Programs:
102* Execution Cycle:: How @command{sed} works
103* Addresses:: Selecting lines with @command{sed}
104* Regular Expressions:: Overview of regular expression syntax
105* Common Commands:: Often used commands
106* The "s" Command:: @command{sed}'s Swiss Army Knife
107* Other Commands:: Less frequently used commands
108* Programming Commands:: Commands for @command{sed} gurus
109* Extended Commands:: Commands specific of @value{SSED}
110* Escapes:: Specifying special characters
111
112Examples:
113* Centering lines::
114* Increment a number::
115* Rename files to lower case::
116* Print bash environment::
117* Reverse chars of lines::
118* tac:: Reverse lines of files
119* cat -n:: Numbering lines
120* cat -b:: Numbering non-blank lines
121* wc -c:: Counting chars
122* wc -w:: Counting words
123* wc -l:: Counting lines
124* head:: Printing the first lines
125* tail:: Printing the last lines
126* uniq:: Make duplicate lines unique
127* uniq -d:: Print duplicated lines of input
128* uniq -u:: Remove all duplicated lines
129* cat -s:: Squeezing blank lines
130
131@ifset PERL
132Perl regexps:: Perl-style regular expressions
133* Backslash:: Introduces special sequences
134* Circumflex/dollar sign/period:: Behave specially with regard to new lines
135* Square brackets:: Are a bit different in strange cases
136* Options setting:: Toggle modifiers in the middle of a regexp
137* Non-capturing subpatterns:: Are not counted when backreferencing
138* Repetition:: Allows for non-greedy matching
139* Backreferences:: Allows for more than 10 back references
140* Assertions:: Allows for complex look ahead matches
141* Non-backtracking subpatterns:: Often gives more performance
142* Conditional subpatterns:: Allows if/then/else branches
143* Recursive patterns:: For example to match parentheses
144* Comments:: Because things can get complex...
145@end ifset
146
147@end detailmenu
148@end menu
149
150
151@node Introduction
152@chapter Introduction
153
154@cindex Stream editor
155@command{sed} is a stream editor.
156A stream editor is used to perform basic text
157transformations on an input stream
158(a file or input from a pipeline).
159While in some ways similar to an editor which
160permits scripted edits (such as @command{ed}),
161@command{sed} works by making only one pass over the
162input(s), and is consequently more efficient.
163But it is @command{sed}'s ability to filter text in a pipeline
164which particularly distinguishes it from other types of
165editors.
166
167
168@node Invoking sed
169@chapter Invocation
170
171Normally @command{sed} is invoked like this:
172
173@example
174sed SCRIPT INPUTFILE...
175@end example
176
177The full format for invoking @command{sed} is:
178
179@example
180sed OPTIONS... [SCRIPT] [INPUTFILE...]
181@end example
182
183If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-},
184@command{sed} filters the contents of the standard input. The @var{script}
185is actually the first non-option parameter, which @command{sed} specially
186considers a script and not an input file if (and only if) none of the
187other @var{options} specifies a script to be executed, that is if neither
188of the @option{-e} and @option{-f} options is specified.
189
190@command{sed} may be invoked with the following command-line options:
191
192@table @code
193@item --version
194@opindex --version
195@cindex Version, printing
196Print out the version of @command{sed} that is being run and a copyright notice,
197then exit.
198
199@item --help
200@opindex --help
201@cindex Usage summary, printing
202Print a usage message briefly summarizing these command-line options
203and the bug-reporting address,
204then exit.
205
206@item -n
207@itemx --quiet
208@itemx --silent
209@opindex -n
210@opindex --quiet
211@opindex --silent
212@cindex Disabling autoprint, from command line
213By default, @command{sed} prints out the pattern space
214at the end of each cycle through the script.
215These options disable this automatic printing,
216and @command{sed} only produces output when explicitly told to
217via the @code{p} command.
218
219@item -i[@var{SUFFIX}]
220@itemx --in-place[=@var{SUFFIX}]
221@opindex -i
222@opindex --in-place
223@cindex In-place editing, activating
224@cindex @value{SSEDEXT}, in-place editing
225This option specifies that files are to be edited in-place.
226@value{SSED} does this by creating a temporary file and
227sending output to this file rather than to the standard
228output.@footnote{This applies to commands such as @code{=},
229@code{a}, @code{c}, @code{i}, @code{l}, @code{p}. You can
230still write to the standard output by using the @code{w}
231@cindex @value{SSEDEXT}, @file{/dev/stdout} file
232or @code{W} commands together with the @file{/dev/stdout}
233special file}.
234
235This option implies @option{-s}.
236
237When the end of the file is reached, the temporary file is
238renamed to the output file's original name. The extension,
239if supplied, is used to modify the name of the old file
240before renaming the temporary file, thereby making a backup
241copy@footnote{Note that @value{SSED} creates the backup
242 file whether or not any output is actually changed.}).
243
244@cindex In-place editing, Perl-style backup file names
245This rule is followed: if the extension doesn't contain a @code{*},
246then it is appended to the end of the current filename as a
247suffix; if the extension does contain one or more @code{*}
248characters, then @emph{each} asterisk is replaced with the
249current filename. This allows you to add a prefix to the
250backup file, instead of (or in addition to) a suffix, or
251even to place backup copies of the original files into another
252directory (provided the directory already exists).
253
254If no extension is supplied, the original file is
255overwritten without making a backup.
256
257@item -l @var{N}
258@itemx --line-length=@var{N}
259@opindex -l
260@opindex --line-length
261@cindex Line length, setting
262Specify the default line-wrap length for the @code{l} command.
263A length of 0 (zero) means to never wrap long lines. If
264not specified, it is taken to be 70.
265
266@item --posix
267@cindex @value{SSEDEXT}, disabling
268@value{SSED} includes several extensions to @acronym{POSIX}
269sed. In order to simplify writing portable scripts, this
270option disables all the extensions that this manual documents,
271including additional commands.
272@cindex @code{POSIXLY_CORRECT} behavior, enabling
273Most of the extensions accept @command{sed} programs that
274are outside the syntax mandated by @acronym{POSIX}, but some
275of them (such as the behavior of the @command{N} command
276described in @pxref{Reporting Bugs}) actually violate the
277standard. If you want to disable only the latter kind of
278extension, you can set the @code{POSIXLY_CORRECT} variable
279to a non-empty value.
280
281@item -r
282@itemx --regexp-extended
283@opindex -r
284@opindex --regexp-extended
285@cindex Extended regular expressions, choosing
286@cindex @acronym{GNU} extensions, extended regular expressions
287Use extended regular expressions rather than basic
288regular expressions. Extended regexps are those that
289@command{egrep} accepts; they can be clearer because they
290usually have less backslashes, but are a @acronym{GNU} extension
291and hence scripts that use them are not portable.
292@xref{Extended regexps, , Extended regular expressions}.
293
294@ifset PERL
295@item -R
296@itemx --regexp-perl
297@opindex -R
298@opindex --regexp-perl
299@cindex Perl-style regular expressions, choosing
300@cindex @value{SSEDEXT}, Perl-style regular expressions
301Use Perl-style regular expressions rather than basic
302regular expressions. Perl-style regexps are extremely
303powerful but are a @value{SSED} extension and hence scripts that
304use it are not portable. @xref{Perl regexps, ,
305Perl-style regular expressions}.
306@end ifset
307
308@item -s
309@itemx --separate
310@cindex Working on separate files
311By default, @command{sed} will consider the files specified on the
312command line as a single continuous long stream. This @value{SSED}
313extension allows the user to consider them as separate files:
314range addresses (such as @samp{/abc/,/def/}) are not allowed
315to span several files, line numbers are relative to the start
316of each file, @code{$} refers to the last line of each file,
317and files invoked from the @code{R} commands are rewound at the
318start of each file.
319
320@item -u
321@itemx --unbuffered
322@opindex -u
323@opindex --unbuffered
324@cindex Unbuffered I/O, choosing
325Buffer both input and output as minimally as practical.
326(This is particularly useful if the input is coming from
327the likes of @samp{tail -f}, and you wish to see the transformed
328output as soon as possible.)
329
330@item -e @var{script}
331@itemx --expression=@var{script}
332@opindex -e
333@opindex --expression
334@cindex Script, from command line
335Add the commands in @var{script} to the set of commands to be
336run while processing the input.
337
338@item -f @var{script-file}
339@itemx --file=@var{script-file}
340@opindex -f
341@opindex --file
342@cindex Script, from a file
343Add the commands contained in the file @var{script-file}
344to the set of commands to be run while processing the input.
345
346@end table
347
348If no @option{-e}, @option{-f}, @option{--expression}, or @option{--file}
349options are given on the command-line,
350then the first non-option argument on the command line is
351taken to be the @var{script} to be executed.
352
353@cindex Files to be processed as input
354If any command-line parameters remain after processing the above,
355these parameters are interpreted as the names of input files to
356be processed.
357@cindex Standard input, processing as input
358A file name of @samp{-} refers to the standard input stream.
359The standard input will be processed if no file names are specified.
360
361
362@node sed Programs
363@chapter @command{sed} Programs
364
365@cindex @command{sed} program structure
366@cindex Script structure
367A @command{sed} program consists of one or more @command{sed} commands,
368passed in by one or more of the
369@option{-e}, @option{-f}, @option{--expression}, and @option{--file}
370options, or the first non-option argument if zero of these
371options are used.
372This document will refer to ``the'' @command{sed} script;
373this is understood to mean the in-order catenation
374of all of the @var{script}s and @var{script-file}s passed in.
375
376Each @code{sed} command consists of an optional address or
377address range, followed by a one-character command name
378and any additional command-specific code.
379
380@menu
381* Execution Cycle:: How @command{sed} works
382* Addresses:: Selecting lines with @command{sed}
383* Regular Expressions:: Overview of regular expression syntax
384* Common Commands:: Often used commands
385* The "s" Command:: @command{sed}'s Swiss Army Knife
386* Other Commands:: Less frequently used commands
387* Programming Commands:: Commands for @command{sed} gurus
388* Extended Commands:: Commands specific of @value{SSED}
389* Escapes:: Specifying special characters
390@end menu
391
392
393@node Execution Cycle
394@section How @command{sed} Works
395
396@cindex Buffer spaces, pattern and hold
397@cindex Spaces, pattern and hold
398@cindex Pattern space, definition
399@cindex Hold space, definition
400@command{sed} maintains two data buffers: the active @emph{pattern} space,
401and the auxiliary @emph{hold} space. Both are initially empty.
402
403@command{sed} operates by performing the following cycle on each
404lines of input: first, @command{sed} reads one line from the input
405stream, removes any trailing newline, and places it in the pattern space.
406Then commands are executed; each command can have an address associated
407to it: addresses are a kind of condition code, and a command is only
408executed if the condition is verified before the command is to be
409executed.
410
411When the end of the script is reached, unless the @option{-n} option
412is in use, the contents of pattern space are printed out to the output
413stream, adding back the trailing newline if it was removed.@footnote{Actually,
414 if @command{sed} prints a line without the terminating newline, it will
415 nevertheless print the missing newline as soon as more text is sent to
416 the same output stream, which gives the ``least expected surprise''
417 even though it does not make commands like @samp{sed -n p} exactly
418 identical to @command{cat}.} Then the next cycle starts for the next
419input line.
420
421Unless special commands (like @samp{D}) are used, the pattern space is
422deleted between two cycles. The hold space, on the other hand, keeps
423its data between cycles (see commands @samp{h}, @samp{H}, @samp{x},
424@samp{g}, @samp{G} to move data between both buffers).
425
426
427@node Addresses
428@section Selecting lines with @command{sed}
429@cindex Addresses, in @command{sed} scripts
430@cindex Line selection
431@cindex Selecting lines to process
432
433Addresses in a @command{sed} script can be in any of the following forms:
434@table @code
435@item @var{number}
436@cindex Address, numeric
437@cindex Line, selecting by number
438Specifying a line number will match only that line in the input.
439(Note that @command{sed} counts lines continuously across all input files
440unless @option{-i} or @option{-s} options are specified.)
441
442@item @var{first}~@var{step}
443@cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses
444This @acronym{GNU} extension matches every @var{step}th line
445starting with line @var{first}.
446In particular, lines will be selected when there exists
447a non-negative @var{n} such that the current line-number equals
448@var{first} + (@var{n} * @var{step}).
449Thus, to select the odd-numbered lines,
450one would use @code{1~2};
451to pick every third line starting with the second, @samp{2~3} would be used;
452to pick every fifth line starting with the tenth, use @samp{10~5};
453and @samp{50~0} is just an obscure way of saying @code{50}.
454
455@item $
456@cindex Address, last line
457@cindex Last line, selecting
458@cindex Line, selecting last
459This address matches the last line of the last file of input, or
460the last line of each file when the @option{-i} or @option{-s} options
461are specified.
462
463@item /@var{regexp}/
464@cindex Address, as a regular expression
465@cindex Line, selecting by regular expression match
466This will select any line which matches the regular expression @var{regexp}.
467If @var{regexp} itself includes any @code{/} characters,
468each must be escaped by a backslash (@code{\}).
469
470@cindex empty regular expression
471@cindex @value{SSEDEXT}, modifiers and the empty regular expression
472The empty regular expression @samp{//} repeats the last regular
473expression match (the same holds if the empty regular expression is
474passed to the @code{s} command). Note that modifiers to regular expressions
475are evaluated when the regular expression is compiled, thus it is invalid to
476specify them together with the empty regular expression.
477
478@item \%@var{regexp}%
479(The @code{%} may be replaced by any other single character.)
480
481@cindex Slash character, in regular expressions
482This also matches the regular expression @var{regexp},
483but allows one to use a different delimiter than @code{/}.
484This is particularly useful if the @var{regexp} itself contains
485a lot of slashes, since it avoids the tedious escaping of every @code{/}.
486If @var{regexp} itself includes any delimiter characters,
487each must be escaped by a backslash (@code{\}).
488
489@item /@var{regexp}/I
490@itemx \%@var{regexp}%I
491@cindex @acronym{GNU} extensions, @code{I} modifier
492@ifset PERL
493@cindex Perl-style regular expressions, case-insensitive
494@end ifset
495The @code{I} modifier to regular-expression matching is a @acronym{GNU}
496extension which causes the @var{regexp} to be matched in
497a case-insensitive manner.
498
499@item /@var{regexp}/M
500@itemx \%@var{regexp}%M
501@ifset PERL
502@cindex @value{SSEDEXT}, @code{M} modifier
503@end ifset
504@cindex Perl-style regular expressions, multiline
505The @code{M} modifier to regular-expression matching is a @value{SSED}
506extension which causes @code{^} and @code{$} to match respectively
507(in addition to the normal behavior) the empty string after a newline,
508and the empty string before a newline. There are special character
509sequences
510@ifset PERL
511(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
512in basic or extended regular expression modes)
513@end ifset
514@ifclear PERL
515(@code{\`} and @code{\'})
516@end ifclear
517which always match the beginning or the end of the buffer.
518@code{M} stands for @cite{multi-line}.
519
520@ifset PERL
521@item /@var{regexp}/S
522@itemx \%@var{regexp}%S
523@cindex @value{SSEDEXT}, @code{S} modifier
524@cindex Perl-style regular expressions, single line
525The @code{S} modifier to regular-expression matching is only valid
526in Perl mode and specifies that the dot character (@code{.}) will
527match the newline character too. @code{S} stands for @cite{single-line}.
528@end ifset
529
530@ifset PERL
531@item /@var{regexp}/X
532@itemx \%@var{regexp}%X
533@cindex @value{SSEDEXT}, @code{X} modifier
534@cindex Perl-style regular expressions, extended
535The @code{X} modifier to regular-expression matching is also
536valid in Perl mode only. If it is used, whitespace in the
537pattern (other than in a character class) and
538characters between a @kbd{#} outside a character class and the
539next newline character are ignored. An escaping backslash
540can be used to include a whitespace or @kbd{#} character as part
541of the pattern.
542@end ifset
543@end table
544
545If no addresses are given, then all lines are matched;
546if one address is given, then only lines matching that
547address are matched.
548
549@cindex Range of lines
550@cindex Several lines, selecting
551An address range can be specified by specifying two addresses
552separated by a comma (@code{,}). An address range matches lines
553starting from where the first address matches, and continues
554until the second address matches (inclusively).
555
556If the second address is a @var{regexp}, then checking for the
557ending match will start with the line @emph{following} the
558line which matched the first address: a range will always
559span at least two lines (except of course if the input stream
560ends).
561
562If the second address is a @var{number} less than (or equal to)
563the line matching the first address, then only the one line is
564matched.
565
566@cindex Special addressing forms
567@cindex Range with start address of zero
568@cindex Zero, as range start address
569@cindex @var{addr1},+N
570@cindex @var{addr1},~N
571@cindex @acronym{GNU} extensions, special two-address forms
572@cindex @acronym{GNU} extensions, @code{0} address
573@cindex @acronym{GNU} extensions, 0,@var{addr2} addressing
574@cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing
575@cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing
576@value{SSED} also supports some special two-address forms; all these
577are @acronym{GNU} extensions:
578@table @code
579@item 0,/@var{regexp}/
580A line number of @code{0} can be used in an address specification like
581@code{0,/@var{regexp}/} so that @command{sed} will try to match
582@var{regexp} in the first input line too. In other words,
583@code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/},
584except that if @var{addr2} matches the very first line of input the
585@code{0,/@var{regexp}/} form will consider it to end the range, whereas
586the @code{1,/@var{regexp}/} form will match the beginning of its range and
587hence make the range span up to the @emph{second} occurrence of the
588regular expression.
589
590Note that this is the only place where the @code{0} address makes
591sense; there is no 0-th line and commands which are given the @code{0}
592address in any other way will give an error.
593
594@item @var{addr1},+@var{N}
595Matches @var{addr1} and the @var{N} lines following @var{addr1}.
596
597@item @var{addr1},~@var{N}
598Matches @var{addr1} and the lines following @var{addr1}
599until the next line whose input line number is a multiple of @var{N}.
600@end table
601
602@cindex Excluding lines
603@cindex Selecting non-matching lines
604Appending the @code{!} character to the end of an address
605specification negates the sense of the match.
606That is, if the @code{!} character follows an address range,
607then only lines which do @emph{not} match the address range
608will be selected.
609This also works for singleton addresses,
610and, perhaps perversely, for the null address.
611
612
613@node Regular Expressions
614@section Overview of Regular Expression Syntax
615
616To know how to use @command{sed}, people should understand regular
617expressions (@dfn{regexp} for short). A regular expression
618is a pattern that is matched against a
619subject string from left to right. Most characters are
620@dfn{ordinary}: they stand for
621themselves in a pattern, and match the corresponding characters
622in the subject. As a trivial example, the pattern
623
624@example
625 The quick brown fox
626@end example
627
628@noindent
629matches a portion of a subject string that is identical to
630itself. The power of regular expressions comes from the
631ability to include alternatives and repetitions in the pattern.
632These are encoded in the pattern by the use of @dfn{special characters},
633which do not stand for themselves but instead
634are interpreted in some special way. Here is a brief description
635of regular expression syntax as used in @command{sed}.
636
637@table @code
638@item @var{char}
639A single ordinary character matches itself.
640
641@item *
642@cindex @acronym{GNU} extensions, to basic regular expressions
643Matches a sequence of zero or more instances of matches for the
644preceding regular expression, which must be an ordinary character, a
645special character preceded by @code{\}, a @code{.}, a grouped regexp
646(see below), or a bracket expression. As a @acronym{GNU} extension, a
647postfixed regular expression can also be followed by @code{*}; for
648example, @code{a**} is equivalent to @code{a*}. @acronym{POSIX}
6491003.1-2001 says that @code{*} stands for itself when it appears at
650the start of a regular expression or subexpression, but many
651non@acronym{GNU} implementations do not support this and portable
652scripts should instead use @code{\*} in these contexts.
653
654@item \+
655@cindex @acronym{GNU} extensions, to basic regular expressions
656As @code{*}, but matches one or more. It is a @acronym{GNU} extension.
657
658@item \?
659@cindex @acronym{GNU} extensions, to basic regular expressions
660As @code{*}, but only matches zero or one. It is a @acronym{GNU} extension.
661
662@item \@{@var{i}\@}
663As @code{*}, but matches exactly @var{i} sequences (@var{i} is a
664decimal integer; for portability, keep it between 0 and 255
665inclusive).
666
667@item \@{@var{i},@var{j}\@}
668Matches between @var{i} and @var{j}, inclusive, sequences.
669
670@item \@{@var{i},\@}
671Matches more than or equal to @var{i} sequences.
672
673@item \(@var{regexp}\)
674Groups the inner @var{regexp} as a whole, this is used to:
675
676@itemize @bullet
677@item
678@cindex @acronym{GNU} extensions, to basic regular expressions
679Apply postfix operators, like @code{\(abcd\)*}:
680this will search for zero or more whole sequences
681of @samp{abcd}, while @code{abcd*} would search
682for @samp{abc} followed by zero or more occurrences
683of @samp{d}. Note that support for @code{\(abcd\)*} is
684required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU}
685implementations do not support it and hence it is not universally
686portable.
687
688@item
689Use back references (see below).
690@end itemize
691
692@item .
693Matches any character, including newline.
694
695@item ^
696Matches the null string at beginning of line, i.e. what
697appears after the circumflex must appear at the
698beginning of line. @code{^#include} will match only
699lines where @samp{#include} is the first thing on line---if
700there are spaces before, for example, the match fails.
701@code{^} acts as a special character only at the beginning
702of the regular expression or subexpression (that is,
703after @code{\(} or @code{\|}). Portable scripts should avoid
704@code{^} at the beginning of a subexpression, though, as
705@acronym{POSIX} allows implementations that treat @code{^} as
706an ordinary character in that context.
707
708
709@item $
710It is the same as @code{^}, but refers to end of line.
711@code{$} also acts as a special character only at the end
712of the regular expression or subexpression (that is, before @code{\)}
713or @code{\|}), and its use at the end of a subexpression is not
714portable.
715
716
717@item [@var{list}]
718@itemx [^@var{list}]
719Matches any single character in @var{list}: for example,
720@code{[aeiou]} matches all vowels. A list may include
721sequences like @code{@var{char1}-@var{char2}}, which
722matches any character between (inclusive) @var{char1}
723and @var{char2}.
724
725A leading @code{^} reverses the meaning of @var{list}, so that
726it matches any single character @emph{not} in @var{list}. To include
727@code{]} in the list, make it the first character (after
728the @code{^} if needed), to include @code{-} in the list,
729make it the first or last; to include @code{^} put
730it after the first character.
731
732@cindex @code{POSIXLY_CORRECT} behavior, bracket expressions
733The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\}
734are normally not special within @var{list}. For example, @code{[\*]}
735matches either @samp{\} or @samp{*}, because the @code{\} is not
736special here. However, strings like @code{[.ch.]}, @code{[=a=]}, and
737@code{[:space:]} are special within @var{list} and represent collating
738symbols, equivalence classes, and character classes, respectively, and
739@code{[} is therefore special within @var{list} when it is followed by
740@code{.}, @code{=}, or @code{:}. Also, when not in
741@env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and
742@code{\t} are recognized within @var{list}. @xref{Escapes}.
743
744@item @var{regexp1}\|@var{regexp2}
745@cindex @acronym{GNU} extensions, to basic regular expressions
746Matches either @var{regexp1} or @var{regexp2}. Use
747parentheses to use complex alternative regular expressions.
748The matching process tries each alternative in turn, from
749left to right, and the first one that succeeds is used.
750It is a @acronym{GNU} extension.
751
752@item @var{regexp1}@var{regexp2}
753Matches the concatenation of @var{regexp1} and @var{regexp2}.
754Concatenation binds more tightly than @code{\|}, @code{^}, and
755@code{$}, but less tightly than the other regular expression
756operators.
757
758@item \@var{digit}
759Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized
760subexpression in the regular expression. This is called a @dfn{back
761reference}. Subexpressions are implicity numbered by counting
762occurrences of @code{\(} left-to-right.
763
764@item \n
765Matches the newline character.
766
767@item \@var{char}
768Matches @var{char}, where @var{char} is one of @code{$},
769@code{*}, @code{.}, @code{[}, @code{\}, or @code{^}.
770Note that the only C-like
771backslash sequences that you can portably assume to be
772interpreted are @code{\n} and @code{\\}; in particular
773@code{\t} is not portable, and matches a @samp{t} under most
774implementations of @command{sed}, rather than a tab character.
775
776@end table
777
778@cindex Greedy regular expression matching
779Note that the regular expression matcher is greedy, i.e., matches
780are attempted from left to right and, if two or more matches are
781possible starting at the same character, it selects the longest.
782
783@noindent
784Examples:
785@table @samp
786@item abcdef
787Matches @samp{abcdef}.
788
789@item a*b
790Matches zero or more @samp{a}s followed by a single
791@samp{b}. For example, @samp{b} or @samp{aaaaab}.
792
793@item a\?b
794Matches @samp{b} or @samp{ab}.
795
796@item a\+b\+
797Matches one or more @samp{a}s followed by one or more
798@samp{b}s: @samp{ab} is the shortest possible match, but
799other examples are @samp{aaaab} or @samp{abbbbb} or
800@samp{aaaaaabbbbbbb}.
801
802@item .*
803@itemx .\+
804These two both match all the characters in a string;
805however, the first matches every string (including the empty
806string), while the second matches only strings containing
807at least one character.
808
809@item ^main.*(.*)
810his matches a string starting with @samp{main},
811followed by an opening and closing
812parenthesis. The @samp{n}, @samp{(} and @samp{)} need not
813be adjacent.
814
815@item ^#
816This matches a string beginning with @samp{#}.
817
818@item \\$
819This matches a string ending with a single backslash. The
820regexp contains two backslashes for escaping.
821
822@item \$
823Instead, this matches a string consisting of a single dollar sign,
824because it is escaped.
825
826@item [a-zA-Z0-9]
827In the C locale, this matches any @acronym{ASCII} letters or digits.
828
829@item [^ @kbd{tab}]\+
830(Here @kbd{tab} stands for a single tab character.)
831This matches a string of one or more
832characters, none of which is a space or a tab.
833Usually this means a word.
834
835@item ^\(.*\)\n\1$
836This matches a string consisting of two equal substrings separated by
837a newline.
838
839@item .\@{9\@}A$
840This matches nine characters followed by an @samp{A}.
841
842@item ^.\@{15\@}A
843This matches the start of a string that contains 16 characters,
844the last of which is an @samp{A}.
845
846@end table
847
848
849
850@node Common Commands
851@section Often-Used Commands
852
853If you use @command{sed} at all, you will quite likely want to know
854these commands.
855
856@table @code
857@item #
858[No addresses allowed.]
859
860@findex # (comments)
861@cindex Comments, in scripts
862The @code{#} character begins a comment;
863the comment continues until the next newline.
864
865@cindex Portability, comments
866If you are concerned about portability, be aware that
867some implementations of @command{sed} (which are not @sc{posix}
868conformant) may only support a single one-line comment,
869and then only when the very first character of the script is a @code{#}.
870
871@findex -n, forcing from within a script
872@cindex Caveat --- #n on first line
873Warning: if the first two characters of the @command{sed} script
874are @code{#n}, then the @option{-n} (no-autoprint) option is forced.
875If you want to put a comment in the first line of your script
876and that comment begins with the letter @samp{n}
877and you do not want this behavior,
878then be sure to either use a capital @samp{N},
879or place at least one space before the @samp{n}.
880
881@item q [@var{exit-code}]
882This command only accepts a single address.
883
884@findex q (quit) command
885@cindex @value{SSEDEXT}, returning an exit code
886@cindex Quitting
887Exit @command{sed} without processing any more commands or input.
888Note that the current pattern space is printed if auto-print is
889not disabled with the @option{-n} options. The ability to return
890an exit code from the @command{sed} script is a @value{SSED} extension.
891
892@item d
893@findex d (delete) command
894@cindex Text, deleting
895Delete the pattern space;
896immediately start next cycle.
897
898@item p
899@findex p (print) command
900@cindex Text, printing
901Print out the pattern space (to the standard output).
902This command is usually only used in conjunction with the @option{-n}
903command-line option.
904
905@item n
906@findex n (next-line) command
907@cindex Next input line, replace pattern space with
908@cindex Read next input line
909If auto-print is not disabled, print the pattern space,
910then, regardless, replace the pattern space with the next line of input.
911If there is no more input then @command{sed} exits without processing
912any more commands.
913
914@item @{ @var{commands} @}
915@findex @{@} command grouping
916@cindex Grouping commands
917@cindex Command groups
918A group of commands may be enclosed between
919@code{@{} and @code{@}} characters.
920This is particularly useful when you want a group of commands
921to be triggered by a single address (or address-range) match.
922
923@end table
924
925@node The "s" Command
926@section The @code{s} Command
927
928The syntax of the @code{s} (as in substitute) command is
929@samp{s/@var{regexp}/@var{replacement}/@var{flags}}. The @code{/}
930characters may be uniformly replaced by any other single
931character within any given @code{s} command. The @code{/}
932character (or whatever other character is used in its stead)
933can appear in the @var{regexp} or @var{replacement}
934only if it is preceded by a @code{\} character.
935
936The @code{s} command is probably the most important in @command{sed}
937and has a lot of different options. Its basic concept is simple:
938the @code{s} command attempts to match the pattern
939space against the supplied @var{regexp}; if the match is
940successful, then that portion of the pattern
941space which was matched is replaced with @var{replacement}.
942
943@cindex Backreferences, in regular expressions
944@cindex Parenthesized substrings
945The @var{replacement} can contain @code{\@var{n}} (@var{n} being
946a number from 1 to 9, inclusive) references, which refer to
947the portion of the match which is contained between the @var{n}th
948@code{\(} and its matching @code{\)}.
949Also, the @var{replacement} can contain unescaped @code{&}
950characters which reference the whole matched portion
951of the pattern space.
952@cindex @value{SSEDEXT}, case modifiers in @code{s} commands
953Finally, as a @value{SSED} extension, you can include a
954special sequence made of a backslash and one of the letters
955@code{L}, @code{l}, @code{U}, @code{u}, or @code{E}.
956The meaning is as follows:
957
958@table @code
959@item \L
960Turn the replacement
961to lowercase until a @code{\U} or @code{\E} is found,
962
963@item \l
964Turn the
965next character to lowercase,
966
967@item \U
968Turn the replacement to uppercase
969until a @code{\L} or @code{\E} is found,
970
971@item \u
972Turn the next character
973to uppercase,
974
975@item \E
976Stop case conversion started by @code{\L} or @code{\U}.
977@end table
978
979To include a literal @code{\}, @code{&}, or newline in the final
980replacement, be sure to precede the desired @code{\}, @code{&},
981or newline in the @var{replacement} with a @code{\}.
982
983@findex s command, option flags
984@cindex Substitution of text, options
985The @code{s} command can be followed by zero or more of the
986following @var{flags}:
987
988@table @code
989@item g
990@cindex Global substitution
991@cindex Replacing all text matching regexp in a line
992Apply the replacement to @emph{all} matches to the @var{regexp},
993not just the first.
994
995@item @var{number}
996@cindex Replacing only @var{n}th match of regexp in a line
997Only replace the @var{number}th match of the @var{regexp}.
998
999@cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command
1000@cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command
1001Note: the @sc{posix} standard does not specify what should happen
1002when you mix the @code{g} and @var{number} modifiers,
1003and currently there is no widely agreed upon meaning
1004across @command{sed} implementations.
1005For @value{SSED}, the interaction is defined to be:
1006ignore matches before the @var{number}th,
1007and then match and replace all matches from
1008the @var{number}th on.
1009
1010@item p
1011@cindex Text, printing after substitution
1012If the substitution was made, then print the new pattern space.
1013
1014Note: when both the @code{p} and @code{e} options are specified,
1015the relative ordering of the two produces very different results.
1016In general, @code{ep} (evaluate then print) is what you want,
1017but operating the other way round can be useful for debugging.
1018For this reason, the current version of @value{SSED} interprets
1019specially the presence of @code{p} options both before and after
1020@code{e}, printing the pattern space before and after evaluation,
1021while in general flags for the @code{s} command show their
1022effect just once. This behavior, although documented, might
1023change in future versions.
1024
1025@item w @var{file-name}
1026@cindex Text, writing to a file after substitution
1027@cindex @value{SSEDEXT}, @file{/dev/stdout} file
1028@cindex @value{SSEDEXT}, @file{/dev/stderr} file
1029If the substitution was made, then write out the result to the named file.
1030As a @value{SSED} extension, two special values of @var{file-name} are
1031supported: @file{/dev/stderr}, which writes the result to the standard
1032error, and @file{/dev/stdout}, which writes to the standard
1033output.@footnote{This is equivalent to @code{p} unless the @option{-i}
1034option is being used.}
1035
1036@item e
1037@cindex Evaluate Bourne-shell commands, after substitution
1038@cindex Subprocesses
1039@cindex @value{SSEDEXT}, evaluating Bourne-shell commands
1040@cindex @value{SSEDEXT}, subprocesses
1041This command allows one to pipe input from a shell command
1042into pattern space. If a substitution was made, the command
1043that is found in pattern space is executed and pattern space
1044is replaced with its output. A trailing newline is suppressed;
1045results are undefined if the command to be executed contains
1046a @sc{nul} character. This is a @value{SSED} extension.
1047
1048@item I
1049@itemx i
1050@cindex @acronym{GNU} extensions, @code{I} modifier
1051@cindex Case-insensitive matching
1052@ifset PERL
1053@cindex Perl-style regular expressions, case-insensitive
1054@end ifset
1055The @code{I} modifier to regular-expression matching is a @acronym{GNU}
1056extension which makes @command{sed} match @var{regexp} in a
1057case-insensitive manner.
1058
1059@item M
1060@itemx m
1061@cindex @value{SSEDEXT}, @code{M} modifier
1062@ifset PERL
1063@cindex Perl-style regular expressions, multiline
1064@end ifset
1065The @code{M} modifier to regular-expression matching is a @value{SSED}
1066extension which causes @code{^} and @code{$} to match respectively
1067(in addition to the normal behavior) the empty string after a newline,
1068and the empty string before a newline. There are special character
1069sequences
1070@ifset PERL
1071(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
1072in basic or extended regular expression modes)
1073@end ifset
1074@ifclear PERL
1075(@code{\`} and @code{\'})
1076@end ifclear
1077which always match the beginning or the end of the buffer.
1078@code{M} stands for @cite{multi-line}.
1079
1080@ifset PERL
1081@item S
1082@itemx s
1083@cindex @value{SSEDEXT}, @code{S} modifier
1084@cindex Perl-style regular expressions, single line
1085The @code{S} modifier to regular-expression matching is only valid
1086in Perl mode and specifies that the dot character (@code{.}) will
1087match the newline character too. @code{S} stands for @cite{single-line}.
1088@end ifset
1089
1090@ifset PERL
1091@item X
1092@itemx x
1093@cindex @value{SSEDEXT}, @code{X} modifier
1094@cindex Perl-style regular expressions, extended
1095The @code{X} modifier to regular-expression matching is also
1096valid in Perl mode only. If it is used, whitespace in the
1097pattern (other than in a character class) and
1098characters between a @kbd{#} outside a character class and the
1099next newline character are ignored. An escaping backslash
1100can be used to include a whitespace or @kbd{#} character as part
1101of the pattern.
1102@end ifset
1103@end table
1104
1105
1106@node Other Commands
1107@section Less Frequently-Used Commands
1108
1109Though perhaps less frequently used than those in the previous
1110section, some very small yet useful @command{sed} scripts can be built with
1111these commands.
1112
1113@table @code
1114@item y/@var{source-chars}/@var{dest-chars}/
1115(The @code{/} characters may be uniformly replaced by
1116any other single character within any given @code{y} command.)
1117
1118@findex y (transliterate) command
1119@cindex Transliteration
1120Transliterate any characters in the pattern space which match
1121any of the @var{source-chars} with the corresponding character
1122in @var{dest-chars}.
1123
1124Instances of the @code{/} (or whatever other character is used in its stead),
1125@code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars}
1126lists, provide that each instance is escaped by a @code{\}.
1127The @var{source-chars} and @var{dest-chars} lists @emph{must}
1128contain the same number of characters (after de-escaping).
1129
1130@item a\
1131@itemx @var{text}
1132@cindex @value{SSEDEXT}, two addresses supported by most commands
1133As a @acronym{GNU} extension, this command accepts two addresses.
1134
1135@findex a (append text lines) command
1136@cindex Appending text after a line
1137@cindex Text, appending
1138Queue the lines of text which follow this command
1139(each but the last ending with a @code{\},
1140which are removed from the output)
1141to be output at the end of the current cycle,
1142or when the next input line is read.
1143
1144Escape sequences in @var{text} are processed, so you should
1145use @code{\\} in @var{text} to print a single backslash.
1146
1147As a @acronym{GNU} extension, if between the @code{a} and the newline there is
1148other than a whitespace-@code{\} sequence, then the text of this line,
1149starting at the first non-whitespace character after the @code{a},
1150is taken as the first line of the @var{text} block.
1151(This enables a simplification in scripting a one-line add.)
1152This extension also works with the @code{i} and @code{c} commands.
1153
1154@item i\
1155@itemx @var{text}
1156@cindex @value{SSEDEXT}, two addresses supported by most commands
1157As a @acronym{GNU} extension, this command accepts two addresses.
1158
1159@findex i (insert text lines) command
1160@cindex Inserting text before a line
1161@cindex Text, insertion
1162Immediately output the lines of text which follow this command
1163(each but the last ending with a @code{\},
1164which are removed from the output).
1165
1166@item c\
1167@itemx @var{text}
1168@findex c (change to text lines) command
1169@cindex Replacing selected lines with other text
1170Delete the lines matching the address or address-range,
1171and output the lines of text which follow this command
1172(each but the last ending with a @code{\},
1173which are removed from the output)
1174in place of the last line
1175(or in place of each line, if no addresses were specified).
1176A new cycle is started after this command is done,
1177since the pattern space will have been deleted.
1178
1179@item =
1180@cindex @value{SSEDEXT}, two addresses supported by most commands
1181As a @acronym{GNU} extension, this command accepts two addresses.
1182
1183@findex = (print line number) command
1184@cindex Printing line number
1185@cindex Line number, printing
1186Print out the current input line number (with a trailing newline).
1187
1188@item l @var{n}
1189@findex l (list unambiguously) command
1190@cindex List pattern space
1191@cindex Printing text unambiguously
1192@cindex Line length, setting
1193@cindex @value{SSEDEXT}, setting line length
1194Print the pattern space in an unambiguous form:
1195non-printable characters (and the @code{\} character)
1196are printed in C-style escaped form; long lines are split,
1197with a trailing @code{\} character to indicate the split;
1198the end of each line is marked with a @code{$}.
1199
1200@var{n} specifies the desired line-wrap length;
1201a length of 0 (zero) means to never wrap long lines. If omitted,
1202the default as specified on the command line is used. The @var{n}
1203parameter is a @value{SSED} extension.
1204
1205@item r @var{filename}
1206@cindex @value{SSEDEXT}, two addresses supported by most commands
1207As a @acronym{GNU} extension, this command accepts two addresses.
1208
1209@findex r (read file) command
1210@cindex Read text from a file
1211@cindex @value{SSEDEXT}, @file{/dev/stdin} file
1212Queue the contents of @var{filename} to be read and
1213inserted into the output stream at the end of the current cycle,
1214or when the next input line is read.
1215Note that if @var{filename} cannot be read, it is treated as
1216if it were an empty file, without any error indication.
1217
1218As a @value{SSED} extension, the special value @file{/dev/stdin}
1219is supported for the file name, which reads the contents of the
1220standard input.
1221
1222@item w @var{filename}
1223@findex w (write file) command
1224@cindex Write to a file
1225@cindex @value{SSEDEXT}, @file{/dev/stdout} file
1226@cindex @value{SSEDEXT}, @file{/dev/stderr} file
1227Write the pattern space to @var{filename}.
1228As a @value{SSED} extension, two special values of @var{file-name} are
1229supported: @file{/dev/stderr}, which writes the result to the standard
1230error, and @file{/dev/stdout}, which writes to the standard
1231output.@footnote{This is equivalent to @code{p} unless the @option{-i}
1232option is being used.}
1233
1234The file will be created (or truncated) before the
1235first input line is read; all @code{w} commands
1236(including instances of @code{w} flag on successful @code{s} commands)
1237which refer to the same @var{filename} are output without
1238closing and reopening the file.
1239
1240@item D
1241@findex D (delete first line) command
1242@cindex Delete first line from pattern space
1243Delete text in the pattern space up to the first newline.
1244If any text is left, restart cycle with the resultant
1245pattern space (without reading a new line of input),
1246otherwise start a normal new cycle.
1247
1248@item N
1249@findex N (append Next line) command
1250@cindex Next input line, append to pattern space
1251@cindex Append next input line to pattern space
1252Add a newline to the pattern space,
1253then append the next line of input to the pattern space.
1254If there is no more input then @command{sed} exits without processing
1255any more commands.
1256
1257@item P
1258@findex P (print first line) command
1259@cindex Print first line from pattern space
1260Print out the portion of the pattern space up to the first newline.
1261
1262@item h
1263@findex h (hold) command
1264@cindex Copy pattern space into hold space
1265@cindex Replace hold space with copy of pattern space
1266@cindex Hold space, copying pattern space into
1267Replace the contents of the hold space with the contents of the pattern space.
1268
1269@item H
1270@findex H (append Hold) command
1271@cindex Append pattern space to hold space
1272@cindex Hold space, appending from pattern space
1273Append a newline to the contents of the hold space,
1274and then append the contents of the pattern space to that of the hold space.
1275
1276@item g
1277@findex g (get) command
1278@cindex Copy hold space into pattern space
1279@cindex Replace pattern space with copy of hold space
1280@cindex Hold space, copy into pattern space
1281Replace the contents of the pattern space with the contents of the hold space.
1282
1283@item G
1284@findex G (appending Get) command
1285@cindex Append hold space to pattern space
1286@cindex Hold space, appending to pattern space
1287Append a newline to the contents of the pattern space,
1288and then append the contents of the hold space to that of the pattern space.
1289
1290@item x
1291@findex x (eXchange) command
1292@cindex Exchange hold space with pattern space
1293@cindex Hold space, exchange with pattern space
1294Exchange the contents of the hold and pattern spaces.
1295
1296@end table
1297
1298
1299@node Programming Commands
1300@section Commands for @command{sed} gurus
1301
1302In most cases, use of these commands indicates that you are
1303probably better off programming in something like @command{awk}
1304or Perl. But occasionally one is committed to sticking
1305with @command{sed}, and these commands can enable one to write
1306quite convoluted scripts.
1307
1308@cindex Flow of control in scripts
1309@table @code
1310@item : @var{label}
1311[No addresses allowed.]
1312
1313@findex : (label) command
1314@cindex Labels, in scripts
1315Specify the location of @var{label} for branch commands.
1316In all other respects, a no-op.
1317
1318@item b @var{label}
1319@findex b (branch) command
1320@cindex Branch to a label, unconditionally
1321@cindex Goto, in scripts
1322Unconditionally branch to @var{label}.
1323The @var{label} may be omitted, in which case the next cycle is started.
1324
1325@item t @var{label}
1326@findex t (test and branch if successful) command
1327@cindex Branch to a label, if @code{s///} succeeded
1328@cindex Conditional branch
1329Branch to @var{label} only if there has been a successful @code{s}ubstitution
1330since the last input line was read or conditional branch was taken.
1331The @var{label} may be omitted, in which case the next cycle is started.
1332
1333@end table
1334
1335@node Extended Commands
1336@section Commands Specific to @value{SSED}
1337
1338These commands are specific to @value{SSED}, so you
1339must use them with care and only when you are sure that
1340hindering portability is not evil. They allow you to check
1341for @value{SSED} extensions or to do tasks that are required
1342quite often, yet are unsupported by standard @command{sed}s.
1343
1344@table @code
1345@item e [@var{command}]
1346@findex e (evaluate) command
1347@cindex Evaluate Bourne-shell commands
1348@cindex Subprocesses
1349@cindex @value{SSEDEXT}, evaluating Bourne-shell commands
1350@cindex @value{SSEDEXT}, subprocesses
1351This command allows one to pipe input from a shell command
1352into pattern space. Without parameters, the @code{e} command
1353executes the command that is found in pattern space and
1354replaces the pattern space with the output; a trailing newline
1355is suppressed.
1356
1357If a parameter is specified, instead, the @code{e} command
1358interprets it as a command and sends its output to the output stream
1359(like @code{r} does). The command can run across multiple
1360lines, all but the last ending with a back-slash.
1361
1362In both cases, the results are undefined if the command to be
1363executed contains a @sc{nul} character.
1364
1365@item L @var{n}
1366@findex L (fLow paragraphs) command
1367@cindex Reformat pattern space
1368@cindex Reformatting paragraphs
1369@cindex @value{SSEDEXT}, reformatting paragraphs
1370@cindex @value{SSEDEXT}, @code{L} command
1371This @value{SSED} extension fills and joins lines in pattern space
1372to produce output lines of (at most) @var{n} characters, like
1373@code{fmt} does; if @var{n} is omitted, the default as specified
1374on the command line is used. This command is considered a failed
1375experiment and unless there is enough request (which seems unlikely)
1376will be removed in future versions.
1377
1378@ignore
1379Blank lines, spaces between words, and indentation are
1380preserved in the output; successive input lines with different
1381indentation are not joined; tabs are expanded to 8 columns.
1382
1383If the pattern space contains multiple lines, they are joined, but
1384since the pattern space usually contains a single line, the behavior
1385of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e.,
1386it does not join short lines to form longer ones).
1387
1388@var{n} specifies the desired line-wrap length; if omitted,
1389the default as specified on the command line is used.
1390@end ignore
1391
1392@item Q [@var{exit-code}]
1393This command only accepts a single address.
1394
1395@findex Q (silent Quit) command
1396@cindex @value{SSEDEXT}, quitting silently
1397@cindex @value{SSEDEXT}, returning an exit code
1398@cindex Quitting
1399This command is the same as @code{q}, but will not print the
1400contents of pattern space. Like @code{q}, it provides the
1401ability to return an exit code to the caller.
1402
1403This command can be useful because the only alternative ways
1404to accomplish this apparently trivial function are to use
1405the @option{-n} option (which can unnecessarily complicate
1406your script) or resorting to the following snippet, which
1407wastes time by reading the whole file without any visible effect:
1408
1409@example
1410:eat
1411$d @i{Quit silently on the last line}
1412N @i{Read another line, silently}
1413g @i{Overwrite pattern space each time to save memory}
1414b eat
1415@end example
1416
1417@item R @var{filename}
1418@findex R (read line) command
1419@cindex Read text from a file
1420@cindex @value{SSEDEXT}, reading a file a line at a time
1421@cindex @value{SSEDEXT}, @code{R} command
1422@cindex @value{SSEDEXT}, @file{/dev/stdin} file
1423Queue a line of @var{filename} to be read and
1424inserted into the output stream at the end of the current cycle,
1425or when the next input line is read.
1426Note that if @var{filename} cannot be read, or if its end is
1427reached, no line is appended, without any error indication.
1428
1429As with the @code{r} command, the special value @file{/dev/stdin}
1430is supported for the file name, which reads a line from the
1431standard input.
1432
1433@item T @var{label}
1434@findex T (test and branch if failed) command
1435@cindex @value{SSEDEXT}, branch if @code{s///} failed
1436@cindex Branch to a label, if @code{s///} failed
1437@cindex Conditional branch
1438Branch to @var{label} only if there have been no successful
1439@code{s}ubstitutions since the last input line was read or
1440conditional branch was taken. The @var{label} may be omitted,
1441in which case the next cycle is started.
1442
1443@item v @var{version}
1444@findex v (version) command
1445@cindex @value{SSEDEXT}, checking for their presence
1446@cindex Requiring @value{SSED}
1447This command does nothing, but makes @command{sed} fail if
1448@value{SSED} extensions are not supported, simply because other
1449versions of @command{sed} do not implement it. In addition, you
1450can specify the version of @command{sed} that your script
1451requires, such as @code{4.0.5}. The default is @code{4.0}
1452because that is the first version that implemented this command.
1453
1454This command enables all @value{SSEDEXT} even if
1455@env{POSIXLY_CORRECT} is set in the environment.
1456
1457@item W @var{filename}
1458@findex W (write first line) command
1459@cindex Write first line to a file
1460@cindex @value{SSEDEXT}, writing first line to a file
1461Write to the given filename the portion of the pattern space up to
1462the first newline. Everything said under the @code{w} command about
1463file handling holds here too.
1464@end table
1465
1466@node Escapes
1467@section @acronym{GNU} Extensions for Escapes in Regular Expressions
1468
1469@cindex @acronym{GNU} extensions, special escapes
1470Until this chapter, we have only encountered escapes of the form
1471@samp{\^}, which tell @command{sed} not to interpret the circumflex
1472as a special character, but rather to take it literally. For
1473example, @samp{\*} matches a single asterisk rather than zero
1474or more backslashes.
1475
1476@cindex @code{POSIXLY_CORRECT} behavior, escapes
1477This chapter introduces another kind of escape@footnote{All
1478the escapes introduced here are @acronym{GNU}
1479extensions, with the exception of @code{\n}. In basic regular
1480expression mode, setting @code{POSIXLY_CORRECT} disables them inside
1481bracket expressions.}---that
1482is, escapes that are applied to a character or sequence of characters
1483that ordinarily are taken literally, and that @command{sed} replaces
1484with a special character. This provides a way
1485of encoding non-printable characters in patterns in a visible manner.
1486There is no restriction on the appearance of non-printing characters
1487in a @command{sed} script but when a script is being prepared in the
1488shell or by text editing, it is usually easier to use one of
1489the following escape sequences than the binary character it
1490represents:
1491
1492The list of these escapes is:
1493
1494@table @code
1495@item \a
1496Produces or matches a @sc{bel} character, that is an ``alert'' (@sc{ascii} 7).
1497
1498@item \f
1499Produces or matches a form feed (@sc{ascii} 12).
1500
1501@item \n
1502Produces or matches a newline (@sc{ascii} 10).
1503
1504@item \r
1505Produces or matches a carriage return (@sc{ascii} 13).
1506
1507@item \t
1508Produces or matches a horizontal tab (@sc{ascii} 9).
1509
1510@item \v
1511Produces or matches a so called ``vertical tab'' (@sc{ascii} 11).
1512
1513@item \c@var{x}
1514Produces or matches @kbd{@sc{Control}-@var{x}}, where @var{x} is
1515any character. The precise effect of @samp{\c@var{x}} is as follows:
1516if @var{x} is a lower case letter, it is converted to upper case.
1517Then bit 6 of the character (hex 40) is inverted. Thus @samp{\cz} becomes
1518hex 1A, but @samp{\c@{} becomes hex 3B, while @samp{\c;} becomes hex 7B.
1519
1520@item \d@var{xxx}
1521Produces or matches a character whose decimal @sc{ascii} value is @var{xxx}.
1522
1523@item \o@var{xxx}
1524@ifset PERL
1525@item \@var{xxx}
1526@end ifset
1527Produces or matches a character whose octal @sc{ascii} value is @var{xxx}.
1528@ifset PERL
1529The syntax without the @code{o} is active in Perl mode, while the one
1530with the @code{o} is active in the normal or extended @sc{posix} regular
1531expression modes.
1532@end ifset
1533
1534@item \x@var{xx}
1535Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}.
1536@end table
1537
1538@samp{\b} (backspace) was omitted because of the conflict with
1539the existing ``word boundary'' meaning.
1540
1541Other escapes match a particular character class and are valid only in
1542regular expressions:
1543
1544@table @code
1545@item \w
1546Matches any ``word'' character. A ``word'' character is any
1547letter or digit or the underscore character.
1548
1549@item \W
1550Matches any ``non-word'' character.
1551
1552@item \b
1553Matches a word boundary; that is it matches if the character
1554to the left is a ``word'' character and the character to the
1555right is a ``non-word'' character, or vice-versa.
1556
1557@item \B
1558Matches everywhere but on a word boundary; that is it matches
1559if the character to the left and the character to the right
1560are either both ``word'' characters or both ``non-word''
1561characters.
1562
1563@item \`
1564Matches only at the start of pattern space. This is different
1565from @code{^} in multi-line mode.
1566
1567@item \'
1568Matches only at the end of pattern space. This is different
1569from @code{$} in multi-line mode.
1570
1571@ifset PERL
1572@item \G
1573Match only at the start of pattern space or, when doing a global
1574substitution using the @code{s///g} command and option, at
1575the end-of-match position of the prior match. For example,
1576@samp{s/\Ga/Z/g} will change an initial run of @code{a}s to
1577a run of @code{Z}s
1578@end ifset
1579@end table
1580
1581@node Examples
1582@chapter Some Sample Scripts
1583
1584Here are some @command{sed} scripts to guide you in the art of mastering
1585@command{sed}.
1586
1587@menu
1588Some exotic examples:
1589* Centering lines::
1590* Increment a number::
1591* Rename files to lower case::
1592* Print bash environment::
1593* Reverse chars of lines::
1594
1595Emulating standard utilities:
1596* tac:: Reverse lines of files
1597* cat -n:: Numbering lines
1598* cat -b:: Numbering non-blank lines
1599* wc -c:: Counting chars
1600* wc -w:: Counting words
1601* wc -l:: Counting lines
1602* head:: Printing the first lines
1603* tail:: Printing the last lines
1604* uniq:: Make duplicate lines unique
1605* uniq -d:: Print duplicated lines of input
1606* uniq -u:: Remove all duplicated lines
1607* cat -s:: Squeezing blank lines
1608@end menu
1609
1610@node Centering lines
1611@section Centering Lines
1612
1613This script centers all lines of a file on a 80 columns width.
1614To change that width, the number in @code{\@{@dots{}\@}} must be
1615replaced, and the number of added spaces also must be changed.
1616
1617Note how the buffer commands are used to separate parts in
1618the regular expressions to be matched---this is a common
1619technique.
1620
1621@c start-------------------------------------------
1622@example
1623#!/usr/bin/sed -f
1624
1625@group
1626# Put 80 spaces in the buffer
16271 @{
1628 x
1629 s/^$/ /
1630 s/^.*$/&&&&&&&&/
1631 x
1632@}
1633@end group
1634
1635@group
1636# del leading and trailing spaces
1637y/@kbd{tab}/ /
1638s/^ *//
1639s/ *$//
1640@end group
1641
1642@group
1643# add a newline and 80 spaces to end of line
1644G
1645@end group
1646
1647@group
1648# keep first 81 chars (80 + a newline)
1649s/^\(.\@{81\@}\).*$/\1/
1650@end group
1651
1652@group
1653# \2 matches half of the spaces, which are moved to the beginning
1654s/^\(.*\)\n\(.*\)\2/\2\1/
1655@end group
1656@end example
1657@c end---------------------------------------------
1658
1659@node Increment a number
1660@section Increment a Number
1661
1662This script is one of a few that demonstrate how to do arithmetic
1663in @command{sed}. This is indeed possible,@footnote{@command{sed} guru Greg
1664Ubben wrote an implementation of the @command{dc} @sc{rpn} calculator!
1665It is distributed together with sed.} but must be done manually.
1666
1667To increment one number you just add 1 to last digit, replacing
1668it by the following digit. There is one exception: when the digit
1669is a nine the previous digits must be also incremented until you
1670don't have a nine.
1671
1672This solution by Bruno Haible is very clever and smart because
1673it uses a single buffer; if you don't have this limitation, the
1674algorithm used in @ref{cat -n, Numbering lines}, is faster.
1675It works by replacing trailing nines with an underscore, then
1676using multiple @code{s} commands to increment the last digit,
1677and then again substituting underscores with zeros.
1678
1679@c start-------------------------------------------
1680@example
1681#!/usr/bin/sed -f
1682
1683/[^0-9]/ d
1684
1685@group
1686# replace all leading 9s by _ (any other character except digits, could
1687# be used)
1688:d
1689s/9\(_*\)$/_\1/
1690td
1691@end group
1692
1693@group
1694# incr last digit only. The first line adds a most-significant
1695# digit of 1 if we have to add a digit.
1696#
1697# The @code{tn} commands are not necessary, but make the thing
1698# faster
1699@end group
1700
1701@group
1702s/^\(_*\)$/1\1/; tn
1703s/8\(_*\)$/9\1/; tn
1704s/7\(_*\)$/8\1/; tn
1705s/6\(_*\)$/7\1/; tn
1706s/5\(_*\)$/6\1/; tn
1707s/4\(_*\)$/5\1/; tn
1708s/3\(_*\)$/4\1/; tn
1709s/2\(_*\)$/3\1/; tn
1710s/1\(_*\)$/2\1/; tn
1711s/0\(_*\)$/1\1/; tn
1712@end group
1713
1714@group
1715:n
1716y/_/0/
1717@end group
1718@end example
1719@c end---------------------------------------------
1720
1721@node Rename files to lower case
1722@section Rename Files to Lower Case
1723
1724This is a pretty strange use of @command{sed}. We transform text, and
1725transform it to be shell commands, then just feed them to shell.
1726Don't worry, even worse hacks are done when using @command{sed}; I have
1727seen a script converting the output of @command{date} into a @command{bc}
1728program!
1729
1730The main body of this is the @command{sed} script, which remaps the name
1731from lower to upper (or vice-versa) and even checks out
1732if the remapped name is the same as the original name.
1733Note how the script is parameterized using shell
1734variables and proper quoting.
1735
1736@c start-------------------------------------------
1737@example
1738@group
1739#! /bin/sh
1740# rename files to lower/upper case...
1741#
1742# usage:
1743# move-to-lower *
1744# move-to-upper *
1745# or
1746# move-to-lower -R .
1747# move-to-upper -R .
1748#
1749@end group
1750
1751@group
1752help()
1753@{
1754 cat << eof
1755Usage: $0 [-n] [-r] [-h] files...
1756@end group
1757
1758@group
1759-n do nothing, only see what would be done
1760-R recursive (use find)
1761-h this message
1762files files to remap to lower case
1763@end group
1764
1765@group
1766Examples:
1767 $0 -n * (see if everything is ok, then...)
1768 $0 *
1769@end group
1770
1771 $0 -R .
1772
1773@group
1774eof
1775@}
1776@end group
1777
1778@group
1779apply_cmd='sh'
1780finder='echo "$@@" | tr " " "\n"'
1781files_only=
1782@end group
1783
1784@group
1785while :
1786do
1787 case "$1" in
1788 -n) apply_cmd='cat' ;;
1789 -R) finder='find "$@@" -type f';;
1790 -h) help ; exit 1 ;;
1791 *) break ;;
1792 esac
1793 shift
1794done
1795@end group
1796
1797@group
1798if [ -z "$1" ]; then
1799 echo Usage: $0 [-h] [-n] [-r] files...
1800 exit 1
1801fi
1802@end group
1803
1804@group
1805LOWER='abcdefghijklmnopqrstuvwxyz'
1806UPPER='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
1807@end group
1808
1809@group
1810case `basename $0` in
1811 *upper*) TO=$UPPER; FROM=$LOWER ;;
1812 *) FROM=$UPPER; TO=$LOWER ;;
1813esac
1814@end group
1815
1816eval $finder | sed -n '
1817
1818@group
1819# remove all trailing slashes
1820s/\/*$//
1821@end group
1822
1823@group
1824# add ./ if there is no path, only a filename
1825/\//! s/^/.\//
1826@end group
1827
1828@group
1829# save path+filename
1830h
1831@end group
1832
1833@group
1834# remove path
1835s/.*\///
1836@end group
1837
1838@group
1839# do conversion only on filename
1840y/'$FROM'/'$TO'/
1841@end group
1842
1843@group
1844# now line contains original path+file, while
1845# hold space contains the new filename
1846x
1847@end group
1848
1849@group
1850# add converted file name to line, which now contains
1851# path/file-name\nconverted-file-name
1852G
1853@end group
1854
1855@group
1856# check if converted file name is equal to original file name,
1857# if it is, do not print nothing
1858/^.*\/\(.*\)\n\1/b
1859@end group
1860
1861@group
1862# now, transform path/fromfile\n, into
1863# mv path/fromfile path/tofile and print it
1864s/^\(.*\/\)\(.*\)\n\(.*\)$/mv "\1\2" "\1\3"/p
1865@end group
1866
1867' | $apply_cmd
1868@end example
1869@c end---------------------------------------------
1870
1871@node Print bash environment
1872@section Print @command{bash} Environment
1873
1874This script strips the definition of the shell functions
1875from the output of the @command{set} Bourne-shell command.
1876
1877@c start-------------------------------------------
1878@example
1879#!/bin/sh
1880
1881@group
1882set | sed -n '
1883:x
1884@end group
1885
1886@group
1887@ifinfo
1888# if no occurrence of "=()" print and load next line
1889@end ifinfo
1890@ifnotinfo
1891# if no occurrence of @samp{=()} print and load next line
1892@end ifnotinfo
1893/=()/! @{ p; b; @}
1894/ () $/! @{ p; b; @}
1895@end group
1896
1897@group
1898# possible start of functions section
1899# save the line in case this is a var like FOO="() "
1900h
1901@end group
1902
1903@group
1904# if the next line has a brace, we quit because
1905# nothing comes after functions
1906n
1907/^@{/ q
1908@end group
1909
1910@group
1911# print the old line
1912x; p
1913@end group
1914
1915@group
1916# work on the new line now
1917x; bx
1918'
1919@end group
1920@end example
1921@c end---------------------------------------------
1922
1923@node Reverse chars of lines
1924@section Reverse Characters of Lines
1925
1926This script can be used to reverse the position of characters
1927in lines. The technique moves two characters at a time, hence
1928it is faster than more intuitive implementations.
1929
1930Note the @code{tx} command before the definition of the label.
1931This is often needed to reset the flag that is tested by
1932the @code{t} command.
1933
1934Imaginative readers will find uses for this script. An example
1935is reversing the output of @command{banner}.@footnote{This requires
1936another script to pad the output of banner; for example
1937
1938@example
1939#! /bin/sh
1940
1941banner -w $1 $2 $3 $4 |
1942 sed -e :a -e '/^.\@{0,'$1'\@}$/ @{ s/$/ /; ba; @}' |
1943 ~/sedscripts/reverseline.sed
1944@end example
1945}
1946
1947@c start-------------------------------------------
1948@example
1949#!/usr/bin/sed -f
1950
1951/../! b
1952
1953@group
1954# Reverse a line. Begin embedding the line between two newlines
1955s/^.*$/\
1956&\
1957/
1958@end group
1959
1960@group
1961# Move first character at the end. The regexp matches until
1962# there are zero or one characters between the markers
1963tx
1964:x
1965s/\(\n.\)\(.*\)\(.\n\)/\3\2\1/
1966tx
1967@end group
1968
1969@group
1970# Remove the newline markers
1971s/\n//g
1972@end group
1973@end example
1974@c end---------------------------------------------
1975
1976@node tac
1977@section Reverse Lines of Files
1978
1979This one begins a series of totally useless (yet interesting)
1980scripts emulating various Unix commands. This, in particular,
1981is a @command{tac} workalike.
1982
1983Note that on implementations other than @acronym{GNU} @command{sed}
1984@ifset PERL
1985and @value{SSED}
1986@end ifset
1987this script might easily overflow internal buffers.
1988
1989@c start-------------------------------------------
1990@example
1991#!/usr/bin/sed -nf
1992
1993# reverse all lines of input, i.e. first line became last, ...
1994
1995@group
1996# from the second line, the buffer (which contains all previous lines)
1997# is *appended* to current line, so, the order will be reversed
19981! G
1999@end group
2000
2001@group
2002# on the last line we're done -- print everything
2003$ p
2004@end group
2005
2006@group
2007# store everything on the buffer again
2008h
2009@end group
2010@end example
2011@c end---------------------------------------------
2012
2013@node cat -n
2014@section Numbering Lines
2015
2016This script replaces @samp{cat -n}; in fact it formats its output
2017exactly like @acronym{GNU} @command{cat} does.
2018
2019Of course this is completely useless and for two reasons: first,
2020because somebody else did it in C, second, because the following
2021Bourne-shell script could be used for the same purpose and would
2022be much faster:
2023
2024@c start-------------------------------------------
2025@example
2026@group
2027#! /bin/sh
2028sed -e "=" $@@ | sed -e '
2029 s/^/ /
2030 N
2031 s/^ *\(......\)\n/\1 /
2032'
2033@end group
2034@end example
2035@c end---------------------------------------------
2036
2037It uses @command{sed} to print the line number, then groups lines two
2038by two using @code{N}. Of course, this script does not teach as much as
2039the one presented below.
2040
2041The algorithm used for incrementing uses both buffers, so the line
2042is printed as soon as possible and then discarded. The number
2043is split so that changing digits go in a buffer and unchanged ones go
2044in the other; the changed digits are modified in a single step
2045(using a @code{y} command). The line number for the next line
2046is then composed and stored in the hold space, to be used in the
2047next iteration.
2048
2049@c start-------------------------------------------
2050@example
2051#!/usr/bin/sed -nf
2052
2053@group
2054# Prime the pump on the first line
2055x
2056/^$/ s/^.*$/1/
2057@end group
2058
2059@group
2060# Add the correct line number before the pattern
2061G
2062h
2063@end group
2064
2065@group
2066# Format it and print it
2067s/^/ /
2068s/^ *\(......\)\n/\1 /p
2069@end group
2070
2071@group
2072# Get the line number from hold space; add a zero
2073# if we're going to add a digit on the next line
2074g
2075s/\n.*$//
2076/^9*$/ s/^/0/
2077@end group
2078
2079@group
2080# separate changing/unchanged digits with an x
2081s/.9*$/x&/
2082@end group
2083
2084@group
2085# keep changing digits in hold space
2086h
2087s/^.*x//
2088y/0123456789/1234567890/
2089x
2090@end group
2091
2092@group
2093# keep unchanged digits in pattern space
2094s/x.*$//
2095@end group
2096
2097@group
2098# compose the new number, remove the newline implicitly added by G
2099G
2100s/\n//
2101h
2102@end group
2103@end example
2104@c end---------------------------------------------
2105
2106@node cat -b
2107@section Numbering Non-blank Lines
2108
2109Emulating @samp{cat -b} is almost the same as @samp{cat -n}---we only
2110have to select which lines are to be numbered and which are not.
2111
2112The part that is common to this script and the previous one is
2113not commented to show how important it is to comment @command{sed}
2114scripts properly...
2115
2116@c start-------------------------------------------
2117@example
2118#!/usr/bin/sed -nf
2119
2120@group
2121/^$/ @{
2122 p
2123 b
2124@}
2125@end group
2126
2127@group
2128# Same as cat -n from now
2129x
2130/^$/ s/^.*$/1/
2131G
2132h
2133s/^/ /
2134s/^ *\(......\)\n/\1 /p
2135x
2136s/\n.*$//
2137/^9*$/ s/^/0/
2138s/.9*$/x&/
2139h
2140s/^.*x//
2141y/0123456789/1234567890/
2142x
2143s/x.*$//
2144G
2145s/\n//
2146h
2147@end group
2148@end example
2149@c end---------------------------------------------
2150
2151@node wc -c
2152@section Counting Characters
2153
2154This script shows another way to do arithmetic with @command{sed}.
2155In this case we have to add possibly large numbers, so implementing
2156this by successive increments would not be feasible (and possibly
2157even more complicated to contrive than this script).
2158
2159The approach is to map numbers to letters, kind of an abacus
2160implemented with @command{sed}. @samp{a}s are units, @samp{b}s are
2161tens and so on: we simply add the number of characters
2162on the current line as units, and then propagate the carry
2163to tens, hundreds, and so on.
2164
2165As usual, running totals are kept in hold space.
2166
2167On the last line, we convert the abacus form back to decimal.
2168For the sake of variety, this is done with a loop rather than
2169with some 80 @code{s} commands@footnote{Some implementations
2170have a limit of 199 commands per script}: first we
2171convert units, removing @samp{a}s from the number; then we
2172rotate letters so that tens become @samp{a}s, and so on
2173until no more letters remain.
2174
2175@c start-------------------------------------------
2176@example
2177#!/usr/bin/sed -nf
2178
2179@group
2180# Add n+1 a's to hold space (+1 is for the newline)
2181s/./a/g
2182H
2183x
2184s/\n/a/
2185@end group
2186
2187@group
2188# Do the carry. The t's and b's are not necessary,
2189# but they do speed up the thing
2190t a
2191: a; s/aaaaaaaaaa/b/g; t b; b done
2192: b; s/bbbbbbbbbb/c/g; t c; b done
2193: c; s/cccccccccc/d/g; t d; b done
2194: d; s/dddddddddd/e/g; t e; b done
2195: e; s/eeeeeeeeee/f/g; t f; b done
2196: f; s/ffffffffff/g/g; t g; b done
2197: g; s/gggggggggg/h/g; t h; b done
2198: h; s/hhhhhhhhhh//g
2199@end group
2200
2201@group
2202: done
2203$! @{
2204 h
2205 b
2206@}
2207@end group
2208
2209# On the last line, convert back to decimal
2210
2211@group
2212: loop
2213/a/! s/[b-h]*/&0/
2214s/aaaaaaaaa/9/
2215s/aaaaaaaa/8/
2216s/aaaaaaa/7/
2217s/aaaaaa/6/
2218s/aaaaa/5/
2219s/aaaa/4/
2220s/aaa/3/
2221s/aa/2/
2222s/a/1/
2223@end group
2224
2225@group
2226: next
2227y/bcdefgh/abcdefg/
2228/[a-h]/ b loop
2229p
2230@end group
2231@end example
2232@c end---------------------------------------------
2233
2234@node wc -w
2235@section Counting Words
2236
2237This script is almost the same as the previous one, once each
2238of the words on the line is converted to a single @samp{a}
2239(in the previous script each letter was changed to an @samp{a}).
2240
2241It is interesting that real @command{wc} programs have optimized
2242loops for @samp{wc -c}, so they are much slower at counting
2243words rather than characters. This script's bottleneck,
2244instead, is arithmetic, and hence the word-counting one
2245is faster (it has to manage smaller numbers).
2246
2247Again, the common parts are not commented to show the importance
2248of commenting @command{sed} scripts.
2249
2250@c start-------------------------------------------
2251@example
2252#!/usr/bin/sed -nf
2253
2254@group
2255# Convert words to a's
2256s/[ @kbd{tab}][ @kbd{tab}]*/ /g
2257s/^/ /
2258s/ [^ ][^ ]*/a /g
2259s/ //g
2260@end group
2261
2262@group
2263# Append them to hold space
2264H
2265x
2266s/\n//
2267@end group
2268
2269@group
2270# From here on it is the same as in wc -c.
2271/aaaaaaaaaa/! bx; s/aaaaaaaaaa/b/g
2272/bbbbbbbbbb/! bx; s/bbbbbbbbbb/c/g
2273/cccccccccc/! bx; s/cccccccccc/d/g
2274/dddddddddd/! bx; s/dddddddddd/e/g
2275/eeeeeeeeee/! bx; s/eeeeeeeeee/f/g
2276/ffffffffff/! bx; s/ffffffffff/g/g
2277/gggggggggg/! bx; s/gggggggggg/h/g
2278s/hhhhhhhhhh//g
2279:x
2280$! @{ h; b; @}
2281:y
2282/a/! s/[b-h]*/&0/
2283s/aaaaaaaaa/9/
2284s/aaaaaaaa/8/
2285s/aaaaaaa/7/
2286s/aaaaaa/6/
2287s/aaaaa/5/
2288s/aaaa/4/
2289s/aaa/3/
2290s/aa/2/
2291s/a/1/
2292y/bcdefgh/abcdefg/
2293/[a-h]/ by
2294p
2295@end group
2296@end example
2297@c end---------------------------------------------
2298
2299@node wc -l
2300@section Counting Lines
2301
2302No strange things are done now, because @command{sed} gives us
2303@samp{wc -l} functionality for free!!! Look:
2304
2305@c start-------------------------------------------
2306@example
2307@group
2308#!/usr/bin/sed -nf
2309$=
2310@end group
2311@end example
2312@c end---------------------------------------------
2313
2314@node head
2315@section Printing the First Lines
2316
2317This script is probably the simplest useful @command{sed} script.
2318It displays the first 10 lines of input; the number of displayed
2319lines is right before the @code{q} command.
2320
2321@c start-------------------------------------------
2322@example
2323@group
2324#!/usr/bin/sed -f
232510q
2326@end group
2327@end example
2328@c end---------------------------------------------
2329
2330@node tail
2331@section Printing the Last Lines
2332
2333Printing the last @var{n} lines rather than the first is more complex
2334but indeed possible. @var{n} is encoded in the second line, before
2335the bang character.
2336
2337This script is similar to the @command{tac} script in that it keeps the
2338final output in the hold space and prints it at the end:
2339
2340@c start-------------------------------------------
2341@example
2342#!/usr/bin/sed -nf
2343
2344@group
23451! @{; H; g; @}
23461,10 !s/[^\n]*\n//
2347$p
2348h
2349@end group
2350@end example
2351@c end---------------------------------------------
2352
2353Mainly, the scripts keeps a window of 10 lines and slides it
2354by adding a line and deleting the oldest (the substitution command
2355on the second line works like a @code{D} command but does not
2356restart the loop).
2357
2358The ``sliding window'' technique is a very powerful way to write
2359efficient and complex @command{sed} scripts, because commands like
2360@code{P} would require a lot of work if implemented manually.
2361
2362To introduce the technique, which is fully demonstrated in the
2363rest of this chapter and is based on the @code{N}, @code{P}
2364and @code{D} commands, here is an implementation of @command{tail}
2365using a simple ``sliding window.''
2366
2367This looks complicated but in fact the working is the same as
2368the last script: after we have kicked in the appropriate number
2369of lines, however, we stop using the hold space to keep inter-line
2370state, and instead use @code{N} and @code{D} to slide pattern
2371space by one line:
2372
2373@c start-------------------------------------------
2374@example
2375#!/usr/bin/sed -f
2376
2377@group
23781h
23792,10 @{; H; g; @}
2380$q
23811,9d
2382N
2383D
2384@end group
2385@end example
2386@c end---------------------------------------------
2387
2388Note how the first, second and fourth line are inactive after
2389the first ten lines of input. After that, all the script does
2390is: exiting on the last line of input, appending the next input
2391line to pattern space, and removing the first line.
2392
2393@node uniq
2394@section Make Duplicate Lines Unique
2395
2396This is an example of the art of using the @code{N}, @code{P}
2397and @code{D} commands, probably the most difficult to master.
2398
2399@c start-------------------------------------------
2400@example
2401@group
2402#!/usr/bin/sed -f
2403h
2404@end group
2405
2406@group
2407:b
2408# On the last line, print and exit
2409$b
2410N
2411/^\(.*\)\n\1$/ @{
2412 # The two lines are identical. Undo the effect of
2413 # the n command.
2414 g
2415 bb
2416@}
2417@end group
2418
2419@group
2420# If the @code{N} command had added the last line, print and exit
2421$b
2422@end group
2423
2424@group
2425# The lines are different; print the first and go
2426# back working on the second.
2427P
2428D
2429@end group
2430@end example
2431@c end---------------------------------------------
2432
2433As you can see, we mantain a 2-line window using @code{P} and @code{D}.
2434This technique is often used in advanced @command{sed} scripts.
2435
2436@node uniq -d
2437@section Print Duplicated Lines of Input
2438
2439This script prints only duplicated lines, like @samp{uniq -d}.
2440
2441@c start-------------------------------------------
2442@example
2443#!/usr/bin/sed -nf
2444
2445@group
2446$b
2447N
2448/^\(.*\)\n\1$/ @{
2449 # Print the first of the duplicated lines
2450 s/.*\n//
2451 p
2452@end group
2453
2454@group
2455 # Loop until we get a different line
2456 :b
2457 $b
2458 N
2459 /^\(.*\)\n\1$/ @{
2460 s/.*\n//
2461 bb
2462 @}
2463@}
2464@end group
2465
2466@group
2467# The last line cannot be followed by duplicates
2468$b
2469@end group
2470
2471@group
2472# Found a different one. Leave it alone in the pattern space
2473# and go back to the top, hunting its duplicates
2474D
2475@end group
2476@end example
2477@c end---------------------------------------------
2478
2479@node uniq -u
2480@section Remove All Duplicated Lines
2481
2482This script prints only unique lines, like @samp{uniq -u}.
2483
2484@c start-------------------------------------------
2485@example
2486#!/usr/bin/sed -f
2487
2488@group
2489# Search for a duplicate line --- until that, print what you find.
2490$b
2491N
2492/^\(.*\)\n\1$/ ! @{
2493 P
2494 D
2495@}
2496@end group
2497
2498@group
2499:c
2500# Got two equal lines in pattern space. At the
2501# end of the file we simply exit
2502$d
2503@end group
2504
2505@group
2506# Else, we keep reading lines with @code{N} until we
2507# find a different one
2508s/.*\n//
2509N
2510/^\(.*\)\n\1$/ @{
2511 bc
2512@}
2513@end group
2514
2515@group
2516# Remove the last instance of the duplicate line
2517# and go back to the top
2518D
2519@end group
2520@end example
2521@c end---------------------------------------------
2522
2523@node cat -s
2524@section Squeezing Blank Lines
2525
2526As a final example, here are three scripts, of increasing complexity
2527and speed, that implement the same function as @samp{cat -s}, that is
2528squeezing blank lines.
2529
2530The first leaves a blank line at the beginning and end if there are
2531some already.
2532
2533@c start-------------------------------------------
2534@example
2535#!/usr/bin/sed -f
2536
2537@group
2538# on empty lines, join with next
2539# Note there is a star in the regexp
2540:x
2541/^\n*$/ @{
2542N
2543bx
2544@}
2545@end group
2546
2547@group
2548# now, squeeze all '\n', this can be also done by:
2549# s/^\(\n\)*/\1/
2550s/\n*/\
2551/
2552@end group
2553@end example
2554@c end---------------------------------------------
2555
2556This one is a bit more complex and removes all empty lines
2557at the beginning. It does leave a single blank line at end
2558if one was there.
2559
2560@c start-------------------------------------------
2561@example
2562#!/usr/bin/sed -f
2563
2564@group
2565# delete all leading empty lines
25661,/^./@{
2567/./!d
2568@}
2569@end group
2570
2571@group
2572# on an empty line we remove it and all the following
2573# empty lines, but one
2574:x
2575/./!@{
2576N
2577s/^\n$//
2578tx
2579@}
2580@end group
2581@end example
2582@c end---------------------------------------------
2583
2584This removes leading and trailing blank lines. It is also the
2585fastest. Note that loops are completely done with @code{n} and
2586@code{b}, without relying on @command{sed} to restart the
2587the script automatically at the end of a line.
2588
2589@c start-------------------------------------------
2590@example
2591#!/usr/bin/sed -nf
2592
2593@group
2594# delete all (leading) blanks
2595/./!d
2596@end group
2597
2598@group
2599# get here: so there is a non empty
2600:x
2601# print it
2602p
2603# get next
2604n
2605# got chars? print it again, etc...
2606/./bx
2607@end group
2608
2609@group
2610# no, don't have chars: got an empty line
2611:z
2612# get next, if last line we finish here so no trailing
2613# empty lines are written
2614n
2615# also empty? then ignore it, and get next... this will
2616# remove ALL empty lines
2617/./!bz
2618@end group
2619
2620@group
2621# all empty lines were deleted/ignored, but we have a non empty. As
2622# what we want to do is to squeeze, insert a blank line artificially
2623i\
2624@end group
2625
2626bx
2627@end example
2628@c end---------------------------------------------
2629
2630@node Limitations
2631@chapter @value{SSED}'s Limitations and Non-limitations
2632
2633@cindex @acronym{GNU} extensions, unlimited line length
2634@cindex Portability, line length limitations
2635For those who want to write portable @command{sed} scripts,
2636be aware that some implementations have been known to
2637limit line lengths (for the pattern and hold spaces)
2638to be no more than 4000 bytes.
2639The @sc{posix} standard specifies that conforming @command{sed}
2640implementations shall support at least 8192 byte line lengths.
2641@value{SSED} has no built-in limit on line length;
2642as long as it can @code{malloc()} more (virtual) memory,
2643you can feed or construct lines as long as you like.
2644
2645However, recursion is used to handle subpatterns and indefinite
2646repetition. This means that the available stack space may limit
2647the size of the buffer that can be processed by certain patterns.
2648
2649@ifset PERL
2650There are some size limitations in the regular expression
2651matcher but it is hoped that they will never in practice
2652be relevant. The maximum length of a compiled pattern
2653is 65539 (sic) bytes. All values in repeating quantifiers
2654must be less than 65536. The maximum nesting depth of
2655all parenthesized subpatterns, including capturing and
2656non-capturing subpatterns@footnote{The
2657distinction is meaningful when referring to Perl-style
2658regular expressions.}, assertions, and other types of
2659subpattern, is 200.
2660
2661Also, @value{SSED} recognizes the @sc{posix} syntax
2662@code{[.@var{ch}.]} and @code{[=@var{ch}=]}
2663where @var{ch} is a ``collating element'', but these
2664are not supported, and an error is given if they are
2665encountered.
2666
2667Here are a few distinctions between the real Perl-style
2668regular expressions and those that @option{-R} recognizes.
2669
2670@enumerate
2671@item
2672Lookahead assertions do not allow repeat quantifiers after them
2673Perl permits them, but they do not mean what you
2674might think. For example, @samp{(?!a)@{3@}} does not assert that the
2675next three characters are not @samp{a}. It just asserts three times that the
2676next character is not @samp{a} --- a waste of time and nothing else.
2677
2678@item
2679Capturing subpatterns that occur inside negative lookahead
2680head assertions are counted, but their entries are counted
2681as empty in the second half of an @code{s} command.
2682Perl sets its numerical variables from any such patterns
2683that are matched before the assertion fails to match
2684something (thereby succeeding), but only if the negative
2685lookahead assertion contains just one branch.
2686
2687@item
2688The following Perl escape sequences are not supported:
2689@samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E},
2690@samp{\Q}. In fact these are implemented by Perl's general
2691string-handling and are not part of its pattern matching engine.
2692
2693@item
2694The Perl @samp{\G} assertion is not supported as it is not
2695relevant to single pattern matches.
2696
2697@item
2698Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})}
2699and @samp{(?p@{code@})} constructions. However, there is some experimental
2700support for recursive patterns using the non-Perl item @samp{(?R)}.
2701
2702@item
2703There are at the time of writing some oddities in Perl
27045.005_02 concerned with the settings of captured strings
2705when part of a pattern is repeated. For example, matching
2706@samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets
2707@samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.}
2708to the value @samp{b}, but matching @samp{aabbaa}
2709against @samp{/^(aa(bb)?)+$/} leaves @samp{$2}
2710unset. However, if the pattern is changed to
2711@samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set.
2712In Perl 5.004 @samp{$2} is set in both cases, and that is also
2713true of @value{SSED}.
2714
2715@item
2716Another as yet unresolved discrepancy is that in Perl
27175.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches
2718the string @samp{a}, whereas in @value{SSED} it does not.
2719However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched
2720against @samp{a} leaves $1 unset.
2721@end enumerate
2722@end ifset
2723
2724@node Other Resources
2725@chapter Other Resources for Learning About @command{sed}
2726
2727@cindex Additional reading about @command{sed}
2728In addition to several books that have been written about @command{sed}
2729(either specifically or as chapters in books which discuss
2730shell programming), one can find out more about @command{sed}
2731(including suggestions of a few books) from the FAQ
2732for the @code{sed-users} mailing list, available from any of:
2733@display
2734 @uref{http://www.student.northpark.edu/pemente/sed/sedfaq.html}
2735 @uref{http://sed.sf.net/grabbag/tutorials/sedfaq.html}
2736@end display
2737
2738Also of interest are
2739@uref{http://www.student.northpark.edu/pemente/sed/index.htm}
2740and @uref{http://sed.sf.net/grabbag},
2741which include @command{sed} tutorials and other @command{sed}-related goodies.
2742
2743The @code{sed-users} mailing list itself maintained by Sven Guckes.
2744To subscribe, visit @uref{http://groups.yahoo.com} and search
2745for the @code{sed-users} mailing list.
2746
2747@node Reporting Bugs
2748@chapter Reporting Bugs
2749
2750@cindex Bugs, reporting
2751Email bug reports to @email{bonzini@@gnu.org}.
2752Be sure to include the word ``sed'' somewhere in the @code{Subject:} field.
2753Also, please include the output of @samp{sed --version} in the body
2754of your report if at all possible.
2755
2756Please do not send a bug report like this:
2757
2758@example
2759@i{while building frobme-1.3.4}
2760$ configure
2761@error{} sed: file sedscr line 1: Unknown option to 's'
2762@end example
2763
2764If @value{SSED} doesn't configure your favorite package, take a
2765few extra minutes to identify the specific problem and make a stand-alone
2766test case. Unlike other programs such as C compilers, making such test
2767cases for @command{sed} is quite simple.
2768
2769A stand-alone test case includes all the data necessary to perform the
2770test, and the specific invocation of @command{sed} that causes the problem.
2771The smaller a stand-alone test case is, the better. A test case should
2772not involve something as far removed from @command{sed} as ``try to configure
2773frobme-1.3.4''. Yes, that is in principle enough information to look
2774for the bug, but that is not a very practical prospect.
2775
2776Here are a few commonly reported bugs that are not bugs.
2777
2778@table @asis
2779@item @code{N} command on the last line
2780@cindex Portability, @code{N} command on the last line
2781@cindex Non-bugs, @code{N} command on the last line
2782
2783Most versions of @command{sed} exit without printing anything when
2784the @command{N} command is issued on the last line of a file.
2785@value{SSED} prints pattern space before exiting unless of course
2786the @command{-n} command switch has been specified. This choice is
2787by design.
2788
2789For example, the behavior of
2790@example
2791sed N foo bar
2792@end example
2793@noindent
2794would depend on whether foo has an even or an odd number of
2795lines@footnote{which is the actual ``bug'' that prompted the
2796change in behavior}. Or, when writing a script to read the
2797next few lines following a pattern match, traditional
2798implementations of @code{sed} would force you to write
2799something like
2800@example
2801/foo/@{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N @}
2802@end example
2803@noindent
2804instead of just
2805@example
2806/foo/@{ N;N;N;N;N;N;N;N;N; @}
2807@end example
2808
2809@cindex @code{POSIXLY_CORRECT} behavior, @code{N} command
2810In any case, the simplest workaround is to use @code{$d;N} in
2811scripts that rely on the traditional behavior, or to set
2812the @code{POSIXLY_CORRECT} variable to a non-empty value.
2813
2814@item Regex syntax clashes (problems with backslashes)
2815@cindex @acronym{GNU} extensions, to basic regular expressions
2816@cindex Non-bugs, regex syntax clashes
2817@command{sed} uses the @sc{posix} basic regular expression syntax. According to
2818the standard, the meaning of some escape sequences is undefined in
2819this syntax; notable in the case of @command{sed} are @code{\|},
2820@code{\+}, @code{\?}, @code{\`}, @code{\'}, @code{\<},
2821@code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}.
2822
2823As in all @acronym{GNU} programs that use @sc{posix} basic regular
2824expressions, @command{sed} interprets these escape sequences as special
2825characters. So, @code{x\+} matches one or more occurrences of @samp{x}.
2826@code{abc\|def} matches either @samp{abc} or @samp{def}.
2827
2828This syntax may cause problems when running scripts written for other
2829@command{sed}s. Some @command{sed} programs have been written with the
2830assumption that @code{\|} and @code{\+} match the literal characters
2831@code{|} and @code{+}. Such scripts must be modified by removing the
2832spurious backslashes if they are to be used with modern implementations
2833of @command{sed}, like
2834@ifset PERL
2835@value{SSED} or
2836@end ifset
2837@acronym{GNU} @command{sed}.
2838
2839On the other hand, some scripts use s|abc\|def||g to remove occurrences
2840of @emph{either} @code{abc} or @code{def}. While this worked until
2841@command{sed} 4.0.x, newer versions interpret this as removing the
2842string @code{abc|def}. This is again undefined behavior according to
2843@acronym{POSIX}, and this interpretation is arguably more robust: older
2844@command{sed}s, for example, required that the regex matcher parsed
2845@code{\/} as @code{/} in the common case of escaping a slash, which is
2846again undefined behavior; the new behavior avoids this, and this is good
2847because the regex matcher is only partially under our control.
2848
2849@cindex @acronym{GNU} extensions, special escapes
2850In addition, this version of @command{sed} supports several escape characters
2851(some of which are multi-character) to insert non-printable characters
2852in scripts (@code{\a}, @code{\c}, @code{\d}, @code{\o}, @code{\r},
2853@code{\t}, @code{\v}, @code{\x}). These can cause similar problems
2854with scripts written for other @command{sed}s.
2855
2856@item @option{-i} clobbers read-only files
2857@cindex In-place editing
2858@cindex @value{SSEDEXT}, in-place editing
2859@cindex Non-bugs, in-place editing
2860
2861In short, @samp{sed -i} will let you delete the contents of
2862a read-only file, and in general the @option{-i} option
2863(@pxref{Invoking sed, , Invocation}) lets you clobber
2864protected files. This is not a bug, but rather a consequence
2865of how the Unix filesystem works.
2866
2867The permissions on a file say what can happen to the data
2868in that file, while the permissions on a directory say what can
2869happen to the list of files in that directory. @samp{sed -i}
2870will not ever open for writing a file that is already on disk.
2871Rather, it will work on a temporary file that is finally renamed
2872to the original name: if you rename or delete files, you're actually
2873modifying the contents of the directory, so the operation depends on
2874the permissions of the directory, not of the file. For this same
2875reason, @command{sed} does not let you use @option{-i} on a writeable file
2876in a read-only directory (but unbelievably nobody reports that as a
2877bug@dots{}).
2878
2879@item @code{0a} does not work (gives an error)
2880There is no line 0. 0 is a special address that is only used to treat
2881addresses like @code{0,/@var{RE}/} as active when the script starts: if
2882you write @code{1,/abc/d} and the first line includes the word @samp{abc},
2883then that match would be ignored because address ranges must span at least
2884two lines (barring the end of the file); but what you probably wanted is
2885to delete every line up to the first one including @samp{abc}, and this
2886is obtained with @code{0,/abc/d}.
2887
2888@ifclear PERL
2889@item @code{[a-z]} is case insensitive
2890You are encountering problems with locales. POSIX mandates that @code{[a-z]}
2891uses the current locale's collation order -- in C parlance, that means using
2892@code{strcoll(3)} instead of @code{strcmp(3)}. Some locales have a
2893case-insensitive collation order, others don't: one of those that have
2894problems is Estonian.
2895
2896Another problem is that @code{[a-z]} tries to use collation symbols.
2897This only happens if you are on the @acronym{GNU} system, using
2898@acronym{GNU} libc's regular expression matcher instead of compiling the
2899one supplied with @acronym{GNU} sed. In a Danish locale, for example,
2900the regular expression @code{^[a-z]$} matches the string @samp{aa},
2901because this is a single collating symbol that comes after @samp{a}
2902and before @samp{b}; @samp{ll} behaves similarly in Spanish
2903locales, or @samp{ij} in Dutch locales.
2904
2905To work around these problems, which may cause bugs in shell scripts, set
2906the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}.
2907@end ifclear
2908@end table
2909
2910
2911@node Extended regexps
2912@appendix Extended regular expressions
2913@cindex Extended regular expressions, syntax
2914
2915The only difference between basic and extended regular expressions is in
2916the behavior of a few characters: @samp{?}, @samp{+}, parentheses,
2917and braces (@samp{@{@}}). While basic regular expressions require
2918these to be escaped if you want them to behave as special characters,
2919when using extended regular expressions you must escape them if
2920you want them @emph{to match a literal character}.
2921
2922@noindent
2923Examples:
2924@table @code
2925@item abc?
2926becomes @samp{abc\?} when using extended regular expressions. It matches
2927the literal string @samp{abc?}.
2928
2929@item c\+
2930becomes @samp{c+} when using extended regular expressions. It matches
2931one or more @samp{c}s.
2932
2933@item a\@{3,\@}
2934becomes @samp{a@{3,@}} when using extended regular expressions. It matches
2935three or more @samp{a}s.
2936
2937@item \(abc\)\@{2,3\@}
2938becomes @samp{(abc)@{2,3@}} when using extended regular expressions. It
2939matches either @samp{abcabc} or @samp{abcabcabc}.
2940
2941@item \(abc*\)\1
2942becomes @samp{(abc*)\1} when using extended regular expressions.
2943Backreferences must still be escaped when using extended regular
2944expressions.
2945@end table
2946
2947@ifset PERL
2948@node Perl regexps
2949@appendix Perl-style regular expressions
2950@cindex Perl-style regular expressions, syntax
2951
2952@emph{This part is taken from the @file{pcre.txt} file distributed together
2953with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.}
2954
2955Perl introduced several extensions to regular expressions, some
2956of them incompatible with the syntax of regular expressions
2957accepted by Emacs and other @acronym{GNU} tools (whose matcher was
2958based on the Emacs matcher). @value{SSED} implements
2959both kinds of extensions.
2960
2961@iftex
2962Summarizing, we have:
2963
2964@itemize @bullet
2965@item
2966A backslash can introduce several special sequences
2967
2968@item
2969The circumflex, dollar sign, and period characters behave specially
2970with regard to new lines
2971
2972@item
2973Strange uses of square brackets are parsed differently
2974
2975@item
2976You can toggle modifiers in the middle of a regular expression
2977
2978@item
2979You can specify that a subpattern does not count when numbering backreferences
2980
2981@item
2982@cindex Greedy regular expression matching
2983You can specify greedy or non-greedy matching
2984
2985@item
2986You can have more than ten back references
2987
2988@item
2989You can do complex look aheads and look behinds (in the spirit of
2990@code{\b}, but with subpatterns).
2991
2992@item
2993You can often improve performance by avoiding that @command{sed} wastes
2994time with backtracking
2995
2996@item
2997You can have if/then/else branches
2998
2999@item
3000You can do recursive matches, for example to look for unbalanced parentheses
3001
3002@item
3003You can have comments and non-significant whitespace, because things can
3004get complex...
3005@end itemize
3006
3007Most of these extensions are introduced by the special @code{(?}
3008sequence, which gives special meanings to parenthesized groups.
3009@end iftex
3010@menu
3011Other extensions can be roughly subdivided in two categories
3012On one hand Perl introduces several more escaped sequences
3013(that is, sequences introduced by a backslash). On the other
3014hand, it specifies that if a question mark follows an open
3015parentheses it should give a special meaning to the parenthesized
3016group.
3017
3018* Backslash:: Introduces special sequences
3019* Circumflex/dollar sign/period:: Behave specially with regard to new lines
3020* Square brackets:: Are a bit different in strange cases
3021* Options setting:: Toggle modifiers in the middle of a regexp
3022* Non-capturing subpatterns:: Are not counted when backreferencing
3023* Repetition:: Allows for non-greedy matching
3024* Backreferences:: Allows for more than 10 back references
3025* Assertions:: Allows for complex look ahead matches
3026* Non-backtracking subpatterns:: Often gives more performance
3027* Conditional subpatterns:: Allows if/then/else branches
3028* Recursive patterns:: For example to match parentheses
3029* Comments:: Because things can get complex...
3030@end menu
3031
3032@node Backslash
3033@appendixsec Backslash
3034@cindex Perl-style regular expressions, escaped sequences
3035
3036There are a few difference in the handling of backslashed
3037sequences in Perl mode.
3038
3039First of all, there are no @code{\o} and @code{\d} sequences.
3040@sc{ascii} values for characters can be specified in octal
3041with a @code{\@var{xxx}} sequence, where @var{xxx} is a
3042sequence of up to three octal digits. If the first digit
3043is a zero, the treatment of the sequence is straightforward;
3044just note that if the character that follows the escaped digit
3045is itself an octal digit, you have to supply three octal digits
3046for @var{xxx}. For example @code{\07} is a @sc{bel} character
3047rather than a @sc{nul} and a literal @code{7} (this sequence is
3048instead represented by @code{\0007}).
3049
3050@cindex Perl-style regular expressions, backreferences
3051The handling of a backslash followed by a digit other than 0
3052is complicated. Outside a character class, @command{sed} reads it
3053and any following digits as a decimal number. If the number
3054is less than 10, or if there have been at least that many
3055previous capturing left parentheses in the expression, the
3056entire sequence is taken as a back reference. A description
3057of how this works is given later, following the discussion
3058of parenthesized subpatterns.
3059
3060Inside a character class, or if the decimal number is
3061greater than 9 and there have not been that many capturing
3062subpatterns, @command{sed} re-reads up to three octal digits following
3063the backslash, and generates a single byte from the
3064least significant 8 bits of the value. Any subsequent digits
3065stand for themselves. For example:
3066
3067@example
3068 \040 @i{is another way of writing a space}
3069 \40 @i{is the same, provided there are fewer than 40}
3070 @i{previous capturing subpatterns}
3071 \7 @i{is always a back reference}
3072 \011 @i{is always a tab}
3073 \11 @i{might be a back reference, or another way of}
3074 @i{writing a tab}
3075 \0113 @i{is a tab followed by the character @samp{3}}
3076 \113 @i{is the character with octal code 113 (since there}
3077 @i{can be no more than 99 back references)}
3078 \377 @i{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}
3079 \81 @i{is either a back reference, or a binary zero}
3080 @i{followed by the two characters @samp{81}}
3081@end example
3082
3083Note that octal values of 100 or greater must not be introduced
3084duced by a leading zero, because no more than three octal
3085digits are ever read.
3086
3087All the sequences that define a single byte value can be
3088used both inside and outside character classes. In addition,
3089inside a character class, the sequence @code{\b} is interpreted
3090as the backspace character (hex 08). Outside a character
3091class it has a different meaning (see below).
3092
3093In addition, there are four additional escapes specifying
3094generic character classes (like @code{\w} and @code{\W} do):
3095
3096@cindex Perl-style regular expressions, character classes
3097@table @samp
3098@item \d
3099Matches any decimal digit
3100
3101@item \D
3102Matches any character that is not a decimal digit
3103@end table
3104
3105In Perl mode, these character type sequences can appear both inside and
3106outside character classes. Instead, in @sc{posix} mode these sequences
3107(as well as @code{\w} and @code{\W}) are treated as two literal characters
3108(a backslash and a letter) inside square brackets.
3109
3110Escaped sequences specifying assertions are also different in
3111Perl mode. An assertion specifies a condition that has to be met
3112at a particular point in a match, without consuming any
3113characters from the subject string. The use of subpatterns
3114for more complicated assertions is described below. The
3115backslashed assertions are
3116
3117@cindex Perl-style regular expressions, assertions
3118@table @samp
3119@item \b
3120Asserts that the point is at a word boundary.
3121A word boundary is a position in the subject string where
3122the current character and the previous character do not both
3123match @code{\w} or @code{\W} (i.e. one matches @code{\w} and
3124the other matches @code{\W}), or the start or end of the string
3125if the first or last character matches @code{\w}, respectively.
3126
3127@item \B
3128Asserts that the point is not at a word boundary.
3129
3130@item \A
3131Asserts the matcher is at the start of pattern space (independent
3132of multiline mode).
3133
3134@item \Z
3135Asserts the matcher is at the end of pattern space,
3136or at a newline before the end of pattern space (independent of
3137multiline mode)
3138
3139@item \z
3140Asserts the matcher is at the end of pattern space (independent
3141of multiline mode)
3142@end table
3143
3144These assertions may not appear in character classes (but
3145note that @code{\b} has a different meaning, namely the
3146backspace character, inside a character class).
3147Note that Perl mode does not support directly assertions
3148for the beginning and the end of word; the @acronym{GNU} extensions
3149@code{\<} and @code{\>} achieve this purpose in @sc{posix} mode
3150instead.
3151
3152The @code{\A}, @code{\Z}, and @code{\z} assertions differ
3153from the traditional circumflex and dollar sign (described below)
3154in that they only ever match at the very start and end of the
3155subject string, whatever options are set; in particular @code{\A}
3156and @code{\z} are the same as the @acronym{GNU} extensions
3157@code{\`} and @code{\'} that are active in @sc{posix} mode.
3158
3159@node Circumflex/dollar sign/period
3160@appendixsec Circumflex, dollar sign, period
3161@cindex Perl-style regular expressions, newlines
3162
3163Outside a character class, in the default matching mode, the
3164circumflex character is an assertion which is true only if
3165the current matching point is at the start of the subject
3166string. Inside a character class, the circumflex has an entirely
3167different meaning (see below).
3168
3169The circumflex need not be the first character of the pattern if
3170a number of alternatives are involved, but it should be the
3171first thing in each alternative in which it appears if the
3172pattern is ever to match that branch. If all possible alternatives,
3173start with a circumflex, that is, if the pattern is
3174constrained to match only at the start of the subject, it is
3175said to be an @dfn{anchored} pattern. (There are also other constructs
3176structs that can cause a pattern to be anchored.)
3177
3178A dollar sign is an assertion which is true only if the
3179current matching point is at the end of the subject string,
3180or immediately before a newline character that is the last
3181character in the string (by default). A dollar sign need not be the
3182last character of the pattern if a number of alternatives
3183are involved, but it should be the last item in any branch
3184in which it appears. A dollar sign has no special meaning in a
3185character class.
3186
3187@cindex Perl-style regular expressions, multiline
3188The meanings of the circumflex and dollar sign characters are
3189changed if the @code{M} modifier option is used. When this is
3190the case, they match immediately after and immediately
3191before an internal @code{\n} character, respectively, in addition
3192to matching at the start and end of the subject string. For
3193example, the pattern @code{/^abc$/} matches the subject string
3194@samp{def\nabc} in multiline mode, but not otherwise. Consequently,
3195patterns that are anchored in single line mode
3196because all branches start with @code{^} are not anchored in
3197multiline mode.
3198
3199@cindex Perl-style regular expressions, multiline
3200Note that the sequences @code{\A}, @code{\Z}, and @code{\z}
3201can be used to match the start and end of the subject in both
3202modes, and if all branches of a pattern start with @code{\A}
3203is it always anchored, whether the @code{M} modifier is set or not.
3204
3205@cindex Perl-style regular expressions, single line
3206Outside a character class, a dot in the pattern matches any
3207one character in the subject, including a non-printing character,
3208but not (by default) newline. If the @code{S} modifier is used,
3209dots match newlines as well. Actually, the handling of
3210dot is entirely independent of the handling of circumflex
3211and dollar sign, the only relationship being that they both
3212involve newline characters. Dot has no special meaning in a
3213character class.
3214
3215@node Square brackets
3216@appendixsec Square brackets
3217@cindex Perl-style regular expressions, character classes
3218
3219An opening square bracket introduces a character class, terminated
3220by a closing square bracket. A closing square bracket on its own
3221is not special. If a closing square bracket is required as a
3222member of the class, it should be the first data character in
3223the class (after an initial circumflex, if present) or escaped with a backslash.
3224
3225A character class matches a single character in the subject;
3226the character must be in the set of characters defined by
3227the class, unless the first character in the class is a circumflex,
3228in which case the subject character must not be in
3229the set defined by the class. If a circumflex is actually
3230required as a member of the class, ensure it is not the
3231first character, or escape it with a backslash.
3232
3233For example, the character class [aeiou] matches any lower
3234case vowel, while [^aeiou] matches any character that is not
3235a lower case vowel. Note that a circumflex is just a convenient
3236venient notation for specifying the characters which are in
3237the class by enumerating those that are not. It is not an
3238assertion: it still consumes a character from the subject
3239string, and fails if the current pointer is at the end of
3240the string.
3241
3242@cindex Perl-style regular expressions, case-insensitive
3243When caseless matching is set, any letters in a class
3244represent both their upper case and lower case versions, so
3245for example, a caseless @code{[aeiou]} matches uppercase
3246and lowercase @samp{A}s, and a caseless @code{[^aeiou]}
3247does not match @samp{A}, whereas a case-sensitive version would.
3248
3249@cindex Perl-style regular expressions, single line
3250@cindex Perl-style regular expressions, multiline
3251The newline character is never treated in any special way in
3252character classes, whatever the setting of the @code{S} and
3253@code{M} options (modifiers) is. A class such as @code{[^a]} will
3254always match a newline.
3255
3256The minus (hyphen) character can be used to specify a range
3257of characters in a character class. For example, @code{[d-m]}
3258matches any letter between d and m, inclusive. If a minus
3259character is required in a class, it must be escaped with a
3260backslash or appear in a position where it cannot be interpreted
3261as indicating a range, typically as the first or last
3262character in the class.
3263
3264It is not possible to have the literal character @code{]} as the
3265end character of a range. A pattern such as @code{[W-]46]} is
3266interpreted as a class of two characters (@code{W} and @code{-})
3267followed by a literal string @code{46]}, so it would match
3268@samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped
3269with a backslash it is interpreted as the end of range, so
3270@code{[W-\]46]} is interpreted as a single class containing a
3271range followed by two separate characters. The octal or
3272hexadecimal representation of @code{]} can also be used to end a range.
3273
3274Ranges operate in @sc{ascii} collating sequence. They can also be
3275used for characters specified numerically, for example
3276@code{[\000-\037]}. If a range that includes letters is used when
3277caseless matching is set, it matches the letters in either
3278case. For example, a caseless @code{[W-c]} is equivalent to
3279@code{[][\^_`wxyzabc]}, matched caselessly, and if character
3280tables for the French locale are in use, @code{[\xc8-\xcb]}
3281matches accented E characters in both cases.
3282
3283Unlike in @sc{posix} mode, the character types @code{\d},
3284@code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W}
3285may also appear in a character class, and add the characters
3286that they match to the class. For example, @code{[\dABCDEF]} matches any
3287hexadecimal digit. A circumflex can conveniently be used
3288with the upper case character types to specify a more restricted
3289set of characters than the matching lower case type.
3290For example, the class @code{[^\W_]} matches any letter or digit,
3291but not underscore.
3292
3293All non-alphameric characters other than @code{\}, @code{-},
3294@code{^} (at the start) and the terminating @code{]}
3295are non-special in character classes, but it does no harm
3296if they are escaped.
3297
3298Perl 5.6 supports the @sc{posix} notation for character classes, which
3299uses names enclosed by @code{[:} and @code{:]} within the enclosing
3300square brackets, and @value{SSED} supports this notation as well.
3301For example,
3302
3303@example
3304 [01[:alpha:]%]
3305@end example
3306
3307@noindent
3308matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}.
3309The supported class names are
3310
3311@table @code
3312@item alnum
3313Matches letters and digits
3314
3315@item alpha
3316Matches letters
3317
3318@item ascii
3319Matches character codes 0 - 127
3320
3321@item cntrl
3322Matches control characters
3323
3324@item digit
3325Matches decimal digits (same as \d)
3326
3327@item graph
3328Matches printing characters, excluding space
3329
3330@item lower
3331Matches lower case letters
3332
3333@item print
3334Matches printing characters, including space
3335
3336@item punct
3337Matches printing characters, excluding letters and digits
3338
3339@item space
3340Matches white space (same as \s)
3341
3342@item upper
3343Matches upper case letters
3344
3345@item word
3346Matches ``word'' characters (same as \w)
3347
3348@item xdigit
3349Matches hexadecimal digits
3350@end table
3351
3352The names @code{ascii} and @code{word} are extensions valid only in
3353Perl mode. Another Perl extension is negation, which is
3354indicated by a circumflex character after the colon. For example,
3355
3356@example
3357 [12[:^digit:]]
3358@end example
3359
3360@noindent
3361matches @samp{1}, @samp{2}, or any non-digit.
3362
3363@node Options setting
3364@appendixsec Options setting
3365@cindex Perl-style regular expressions, toggling options
3366@cindex Perl-style regular expressions, case-insensitive
3367@cindex Perl-style regular expressions, multiline
3368@cindex Perl-style regular expressions, single line
3369@cindex Perl-style regular expressions, extended
3370
3371The settings of the @code{I}, @code{M}, @code{S}, @code{X}
3372modifiers can be changed from within the pattern by
3373a sequence of Perl option letters enclosed between @code{(?}
3374and @code{)}. The option letters must be lowercase.
3375
3376For example, @code{(?im)} sets caseless, multiline matching. It is
3377also possible to unset these options by preceding the letter
3378with a hyphen; you can also have combined settings and unsettings:
3379@code{(?im-sx)} sets caseless and multiline matching,
3380while unsets single line matching (for dots) and extended
3381whitespace interpretation. If a letter appears both before
3382and after the hyphen, the option is unset.
3383
3384The scope of these option changes depends on where in the
3385pattern the setting occurs. For settings that are outside
3386any subpattern (defined below), the effect is the same as if
3387the options were set or unset at the start of matching. The
3388following patterns all behave in exactly the same way:
3389
3390@example
3391 (?i)abc
3392 a(?i)bc
3393 ab(?i)c
3394 abc(?i)
3395@end example
3396
3397which in turn is the same as specifying the pattern abc with
3398the @code{I} modifier. In other words, ``top level'' settings
3399apply to the whole pattern (unless there are other
3400changes inside subpatterns). If there is more than one setting
3401of the same option at top level, the rightmost setting
3402is used.
3403
3404If an option change occurs inside a subpattern, the effect
3405is different. This is a change of behaviour in Perl 5.005.
3406An option change inside a subpattern affects only that part
3407of the subpattern @emph{that follows} it, so
3408
3409@example
3410 (a(?i)b)c
3411@end example
3412
3413@noindent
3414matches abc and aBc and no other strings (assuming
3415case-sensitive matching is used). By this means, options can
3416be made to have different settings in different parts of the
3417pattern. Any changes made in one alternative do carry on
3418into subsequent branches within the same subpattern. For
3419example,
3420
3421@example
3422 (a(?i)b|c)
3423@end example
3424
3425@noindent
3426matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C},
3427even though when matching @samp{C} the first branch is
3428abandoned before the option setting.
3429This is because the effects of option settings happen at
3430compile time. There would be some very weird behaviour otherwise.
3431
3432@ignore
3433There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA
3434that can be changed in the same way as the Perl-compatible options by
3435using the characters U and X respectively. The (?X) flag
3436setting is special in that it must always occur earlier in
3437the pattern than any of the additional features it turns on,
3438even when it is at top level. It is best put at the start.
3439@end ignore
3440
3441
3442@node Non-capturing subpatterns
3443@appendixsec Non-capturing subpatterns
3444@cindex Perl-style regular expressions, non-capturing subpatterns
3445
3446Marking part of a pattern as a subpattern does two things.
3447On one hand, it localizes a set of alternatives; on the other
3448hand, it sets up the subpattern as a capturing subpattern (as
3449defined above). The subpattern can be backreferenced and
3450referenced in the right side of @code{s} commands.
3451
3452For example, if the string @samp{the red king} is matched against
3453the pattern
3454
3455@example
3456 the ((red|white) (king|queen))
3457@end example
3458
3459@noindent
3460the captured substrings are @samp{red king}, @samp{red},
3461and @samp{king}, and are numbered 1, 2, and 3.
3462
3463The fact that plain parentheses fulfil two functions is not
3464always helpful. There are often times when a grouping
3465subpattern is required without a capturing requirement. If an
3466opening parenthesis is followed by @code{?:}, the subpattern does
3467not do any capturing, and is not counted when computing the
3468number of any subsequent capturing subpatterns. For example,
3469if the string @samp{the white queen} is matched against the pattern
3470
3471@example
3472 the ((?:red|white) (king|queen))
3473@end example
3474
3475@noindent
3476the captured substrings are @samp{white queen} and @samp{queen},
3477and are numbered 1 and 2. The maximum number of captured
3478substrings is 99, while the maximum number of all subpatterns,
3479both capturing and non-capturing, is 200.
3480
3481As a convenient shorthand, if any option settings are
3482equired at the start of a non-capturing subpattern, the
3483option letters may appear between the @code{?} and the
3484@code{:}. Thus the two patterns
3485
3486@example
3487 (?i:saturday|sunday)
3488 (?:(?i)saturday|sunday)
3489@end example
3490
3491@noindent
3492match exactly the same set of strings. Because alternative
3493branches are tried from left to right, and options are not
3494reset until the end of the subpattern is reached, an option
3495setting in one branch does affect subsequent branches, so
3496the above patterns match @samp{SUNDAY} as well as @samp{Saturday}.
3497
3498
3499@node Repetition
3500@appendixsec Repetition
3501@cindex Perl-style regular expressions, repetitions
3502
3503Repetition is specified by quantifiers, which can follow any
3504of the following items:
3505
3506@itemize @bullet
3507@item
3508a single character, possibly escaped
3509
3510@item
3511the @code{.} special character
3512
3513@item
3514a character class
3515
3516@item
3517a back reference (see next section)
3518
3519@item
3520a parenthesized subpattern (unless it is an assertion; @pxref{Assertions})
3521@end itemize
3522
3523The general repetition quantifier specifies a minimum and
3524maximum number of permitted matches, by giving the two
3525numbers in curly brackets (braces), separated by a comma.
3526The numbers must be less than 65536, and the first must be
3527less than or equal to the second. For example:
3528
3529@example
3530 z@{2,4@}
3531@end example
3532
3533@noindent
3534matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own
3535is not a special character. If the second number is omitted,
3536but the comma is present, there is no upper limit; if the
3537second number and the comma are both omitted, the quantifier
3538specifies an exact number of required matches. Thus
3539
3540@example
3541 [aeiou]@{3,@}
3542@end example
3543
3544@noindent
3545matches at least 3 successive vowels, but may match many
3546more, while
3547
3548@example
3549 \d@{8@}
3550@end example
3551
3552@noindent
3553matches exactly 8 digits. An opening curly bracket that
3554appears in a position where a quantifier is not allowed, or
3555one that does not match the syntax of a quantifier, is taken
3556as a literal character. For example, @{,6@} is not a quantifier,
3557but a literal string of four characters.@footnote{It
3558raises an error if @option{-R} is not used.}
3559
3560The quantifier @samp{@{0@}} is permitted, causing the expression to
3561behave as if the previous item and the quantifier were not
3562present.
3563
3564For convenience (and historical compatibility) the three
3565most common quantifiers have single-character abbreviations:
3566
3567@table @code
3568@item *
3569is equivalent to @{0,@}
3570
3571@item +
3572is equivalent to @{1,@}
3573
3574@item ?
3575is equivalent to @{0,1@}
3576@end table
3577
3578It is possible to construct infinite loops by following a
3579subpattern that can match no characters with a quantifier
3580that has no upper limit, for example:
3581
3582@example
3583 (a?)*
3584@end example
3585
3586Earlier versions of Perl used to give an error at
3587compile time for such patterns. However, because there are
3588cases where this can be useful, such patterns are now
3589accepted, but if any repetition of the subpattern does in
3590fact match no characters, the loop is forcibly broken.
3591
3592@cindex Greedy regular expression matching
3593@cindex Perl-style regular expressions, stingy repetitions
3594By default, the quantifiers are @dfn{greedy} like in @sc{posix}
3595mode, that is, they match as much as possible (up to the maximum
3596number of permitted times), without causing the rest of the
3597pattern to fail. The classic example of where this gives problems
3598is in trying to match comments in C programs. These appear between
3599the sequences @code{/*} and @code{*/} and within the sequence, individual
3600@code{*} and @code{/} characters may appear. An attempt to match C
3601comments by applying the pattern
3602
3603@example
3604 /\*.*\*/
3605@end example
3606
3607@noindent
3608to the string
3609
3610@example
3611 /* first command */ not comment /* second comment */
3612@end example
3613
3614@noindent
3615
3616fails, because it matches the entire string owing to the
3617greediness of the @code{.*} item.
3618
3619However, if a quantifier is followed by a question mark, it
3620ceases to be greedy, and instead matches the minimum number
3621of times possible, so the pattern @code{/\*.*?\*/}
3622does the right thing with the C comments. The meaning of the
3623various quantifiers is not otherwise changed, just the preferred
3624number of matches. Do not confuse this use of question
3625mark with its use as a quantifier in its own right.
3626Because it has two uses, it can sometimes appear doubled, as in
3627
3628@example
3629 \d??\d
3630@end example
3631
3632which matches one digit by preference, but can match two if
3633that is the only way the rest of the pattern matches.
3634
3635Note that greediness does not matter when specifying addresses,
3636but can be nevertheless used to improve performance.
3637
3638@ignore
3639 If the PCRE_UNGREEDY option is set (an option which is not
3640 available in Perl), the quantifiers are not greedy by
3641 default, but individual ones can be made greedy by following
3642 them with a question mark. In other words, it inverts the
3643 default behaviour.
3644@end ignore
3645
3646When a parenthesized subpattern is quantified with a minimum
3647repeat count that is greater than 1 or with a limited maximum,
3648more store is required for the compiled pattern, in
3649proportion to the size of the minimum or maximum.
3650
3651@cindex Perl-style regular expressions, single line
3652If a pattern starts with @code{.*} or @code{.@{0,@}} and the
3653@code{S} modifier is used, the pattern is implicitly anchored,
3654because whatever follows will be tried against every character
3655position in the subject string, so there is no point in
3656retrying the overall match at any position after the first.
3657PCRE treats such a pattern as though it were preceded by \A.
3658
3659When a capturing subpattern is repeated, the value captured
3660is the substring that matched the final iteration. For example,
3661after
3662
3663@example
3664 (tweedle[dume]@{3@}\s*)+
3665@end example
3666
3667@noindent
3668has matched @samp{tweedledum tweedledee} the value of the
3669captured substring is @samp{tweedledee}. However, if there are
3670nested capturing subpatterns, the corresponding captured
3671values may have been set in previous iterations. For example,
3672after
3673
3674@example
3675 /(a|(b))+/
3676@end example
3677
3678matches @samp{aba}, the value of the second captured substring is
3679@samp{b}.
3680
3681@node Backreferences
3682@appendixsec Backreferences
3683@cindex Perl-style regular expressions, backreferences
3684
3685Outside a character class, a backslash followed by a digit
3686greater than 0 (and possibly further digits) is a back
3687reference to a capturing subpattern earlier (i.e. to its
3688left) in the pattern, provided there have been that many
3689previous capturing left parentheses.
3690
3691However, if the decimal number following the backslash is
3692less than 10, it is always taken as a back reference, and
3693causes an error only if there are not that many capturing
3694left parentheses in the entire pattern. In other words, the
3695parentheses that are referenced need not be to the left of
3696the reference for numbers less than 10. @ref{Backslash}
3697for further details of the handling of digits following a backslash.
3698
3699A back reference matches whatever actually matched the capturing
3700subpattern in the current subject string, rather than
3701anything matching the subpattern itself. So the pattern
3702
3703@example
3704 (sens|respons)e and \1ibility
3705@end example
3706
3707@noindent
3708matches @samp{sense and sensibility} and @samp{response and responsibility},
3709but not @samp{sense and responsibility}. If caseful
3710matching is in force at the time of the back reference, the
3711case of letters is relevant. For example,
3712
3713@example
3714 ((?i)blah)\s+\1
3715@end example
3716
3717@noindent
3718matches @samp{blah blah} and @samp{Blah Blah}, but not
3719@samp{BLAH blah}, even though the original capturing
3720subpattern is matched caselessly.
3721
3722There may be more than one back reference to the same subpattern.
3723Also, if a subpattern has not actually been used in a
3724particular match, any back references to it always fail. For
3725example, the pattern
3726
3727@example
3728 (a|(bc))\2
3729@end example
3730
3731@noindent
3732always fails if it starts to match @samp{a} rather than
3733@samp{bc}. Because there may be up to 99 back references, all
3734digits following the backslash are taken as part of a potential
3735back reference number; this is different from what happens
3736in @sc{posix} mode. If the pattern continues with a digit
3737character, some delimiter must be used to terminate the back
3738reference. If the @code{X} modifier option is set, this can be
3739whitespace. Otherwise an empty comment can be used, or the
3740following character can be expressed in hexadecimal or octal.
3741
3742A back reference that occurs inside the parentheses to which
3743it refers fails when the subpattern is first used, so, for
3744example, @code{(a\1)} never matches. However, such references
3745can be useful inside repeated subpatterns. For example, the
3746pattern
3747
3748@example
3749 (a|b\1)+
3750@end example
3751
3752@noindent
3753matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa},
3754etc. At each iteration of the subpattern, the back reference matches
3755the character string corresponding to the previous iteration. In
3756order for this to work, the pattern must be such that the first
3757iteration does not need to match the back reference. This can be
3758done using alternation, as in the example above, or by a
3759quantifier with a minimum of zero.
3760
3761@node Assertions
3762@appendixsec Assertions
3763@cindex Perl-style regular expressions, assertions
3764@cindex Perl-style regular expressions, asserting subpatterns
3765
3766An assertion is a test on the characters following or
3767preceding the current matching point that does not actually
3768consume any characters. The simple assertions coded as @code{\b},
3769@code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$}
3770are described above. More complicated assertions are coded as
3771subpatterns. There are two kinds: those that look ahead of the
3772current position in the subject string, and those that look behind it.
3773
3774@cindex Perl-style regular expressions, lookahead subpatterns
3775An assertion subpattern is matched in the normal way, except
3776that it does not cause the current matching position to be
3777changed. Lookahead assertions start with @code{(?=} for positive
3778assertions and @code{(?!} for negative assertions. For example,
3779
3780@example
3781 \w+(?=;)
3782@end example
3783
3784@noindent
3785matches a word followed by a semicolon, but does not include
3786the semicolon in the match, and
3787
3788@example
3789 foo(?!bar)
3790@end example
3791
3792@noindent
3793matches any occurrence of @samp{foo} that is not followed by
3794@samp{bar}.
3795
3796Note that the apparently similar pattern
3797
3798@example
3799 (?!foo)bar
3800@end example
3801
3802@noindent
3803@cindex Perl-style regular expressions, lookbehind subpatterns
3804finds any occurrence of @samp{bar} even if it is preceded by
3805@samp{foo}, because the assertion @code{(?!foo)} is always true
3806when the next three characters are @samp{bar}. A lookbehind
3807assertion is needed to achieve this effect.
3808Lookbehind assertions start with @code{(?<=} for positive
3809assertions and @code{(?<!} for negative assertions. So,
3810
3811@example
3812 (?<!foo)bar
3813@end example
3814
3815achieves the required effect of finding an occurrence of
3816@samp{bar} that is not preceded by @samp{foo}. The contents of a
3817lookbehind assertion are restricted
3818such that all the strings it matches must have a fixed
3819length. However, if there are several alternatives, they do
3820not all have to have the same fixed length. This is an extension
3821compared with Perl 5.005, which requires all branches to match
3822the same length of string. Thus
3823
3824@example
3825 (?<=dogs|cats|)
3826@end example
3827
3828@noindent
3829is permitted, but the apparently equivalent regular expression
3830
3831@example
3832 (?<!dogs?|cats?)
3833@end example
3834
3835@noindent
3836causes an error at compile time. Branches that match different
3837length strings are permitted only at the top level of
3838a lookbehind assertion: an assertion such as
3839
3840@example
3841 (?<=ab(c|de))
3842@end example
3843
3844@noindent
3845is not permitted, because its single top-level branch can
3846match two different lengths, but it is acceptable if rewritten
3847to use two top-level branches:
3848
3849@example
3850 (?<=abc|abde)
3851@end example
3852
3853All this is required because lookbehind assertions simply
3854move the current position back by the alternative's fixed
3855width and then try to match. If there are
3856insufficient characters before the current position, the
3857match is deemed to fail. Lookbehinds, in conjunction with
3858non-backtracking subpatterns can be particularly useful for
3859matching at the ends of strings; an example is given at the end
3860of the section on non-backtracking subpatterns.
3861
3862Several assertions (of any sort) may occur in succession.
3863For example,
3864
3865@example
3866 (?<=\d@{3@})(?<!999)foo
3867@end example
3868
3869@noindent
3870matches @samp{foo} preceded by three digits that are not @samp{999}.
3871Notice that each of the assertions is applied independently
3872at the same point in the subject string. First there is a
3873check that the previous three characters are all digits, and
3874then there is a check that the same three characters are not
3875@samp{999}. This pattern does not match @samp{foo} preceded by six
3876characters, the first of which are digits and the last three
3877of which are not @samp{999}. For example, it doesn't match
3878@samp{123abcfoo}. A pattern to do that is
3879
3880@example
3881 (?<=\d@{3@}...)(?<!999)foo
3882@end example
3883
3884@noindent
3885This time the first assertion looks at the preceding six
3886characters, checking that the first three are digits, and
3887then the second assertion checks that the preceding three
3888characters are not @samp{999}. Actually, assertions can be
3889nested in any combination, so one can write this as
3890
3891@example
3892 (?<=\d@{3@}(?!999)...)foo
3893@end example
3894
3895or
3896
3897@example
3898 (?<=\d@{3@}...(?<!999))foo
3899@end example
3900
3901@noindent
3902both of which might be considered more readable.
3903
3904Assertion subpatterns are not capturing subpatterns, and may
3905not be repeated, because it makes no sense to assert the
3906same thing several times. If any kind of assertion contains
3907capturing subpatterns within it, these are counted for the
3908purposes of numbering the capturing subpatterns in the whole
3909pattern. However, substring capturing is carried out only
3910for positive assertions, because it does not make sense for
3911negative assertions.
3912
3913Assertions count towards the maximum of 200 parenthesized
3914subpatterns.
3915
3916@node Non-backtracking subpatterns
3917@appendixsec Non-backtracking subpatterns
3918@cindex Perl-style regular expressions, non-backtracking subpatterns
3919
3920With both maximizing and minimizing repetition, failure of
3921what follows normally causes the repeated item to be evaluated
3922again to see if a different number of repeats allows the
3923rest of the pattern to match. Sometimes it is useful to
3924prevent this, either to change the nature of the match, or
3925to cause it fail earlier than it otherwise might, when the
3926author of the pattern knows there is no point in carrying
3927on.
3928
3929Consider, for example, the pattern @code{\d+foo} when applied to
3930the subject line
3931
3932@example
3933 123456bar
3934@end example
3935
3936After matching all 6 digits and then failing to match @samp{foo},
3937the normal action of the matcher is to try again with only 5
3938digits matching the @code{\d+} item, and then with 4, and so on,
3939before ultimately failing. Non-backtracking subpatterns
3940provide the means for specifying that once a portion of the
3941pattern has matched, it is not to be re-evaluated in this way,
3942so the matcher would give up immediately on failing to match
3943@samp{foo} the first time. The notation is another kind of special
3944parenthesis, starting with @code{(?>} as in this example:
3945
3946@example
3947 (?>\d+)bar
3948@end example
3949
3950This kind of parenthesis ``locks up'' the part of the pattern
3951it contains once it has matched, and a failure further into
3952the pattern is prevented from backtracking into it.
3953Backtracking past it to previous items, however, works as
3954normal.
3955
3956Non-backtracking subpatterns are not capturing subpatterns. Simple
3957cases such as the above example can be thought of as a maximizing
3958repeat that must swallow everything it can. So,
3959while both @code{\d+} and @code{\d+?} are prepared to adjust the number of
3960digits they match in order to make the rest of the pattern
3961match, @code{(?>\d+)} can only match an entire sequence of digits.
3962
3963This construction can of course contain arbitrarily complicated
3964subpatterns, and it can be nested.
3965
3966@cindex Perl-style regular expressions, lookbehind subpatterns
3967Non-backtracking subpatterns can be used in conjunction with look-behind
3968assertions to specify efficient matching at the end
3969of the subject string. Consider a simple pattern such as
3970
3971@example
3972 abcd$
3973@end example
3974
3975@noindent
3976when applied to a long string which does not match. Because
3977matching proceeds from left to right, @command{sed} will look for
3978each @samp{a} in the subject and then see if what follows matches
3979the rest of the pattern. If the pattern is specified as
3980
3981@example
3982 ^.*abcd$
3983@end example
3984
3985@noindent
3986the initial @code{.*} matches the entire string at first, but when
3987this fails (because there is no following @samp{a}), it backtracks
3988to match all but the last character, then all but the
3989last two characters, and so on. Once again the search for
3990@samp{a} covers the entire string, from right to left, so we are
3991no better off. However, if the pattern is written as
3992
3993@example
3994 ^(?>.*)(?<=abcd)
3995@end example
3996
3997there can be no backtracking for the .* item; it can match
3998only the entire string. The subsequent lookbehind assertion
3999does a single test on the last four characters. If it fails,
4000the match fails immediately. For long strings, this approach
4001makes a significant difference to the processing time.
4002
4003When a pattern contains an unlimited repeat inside a subpattern
4004that can itself be repeated an unlimited number of
4005times, the use of a once-only subpattern is the only way to
4006avoid some failing matches taking a very long time
4007indeed.@footnote{Actually, the matcher embedded in @value{SSED}
4008 tries to do something for this in the simplest cases,
4009 like @code{([^b]*b)*}. These cases are actually quite
4010 common: they happen for example in a regular expression
4011 like @code{\/\*([^*]*\*)*\/} which matches C comments.}
4012
4013The pattern
4014
4015@example
4016 (\D+|<\d+>)*[!?]
4017@end example
4018
4019([^0-9<]+<(\d+>)?)*[!?]
4020
4021@noindent
4022matches an unlimited number of substrings that either consist
4023of non-digits, or digits enclosed in angular brackets, followed by
4024an exclamation or question mark. When it matches, it runs quickly.
4025However, if it is applied to
4026
4027@example
4028 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
4029@end example
4030
4031@noindent
4032it takes a long time before reporting failure. This is
4033because the string can be divided between the two repeats in
4034a large number of ways, and all have to be tried.@footnote{The
4035example used @code{[!?]} rather than a single character at the end,
4036because both @value{SSED} and Perl have an optimization that allows
4037for fast failure when a single character is used. They
4038remember the last single character that is required for a
4039match, and fail early if it is not present in the string.}
4040
4041If the pattern is changed to
4042
4043@example
4044 ((?>\D+)|<\d+>)*[!?]
4045@end example
4046
4047sequences of non-digits cannot be broken, and failure happens
4048quickly.
4049
4050@node Conditional subpatterns
4051@appendixsec Conditional subpatterns
4052@cindex Perl-style regular expressions, conditional subpatterns
4053
4054It is possible to cause the matching process to obey a subpattern
4055conditionally or to choose between two alternative
4056subpatterns, depending on the result of an assertion, or
4057whether a previous capturing subpattern matched or not. The
4058two possible forms of conditional subpattern are
4059
4060@example
4061 (?(@var{condition})@var{yes-pattern})
4062 (?(@var{condition})@var{yes-pattern}|@var{no-pattern})
4063@end example
4064
4065If the condition is satisfied, the yes-pattern is used; otherwise
4066the no-pattern (if present) is used. If there are more than two
4067alternatives in the subpattern, a compile-time error occurs.
4068
4069There are two kinds of condition. If the text between the
4070parentheses consists of a sequence of digits, the condition
4071is satisfied if the capturing subpattern of that number has
4072previously matched. The number must be greater than zero.
4073Consider the following pattern, which contains non-significant
4074white space to make it more readable (assume the @code{X} modifier)
4075and to divide it into three parts for ease of discussion:
4076
4077@example
4078 ( \( )? [^()]+ (?(1) \) )
4079@end example
4080
4081The first part matches an optional opening parenthesis, and
4082if that character is present, sets it as the first captured
4083substring. The second part matches one or more characters
4084that are not parentheses. The third part is a conditional
4085subpattern that tests whether the first set of parentheses
4086matched or not. If they did, that is, if subject started
4087with an opening parenthesis, the condition is true, and so
4088the yes-pattern is executed and a closing parenthesis is
4089required. Otherwise, since no-pattern is not present, the
4090subpattern matches nothing. In other words, this pattern
4091matches a sequence of non-parentheses, optionally enclosed
4092in parentheses.
4093
4094@cindex Perl-style regular expressions, lookahead subpatterns
4095If the condition is not a sequence of digits, it must be an
4096assertion. This may be a positive or negative lookahead or
4097lookbehind assertion. Consider this pattern, again containing
4098non-significant white space, and with the two alternatives
4099on the second line:
4100
4101@example
4102 (?(?=...[a-z])
4103 \d\d-[a-z]@{3@}-\d\d |
4104 \d\d-\d\d-\d\d )
4105@end example
4106
4107The condition is a positive lookahead assertion that matches
4108a letter that is three characters away from the current point.
4109If a letter is found, the subject is matched against the first
4110alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are
4111letters and @var{dd} are digits); otherwise it is matched against
4112the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}.
4113
4114
4115@node Recursive patterns
4116@appendixsec Recursive patterns
4117@cindex Perl-style regular expressions, recursive patterns
4118@cindex Perl-style regular expressions, recursion
4119
4120Consider the problem of matching a string in parentheses,
4121allowing for unlimited nested parentheses. Without the use
4122of recursion, the best that can be done is to use a pattern
4123that matches up to some fixed depth of nesting. It is not
4124possible to handle an arbitrary nesting depth. Perl 5.6 has
4125provided an experimental facility that allows regular
4126expressions to recurse (amongst other things). It does this
4127by interpolating Perl code in the expression at run time,
4128and the code can refer to the expression itself. A Perl pattern
4129tern to solve the parentheses problem can be created like
4130this:
4131
4132@example
4133 $re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x;
4134@end example
4135
4136The @code{(?p@{...@})} item interpolates Perl code at run time,
4137and in this case refers recursively to the pattern in which it
4138appears. Obviously, @command{sed} cannot support the interpolation of
4139Perl code. Instead, the special item @code{(?R)} is provided for
4140the specific case of recursion. This pattern solves the
4141parentheses problem (assume the @code{X} modifier option is used
4142so that white space is ignored):
4143
4144@example
4145 \( ( (?>[^()]+) | (?R) )* \)
4146@end example
4147
4148First it matches an opening parenthesis. Then it matches any
4149number of substrings which can either be a sequence of
4150non-parentheses, or a recursive match of the pattern itself
4151(i.e. a correctly parenthesized substring). Finally there is
4152a closing parenthesis.
4153
4154This particular example pattern contains nested unlimited
4155repeats, and so the use of a non-backtracking subpattern for
4156matching strings of non-parentheses is important when applying
4157the pattern to strings that do not match. For example, when
4158it is applied to
4159
4160@example
4161 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
4162@end example
4163
4164it yields a ``no match'' response quickly. However, if a
4165standard backtracking subpattern is not used, the match runs
4166for a very long time indeed because there are so many different
4167ways the @code{+} and @code{*} repeats can carve up the subject,
4168and all have to be tested before failure can be reported.
4169
4170The values set for any capturing subpatterns are those from
4171the outermost level of the recursion at which the subpattern
4172value is set. If the pattern above is matched against
4173
4174@example
4175 (ab(cd)ef)
4176@end example
4177
4178@noindent
4179the value for the capturing parentheses is @samp{ef}, which is
4180the last value taken on at the top level.
4181
4182@node Comments
4183@appendixsec Comments
4184@cindex Perl-style regular expressions, comments
4185
4186The sequence (?# marks the start of a comment which continues
4187ues up to the next closing parenthesis. Nested parentheses
4188are not permitted. The characters that make up a comment
4189play no part in the pattern matching at all.
4190
4191@cindex Perl-style regular expressions, extended
4192If the @code{X} modifier option is used, an unescaped @code{#} character
4193outside a character class introduces a comment that continues
4194up to the next newline character in the pattern.
4195@end ifset
4196
4197
4198@page
4199@node Concept Index
4200@unnumbered Concept Index
4201
4202This is a general index of all issues discussed in this manual, with the
4203exception of the @command{sed} commands and command-line options.
4204
4205@printindex cp
4206
4207@page
4208@node Command and Option Index
4209@unnumbered Command and Option Index
4210
4211This is an alphabetical list of all @command{sed} commands and command-line
4212options.
4213
4214@printindex fn
4215
4216@contents
4217@bye
4218
4219@c XXX FIXME: the term "cycle" is never defined...
Note: See TracBrowser for help on using the repository browser.