Ignore:
Timestamp:
Sep 19, 2024, 2:34:43 AM (10 months ago)
Author:
bird
Message:

src/sed: Merged in changes between 4.1.5 and 4.9 from the vendor branch. (svn merge /vendor/sed/4.1.5 /vendor/sed/current .)

Location:
trunk/src/sed
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • trunk/src/sed

  • trunk/src/sed/doc/sed.texi

    r599 r3613  
    11\input texinfo  @c -*-texinfo-*-
    2 @c Do not edit this file!! It is automatically generated from sed-in.texi.
    32@c
    43@c -- Stuff that needs adding: ----------------------------------------------
    5 @c (document the `;' command-separator)
     4@c (nothing!)
    65@c --------------------------------------------------------------------------
    76@c Check for consistency: regexps in @code, text that they match in @samp.
    8 @c 
     7@c
    98@c Tips:
    109@c    @command for command
     
    3635@value{SSED}, a stream editor.
    3736
    38 Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free
    39 Software Foundation, Inc.
    40 
    41 This document is released under the terms of the @acronym{GNU} Free
    42 Documentation License as published by the Free Software Foundation;
    43 either version 1.1, or (at your option) any later version.
    44 
    45 You should have received a copy of the @acronym{GNU} Free Documentation
    46 License along with @value{SSED}; see the file @file{COPYING.DOC}.
    47 If not, write to the Free Software Foundation, 59 Temple Place - Suite
    48 330, Boston, MA 02110-1301, USA.
    49 
    50 There are no Cover Texts and no Invariant Sections; this text, along
    51 with its equivalent in the printed manual, constitutes the Title Page.
     37Copyright @copyright{} 1998--2022 Free Software Foundation, Inc.
     38
     39@quotation
     40Permission is granted to copy, distribute and/or modify this document
     41under the terms of the GNU Free Documentation License, Version 1.3
     42or any later version published by the Free Software Foundation;
     43with no Invariant Sections, no Front-Cover Texts, and no
     44Back-Cover Texts.  A copy of the license is included in the
     45section entitled ``GNU Free Documentation License''.
     46@end quotation
    5247@end copying
    5348
     
    5550
    5651@titlepage
    57 @title @command{sed}, a stream editor
     52@title @value{SSED}, a stream editor
    5853@subtitle version @value{VERSION}, @value{UPDATED}
    59 @author by Ken Pizzini, Paolo Bonzini
     54@author by Ken Pizzini, Paolo Bonzini, Jim Meyering, Assaf Gordon
    6055
    6156@page
    6257@vskip 0pt plus 1filll
    63 Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc.
    64 
    6558@insertcopying
    66 
    67 Published by the Free Software Foundation, @*
    68 51 Franklin Street, Fifth Floor @*
    69 Boston, MA 02110-1301, USA
    7059@end titlepage
    7160
    72 
     61@contents
     62
     63@ifnottex
    7364@node Top
    74 @top
    75 
    76 @ifnottex
     65@top @value{SSED}
     66
    7767@insertcopying
    7868@end ifnottex
     
    8171* Introduction::               Introduction
    8272* Invoking sed::               Invocation
    83 * sed Programs::               @command{sed} programs
     73* sed scripts::                @command{sed} scripts
     74* sed addresses::              Addresses: selecting lines
     75* sed regular expressions::    Regular expressions: selecting text
     76* advanced sed::               Advanced @command{sed}: cycles and buffers
    8477* Examples::                   Some sample scripts
    8578* Limitations::                Limitations and (non-)limitations of @value{SSED}
    8679* Other Resources::            Other resources for learning about @command{sed}
    8780* Reporting Bugs::             Reporting bugs
    88 
    89 * Extended regexps::           @command{egrep}-style regular expressions
    90 @ifset PERL
    91 * Perl regexps::               Perl-style regular expressions
    92 @end ifset
    93 
     81* GNU Free Documentation License:: Copying and sharing this manual
    9482* Concept Index::              A menu with all the topics in this manual.
    9583* Command and Option Index::   A menu with all @command{sed} commands and
    9684                               command-line options.
    97 
    98 @detailmenu
    99 --- The detailed node listing ---
    100 
    101 sed Programs:
    102 * Execution Cycle::                 How @command{sed} works
    103 * Addresses::                       Selecting lines with @command{sed}
    104 * Regular Expressions::             Overview of regular expression syntax
    105 * Common Commands::                 Often used commands
    106 * The "s" Command::                 @command{sed}'s Swiss Army Knife
    107 * Other Commands::                  Less frequently used commands
    108 * Programming Commands::            Commands for @command{sed} gurus
    109 * Extended Commands::               Commands specific of @value{SSED}
    110 * Escapes::                         Specifying special characters
    111 
    112 Examples:
    113 * Centering lines::
    114 * Increment a number::
    115 * Rename files to lower case::
    116 * Print bash environment::
    117 * Reverse chars of lines::
    118 * tac::                             Reverse lines of files
    119 * cat -n::                          Numbering lines
    120 * cat -b::                          Numbering non-blank lines
    121 * wc -c::                           Counting chars
    122 * wc -w::                           Counting words
    123 * wc -l::                           Counting lines
    124 * head::                            Printing the first lines
    125 * tail::                            Printing the last lines
    126 * uniq::                            Make duplicate lines unique
    127 * uniq -d::                         Print duplicated lines of input
    128 * uniq -u::                         Remove all duplicated lines
    129 * cat -s::                          Squeezing blank lines
    130 
    131 @ifset PERL
    132 Perl regexps::                      Perl-style regular expressions
    133 * Backslash::                       Introduces special sequences
    134 * Circumflex/dollar sign/period::   Behave specially with regard to new lines
    135 * Square brackets::                 Are a bit different in strange cases
    136 * Options setting::                 Toggle modifiers in the middle of a regexp
    137 * Non-capturing subpatterns::       Are not counted when backreferencing
    138 * Repetition::                      Allows for non-greedy matching
    139 * Backreferences::                  Allows for more than 10 back references
    140 * Assertions::                      Allows for complex look ahead matches
    141 * Non-backtracking subpatterns::    Often gives more performance
    142 * Conditional subpatterns::         Allows if/then/else branches
    143 * Recursive patterns::              For example to match parentheses
    144 * Comments::                        Because things can get complex...
    145 @end ifset
    146 
    147 @end detailmenu
    14885@end menu
    14986
     
    167104
    168105@node Invoking sed
    169 @chapter Invocation
    170 
     106@chapter Running sed
     107
     108This chapter covers how to run @command{sed}. Details of @command{sed}
     109scripts and individual @command{sed} commands are discussed in the
     110next chapter.
     111
     112@menu
     113* Overview::
     114* Command-Line Options::
     115* Exit status::
     116@end menu
     117
     118
     119@node Overview
     120@section Overview
    171121Normally @command{sed} is invoked like this:
    172122
     
    175125@end example
    176126
     127For example, to change every @samp{hello} to @samp{world}
     128in the file @file{input.txt}:
     129
     130@example
     131sed 's/hello/world/g' input.txt > output.txt
     132@end example
     133
     134Without the @samp{g} (global) modifier, @command{sed} affects
     135only the first instance per line.
     136
     137@cindex stdin
     138@cindex standard input
     139If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-},
     140@command{sed} filters the contents of the standard input. The following
     141commands are equivalent:
     142
     143@example
     144sed 's/hello/world/g' input.txt > output.txt
     145sed 's/hello/world/g' < input.txt > output.txt
     146cat input.txt | sed 's/hello/world/g' - > output.txt
     147@end example
     148
     149@cindex stdout
     150@cindex output
     151@cindex standard output
     152@cindex -i, example
     153@command{sed} writes output to standard output. Use @option{-i} to edit
     154files in-place instead of printing to standard output.
     155See also the @code{W} and @code{s///w} commands for writing output to
     156other files. The following command modifies @file{file.txt} and
     157does not produce any output:
     158
     159@example
     160sed -i 's/hello/world/' file.txt
     161@end example
     162
     163@cindex -n, example
     164@cindex p, example
     165@cindex suppressing output
     166@cindex output, suppressing
     167By default @command{sed} prints all processed input (except input
     168that has been modified/deleted by commands such as @command{d}).
     169Use @option{-n} to suppress output, and the @code{p} command
     170to print specific lines. The following command prints only line 45
     171of the input file:
     172
     173@example
     174sed -n '45p' file.txt
     175@end example
     176
     177
     178
     179@cindex multiple files
     180@cindex -s, example
     181@command{sed} treats multiple input files as one long stream.
     182The following example prints the first line of the first file
     183(@file{one.txt}) and the last line of the last file (@file{three.txt}).
     184Use @option{-s} to reverse this behavior.
     185
     186@example
     187sed -n  '1p ; $p' one.txt two.txt three.txt
     188@end example
     189
     190
     191@cindex -e, example
     192@cindex --expression, example
     193@cindex -f, example
     194@cindex --file, example
     195@cindex script parameter
     196@cindex parameters, script
     197Without @option{-e} or @option{-f} options, @command{sed} uses
     198the first non-option parameter as the @var{script}, and the following
     199non-option parameters as input files.
     200If @option{-e} or @option{-f} options are used to specify a @var{script},
     201all non-option parameters are taken as input files.
     202Options @option{-e} and @option{-f} can be combined, and can appear
     203multiple times (in which case the final effective @var{script} will be
     204concatenation of all the individual @var{script}s).
     205
     206The following examples are equivalent:
     207
     208@example
     209sed 's/hello/world/' input.txt > output.txt
     210
     211sed -e 's/hello/world/' input.txt > output.txt
     212sed --expression='s/hello/world/' input.txt > output.txt
     213
     214echo 's/hello/world/' > myscript.sed
     215sed -f myscript.sed input.txt > output.txt
     216sed --file=myscript.sed input.txt > output.txt
     217@end example
     218
     219
     220@node Command-Line Options
     221@section Command-Line Options
     222
    177223The full format for invoking @command{sed} is:
    178224
     
    180226sed OPTIONS... [SCRIPT] [INPUTFILE...]
    181227@end example
    182 
    183 If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-},
    184 @command{sed} filters the contents of the standard input.  The @var{script}
    185 is actually the first non-option parameter, which @command{sed} specially
    186 considers a script and not an input file if (and only if) none of the
    187 other @var{options} specifies a script to be executed, that is if neither
    188 of the @option{-e} and @option{-f} options is specified.
    189228
    190229@command{sed} may be invoked with the following command-line options:
     
    212251@cindex Disabling autoprint, from command line
    213252By default, @command{sed} prints out the pattern space
    214 at the end of each cycle through the script.
     253at the end of each cycle through the script (@pxref{Execution Cycle, ,
     254How @code{sed} works}).
    215255These options disable this automatic printing,
    216256and @command{sed} only produces output when explicitly told to
    217257via the @code{p} command.
     258
     259@item --debug
     260@opindex --debug
     261@cindex @value{SSEDEXT}, debug
     262Print the input sed program in canonical form,
     263and annotate program execution.
     264@codequotebacktick on
     265@codequoteundirected on
     266@example
     267$ echo 1 | sed '\%1%s21232'
     2683
     269
     270$ echo 1 | sed --debug '\%1%s21232'
     271SED PROGRAM:
     272  /1/ s/1/3/
     273INPUT:   'STDIN' line 1
     274PATTERN: 1
     275COMMAND: /1/ s/1/3/
     276PATTERN: 3
     277END-OF-CYCLE:
     2783
     279@end example
     280@codequotebacktick off
     281@codequoteundirected off
     282
     283
     284@item -e @var{script}
     285@itemx --expression=@var{script}
     286@opindex -e
     287@opindex --expression
     288@cindex Script, from command line
     289Add the commands in @var{script} to the set of commands to be
     290run while processing the input.
     291
     292@item -f @var{script-file}
     293@itemx --file=@var{script-file}
     294@opindex -f
     295@opindex --file
     296@cindex Script, from a file
     297Add the commands contained in the file @var{script-file}
     298to the set of commands to be run while processing the input.
    218299
    219300@item -i[@var{SUFFIX}]
     
    240321before renaming the temporary file, thereby making a backup
    241322copy@footnote{Note that @value{SSED} creates the backup
    242     file whether or not any output is actually changed.}).
     323file whether or not any output is actually changed.}).
    243324
    244325@cindex In-place editing, Perl-style backup file names
     
    255336overwritten without making a backup.
    256337
     338Because @option{-i} takes an optional argument, it should
     339not be followed by other short options:
     340@table @code
     341@item sed -Ei '...' FILE
     342Same as @option{-E -i} with no backup suffix - @file{FILE} will be
     343edited in-place without creating a backup.
     344
     345@item sed -iE '...' FILE
     346This is equivalent to @option{--in-place=E}, creating @file{FILEE} as backup
     347of @file{FILE}
     348@end table
     349
     350Be cautious of using @option{-n} with @option{-i}: the former disables
     351automatic printing of lines and the latter changes the file in-place
     352without a backup. Used carelessly (and without an explicit @code{p} command),
     353the output file will be empty:
     354@codequotebacktick on
     355@codequoteundirected on
     356@example
     357# WRONG USAGE: 'FILE' will be truncated.
     358sed -ni 's/foo/bar/' FILE
     359@end example
     360@codequotebacktick off
     361@codequoteundirected off
     362
    257363@item -l @var{N}
    258364@itemx --line-length=@var{N}
     
    265371
    266372@item --posix
     373@opindex --posix
    267374@cindex @value{SSEDEXT}, disabling
    268 @value{SSED} includes several extensions to @acronym{POSIX}
     375@value{SSED} includes several extensions to POSIX
    269376sed.  In order to simplify writing portable scripts, this
    270377option disables all the extensions that this manual documents,
     
    272379@cindex @code{POSIXLY_CORRECT} behavior, enabling
    273380Most of the extensions accept @command{sed} programs that
    274 are outside the syntax mandated by @acronym{POSIX}, but some
     381are outside the syntax mandated by POSIX, but some
    275382of them (such as the behavior of the @command{N} command
    276 described in @pxref{Reporting Bugs}) actually violate the
     383described in @ref{Reporting Bugs}) actually violate the
    277384standard.  If you want to disable only the latter kind of
    278385extension, you can set the @code{POSIXLY_CORRECT} variable
    279386to a non-empty value.
    280387
    281 @item -r
     388@item -b
     389@itemx --binary
     390@opindex -b
     391@opindex --binary
     392This option is available on every platform, but is only effective where the
     393operating system makes a distinction between text files and binary files.
     394When such a distinction is made---as is the case for MS-DOS, Windows,
     395Cygwin---text files are composed of lines separated by a carriage return
     396@emph{and} a line feed character, and @command{sed} does not see the
     397ending CR.  When this option is specified, @command{sed} will open
     398input files in binary mode, thus not requesting this special processing
     399and considering lines to end at a line feed.
     400
     401@item --follow-symlinks
     402@opindex --follow-symlinks
     403This option is available only on platforms that support
     404symbolic links and has an effect only if option @option{-i}
     405is specified.  In this case, if the file that is specified
     406on the command line is a symbolic link, @command{sed} will
     407follow the link and edit the ultimate destination of the
     408link.  The default behavior is to break the symbolic link,
     409so that the link destination will not be modified.
     410
     411@item -E
     412@itemx -r
    282413@itemx --regexp-extended
     414@opindex -E
    283415@opindex -r
    284416@opindex --regexp-extended
    285417@cindex Extended regular expressions, choosing
    286 @cindex @acronym{GNU} extensions, extended regular expressions
     418@cindex GNU extensions, extended regular expressions
    287419Use extended regular expressions rather than basic
    288420regular expressions.  Extended regexps are those that
    289421@command{egrep} accepts; they can be clearer because they
    290 usually have less backslashes, but are a @acronym{GNU} extension
    291 and hence scripts that use them are not portable.
    292 @xref{Extended regexps, , Extended regular expressions}.
    293 
    294 @ifset PERL
    295 @item -R
    296 @itemx --regexp-perl
    297 @opindex -R
    298 @opindex --regexp-perl
    299 @cindex Perl-style regular expressions, choosing
    300 @cindex @value{SSEDEXT}, Perl-style regular expressions
    301 Use Perl-style regular expressions rather than basic
    302 regular expressions.  Perl-style regexps are extremely
    303 powerful but are a @value{SSED} extension and hence scripts that
    304 use it are not portable.  @xref{Perl regexps, ,
    305 Perl-style regular expressions}.
    306 @end ifset
     422usually have fewer backslashes.
     423Historically this was a GNU extension,
     424but the @option{-E}
     425extension has since been added to the POSIX standard
     426(http://austingroupbugs.net/view.php?id=528),
     427so use @option{-E} for portability.
     428GNU sed has accepted @option{-E} as an undocumented option for years,
     429and *BSD seds have accepted @option{-E} for years as well,
     430but scripts that use @option{-E} might not port to other older systems.
     431@xref{ERE syntax, , Extended regular expressions}.
     432
    307433
    308434@item -s
    309435@itemx --separate
     436@opindex -s
     437@opindex --separate
    310438@cindex Working on separate files
    311439By default, @command{sed} will consider the files specified on the
     
    318446start of each file.
    319447
     448@item --sandbox
     449@opindex --sandbox
     450@cindex Sandbox mode
     451In sandbox mode,  @code{e/w/r} commands are rejected - programs containing
     452them will be aborted without being run. Sandbox mode ensures @command{sed}
     453operates only on the input files designated on the command line, and
     454cannot run external programs.
     455
     456
    320457@item -u
    321458@itemx --unbuffered
     
    328465output as soon as possible.)
    329466
    330 @item -e @var{script}
    331 @itemx --expression=@var{script}
    332 @opindex -e
    333 @opindex --expression
    334 @cindex Script, from command line
    335 Add the commands in @var{script} to the set of commands to be
    336 run while processing the input.
    337 
    338 @item -f @var{script-file}
    339 @itemx --file=@var{script-file}
    340 @opindex -f
    341 @opindex --file
    342 @cindex Script, from a file
    343 Add the commands contained in the file @var{script-file}
    344 to the set of commands to be run while processing the input.
    345 
     467@item -z
     468@itemx --null-data
     469@itemx --zero-terminated
     470@opindex -z
     471@opindex --null-data
     472@opindex --zero-terminated
     473Treat the input as a set of lines, each terminated by a zero byte
     474(the ASCII @samp{NUL} character) instead of a newline.  This option can
     475be used with commands like @samp{sort -z} and @samp{find -print0}
     476to process arbitrary file names.
    346477@end table
    347478
     
    359490The standard input will be processed if no file names are specified.
    360491
    361 
    362 @node sed Programs
    363 @chapter @command{sed} Programs
    364 
    365 @cindex @command{sed} program structure
     492@node Exit status
     493@section Exit status
     494@cindex exit status
     495An exit status of zero indicates success, and a nonzero value
     496indicates failure. @value{SSED} returns the following exit status
     497error values:
     498
     499@table @asis
     500@item 0
     501Successful completion.
     502
     503@item 1
     504Invalid command, invalid syntax, invalid regular expression or a
     505@value{SSED} extension command used with @option{--posix}.
     506
     507@item 2
     508One or more of the input file specified on the command line could not be
     509opened (e.g. if a file is not found, or read permission is denied).
     510Processing continued with other files.
     511
     512@item 4
     513An I/O error, or a serious processing error during runtime,
     514@value{SSED} aborted immediately.
     515@end table
     516
     517@cindex Q, example
     518@cindex exit status, example
     519Additionally, the commands @code{q} and @code{Q} can be used to terminate
     520@command{sed} with a custom exit code value (this is a @value{SSED} extension):
     521
     522@example
     523$ echo | sed 'Q42' ; echo $?
     52442
     525@end example
     526
     527
     528@node sed scripts
     529@chapter @command{sed} scripts
     530
     531
     532@menu
     533* sed script overview::      @command{sed} script overview
     534* sed commands list::        @command{sed} commands summary
     535* The "s" Command::          @command{sed}'s Swiss Army Knife
     536* Common Commands::          Often used commands
     537* Other Commands::           Less frequently used commands
     538* Programming Commands::     Commands for @command{sed} gurus
     539* Extended Commands::        Commands specific of @value{SSED}
     540* Multiple commands syntax:: Extension for easier scripting
     541@end menu
     542
     543@node sed script overview
     544@section @command{sed} script overview
     545
     546@cindex @command{sed} script structure
    366547@cindex Script structure
     548
    367549A @command{sed} program consists of one or more @command{sed} commands,
    368550passed in by one or more of the
     
    371553options are used.
    372554This document will refer to ``the'' @command{sed} script;
    373 this is understood to mean the in-order catenation
     555this is understood to mean the in-order concatenation
    374556of all of the @var{script}s and @var{script-file}s passed in.
    375 
    376 Each @code{sed} command consists of an optional address or
    377 address range, followed by a one-character command name
    378 and any additional command-specific code.
    379 
    380 @menu
    381 * Execution Cycle::          How @command{sed} works
    382 * Addresses::                Selecting lines with @command{sed}
    383 * Regular Expressions::      Overview of regular expression syntax
    384 * Common Commands::          Often used commands
    385 * The "s" Command::          @command{sed}'s Swiss Army Knife
    386 * Other Commands::           Less frequently used commands
    387 * Programming Commands::     Commands for @command{sed} gurus
    388 * Extended Commands::        Commands specific of @value{SSED}
    389 * Escapes::                  Specifying special characters
    390 @end menu
    391 
    392 
    393 @node Execution Cycle
    394 @section How @command{sed} Works
    395 
    396 @cindex Buffer spaces, pattern and hold
    397 @cindex Spaces, pattern and hold
    398 @cindex Pattern space, definition
    399 @cindex Hold space, definition
    400 @command{sed} maintains two data buffers: the active @emph{pattern} space,
    401 and the auxiliary @emph{hold} space. Both are initially empty.
    402 
    403 @command{sed} operates by performing the following cycle on each
    404 lines of input: first, @command{sed} reads one line from the input
    405 stream, removes any trailing newline, and places it in the pattern space.
    406 Then commands are executed; each command can have an address associated
    407 to it: addresses are a kind of condition code, and a command is only
    408 executed if the condition is verified before the command is to be
    409 executed.
    410 
    411 When the end of the script is reached, unless the @option{-n} option
    412 is in use, the contents of pattern space are printed out to the output
    413 stream, adding back the trailing newline if it was removed.@footnote{Actually,
    414   if @command{sed} prints a line without the terminating newline, it will
    415   nevertheless print the missing newline as soon as more text is sent to
    416   the same output stream, which gives the ``least expected surprise''
    417   even though it does not make commands like @samp{sed -n p} exactly
    418   identical to @command{cat}.} Then the next cycle starts for the next
    419 input line.
    420 
    421 Unless special commands (like @samp{D}) are used, the pattern space is
    422 deleted between two cycles. The hold space, on the other hand, keeps
    423 its data between cycles (see commands @samp{h}, @samp{H}, @samp{x},
    424 @samp{g}, @samp{G} to move data between both buffers).
    425 
    426 
    427 @node Addresses
    428 @section Selecting lines with @command{sed}
    429 @cindex Addresses, in @command{sed} scripts
    430 @cindex Line selection
    431 @cindex Selecting lines to process
    432 
    433 Addresses in a @command{sed} script can be in any of the following forms:
     557@xref{Overview}.
     558
     559
     560@cindex @command{sed} commands syntax
     561@cindex syntax, @command{sed} commands
     562@cindex addresses, syntax
     563@cindex syntax, addresses
     564@command{sed} commands follow this syntax:
     565
     566@example
     567[addr]@var{X}[options]
     568@end example
     569
     570@var{X} is a single-letter @command{sed} command.
     571@c TODO: add @pxref{commands} when there is a command-list section.
     572@code{[addr]} is an optional line address. If @code{[addr]} is specified,
     573the command @var{X} will be executed only on the matched lines.
     574@code{[addr]} can be a single line number, a regular expression,
     575or a range of lines (@pxref{sed addresses}).
     576Additional @code{[options]} are used for some @command{sed} commands.
     577
     578@cindex @command{d}, example
     579@cindex address range, example
     580@cindex example, address range
     581The following example deletes  lines 30 to 35 in the input.
     582@code{30,35} is an address range. @command{d} is the delete command:
     583
     584@example
     585sed '30,35d' input.txt > output.txt
     586@end example
     587
     588@cindex @command{q}, example
     589@cindex regular expression, example
     590@cindex example, regular expression
     591The following example prints all input until a line
     592starting with the string @samp{foo} is found. If such line is found,
     593@command{sed} will terminate with exit status 42.
     594If such line was not found (and no other error occurred), @command{sed}
     595will exit with status 0.
     596@code{/^foo/} is a regular-expression address.
     597@command{q} is the quit command. @code{42} is the command option.
     598
     599@example
     600sed '/^foo/q42' input.txt > output.txt
     601@end example
     602
     603
     604@cindex multiple @command{sed} commands
     605@cindex @command{sed} commands, multiple
     606@cindex newline, command separator
     607@cindex semicolons, command separator
     608@cindex ;, command separator
     609@cindex -e, example
     610@cindex -f, example
     611Commands within a @var{script} or @var{script-file} can be
     612separated by semicolons (@code{;}) or newlines (ASCII 10).
     613Multiple scripts can be specified with @option{-e} or @option{-f}
     614options.
     615
     616The following examples are all equivalent. They perform two @command{sed}
     617operations: deleting any lines matching the regular expression @code{/^foo/},
     618and replacing all occurrences of the string @samp{hello} with @samp{world}:
     619
     620@example
     621sed '/^foo/d ; s/hello/world/g' input.txt > output.txt
     622
     623sed -e '/^foo/d' -e 's/hello/world/g' input.txt > output.txt
     624
     625echo '/^foo/d' > script.sed
     626echo 's/hello/world/g' >> script.sed
     627sed -f script.sed input.txt > output.txt
     628
     629echo 's/hello/world/g' > script2.sed
     630sed -e '/^foo/d' -f script2.sed input.txt > output.txt
     631@end example
     632
     633
     634@cindex @command{a}, and semicolons
     635@cindex @command{c}, and semicolons
     636@cindex @command{i}, and semicolons
     637Commands @command{a}, @command{c}, @command{i}, due to their syntax,
     638cannot be followed by semicolons working as command separators and
     639thus should be terminated
     640with newlines or be placed at the end of a @var{script} or @var{script-file}.
     641Commands can also be preceded with optional non-significant
     642whitespace characters.
     643@xref{Multiple commands syntax}.
     644
     645
     646
     647@node sed commands list
     648@section @command{sed} commands summary
     649
     650The following commands are supported in @value{SSED}.
     651Some are standard POSIX commands, while other are @value{SSEDEXT}.
     652Details and examples for each command are in the following sections.
     653(Mnemonics) are shown in parentheses.
     654
    434655@table @code
    435 @item @var{number}
    436 @cindex Address, numeric
    437 @cindex Line, selecting by number
    438 Specifying a line number will match only that line in the input.
    439 (Note that @command{sed} counts lines continuously across all input files
    440 unless @option{-i} or @option{-s} options are specified.)
    441 
    442 @item @var{first}~@var{step}
    443 @cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses
    444 This @acronym{GNU} extension matches every @var{step}th line
    445 starting with line @var{first}.
    446 In particular, lines will be selected when there exists
    447 a non-negative @var{n} such that the current line-number equals
    448 @var{first} + (@var{n} * @var{step}).
    449 Thus, to select the odd-numbered lines,
    450 one would use @code{1~2};
    451 to pick every third line starting with the second, @samp{2~3} would be used;
    452 to pick every fifth line starting with the tenth, use @samp{10~5};
    453 and @samp{50~0} is just an obscure way of saying @code{50}.
    454 
    455 @item $
    456 @cindex Address, last line
    457 @cindex Last line, selecting
    458 @cindex Line, selecting last
    459 This address matches the last line of the last file of input, or
    460 the last line of each file when the @option{-i} or @option{-s} options
    461 are specified.
    462 
    463 @item /@var{regexp}/
    464 @cindex Address, as a regular expression
    465 @cindex Line, selecting by regular expression match
    466 This will select any line which matches the regular expression @var{regexp}.
    467 If @var{regexp} itself includes any @code{/} characters,
    468 each must be escaped by a backslash (@code{\}).
    469 
    470 @cindex empty regular expression
    471 @cindex @value{SSEDEXT}, modifiers and the empty regular expression
    472 The empty regular expression @samp{//} repeats the last regular
    473 expression match (the same holds if the empty regular expression is
    474 passed to the @code{s} command).  Note that modifiers to regular expressions
    475 are evaluated when the regular expression is compiled, thus it is invalid to
    476 specify them together with the empty regular expression.
    477 
    478 @item \%@var{regexp}%
    479 (The @code{%} may be replaced by any other single character.)
    480 
    481 @cindex Slash character, in regular expressions
    482 This also matches the regular expression @var{regexp},
    483 but allows one to use a different delimiter than @code{/}.
    484 This is particularly useful if the @var{regexp} itself contains
    485 a lot of slashes, since it avoids the tedious escaping of every @code{/}.
    486 If @var{regexp} itself includes any delimiter characters,
    487 each must be escaped by a backslash (@code{\}).
    488 
    489 @item /@var{regexp}/I
    490 @itemx \%@var{regexp}%I
    491 @cindex @acronym{GNU} extensions, @code{I} modifier
    492 @ifset PERL
    493 @cindex Perl-style regular expressions, case-insensitive
    494 @end ifset
    495 The @code{I} modifier to regular-expression matching is a @acronym{GNU}
    496 extension which causes the @var{regexp} to be matched in
    497 a case-insensitive manner.
    498 
    499 @item /@var{regexp}/M
    500 @itemx \%@var{regexp}%M
    501 @ifset PERL
    502 @cindex @value{SSEDEXT}, @code{M} modifier
    503 @end ifset
    504 @cindex Perl-style regular expressions, multiline
    505 The @code{M} modifier to regular-expression matching is a @value{SSED}
    506 extension which causes @code{^} and @code{$} to match respectively
    507 (in addition to the normal behavior) the empty string after a newline,
    508 and the empty string before a newline.  There are special character
    509 sequences
    510 @ifset PERL
    511 (@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
    512 in basic or extended regular expression modes)
    513 @end ifset
    514 @ifclear PERL
    515 (@code{\`} and @code{\'})
    516 @end ifclear
    517 which always match the beginning or the end of the buffer.
    518 @code{M} stands for @cite{multi-line}.
    519 
    520 @ifset PERL
    521 @item /@var{regexp}/S
    522 @itemx \%@var{regexp}%S
    523 @cindex @value{SSEDEXT}, @code{S} modifier
    524 @cindex Perl-style regular expressions, single line
    525 The @code{S} modifier to regular-expression matching is only valid
    526 in Perl mode and specifies that the dot character (@code{.}) will
    527 match the newline character too.  @code{S} stands for @cite{single-line}.
    528 @end ifset
    529 
    530 @ifset PERL
    531 @item /@var{regexp}/X
    532 @itemx \%@var{regexp}%X
    533 @cindex @value{SSEDEXT}, @code{X} modifier
    534 @cindex Perl-style regular expressions, extended
    535 The @code{X} modifier to regular-expression matching is also
    536 valid in Perl mode only.  If it is used, whitespace in the
    537 pattern (other than in a character class) and
    538 characters between a @kbd{#} outside a character class and the
    539 next newline character are ignored. An escaping backslash
    540 can be used to include a whitespace or @kbd{#} character as part
    541 of the pattern.
    542 @end ifset
    543 @end table
    544 
    545 If no addresses are given, then all lines are matched;
    546 if one address is given, then only lines matching that
    547 address are matched.
    548 
    549 @cindex Range of lines
    550 @cindex Several lines, selecting
    551 An address range can be specified by specifying two addresses
    552 separated by a comma (@code{,}).  An address range matches lines
    553 starting from where the first address matches, and continues
    554 until the second address matches (inclusively).
    555 
    556 If the second address is a @var{regexp}, then checking for the
    557 ending match will start with the line @emph{following} the
    558 line which matched the first address: a range will always
    559 span at least two lines (except of course if the input stream
    560 ends).
    561 
    562 If the second address is a @var{number} less than (or equal to)
    563 the line matching the first address, then only the one line is
    564 matched.
    565 
    566 @cindex Special addressing forms
    567 @cindex Range with start address of zero
    568 @cindex Zero, as range start address
    569 @cindex @var{addr1},+N
    570 @cindex @var{addr1},~N
    571 @cindex @acronym{GNU} extensions, special two-address forms
    572 @cindex @acronym{GNU} extensions, @code{0} address
    573 @cindex @acronym{GNU} extensions, 0,@var{addr2} addressing
    574 @cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing
    575 @cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing
    576 @value{SSED} also supports some special two-address forms; all these
    577 are @acronym{GNU} extensions:
    578 @table @code
    579 @item 0,/@var{regexp}/
    580 A line number of @code{0} can be used in an address specification like
    581 @code{0,/@var{regexp}/} so that @command{sed} will try to match
    582 @var{regexp} in the first input line too.  In other words,
    583 @code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/},
    584 except that if @var{addr2} matches the very first line of input the
    585 @code{0,/@var{regexp}/} form will consider it to end the range, whereas
    586 the @code{1,/@var{regexp}/} form will match the beginning of its range and
    587 hence make the range span up to the @emph{second} occurrence of the
    588 regular expression.
    589 
    590 Note that this is the only place where the @code{0} address makes
    591 sense; there is no 0-th line and commands which are given the @code{0}
    592 address in any other way will give an error.
    593 
    594 @item @var{addr1},+@var{N}
    595 Matches @var{addr1} and the @var{N} lines following @var{addr1}.
    596 
    597 @item @var{addr1},~@var{N}
    598 Matches @var{addr1} and the lines following @var{addr1}
    599 until the next line whose input line number is a multiple of @var{N}.
    600 @end table
    601 
    602 @cindex Excluding lines
    603 @cindex Selecting non-matching lines
    604 Appending the @code{!} character to the end of an address
    605 specification negates the sense of the match.
    606 That is, if the @code{!} character follows an address range,
    607 then only lines which do @emph{not} match the address range
    608 will be selected.
    609 This also works for singleton addresses,
    610 and, perhaps perversely, for the null address.
    611 
    612 
    613 @node Regular Expressions
    614 @section Overview of Regular Expression Syntax
    615 
    616 To know how to use @command{sed}, people should understand regular
    617 expressions (@dfn{regexp} for short).  A regular expression
    618 is a pattern that is matched against a
    619 subject string from left to right.  Most characters are
    620 @dfn{ordinary}: they stand for
    621 themselves in a pattern, and match the corresponding characters
    622 in the subject.  As a trivial example, the pattern
    623 
    624 @example
    625      The quick brown fox
    626 @end example
    627 
    628 @noindent
    629 matches a portion of a subject string that is identical to
    630 itself.  The power of regular expressions comes from the
    631 ability to include alternatives and repetitions in the pattern.
    632 These are encoded in the pattern by the use of @dfn{special characters},
    633 which do not stand for themselves but instead
    634 are interpreted in some special way.  Here is a brief description
    635 of regular expression syntax as used in @command{sed}.
    636 
    637 @table @code
    638 @item @var{char}
    639 A single ordinary character matches itself.
    640 
    641 @item *
    642 @cindex @acronym{GNU} extensions, to basic regular expressions
    643 Matches a sequence of zero or more instances of matches for the
    644 preceding regular expression, which must be an ordinary character, a
    645 special character preceded by @code{\}, a @code{.}, a grouped regexp
    646 (see below), or a bracket expression.  As a @acronym{GNU} extension, a
    647 postfixed regular expression can also be followed by @code{*}; for
    648 example, @code{a**} is equivalent to @code{a*}.  @acronym{POSIX}
    649 1003.1-2001 says that @code{*} stands for itself when it appears at
    650 the start of a regular expression or subexpression, but many
    651 non@acronym{GNU} implementations do not support this and portable
    652 scripts should instead use @code{\*} in these contexts.
    653 
    654 @item \+
    655 @cindex @acronym{GNU} extensions, to basic regular expressions
    656 As @code{*}, but matches one or more.  It is a @acronym{GNU} extension.
    657 
    658 @item \?
    659 @cindex @acronym{GNU} extensions, to basic regular expressions
    660 As @code{*}, but only matches zero or one.  It is a @acronym{GNU} extension.
    661 
    662 @item \@{@var{i}\@}
    663 As @code{*}, but matches exactly @var{i} sequences (@var{i} is a
    664 decimal integer; for portability, keep it between 0 and 255
    665 inclusive).
    666 
    667 @item \@{@var{i},@var{j}\@}
    668 Matches between @var{i} and @var{j}, inclusive, sequences.
    669 
    670 @item \@{@var{i},\@}
    671 Matches more than or equal to @var{i} sequences.
    672 
    673 @item \(@var{regexp}\)
    674 Groups the inner @var{regexp} as a whole, this is used to:
    675 
    676 @itemize @bullet
    677 @item
    678 @cindex @acronym{GNU} extensions, to basic regular expressions
    679 Apply postfix operators, like @code{\(abcd\)*}:
    680 this will search for zero or more whole sequences
    681 of @samp{abcd}, while @code{abcd*} would search
    682 for @samp{abc} followed by zero or more occurrences
    683 of @samp{d}.  Note that support for @code{\(abcd\)*} is
    684 required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU}
    685 implementations do not support it and hence it is not universally
    686 portable.         
    687 
    688 @item
    689 Use back references (see below).
    690 @end itemize
    691 
    692 @item .
    693 Matches any character, including newline.
    694 
    695 @item ^
    696 Matches the null string at beginning of line, i.e. what
    697 appears after the circumflex must appear at the
    698 beginning of line. @code{^#include} will match only
    699 lines where @samp{#include} is the first thing on line---if
    700 there are spaces before, for example, the match fails.
    701 @code{^} acts as a special character only at the beginning
    702 of the regular expression or subexpression (that is,
    703 after @code{\(} or @code{\|}).  Portable scripts should avoid
    704 @code{^} at the beginning of a subexpression, though, as
    705 @acronym{POSIX} allows implementations that treat @code{^} as
    706 an ordinary character in that context.
    707 
    708 
    709 @item $
    710 It is the same as @code{^}, but refers to end of line.
    711 @code{$} also acts as a special character only at the end
    712 of the regular expression or subexpression (that is, before @code{\)}
    713 or @code{\|}), and its use at the end of a subexpression is not
    714 portable.
    715 
    716 
    717 @item [@var{list}]
    718 @itemx [^@var{list}]
    719 Matches any single character in @var{list}: for example,
    720 @code{[aeiou]} matches all vowels.  A list may include
    721 sequences like @code{@var{char1}-@var{char2}}, which
    722 matches any character between (inclusive) @var{char1}
    723 and @var{char2}.
    724 
    725 A leading @code{^} reverses the meaning of @var{list}, so that
    726 it matches any single character @emph{not} in @var{list}.  To include
    727 @code{]} in the list, make it the first character (after
    728 the @code{^} if needed), to include @code{-} in the list,
    729 make it the first or last; to include @code{^} put
    730 it after the first character.
    731 
    732 @cindex @code{POSIXLY_CORRECT} behavior, bracket expressions
    733 The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\}
    734 are normally not special within @var{list}.  For example, @code{[\*]}
    735 matches either @samp{\} or @samp{*}, because the @code{\} is not
    736 special here.  However, strings like @code{[.ch.]}, @code{[=a=]}, and
    737 @code{[:space:]} are special within @var{list} and represent collating
    738 symbols, equivalence classes, and character classes, respectively, and
    739 @code{[} is therefore special within @var{list} when it is followed by
    740 @code{.}, @code{=}, or @code{:}.  Also, when not in
    741 @env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and
    742 @code{\t} are recognized within @var{list}.  @xref{Escapes}.
    743 
    744 @item @var{regexp1}\|@var{regexp2}
    745 @cindex @acronym{GNU} extensions, to basic regular expressions
    746 Matches either @var{regexp1} or @var{regexp2}.  Use
    747 parentheses to use complex alternative regular expressions.
    748 The matching process tries each alternative in turn, from
    749 left to right, and the first one that succeeds is used.
    750 It is a @acronym{GNU} extension.
    751 
    752 @item @var{regexp1}@var{regexp2}
    753 Matches the concatenation of @var{regexp1} and @var{regexp2}.
    754 Concatenation binds more tightly than @code{\|}, @code{^}, and
    755 @code{$}, but less tightly than the other regular expression
    756 operators.
    757 
    758 @item \@var{digit}
    759 Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized
    760 subexpression in the regular expression.  This is called a @dfn{back
    761 reference}.  Subexpressions are implicity numbered by counting
    762 occurrences of @code{\(} left-to-right.
    763 
    764 @item \n
    765 Matches the newline character.
    766 
    767 @item \@var{char}
    768 Matches @var{char}, where @var{char} is one of @code{$},
    769 @code{*}, @code{.}, @code{[}, @code{\}, or @code{^}.
    770 Note that the only C-like
    771 backslash sequences that you can portably assume to be
    772 interpreted are @code{\n} and @code{\\}; in particular
    773 @code{\t} is not portable, and matches a @samp{t} under most
    774 implementations of @command{sed}, rather than a tab character.
    775 
    776 @end table
    777 
    778 @cindex Greedy regular expression matching
    779 Note that the regular expression matcher is greedy, i.e., matches
    780 are attempted from left to right and, if two or more matches are
    781 possible starting at the same character, it selects the longest.
    782 
    783 @noindent
    784 Examples:
    785 @table @samp
    786 @item abcdef
    787 Matches @samp{abcdef}.
    788 
    789 @item a*b
    790 Matches zero or more @samp{a}s followed by a single
    791 @samp{b}.  For example, @samp{b} or @samp{aaaaab}.
    792 
    793 @item a\?b
    794 Matches @samp{b} or @samp{ab}.
    795 
    796 @item a\+b\+
    797 Matches one or more @samp{a}s followed by one or more
    798 @samp{b}s: @samp{ab} is the shortest possible match, but
    799 other examples are @samp{aaaab} or @samp{abbbbb} or
    800 @samp{aaaaaabbbbbbb}.
    801 
    802 @item .*
    803 @itemx .\+
    804 These two both match all the characters in a string;
    805 however, the first matches every string (including the empty
    806 string), while the second matches only strings containing
    807 at least one character.
    808 
    809 @item ^main.*(.*)
    810 his matches a string starting with @samp{main},
    811 followed by an opening and closing
    812 parenthesis.  The @samp{n}, @samp{(} and @samp{)} need not
    813 be adjacent.
    814 
    815 @item ^#
    816 This matches a string beginning with @samp{#}.
    817 
    818 @item \\$
    819 This matches a string ending with a single backslash.  The
    820 regexp contains two backslashes for escaping.
    821 
    822 @item \$
    823 Instead, this matches a string consisting of a single dollar sign,
    824 because it is escaped.
    825 
    826 @item [a-zA-Z0-9]
    827 In the C locale, this matches any @acronym{ASCII} letters or digits.
    828 
    829 @item [^ @kbd{tab}]\+
    830 (Here @kbd{tab} stands for a single tab character.)
    831 This matches a string of one or more
    832 characters, none of which is a space or a tab.
    833 Usually this means a word.
    834 
    835 @item ^\(.*\)\n\1$
    836 This matches a string consisting of two equal substrings separated by
    837 a newline.
    838 
    839 @item .\@{9\@}A$
    840 This matches nine characters followed by an @samp{A}.
    841 
    842 @item ^.\@{15\@}A
    843 This matches the start of a string that contains 16 characters,
    844 the last of which is an @samp{A}.
    845 
    846 @end table
    847 
    848 
    849 
    850 @node Common Commands
    851 @section Often-Used Commands
    852 
    853 If you use @command{sed} at all, you will quite likely want to know
    854 these commands.
    855 
    856 @table @code
    857 @item #
    858 [No addresses allowed.]
    859 
    860 @findex # (comments)
    861 @cindex Comments, in scripts
    862 The @code{#} character begins a comment;
    863 the comment continues until the next newline.
    864 
    865 @cindex Portability, comments
    866 If you are concerned about portability, be aware that
    867 some implementations of @command{sed} (which are not @sc{posix}
    868 conformant) may only support a single one-line comment,
    869 and then only when the very first character of the script is a @code{#}.
    870 
    871 @findex -n, forcing from within a script
    872 @cindex Caveat --- #n on first line
    873 Warning: if the first two characters of the @command{sed} script
    874 are @code{#n}, then the @option{-n} (no-autoprint) option is forced.
    875 If you want to put a comment in the first line of your script
    876 and that comment begins with the letter @samp{n}
    877 and you do not want this behavior,
    878 then be sure to either use a capital @samp{N},
    879 or place at least one space before the @samp{n}.
    880 
    881 @item q [@var{exit-code}]
    882 This command only accepts a single address.
    883 
    884 @findex q (quit) command
    885 @cindex @value{SSEDEXT}, returning an exit code
    886 @cindex Quitting
    887 Exit @command{sed} without processing any more commands or input.
    888 Note that the current pattern space is printed if auto-print is
    889 not disabled with the @option{-n} options.  The ability to return
    890 an exit code from the @command{sed} script is a @value{SSED} extension.
     656
     657@item a\
     658@itemx @var{text}
     659Append @var{text} after a line.
     660
     661@item a @var{text}
     662Append @var{text} after a line (alternative syntax).
     663
     664@item b @var{label}
     665Branch unconditionally to @var{label}.
     666The @var{label} may be omitted, in which case the next cycle is started.
     667
     668@item c\
     669@itemx @var{text}
     670Replace (change) lines with @var{text}.
     671
     672@item c @var{text}
     673Replace (change) lines with @var{text} (alternative syntax).
    891674
    892675@item d
    893 @findex d (delete) command
    894 @cindex Text, deleting
    895676Delete the pattern space;
    896677immediately start next cycle.
    897678
    898 @item p
    899 @findex p (print) command
    900 @cindex Text, printing
    901 Print out the pattern space (to the standard output).
    902 This command is usually only used in conjunction with the @option{-n}
    903 command-line option.
     679@item D
     680If pattern space contains newlines, delete text in the pattern
     681space up to the first newline, and restart cycle with the resultant
     682pattern space, without reading a new line of input.
     683
     684If pattern space contains no newline, start a normal new cycle as if
     685the @code{d} command was issued.
     686@c TODO: add a section about D+N and D+n commands
     687
     688@item e
     689Executes the command that is found in pattern space and
     690replaces the pattern space with the output; a trailing newline
     691is suppressed.
     692
     693@item e @var{command}
     694Executes @var{command} and sends its output to the output stream.
     695The command can run across multiple lines, all but the last ending with
     696a back-slash.
     697
     698@item F
     699(filename) Print the file name of the current input file (with a trailing
     700newline).
     701
     702@item g
     703Replace the contents of the pattern space with the contents of the hold space.
     704
     705@item G
     706Append a newline to the contents of the pattern space,
     707and then append the contents of the hold space to that of the pattern space.
     708
     709@item h
     710(hold) Replace the contents of the hold space with the contents of the
     711pattern space.
     712
     713@item H
     714Append a newline to the contents of the hold space,
     715and then append the contents of the pattern space to that of the hold space.
     716
     717@item i\
     718@itemx @var{text}
     719insert @var{text} before a line.
     720
     721@item i @var{text}
     722insert @var{text} before a line (alternative syntax).
     723
     724@item l
     725Print the pattern space in an unambiguous form.
    904726
    905727@item n
    906 @findex n (next-line) command
    907 @cindex Next input line, replace pattern space with
    908 @cindex Read next input line
    909 If auto-print is not disabled, print the pattern space,
     728(next) If auto-print is not disabled, print the pattern space,
    910729then, regardless, replace the pattern space with the next line of input.
    911730If there is no more input then @command{sed} exits without processing
    912731any more commands.
    913732
    914 @item @{ @var{commands} @}
    915 @findex @{@} command grouping
    916 @cindex Grouping commands
    917 @cindex Command groups
    918 A group of commands may be enclosed between
    919 @code{@{} and @code{@}} characters.
    920 This is particularly useful when you want a group of commands
    921 to be triggered by a single address (or address-range) match.
     733@item N
     734Add a newline to the pattern space,
     735then append the next line of input to the pattern space.
     736If there is no more input then @command{sed} exits without processing
     737any more commands.
     738
     739@item p
     740Print the pattern space.
     741@c useful with @option{-n}
     742
     743@item P
     744Print the pattern space, up to the first <newline>.
     745
     746@item q@var{[exit-code]}
     747(quit) Exit @command{sed} without processing any more commands or input.
     748
     749@item Q@var{[exit-code]}
     750(quit) This command is the same as @code{q}, but will not print the
     751contents of pattern space.  Like @code{q}, it provides the
     752ability to return an exit code to the caller.
     753@c useful to quit on a conditional without printing
     754
     755@item r filename
     756Reads file @var{filename}.
     757
     758@item R filename
     759Queue a line of @var{filename} to be read and
     760inserted into the output stream at the end of the current cycle,
     761or when the next input line is read.
     762@c useful to interleave files
     763
     764@item s@var{/regexp/replacement/[flags]}
     765(substitute) Match the regular-expression against the content of the
     766pattern space.  If found, replace matched string with
     767@var{replacement}.
     768
     769@item t @var{label}
     770(test) Branch to @var{label} only if there has been a successful
     771@code{s}ubstitution since the last input line was read or conditional
     772branch was taken.  The @var{label} may be omitted, in which case the
     773next cycle is started.
     774
     775@item T @var{label}
     776(test) Branch to @var{label} only if there have been no successful
     777@code{s}ubstitutions since the last input line was read or
     778conditional branch was taken. The @var{label} may be omitted,
     779in which case the next cycle is started.
     780
     781@item v @var{[version]}
     782(version) This command does nothing, but makes @command{sed} fail if
     783@value{SSED} extensions are not supported, or if the requested version
     784is not available.
     785
     786@item w filename
     787Write the pattern space to @var{filename}.
     788
     789@item W filename
     790Write to the given filename the portion of the pattern space up to
     791the first newline
     792
     793@item x
     794Exchange the contents of the hold and pattern spaces.
     795
     796
     797@item y/src/dst/
     798Transliterate any characters in the pattern space which match
     799any of the @var{source-chars} with the corresponding character
     800in @var{dest-chars}.
     801
     802
     803@item z
     804(zap) This command empties the content of pattern space.
     805
     806@item #
     807A comment, until  the next newline.
     808
     809
     810@item @{ @var{cmd ; cmd ...} @}
     811Group several commands together.
     812@c useful for multiple commands on same address
     813
     814@item =
     815Print the current input line number (with a trailing newline).
     816
     817@item : @var{label}
     818Specify the location of @var{label} for branch commands (@code{b},
     819@code{t}, @code{T}).
    922820
    923821@end table
     822
    924823
    925824@node The "s" Command
    926825@section The @code{s} Command
    927826
    928 The syntax of the @code{s} (as in substitute) command is
    929 @samp{s/@var{regexp}/@var{replacement}/@var{flags}}.  The @code{/}
    930 characters may be uniformly replaced by any other single
    931 character within any given @code{s} command.  The @code{/}
    932 character (or whatever other character is used in its stead)
    933 can appear in the @var{regexp} or @var{replacement}
    934 only if it is preceded by a @code{\} character.
    935 
    936 The @code{s} command is probably the most important in @command{sed}
    937 and has a lot of different options.  Its basic concept is simple:
    938 the @code{s} command attempts to match the pattern
    939 space against the supplied @var{regexp}; if the match is
    940 successful, then that portion of the pattern
    941 space which was matched is replaced with @var{replacement}.
     827The @code{s} command (as in substitute) is probably the most important
     828in @command{sed} and has a lot of different options.  The syntax of
     829the @code{s} command is
     830@samp{s/@var{regexp}/@var{replacement}/@var{flags}}.
     831
     832Its basic concept is simple: the @code{s} command attempts to match
     833the pattern space against the supplied regular expression @var{regexp};
     834if the match is successful, then that portion of the
     835pattern space which was matched is replaced with @var{replacement}.
     836
     837For details about @var{regexp} syntax @pxref{Regexp Addresses,,Regular
     838Expression Addresses}.
    942839
    943840@cindex Backreferences, in regular expressions
     
    950847characters which reference the whole matched portion
    951848of the pattern space.
     849
     850@c TODO: xref to backreference section mention @var{\'}.
     851
     852The @code{/}
     853characters may be uniformly replaced by any other single
     854character within any given @code{s} command.  The @code{/}
     855character (or whatever other character is used in its stead)
     856can appear in the @var{regexp} or @var{replacement}
     857only if it is preceded by a @code{\} character.
     858
     859
     860
    952861@cindex @value{SSEDEXT}, case modifiers in @code{s} commands
    953862Finally, as a @value{SSED} extension, you can include a
     
    976885Stop case conversion started by @code{\L} or @code{\U}.
    977886@end table
     887
     888When the @code{g} flag is being used, case conversion does not
     889propagate from one occurrence of the regular expression to
     890another.  For example, when the following command is executed
     891with @samp{a-b-} in pattern space:
     892@example
     893s/\(b\?\)-/x\u\1/g
     894@end example
     895
     896@noindent
     897the output is @samp{axxB}.  When replacing the first @samp{-},
     898the @samp{\u} sequence only affects the empty replacement of
     899@samp{\1}.  It does not affect the @code{x} character that is
     900added to pattern space when replacing @code{b-} with @code{xB}.
     901
     902On the other hand, @code{\l} and @code{\u} do affect the remainder
     903of the replacement text if they are followed by an empty substitution.
     904With @samp{a-b-} in pattern space, the following command:
     905@example
     906s/\(b\?\)-/\u\1x/g
     907@end example
     908
     909@noindent
     910will replace @samp{-} with @samp{X} (uppercase) and @samp{b-} with
     911@samp{Bx}.  If this behavior is undesirable, you can prevent it by
     912adding a @samp{\E} sequence---after @samp{\1} in this case.
    978913
    979914To include a literal @code{\}, @code{&}, or newline in the final
     
    997932Only replace the @var{number}th match of the @var{regexp}.
    998933
    999 @cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command
     934@cindex GNU extensions, @code{g} and @var{number} modifier
     935interaction in @code{s} command
    1000936@cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command
    1001937Note: the @sc{posix} standard does not specify what should happen
     
    1023959change in future versions.
    1024960
    1025 @item w @var{file-name}
     961@item w @var{filename}
    1026962@cindex Text, writing to a file after substitution
    1027963@cindex @value{SSEDEXT}, @file{/dev/stdout} file
    1028964@cindex @value{SSEDEXT}, @file{/dev/stderr} file
    1029965If the substitution was made, then write out the result to the named file.
    1030 As a @value{SSED} extension, two special values of @var{file-name} are
     966As a @value{SSED} extension, two special values of @var{filename} are
    1031967supported: @file{/dev/stderr}, which writes the result to the standard
    1032968error, and @file{/dev/stdout}, which writes to the standard
     
    1048984@item I
    1049985@itemx i
    1050 @cindex @acronym{GNU} extensions, @code{I} modifier
     986@cindex GNU extensions, @code{I} modifier
    1051987@cindex Case-insensitive matching
    1052 @ifset PERL
    1053 @cindex Perl-style regular expressions, case-insensitive
    1054 @end ifset
    1055 The @code{I} modifier to regular-expression matching is a @acronym{GNU}
     988The @code{I} modifier to regular-expression matching is a GNU
    1056989extension which makes @command{sed} match @var{regexp} in a
    1057990case-insensitive manner.
     
    1060993@itemx m
    1061994@cindex @value{SSEDEXT}, @code{M} modifier
    1062 @ifset PERL
    1063 @cindex Perl-style regular expressions, multiline
    1064 @end ifset
    1065995The @code{M} modifier to regular-expression matching is a @value{SSED}
    1066 extension which causes @code{^} and @code{$} to match respectively
    1067 (in addition to the normal behavior) the empty string after a newline,
    1068 and the empty string before a newline.  There are special character
    1069 sequences
    1070 @ifset PERL
    1071 (@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
    1072 in basic or extended regular expression modes)
    1073 @end ifset
     996extension which directs @value{SSED} to match the regular expression
     997in @cite{multi-line} mode.  The modifier causes @code{^} and @code{$} to
     998match respectively (in addition to the normal behavior) the empty string
     999after a newline, and the empty string before a newline.  There are
     1000special character sequences
    10741001@ifclear PERL
    10751002(@code{\`} and @code{\'})
    10761003@end ifclear
    10771004which always match the beginning or the end of the buffer.
    1078 @code{M} stands for @cite{multi-line}.
    1079 
    1080 @ifset PERL
    1081 @item S
    1082 @itemx s
    1083 @cindex @value{SSEDEXT}, @code{S} modifier
    1084 @cindex Perl-style regular expressions, single line
    1085 The @code{S} modifier to regular-expression matching is only valid
    1086 in Perl mode and specifies that the dot character (@code{.}) will
    1087 match the newline character too.  @code{S} stands for @cite{single-line}.
    1088 @end ifset
    1089 
    1090 @ifset PERL
    1091 @item X
    1092 @itemx x
    1093 @cindex @value{SSEDEXT}, @code{X} modifier
    1094 @cindex Perl-style regular expressions, extended
    1095 The @code{X} modifier to regular-expression matching is also
    1096 valid in Perl mode only.  If it is used, whitespace in the
    1097 pattern (other than in a character class) and
    1098 characters between a @kbd{#} outside a character class and the
    1099 next newline character are ignored. An escaping backslash
    1100 can be used to include a whitespace or @kbd{#} character as part
    1101 of the pattern.
    1102 @end ifset
     1005In addition,
     1006the period character does not match a new-line character in
     1007multi-line mode.
     1008
     1009
     1010@end table
     1011
     1012@node Common Commands
     1013@section Often-Used Commands
     1014
     1015If you use @command{sed} at all, you will quite likely want to know
     1016these commands.
     1017
     1018@table @code
     1019@item #
     1020[No addresses allowed.]
     1021
     1022@findex # (comments)
     1023@cindex Comments, in scripts
     1024The @code{#} character begins a comment;
     1025the comment continues until the next newline.
     1026
     1027@cindex Portability, comments
     1028If you are concerned about portability, be aware that
     1029some implementations of @command{sed} (which are not @sc{posix}
     1030conforming) may only support a single one-line comment,
     1031and then only when the very first character of the script is a @code{#}.
     1032
     1033@findex -n, forcing from within a script
     1034@cindex Caveat --- #n on first line
     1035Warning: if the first two characters of the @command{sed} script
     1036are @code{#n}, then the @option{-n} (no-autoprint) option is forced.
     1037If you want to put a comment in the first line of your script
     1038and that comment begins with the letter @samp{n}
     1039and you do not want this behavior,
     1040then be sure to either use a capital @samp{N},
     1041or place at least one space before the @samp{n}.
     1042
     1043@item q [@var{exit-code}]
     1044@findex q (quit) command
     1045@cindex @value{SSEDEXT}, returning an exit code
     1046@cindex Quitting
     1047Exit @command{sed} without processing any more commands or input.
     1048
     1049Example: stop after printing the second line:
     1050@example
     1051$ seq 3 | sed 2q
     10521
     10532
     1054@end example
     1055
     1056This command accepts only one address.
     1057Note that the current pattern space is printed if auto-print is
     1058not disabled with the @option{-n} options.  The ability to return
     1059an exit code from the @command{sed} script is a @value{SSED} extension.
     1060
     1061See also the @value{SSED} extension @code{Q} command which quits silently
     1062without printing the current pattern space.
     1063
     1064@item d
     1065@findex d (delete) command
     1066@cindex Text, deleting
     1067Delete the pattern space;
     1068immediately start next cycle.
     1069
     1070Example: delete the second input line:
     1071@example
     1072$ seq 3 | sed 2d
     10731
     10743
     1075@end example
     1076
     1077@item p
     1078@findex p (print) command
     1079@cindex Text, printing
     1080Print out the pattern space (to the standard output).
     1081This command is usually only used in conjunction with the @option{-n}
     1082command-line option.
     1083
     1084Example: print only the second input line:
     1085@example
     1086$ seq 3 | sed -n 2p
     10872
     1088@end example
     1089
     1090@item n
     1091@findex n (next-line) command
     1092@cindex Next input line, replace pattern space with
     1093@cindex Read next input line
     1094If auto-print is not disabled, print the pattern space,
     1095then, regardless, replace the pattern space with the next line of input.
     1096If there is no more input then @command{sed} exits without processing
     1097any more commands.
     1098
     1099This command is useful to skip lines (e.g. process every Nth line).
     1100
     1101Example: perform substitution on every 3rd line (i.e. two @code{n} commands
     1102skip two lines):
     1103@codequoteundirected on
     1104@codequotebacktick on
     1105@example
     1106$ seq 6 | sed 'n;n;s/./x/'
     11071
     11082
     1109x
     11104
     11115
     1112x
     1113@end example
     1114
     1115@value{SSED} provides an extension address syntax of @var{first}~@var{step}
     1116to achieve the same result:
     1117
     1118@example
     1119$ seq 6 | sed '0~3s/./x/'
     11201
     11212
     1122x
     11234
     11245
     1125x
     1126@end example
     1127
     1128@codequotebacktick off
     1129@codequoteundirected off
     1130
     1131
     1132@item @{ @var{commands} @}
     1133@findex @{@} command grouping
     1134@cindex Grouping commands
     1135@cindex Command groups
     1136A group of commands may be enclosed between
     1137@code{@{} and @code{@}} characters.
     1138This is particularly useful when you want a group of commands
     1139to be triggered by a single address (or address-range) match.
     1140
     1141Example: perform substitution then print the second input line:
     1142@codequoteundirected on
     1143@codequotebacktick on
     1144@example
     1145$ seq 3 | sed -n '2@{s/2/X/ ; p@}'
     1146X
     1147@end example
     1148@codequoteundirected off
     1149@codequotebacktick off
     1150
    11031151@end table
    11041152
     
    11131161@table @code
    11141162@item y/@var{source-chars}/@var{dest-chars}/
    1115 (The @code{/} characters may be uniformly replaced by
    1116 any other single character within any given @code{y} command.)
    1117 
    11181163@findex y (transliterate) command
    11191164@cindex Transliteration
     
    11221167in @var{dest-chars}.
    11231168
     1169Example: transliterate @samp{a-j} into @samp{0-9}:
     1170@codequoteundirected on
     1171@codequotebacktick on
     1172@example
     1173$ echo hello world | sed 'y/abcdefghij/0123456789/'
     117474llo worl3
     1175@end example
     1176@codequoteundirected off
     1177@codequotebacktick off
     1178
     1179(The @code{/} characters may be uniformly replaced by
     1180any other single character within any given @code{y} command.)
     1181
    11241182Instances of the @code{/} (or whatever other character is used in its stead),
    11251183@code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars}
     
    11281186contain the same number of characters (after de-escaping).
    11291187
     1188See the @command{tr} command from GNU coreutils for similar functionality.
     1189
     1190@item a @var{text}
     1191Appending @var{text} after a line. This is a GNU extension
     1192to the standard @code{a} command - see below for details.
     1193
     1194Example: Add @samp{hello} after the second line:
     1195@codequoteundirected on
     1196@codequotebacktick on
     1197@example
     1198$ seq 3 | sed '2a hello'
     11991
     12002
     1201hello
     12023
     1203@end example
     1204@codequoteundirected off
     1205@codequotebacktick off
     1206
     1207Leading whitespace after the @code{a} command is ignored.
     1208The text to add is read until the end of the line.
     1209
     1210
    11301211@item a\
    11311212@itemx @var{text}
    1132 @cindex @value{SSEDEXT}, two addresses supported by most commands
    1133 As a @acronym{GNU} extension, this command accepts two addresses.
    1134 
    11351213@findex a (append text lines) command
    11361214@cindex Appending text after a line
    11371215@cindex Text, appending
    1138 Queue the lines of text which follow this command
     1216Appending @var{text} after a line.
     1217
     1218Example: Add @samp{hello} after the second line
     1219(@print{} indicates printed output lines):
     1220@codequoteundirected on
     1221@codequotebacktick on
     1222@example
     1223$ seq 3 | sed '2a\
     1224hello'
     1225@print{}1
     1226@print{}2
     1227@print{}hello
     1228@print{}3
     1229@end example
     1230@codequoteundirected off
     1231@codequotebacktick off
     1232
     1233The @code{a} command queues the lines of text which follow this command
    11391234(each but the last ending with a @code{\},
    11401235which are removed from the output)
     
    11421237or when the next input line is read.
    11431238
     1239@cindex @value{SSEDEXT}, two addresses supported by most commands
     1240As a GNU extension, this command accepts two addresses.
     1241
    11441242Escape sequences in @var{text} are processed, so you should
    11451243use @code{\\} in @var{text} to print a single backslash.
    11461244
    1147 As a @acronym{GNU} extension, if between the @code{a} and the newline there is
    1148 other than a whitespace-@code{\} sequence, then the text of this line,
    1149 starting at the first non-whitespace character after the @code{a},
    1150 is taken as the first line of the @var{text} block.
    1151 (This enables a simplification in scripting a one-line add.)
    1152 This extension also works with the @code{i} and @code{c} commands.
    1153 
     1245The commands resume after the last line without a backslash (@code{\}) -
     1246@samp{world} in the following example:
     1247@codequoteundirected on
     1248@codequotebacktick on
     1249@example
     1250$ seq 3 | sed '2a\
     1251hello\
     1252world
     12533s/./X/'
     1254@print{}1
     1255@print{}2
     1256@print{}hello
     1257@print{}world
     1258@print{}X
     1259@end example
     1260@codequoteundirected off
     1261@codequotebacktick off
     1262
     1263As a GNU extension, the @code{a} command and @var{text} can be
     1264separated into two @code{-e} parameters, enabling easier scripting:
     1265@codequoteundirected on
     1266@codequotebacktick on
     1267@example
     1268$ seq 3 | sed -e '2a\' -e hello
     12691
     12702
     1271hello
     12723
     1273
     1274$ sed -e '2a\' -e "$VAR"
     1275@end example
     1276@codequoteundirected off
     1277@codequotebacktick off
     1278
     1279@item i @var{text}
     1280insert @var{text} before a line. This is a GNU extension
     1281to the standard @code{i} command - see below for details.
     1282
     1283Example: Insert @samp{hello} before the second line:
     1284@codequoteundirected on
     1285@codequotebacktick on
     1286@example
     1287$ seq 3 | sed '2i hello'
     12881
     1289hello
     12902
     12913
     1292@end example
     1293@codequoteundirected off
     1294@codequotebacktick off
     1295
     1296Leading whitespace after the @code{i} command is ignored.
     1297The text to add is read until the end of the line.
     1298
     1299@anchor{insert command}
    11541300@item i\
    11551301@itemx @var{text}
    1156 @cindex @value{SSEDEXT}, two addresses supported by most commands
    1157 As a @acronym{GNU} extension, this command accepts two addresses.
    1158 
    11591302@findex i (insert text lines) command
    11601303@cindex Inserting text before a line
    11611304@cindex Text, insertion
    1162 Immediately output the lines of text which follow this command
    1163 (each but the last ending with a @code{\},
    1164 which are removed from the output).
     1305Immediately output the lines of text which follow this command.
     1306
     1307Example: Insert @samp{hello} before the second line
     1308(@print{} indicates printed output lines):
     1309@codequoteundirected on
     1310@codequotebacktick on
     1311@example
     1312$ seq 3 | sed '2i\
     1313hello'
     1314@print{}1
     1315@print{}hello
     1316@print{}2
     1317@print{}3
     1318@end example
     1319@codequoteundirected off
     1320@codequotebacktick off
     1321
     1322@cindex @value{SSEDEXT}, two addresses supported by most commands
     1323As a GNU extension, this command accepts two addresses.
     1324
     1325Escape sequences in @var{text} are processed, so you should
     1326use @code{\\} in @var{text} to print a single backslash.
     1327
     1328The commands resume after the last line without a backslash (@code{\}) -
     1329@samp{world} in the following example:
     1330@codequoteundirected on
     1331@codequotebacktick on
     1332@example
     1333$ seq 3 | sed '2i\
     1334hello\
     1335world
     1336s/./X/'
     1337@print{}X
     1338@print{}hello
     1339@print{}world
     1340@print{}X
     1341@print{}X
     1342@end example
     1343@codequoteundirected off
     1344@codequotebacktick off
     1345
     1346As a GNU extension, the @code{i} command and @var{text} can be
     1347separated into two @code{-e} parameters, enabling easier scripting:
     1348@codequoteundirected on
     1349@codequotebacktick on
     1350@example
     1351$ seq 3 | sed -e '2i\' -e hello
     13521
     1353hello
     13542
     13553
     1356
     1357$ sed -e '2i\' -e "$VAR"
     1358@end example
     1359@codequoteundirected off
     1360@codequotebacktick off
     1361
     1362@item c @var{text}
     1363Replaces the line(s) with @var{text}. This is a GNU extension
     1364to the standard @code{c} command - see below for details.
     1365
     1366Example: Replace the 2nd to 9th lines with the word @samp{hello}:
     1367@codequoteundirected on
     1368@codequotebacktick on
     1369@example
     1370$ seq 10 | sed '2,9c hello'
     13711
     1372hello
     137310
     1374@end example
     1375@codequoteundirected off
     1376@codequotebacktick off
     1377
     1378Leading whitespace after the @code{c} command is ignored.
     1379The text to add is read until the end of the line.
    11651380
    11661381@item c\
     
    11691384@cindex Replacing selected lines with other text
    11701385Delete the lines matching the address or address-range,
    1171 and output the lines of text which follow this command
    1172 (each but the last ending with a @code{\},
    1173 which are removed from the output)
    1174 in place of the last line
    1175 (or in place of each line, if no addresses were specified).
     1386and output the lines of text which follow this command.
     1387
     1388Example: Replace 2nd to 4th lines with the words @samp{hello} and
     1389@samp{world} (@print{} indicates printed output lines):
     1390@codequoteundirected on
     1391@codequotebacktick on
     1392@example
     1393$ seq 5 | sed '2,4c\
     1394hello\
     1395world'
     1396@print{}1
     1397@print{}hello
     1398@print{}world
     1399@print{}5
     1400@end example
     1401@codequoteundirected off
     1402@codequotebacktick off
     1403
     1404If no addresses are given, each line is replaced.
     1405
    11761406A new cycle is started after this command is done,
    11771407since the pattern space will have been deleted.
     1408In the following example, the @code{c} starts a
     1409new cycle and the substitution command is not performed
     1410on the replaced text:
     1411
     1412@codequoteundirected on
     1413@codequotebacktick on
     1414@example
     1415$ seq 3 | sed '2c\
     1416hello
     1417s/./X/'
     1418@print{}X
     1419@print{}hello
     1420@print{}X
     1421@end example
     1422@codequoteundirected off
     1423@codequotebacktick off
     1424
     1425As a GNU extension, the @code{c} command and @var{text} can be
     1426separated into two @code{-e} parameters, enabling easier scripting:
     1427@codequoteundirected on
     1428@codequotebacktick on
     1429@example
     1430$ seq 3 | sed -e '2c\' -e hello
     14311
     1432hello
     14333
     1434
     1435$ sed -e '2c\' -e "$VAR"
     1436@end example
     1437@codequoteundirected off
     1438@codequotebacktick off
     1439
    11781440
    11791441@item =
    1180 @cindex @value{SSEDEXT}, two addresses supported by most commands
    1181 As a @acronym{GNU} extension, this command accepts two addresses.
    1182 
    11831442@findex = (print line number) command
    11841443@cindex Printing line number
    11851444@cindex Line number, printing
    11861445Print out the current input line number (with a trailing newline).
     1446
     1447@codequoteundirected on
     1448@codequotebacktick on
     1449@example
     1450$ printf '%s\n' aaa bbb ccc | sed =
     14511
     1452aaa
     14532
     1454bbb
     14553
     1456ccc
     1457@end example
     1458@codequoteundirected off
     1459@codequotebacktick off
     1460
     1461@cindex @value{SSEDEXT}, two addresses supported by most commands
     1462As a GNU extension, this command accepts two addresses.
     1463
     1464
     1465
    11871466
    11881467@item l @var{n}
     
    12041483
    12051484@item r @var{filename}
    1206 @cindex @value{SSEDEXT}, two addresses supported by most commands
    1207 As a @acronym{GNU} extension, this command accepts two addresses.
    12081485
    12091486@findex r (read file) command
    12101487@cindex Read text from a file
     1488Reads file @var{filename}. Example:
     1489
     1490@codequoteundirected on
     1491@codequotebacktick on
     1492@example
     1493$ seq 3 | sed '2r/etc/hostname'
     14941
     14952
     1496fencepost.gnu.org
     14973
     1498@end example
     1499@codequoteundirected off
     1500@codequotebacktick off
     1501
    12111502@cindex @value{SSEDEXT}, @file{/dev/stdin} file
    12121503Queue the contents of @var{filename} to be read and
     
    12201511standard input.
    12211512
     1513@cindex @value{SSEDEXT}, two addresses supported by most commands
     1514As a GNU extension, this command accepts two addresses. The
     1515file will then be reread and inserted on each of the addressed lines.
     1516
     1517As a @value{SSED} extension, the @code{r} command accepts a zero address,
     1518inserting a file @emph{before} the first line of the input
     1519@pxref{Adding a header to multiple files}.
     1520
    12221521@item w @var{filename}
    12231522@findex w (write file) command
     
    12261525@cindex @value{SSEDEXT}, @file{/dev/stderr} file
    12271526Write the pattern space to @var{filename}.
    1228 As a @value{SSED} extension, two special values of @var{file-name} are
     1527As a @value{SSED} extension, two special values of @var{filename} are
    12291528supported: @file{/dev/stderr}, which writes the result to the standard
    12301529error, and @file{/dev/stdout}, which writes to the standard
     
    12321531option is being used.}
    12331532
    1234 The file will be created (or truncated) before the
    1235 first input line is read; all @code{w} commands
    1236 (including instances of @code{w} flag on successful @code{s} commands)
    1237 which refer to the same @var{filename} are output without
    1238 closing and reopening the file.
     1533The file will be created (or truncated) before the first input line is
     1534read; all @code{w} commands (including instances of the @code{w} flag
     1535on successful @code{s} commands) which refer to the same @var{filename}
     1536are output without closing and reopening the file.
    12391537
    12401538@item D
    12411539@findex D (delete first line) command
    12421540@cindex Delete first line from pattern space
    1243 Delete text in the pattern space up to the first newline.
    1244 If any text is left, restart cycle with the resultant
    1245 pattern space (without reading a new line of input),
    1246 otherwise start a normal new cycle.
     1541If pattern space contains no newline, start a normal new cycle as if
     1542the @code{d} command was issued.  Otherwise, delete text in the pattern
     1543space up to the first newline, and restart cycle with the resultant
     1544pattern space, without reading a new line of input.
    12471545
    12481546@item N
     
    12541552If there is no more input then @command{sed} exits without processing
    12551553any more commands.
     1554
     1555When @option{-z} is used, a zero byte (the ascii @samp{NUL} character) is
     1556added between the lines (instead of a new line).
     1557
     1558By default @command{sed} does not terminate if there is no 'next' input line.
     1559This is a GNU extension which can be disabled with @option{--posix}.
     1560@xref{N_command_last_line,,N command on the last line}.
     1561
    12561562
    12571563@item P
     
    13561662
    13571663If a parameter is specified, instead, the @code{e} command
    1358 interprets it as a command and sends its output to the output stream
    1359 (like @code{r} does).  The command can run across multiple
    1360 lines, all but the last ending with a back-slash.
     1664interprets it as a command and sends its output to the output stream.
     1665The command can run across multiple lines, all but the last ending with
     1666a back-slash.
    13611667
    13621668In both cases, the results are undefined if the command to be
    13631669executed contains a @sc{nul} character.
    13641670
    1365 @item L @var{n}
    1366 @findex L (fLow paragraphs) command
    1367 @cindex Reformat pattern space
    1368 @cindex Reformatting paragraphs
    1369 @cindex @value{SSEDEXT}, reformatting paragraphs
    1370 @cindex @value{SSEDEXT}, @code{L} command
    1371 This @value{SSED} extension fills and joins lines in pattern space
    1372 to produce output lines of (at most) @var{n} characters, like
    1373 @code{fmt} does; if @var{n} is omitted, the default as specified
    1374 on the command line is used.  This command is considered a failed
    1375 experiment and unless there is enough request (which seems unlikely)
    1376 will be removed in future versions.
    1377 
    1378 @ignore
    1379 Blank lines, spaces between words, and indentation are
    1380 preserved in the output; successive input lines with different
    1381 indentation are not joined; tabs are expanded to 8 columns.
    1382 
    1383 If the pattern space contains multiple lines, they are joined, but
    1384 since the pattern space usually contains a single line, the behavior
    1385 of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e.,
    1386 it does not join short lines to form longer ones).
    1387 
    1388 @var{n} specifies the desired line-wrap length; if omitted,
    1389 the default as specified on the command line is used.
    1390 @end ignore
     1671Note that, unlike the @code{r} command, the output of the command will
     1672be printed immediately; the @code{r} command instead delays the output
     1673to the end of the current cycle.
     1674
     1675@item F
     1676@findex F (File name) command
     1677@cindex Printing file name
     1678@cindex File name, printing
     1679Print out the file name of the current input file (with a trailing
     1680newline).
    13911681
    13921682@item Q [@var{exit-code}]
    1393 This command only accepts a single address.
     1683This command accepts only one address.
    13941684
    13951685@findex Q (silent Quit) command
     
    14091699@example
    14101700:eat
    1411 $d       @i{Quit silently on the last line}
    1412 N        @i{Read another line, silently}
    1413 g        @i{Overwrite pattern space each time to save memory}
     1701$d       @i{@r{Quit silently on the last line}}
     1702N        @i{@r{Read another line, silently}}
     1703g        @i{@r{Overwrite pattern space each time to save memory}}
    14141704b eat
    14151705@end example
     
    14621752the first newline.  Everything said under the @code{w} command about
    14631753file handling holds here too.
     1754
     1755@item z
     1756@findex z (Zap) command
     1757@cindex @value{SSEDEXT}, emptying pattern space
     1758@cindex Emptying pattern space
     1759This command empties the content of pattern space.  It is
     1760usually the same as @samp{s/.*//}, but is more efficient
     1761and works in the presence of invalid multibyte sequences
     1762in the input stream.  @sc{posix} mandates that such sequences
     1763are @emph{not} matched by @samp{.}, so that there is no portable
     1764way to clear @command{sed}'s buffers in the middle of the
     1765script in most multibyte locales (including UTF-8 locales).
    14641766@end table
    14651767
     1768
     1769@node Multiple commands syntax
     1770@section Multiple commands syntax
     1771
     1772@c POSIX says:
     1773@c   Editing commands other than {...}, a, b, c, i, r, t, w, :, and #
     1774@c   can be followed by a <semicolon>, optional <blank> characters, and
     1775@c   another editing command. However, when an s editing command is used
     1776@c   with the w flag, following it with another command in this manner
     1777@c   produces undefined results.
     1778
     1779There are several methods to specify multiple commands in a @command{sed}
     1780program.
     1781
     1782Using newlines is most natural when running a sed script from a file
     1783(using the @option{-f} option).
     1784
     1785On the command line, all @command{sed} commands may be separated by newlines.
     1786Alternatively, you may specify each command as an argument to an @option{-e}
     1787option:
     1788
     1789@codequoteundirected on
     1790@codequotebacktick on
     1791@example
     1792@group
     1793$ seq 6 | sed '1d
     17943d
     17955d'
     17962
     17974
     17986
     1799
     1800$ seq 6 | sed -e 1d -e 3d -e 5d
     18012
     18024
     18036
     1804@end group
     1805@end example
     1806@codequoteundirected off
     1807@codequotebacktick off
     1808
     1809A semicolon (@samp{;}) may be used to separate most simple commands:
     1810
     1811@codequoteundirected on
     1812@codequotebacktick on
     1813@example
     1814@group
     1815$ seq 6 | sed '1d;3d;5d'
     18162
     18174
     18186
     1819@end group
     1820@end example
     1821@codequoteundirected off
     1822@codequotebacktick off
     1823
     1824The @code{@{},@code{@}},@code{b},@code{t},@code{T},@code{:} commands can
     1825be separated with a semicolon (this is a non-portable @value{SSED} extension).
     1826
     1827@codequoteundirected on
     1828@codequotebacktick on
     1829@example
     1830@group
     1831$ seq 4 | sed '@{1d;3d@}'
     18322
     18334
     1834
     1835$ seq 6 | sed '@{1d;3d@};5d'
     18362
     18374
     18386
     1839@end group
     1840@end example
     1841@codequoteundirected off
     1842@codequotebacktick off
     1843
     1844Labels used in @code{b},@code{t},@code{T},@code{:} commands are read
     1845until a semicolon.  Leading and trailing whitespace is ignored.  In
     1846the examples below the label is @samp{x}.  The first example works
     1847with @value{SSED}.  The second is a portable equivalent.  For more
     1848information about branching and labels @pxref{Branching and flow
     1849control}.
     1850
     1851@codequoteundirected on
     1852@codequotebacktick on
     1853@example
     1854@group
     1855$ seq 3 | sed '/1/b x ; s/^/=/ ; :x ; 3d'
     18561
     1857=2
     1858
     1859$ seq 3 | sed -e '/1/bx' -e 's/^/=/' -e ':x' -e '3d'
     18601
     1861=2
     1862@end group
     1863@end example
     1864@codequoteundirected off
     1865@codequotebacktick off
     1866
     1867
     1868
     1869@subsection Commands Requiring a newline
     1870
     1871The following commands cannot be separated by a semicolon and
     1872require a newline:
     1873
     1874@table @asis
     1875
     1876@item @code{a},@code{c},@code{i} (append/change/insert)
     1877
     1878All characters following @code{a},@code{c},@code{i} commands are taken
     1879as the text to append/change/insert.  Using a semicolon leads to
     1880undesirable results:
     1881
     1882@codequoteundirected on
     1883@codequotebacktick on
     1884@example
     1885@group
     1886$ seq 2 | sed '1aHello ; 2d'
     18871
     1888Hello ; 2d
     18892
     1890@end group
     1891@end example
     1892@codequoteundirected off
     1893@codequotebacktick off
     1894
     1895Separate the commands using @option{-e} or a newline:
     1896
     1897@codequoteundirected on
     1898@codequotebacktick on
     1899@example
     1900@group
     1901$ seq 2 | sed -e 1aHello -e 2d
     19021
     1903Hello
     1904
     1905$ seq 2 | sed '1aHello
     19062d'
     19071
     1908Hello
     1909@end group
     1910@end example
     1911@codequoteundirected off
     1912@codequotebacktick off
     1913
     1914Note that specifying the text to add (@samp{Hello}) immediately
     1915after @code{a},@code{c},@code{i} is itself a @value{SSED} extension.
     1916A portable, POSIX-compliant alternative is:
     1917
     1918@codequoteundirected on
     1919@codequotebacktick on
     1920@example
     1921@group
     1922$ seq 2 | sed '1a\
     1923Hello
     19242d'
     19251
     1926Hello
     1927@end group
     1928@end example
     1929@codequoteundirected off
     1930@codequotebacktick off
     1931
     1932@item @code{#} (comment)
     1933
     1934All characters following @samp{#} until the next newline are ignored.
     1935
     1936@codequoteundirected on
     1937@codequotebacktick on
     1938@example
     1939@group
     1940$ seq 3 | sed '# this is a comment ; 2d'
     19411
     19422
     19433
     1944
     1945
     1946$ seq 3 | sed '# this is a comment
     19472d'
     19481
     19493
     1950@end group
     1951@end example
     1952@codequoteundirected off
     1953@codequotebacktick off
     1954
     1955@item @code{r},@code{R},@code{w},@code{W} (reading and writing files)
     1956
     1957The @code{r},@code{R},@code{w},@code{W} commands parse the filename
     1958until end of the line.  If whitespace, comments or semicolons are found,
     1959they will be included in the filename, leading to unexpected results:
     1960
     1961@codequoteundirected on
     1962@codequotebacktick on
     1963@example
     1964@group
     1965$ seq 2 | sed '1w hello.txt ; 2d'
     19661
     19672
     1968
     1969$ ls -log
     1970total 4
     1971-rw-rw-r-- 1 2 Jan 23 23:03 hello.txt ; 2d
     1972
     1973$ cat 'hello.txt ; 2d'
     19741
     1975@end group
     1976@end example
     1977@codequoteundirected off
     1978@codequotebacktick off
     1979
     1980Note that @command{sed} silently ignores read/write errors in
     1981@code{r},@code{R},@code{w},@code{W} commands (such as missing files).
     1982In the following example, @command{sed} tries to read a file named
     1983@samp{@file{hello.txt ; N}}. The file is missing, and the error is silently
     1984ignored:
     1985
     1986@codequoteundirected on
     1987@codequotebacktick on
     1988@example
     1989@group
     1990$ echo x | sed '1rhello.txt ; N'
     1991x
     1992@end group
     1993@end example
     1994@codequoteundirected off
     1995@codequotebacktick off
     1996
     1997@item @code{e} (command execution)
     1998
     1999Any characters following the @code{e} command until the end of the line
     2000will be sent to the shell.  If whitespace, comments or semicolons are found,
     2001they will be included in the shell command, leading to unexpected results:
     2002
     2003@codequoteundirected on
     2004@codequotebacktick on
     2005@example
     2006@group
     2007$ echo a | sed '1e touch foo#bar'
     2008a
     2009
     2010$ ls -1
     2011foo#bar
     2012
     2013$ echo a | sed '1e touch foo ; s/a/b/'
     2014sh: 1: s/a/b/: not found
     2015a
     2016@end group
     2017@end example
     2018@codequoteundirected off
     2019@codequotebacktick off
     2020
     2021
     2022@item @code{s///[we]} (substitute with @code{e} or @code{w} flags)
     2023
     2024In a substitution command, the @code{w} flag writes the substitution
     2025result to a file, and the @code{e} flag executes the substitution result
     2026as a shell command.  As with the @code{r/R/w/W/e} commands, these
     2027must be terminated with a newline.  If whitespace, comments or semicolons
     2028are found, they will be included in the shell command or filename, leading to
     2029unexpected results:
     2030
     2031@codequoteundirected on
     2032@codequotebacktick on
     2033@example
     2034@group
     2035$ echo a | sed 's/a/b/w1.txt#foo'
     2036b
     2037
     2038$ ls -1
     20391.txt#foo
     2040@end group
     2041@end example
     2042@codequoteundirected off
     2043@codequotebacktick off
     2044
     2045@end table
     2046
     2047
     2048@node sed addresses
     2049@chapter Addresses: selecting lines
     2050
     2051@menu
     2052* Addresses overview::                Addresses overview
     2053* Numeric Addresses::                 selecting lines by numbers
     2054* Regexp Addresses::                  selecting lines by text matching
     2055* Range Addresses::                   selecting a range of lines
     2056* Zero Address::                      Using address @code{0}
     2057@end menu
     2058
     2059@node Addresses overview
     2060@section Addresses overview
     2061
     2062@cindex addresses, numeric
     2063@cindex numeric addresses
     2064Addresses determine on which line(s) the @command{sed} command will be
     2065executed. The following command replaces any first occurrence of @samp{hello}
     2066with @samp{world} only on line 144:
     2067
     2068@codequoteundirected on
     2069@codequotebacktick on
     2070@example
     2071sed '144s/hello/world/' input.txt > output.txt
     2072@end example
     2073@codequoteundirected off
     2074@codequotebacktick off
     2075
     2076
     2077
     2078If no address is specified, the command is performed on all lines.
     2079The following command replaces @samp{hello} with @samp{world},
     2080targeting every line of the input file.
     2081However, note that it modifies only the first instance of @samp{hello}
     2082on each line.
     2083Use the @samp{g} modifier to affect every instance on each affected line.
     2084
     2085@codequoteundirected on
     2086@codequotebacktick on
     2087@example
     2088sed 's/hello/world/' input.txt > output.txt
     2089@end example
     2090@codequoteundirected off
     2091@codequotebacktick off
     2092
     2093
     2094
     2095@cindex addresses, regular expression
     2096@cindex regular expression addresses
     2097Addresses can contain regular expressions to match lines based
     2098on content instead of line numbers. The following command replaces
     2099@samp{hello} with @samp{world} only on lines
     2100containing the string @samp{apple}:
     2101
     2102@codequoteundirected on
     2103@codequotebacktick on
     2104@example
     2105sed '/apple/s/hello/world/' input.txt > output.txt
     2106@end example
     2107@codequoteundirected off
     2108@codequotebacktick off
     2109
     2110
     2111
     2112@cindex addresses, range
     2113@cindex range addresses
     2114An address range is specified with two addresses separated by a comma
     2115(@code{,}). Addresses can be numeric, regular expressions, or a mix of
     2116both.
     2117The following command replaces @samp{hello} with @samp{world}
     2118only on lines 4 to 17 (inclusive):
     2119
     2120@codequoteundirected on
     2121@codequotebacktick on
     2122@example
     2123sed '4,17s/hello/world/' input.txt > output.txt
     2124@end example
     2125@codequoteundirected off
     2126@codequotebacktick off
     2127
     2128
     2129
     2130@cindex Excluding lines
     2131@cindex Selecting non-matching lines
     2132@cindex addresses, negating
     2133@cindex addresses, excluding
     2134Appending the @code{!} character to the end of an address
     2135specification (before the command letter) negates the sense of the
     2136match.  That is, if the @code{!} character follows an address or an
     2137address range, then only lines which do @emph{not} match the addresses
     2138will be selected. The following command replaces @samp{hello}
     2139with @samp{world} only on lines @emph{not} containing the string
     2140@samp{apple}:
     2141
     2142@example
     2143sed '/apple/!s/hello/world/' input.txt > output.txt
     2144@end example
     2145
     2146The following command replaces @samp{hello} with
     2147@samp{world} only on lines 1 to 3 and from line 18 to the last line of the
     2148input file (i.e. excluding lines 4 to 17):
     2149
     2150@example
     2151sed '4,17!s/hello/world/' input.txt > output.txt
     2152@end example
     2153
     2154
     2155
     2156
     2157
     2158@node Numeric Addresses
     2159@section Selecting lines by numbers
     2160@cindex Addresses, in @command{sed} scripts
     2161@cindex Line selection
     2162@cindex Selecting lines to process
     2163
     2164Addresses in a @command{sed} script can be in any of the following forms:
     2165@table @code
     2166@item @var{number}
     2167@cindex Address, numeric
     2168@cindex Line, selecting by number
     2169Specifying a line number will match only that line in the input.
     2170(Note that @command{sed} counts lines continuously across all input files
     2171unless @option{-i} or @option{-s} options are specified.)
     2172
     2173@item $
     2174@cindex Address, last line
     2175@cindex Last line, selecting
     2176@cindex Line, selecting last
     2177This address matches the last line of the last file of input, or
     2178the last line of each file when the @option{-i} or @option{-s} options
     2179are specified.
     2180
     2181
     2182@item @var{first}~@var{step}
     2183@cindex GNU extensions, @samp{@var{n}~@var{m}} addresses
     2184This GNU extension matches every @var{step}th line
     2185starting with line @var{first}.
     2186In particular, lines will be selected when there exists
     2187a non-negative @var{n} such that the current line-number equals
     2188@var{first} + (@var{n} * @var{step}).
     2189Thus, one would use @code{1~2} to select the odd-numbered lines and
     2190@code{0~2} for even-numbered lines;
     2191to pick every third line starting with the second, @samp{2~3} would be used;
     2192to pick every fifth line starting with the tenth, use @samp{10~5};
     2193and @samp{50~0} is just an obscure way of saying @code{50}.
     2194
     2195The following commands demonstrate the step address usage:
     2196
     2197@example
     2198$ seq 10 | sed -n '0~4p'
     21994
     22008
     2201
     2202$ seq 10 | sed -n '1~3p'
     22031
     22044
     22057
     220610
     2207@end example
     2208
     2209
     2210@end table
     2211
     2212
     2213
     2214@node Regexp Addresses
     2215@section selecting lines by text matching
     2216
     2217@value{SSED} supports the following regular expression addresses.
     2218The default regular expression is
     2219@ref{BRE syntax, , Basic Regular Expression (BRE)}.
     2220If @option{-E} or @option{-r} options are used, The regular expression should be
     2221in @ref{ERE syntax, , Extended Regular Expression (ERE)} syntax.
     2222@xref{BRE vs ERE}.
     2223
     2224@table @code
     2225@item /@var{regexp}/
     2226@cindex Address, as a regular expression
     2227@cindex Line, selecting by regular expression match
     2228This will select any line which matches the regular expression @var{regexp}.
     2229If @var{regexp} itself includes any @code{/} characters,
     2230each must be escaped by a backslash (@code{\}).
     2231
     2232The following command prints lines in @file{/etc/passwd}
     2233which end with @samp{bash}@footnote{
     2234There are of course many other ways to do the same,
     2235e.g.
     2236@example
     2237grep 'bash$' /etc/passwd
     2238awk -F: '$7 == "/bin/bash"' /etc/passwd
     2239@end example
     2240}:
     2241
     2242@example
     2243sed -n '/bash$/p' /etc/passwd
     2244@end example
     2245
     2246@cindex empty regular expression
     2247@cindex @value{SSEDEXT}, modifiers and the empty regular expression
     2248The empty regular expression @samp{//} repeats the last regular
     2249expression match (the same holds if the empty regular expression is
     2250passed to the @code{s} command).  Note that modifiers to regular expressions
     2251are evaluated when the regular expression is compiled, thus it is invalid to
     2252specify them together with the empty regular expression.
     2253
     2254@item \%@var{regexp}%
     2255(The @code{%} may be replaced by any other single character.)
     2256
     2257@cindex Slash character, in regular expressions
     2258This also matches the regular expression @var{regexp},
     2259but allows one to use a different delimiter than @code{/}.
     2260This is particularly useful if the @var{regexp} itself contains
     2261a lot of slashes, since it avoids the tedious escaping of every @code{/}.
     2262If @var{regexp} itself includes any delimiter characters,
     2263each must be escaped by a backslash (@code{\}).
     2264
     2265The following commands are equivalent. They print lines
     2266which start with @samp{/home/alice/documents/}:
     2267
     2268@example
     2269sed -n '/^\/home\/alice\/documents\//p'
     2270sed -n '\%^/home/alice/documents/%p'
     2271sed -n '\;^/home/alice/documents/;p'
     2272@end example
     2273
     2274
     2275@item /@var{regexp}/I
     2276@itemx \%@var{regexp}%I
     2277@cindex GNU extensions, @code{I} modifier
     2278@cindex case insensitive, regular expression
     2279The @code{I} modifier to regular-expression matching is a GNU
     2280extension which causes the @var{regexp} to be matched in
     2281a case-insensitive manner.
     2282
     2283In many other programming languages, a lower case @code{i} is used
     2284for case-insensitive regular expression matching. However, in @command{sed}
     2285the @code{i} is used for the insert command (@pxref{insert command}).
     2286
     2287Observe the difference between the following examples.
     2288
     2289In this example, @code{/b/I} is the address: regular expression with @code{I}
     2290modifier. @code{d} is the delete command:
     2291
     2292@example
     2293$ printf "%s\n" a b c | sed '/b/Id'
     2294a
     2295c
     2296@end example
     2297
     2298Here, @code{/b/} is the address: a regular expression.
     2299@code{i} is the insert command.
     2300@code{d} is the value to insert.
     2301A line with @samp{d} is then inserted above the matched line:
     2302
     2303@example
     2304$ printf "%s\n" a b c | sed '/b/id'
     2305a
     2306d
     2307b
     2308c
     2309@end example
     2310
     2311@item /@var{regexp}/M
     2312@itemx \%@var{regexp}%M
     2313@cindex @value{SSEDEXT}, @code{M} modifier
     2314The @code{M} modifier to regular-expression matching is a @value{SSED}
     2315extension which directs @value{SSED} to match the regular expression
     2316in @cite{multi-line} mode.  The modifier causes @code{^} and @code{$} to
     2317match respectively (in addition to the normal behavior) the empty string
     2318after a newline, and the empty string before a newline.  There are
     2319special character sequences
     2320@ifclear PERL
     2321(@code{\`} and @code{\'})
     2322@end ifclear
     2323which always match the beginning or the end of the buffer.
     2324In addition,
     2325the period character does not match a new-line character in
     2326multi-line mode.
     2327@end table
     2328
     2329
     2330@cindex regex addresses and pattern space
     2331@cindex regex addresses and input lines
     2332Regex addresses operate on the content of the current
     2333pattern space. If the pattern space is changed (for example with @code{s///}
     2334command) the regular expression matching will operate on the changed text.
     2335
     2336In the following example, automatic printing is disabled with
     2337@option{-n}.  The @code{s/2/X/} command changes lines containing
     2338@samp{2} to @samp{X}. The command @code{/[0-9]/p} matches
     2339lines with digits and prints them.
     2340Because the second line is changed before the @code{/[0-9]/} regex,
     2341it will not match and will not be printed:
     2342
     2343@codequoteundirected on
     2344@codequotebacktick on
     2345@example
     2346@group
     2347$ seq 3 | sed -n 's/2/X/ ; /[0-9]/p'
     23481
     23493
     2350@end group
     2351@end example
     2352@codequoteundirected off
     2353@codequotebacktick off
     2354
     2355
     2356@node Range Addresses
     2357@section Range Addresses
     2358
     2359@cindex Range of lines
     2360@cindex Several lines, selecting
     2361An address range can be specified by specifying two addresses
     2362separated by a comma (@code{,}).  An address range matches lines
     2363starting from where the first address matches, and continues
     2364until the second address matches (inclusively):
     2365
     2366@example
     2367$ seq 10 | sed -n '4,6p'
     23684
     23695
     23706
     2371@end example
     2372
     2373If the second address is a @var{regexp}, then checking for the
     2374ending match will start with the line @emph{following} the
     2375line which matched the first address: a range will always
     2376span at least two lines (except of course if the input stream
     2377ends).
     2378
     2379@example
     2380$ seq 10 | sed -n '4,/[0-9]/p'
     23814
     23825
     2383@end example
     2384
     2385If the second address is a @var{number} less than (or equal to)
     2386the line matching the first address, then only the one line is
     2387matched:
     2388
     2389@example
     2390$ seq 10 | sed -n '4,1p'
     23914
     2392@end example
     2393
     2394@anchor{Zero Address Regex Range}
     2395@cindex Special addressing forms
     2396@cindex Range with start address of zero
     2397@cindex Zero, as range start address
     2398@cindex @var{addr1},+N
     2399@cindex @var{addr1},~N
     2400@cindex GNU extensions, special two-address forms
     2401@cindex GNU extensions, @code{0} address
     2402@cindex GNU extensions, 0,@var{addr2} addressing
     2403@cindex GNU extensions, @var{addr1},+@var{N} addressing
     2404@cindex GNU extensions, @var{addr1},~@var{N} addressing
     2405@value{SSED} also supports some special two-address forms; all these
     2406are GNU extensions:
     2407@table @code
     2408@item 0,/@var{regexp}/
     2409A line number of @code{0} can be used in an address specification like
     2410@code{0,/@var{regexp}/} so that @command{sed} will try to match
     2411@var{regexp} in the first input line too.  In other words,
     2412@code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/},
     2413except that if @var{addr2} matches the very first line of input the
     2414@code{0,/@var{regexp}/} form will consider it to end the range, whereas
     2415the @code{1,/@var{regexp}/} form will match the beginning of its range and
     2416hence make the range span up to the @emph{second} occurrence of the
     2417regular expression.
     2418
     2419The following examples demonstrate the difference between starting
     2420with address 1 and 0:
     2421
     2422@example
     2423$ seq 10 | sed -n '1,/[0-9]/p'
     24241
     24252
     2426
     2427$ seq 10 | sed -n '0,/[0-9]/p'
     24281
     2429@end example
     2430
     2431
     2432@item @var{addr1},+@var{N}
     2433Matches @var{addr1} and the @var{N} lines following @var{addr1}.
     2434
     2435@example
     2436$ seq 10 | sed -n '6,+2p'
     24376
     24387
     24398
     2440@end example
     2441
     2442@var{addr1} can be a line number or a regular expression.
     2443
     2444@item @var{addr1},~@var{N}
     2445Matches @var{addr1} and the lines following @var{addr1}
     2446until the next line whose input line number is a multiple of @var{N}.
     2447The following command prints starting at line 6, until the next line which
     2448is a multiple of 4 (i.e. line 8):
     2449
     2450@example
     2451$ seq 10 | sed -n '6,~4p'
     24526
     24537
     24548
     2455@end example
     2456
     2457@var{addr1} can be a line number or a regular expression.
     2458
     2459@end table
     2460
     2461
     2462
     2463@node Zero Address
     2464@section Zero Address
     2465@cindex Zero Address
     2466As a @value{SSED} extension, @code{0} address can be used in two cases:
     2467@enumerate
     2468@item
     2469In a regex range addresses as @code{0,/@var{regexp}/}
     2470(@pxref{Zero Address Regex Range}).
     2471@item
     2472With the @code{r} command, inserting a file before the first line
     2473(@pxref{Adding a header to multiple files}).
     2474@end enumerate
     2475
     2476Note that these are the only places where the @code{0} address makes
     2477sense; Commands which are given the @code{0} address in any
     2478other way will give an error.
     2479
     2480
     2481
     2482@node sed regular expressions
     2483@chapter Regular Expressions: selecting text
     2484
     2485@menu
     2486* Regular Expressions Overview:: Overview of Regular expression in @command{sed}
     2487* BRE vs ERE::               Basic (BRE) and extended (ERE) regular expression
     2488                             syntax
     2489* BRE syntax::               Overview of basic regular expression syntax
     2490* ERE syntax::               Overview of extended regular expression syntax
     2491* Character Classes and Bracket Expressions::
     2492* regexp extensions::        Additional regular expression commands
     2493* Back-references and Subexpressions:: Back-references and Subexpressions
     2494* Escapes::                  Specifying special characters
     2495* Locale Considerations::    Multibyte characters and locale considerations
     2496@end menu
     2497
     2498@node Regular Expressions Overview
     2499@section Overview of regular expression in @command{sed}
     2500
     2501@c NOTE: Keep examples in the 'overview' section
     2502@c neutral in regards to BRE/ERE - to ease understanding.
     2503
     2504
     2505To know how to use @command{sed}, people should understand regular
     2506expressions (@dfn{regexp} for short).  A regular expression
     2507is a pattern that is matched against a
     2508subject string from left to right.  Most characters are
     2509@dfn{ordinary}: they stand for
     2510themselves in a pattern, and match the corresponding characters.
     2511Regular expressions in @command{sed} are specified between two
     2512slashes.
     2513
     2514The following command prints lines containing the string @samp{hello}:
     2515
     2516@example
     2517sed -n '/hello/p'
     2518@end example
     2519
     2520The above example is equivalent to this @command{grep} command:
     2521
     2522@example
     2523grep 'hello'
     2524@end example
     2525
     2526The power of regular expressions comes from the ability to include
     2527alternatives and repetitions in the pattern.  These are encoded in the
     2528pattern by the use of @dfn{special characters}, which do not stand for
     2529themselves but instead are interpreted in some special way.
     2530
     2531The character @code{^} (caret) in a regular expression matches the
     2532beginning of the line. The character @code{.} (dot) matches any single
     2533character. The following @command{sed} command matches and prints
     2534lines which start with the letter @samp{b}, followed by any single character,
     2535followed by the letter @samp{d}:
     2536
     2537@example
     2538$ printf "%s\n" abode bad bed bit bid byte body | sed -n '/^b.d/p'
     2539bad
     2540bed
     2541bid
     2542body
     2543@end example
     2544
     2545The following sections explain the meaning and usage of special
     2546characters in regular expressions.
     2547
     2548@node BRE vs ERE
     2549@section Basic (BRE) and extended (ERE) regular expression
     2550
     2551Basic and extended regular expressions are two variations on the
     2552syntax of the specified pattern. Basic Regular Expression (BRE) syntax is the
     2553default in @command{sed} (and similarly in @command{grep}).
     2554Use the POSIX-specified @option{-E} option (@option{-r},
     2555@option{--regexp-extended}) to enable Extended Regular Expression (ERE) syntax.
     2556
     2557In @value{SSED}, the only difference between basic and extended regular
     2558expressions is in the behavior of a few special characters: @samp{?},
     2559@samp{+}, parentheses, braces (@samp{@{@}}), and @samp{|}.
     2560
     2561With basic (BRE) syntax, these characters do not have special meaning
     2562unless prefixed with a backslash (@samp{\}); While with extended (ERE) syntax
     2563it is reversed: these characters are special unless they are prefixed
     2564with backslash (@samp{\}).
     2565
     2566@multitable @columnfractions .28 .36 .35
     2567
     2568@headitem Desired pattern
     2569@tab Basic (BRE) Syntax
     2570@tab Extended (ERE) Syntax
     2571
     2572@item literal @samp{+} (plus sign)
     2573
     2574@tab
     2575@exampleindent 0
     2576@codequoteundirected on
     2577@codequotebacktick on
     2578@example
     2579$ echo 'a+b=c' > foo
     2580$ sed -n '/a+b/p' foo
     2581a+b=c
     2582@end example
     2583@codequotebacktick off
     2584@codequoteundirected off
     2585
     2586@tab
     2587@exampleindent 0
     2588@codequoteundirected on
     2589@codequotebacktick on
     2590@example
     2591$ echo 'a+b=c' > foo
     2592$ sed -E -n '/a\+b/p' foo
     2593a+b=c
     2594@end example
     2595@codequotebacktick off
     2596@codequoteundirected off
     2597
     2598
     2599@item One or more @samp{a} characters followed by @samp{b}
     2600(plus sign as special meta-character)
     2601
     2602@tab
     2603@exampleindent 0
     2604@codequoteundirected on
     2605@codequotebacktick on
     2606@example
     2607$ echo aab > foo
     2608$ sed -n '/a\+b/p' foo
     2609aab
     2610@end example
     2611@codequotebacktick off
     2612@codequoteundirected off
     2613
     2614@tab
     2615@exampleindent 0
     2616@codequoteundirected on
     2617@codequotebacktick on
     2618@example
     2619$ echo aab > foo
     2620$ sed -E -n '/a+b/p' foo
     2621aab
     2622@end example
     2623@codequotebacktick off
     2624@codequoteundirected off
     2625
     2626@end multitable
     2627
     2628
     2629
     2630
     2631@node BRE syntax
     2632@section Overview of basic regular expression syntax
     2633
     2634Here is a brief description
     2635of regular expression syntax as used in @command{sed}.
     2636
     2637@table @code
     2638@item @var{char}
     2639A single ordinary character matches itself.
     2640
     2641@item *
     2642@cindex GNU extensions, to basic regular expressions
     2643Matches a sequence of zero or more instances of matches for the
     2644preceding regular expression, which must be an ordinary character, a
     2645special character preceded by @code{\}, a @code{.}, a grouped regexp
     2646(see below), or a bracket expression.  As a GNU extension, a
     2647postfixed regular expression can also be followed by @code{*}; for
     2648example, @code{a**} is equivalent to @code{a*}.  POSIX
     26491003.1-2001 says that @code{*} stands for itself when it appears at
     2650the start of a regular expression or subexpression, but many
     2651non-GNU implementations do not support this and portable
     2652scripts should instead use @code{\*} in these contexts.
     2653@item .
     2654Matches any character, including newline.
     2655
     2656@item ^
     2657Matches the null string at beginning of the pattern space, i.e. what
     2658appears after the circumflex must appear at the beginning of the
     2659pattern space.
     2660
     2661In most scripts, pattern space is initialized to the content of each
     2662line (@pxref{Execution Cycle, , How @code{sed} works}).  So, it is a
     2663useful simplification to think of @code{^#include} as matching only
     2664lines where @samp{#include} is the first thing on the line---if there is
     2665any preceding space, for example, the match fails.  This simplification is
     2666valid as long as the original content of pattern space is not modified,
     2667for example with an @code{s} command.
     2668
     2669@code{^} acts as a special character only at the beginning of the
     2670regular expression or subexpression (that is, after @code{\(} or
     2671@code{\|}).  Portable scripts should avoid @code{^} at the beginning of
     2672a subexpression, though, as POSIX allows implementations that
     2673treat @code{^} as an ordinary character in that context.
     2674
     2675@item $
     2676It is the same as @code{^}, but refers to end of pattern space.
     2677@code{$} also acts as a special character only at the end
     2678of the regular expression or subexpression (that is, before @code{\)}
     2679or @code{\|}), and its use at the end of a subexpression is not
     2680portable.
     2681
     2682
     2683@item [@var{list}]
     2684@itemx [^@var{list}]
     2685Matches any single character in @var{list}: for example,
     2686@code{[aeiou]} matches all vowels.  A list may include
     2687sequences like @code{@var{char1}-@var{char2}}, which
     2688matches any character between (inclusive) @var{char1}
     2689and @var{char2}.
     2690@xref{Character Classes and Bracket Expressions}.
     2691
     2692@item \+
     2693@cindex GNU extensions, to basic regular expressions
     2694As @code{*}, but matches one or more.  It is a GNU extension.
     2695
     2696@item \?
     2697@cindex GNU extensions, to basic regular expressions
     2698As @code{*}, but only matches zero or one.  It is a GNU extension.
     2699
     2700@item \@{@var{i}\@}
     2701As @code{*}, but matches exactly @var{i} sequences (@var{i} is a
     2702decimal integer; for portability, keep it between 0 and 255
     2703inclusive).
     2704
     2705@item \@{@var{i},@var{j}\@}
     2706Matches between @var{i} and @var{j}, inclusive, sequences.
     2707
     2708@item \@{@var{i},\@}
     2709Matches more than or equal to @var{i} sequences.
     2710
     2711@item \(@var{regexp}\)
     2712Groups the inner @var{regexp} as a whole, this is used to:
     2713
     2714@itemize @bullet
     2715@item
     2716@cindex GNU extensions, to basic regular expressions
     2717Apply postfix operators, like @code{\(abcd\)*}:
     2718this will search for zero or more whole sequences
     2719of @samp{abcd}, while @code{abcd*} would search
     2720for @samp{abc} followed by zero or more occurrences
     2721of @samp{d}.  Note that support for @code{\(abcd\)*} is
     2722required by POSIX 1003.1-2001, but many non-GNU
     2723implementations do not support it and hence it is not universally
     2724portable.
     2725
     2726@item
     2727Use back references (see below).
     2728@end itemize
     2729
     2730
     2731@item @var{regexp1}\|@var{regexp2}
     2732@cindex GNU extensions, to basic regular expressions
     2733Matches either @var{regexp1} or @var{regexp2}.  Use
     2734parentheses to use complex alternative regular expressions.
     2735The matching process tries each alternative in turn, from
     2736left to right, and the first one that succeeds is used.
     2737It is a GNU extension.
     2738
     2739@item @var{regexp1}@var{regexp2}
     2740Matches the concatenation of @var{regexp1} and @var{regexp2}.
     2741Concatenation binds more tightly than @code{\|}, @code{^}, and
     2742@code{$}, but less tightly than the other regular expression
     2743operators.
     2744
     2745@item \@var{digit}
     2746Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized
     2747subexpression in the regular expression.  This is called a @dfn{back
     2748reference}.  Subexpressions are implicitly numbered by counting
     2749occurrences of @code{\(} left-to-right.
     2750
     2751@item \n
     2752Matches the newline character.
     2753
     2754@item \@var{char}
     2755Matches @var{char}, where @var{char} is one of @code{$},
     2756@code{*}, @code{.}, @code{[}, @code{\}, or @code{^}.
     2757Note that the only C-like
     2758backslash sequences that you can portably assume to be
     2759interpreted are @code{\n} and @code{\\}; in particular
     2760@code{\t} is not portable, and matches a @samp{t} under most
     2761implementations of @command{sed}, rather than a tab character.
     2762
     2763@end table
     2764
     2765@cindex Greedy regular expression matching
     2766Note that the regular expression matcher is greedy, i.e., matches
     2767are attempted from left to right and, if two or more matches are
     2768possible starting at the same character, it selects the longest.
     2769
     2770@noindent
     2771Examples:
     2772@table @samp
     2773@item abcdef
     2774Matches @samp{abcdef}.
     2775
     2776@item a*b
     2777Matches zero or more @samp{a}s followed by a single
     2778@samp{b}.  For example, @samp{b} or @samp{aaaaab}.
     2779
     2780@item a\?b
     2781Matches @samp{b} or @samp{ab}.
     2782
     2783@item a\+b\+
     2784Matches one or more @samp{a}s followed by one or more
     2785@samp{b}s: @samp{ab} is the shortest possible match, but
     2786other examples are @samp{aaaab} or @samp{abbbbb} or
     2787@samp{aaaaaabbbbbbb}.
     2788
     2789@item .*
     2790@itemx .\+
     2791These two both match all the characters in a string;
     2792however, the first matches every string (including the empty
     2793string), while the second matches only strings containing
     2794at least one character.
     2795
     2796@item ^main.*(.*)
     2797This matches a string starting with @samp{main},
     2798followed by an opening and closing
     2799parenthesis.  The @samp{n}, @samp{(} and @samp{)} need not
     2800be adjacent.
     2801
     2802@item ^#
     2803This matches a string beginning with @samp{#}.
     2804
     2805@item \\$
     2806This matches a string ending with a single backslash.  The
     2807regexp contains two backslashes for escaping.
     2808
     2809@item \$
     2810Instead, this matches a string consisting of a single dollar sign,
     2811because it is escaped.
     2812
     2813@item [a-zA-Z0-9]
     2814In the C locale, this matches any ASCII letters or digits.
     2815
     2816@item [^ @kbd{@key{TAB}}]\+
     2817(Here @kbd{@key{TAB}} stands for a single tab character.)
     2818This matches a string of one or more
     2819characters, none of which is a space or a tab.
     2820Usually this means a word.
     2821
     2822@item ^\(.*\)\n\1$
     2823This matches a string consisting of two equal substrings separated by
     2824a newline.
     2825
     2826@item .\@{9\@}A$
     2827This matches nine characters followed by an @samp{A} at the end of a line.
     2828
     2829@item ^.\@{15\@}A
     2830This matches the start of a string that contains 16 characters,
     2831the last of which is an @samp{A}.
     2832
     2833@end table
     2834
     2835
     2836@node ERE syntax
     2837@section Overview of extended regular expression syntax
     2838@cindex Extended regular expressions, syntax
     2839
     2840The only difference between basic and extended regular expressions is in
     2841the behavior of a few characters: @samp{?}, @samp{+}, parentheses,
     2842braces (@samp{@{@}}), and @samp{|}.  While basic regular expressions
     2843require these to be escaped if you want them to behave as special
     2844characters, when using extended regular expressions you must escape
     2845them if you want them @emph{to match a literal character}.  @samp{|}
     2846is special here because @samp{\|} is a GNU extension -- standard
     2847basic regular expressions do not provide its functionality.
     2848
     2849@noindent
     2850Examples:
     2851@table @code
     2852@item abc?
     2853becomes @samp{abc\?} when using extended regular expressions.  It matches
     2854the literal string @samp{abc?}.
     2855
     2856@item c\+
     2857becomes @samp{c+} when using extended regular expressions.  It matches
     2858one or more @samp{c}s.
     2859
     2860@item a\@{3,\@}
     2861becomes @samp{a@{3,@}} when using extended regular expressions.  It matches
     2862three or more @samp{a}s.
     2863
     2864@item \(abc\)\@{2,3\@}
     2865becomes @samp{(abc)@{2,3@}} when using extended regular expressions.  It
     2866matches either @samp{abcabc} or @samp{abcabcabc}.
     2867
     2868@item \(abc*\)\1
     2869becomes @samp{(abc*)\1} when using extended regular expressions.
     2870Backreferences must still be escaped when using extended regular
     2871expressions.
     2872
     2873@item a\|b
     2874becomes @samp{a|b} when using extended regular expressions.  It matches
     2875@samp{a} or @samp{b}.
     2876@end table
     2877
     2878@node Character Classes and Bracket Expressions
     2879@section Character Classes and Bracket Expressions
     2880
     2881@c The 'character class' section is shamelessly copied from grep's manual.
     2882
     2883@cindex bracket expression
     2884@cindex character class
     2885A @dfn{bracket expression} is a list of characters enclosed by @samp{[} and
     2886@samp{]}.
     2887It matches any single character in that list;
     2888if the first character of the list is the caret @samp{^},
     2889then it matches any character @strong{not} in the list.
     2890For example, the following command replaces the strings
     2891@samp{gray} or @samp{grey} with @samp{blue}:
     2892
     2893@example
     2894sed  's/gr[ae]y/blue/'
     2895@end example
     2896
     2897@c TODO: fix 'ref' to look good in both HTML and PDF
     2898Bracket expressions can be used in both
     2899@ref{BRE syntax,,basic} and @ref{ERE syntax,,extended}
     2900regular expressions (that is, with or without the @option{-E}/@option{-r}
     2901options).
     2902
     2903@cindex range expression
     2904Within a bracket expression, a @dfn{range expression} consists of two
     2905characters separated by a hyphen.
     2906It matches any single character that
     2907sorts between the two characters, inclusive.
     2908In the default C locale, the sorting sequence is the native character
     2909order; for example, @samp{[a-d]} is equivalent to @samp{[abcd]}.
     2910
     2911
     2912Finally, certain named classes of characters are predefined within
     2913bracket expressions, as follows.
     2914
     2915These named classes must be used @emph{inside} brackets
     2916themselves. Correct usage:
     2917@example
     2918$ echo 1 | sed 's/[[:digit:]]/X/'
     2919X
     2920@end example
     2921
     2922Incorrect usage is rejected by newer @command{sed} versions.
     2923Older versions accepted it but treated it as a single bracket expression
     2924(which is equivalent to @samp{[dgit:]},
     2925that is, only the characters @var{d/g/i/t/:}):
     2926@example
     2927# current GNU sed versions - incorrect usage rejected
     2928$ echo 1 | sed 's/[:digit:]/X/'
     2929sed: character class syntax is [[:space:]], not [:space:]
     2930
     2931# older GNU sed versions
     2932$ echo 1 | sed 's/[:digit:]/X/'
     29331
     2934@end example
     2935
     2936
     2937@cindex classes of characters
     2938@cindex character classes
     2939@cindex named character classes
     2940@table @samp
     2941
     2942@item [:alnum:]
     2943@opindex alnum @r{character class}
     2944@cindex alphanumeric characters
     2945Alphanumeric characters:
     2946@samp{[:alpha:]} and @samp{[:digit:]}; in the @samp{C} locale and ASCII
     2947character encoding, this is the same as @samp{[0-9A-Za-z]}.
     2948
     2949@item [:alpha:]
     2950@opindex alpha @r{character class}
     2951@cindex alphabetic characters
     2952Alphabetic characters:
     2953@samp{[:lower:]} and @samp{[:upper:]}; in the @samp{C} locale and ASCII
     2954character encoding, this is the same as @samp{[A-Za-z]}.
     2955
     2956@item [:blank:]
     2957@opindex blank @r{character class}
     2958@cindex blank characters
     2959Blank characters:
     2960space and tab.
     2961
     2962@item [:cntrl:]
     2963@opindex cntrl @r{character class}
     2964@cindex control characters
     2965Control characters.
     2966In ASCII, these characters have octal codes 000
     2967through 037, and 177 (DEL).
     2968In other character sets, these are
     2969the equivalent characters, if any.
     2970
     2971@item [:digit:]
     2972@opindex digit @r{character class}
     2973@cindex digit characters
     2974@cindex numeric characters
     2975Digits: @code{0 1 2 3 4 5 6 7 8 9}.
     2976
     2977@item [:graph:]
     2978@opindex graph @r{character class}
     2979@cindex graphic characters
     2980Graphical characters:
     2981@samp{[:alnum:]} and @samp{[:punct:]}.
     2982
     2983@item [:lower:]
     2984@opindex lower @r{character class}
     2985@cindex lower-case letters
     2986Lower-case letters; in the @samp{C} locale and ASCII character
     2987encoding, this is
     2988@code{a b c d e f g h i j k l m n o p q r s t u v w x y z}.
     2989
     2990@item [:print:]
     2991@opindex print @r{character class}
     2992@cindex printable characters
     2993Printable characters:
     2994@samp{[:alnum:]}, @samp{[:punct:]}, and space.
     2995
     2996@item [:punct:]
     2997@opindex punct @r{character class}
     2998@cindex punctuation characters
     2999Punctuation characters; in the @samp{C} locale and ASCII character
     3000encoding, this is
     3001@code{!@: " # $ % & ' ( ) * + , - .@: / : ; < = > ?@: @@ [ \ ] ^ _ ` @{ | @} ~}.
     3002
     3003@item [:space:]
     3004@opindex space @r{character class}
     3005@cindex space characters
     3006@cindex whitespace characters
     3007Space characters: in the @samp{C} locale, this is
     3008tab, newline, vertical tab, form feed, carriage return, and space.
     3009
     3010
     3011@item [:upper:]
     3012@opindex upper @r{character class}
     3013@cindex upper-case letters
     3014Upper-case letters: in the @samp{C} locale and ASCII character
     3015encoding, this is
     3016@code{A B C D E F G H I J K L M N O P Q R S T U V W X Y Z}.
     3017
     3018@item [:xdigit:]
     3019@opindex xdigit @r{character class}
     3020@cindex xdigit class
     3021@cindex hexadecimal digits
     3022Hexadecimal digits:
     3023@code{0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f}.
     3024
     3025@end table
     3026Note that the brackets in these class names are
     3027part of the symbolic names, and must be included in addition to
     3028the brackets delimiting the bracket expression.
     3029
     3030Most meta-characters lose their special meaning inside bracket expressions:
     3031
     3032@table @samp
     3033@item ]
     3034ends the bracket expression if it's not the first list item.
     3035So, if you want to make the @samp{]} character a list item,
     3036you must put it first.
     3037
     3038@item -
     3039represents the range if it's not first or last in a list or the ending point
     3040of a range.
     3041
     3042@item ^
     3043represents the characters not in the list.
     3044If you want to make the @samp{^}
     3045character a list item, place it anywhere but first.
     3046@end table
     3047
     3048TODO: incorporate this paragraph (copied verbatim from BRE section).
     3049
     3050@cindex @code{POSIXLY_CORRECT} behavior, bracket expressions
     3051The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\}
     3052are normally not special within @var{list}.  For example, @code{[\*]}
     3053matches either @samp{\} or @samp{*}, because the @code{\} is not
     3054special here.  However, strings like @code{[.ch.]}, @code{[=a=]}, and
     3055@code{[:space:]} are special within @var{list} and represent collating
     3056symbols, equivalence classes, and character classes, respectively, and
     3057@code{[} is therefore special within @var{list} when it is followed by
     3058@code{.}, @code{=}, or @code{:}.  Also, when not in
     3059@env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and
     3060@code{\t} are recognized within @var{list}.  @xref{Escapes}.
     3061@c ********
     3062
     3063
     3064@c TODO: improve explanation about collation classes and equivalence classes
     3065@c       perhaps dedicate a section to Locales ??
     3066
     3067@table @samp
     3068@item [.
     3069represents the open collating symbol.
     3070
     3071@item .]
     3072represents the close collating symbol.
     3073
     3074@item [=
     3075represents the open equivalence class.
     3076
     3077@item =]
     3078represents the close equivalence class.
     3079
     3080@item [:
     3081represents the open character class symbol, and should be followed by a
     3082valid character class name.
     3083
     3084@item :]
     3085represents the close character class symbol.
     3086@end table
     3087
     3088
     3089@node regexp extensions
     3090@section regular expression extensions
     3091
     3092The following sequences have special meaning inside regular expressions
     3093(used in @ref{Regexp Addresses,,addresses} and the @code{s} command).
     3094
     3095These can be used in both
     3096@ref{BRE syntax,,basic} and @ref{ERE syntax,,extended}
     3097regular expressions (that is, with or without the @option{-E}/@option{-r}
     3098options).
     3099
     3100@table @code
     3101@item \w
     3102Matches any ``word'' character.  A ``word'' character is any
     3103letter or digit or the underscore character.
     3104
     3105@example
     3106$ echo "abc %-= def." | sed 's/\w/X/g'
     3107XXX %-= XXX.
     3108@end example
     3109
     3110
     3111@item \W
     3112Matches any ``non-word'' character.
     3113
     3114@example
     3115$ echo "abc %-= def." | sed 's/\W/X/g'
     3116abcXXXXXdefX
     3117@end example
     3118
     3119
     3120@item \b
     3121Matches a word boundary; that is it matches if the character
     3122to the left is a ``word'' character and the character to the
     3123right is a ``non-word'' character, or vice-versa.
     3124
     3125@example
     3126$ echo "abc %-= def." | sed 's/\b/X/g'
     3127XabcX %-= XdefX.
     3128@end example
     3129
     3130
     3131@item \B
     3132Matches everywhere but on a word boundary; that is it matches
     3133if the character to the left and the character to the right
     3134are either both ``word'' characters or both ``non-word''
     3135characters.
     3136
     3137@example
     3138$ echo "abc %-= def." | sed 's/\B/X/g'
     3139aXbXc X%X-X=X dXeXf.X
     3140@end example
     3141
     3142
     3143@item \s
     3144Matches whitespace characters (spaces and tabs).
     3145Newlines embedded in the pattern/hold spaces will also match:
     3146
     3147@example
     3148$ echo "abc %-= def." | sed 's/\s/X/g'
     3149abcX%-=Xdef.
     3150@end example
     3151
     3152
     3153@item \S
     3154Matches non-whitespace characters.
     3155
     3156@example
     3157$ echo "abc %-= def." | sed 's/\S/X/g'
     3158XXX XXX XXXX
     3159@end example
     3160
     3161
     3162@item \<
     3163Matches the beginning of a word.
     3164
     3165@example
     3166$ echo "abc %-= def." | sed 's/\</X/g'
     3167Xabc %-= Xdef.
     3168@end example
     3169
     3170
     3171@item \>
     3172Matches the end of a word.
     3173
     3174@example
     3175$ echo "abc %-= def." | sed 's/\>/X/g'
     3176abcX %-= defX.
     3177@end example
     3178
     3179
     3180@item \`
     3181Matches only at the start of pattern space.  This is different
     3182from @code{^} in multi-line mode.
     3183
     3184Compare the following two examples:
     3185
     3186@example
     3187$ printf "a\nb\nc\n" | sed 'N;N;s/^/X/gm'
     3188Xa
     3189Xb
     3190Xc
     3191
     3192$ printf "a\nb\nc\n" | sed 'N;N;s/\`/X/gm'
     3193Xa
     3194b
     3195c
     3196@end example
     3197
     3198@item \'
     3199Matches only at the end of pattern space.  This is different
     3200from @code{$} in multi-line mode.
     3201
     3202
     3203
     3204@end table
     3205
     3206
     3207@node Back-references and Subexpressions
     3208@section Back-references and Subexpressions
     3209@cindex subexpression
     3210@cindex back-reference
     3211
     3212@dfn{back-references} are regular expression commands which refer to a
     3213previous part of the matched regular expression.  Back-references are
     3214specified with backslash and a single digit (e.g. @samp{\1}).  The
     3215part of the regular expression they refer to is called a
     3216@dfn{subexpression}, and is designated with parentheses.
     3217
     3218Back-references and subexpressions are used in two cases: in the
     3219regular expression search pattern, and in the @var{replacement} part
     3220of the @command{s} command (@pxref{Regexp Addresses,,Regular
     3221Expression Addresses} and @ref{The "s" Command}).
     3222
     3223In a regular expression pattern, back-references are used to match
     3224the same content as a previously matched subexpression.  In the
     3225following example, the subexpression is @samp{.} - any single
     3226character (being surrounded by parentheses makes it a
     3227subexpression). The back-reference @samp{\1} asks to match the same
     3228content (same character) as the sub-expression.
     3229
     3230The command below matches words starting with any character,
     3231followed by the letter @samp{o}, followed by the same character as the
     3232first.
     3233
     3234@example
     3235$ sed -E -n '/^(.)o\1$/p' /usr/share/dict/words
     3236bob
     3237mom
     3238non
     3239pop
     3240sos
     3241tot
     3242wow
     3243@end example
     3244
     3245Multiple subexpressions are automatically numbered from
     3246left-to-right. This command searches for 6-letter
     3247palindromes (the first three letters are 3 subexpressions,
     3248followed by 3 back-references in reverse order):
     3249
     3250@example
     3251$ sed -E -n '/^(.)(.)(.)\3\2\1$/p' /usr/share/dict/words
     3252redder
     3253@end example
     3254
     3255In the @command{s} command, back-references can be
     3256used in the @var{replacement} part to refer back to subexpressions in
     3257the @var{regexp} part.
     3258
     3259The following example uses two subexpressions in the regular
     3260expression to match two space-separated words. The back-references in
     3261the @var{replacement} part prints the words in a different order:
     3262
     3263@example
     3264$ echo "James Bond" | sed -E 's/(.*) (.*)/The name is \2, \1 \2./'
     3265The name is Bond, James Bond.
     3266@end example
     3267
     3268
     3269When used with alternation, if the group does not participate in the
     3270match then the back-reference makes the whole match fail.  For
     3271example, @samp{a(.)|b\1} will not match @samp{ba}.  When multiple
     3272regular expressions are given with @option{-e} or from a file
     3273(@samp{-f @var{file}}), back-references are local to each expression.
     3274
     3275
    14663276@node Escapes
    1467 @section @acronym{GNU} Extensions for Escapes in Regular Expressions
    1468 
    1469 @cindex @acronym{GNU} extensions, special escapes
     3277@section Escape Sequences - specifying special characters
     3278
     3279@cindex GNU extensions, special escapes
    14703280Until this chapter, we have only encountered escapes of the form
    14713281@samp{\^}, which tell @command{sed} not to interpret the circumflex
     
    14763286@cindex @code{POSIXLY_CORRECT} behavior, escapes
    14773287This chapter introduces another kind of escape@footnote{All
    1478 the escapes introduced here are @acronym{GNU}
     3288the escapes introduced here are GNU
    14793289extensions, with the exception of @code{\n}.  In basic regular
    14803290expression mode, setting @code{POSIXLY_CORRECT} disables them inside
     
    15223332
    15233333@item \o@var{xxx}
    1524 @ifset PERL
    1525 @item \@var{xxx}
    1526 @end ifset
    15273334Produces or matches a character whose octal @sc{ascii} value is @var{xxx}.
    1528 @ifset PERL
    1529 The syntax without the @code{o} is active in Perl mode, while the one
    1530 with the @code{o} is active in the normal or extended @sc{posix} regular
    1531 expression modes.
    1532 @end ifset
    15333335
    15343336@item \x@var{xx}
     
    15393341the existing ``word boundary'' meaning.
    15403342
    1541 Other escapes match a particular character class and are valid only in
    1542 regular expressions:
    1543 
     3343@subsection Escaping Precedence
     3344
     3345@value{SSED} processes escape sequences @emph{before} passing
     3346the text onto the regular-expression matching of the @command{s///} command
     3347and Address matching. Thus the following two commands are equivalent
     3348(@samp{0x5e} is the hexadecimal @sc{ascii} value of the character @samp{^}):
     3349
     3350@codequoteundirected on
     3351@codequotebacktick on
     3352@example
     3353@group
     3354$ echo 'a^c' | sed 's/^/b/'
     3355ba^c
     3356
     3357$ echo 'a^c' | sed 's/\x5e/b/'
     3358ba^c
     3359@end group
     3360@end example
     3361@codequoteundirected off
     3362@codequotebacktick off
     3363
     3364As are the following (@samp{0x5b},@samp{0x5d} are the hexadecimal
     3365@sc{ascii} values of @samp{[},@samp{]}, respectively):
     3366
     3367@codequoteundirected on
     3368@codequotebacktick on
     3369@example
     3370@group
     3371$ echo abc | sed 's/[a]/x/'
     3372Xbc
     3373$ echo abc | sed 's/\x5ba\x5d/x/'
     3374Xbc
     3375@end group
     3376@end example
     3377@codequoteundirected off
     3378@codequotebacktick off
     3379
     3380However it is recommended to avoid such special characters
     3381due to unexpected edge-cases. For example, the following
     3382are not equivalent:
     3383
     3384@codequoteundirected on
     3385@codequotebacktick on
     3386@example
     3387@group
     3388$ echo 'a^c' | sed 's/\^/b/'
     3389abc
     3390
     3391$ echo 'a^c' | sed 's/\\\x5e/b/'
     3392a^c
     3393@end group
     3394@end example
     3395@codequoteundirected off
     3396@codequotebacktick off
     3397
     3398@c also: this fails in different places:
     3399@c   $ sed 's/[//'
     3400@c   sed: -e expression #1, char 5: unterminated `s' command
     3401@c   $ sed 's/\x5b//'
     3402@c   sed: -e expression #1, char 8: Invalid regular expression
     3403@c
     3404@c which is OK but confusing to explain why (the first
     3405@c fails in compile.c:snarf_char_class while the second
     3406@c is passed to the regex engine and then fails).
     3407
     3408
     3409@node Locale Considerations
     3410@section Multibyte characters and Locale Considerations
     3411
     3412@value{SSED} processes valid multibyte characters in multibyte locales
     3413(e.g. @code{UTF-8}).  @footnote{Some regexp edge-cases depends on the
     3414operating system and libc implementation. The examples shown are known
     3415to work as-expected on GNU/Linux systems using glibc.}
     3416
     3417@noindent The following example uses the Greek letter Capital Sigma
     3418(@value{ucsigma},
     3419Unicode code point @code{0x03A3}). In a @code{UTF-8} locale,
     3420@command{sed} correctly processes the Sigma as one character despite
     3421it being 2 octets (bytes):
     3422
     3423@codequoteundirected on
     3424@codequotebacktick on
     3425@example
     3426@group
     3427$ locale | grep LANG
     3428LANG=en_US.UTF-8
     3429
     3430$ printf 'a\u03A3b'
     3431a@value{ucsigma}b
     3432
     3433$ printf 'a\u03A3b' | sed 's/./X/g'
     3434XXX
     3435
     3436$ printf 'a\u03A3b' | od -tx1 -An
     3437 61 ce a3 62
     3438@end group
     3439@end example
     3440@codequoteundirected off
     3441@codequotebacktick off
     3442
     3443@noindent
     3444To force @command{sed} to process octets separately, use the @code{C} locale
     3445(also known as the @code{POSIX} locale):
     3446
     3447@codequoteundirected on
     3448@codequotebacktick on
     3449@example
     3450$ printf 'a\u03A3b' | LC_ALL=C sed 's/./X/g'
     3451XXXX
     3452@end example
     3453@codequoteundirected off
     3454@codequotebacktick off
     3455
     3456@subsection Invalid multibyte characters
     3457
     3458@command{sed}'s regular expressions @emph{do not} match
     3459invalid multibyte sequences in a multibyte locale.
     3460
     3461@noindent
     3462In the following examples, the ascii value @code{0xCE} is
     3463an incomplete multibyte character (shown here as @value{unicodeFFFD}).
     3464The regular expression @samp{.} does not match it:
     3465
     3466@codequoteundirected on
     3467@codequotebacktick on
     3468@example
     3469@group
     3470$ printf 'a\xCEb\n'
     3471a@value{unicodeFFFD}e
     3472
     3473$ printf 'a\xCEb\n' | sed 's/./X/g'
     3474X@value{unicodeFFFD}X
     3475
     3476$ printf 'a\xCEc\n' | sed 's/./X/g' | od -tx1c -An
     3477  58  ce  58  0a
     3478   X      X   \n
     3479@end group
     3480@end example
     3481@codequoteundirected off
     3482@codequotebacktick off
     3483
     3484@noindent Similarly, the 'catch-all' regular expression @samp{.*} does not
     3485match the entire line:
     3486
     3487@codequoteundirected on
     3488@codequotebacktick on
     3489@example
     3490@group
     3491$ printf 'a\xCEc\n' | sed 's/.*//' | od -tx1c -An
     3492  ce  63  0a
     3493       c  \n
     3494@end group
     3495@end example
     3496@codequoteundirected off
     3497@codequotebacktick off
     3498
     3499@noindent
     3500@value{SSED} offers the special @command{z} command to clear the
     3501current pattern space regardless of invalid multibyte characters
     3502(i.e. it works like @code{s/.*//} but also removes invalid multibyte
     3503characters):
     3504
     3505@codequoteundirected on
     3506@codequotebacktick on
     3507@example
     3508@group
     3509$ printf 'a\xCEc\n' | sed 'z' | od -tx1c -An
     3510   0a
     3511   \n
     3512@end group
     3513@end example
     3514@codequoteundirected off
     3515@codequotebacktick off
     3516
     3517@noindent Alternatively, force the @code{C} locale to process
     3518each octet separately (every octet is a valid character in the @code{C}
     3519locale):
     3520
     3521@codequoteundirected on
     3522@codequotebacktick on
     3523@example
     3524@group
     3525$ printf 'a\xCEc\n' | LC_ALL=C sed 's/.*//' | od -tx1c -An
     3526  0a
     3527  \n
     3528@end group
     3529@end example
     3530@codequoteundirected off
     3531@codequotebacktick off
     3532
     3533
     3534@command{sed}'s inability to process invalid multibyte characters
     3535can be used to detect such invalid sequences in a file.
     3536In the following examples, the @code{\xCE\xCE} is an invalid
     3537multibyte sequence, while @code{\xCE\A3} is a valid multibyte sequence
     3538(of the Greek Sigma character).
     3539
     3540@noindent
     3541The following @command{sed} program removes all valid
     3542characters using @code{s/.//g}.  Any content left in the pattern space
     3543(the invalid characters) are added to the hold space using the
     3544@code{H} command. On the last line (@code{$}), the hold space is retrieved
     3545(@code{x}), newlines are removed (@code{s/\n//g}), and any remaining
     3546octets are printed unambiguously (@code{l}).  Thus, any invalid
     3547multibyte sequences are printed as octal values:
     3548
     3549@codequoteundirected on
     3550@codequotebacktick on
     3551@example
     3552@group
     3553$ printf 'ab\nc\n\xCE\xCEde\n\xCE\xA3f\n' > invalid.txt
     3554
     3555$ cat invalid.txt
     3556ab
     3557c
     3558@value{unicodeFFFD}@value{unicodeFFFD}de
     3559@value{ucsigma}f
     3560
     3561$ sed -n 's/.//g ; H ; $@{x;s/\n//g;l@}' invalid.txt
     3562\316\316$
     3563@end group
     3564@end example
     3565@codequoteundirected off
     3566@codequotebacktick off
     3567
     3568@noindent With a few more commands, @command{sed} can print
     3569the exact line number corresponding to each invalid characters (line 3).
     3570These characters can then be removed by forcing the @code{C} locale
     3571and using octal escape sequences:
     3572
     3573@codequoteundirected on
     3574@codequotebacktick on
     3575@example
     3576$ sed -n 's/.//g;=;l' invalid.txt | paste - -  | awk '$2!="$"'
     35773       \316\316$
     3578
     3579$ LC_ALL=C sed '3s/\o316\o316//' invalid.txt > fixed.txt
     3580@end example
     3581@codequoteundirected off
     3582@codequotebacktick off
     3583
     3584@subsection Upper/Lower case conversion
     3585
     3586
     3587@value{SSED}'s substitute command (@code{s}) supports upper/lower
     3588case conversions using @code{\U},@code{\L} codes.
     3589These conversions support multibyte characters:
     3590
     3591@codequoteundirected on
     3592@codequotebacktick on
     3593@example
     3594$ printf 'ABC\u03a3\n'
     3595ABC@value{ucsigma}
     3596
     3597$ printf 'ABC\u03a3\n' | sed 's/.*/\L&/'
     3598abc@value{lcsigma}
     3599@end example
     3600@codequoteundirected off
     3601@codequotebacktick off
     3602
     3603@noindent
     3604@xref{The "s" Command}.
     3605
     3606
     3607@subsection Multibyte regexp character classes
     3608
     3609@c TODO: fix following paragraphs (copied verbatim from 'bracket
     3610@c expression' section).
     3611
     3612In other locales, the sorting sequence is not specified, and
     3613@samp{[a-d]} might be equivalent to @samp{[abcd]} or to
     3614@samp{[aBbCcDd]}, or it might fail to match any character, or the set of
     3615characters that it matches might even be erratic.
     3616To obtain the traditional interpretation
     3617of bracket expressions, you can use the @samp{C} locale by setting the
     3618@env{LC_ALL} environment variable to the value @samp{C}.
     3619
     3620@example
     3621# TODO: is there any real-world system/locale where 'A'
     3622#       is replaced by '-' ?
     3623$ echo A | sed 's/[a-z]/-/'
     3624A
     3625@end example
     3626
     3627Their interpretation depends on the @env{LC_CTYPE} locale;
     3628for example, @samp{[[:alnum:]]} means the character class of numbers and letters
     3629in the current locale.
     3630
     3631TODO: show example of collation
     3632
     3633@codequoteundirected on
     3634@codequotebacktick on
     3635@example
     3636# TODO: this works on glibc systems, not on musl-libc/freebsd/macosx.
     3637$ printf 'cliché\n' | LC_ALL=fr_FR.utf8 sed 's/[[=e=]]/X/g'
     3638clichX
     3639@end example
     3640@codequoteundirected off
     3641@codequotebacktick off
     3642
     3643
     3644@node advanced sed
     3645@chapter Advanced @command{sed}: cycles and buffers
     3646
     3647@menu
     3648* Execution Cycle::          How @command{sed} works
     3649* Hold and Pattern Buffers::
     3650* Multiline techniques::     Using D,G,H,N,P to process multiple lines
     3651* Branching and flow control::
     3652@end menu
     3653
     3654@node Execution Cycle
     3655@section How @command{sed} Works
     3656
     3657@cindex Buffer spaces, pattern and hold
     3658@cindex Spaces, pattern and hold
     3659@cindex Pattern space, definition
     3660@cindex Hold space, definition
     3661@command{sed} maintains two data buffers: the active @emph{pattern} space,
     3662and the auxiliary @emph{hold} space. Both are initially empty.
     3663
     3664@command{sed} operates by performing the following cycle on each
     3665line of input: first, @command{sed} reads one line from the input
     3666stream, removes any trailing newline, and places it in the pattern space.
     3667Then commands are executed; each command can have an address associated
     3668to it: addresses are a kind of condition code, and a command is only
     3669executed if the condition is verified before the command is to be
     3670executed.
     3671
     3672When the end of the script is reached, unless the @option{-n} option
     3673is in use, the contents of pattern space are printed out to the output
     3674stream, adding back the trailing newline if it was removed.@footnote{Actually,
     3675if @command{sed} prints a line without the terminating newline, it will
     3676nevertheless print the missing newline as soon as more text is sent to
     3677the same output stream, which gives the ``least expected surprise''
     3678even though it does not make commands like @samp{sed -n p} exactly
     3679identical to @command{cat}.} Then the next cycle starts for the next
     3680input line.
     3681
     3682Unless special commands (like @samp{D}) are used, the pattern space is
     3683deleted between two cycles. The hold space, on the other hand, keeps
     3684its data between cycles (see commands @samp{h}, @samp{H}, @samp{x},
     3685@samp{g}, @samp{G} to move data between both buffers).
     3686
     3687@node Hold and Pattern Buffers
     3688@section Hold and Pattern Buffers
     3689
     3690TODO
     3691
     3692@node Multiline techniques
     3693@section Multiline techniques - using D,G,H,N,P to process multiple lines
     3694
     3695Multiple lines can be processed as one buffer using the
     3696@code{D},@code{G},@code{H},@code{N},@code{P}. They are similar to
     3697their lowercase counterparts (@code{d},@code{g},
     3698@code{h},@code{n},@code{p}), except that these commands append or
     3699subtract data while respecting embedded newlines - allowing adding and
     3700removing lines from the pattern and hold spaces.
     3701
     3702They operate as follows:
    15443703@table @code
    1545 @item \w
    1546 Matches any ``word'' character.  A ``word'' character is any
    1547 letter or digit or the underscore character.
    1548 
    1549 @item \W
    1550 Matches any ``non-word'' character.
    1551 
    1552 @item \b
    1553 Matches a word boundary; that is it matches if the character
    1554 to the left is a ``word'' character and the character to the
    1555 right is a ``non-word'' character, or vice-versa.
    1556 
    1557 @item \B
    1558 Matches everywhere but on a word boundary; that is it matches
    1559 if the character to the left and the character to the right
    1560 are either both ``word'' characters or both ``non-word''
    1561 characters.
    1562 
    1563 @item \`
    1564 Matches only at the start of pattern space.  This is different
    1565 from @code{^} in multi-line mode.
    1566 
    1567 @item \'
    1568 Matches only at the end of pattern space.  This is different
    1569 from @code{$} in multi-line mode.
    1570 
    1571 @ifset PERL
    1572 @item \G
    1573 Match only at the start of pattern space or, when doing a global
    1574 substitution using the @code{s///g} command and option, at
    1575 the end-of-match position of the prior match.  For example,
    1576 @samp{s/\Ga/Z/g} will change an initial run of @code{a}s to
    1577 a run of @code{Z}s
    1578 @end ifset
     3704@item D
     3705@emph{deletes} line from the pattern space until the first newline,
     3706and restarts the cycle.
     3707
     3708@item G
     3709@emph{appends} line from the hold space to the pattern space, with a
     3710newline before it.
     3711
     3712@item H
     3713@emph{appends} line from the pattern space to the hold space, with a
     3714newline before it.
     3715
     3716@item N
     3717@emph{appends} line from the input file to the pattern space.
     3718
     3719@item P
     3720@emph{prints} line from the pattern space until the first newline.
     3721
    15793722@end table
     3723
     3724
     3725The following example illustrates the operation of @code{N} and
     3726@code{D} commands:
     3727
     3728@codequoteundirected on
     3729@codequotebacktick on
     3730@example
     3731@group
     3732$ seq 6 | sed -n 'N;l;D'
     37331\n2$
     37342\n3$
     37353\n4$
     37364\n5$
     37375\n6$
     3738@end group
     3739@end example
     3740@codequoteundirected off
     3741@codequotebacktick off
     3742
     3743@enumerate
     3744@item
     3745@command{sed} starts by reading the first line into the pattern space
     3746(i.e. @samp{1}).
     3747@item
     3748At the beginning of every cycle, the @code{N}
     3749command appends a newline and the next line to the pattern space
     3750(i.e. @samp{1}, @samp{\n}, @samp{2} in the first cycle).
     3751@item
     3752The @code{l} command prints the content of the pattern space
     3753unambiguously.
     3754@item
     3755The @code{D} command then removes the content of pattern
     3756space up to the first newline (leaving @samp{2} at the end of
     3757the first cycle).
     3758@item
     3759At the next cycle the @code{N} command appends a
     3760newline and the next input line to the pattern space
     3761(e.g. @samp{2}, @samp{\n}, @samp{3}).
     3762@end enumerate
     3763
     3764
     3765@cindex processing paragraphs
     3766@cindex paragraphs, processing
     3767A common technique to process blocks of text such as paragraphs
     3768(instead of line-by-line) is using the following construct:
     3769
     3770@codequoteundirected on
     3771@codequotebacktick on
     3772@example
     3773sed '/./@{H;$!d@} ; x ; s/REGEXP/REPLACEMENT/'
     3774@end example
     3775@codequoteundirected off
     3776@codequotebacktick off
     3777
     3778@enumerate
     3779@item
     3780The first expression, @code{/./@{H;$!d@}} operates on all non-empty lines,
     3781and adds the current line (in the pattern space) to the hold space.
     3782On all lines except the last, the pattern space is deleted and the cycle is
     3783restarted.
     3784
     3785@item
     3786The other expressions @code{x} and @code{s} are executed only on empty
     3787lines (i.e. paragraph separators). The @code{x} command fetches the
     3788accumulated lines from the hold space back to the pattern space. The
     3789@code{s///} command then operates on all the text in the paragraph
     3790(including the embedded newlines).
     3791@end enumerate
     3792
     3793The following example demonstrates this technique:
     3794@codequoteundirected on
     3795@codequotebacktick on
     3796@example
     3797@group
     3798$ cat input.txt
     3799a a a aa aaa
     3800aaaa aaaa aa
     3801aaaa aaa aaa
     3802
     3803bbbb bbb bbb
     3804bb bb bbb bb
     3805bbbbbbbb bbb
     3806
     3807ccc ccc cccc
     3808cccc ccccc c
     3809cc cc cc cc
     3810
     3811$ sed '/./@{H;$!d@} ; x ; s/^/\nSTART-->/ ; s/$/\n<--END/' input.txt
     3812
     3813START-->
     3814a a a aa aaa
     3815aaaa aaaa aa
     3816aaaa aaa aaa
     3817<--END
     3818
     3819START-->
     3820bbbb bbb bbb
     3821bb bb bbb bb
     3822bbbbbbbb bbb
     3823<--END
     3824
     3825START-->
     3826ccc ccc cccc
     3827cccc ccccc c
     3828cc cc cc cc
     3829<--END
     3830@end group
     3831@end example
     3832@codequoteundirected off
     3833@codequotebacktick off
     3834
     3835For more annotated examples, @pxref{Text search across multiple lines}
     3836and @ref{Line length adjustment}.
     3837
     3838@node Branching and flow control
     3839@section Branching and Flow Control
     3840
     3841The branching commands @code{b}, @code{t}, and @code{T} enable
     3842changing the flow of @command{sed} programs.
     3843
     3844By default, @command{sed} reads an input line into the pattern buffer,
     3845then continues to processes all commands in order.
     3846Commands without addresses affect all lines.
     3847Commands with addresses affect only matching lines.
     3848@xref{Execution Cycle} and @ref{Addresses overview}.
     3849
     3850@command{sed} does not support a typical @code{if/then} construct.
     3851Instead, some commands can be used as conditionals or to change the
     3852default flow control:
     3853
     3854@table @code
     3855
     3856@item d
     3857delete (clears) the current pattern space,
     3858and restart the program cycle without processing the rest of the commands
     3859and without printing the pattern space.
     3860
     3861@item D
     3862delete the contents of the pattern space @emph{up to the first newline},
     3863and restart the program cycle without processing the rest of
     3864the commands and without printing the pattern space.
     3865
     3866@item [addr]X
     3867@itemx [addr]@{ X ; X ; X @}
     3868@item /regexp/X
     3869@item /regexp/@{ X ; X ; X @}
     3870Addresses and regular expressions can be used as an @code{if/then}
     3871conditional: If @var{[addr]} matches the current pattern space,
     3872execute the command(s).
     3873For example: The command @code{/^#/d} means:
     3874@emph{if} the current pattern matches the regular expression @code{^#} (a line
     3875starting with a hash), @emph{then} execute the @code{d} command:
     3876delete the line without printing it, and restart the program cycle
     3877immediately.
     3878
     3879@item b
     3880branch unconditionally (that is: always jump to a label, skipping
     3881or repeating other commands, without restarting a new cycle). Combined
     3882with an address, the branch can be conditionally executed on matched
     3883lines.
     3884
     3885@item t
     3886branch conditionally (that is: jump to a label) @emph{only if} a
     3887@code{s///} command has succeeded since the last input line was read
     3888or another conditional branch was taken.
     3889
     3890@item T
     3891similar but opposite to the @code{t} command: branch only if
     3892there has been @emph{no} successful substitutions since the last
     3893input line was read.
     3894@end table
     3895
     3896
     3897The following two @command{sed} programs are equivalent.  The first
     3898(contrived) example uses the @code{b} command to skip the @code{s///}
     3899command on lines containing @samp{1}.  The second example uses an
     3900address with negation (@samp{!})  to perform substitution only on
     3901desired lines.  The @code{y///} command is still executed on all
     3902lines:
     3903
     3904@codequoteundirected on
     3905@codequotebacktick on
     3906@example
     3907@group
     3908$ printf '%s\n' a1 a2 a3 | sed -E '/1/bx ; s/a/z/ ; :x ; y/123/456/'
     3909a4
     3910z5
     3911z6
     3912
     3913$ printf '%s\n' a1 a2 a3 | sed -E '/1/!s/a/z/ ; y/123/456/'
     3914a4
     3915z5
     3916z6
     3917@end group
     3918@end example
     3919@codequoteundirected off
     3920@codequotebacktick off
     3921
     3922
     3923
     3924@subsection Branching and Cycles
     3925@cindex labels
     3926@cindex omitting labels
     3927@cindex cycle, restarting
     3928@cindex restarting a cycle
     3929The @code{b},@code{t} and @code{T} commands can be followed by a label
     3930(typically a single letter). Labels are defined with a colon followed by
     3931one or more letters (e.g. @samp{:x}). If the label is omitted the
     3932branch commands restart the cycle.  Note the difference between
     3933branching to a label and restarting the cycle: when a cycle is
     3934restarted, @command{sed} first prints the current content of the
     3935pattern space, then reads the next input line into the pattern space;
     3936Jumping to a label (even if it is at the beginning of the program)
     3937does not print the pattern space and does not read the next input line.
     3938
     3939The following program is a no-op. The @code{b} command (the only command
     3940in the program) does not have a label, and thus simply restarts the cycle.
     3941On each cycle, the pattern space is printed and the next input line is read:
     3942
     3943@example
     3944@group
     3945$ seq 3 | sed b
     39461
     39472
     39483
     3949@end group
     3950@end example
     3951
     3952@cindex infinite loop, branching
     3953@cindex branching, infinite loop
     3954The following example is an infinite-loop - it doesn't terminate and
     3955doesn't print anything. The @code{b} command jumps to the @samp{x}
     3956label, and a new cycle is never started:
     3957
     3958@codequoteundirected on
     3959@codequotebacktick on
     3960@example
     3961@group
     3962$ seq 3 | sed ':x ; bx'
     3963
     3964# The above command requires gnu sed (which supports additional
     3965# commands following a label, without a newline). A portable equivalent:
     3966#     sed -e ':x' -e bx
     3967@end group
     3968@end example
     3969@codequoteundirected off
     3970@codequotebacktick off
     3971
     3972@cindex branching and n, N
     3973@cindex n, and branching
     3974@cindex N, and branching
     3975Branching is often complemented with the @code{n} or @code{N} commands:
     3976both commands read the next input line into the pattern space without waiting
     3977for the cycle to restart. Before reading the next input line, @code{n}
     3978prints the current pattern space then empties it, while @code{N}
     3979appends a newline and the next input line to the pattern space.
     3980
     3981Consider the following two examples:
     3982
     3983@codequoteundirected on
     3984@codequotebacktick on
     3985@example
     3986@group
     3987$ seq 3 | sed ':x ; n ; bx'
     39881
     39892
     39903
     3991
     3992$ seq 3 | sed ':x ; N ; bx'
     39931
     39942
     39953
     3996@end group
     3997@end example
     3998@codequoteundirected off
     3999@codequotebacktick off
     4000
     4001@itemize
     4002@item
     4003Both examples do not inf-loop, despite never starting a new cycle.
     4004
     4005@item
     4006In the first example, the @code{n} commands first prints the content
     4007of the pattern space, empties the pattern space then reads the next
     4008input line.
     4009
     4010@item
     4011In the second example, the @code{N} commands appends the next input
     4012line to the pattern space (with a newline).  Lines are accumulated in
     4013the pattern space until there are no more input lines to read, then
     4014the @code{N} command terminates the @command{sed} program. When the
     4015program terminates, the end-of-cycle actions are performed, and the
     4016entire pattern space is printed.
     4017
     4018@item
     4019The second example requires @value{SSED},
     4020because it uses the non-POSIX-standard behavior of @code{N}.
     4021See the ``@code{N} command on the last line'' paragraph
     4022in @ref{Reporting Bugs}.
     4023
     4024@item
     4025To further examine the difference between the two examples,
     4026try the following commands:
     4027@codequoteundirected on
     4028@codequotebacktick on
     4029@example
     4030@group
     4031printf '%s\n' aa bb cc dd | sed ':x ; n ; = ; bx'
     4032printf '%s\n' aa bb cc dd | sed ':x ; N ; = ; bx'
     4033printf '%s\n' aa bb cc dd | sed ':x ; n ; s/\n/***/ ; bx'
     4034printf '%s\n' aa bb cc dd | sed ':x ; N ; s/\n/***/ ; bx'
     4035@end group
     4036@end example
     4037@codequoteundirected off
     4038@codequotebacktick off
     4039
     4040@end itemize
     4041
     4042
     4043
     4044@subsection Branching example: joining lines
     4045
     4046@cindex joining lines with branching
     4047@cindex branching, joining lines
     4048@cindex quoted-printable lines, joining
     4049@cindex joining quoted-printable lines
     4050@cindex t, joining lines with
     4051@cindex b, joining lines with
     4052@cindex b, versus t
     4053@cindex t, versus b
     4054As a real-world example of using branching, consider the case of
     4055@uref{https://en.wikipedia.org/wiki/Quoted-printable,quoted-printable} files,
     4056typically used to encode email messages.
     4057In these files long lines are split and marked with a @dfn{soft line break}
     4058consisting of a single @samp{=} character at the end of the line:
     4059
     4060@example
     4061@group
     4062$ cat jaques.txt
     4063All the wor=
     4064ld's a stag=
     4065e,
     4066And all the=
     4067 men and wo=
     4068men merely =
     4069players:
     4070They have t=
     4071heir exits =
     4072and their e=
     4073ntrances;
     4074And one man=
     4075 in his tim=
     4076e plays man=
     4077y parts.
     4078@end group
     4079@end example
     4080
     4081
     4082The following program uses an address match @samp{/=$/} as a
     4083conditional: If the current pattern space ends with a @samp{=}, it
     4084reads the next input line using @code{N}, replaces all @samp{=}
     4085characters which are followed by a newline, and unconditionally
     4086branches (@code{b}) to the beginning of the program without restarting
     4087a new cycle. If the pattern space does not ends with @samp{=}, the
     4088default action is performed: the pattern space is printed and a new
     4089cycle is started:
     4090
     4091@codequoteundirected on
     4092@codequotebacktick on
     4093@example
     4094@group
     4095$ sed ':x ; /=$/ @{ N ; s/=\n//g ; bx @}' jaques.txt
     4096All the world's a stage,
     4097And all the men and women merely players:
     4098They have their exits and their entrances;
     4099And one man in his time plays many parts.
     4100@end group
     4101@end example
     4102@codequoteundirected off
     4103@codequotebacktick off
     4104
     4105Here's an alternative program with a slightly different approach: On
     4106all lines except the last, @code{N} appends the line to the pattern
     4107space.  A substitution command then removes soft line breaks
     4108(@samp{=} at the end of a line, i.e. followed by a newline) by replacing
     4109them with an empty string.
     4110@emph{if} the substitution was successful (meaning the pattern space contained
     4111a line which should be joined), The conditional branch command @code{t} jumps
     4112to the beginning of the program without completing or restarting the cycle.
     4113If the substitution failed (meaning there were no soft line breaks),
     4114The @code{t} command will @emph{not} branch. Then, @code{P} will
     4115print the pattern space content until the first newline, and @code{D}
     4116will delete the pattern space content until the first new line.
     4117(To learn more about @code{N}, @code{P} and @code{D} commands
     4118@pxref{Multiline techniques}).
     4119
     4120
     4121@codequoteundirected on
     4122@codequotebacktick on
     4123@example
     4124@group
     4125$ sed ':x ; $!N ; s/=\n// ; tx ; P ; D' jaques.txt
     4126All the world's a stage,
     4127And all the men and women merely players:
     4128They have their exits and their entrances;
     4129And one man in his time plays many parts.
     4130@end group
     4131@end example
     4132@codequoteundirected off
     4133@codequotebacktick off
     4134
     4135
     4136For more line-joining examples @pxref{Joining lines}.
     4137
    15804138
    15814139@node Examples
     
    15864144
    15874145@menu
     4146
     4147Useful one-liners:
     4148* Joining lines::
     4149
    15884150Some exotic examples:
    15894151* Centering lines::
     
    15924154* Print bash environment::
    15934155* Reverse chars of lines::
     4156* Text search across multiple lines::
     4157* Line length adjustment::
     4158* Adding a header to multiple files::
    15944159
    15954160Emulating standard utilities:
     
    16084173@end menu
    16094174
     4175@node Joining lines
     4176@section Joining lines
     4177
     4178This section uses @code{N}, @code{D} and @code{P} commands to process
     4179multiple lines, and the @code{b} and @code{t} commands for branching.
     4180@xref{Multiline techniques} and @ref{Branching and flow control}.
     4181
     4182Join specific lines (e.g. if lines 2 and 3 need to be joined):
     4183
     4184@codequoteundirected on
     4185@codequotebacktick on
     4186@example
     4187$ cat lines.txt
     4188hello
     4189hel
     4190lo
     4191hello
     4192
     4193$ sed '2@{N;s/\n//;@}' lines.txt
     4194hello
     4195hello
     4196hello
     4197@end example
     4198@codequoteundirected off
     4199@codequotebacktick off
     4200
     4201Join backslash-continued lines:
     4202
     4203@codequoteundirected on
     4204@codequotebacktick on
     4205@example
     4206$ cat 1.txt
     4207this \
     4208is \
     4209a \
     4210long \
     4211line
     4212and another \
     4213line
     4214
     4215$ sed -e ':x /\\$/ @{ N; s/\\\n//g ; bx @}'  1.txt
     4216this is a long line
     4217and another line
     4218
     4219
     4220#TODO: The above requires gnu sed.
     4221#      non-gnu seds need newlines after ':' and 'b'
     4222@end example
     4223@codequoteundirected off
     4224@codequotebacktick off
     4225
     4226Join lines that start with whitespace (e.g SMTP headers):
     4227
     4228@codequoteundirected on
     4229@codequotebacktick on
     4230@example
     4231@group
     4232$ cat 2.txt
     4233Subject: Hello
     4234    World
     4235Content-Type: multipart/alternative;
     4236    boundary=94eb2c190cc6370f06054535da6a
     4237Date: Tue, 3 Jan 2017 19:41:16 +0000 (GMT)
     4238Authentication-Results: mx.gnu.org;
     4239       dkim=pass header.i=@@gnu.org;
     4240       spf=pass
     4241Message-ID: <abcdef@@gnu.org>
     4242From: John Doe <jdoe@@gnu.org>
     4243To: Jane Smith <jsmith@@gnu.org>
     4244
     4245$ sed -E ':a ; $!N ; s/\n\s+/ / ; ta ; P ; D' 2.txt
     4246Subject: Hello World
     4247Content-Type: multipart/alternative; boundary=94eb2c190cc6370f06054535da6a
     4248Date: Tue, 3 Jan 2017 19:41:16 +0000 (GMT)
     4249Authentication-Results: mx.gnu.org; dkim=pass header.i=@@gnu.org; spf=pass
     4250Message-ID: <abcdef@@gnu.org>
     4251From: John Doe <jdoe@@gnu.org>
     4252To: Jane Smith <jsmith@@gnu.org>
     4253
     4254# A portable (non-gnu) variation:
     4255#   sed -e :a -e '$!N;s/\n  */ /;ta' -e 'P;D'
     4256@end group
     4257@end example
     4258@codequoteundirected off
     4259@codequotebacktick off
     4260
     4261
    16104262@node Centering lines
    16114263@section Centering Lines
     
    16344286
    16354287@group
    1636 # del leading and trailing spaces
    1637 y/@kbd{tab}/ /
     4288# delete leading and trailing spaces
     4289y/@kbd{@key{TAB}}/ /
    16384290s/^ *//
    16394291s/ *$//
     
    16844336
    16854337@group
    1686 # replace all leading 9s by _ (any other character except digits, could
     4338# replace all trailing 9s by _ (any other character except digits, could
    16874339# be used)
    16884340:d
     
    16944346# incr last digit only.  The first line adds a most-significant
    16954347# digit of 1 if we have to add a digit.
    1696 #
    1697 # The @code{tn} commands are not necessary, but make the thing
    1698 # faster
    16994348@end group
    17004349
     
    17274376seen a script converting the output of @command{date} into a @command{bc}
    17284377program!
    1729  
     4378
    17304379The main body of this is the @command{sed} script, which remaps the name
    1731 from lower to upper (or vice-versa) and even checks out 
     4380from lower to upper (or vice-versa) and even checks out
    17324381if the remapped name is the same as the original name.
    17334382Note how the script is parameterized using shell
     
    17384387@group
    17394388#! /bin/sh
    1740 # rename files to lower/upper case... 
     4389# rename files to lower/upper case...
    17414390#
    1742 # usage: 
    1743 #    move-to-lower * 
    1744 #    move-to-upper * 
     4391# usage:
     4392#    move-to-lower *
     4393#    move-to-upper *
    17454394# or
    17464395#    move-to-lower -R .
     
    17524401help()
    17534402@{
    1754         cat << eof
     4403        cat << eof
    17554404Usage: $0 [-n] [-r] [-h] files...
    17564405@end group
     
    17854434while :
    17864435do
    1787     case "$1" in 
     4436    case "$1" in
    17884437        -n) apply_cmd='cat' ;;
    17894438        -R) finder='find "$@@" -type f';;
     
    18134462esac
    18144463@end group
    1815        
     4464
    18164465eval $finder | sed -n '
    18174466
     
    18554504@group
    18564505# check if converted file name is equal to original file name,
    1857 # if it is, do not print nothing
     4506# if it is, do not print anything
    18584507/^.*\/\(.*\)\n\1/b
     4508@end group
     4509
     4510@group
     4511# escape special characters for the shell
     4512s/["$`\\]/\\&/g
    18594513@end group
    18604514
     
    19744628@c end---------------------------------------------
    19754629
     4630
     4631@node Text search across multiple lines
     4632@section Text search across multiple lines
     4633
     4634This section uses @code{N} and @code{D} commands to search for
     4635consecutive words spanning multiple lines. @xref{Multiline techniques}.
     4636
     4637These examples deal with finding doubled occurrences of words in a document.
     4638
     4639Finding doubled words in a single line is easy using GNU @command{grep}
     4640and similarly with @value{SSED}:
     4641
     4642@c NOTE: in all examples, 'the@ the' is used to prevent
     4643@c 'make syntax-check' from complaining about double words.
     4644@codequoteundirected on
     4645@codequotebacktick on
     4646@example
     4647@group
     4648$ cat two-cities-dup1.txt
     4649It was the best of times,
     4650it was the worst of times,
     4651it was the@ the age of wisdom,
     4652it was the age of foolishness,
     4653
     4654$ grep -E '\b(\w+)\s+\1\b' two-cities-dup1.txt
     4655it was the@ the age of wisdom,
     4656
     4657$ grep -n -E '\b(\w+)\s+\1\b' two-cities-dup1.txt
     46583:it was the@ the age of wisdom,
     4659
     4660$ sed -En '/\b(\w+)\s+\1\b/p' two-cities-dup1.txt
     4661it was the@ the age of wisdom,
     4662
     4663$ sed -En '/\b(\w+)\s+\1\b/@{=;p@}' two-cities-dup1.txt
     46643
     4665it was the@ the age of wisdom,
     4666@end group
     4667@end example
     4668@codequoteundirected off
     4669@codequotebacktick off
     4670
     4671@itemize @bullet
     4672@item
     4673The regular expression @samp{\b\w+\s+} searches for word-boundary (@samp{\b}),
     4674followed by one-or-more word-characters (@samp{\w+}), followed by whitespace
     4675(@samp{\s+}). @xref{regexp extensions}.
     4676
     4677@item
     4678Adding parentheses around the @samp{(\w+)} expression creates a subexpression.
     4679The regular expression pattern @samp{(PATTERN)\s+\1} defines a subexpression
     4680(in the parentheses) followed by a back-reference, separated by whitespace.
     4681A successful match means the @var{PATTERN} was repeated twice in succession.
     4682@xref{Back-references and Subexpressions}.
     4683
     4684@item
     4685The word-boundery expression (@samp{\b}) at both ends ensures partial
     4686words are not matched (e.g. @samp{the then} is not a desired match).
     4687@c Thanks to Jim for pointing this out in
     4688@c https://lists.gnu.org/archive/html/sed-devel/2016-12/msg00041.html
     4689
     4690@item
     4691The @option{-E} option enables extended regular expression syntax, alleviating
     4692the need to add backslashes before the parenthesis. @xref{ERE syntax}.
     4693
     4694@end itemize
     4695
     4696When the doubled word span two lines the above regular expression
     4697will not find them as @command{grep} and @command{sed} operate line-by-line.
     4698
     4699By using @command{N} and @command{D} commands, @command{sed} can apply
     4700regular expressions on multiple lines (that is, multiple lines are stored
     4701in the pattern space, and the regular expression works on it):
     4702
     4703@c NOTE: use 'the@*the' instead of a real new line to prevent
     4704@c 'make syntax-check' to complain about doubled-words.
     4705@codequoteundirected on
     4706@codequotebacktick on
     4707@example
     4708$ cat two-cities-dup2.txt
     4709It was the best of times, it was the
     4710worst of times, it was the@*the age of wisdom,
     4711it was the age of foolishness,
     4712
     4713$ sed -En '@{N; /\b(\w+)\s+\1\b/@{=;p@} ; D@}'  two-cities-dup2.txt
     47143
     4715worst of times, it was the@*the age of wisdom,
     4716@end example
     4717@codequoteundirected off
     4718@codequotebacktick off
     4719
     4720@itemize @bullet
     4721@item
     4722The @command{N} command appends the next line to the pattern space
     4723(thus ensuring it contains two consecutive lines in every cycle).
     4724
     4725@item
     4726The regular expression uses @samp{\s+} for word separator which matches
     4727both spaces and newlines.
     4728
     4729@item
     4730The regular expression matches, the entire pattern space is printed
     4731with @command{p}. No lines are printed by default due to the @option{-n} option.
     4732
     4733@item
     4734The @command{D} removes the first line from the pattern space (up until the
     4735first newline), readying it for the next cycle.
     4736@end itemize
     4737
     4738See the GNU @command{coreutils} manual for an alternative solution using
     4739@command{tr -s} and @command{uniq} at
     4740@c NOTE: cheating and keeping the URL line shorter than 80 characters
     4741@c by using 'gnu.org' and '/s/'.
     4742@url{https://gnu.org/s/coreutils/manual/html_node/Squeezing-and-deleting.html}.
     4743
     4744@node Line length adjustment
     4745@section Line length adjustment
     4746
     4747This section uses @code{N} and @code{P} commands to read and write
     4748lines, and the @code{b} command for branching.
     4749@xref{Multiline techniques} and @ref{Branching and flow control}.
     4750
     4751This (somewhat contrived) example deal with formatting and wrapping
     4752lines of text of the following input file:
     4753
     4754@example
     4755@group
     4756$ cat two-cities-mix.txt
     4757It was the best of times, it was
     4758the worst of times, it
     4759was the age of
     4760wisdom,
     4761it
     4762was
     4763the age
     4764of foolishness,
     4765@end group
     4766@end example
     4767
     4768@exdent The following sed program wraps lines at 40 characters:
     4769@codequoteundirected on
     4770@codequotebacktick on
     4771@example
     4772@group
     4773$ cat wrap40.sed
     4774# outer loop
     4775:x
     4776
     4777# Append a newline followed by the next input line to the pattern buffer
     4778N
     4779
     4780# Remove all newlines from the pattern buffer
     4781s/\n/ /g
     4782
     4783
     4784# Inner loop
     4785:y
     4786
     4787# Add a newline after the first 40 characters
     4788s/(.@{40,40@})/\1\n/
     4789
     4790# If there is a newline in the pattern buffer
     4791# (i.e. the previous substitution added a newline)
     4792/\n/ @{
     4793    # There are newlines in the pattern buffer -
     4794    # print the content until the first newline.
     4795    P
     4796
     4797   # Remove the printed characters and the first newline
     4798   s/.*\n//
     4799
     4800   # branch to label 'y' - repeat inner loop
     4801   by
     4802 @}
     4803
     4804# No newlines in the pattern buffer - Branch to label 'x' (outer loop)
     4805# and read the next input line
     4806bx
     4807@end group
     4808@end example
     4809@codequoteundirected off
     4810@codequotebacktick off
     4811
     4812
     4813
     4814@exdent The wrapped output:
     4815@codequoteundirected on
     4816@codequotebacktick on
     4817@example
     4818@group
     4819$ sed -E -f wrap40.sed two-cities-mix.txt
     4820It was the best of times, it was the wor
     4821st of times, it was the age of wisdom, i
     4822t was the age of foolishness,
     4823@end group
     4824@end example
     4825@codequoteundirected off
     4826@codequotebacktick off
     4827
     4828
     4829
     4830
     4831@node Adding a header to multiple files
     4832@section Adding a header to multiple files
     4833
     4834@value{SSED} can be used to safely modify multiple files at once.
     4835
     4836@exdent Add a single line to the beginning of source code files:
     4837
     4838@codequoteundirected on
     4839@codequotebacktick on
     4840@example
     4841sed -i '1i/* Copyright (C) FOO BAR */' *.c
     4842@end example
     4843@codequoteundirected off
     4844@codequotebacktick off
     4845
     4846@exdent Adding a few lines is possible using @samp{\n} in the text:
     4847
     4848@codequoteundirected on
     4849@codequotebacktick on
     4850@example
     4851sed -i '1i/*\n * Copyright (C) FOO BAR\n * Created by Jane Doe\n */' *.c
     4852@end example
     4853@codequoteundirected off
     4854@codequotebacktick off
     4855
     4856To add multiple lines from another file, use @code{0rFILE}.
     4857A typical use case is adding a license notice header to all files:
     4858
     4859@codequoteundirected on
     4860@codequotebacktick on
     4861@example
     4862## Create the header file:
     4863$ cat<<'EOF'>LIC.TXT
     4864/*
     4865    Copyright (C) 1989-2021 FOO BAR
     4866
     4867    This program is free software; you can redistribute it and/or modify
     4868    it under the terms of the GNU General Public License as published by
     4869    the Free Software Foundation; either version 3, or (at your option)
     4870    any later version.
     4871
     4872    This program is distributed in the hope that it will be useful,
     4873    but WITHOUT ANY WARRANTY; without even the implied warranty of
     4874    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
     4875    GNU General Public License for more details.
     4876
     4877    You should have received a copy of the GNU General Public License
     4878    along with this program; If not, see <https://www.gnu.org/licenses/>.
     4879*/
     4880EOF
     4881
     4882## Add the file at the beginning of all source code files:
     4883$ sed -i '0rLIC.TXT' *.cpp *.h
     4884@end example
     4885@codequoteundirected off
     4886@codequotebacktick off
     4887
     4888
     4889With script files (e.g. @file{.sh},@file{.py},@file{.pl} files)
     4890the license notice typically appears @emph{after} the first line (the
     4891'shebang' @samp{#!} line). The @code{1rFILE} command will add @file{FILE}
     4892@emph{after} the first line:
     4893
     4894@codequoteundirected on
     4895@codequotebacktick on
     4896@example
     4897## Create the header file:
     4898$ cat<<'EOF'>LIC.TXT
     4899##
     4900## Copyright (C) 1989-2021 FOO BAR
     4901##
     4902## This program is free software; you can redistribute it and/or modify
     4903## it under the terms of the GNU General Public License as published by
     4904## the Free Software Foundation; either version 3, or (at your option)
     4905## any later version.
     4906##
     4907## This program is distributed in the hope that it will be useful,
     4908## but WITHOUT ANY WARRANTY; without even the implied warranty of
     4909## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
     4910## GNU General Public License for more details.
     4911##
     4912## You should have received a copy of the GNU General Public License
     4913## along with this program; If not, see <https://www.gnu.org/licenses/>.
     4914##
     4915##
     4916EOF
     4917
     4918## Add the file at the beginning of all source code files:
     4919$ sed -i '1rLIC.TXT' *.py *.sh
     4920@end example
     4921@codequoteundirected off
     4922@codequotebacktick off
     4923
     4924The above @command{sed} commands can be combined with @command{find}
     4925to locate files in all subdirectories, @command{xargs} to run additional
     4926commands on selected files and @command{grep} to filter out files that already
     4927contain a copyright notice:
     4928
     4929@codequoteundirected on
     4930@codequotebacktick on
     4931@example
     4932find \( -iname '*.cpp' -o -iname '*.c' -o -iname '*.h' \) \
     4933    | xargs grep -Li copyright \
     4934    | xargs -r sed -i '0rLIC.TXT'
     4935@end example
     4936@codequoteundirected off
     4937@codequotebacktick off
     4938
     4939@exdent Or a slightly safe version (handling files with spaces and newlines):
     4940
     4941@codequoteundirected on
     4942@codequotebacktick on
     4943@example
     4944find \( -iname '*.cpp' -o -iname '*.c' -o -iname '*.h' \) -print0 \
     4945    | xargs -0 grep -Z -Li copyright \
     4946    | xargs -0 -r sed -i '0rLIC.TXT'
     4947@end example
     4948@codequoteundirected off
     4949@codequotebacktick off
     4950
     4951Note: using the @code{0} address with @code{r} command requires @value{SSED}
     4952version 4.9 or later. @xref{Zero Address}.
     4953
     4954
     4955
    19764956@node tac
    19774957@section Reverse Lines of Files
     
    19814961is a @command{tac} workalike.
    19824962
    1983 Note that on implementations other than @acronym{GNU} @command{sed}
    1984 @ifset PERL
    1985 and @value{SSED}
    1986 @end ifset
     4963Note that on implementations other than GNU @command{sed}
    19874964this script might easily overflow internal buffers.
    19884965
     
    20154992
    20164993This script replaces @samp{cat -n}; in fact it formats its output
    2017 exactly like @acronym{GNU} @command{cat} does.
     4994exactly like GNU @command{cat} does.
    20184995
    20194996Of course this is completely useless and for two reasons:  first,
     
    22545231@group
    22555232# Convert words to a's
    2256 s/[ @kbd{tab}][ @kbd{tab}]*/ /g
     5233s/[ @kbd{@key{TAB}}][ @kbd{@key{TAB}}]*/ /g
    22575234s/^/ /
    22585235s/ [^ ][^ ]*/a /g
     
    24315408@c end---------------------------------------------
    24325409
    2433 As you can see, we mantain a 2-line window using @code{P} and @code{D}.
     5410As you can see, we maintain a 2-line window using @code{P} and @code{D}.
    24345411This technique is often used in advanced @command{sed} scripts.
    24355412
     
    25855562fastest.  Note that loops are completely done with @code{n} and
    25865563@code{b}, without relying on @command{sed} to restart the
    2587 the script automatically at the end of a line.
     5564script automatically at the end of a line.
    25885565
    25895566@c start-------------------------------------------
     
    26035580# get next
    26045581n
    2605 # got chars? print it again, etc... 
     5582# got chars? print it again, etc...
    26065583/./bx
    26075584@end group
     
    26315608@chapter @value{SSED}'s Limitations and Non-limitations
    26325609
    2633 @cindex @acronym{GNU} extensions, unlimited line length
     5610@cindex GNU extensions, unlimited line length
    26345611@cindex Portability, line length limitations
    26355612For those who want to write portable @command{sed} scripts,
     
    26475624the size of the buffer that can be processed by certain patterns.
    26485625
    2649 @ifset PERL
    2650 There are some size limitations in the regular expression
    2651 matcher but it is hoped that they will never in practice
    2652 be relevant.  The maximum length of a compiled pattern
    2653 is 65539 (sic) bytes.  All values in repeating quantifiers
    2654 must be less than 65536.  The maximum nesting depth of
    2655 all parenthesized subpatterns, including capturing and
    2656 non-capturing subpatterns@footnote{The
    2657 distinction is meaningful when referring to Perl-style
    2658 regular expressions.}, assertions, and other types of
    2659 subpattern, is 200.
    2660 
    2661 Also, @value{SSED} recognizes the @sc{posix} syntax
    2662 @code{[.@var{ch}.]} and @code{[=@var{ch}=]}
    2663 where @var{ch} is a ``collating element'', but these
    2664 are not supported, and an error is given if they are
    2665 encountered.
    2666 
    2667 Here are a few distinctions between the real Perl-style
    2668 regular expressions and those that @option{-R} recognizes.
    2669 
    2670 @enumerate
    2671 @item
    2672 Lookahead assertions do not allow repeat quantifiers after them
    2673 Perl permits them, but they do not mean what you
    2674 might think. For example, @samp{(?!a)@{3@}} does not assert that the
    2675 next three characters are not @samp{a}. It just asserts three times that the
    2676 next character is not @samp{a} --- a waste of time and nothing else.
    2677 
    2678 @item
    2679 Capturing subpatterns that occur inside  negative  lookahead
    2680 head  assertions  are  counted,  but  their  entries are counted
    2681 as empty in the second half of an @code{s} command.
    2682 Perl sets its numerical variables from any such patterns
    2683 that are matched before the assertion fails to match
    2684 something (thereby succeeding), but only if the negative
    2685 lookahead assertion contains just one branch.
    2686 
    2687 @item
    2688 The following Perl escape sequences are not supported:
    2689 @samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E},
    2690 @samp{\Q}. In fact these are implemented by Perl's general
    2691 string-handling and are not part of its pattern matching engine.
    2692 
    2693 @item
    2694 The Perl @samp{\G} assertion is not supported as it is not
    2695 relevant to single pattern matches.
    2696 
    2697 @item
    2698 Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})}
    2699 and @samp{(?p@{code@})} constructions. However, there is some experimental
    2700 support for recursive patterns using the non-Perl item @samp{(?R)}.
    2701 
    2702 @item
    2703 There are at the time of writing some oddities in Perl
    2704 5.005_02 concerned with the settings of captured strings
    2705 when part of a pattern is repeated. For example, matching
    2706 @samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets
    2707 @samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.}
    2708 to the value @samp{b}, but matching @samp{aabbaa}
    2709 against @samp{/^(aa(bb)?)+$/} leaves @samp{$2}
    2710 unset.  However, if the pattern is changed to
    2711 @samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set.
    2712 In Perl 5.004 @samp{$2} is set in both cases, and that is also
    2713 true of @value{SSED}.
    2714 
    2715 @item
    2716 Another as yet unresolved discrepancy is that in Perl
    2717 5.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches
    2718 the string @samp{a}, whereas in @value{SSED} it does not.
    2719 However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched
    2720 against @samp{a} leaves $1 unset.
    2721 @end enumerate
    2722 @end ifset
    27235626
    27245627@node Other Resources
    27255628@chapter Other Resources for Learning About @command{sed}
    27265629
     5630For up to date information about @value{SSED} please
     5631visit @uref{https://www.gnu.org/software/sed/}.
     5632
     5633Send general questions and suggestions to @email{sed-devel@@gnu.org}.
     5634Visit the mailing list archives for past discussions at
     5635@uref{https://lists.gnu.org/archive/html/sed-devel/}.
     5636
    27275637@cindex Additional reading about @command{sed}
    2728 In addition to several books that have been written about @command{sed}
    2729 (either specifically or as chapters in books which discuss
    2730 shell programming), one can find out more about @command{sed}
    2731 (including suggestions of a few books) from the FAQ
    2732 for the @code{sed-users} mailing list, available from any of:
    2733 @display
    2734  @uref{http://www.student.northpark.edu/pemente/sed/sedfaq.html}
    2735  @uref{http://sed.sf.net/grabbag/tutorials/sedfaq.html}
    2736 @end display
    2737 
    2738 Also of interest are
    2739 @uref{http://www.student.northpark.edu/pemente/sed/index.htm}
    2740 and @uref{http://sed.sf.net/grabbag},
    2741 which include @command{sed} tutorials and other @command{sed}-related goodies.
    2742 
    2743 The @code{sed-users} mailing list itself maintained by Sven Guckes.
    2744 To subscribe, visit @uref{http://groups.yahoo.com} and search
    2745 for the @code{sed-users} mailing list.
     5638The following resources provide information about @command{sed}
     5639(both @value{SSED} and other variations). Note these not maintained by
     5640@value{SSED} developers.
     5641
     5642@itemize @bullet
     5643
     5644@item
     5645sed @code{$HOME}: @uref{http://sed.sf.net}
     5646
     5647@item
     5648sed FAQ: @uref{http://sed.sf.net/sedfaq.html}
     5649
     5650@item
     5651seder's grabbag: @uref{http://sed.sf.net/grabbag}
     5652
     5653@item
     5654The @code{sed-users} mailing list maintained by Sven Guckes:
     5655@uref{http://groups.yahoo.com/group/sed-users/}
     5656(note this is @emph{not} the @value{SSED} mailing list).
     5657
     5658@end itemize
    27465659
    27475660@node Reporting Bugs
     
    27495662
    27505663@cindex Bugs, reporting
    2751 Email bug reports to @email{bonzini@@gnu.org}.
    2752 Be sure to include the word ``sed'' somewhere in the @code{Subject:} field.
     5664Email bug reports to @email{bug-sed@@gnu.org}.
    27535665Also, please include the output of @samp{sed --version} in the body
    27545666of your report if at all possible.
     
    27575669
    27585670@example
    2759 @i{while building frobme-1.3.4}
    2760 $ configure 
     5671@i{@i{@r{while building frobme-1.3.4}}}
     5672$ configure
    27615673@error{} sed: file sedscr line 1: Unknown option to 's'
    27625674@end example
     
    27775689
    27785690@table @asis
     5691@anchor{N_command_last_line}
    27795692@item @code{N} command on the last line
    27805693@cindex Portability, @code{N} command on the last line
     
    27865699the @command{-n} command switch has been specified.  This choice is
    27875700by design.
     5701
     5702Default behavior (gnu extension, non-POSIX conforming):
     5703@example
     5704$ seq 3 | sed N
     57051
     57062
     57073
     5708@end example
     5709@noindent
     5710To force POSIX-conforming behavior:
     5711@example
     5712$ seq 3 | sed --posix N
     57131
     57142
     5715@end example
    27885716
    27895717For example, the behavior of
     
    28065734/foo/@{ N;N;N;N;N;N;N;N;N; @}
    28075735@end example
    2808  
     5736
    28095737@cindex @code{POSIXLY_CORRECT} behavior, @code{N} command
    28105738In any case, the simplest workaround is to use @code{$d;N} in
     
    28135741
    28145742@item Regex syntax clashes (problems with backslashes)
    2815 @cindex @acronym{GNU} extensions, to basic regular expressions
     5743@cindex GNU extensions, to basic regular expressions
    28165744@cindex Non-bugs, regex syntax clashes
    28175745@command{sed} uses the @sc{posix} basic regular expression syntax.  According to
     
    28215749@code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}.
    28225750
    2823 As in all @acronym{GNU} programs that use @sc{posix} basic regular
     5751As in all GNU programs that use @sc{posix} basic regular
    28245752expressions, @command{sed} interprets these escape sequences as special
    28255753characters.  So, @code{x\+} matches one or more occurrences of @samp{x}.
     
    28325760spurious backslashes if they are to be used with modern implementations
    28335761of @command{sed}, like
    2834 @ifset PERL
    2835 @value{SSED} or
    2836 @end ifset
    2837 @acronym{GNU} @command{sed}.
     5762GNU @command{sed}.
    28385763
    28395764On the other hand, some scripts use s|abc\|def||g to remove occurrences
     
    28415766@command{sed} 4.0.x, newer versions interpret this as removing the
    28425767string @code{abc|def}.  This is again undefined behavior according to
    2843 @acronym{POSIX}, and this interpretation is arguably more robust: older
     5768POSIX, and this interpretation is arguably more robust: older
    28445769@command{sed}s, for example, required that the regex matcher parsed
    28455770@code{\/} as @code{/} in the common case of escaping a slash, which is
     
    28475772because the regex matcher is only partially under our control.
    28485773
    2849 @cindex @acronym{GNU} extensions, special escapes
     5774@cindex GNU extensions, special escapes
    28505775In addition, this version of @command{sed} supports several escape characters
    28515776(some of which are multi-character) to insert non-printable characters
     
    28635788(@pxref{Invoking sed, , Invocation}) lets you clobber
    28645789protected files.  This is not a bug, but rather a consequence
    2865 of how the Unix filesystem works.
     5790of how the Unix file system works.
    28665791
    28675792The permissions on a file say what can happen to the data
     
    28735798modifying the contents of the directory, so the operation depends on
    28745799the permissions of the directory, not of the file.  For this same
    2875 reason, @command{sed} does not let you use @option{-i} on a writeable file
    2876 in a read-only directory (but unbelievably nobody reports that as a
    2877 bug@dots{}).
     5800reason, @command{sed} does not let you use @option{-i} on a writable file
     5801in a read-only directory, and will break hard or symbolic links when
     5802@option{-i} is used on such a file.
    28785803
    28795804@item @code{0a} does not work (gives an error)
     5805@cindex @code{0} address
     5806@cindex GNU extensions, @code{0} address
     5807@cindex Non-bugs, @code{0} address
     5808
    28805809There is no line 0.  0 is a special address that is only used to treat
    28815810addresses like @code{0,/@var{RE}/} as active when the script starts: if
    2882 you write @code{1,/abc/d} and the first line includes the word @samp{abc},
     5811you write @code{1,/abc/d} and the first line includes the string @samp{abc},
    28835812then that match would be ignored because address ranges must span at least
    28845813two lines (barring the end of the file); but what you probably wanted is
     
    28885817@ifclear PERL
    28895818@item @code{[a-z]} is case insensitive
     5819@cindex Non-bugs, localization-related
     5820
    28905821You are encountering problems with locales.  POSIX mandates that @code{[a-z]}
    28915822uses the current locale's collation order -- in C parlance, that means using
    28925823@code{strcoll(3)} instead of @code{strcmp(3)}.  Some locales have a
    2893 case-insensitive collation order, others don't: one of those that have
    2894 problems is Estonian.
     5824case-insensitive collation order, others don't.
    28955825
    28965826Another problem is that @code{[a-z]} tries to use collation symbols.
    2897 This only happens if you are on the @acronym{GNU} system, using
    2898 @acronym{GNU} libc's regular expression matcher instead of compiling the
    2899 one supplied with @acronym{GNU} sed.  In a Danish locale, for example,
     5827This only happens if you are on the GNU system, using
     5828GNU libc's regular expression matcher instead of compiling the
     5829one supplied with GNU sed.  In a Danish locale, for example,
    29005830the regular expression @code{^[a-z]$} matches the string @samp{aa},
    29015831because this is a single collating symbol that comes after @samp{a}
     
    29055835To work around these problems, which may cause bugs in shell scripts, set
    29065836the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}.
     5837
     5838@item @code{s/.*//} does not clear pattern space
     5839@cindex Non-bugs, localization-related
     5840@cindex @value{SSEDEXT}, emptying pattern space
     5841@cindex Emptying pattern space
     5842
     5843This happens if your input stream includes invalid multibyte
     5844sequences.  @sc{posix} mandates that such sequences
     5845are @emph{not} matched by @samp{.}, so that @samp{s/.*//} will not clear
     5846pattern space as you would expect.  In fact, there is no way to clear
     5847sed's buffers in the middle of the script in most multibyte locales
     5848(including UTF-8 locales).  For this reason, @value{SSED} provides a `z'
     5849command (for `zap') as an extension.
     5850
     5851To work around these problems, which may cause bugs in shell scripts, set
     5852the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}.
    29075853@end ifclear
    29085854@end table
    29095855
    29105856
    2911 @node Extended regexps
    2912 @appendix Extended regular expressions
    2913 @cindex Extended regular expressions, syntax
    2914 
    2915 The only difference between basic and extended regular expressions is in
    2916 the behavior of a few characters: @samp{?}, @samp{+}, parentheses,
    2917 and braces (@samp{@{@}}).  While basic regular expressions require
    2918 these to be escaped if you want them to behave as special characters,
    2919 when using extended regular expressions you must escape them if
    2920 you want them @emph{to match a literal character}.
    2921 
    2922 @noindent
    2923 Examples:
    2924 @table @code
    2925 @item abc?
    2926 becomes @samp{abc\?} when using extended regular expressions.  It matches
    2927 the literal string @samp{abc?}.
    2928 
    2929 @item c\+
    2930 becomes @samp{c+} when using extended regular expressions.  It matches
    2931 one or more @samp{c}s.
    2932 
    2933 @item a\@{3,\@}
    2934 becomes @samp{a@{3,@}} when using extended regular expressions.  It matches
    2935 three or more @samp{a}s.
    2936 
    2937 @item \(abc\)\@{2,3\@}
    2938 becomes @samp{(abc)@{2,3@}} when using extended regular expressions.  It
    2939 matches either @samp{abcabc} or @samp{abcabcabc}.
    2940 
    2941 @item \(abc*\)\1
    2942 becomes @samp{(abc*)\1} when using extended regular expressions.
    2943 Backreferences must still be escaped when using extended regular
    2944 expressions.
    2945 @end table
    2946 
    2947 @ifset PERL
    2948 @node Perl regexps
    2949 @appendix Perl-style regular expressions
    2950 @cindex Perl-style regular expressions, syntax
    2951 
    2952 @emph{This part is taken from the @file{pcre.txt} file distributed together
    2953 with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.}
    2954 
    2955 Perl introduced several extensions to regular expressions, some
    2956 of them incompatible with the syntax of regular expressions
    2957 accepted by Emacs and other @acronym{GNU} tools (whose matcher was
    2958 based on the Emacs matcher).  @value{SSED} implements
    2959 both kinds of extensions.
    2960 
    2961 @iftex
    2962 Summarizing, we have:
    2963 
    2964 @itemize @bullet
    2965 @item
    2966 A backslash can introduce several special sequences
    2967 
    2968 @item
    2969 The circumflex, dollar sign, and period characters behave specially
    2970 with regard to new lines
    2971 
    2972 @item
    2973 Strange uses of square brackets are parsed differently
    2974 
    2975 @item
    2976 You can toggle modifiers in the middle of a regular expression
    2977 
    2978 @item
    2979 You can specify that a subpattern does not count when numbering backreferences
    2980 
    2981 @item
    2982 @cindex Greedy regular expression matching
    2983 You can specify greedy or non-greedy matching
    2984 
    2985 @item
    2986 You can have more than ten back references
    2987 
    2988 @item
    2989 You can do complex look aheads and look behinds (in the spirit of
    2990 @code{\b}, but with subpatterns).
    2991 
    2992 @item
    2993 You can often improve performance by avoiding that @command{sed} wastes
    2994 time with backtracking
    2995 
    2996 @item
    2997 You can have if/then/else branches
    2998 
    2999 @item
    3000 You can do recursive matches, for example to look for unbalanced parentheses
    3001 
    3002 @item
    3003 You can have comments and non-significant whitespace, because things can
    3004 get complex...
    3005 @end itemize
    3006 
    3007 Most of these extensions are introduced by the special @code{(?}
    3008 sequence, which gives special meanings to parenthesized groups.
    3009 @end iftex
    3010 @menu
    3011 Other extensions can be roughly subdivided in two categories
    3012 On one hand Perl introduces several more escaped sequences
    3013 (that is, sequences introduced by a backslash).  On the other
    3014 hand, it specifies that if a question mark follows an open
    3015 parentheses it should give a special meaning to the parenthesized
    3016 group.
    3017 
    3018 * Backslash::                       Introduces special sequences
    3019 * Circumflex/dollar sign/period::   Behave specially with regard to new lines
    3020 * Square brackets::                 Are a bit different in strange cases
    3021 * Options setting::                 Toggle modifiers in the middle of a regexp
    3022 * Non-capturing subpatterns::       Are not counted when backreferencing
    3023 * Repetition::                      Allows for non-greedy matching
    3024 * Backreferences::                  Allows for more than 10 back references
    3025 * Assertions::                      Allows for complex look ahead matches
    3026 * Non-backtracking subpatterns::    Often gives more performance
    3027 * Conditional subpatterns::         Allows if/then/else branches
    3028 * Recursive patterns::              For example to match parentheses
    3029 * Comments::                        Because things can get complex...
    3030 @end menu
    3031 
    3032 @node Backslash
    3033 @appendixsec Backslash
    3034 @cindex Perl-style regular expressions, escaped sequences
    3035 
    3036 There are a few difference in the handling of backslashed
    3037 sequences in Perl mode.
    3038 
    3039 First of all, there are no @code{\o} and @code{\d} sequences.
    3040 @sc{ascii} values for characters can be specified in octal
    3041 with a @code{\@var{xxx}} sequence, where @var{xxx} is a
    3042 sequence of up to three octal digits.  If the first digit
    3043 is a zero, the treatment of the sequence is straightforward;
    3044 just note that if the character that follows the escaped digit
    3045 is itself an octal digit, you have to supply three octal digits
    3046 for @var{xxx}.  For example @code{\07} is a @sc{bel} character
    3047 rather than a @sc{nul} and a literal @code{7} (this sequence is
    3048 instead represented by @code{\0007}).
    3049 
    3050 @cindex Perl-style regular expressions, backreferences
    3051 The handling of a backslash followed by a digit other than 0
    3052 is complicated.  Outside a character class, @command{sed} reads it
    3053 and any following digits as a decimal number. If the number
    3054 is less than 10, or if there have been at least that many
    3055 previous capturing left parentheses in the expression, the
    3056 entire sequence is taken as a back reference. A description
    3057 of how this works is given later, following the discussion
    3058 of parenthesized subpatterns.
    3059 
    3060 Inside a character class, or if the decimal number is
    3061 greater than 9 and there have not been that many capturing
    3062 subpatterns, @command{sed} re-reads up to three octal digits following
    3063 the backslash, and generates a single byte from the
    3064 least significant 8 bits of the value. Any subsequent digits
    3065 stand for themselves.  For example:
    3066 
    3067 @example
    3068      \040  @i{is another way of writing a space}
    3069      \40   @i{is the same, provided there are fewer than 40}
    3070            @i{previous capturing subpatterns}
    3071      \7    @i{is always a back reference}
    3072      \011  @i{is always a tab}
    3073      \11   @i{might be a back reference, or another way of}
    3074            @i{writing a tab}
    3075      \0113 @i{is a tab followed by the character @samp{3}}
    3076      \113  @i{is the character with octal code 113 (since there}
    3077            @i{can be no more than 99 back references)}
    3078      \377  @i{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}
    3079      \81   @i{is either a back reference, or a binary zero}
    3080            @i{followed by the two characters @samp{81}}
    3081 @end example
    3082 
    3083 Note that octal values of 100 or greater must not be introduced
    3084 duced by a leading zero, because no more than three octal
    3085 digits are ever read.
    3086 
    3087 All the sequences that define a single byte value can be
    3088 used both inside and outside character classes. In addition,
    3089 inside a character class, the sequence @code{\b} is interpreted
    3090 as the backspace character (hex 08). Outside a character
    3091 class it has a different meaning (see below).
    3092 
    3093 In addition, there are four additional escapes specifying
    3094 generic character classes (like @code{\w} and @code{\W} do):
    3095 
    3096 @cindex Perl-style regular expressions, character classes
    3097 @table @samp
    3098 @item \d
    3099 Matches any decimal digit
    3100 
    3101 @item \D
    3102 Matches any character that is not a decimal digit
    3103 @end table
    3104 
    3105 In Perl mode, these character type sequences can appear both inside and
    3106 outside character classes. Instead, in @sc{posix} mode these sequences
    3107 (as well as @code{\w} and @code{\W}) are treated as two literal characters
    3108 (a backslash and a letter) inside square brackets.
    3109 
    3110 Escaped sequences specifying assertions are also different in
    3111 Perl mode.  An assertion specifies a condition that has to be met
    3112 at a particular point in a match, without consuming any
    3113 characters from the subject string. The use of subpatterns
    3114 for more complicated assertions is described below.  The
    3115 backslashed assertions are
    3116 
    3117 @cindex Perl-style regular expressions, assertions
    3118 @table @samp
    3119 @item \b
    3120 Asserts that the point is at a word boundary.
    3121 A word boundary is a position in the subject string where
    3122 the current character and the previous character do not both
    3123 match @code{\w} or @code{\W} (i.e. one matches @code{\w} and
    3124 the other matches @code{\W}), or the start or end of the string
    3125 if the first or last character matches @code{\w}, respectively.
    3126 
    3127 @item \B
    3128 Asserts that the point is not at a word boundary.
    3129 
    3130 @item \A
    3131 Asserts the matcher is at the start of pattern space (independent
    3132 of multiline mode).
    3133 
    3134 @item \Z
    3135 Asserts the matcher is at the end of pattern space,
    3136 or at a newline before the end of pattern space (independent of
    3137 multiline mode)
    3138 
    3139 @item \z
    3140 Asserts the matcher is at the end of pattern space (independent
    3141 of multiline mode)
    3142 @end table
    3143 
    3144 These assertions may not appear in character classes (but
    3145 note that @code{\b} has a different meaning, namely the
    3146 backspace character, inside a character class).
    3147 Note that Perl mode does not support directly assertions
    3148 for the beginning and the end of word; the @acronym{GNU} extensions
    3149 @code{\<} and @code{\>} achieve this purpose in @sc{posix} mode
    3150 instead.
    3151 
    3152 The @code{\A}, @code{\Z}, and @code{\z} assertions differ
    3153 from the traditional circumflex and dollar sign (described below)
    3154 in that they only ever match at the very start and end of the
    3155 subject string, whatever options are set; in particular @code{\A}
    3156 and @code{\z} are the same as the @acronym{GNU} extensions
    3157 @code{\`} and @code{\'} that are active in @sc{posix} mode.
    3158 
    3159 @node Circumflex/dollar sign/period
    3160 @appendixsec Circumflex, dollar sign, period
    3161 @cindex Perl-style regular expressions, newlines
    3162 
    3163 Outside a character class, in the default matching mode, the
    3164 circumflex character is an assertion which is true only if
    3165 the current matching point is at the start of the subject
    3166 string.  Inside a character class, the circumflex has an entirely
    3167 different meaning (see below).
    3168 
    3169 The circumflex need not be the first character of the pattern if
    3170 a number of alternatives are involved, but it should be the
    3171 first thing in each alternative in which it appears if the
    3172 pattern is ever to match that branch. If all possible alternatives,
    3173 start with a circumflex, that is, if the pattern is
    3174 constrained to match only at the start of the subject, it is
    3175 said to be an @dfn{anchored} pattern. (There are also other constructs
    3176 structs that can cause a pattern to be anchored.)
    3177 
    3178 A dollar sign is an assertion which is true only if the
    3179 current matching point is at the end of the subject string,
    3180 or immediately before a newline character that is the last
    3181 character in the string (by default).  A dollar sign need not be the
    3182 last character of the pattern if a number of alternatives
    3183 are involved, but it should be the last item in any branch
    3184 in which it appears.  A dollar sign has no special meaning in a
    3185 character class.
    3186 
    3187 @cindex Perl-style regular expressions, multiline
    3188 The meanings of the circumflex and dollar sign characters are
    3189 changed if the @code{M} modifier option is used. When this is
    3190 the case, they match immediately after and immediately
    3191 before an internal @code{\n} character, respectively, in addition
    3192 to matching at the start and end of the subject string.  For
    3193 example, the pattern @code{/^abc$/} matches the subject string
    3194 @samp{def\nabc} in multiline mode, but not otherwise.  Consequently,
    3195 patterns that are anchored in single line mode
    3196 because all branches start with @code{^} are not anchored in
    3197 multiline mode.
    3198 
    3199 @cindex Perl-style regular expressions, multiline
    3200 Note that the sequences @code{\A}, @code{\Z}, and @code{\z}
    3201 can be used to match the start and end of the subject in both
    3202 modes, and if all branches of a pattern start with @code{\A}
    3203 is it always anchored, whether the @code{M} modifier is set or not.
    3204 
    3205 @cindex Perl-style regular expressions, single line
    3206 Outside a character class, a dot in the pattern matches any
    3207 one character in the subject, including a non-printing character,
    3208 but not (by default) newline.  If the @code{S} modifier is used,
    3209 dots match newlines as well.  Actually, the handling of
    3210 dot is entirely independent of the handling of circumflex
    3211 and dollar sign, the only relationship being that they both
    3212 involve newline characters. Dot has no special meaning in a
    3213 character class.
    3214 
    3215 @node Square brackets
    3216 @appendixsec Square brackets
    3217 @cindex Perl-style regular expressions, character classes
    3218 
    3219 An opening square bracket introduces a character class, terminated
    3220 by a closing square bracket.  A closing square bracket on its own
    3221 is not special.  If a closing square bracket is required as a
    3222 member of the class, it should be the first data character in
    3223 the class (after an initial circumflex, if present) or escaped with a backslash.
    3224 
    3225 A character class matches a single character in the subject;
    3226 the character must be in the set of characters defined by
    3227 the class, unless the first character in the class is a circumflex,
    3228 in which case the subject character must not be in
    3229 the set defined by the class. If a circumflex is actually
    3230 required as a member of the class, ensure it is not the
    3231 first character, or escape it with a backslash.
    3232 
    3233 For example, the character class [aeiou] matches any lower
    3234 case vowel, while [^aeiou] matches any character that is not
    3235 a lower case vowel. Note that a circumflex is just a convenient
    3236 venient notation for specifying the characters which are in
    3237 the class by enumerating those that are not. It is not an
    3238 assertion: it still consumes a character from the subject
    3239 string, and fails if the current pointer is at the end of
    3240 the string.
    3241 
    3242 @cindex Perl-style regular expressions, case-insensitive
    3243 When caseless matching is set, any letters in a class
    3244 represent both their upper case and lower case versions, so
    3245 for example, a caseless @code{[aeiou]} matches uppercase
    3246 and lowercase @samp{A}s, and a caseless @code{[^aeiou]}
    3247 does not match @samp{A}, whereas a case-sensitive version would.
    3248 
    3249 @cindex Perl-style regular expressions, single line
    3250 @cindex Perl-style regular expressions, multiline
    3251 The newline character is never treated in any special way in
    3252 character classes, whatever the setting of the @code{S} and
    3253 @code{M} options (modifiers) is.  A class such as @code{[^a]} will
    3254 always match a newline.
    3255 
    3256 The minus (hyphen) character can be used to specify a range
    3257 of characters in a character class.  For example, @code{[d-m]}
    3258 matches any letter between d and m, inclusive.  If a minus
    3259 character is required in a class, it must be escaped with a
    3260 backslash or appear in a position where it cannot be interpreted
    3261 as indicating a range, typically as the first or last
    3262 character in the class.
    3263 
    3264 It is not possible to have the literal character @code{]} as the
    3265 end character of a range.  A pattern such as @code{[W-]46]} is
    3266 interpreted as a class of two characters (@code{W} and @code{-})
    3267 followed by a literal string @code{46]}, so it would match
    3268 @samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped
    3269 with a backslash it is interpreted as the end of range, so
    3270 @code{[W-\]46]} is interpreted as a single class containing a
    3271 range followed by two separate characters. The octal or
    3272 hexadecimal representation of @code{]} can also be used to end a range.
    3273 
    3274 Ranges operate in @sc{ascii} collating sequence. They can also be
    3275 used for characters specified numerically, for example
    3276 @code{[\000-\037]}. If a range that includes letters is used when
    3277 caseless matching is set, it matches the letters in either
    3278 case. For example, a caseless @code{[W-c]} is equivalent to
    3279 @code{[][\^_`wxyzabc]}, matched caselessly, and if character
    3280 tables for the French locale are in use, @code{[\xc8-\xcb]}
    3281 matches accented E characters in both cases.
    3282 
    3283 Unlike in @sc{posix} mode, the character types @code{\d},
    3284 @code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W}
    3285 may also appear in a character class, and add the characters
    3286 that they match to the class. For example, @code{[\dABCDEF]} matches any
    3287 hexadecimal digit.  A circumflex can conveniently be used
    3288 with the upper case character types to specify a more restricted
    3289 set of characters than the matching lower case type.
    3290 For example, the class @code{[^\W_]} matches any letter or digit,
    3291 but not underscore.
    3292 
    3293 All non-alphameric characters other than @code{\}, @code{-},
    3294 @code{^} (at the start) and the terminating @code{]}
    3295 are non-special in character classes, but it does no harm
    3296 if they are escaped.
    3297 
    3298 Perl 5.6 supports the @sc{posix} notation for character classes, which
    3299 uses names enclosed by @code{[:} and @code{:]} within the enclosing
    3300 square brackets, and @value{SSED} supports this notation as well.
    3301 For example,
    3302 
    3303 @example
    3304      [01[:alpha:]%]
    3305 @end example
    3306 
    3307 @noindent
    3308 matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}.
    3309 The supported class names are
    3310 
    3311 @table @code
    3312 @item alnum
    3313 Matches letters and digits
    3314 
    3315 @item alpha
    3316 Matches letters
    3317 
    3318 @item ascii
    3319 Matches character codes 0 - 127
    3320 
    3321 @item cntrl
    3322 Matches control characters
    3323 
    3324 @item digit
    3325 Matches decimal digits (same as \d)
    3326 
    3327 @item graph
    3328 Matches printing characters, excluding space
    3329 
    3330 @item lower
    3331 Matches lower case letters
    3332 
    3333 @item print
    3334 Matches printing characters, including space
    3335 
    3336 @item punct
    3337 Matches printing characters, excluding letters and digits
    3338 
    3339 @item space
    3340 Matches white space (same as \s)
    3341 
    3342 @item upper
    3343 Matches upper case letters
    3344 
    3345 @item word
    3346 Matches ``word'' characters (same as \w)
    3347 
    3348 @item xdigit
    3349 Matches hexadecimal digits
    3350 @end table
    3351 
    3352 The names @code{ascii} and @code{word} are extensions valid only in
    3353 Perl mode.  Another Perl extension is negation, which is
    3354 indicated by a circumflex character after the colon. For example,
    3355 
    3356 @example
    3357      [12[:^digit:]]
    3358 @end example
    3359 
    3360 @noindent
    3361 matches @samp{1}, @samp{2}, or any non-digit.
    3362 
    3363 @node Options setting
    3364 @appendixsec Options setting
    3365 @cindex Perl-style regular expressions, toggling options
    3366 @cindex Perl-style regular expressions, case-insensitive
    3367 @cindex Perl-style regular expressions, multiline
    3368 @cindex Perl-style regular expressions, single line
    3369 @cindex Perl-style regular expressions, extended
    3370 
    3371 The settings of the @code{I}, @code{M}, @code{S}, @code{X}
    3372 modifiers can be changed from within the pattern by
    3373 a sequence of Perl option letters enclosed between @code{(?}
    3374 and @code{)}. The option letters must be lowercase.
    3375 
    3376 For example, @code{(?im)} sets caseless, multiline matching. It is
    3377 also possible to unset these options by preceding the letter
    3378 with a hyphen; you can also have combined settings and unsettings:
    3379 @code{(?im-sx)} sets caseless and multiline matching,
    3380 while unsets single line matching (for dots) and extended
    3381 whitespace interpretation.  If a letter appears both before
    3382 and after the hyphen, the option is unset.
    3383 
    3384 The scope of these option changes depends on where in the
    3385 pattern the setting occurs. For settings that are outside
    3386 any subpattern (defined below), the effect is the same as if
    3387 the options were set or unset at the start of matching. The
    3388 following patterns all behave in exactly the same way:
    3389 
    3390 @example
    3391      (?i)abc
    3392      a(?i)bc
    3393      ab(?i)c
    3394      abc(?i)
    3395 @end example
    3396 
    3397 which in turn is the same as specifying the pattern abc with
    3398 the @code{I} modifier.  In other words, ``top level'' settings
    3399 apply to the whole pattern (unless there are other
    3400 changes inside subpatterns). If there is more than one setting
    3401 of the same option at top level, the rightmost setting
    3402 is used.
    3403 
    3404 If an option change occurs inside a subpattern, the effect
    3405 is different.  This is a change of behaviour in Perl 5.005.
    3406 An option change inside a subpattern affects only that part
    3407 of the subpattern @emph{that follows} it, so
    3408 
    3409 @example
    3410      (a(?i)b)c
    3411 @end example
    3412 
    3413 @noindent
    3414 matches abc and aBc and no other  strings  (assuming
    3415 case-sensitive matching is used).  By this means, options can
    3416 be made to have different settings in different parts of the
    3417 pattern.  Any changes made in one alternative do carry on
    3418 into subsequent branches within the same subpattern.  For
    3419 example,
    3420 
    3421 @example
    3422      (a(?i)b|c)
    3423 @end example
    3424 
    3425 @noindent
    3426 matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C},
    3427 even though when matching @samp{C} the first branch is
    3428 abandoned before the option setting.
    3429 This is because the effects of option settings happen at
    3430 compile time. There would be some very weird behaviour otherwise.
    3431 
    3432 @ignore
    3433 There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA
    3434 that can be changed in the same way as the Perl-compatible options by
    3435 using the characters U and X respectively.  The (?X) flag
    3436 setting is special in that it must always occur earlier in
    3437 the pattern than any of the additional features it turns on,
    3438 even when it is at top level. It is best put at the start.
    3439 @end ignore
    3440 
    3441 
    3442 @node Non-capturing subpatterns
    3443 @appendixsec Non-capturing subpatterns
    3444 @cindex Perl-style regular expressions, non-capturing subpatterns
    3445 
    3446 Marking part of a pattern as a subpattern does two things.
    3447 On one hand, it localizes a set of alternatives; on the other
    3448 hand, it sets up the subpattern as a capturing subpattern (as
    3449 defined above).  The subpattern can be backreferenced and
    3450 referenced in the right side of @code{s} commands.
    3451 
    3452 For example, if the string @samp{the red king} is matched against
    3453 the pattern
    3454 
    3455 @example
    3456      the ((red|white) (king|queen))
    3457 @end example
    3458 
    3459 @noindent
    3460 the captured substrings are @samp{red king}, @samp{red},
    3461 and @samp{king}, and are numbered 1, 2, and 3.
    3462 
    3463 The fact that plain parentheses fulfil two functions is not
    3464 always helpful.  There are often times when a grouping
    3465 subpattern is required without a capturing requirement.  If an
    3466 opening parenthesis is followed by @code{?:}, the subpattern does
    3467 not do any capturing, and is not counted when computing the
    3468 number of any subsequent capturing subpatterns. For example,
    3469 if the string @samp{the white queen} is matched against the pattern
    3470 
    3471 @example
    3472      the ((?:red|white) (king|queen))
    3473 @end example
    3474 
    3475 @noindent
    3476 the captured substrings are @samp{white queen} and @samp{queen},
    3477 and are numbered 1 and 2. The maximum number of captured
    3478 substrings is 99, while the maximum number of all subpatterns,
    3479 both capturing and non-capturing, is 200.
    3480 
    3481 As a convenient shorthand, if any option settings are
    3482 equired at the start of a non-capturing subpattern, the
    3483 option letters may appear between the @code{?} and the
    3484 @code{:}.  Thus the two patterns
    3485 
    3486 @example
    3487    (?i:saturday|sunday)
    3488    (?:(?i)saturday|sunday)
    3489 @end example
    3490 
    3491 @noindent
    3492 match exactly the same set of strings.  Because alternative
    3493 branches are tried from left to right, and options are not
    3494 reset until the end of the subpattern is reached, an option
    3495 setting in one branch does affect subsequent branches, so
    3496 the above patterns match @samp{SUNDAY} as well as @samp{Saturday}.
    3497 
    3498 
    3499 @node Repetition
    3500 @appendixsec Repetition
    3501 @cindex Perl-style regular expressions, repetitions
    3502 
    3503 Repetition is specified by quantifiers, which can follow any
    3504 of the following items:
    3505 
    3506 @itemize @bullet
    3507 @item
    3508 a single character, possibly escaped
    3509 
    3510 @item
    3511 the @code{.} special character
    3512 
    3513 @item
    3514 a character class
    3515 
    3516 @item
    3517 a back reference (see next section)
    3518 
    3519 @item
    3520 a parenthesized subpattern (unless it is an assertion; @pxref{Assertions})
    3521 @end itemize
    3522 
    3523 The general repetition quantifier specifies a minimum and
    3524 maximum number of permitted matches, by giving the two
    3525 numbers in curly brackets (braces), separated by a comma.
    3526 The numbers must be less than 65536, and the first must be
    3527 less than or equal to the second. For example:
    3528 
    3529 @example
    3530      z@{2,4@}
    3531 @end example
    3532 
    3533 @noindent
    3534 matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own
    3535 is not a special character. If the second number is omitted,
    3536 but the comma is present, there is no upper limit; if the
    3537 second number and the comma are both omitted, the quantifier
    3538 specifies an exact number of required matches. Thus
    3539 
    3540 @example
    3541      [aeiou]@{3,@}
    3542 @end example
    3543 
    3544 @noindent
    3545 matches at least 3 successive vowels, but may match many
    3546 more, while
    3547 
    3548 @example
    3549      \d@{8@}
    3550 @end example
    3551 
    3552 @noindent
    3553 matches exactly 8 digits.  An opening curly bracket that
    3554 appears in a position where a quantifier is not allowed, or
    3555 one that does not match the syntax of a quantifier, is taken
    3556 as a literal character. For example, @{,6@} is not a quantifier,
    3557 but a literal string of four characters.@footnote{It
    3558 raises an error if @option{-R} is not used.}
    3559 
    3560 The quantifier @samp{@{0@}} is permitted, causing the expression to
    3561 behave as if the previous item and the quantifier were not
    3562 present.
    3563 
    3564 For convenience (and historical compatibility) the three
    3565 most common quantifiers have single-character abbreviations:
    3566 
    3567 @table @code
    3568 @item *
    3569 is equivalent to @{0,@}
    3570 
    3571 @item +
    3572 is equivalent to @{1,@}
    3573 
    3574 @item ?
    3575 is equivalent to @{0,1@}
    3576 @end table
    3577 
    3578 It is possible to construct infinite loops by following a
    3579 subpattern that can match no characters with a quantifier
    3580 that has no upper limit, for example:
    3581 
    3582 @example
    3583      (a?)*
    3584 @end example
    3585 
    3586 Earlier versions of Perl used to give an error at
    3587 compile time for such patterns. However, because there are
    3588 cases where this can be useful, such patterns are now
    3589 accepted, but if any repetition of the subpattern does in
    3590 fact match no characters, the loop is forcibly broken.
    3591 
    3592 @cindex Greedy regular expression matching
    3593 @cindex Perl-style regular expressions, stingy repetitions
    3594 By default, the quantifiers are @dfn{greedy} like in @sc{posix}
    3595 mode, that is, they match as much as possible (up to the maximum
    3596 number of permitted times), without causing the rest of the
    3597 pattern to fail. The classic example of where this gives problems
    3598 is in trying to match comments in C programs. These appear between
    3599 the sequences @code{/*} and @code{*/} and within the sequence, individual
    3600 @code{*} and @code{/} characters may appear. An attempt to match C
    3601 comments by applying the pattern
    3602 
    3603 @example
    3604      /\*.*\*/
    3605 @end example
    3606 
    3607 @noindent
    3608 to the string
    3609 
    3610 @example
    3611      /* first command */ not comment /* second comment */
    3612 @end example
    3613 
    3614 @noindent
    3615 
    3616 fails, because it matches the entire string owing to the
    3617 greediness of the @code{.*} item.
    3618 
    3619 However, if a quantifier is followed by a question mark, it
    3620 ceases to be greedy, and instead matches the minimum number
    3621 of times possible, so the pattern @code{/\*.*?\*/}
    3622 does the right thing with the C comments. The meaning of the
    3623 various quantifiers is not otherwise changed, just the preferred
    3624 number of matches.  Do not confuse this use of question
    3625 mark with its use as a quantifier in its own right.
    3626 Because it has two uses, it can sometimes appear doubled, as in
    3627 
    3628 @example
    3629      \d??\d
    3630 @end example
    3631 
    3632 which matches one digit by preference, but can match two if
    3633 that is the only way the rest of the pattern matches.
    3634 
    3635 Note that greediness does not matter when specifying addresses,
    3636 but can be nevertheless used to improve performance.
    3637 
    3638 @ignore
    3639    If the PCRE_UNGREEDY option is set (an option which is not
    3640    available in Perl), the quantifiers are not greedy by
    3641    default, but individual ones can be made greedy by following
    3642    them with a question mark. In other words, it inverts the
    3643    default behaviour.
    3644 @end ignore
    3645 
    3646 When a parenthesized subpattern is quantified with a minimum
    3647 repeat count that is greater than 1 or with a limited maximum,
    3648 more store is required for the compiled pattern, in
    3649 proportion to the size of the minimum or maximum.
    3650 
    3651 @cindex Perl-style regular expressions, single line
    3652 If a pattern starts with @code{.*} or @code{.@{0,@}} and the
    3653 @code{S} modifier is used, the pattern is implicitly anchored,
    3654 because whatever follows will be tried against every character
    3655 position in the subject string, so there is no point in
    3656 retrying the overall match at any position after the first.
    3657 PCRE treats such a pattern as though it were preceded by \A.
    3658 
    3659 When a capturing subpattern is repeated, the value captured
    3660 is the substring that matched the final iteration. For example,
    3661 after
    3662 
    3663 @example
    3664      (tweedle[dume]@{3@}\s*)+
    3665 @end example
    3666 
    3667 @noindent
    3668 has matched @samp{tweedledum tweedledee} the value of the
    3669 captured substring is @samp{tweedledee}.  However, if there are
    3670 nested capturing subpatterns, the corresponding captured
    3671 values may have been set in previous iterations. For example,
    3672 after
    3673 
    3674 @example
    3675      /(a|(b))+/
    3676 @end example
    3677 
    3678 matches @samp{aba}, the value of the second captured substring is
    3679 @samp{b}.
    3680 
    3681 @node Backreferences
    3682 @appendixsec Backreferences
    3683 @cindex Perl-style regular expressions, backreferences
    3684 
    3685 Outside a character class, a backslash followed by a digit
    3686 greater than 0 (and possibly further digits) is a back
    3687 reference to a capturing subpattern earlier (i.e.  to its
    3688 left) in the pattern, provided there have been that many
    3689 previous capturing left parentheses.
    3690 
    3691 However, if the decimal number following the backslash is
    3692 less than 10, it is always taken as a back reference, and
    3693 causes an error only if there are not that many capturing
    3694 left parentheses in the entire pattern. In other words, the
    3695 parentheses that are referenced need not be to the left of
    3696 the reference for numbers less than 10. @ref{Backslash}
    3697 for further details of the handling of digits following a backslash.
    3698 
    3699 A back reference matches whatever actually matched the capturing
    3700 subpattern in the current subject string, rather than
    3701 anything matching the subpattern itself. So the pattern
    3702 
    3703 @example
    3704      (sens|respons)e and \1ibility
    3705 @end example
    3706 
    3707 @noindent
    3708 matches @samp{sense and sensibility} and @samp{response and responsibility},
    3709 but not @samp{sense and responsibility}. If caseful
    3710 matching is in force at the time of the back reference, the
    3711 case of letters is relevant. For example,
    3712 
    3713 @example
    3714      ((?i)blah)\s+\1
    3715 @end example
    3716 
    3717 @noindent
    3718 matches @samp{blah blah} and @samp{Blah Blah}, but not
    3719 @samp{BLAH blah}, even though the original capturing
    3720 subpattern is matched caselessly.
    3721 
    3722 There may be more than one back reference to the same subpattern.
    3723 Also, if a subpattern has not actually been used in a
    3724 particular match, any back references to it always fail. For
    3725 example, the pattern
    3726 
    3727 @example
    3728      (a|(bc))\2
    3729 @end example
    3730 
    3731 @noindent
    3732 always fails if it starts to match @samp{a} rather than
    3733 @samp{bc}.  Because there may be up to 99 back references, all
    3734 digits following the backslash are taken as part of a potential
    3735 back reference number; this is different from what happens
    3736 in @sc{posix} mode. If the pattern continues with a digit
    3737 character, some delimiter must be used to terminate the back
    3738 reference.  If the @code{X} modifier option is set, this can be
    3739 whitespace.  Otherwise an empty comment can be used, or the
    3740 following character can be expressed in hexadecimal or octal.
    3741 
    3742 A back reference that occurs inside the parentheses to which
    3743 it refers fails when the subpattern is first used, so, for
    3744 example, @code{(a\1)} never matches.  However, such references
    3745 can be useful inside repeated subpatterns. For example, the
    3746 pattern
    3747 
    3748 @example
    3749      (a|b\1)+
    3750 @end example
    3751 
    3752 @noindent
    3753 matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa},
    3754 etc. At each iteration of the subpattern, the back reference matches
    3755 the character string corresponding to the previous iteration.  In
    3756 order for this to work, the pattern must be such that the first
    3757 iteration does not need to match the back reference.  This can be
    3758 done using alternation, as in the example above, or by a
    3759 quantifier with a minimum of zero.
    3760 
    3761 @node Assertions
    3762 @appendixsec Assertions
    3763 @cindex Perl-style regular expressions, assertions
    3764 @cindex Perl-style regular expressions, asserting subpatterns
    3765 
    3766 An assertion is a test on the characters following or
    3767 preceding the current matching point that does not actually
    3768 consume any characters. The simple assertions coded as @code{\b},
    3769 @code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$}
    3770 are described above. More complicated assertions are coded as
    3771 subpatterns.  There are two kinds: those that look ahead of the
    3772 current position in the subject string, and those that look behind it.
    3773 
    3774 @cindex Perl-style regular expressions, lookahead subpatterns
    3775 An assertion subpattern is matched in the normal way, except
    3776 that it does not cause the current matching position to be
    3777 changed. Lookahead assertions start with @code{(?=} for positive
    3778 assertions and @code{(?!} for negative assertions. For example,
    3779 
    3780 @example
    3781      \w+(?=;)
    3782 @end example
    3783 
    3784 @noindent
    3785 matches a word followed by a semicolon, but does not include
    3786 the semicolon in the match, and
    3787 
    3788 @example
    3789      foo(?!bar)
    3790 @end example
    3791 
    3792 @noindent
    3793 matches any occurrence of @samp{foo} that is not followed by
    3794 @samp{bar}.
    3795 
    3796 Note that the apparently similar pattern
    3797 
    3798 @example
    3799      (?!foo)bar
    3800 @end example
    3801 
    3802 @noindent
    3803 @cindex Perl-style regular expressions, lookbehind subpatterns
    3804 finds any occurrence of @samp{bar} even if it is preceded by
    3805 @samp{foo}, because the assertion @code{(?!foo)} is always true
    3806 when the next three characters are @samp{bar}. A lookbehind
    3807 assertion is needed to achieve this effect.
    3808 Lookbehind assertions start with @code{(?<=} for positive
    3809 assertions and @code{(?<!} for negative assertions. So,
    3810 
    3811 @example
    3812      (?<!foo)bar
    3813 @end example
    3814 
    3815 achieves the required effect of finding an occurrence of
    3816 @samp{bar} that is not preceded by @samp{foo}. The contents of a
    3817 lookbehind assertion are restricted
    3818 such that all the strings it matches must have a fixed
    3819 length.  However, if there are several alternatives, they do
    3820 not all have to have the same fixed length.  This is an extension
    3821 compared with Perl 5.005, which requires all branches to match
    3822 the same length of string. Thus
    3823 
    3824 @example
    3825      (?<=dogs|cats|)
    3826 @end example
    3827 
    3828 @noindent
    3829 is permitted, but the apparently equivalent regular expression
    3830 
    3831 @example
    3832      (?<!dogs?|cats?)
    3833 @end example
    3834 
    3835 @noindent
    3836 causes an error at compile time. Branches that match different
    3837 length strings are permitted only at the top level of
    3838 a lookbehind assertion: an assertion such as
    3839 
    3840 @example
    3841      (?<=ab(c|de))
    3842 @end example
    3843 
    3844 @noindent
    3845 is not permitted, because its single top-level branch can
    3846 match two different lengths, but it is acceptable if rewritten
    3847 to use two top-level branches:
    3848 
    3849 @example
    3850      (?<=abc|abde)
    3851 @end example
    3852 
    3853 All this is required because lookbehind assertions simply
    3854 move the current position back by the alternative's fixed
    3855 width and then try to match.  If there are
    3856 insufficient characters before the current position, the
    3857 match is deemed to fail.  Lookbehinds, in conjunction with
    3858 non-backtracking subpatterns can be particularly useful for
    3859 matching at the ends of strings; an example is given at the end
    3860 of the section on non-backtracking subpatterns.
    3861 
    3862 Several assertions (of any sort) may occur in succession.
    3863 For example,
    3864 
    3865 @example
    3866      (?<=\d@{3@})(?<!999)foo
    3867 @end example
    3868 
    3869 @noindent
    3870 matches @samp{foo} preceded by three digits that are not @samp{999}.
    3871 Notice that each of the assertions is applied independently
    3872 at the same point in the subject string. First there is a
    3873 check that the previous three characters are all digits, and
    3874 then there is a check that the same three characters are not
    3875 @samp{999}.  This pattern does not match @samp{foo} preceded by six
    3876 characters, the first of which are digits and the last three
    3877 of which are not @samp{999}.  For example, it doesn't match
    3878 @samp{123abcfoo}. A pattern to do that is
    3879 
    3880 @example
    3881      (?<=\d@{3@}...)(?<!999)foo
    3882 @end example
    3883 
    3884 @noindent
    3885 This time the first assertion looks at the preceding six
    3886 characters, checking that the first three are digits, and
    3887 then the second assertion checks that the preceding three
    3888 characters are not @samp{999}.  Actually, assertions can be
    3889 nested in any combination, so one can write this as
    3890 
    3891 @example
    3892      (?<=\d@{3@}(?!999)...)foo
    3893 @end example
    3894 
    3895 or
    3896 
    3897 @example
    3898      (?<=\d@{3@}...(?<!999))foo
    3899 @end example
    3900 
    3901 @noindent
    3902 both of which might be considered more readable.
    3903 
    3904 Assertion subpatterns are not capturing subpatterns, and may
    3905 not be repeated, because it makes no sense to assert the
    3906 same thing several times. If any kind of assertion contains
    3907 capturing subpatterns within it, these are counted for the
    3908 purposes of numbering the capturing subpatterns in the whole
    3909 pattern.  However, substring capturing is carried out only
    3910 for positive assertions, because it does not make sense for
    3911 negative assertions.
    3912 
    3913 Assertions count towards the maximum of 200 parenthesized
    3914 subpatterns.
    3915 
    3916 @node Non-backtracking subpatterns
    3917 @appendixsec Non-backtracking subpatterns
    3918 @cindex Perl-style regular expressions, non-backtracking subpatterns
    3919 
    3920 With both maximizing and minimizing repetition, failure of
    3921 what follows normally causes the repeated item to be evaluated
    3922 again to see if a different number of repeats allows the
    3923 rest of the pattern to match. Sometimes it is useful to
    3924 prevent this, either to change the nature of the match, or
    3925 to cause it fail earlier than it otherwise might, when the
    3926 author of the pattern knows there is no point in carrying
    3927 on.
    3928 
    3929 Consider, for example, the pattern @code{\d+foo} when applied to
    3930 the subject line
    3931 
    3932 @example
    3933      123456bar
    3934 @end example
    3935 
    3936 After matching all 6 digits and then failing to match @samp{foo},
    3937 the normal action of the matcher is to try again with only 5
    3938 digits matching the @code{\d+} item, and then with 4, and so on,
    3939 before ultimately failing. Non-backtracking subpatterns
    3940 provide the means for specifying that once a portion of the
    3941 pattern has matched, it is not to be re-evaluated in this way,
    3942 so the matcher would give up immediately on failing to match
    3943 @samp{foo} the first time.  The notation is another kind of special
    3944 parenthesis, starting with @code{(?>} as in this example:
    3945 
    3946 @example
    3947      (?>\d+)bar
    3948 @end example
    3949 
    3950 This kind of parenthesis ``locks up'' the part of the pattern
    3951 it contains once it has matched, and a failure further into
    3952 the pattern is prevented from backtracking into it.
    3953 Backtracking past it to previous items, however, works as
    3954 normal.
    3955 
    3956 Non-backtracking subpatterns are not capturing subpatterns.  Simple
    3957 cases such as the above example can be thought of as a maximizing
    3958 repeat that must swallow everything it can.  So,
    3959 while both @code{\d+} and @code{\d+?} are prepared to adjust the number of
    3960 digits they match in order to make the rest of the pattern
    3961 match, @code{(?>\d+)} can only match an entire sequence of digits.
    3962 
    3963 This construction can of course contain arbitrarily complicated
    3964 subpatterns, and it can be nested.
    3965 
    3966 @cindex Perl-style regular expressions, lookbehind subpatterns
    3967 Non-backtracking subpatterns can be used in conjunction with look-behind
    3968 assertions to specify efficient matching at the end
    3969 of the subject string. Consider a simple pattern such as
    3970 
    3971 @example
    3972      abcd$
    3973 @end example
    3974 
    3975 @noindent
    3976 when applied to a long string which does not match.  Because
    3977 matching proceeds from left to right, @command{sed} will look for
    3978 each @samp{a} in the subject and then see if what follows matches
    3979 the rest of the pattern. If the pattern is specified as
    3980 
    3981 @example
    3982      ^.*abcd$
    3983 @end example
    3984 
    3985 @noindent
    3986 the initial @code{.*} matches the entire string at first, but when
    3987 this fails (because there is no following @samp{a}), it backtracks
    3988 to match all but the last character, then all but the
    3989 last two characters, and so on. Once again the search for
    3990 @samp{a} covers the entire string, from right to left, so we are
    3991 no better off. However, if the pattern is written as
    3992 
    3993 @example
    3994      ^(?>.*)(?<=abcd)
    3995 @end example
    3996 
    3997 there can be no backtracking for the .* item; it can match
    3998 only the entire string. The subsequent lookbehind assertion
    3999 does a single test on the last four characters. If it fails,
    4000 the match fails immediately. For long strings, this approach
    4001 makes a significant difference to the processing time.
    4002 
    4003 When a pattern contains an unlimited repeat inside a subpattern
    4004 that can itself be repeated an unlimited number of
    4005 times, the use of a once-only subpattern is the only way to
    4006 avoid some failing matches taking a very long time
    4007 indeed.@footnote{Actually, the matcher embedded in @value{SSED}
    4008     tries to do something for this in the simplest cases,
    4009     like @code{([^b]*b)*}.  These cases are actually quite
    4010     common: they happen for example in a regular expression
    4011     like @code{\/\*([^*]*\*)*\/} which matches C comments.}
    4012 
    4013 The pattern
    4014 
    4015 @example
    4016      (\D+|<\d+>)*[!?]
    4017 @end example
    4018 
    4019 ([^0-9<]+<(\d+>)?)*[!?]
    4020 
    4021 @noindent
    4022 matches an unlimited number of substrings that either consist
    4023 of non-digits, or digits enclosed in angular brackets, followed by
    4024 an exclamation or question mark. When it matches, it runs quickly.
    4025 However, if it is applied to
    4026 
    4027 @example
    4028      aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    4029 @end example
    4030 
    4031 @noindent
    4032 it takes a long time before reporting failure.  This is
    4033 because the string can be divided between the two repeats in
    4034 a large number of ways, and all have to be tried.@footnote{The
    4035 example used @code{[!?]} rather than a single character at the end,
    4036 because both @value{SSED} and Perl have an optimization that allows
    4037 for fast failure when a single character is used. They
    4038 remember the last single character that is required for a
    4039 match, and fail early if it is not present in the string.}
    4040 
    4041 If the pattern is changed to
    4042 
    4043 @example
    4044      ((?>\D+)|<\d+>)*[!?]
    4045 @end example
    4046 
    4047 sequences of non-digits cannot be broken, and failure happens
    4048 quickly.
    4049 
    4050 @node Conditional subpatterns
    4051 @appendixsec Conditional subpatterns
    4052 @cindex Perl-style regular expressions, conditional subpatterns
    4053 
    4054 It is possible to cause the matching process to obey a subpattern
    4055 conditionally or to choose between two alternative
    4056 subpatterns, depending on the result of an assertion, or
    4057 whether a previous capturing subpattern matched or not. The
    4058 two possible forms of conditional subpattern are
    4059 
    4060 @example
    4061      (?(@var{condition})@var{yes-pattern})
    4062      (?(@var{condition})@var{yes-pattern}|@var{no-pattern})
    4063 @end example
    4064 
    4065 If the condition is satisfied, the yes-pattern is used; otherwise
    4066 the no-pattern (if present) is used. If there are more than two
    4067 alternatives in the subpattern, a compile-time error occurs.
    4068 
    4069 There are two kinds of condition. If the text between the
    4070 parentheses consists of a sequence of digits, the condition
    4071 is satisfied if the capturing subpattern of that number has
    4072 previously matched.  The number must be greater than zero.
    4073 Consider the following pattern, which contains non-significant
    4074 white space to make it more readable (assume the @code{X} modifier)
    4075 and to divide it into three parts for ease of discussion:
    4076 
    4077 @example
    4078      ( \( )?   [^()]+   (?(1) \) )
    4079 @end example
    4080 
    4081 The first part matches an optional opening parenthesis, and
    4082 if that character is present, sets it as the first captured
    4083 substring. The second part matches one or more characters
    4084 that are not parentheses. The third part is a conditional
    4085 subpattern that tests whether the first set of parentheses
    4086 matched or not.  If they did, that is, if subject started
    4087 with an opening parenthesis, the condition is true, and so
    4088 the yes-pattern is executed and a closing parenthesis is
    4089 required. Otherwise, since no-pattern is not present, the
    4090 subpattern matches nothing.  In other words, this pattern
    4091 matches a sequence of non-parentheses, optionally enclosed
    4092 in parentheses.
    4093 
    4094 @cindex Perl-style regular expressions, lookahead subpatterns
    4095 If the condition is not a sequence of digits, it must be an
    4096 assertion.  This may be a positive or negative lookahead or
    4097 lookbehind assertion. Consider this pattern, again containing
    4098 non-significant white space, and with the two alternatives
    4099 on the second line:
    4100 
    4101 @example
    4102      (?(?=...[a-z])
    4103         \d\d-[a-z]@{3@}-\d\d |
    4104         \d\d-\d\d-\d\d )
    4105 @end example
    4106 
    4107 The condition is a positive lookahead assertion that matches
    4108 a letter that is three characters away from the current point.
    4109 If a letter is found, the subject is matched against the first
    4110 alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are
    4111 letters and @var{dd} are digits); otherwise it is matched against
    4112 the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}.
    4113 
    4114 
    4115 @node Recursive patterns
    4116 @appendixsec Recursive patterns
    4117 @cindex Perl-style regular expressions, recursive patterns
    4118 @cindex Perl-style regular expressions, recursion
    4119 
    4120 Consider the problem of matching a string in parentheses,
    4121 allowing for unlimited nested parentheses. Without the use
    4122 of recursion, the best that can be done is to use a pattern
    4123 that matches up to some fixed depth of nesting. It is not
    4124 possible to handle an arbitrary nesting depth. Perl 5.6 has
    4125 provided an experimental facility that allows regular
    4126 expressions to recurse (amongst other things). It does this
    4127 by interpolating Perl code in the expression at run time,
    4128 and the code can refer to the expression itself. A Perl pattern
    4129 tern to solve the parentheses problem can be created like
    4130 this:
    4131 
    4132 @example
    4133      $re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x;
    4134 @end example
    4135 
    4136 The @code{(?p@{...@})} item interpolates Perl code at run time,
    4137 and in this case refers recursively to the pattern in which it
    4138 appears. Obviously, @command{sed} cannot support the interpolation of
    4139 Perl code.  Instead, the special item @code{(?R)} is provided for
    4140 the specific case of recursion. This pattern solves the
    4141 parentheses problem (assume the @code{X} modifier option is used
    4142 so that white space is ignored):
    4143 
    4144 @example
    4145      \( ( (?>[^()]+) | (?R) )* \)
    4146 @end example
    4147 
    4148 First it matches an opening parenthesis. Then it matches any
    4149 number of substrings which can either be a sequence of
    4150 non-parentheses, or a recursive match of the pattern itself
    4151 (i.e. a correctly parenthesized substring). Finally there is
    4152 a closing parenthesis.
    4153 
    4154 This particular example pattern contains nested unlimited
    4155 repeats, and so the use of a non-backtracking subpattern for
    4156 matching strings of non-parentheses is important when applying
    4157 the pattern to strings that do not match. For example, when
    4158 it is applied to
    4159 
    4160 @example
    4161      (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
    4162 @end example
    4163 
    4164 it yields a ``no match'' response quickly. However, if a
    4165 standard backtracking subpattern is not used, the match runs
    4166 for a very long time indeed because there are so many different
    4167 ways the @code{+} and @code{*} repeats can carve up the subject,
    4168 and all have to be tested before failure can be reported.
    4169 
    4170 The values set for any capturing subpatterns are those from
    4171 the outermost level of the recursion at which the subpattern
    4172 value is set. If the pattern above is matched against
    4173 
    4174 @example
    4175      (ab(cd)ef)
    4176 @end example
    4177 
    4178 @noindent
    4179 the value for the capturing parentheses is @samp{ef}, which is
    4180 the last value taken on at the top level.
    4181 
    4182 @node Comments
    4183 @appendixsec Comments
    4184 @cindex Perl-style regular expressions, comments
    4185 
    4186 The sequence (?# marks the start of a comment which continues
    4187 ues up to the next closing parenthesis. Nested parentheses
    4188 are not permitted. The characters that make up a comment
    4189 play no part in the pattern matching at all.
    4190 
    4191 @cindex Perl-style regular expressions, extended
    4192 If the @code{X} modifier option is used, an unescaped @code{#} character
    4193 outside a character class introduces a comment that continues
    4194 up to the next newline character in the pattern.
    4195 @end ifset
     5857
     5858
     5859@page
     5860@node GNU Free Documentation License
     5861@appendix GNU Free Documentation License
     5862
     5863@include fdl.texi
    41965864
    41975865
Note: See TracChangeset for help on using the changeset viewer.