Changeset 3613 for trunk/src/sed/doc/sed.texi
- Timestamp:
- Sep 19, 2024, 2:34:43 AM (10 months ago)
- Location:
- trunk/src/sed
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
trunk/src/sed
-
Property svn:mergeinfo
set to
/vendor/sed/current merged eligible
-
Property svn:mergeinfo
set to
-
trunk/src/sed/doc/sed.texi
r599 r3613 1 1 \input texinfo @c -*-texinfo-*- 2 @c Do not edit this file!! It is automatically generated from sed-in.texi.3 2 @c 4 3 @c -- Stuff that needs adding: ---------------------------------------------- 5 @c ( document the `;' command-separator)4 @c (nothing!) 6 5 @c -------------------------------------------------------------------------- 7 6 @c Check for consistency: regexps in @code, text that they match in @samp. 8 @c 7 @c 9 8 @c Tips: 10 9 @c @command for command … … 36 35 @value{SSED}, a stream editor. 37 36 38 Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free 39 Software Foundation, Inc. 40 41 This document is released under the terms of the @acronym{GNU} Free 42 Documentation License as published by the Free Software Foundation; 43 either version 1.1, or (at your option) any later version. 44 45 You should have received a copy of the @acronym{GNU} Free Documentation 46 License along with @value{SSED}; see the file @file{COPYING.DOC}. 47 If not, write to the Free Software Foundation, 59 Temple Place - Suite 48 330, Boston, MA 02110-1301, USA. 49 50 There are no Cover Texts and no Invariant Sections; this text, along 51 with its equivalent in the printed manual, constitutes the Title Page. 37 Copyright @copyright{} 1998--2022 Free Software Foundation, Inc. 38 39 @quotation 40 Permission is granted to copy, distribute and/or modify this document 41 under the terms of the GNU Free Documentation License, Version 1.3 42 or any later version published by the Free Software Foundation; 43 with no Invariant Sections, no Front-Cover Texts, and no 44 Back-Cover Texts. A copy of the license is included in the 45 section entitled ``GNU Free Documentation License''. 46 @end quotation 52 47 @end copying 53 48 … … 55 50 56 51 @titlepage 57 @title @ command{sed}, a stream editor52 @title @value{SSED}, a stream editor 58 53 @subtitle version @value{VERSION}, @value{UPDATED} 59 @author by Ken Pizzini, Paolo Bonzini 54 @author by Ken Pizzini, Paolo Bonzini, Jim Meyering, Assaf Gordon 60 55 61 56 @page 62 57 @vskip 0pt plus 1filll 63 Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc.64 65 58 @insertcopying 66 67 Published by the Free Software Foundation, @*68 51 Franklin Street, Fifth Floor @*69 Boston, MA 02110-1301, USA70 59 @end titlepage 71 60 72 61 @contents 62 63 @ifnottex 73 64 @node Top 74 @top 75 76 @ifnottex 65 @top @value{SSED} 66 77 67 @insertcopying 78 68 @end ifnottex … … 81 71 * Introduction:: Introduction 82 72 * Invoking sed:: Invocation 83 * sed Programs:: @command{sed} programs 73 * sed scripts:: @command{sed} scripts 74 * sed addresses:: Addresses: selecting lines 75 * sed regular expressions:: Regular expressions: selecting text 76 * advanced sed:: Advanced @command{sed}: cycles and buffers 84 77 * Examples:: Some sample scripts 85 78 * Limitations:: Limitations and (non-)limitations of @value{SSED} 86 79 * Other Resources:: Other resources for learning about @command{sed} 87 80 * Reporting Bugs:: Reporting bugs 88 89 * Extended regexps:: @command{egrep}-style regular expressions 90 @ifset PERL 91 * Perl regexps:: Perl-style regular expressions 92 @end ifset 93 81 * GNU Free Documentation License:: Copying and sharing this manual 94 82 * Concept Index:: A menu with all the topics in this manual. 95 83 * Command and Option Index:: A menu with all @command{sed} commands and 96 84 command-line options. 97 98 @detailmenu99 --- The detailed node listing ---100 101 sed Programs:102 * Execution Cycle:: How @command{sed} works103 * Addresses:: Selecting lines with @command{sed}104 * Regular Expressions:: Overview of regular expression syntax105 * Common Commands:: Often used commands106 * The "s" Command:: @command{sed}'s Swiss Army Knife107 * Other Commands:: Less frequently used commands108 * Programming Commands:: Commands for @command{sed} gurus109 * Extended Commands:: Commands specific of @value{SSED}110 * Escapes:: Specifying special characters111 112 Examples:113 * Centering lines::114 * Increment a number::115 * Rename files to lower case::116 * Print bash environment::117 * Reverse chars of lines::118 * tac:: Reverse lines of files119 * cat -n:: Numbering lines120 * cat -b:: Numbering non-blank lines121 * wc -c:: Counting chars122 * wc -w:: Counting words123 * wc -l:: Counting lines124 * head:: Printing the first lines125 * tail:: Printing the last lines126 * uniq:: Make duplicate lines unique127 * uniq -d:: Print duplicated lines of input128 * uniq -u:: Remove all duplicated lines129 * cat -s:: Squeezing blank lines130 131 @ifset PERL132 Perl regexps:: Perl-style regular expressions133 * Backslash:: Introduces special sequences134 * Circumflex/dollar sign/period:: Behave specially with regard to new lines135 * Square brackets:: Are a bit different in strange cases136 * Options setting:: Toggle modifiers in the middle of a regexp137 * Non-capturing subpatterns:: Are not counted when backreferencing138 * Repetition:: Allows for non-greedy matching139 * Backreferences:: Allows for more than 10 back references140 * Assertions:: Allows for complex look ahead matches141 * Non-backtracking subpatterns:: Often gives more performance142 * Conditional subpatterns:: Allows if/then/else branches143 * Recursive patterns:: For example to match parentheses144 * Comments:: Because things can get complex...145 @end ifset146 147 @end detailmenu148 85 @end menu 149 86 … … 167 104 168 105 @node Invoking sed 169 @chapter Invocation 170 106 @chapter Running sed 107 108 This chapter covers how to run @command{sed}. Details of @command{sed} 109 scripts and individual @command{sed} commands are discussed in the 110 next chapter. 111 112 @menu 113 * Overview:: 114 * Command-Line Options:: 115 * Exit status:: 116 @end menu 117 118 119 @node Overview 120 @section Overview 171 121 Normally @command{sed} is invoked like this: 172 122 … … 175 125 @end example 176 126 127 For example, to change every @samp{hello} to @samp{world} 128 in the file @file{input.txt}: 129 130 @example 131 sed 's/hello/world/g' input.txt > output.txt 132 @end example 133 134 Without the @samp{g} (global) modifier, @command{sed} affects 135 only the first instance per line. 136 137 @cindex stdin 138 @cindex standard input 139 If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-}, 140 @command{sed} filters the contents of the standard input. The following 141 commands are equivalent: 142 143 @example 144 sed 's/hello/world/g' input.txt > output.txt 145 sed 's/hello/world/g' < input.txt > output.txt 146 cat input.txt | sed 's/hello/world/g' - > output.txt 147 @end example 148 149 @cindex stdout 150 @cindex output 151 @cindex standard output 152 @cindex -i, example 153 @command{sed} writes output to standard output. Use @option{-i} to edit 154 files in-place instead of printing to standard output. 155 See also the @code{W} and @code{s///w} commands for writing output to 156 other files. The following command modifies @file{file.txt} and 157 does not produce any output: 158 159 @example 160 sed -i 's/hello/world/' file.txt 161 @end example 162 163 @cindex -n, example 164 @cindex p, example 165 @cindex suppressing output 166 @cindex output, suppressing 167 By default @command{sed} prints all processed input (except input 168 that has been modified/deleted by commands such as @command{d}). 169 Use @option{-n} to suppress output, and the @code{p} command 170 to print specific lines. The following command prints only line 45 171 of the input file: 172 173 @example 174 sed -n '45p' file.txt 175 @end example 176 177 178 179 @cindex multiple files 180 @cindex -s, example 181 @command{sed} treats multiple input files as one long stream. 182 The following example prints the first line of the first file 183 (@file{one.txt}) and the last line of the last file (@file{three.txt}). 184 Use @option{-s} to reverse this behavior. 185 186 @example 187 sed -n '1p ; $p' one.txt two.txt three.txt 188 @end example 189 190 191 @cindex -e, example 192 @cindex --expression, example 193 @cindex -f, example 194 @cindex --file, example 195 @cindex script parameter 196 @cindex parameters, script 197 Without @option{-e} or @option{-f} options, @command{sed} uses 198 the first non-option parameter as the @var{script}, and the following 199 non-option parameters as input files. 200 If @option{-e} or @option{-f} options are used to specify a @var{script}, 201 all non-option parameters are taken as input files. 202 Options @option{-e} and @option{-f} can be combined, and can appear 203 multiple times (in which case the final effective @var{script} will be 204 concatenation of all the individual @var{script}s). 205 206 The following examples are equivalent: 207 208 @example 209 sed 's/hello/world/' input.txt > output.txt 210 211 sed -e 's/hello/world/' input.txt > output.txt 212 sed --expression='s/hello/world/' input.txt > output.txt 213 214 echo 's/hello/world/' > myscript.sed 215 sed -f myscript.sed input.txt > output.txt 216 sed --file=myscript.sed input.txt > output.txt 217 @end example 218 219 220 @node Command-Line Options 221 @section Command-Line Options 222 177 223 The full format for invoking @command{sed} is: 178 224 … … 180 226 sed OPTIONS... [SCRIPT] [INPUTFILE...] 181 227 @end example 182 183 If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-},184 @command{sed} filters the contents of the standard input. The @var{script}185 is actually the first non-option parameter, which @command{sed} specially186 considers a script and not an input file if (and only if) none of the187 other @var{options} specifies a script to be executed, that is if neither188 of the @option{-e} and @option{-f} options is specified.189 228 190 229 @command{sed} may be invoked with the following command-line options: … … 212 251 @cindex Disabling autoprint, from command line 213 252 By default, @command{sed} prints out the pattern space 214 at the end of each cycle through the script. 253 at the end of each cycle through the script (@pxref{Execution Cycle, , 254 How @code{sed} works}). 215 255 These options disable this automatic printing, 216 256 and @command{sed} only produces output when explicitly told to 217 257 via the @code{p} command. 258 259 @item --debug 260 @opindex --debug 261 @cindex @value{SSEDEXT}, debug 262 Print the input sed program in canonical form, 263 and annotate program execution. 264 @codequotebacktick on 265 @codequoteundirected on 266 @example 267 $ echo 1 | sed '\%1%s21232' 268 3 269 270 $ echo 1 | sed --debug '\%1%s21232' 271 SED PROGRAM: 272 /1/ s/1/3/ 273 INPUT: 'STDIN' line 1 274 PATTERN: 1 275 COMMAND: /1/ s/1/3/ 276 PATTERN: 3 277 END-OF-CYCLE: 278 3 279 @end example 280 @codequotebacktick off 281 @codequoteundirected off 282 283 284 @item -e @var{script} 285 @itemx --expression=@var{script} 286 @opindex -e 287 @opindex --expression 288 @cindex Script, from command line 289 Add the commands in @var{script} to the set of commands to be 290 run while processing the input. 291 292 @item -f @var{script-file} 293 @itemx --file=@var{script-file} 294 @opindex -f 295 @opindex --file 296 @cindex Script, from a file 297 Add the commands contained in the file @var{script-file} 298 to the set of commands to be run while processing the input. 218 299 219 300 @item -i[@var{SUFFIX}] … … 240 321 before renaming the temporary file, thereby making a backup 241 322 copy@footnote{Note that @value{SSED} creates the backup 242 323 file whether or not any output is actually changed.}). 243 324 244 325 @cindex In-place editing, Perl-style backup file names … … 255 336 overwritten without making a backup. 256 337 338 Because @option{-i} takes an optional argument, it should 339 not be followed by other short options: 340 @table @code 341 @item sed -Ei '...' FILE 342 Same as @option{-E -i} with no backup suffix - @file{FILE} will be 343 edited in-place without creating a backup. 344 345 @item sed -iE '...' FILE 346 This is equivalent to @option{--in-place=E}, creating @file{FILEE} as backup 347 of @file{FILE} 348 @end table 349 350 Be cautious of using @option{-n} with @option{-i}: the former disables 351 automatic printing of lines and the latter changes the file in-place 352 without a backup. Used carelessly (and without an explicit @code{p} command), 353 the output file will be empty: 354 @codequotebacktick on 355 @codequoteundirected on 356 @example 357 # WRONG USAGE: 'FILE' will be truncated. 358 sed -ni 's/foo/bar/' FILE 359 @end example 360 @codequotebacktick off 361 @codequoteundirected off 362 257 363 @item -l @var{N} 258 364 @itemx --line-length=@var{N} … … 265 371 266 372 @item --posix 373 @opindex --posix 267 374 @cindex @value{SSEDEXT}, disabling 268 @value{SSED} includes several extensions to @acronym{POSIX}375 @value{SSED} includes several extensions to POSIX 269 376 sed. In order to simplify writing portable scripts, this 270 377 option disables all the extensions that this manual documents, … … 272 379 @cindex @code{POSIXLY_CORRECT} behavior, enabling 273 380 Most of the extensions accept @command{sed} programs that 274 are outside the syntax mandated by @acronym{POSIX}, but some381 are outside the syntax mandated by POSIX, but some 275 382 of them (such as the behavior of the @command{N} command 276 described in @ pxref{Reporting Bugs}) actually violate the383 described in @ref{Reporting Bugs}) actually violate the 277 384 standard. If you want to disable only the latter kind of 278 385 extension, you can set the @code{POSIXLY_CORRECT} variable 279 386 to a non-empty value. 280 387 281 @item -r 388 @item -b 389 @itemx --binary 390 @opindex -b 391 @opindex --binary 392 This option is available on every platform, but is only effective where the 393 operating system makes a distinction between text files and binary files. 394 When such a distinction is made---as is the case for MS-DOS, Windows, 395 Cygwin---text files are composed of lines separated by a carriage return 396 @emph{and} a line feed character, and @command{sed} does not see the 397 ending CR. When this option is specified, @command{sed} will open 398 input files in binary mode, thus not requesting this special processing 399 and considering lines to end at a line feed. 400 401 @item --follow-symlinks 402 @opindex --follow-symlinks 403 This option is available only on platforms that support 404 symbolic links and has an effect only if option @option{-i} 405 is specified. In this case, if the file that is specified 406 on the command line is a symbolic link, @command{sed} will 407 follow the link and edit the ultimate destination of the 408 link. The default behavior is to break the symbolic link, 409 so that the link destination will not be modified. 410 411 @item -E 412 @itemx -r 282 413 @itemx --regexp-extended 414 @opindex -E 283 415 @opindex -r 284 416 @opindex --regexp-extended 285 417 @cindex Extended regular expressions, choosing 286 @cindex @acronym{GNU}extensions, extended regular expressions418 @cindex GNU extensions, extended regular expressions 287 419 Use extended regular expressions rather than basic 288 420 regular expressions. Extended regexps are those that 289 421 @command{egrep} accepts; they can be clearer because they 290 usually have less backslashes, but are a @acronym{GNU} extension 291 and hence scripts that use them are not portable. 292 @xref{Extended regexps, , Extended regular expressions}. 293 294 @ifset PERL 295 @item -R 296 @itemx --regexp-perl 297 @opindex -R 298 @opindex --regexp-perl 299 @cindex Perl-style regular expressions, choosing 300 @cindex @value{SSEDEXT}, Perl-style regular expressions 301 Use Perl-style regular expressions rather than basic 302 regular expressions. Perl-style regexps are extremely 303 powerful but are a @value{SSED} extension and hence scripts that 304 use it are not portable. @xref{Perl regexps, , 305 Perl-style regular expressions}. 306 @end ifset 422 usually have fewer backslashes. 423 Historically this was a GNU extension, 424 but the @option{-E} 425 extension has since been added to the POSIX standard 426 (http://austingroupbugs.net/view.php?id=528), 427 so use @option{-E} for portability. 428 GNU sed has accepted @option{-E} as an undocumented option for years, 429 and *BSD seds have accepted @option{-E} for years as well, 430 but scripts that use @option{-E} might not port to other older systems. 431 @xref{ERE syntax, , Extended regular expressions}. 432 307 433 308 434 @item -s 309 435 @itemx --separate 436 @opindex -s 437 @opindex --separate 310 438 @cindex Working on separate files 311 439 By default, @command{sed} will consider the files specified on the … … 318 446 start of each file. 319 447 448 @item --sandbox 449 @opindex --sandbox 450 @cindex Sandbox mode 451 In sandbox mode, @code{e/w/r} commands are rejected - programs containing 452 them will be aborted without being run. Sandbox mode ensures @command{sed} 453 operates only on the input files designated on the command line, and 454 cannot run external programs. 455 456 320 457 @item -u 321 458 @itemx --unbuffered … … 328 465 output as soon as possible.) 329 466 330 @item -e @var{script} 331 @itemx --expression=@var{script} 332 @opindex -e 333 @opindex --expression 334 @cindex Script, from command line 335 Add the commands in @var{script} to the set of commands to be 336 run while processing the input. 337 338 @item -f @var{script-file} 339 @itemx --file=@var{script-file} 340 @opindex -f 341 @opindex --file 342 @cindex Script, from a file 343 Add the commands contained in the file @var{script-file} 344 to the set of commands to be run while processing the input. 345 467 @item -z 468 @itemx --null-data 469 @itemx --zero-terminated 470 @opindex -z 471 @opindex --null-data 472 @opindex --zero-terminated 473 Treat the input as a set of lines, each terminated by a zero byte 474 (the ASCII @samp{NUL} character) instead of a newline. This option can 475 be used with commands like @samp{sort -z} and @samp{find -print0} 476 to process arbitrary file names. 346 477 @end table 347 478 … … 359 490 The standard input will be processed if no file names are specified. 360 491 361 362 @node sed Programs 363 @chapter @command{sed} Programs 364 365 @cindex @command{sed} program structure 492 @node Exit status 493 @section Exit status 494 @cindex exit status 495 An exit status of zero indicates success, and a nonzero value 496 indicates failure. @value{SSED} returns the following exit status 497 error values: 498 499 @table @asis 500 @item 0 501 Successful completion. 502 503 @item 1 504 Invalid command, invalid syntax, invalid regular expression or a 505 @value{SSED} extension command used with @option{--posix}. 506 507 @item 2 508 One or more of the input file specified on the command line could not be 509 opened (e.g. if a file is not found, or read permission is denied). 510 Processing continued with other files. 511 512 @item 4 513 An I/O error, or a serious processing error during runtime, 514 @value{SSED} aborted immediately. 515 @end table 516 517 @cindex Q, example 518 @cindex exit status, example 519 Additionally, the commands @code{q} and @code{Q} can be used to terminate 520 @command{sed} with a custom exit code value (this is a @value{SSED} extension): 521 522 @example 523 $ echo | sed 'Q42' ; echo $? 524 42 525 @end example 526 527 528 @node sed scripts 529 @chapter @command{sed} scripts 530 531 532 @menu 533 * sed script overview:: @command{sed} script overview 534 * sed commands list:: @command{sed} commands summary 535 * The "s" Command:: @command{sed}'s Swiss Army Knife 536 * Common Commands:: Often used commands 537 * Other Commands:: Less frequently used commands 538 * Programming Commands:: Commands for @command{sed} gurus 539 * Extended Commands:: Commands specific of @value{SSED} 540 * Multiple commands syntax:: Extension for easier scripting 541 @end menu 542 543 @node sed script overview 544 @section @command{sed} script overview 545 546 @cindex @command{sed} script structure 366 547 @cindex Script structure 548 367 549 A @command{sed} program consists of one or more @command{sed} commands, 368 550 passed in by one or more of the … … 371 553 options are used. 372 554 This document will refer to ``the'' @command{sed} script; 373 this is understood to mean the in-order c atenation555 this is understood to mean the in-order concatenation 374 556 of all of the @var{script}s and @var{script-file}s passed in. 375 376 Each @code{sed} command consists of an optional address or 377 address range, followed by a one-character command name 378 and any additional command-specific code. 379 380 @menu 381 * Execution Cycle:: How @command{sed} works 382 * Addresses:: Selecting lines with @command{sed} 383 * Regular Expressions:: Overview of regular expression syntax 384 * Common Commands:: Often used commands 385 * The "s" Command:: @command{sed}'s Swiss Army Knife 386 * Other Commands:: Less frequently used commands 387 * Programming Commands:: Commands for @command{sed} gurus 388 * Extended Commands:: Commands specific of @value{SSED} 389 * Escapes:: Specifying special characters 390 @end menu 391 392 393 @node Execution Cycle 394 @section How @command{sed} Works 395 396 @cindex Buffer spaces, pattern and hold 397 @cindex Spaces, pattern and hold 398 @cindex Pattern space, definition 399 @cindex Hold space, definition 400 @command{sed} maintains two data buffers: the active @emph{pattern} space, 401 and the auxiliary @emph{hold} space. Both are initially empty. 402 403 @command{sed} operates by performing the following cycle on each 404 lines of input: first, @command{sed} reads one line from the input 405 stream, removes any trailing newline, and places it in the pattern space. 406 Then commands are executed; each command can have an address associated 407 to it: addresses are a kind of condition code, and a command is only 408 executed if the condition is verified before the command is to be 409 executed. 410 411 When the end of the script is reached, unless the @option{-n} option 412 is in use, the contents of pattern space are printed out to the output 413 stream, adding back the trailing newline if it was removed.@footnote{Actually, 414 if @command{sed} prints a line without the terminating newline, it will 415 nevertheless print the missing newline as soon as more text is sent to 416 the same output stream, which gives the ``least expected surprise'' 417 even though it does not make commands like @samp{sed -n p} exactly 418 identical to @command{cat}.} Then the next cycle starts for the next 419 input line. 420 421 Unless special commands (like @samp{D}) are used, the pattern space is 422 deleted between two cycles. The hold space, on the other hand, keeps 423 its data between cycles (see commands @samp{h}, @samp{H}, @samp{x}, 424 @samp{g}, @samp{G} to move data between both buffers). 425 426 427 @node Addresses 428 @section Selecting lines with @command{sed} 429 @cindex Addresses, in @command{sed} scripts 430 @cindex Line selection 431 @cindex Selecting lines to process 432 433 Addresses in a @command{sed} script can be in any of the following forms: 557 @xref{Overview}. 558 559 560 @cindex @command{sed} commands syntax 561 @cindex syntax, @command{sed} commands 562 @cindex addresses, syntax 563 @cindex syntax, addresses 564 @command{sed} commands follow this syntax: 565 566 @example 567 [addr]@var{X}[options] 568 @end example 569 570 @var{X} is a single-letter @command{sed} command. 571 @c TODO: add @pxref{commands} when there is a command-list section. 572 @code{[addr]} is an optional line address. If @code{[addr]} is specified, 573 the command @var{X} will be executed only on the matched lines. 574 @code{[addr]} can be a single line number, a regular expression, 575 or a range of lines (@pxref{sed addresses}). 576 Additional @code{[options]} are used for some @command{sed} commands. 577 578 @cindex @command{d}, example 579 @cindex address range, example 580 @cindex example, address range 581 The following example deletes lines 30 to 35 in the input. 582 @code{30,35} is an address range. @command{d} is the delete command: 583 584 @example 585 sed '30,35d' input.txt > output.txt 586 @end example 587 588 @cindex @command{q}, example 589 @cindex regular expression, example 590 @cindex example, regular expression 591 The following example prints all input until a line 592 starting with the string @samp{foo} is found. If such line is found, 593 @command{sed} will terminate with exit status 42. 594 If such line was not found (and no other error occurred), @command{sed} 595 will exit with status 0. 596 @code{/^foo/} is a regular-expression address. 597 @command{q} is the quit command. @code{42} is the command option. 598 599 @example 600 sed '/^foo/q42' input.txt > output.txt 601 @end example 602 603 604 @cindex multiple @command{sed} commands 605 @cindex @command{sed} commands, multiple 606 @cindex newline, command separator 607 @cindex semicolons, command separator 608 @cindex ;, command separator 609 @cindex -e, example 610 @cindex -f, example 611 Commands within a @var{script} or @var{script-file} can be 612 separated by semicolons (@code{;}) or newlines (ASCII 10). 613 Multiple scripts can be specified with @option{-e} or @option{-f} 614 options. 615 616 The following examples are all equivalent. They perform two @command{sed} 617 operations: deleting any lines matching the regular expression @code{/^foo/}, 618 and replacing all occurrences of the string @samp{hello} with @samp{world}: 619 620 @example 621 sed '/^foo/d ; s/hello/world/g' input.txt > output.txt 622 623 sed -e '/^foo/d' -e 's/hello/world/g' input.txt > output.txt 624 625 echo '/^foo/d' > script.sed 626 echo 's/hello/world/g' >> script.sed 627 sed -f script.sed input.txt > output.txt 628 629 echo 's/hello/world/g' > script2.sed 630 sed -e '/^foo/d' -f script2.sed input.txt > output.txt 631 @end example 632 633 634 @cindex @command{a}, and semicolons 635 @cindex @command{c}, and semicolons 636 @cindex @command{i}, and semicolons 637 Commands @command{a}, @command{c}, @command{i}, due to their syntax, 638 cannot be followed by semicolons working as command separators and 639 thus should be terminated 640 with newlines or be placed at the end of a @var{script} or @var{script-file}. 641 Commands can also be preceded with optional non-significant 642 whitespace characters. 643 @xref{Multiple commands syntax}. 644 645 646 647 @node sed commands list 648 @section @command{sed} commands summary 649 650 The following commands are supported in @value{SSED}. 651 Some are standard POSIX commands, while other are @value{SSEDEXT}. 652 Details and examples for each command are in the following sections. 653 (Mnemonics) are shown in parentheses. 654 434 655 @table @code 435 @item @var{number} 436 @cindex Address, numeric 437 @cindex Line, selecting by number 438 Specifying a line number will match only that line in the input. 439 (Note that @command{sed} counts lines continuously across all input files 440 unless @option{-i} or @option{-s} options are specified.) 441 442 @item @var{first}~@var{step} 443 @cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses 444 This @acronym{GNU} extension matches every @var{step}th line 445 starting with line @var{first}. 446 In particular, lines will be selected when there exists 447 a non-negative @var{n} such that the current line-number equals 448 @var{first} + (@var{n} * @var{step}). 449 Thus, to select the odd-numbered lines, 450 one would use @code{1~2}; 451 to pick every third line starting with the second, @samp{2~3} would be used; 452 to pick every fifth line starting with the tenth, use @samp{10~5}; 453 and @samp{50~0} is just an obscure way of saying @code{50}. 454 455 @item $ 456 @cindex Address, last line 457 @cindex Last line, selecting 458 @cindex Line, selecting last 459 This address matches the last line of the last file of input, or 460 the last line of each file when the @option{-i} or @option{-s} options 461 are specified. 462 463 @item /@var{regexp}/ 464 @cindex Address, as a regular expression 465 @cindex Line, selecting by regular expression match 466 This will select any line which matches the regular expression @var{regexp}. 467 If @var{regexp} itself includes any @code{/} characters, 468 each must be escaped by a backslash (@code{\}). 469 470 @cindex empty regular expression 471 @cindex @value{SSEDEXT}, modifiers and the empty regular expression 472 The empty regular expression @samp{//} repeats the last regular 473 expression match (the same holds if the empty regular expression is 474 passed to the @code{s} command). Note that modifiers to regular expressions 475 are evaluated when the regular expression is compiled, thus it is invalid to 476 specify them together with the empty regular expression. 477 478 @item \%@var{regexp}% 479 (The @code{%} may be replaced by any other single character.) 480 481 @cindex Slash character, in regular expressions 482 This also matches the regular expression @var{regexp}, 483 but allows one to use a different delimiter than @code{/}. 484 This is particularly useful if the @var{regexp} itself contains 485 a lot of slashes, since it avoids the tedious escaping of every @code{/}. 486 If @var{regexp} itself includes any delimiter characters, 487 each must be escaped by a backslash (@code{\}). 488 489 @item /@var{regexp}/I 490 @itemx \%@var{regexp}%I 491 @cindex @acronym{GNU} extensions, @code{I} modifier 492 @ifset PERL 493 @cindex Perl-style regular expressions, case-insensitive 494 @end ifset 495 The @code{I} modifier to regular-expression matching is a @acronym{GNU} 496 extension which causes the @var{regexp} to be matched in 497 a case-insensitive manner. 498 499 @item /@var{regexp}/M 500 @itemx \%@var{regexp}%M 501 @ifset PERL 502 @cindex @value{SSEDEXT}, @code{M} modifier 503 @end ifset 504 @cindex Perl-style regular expressions, multiline 505 The @code{M} modifier to regular-expression matching is a @value{SSED} 506 extension which causes @code{^} and @code{$} to match respectively 507 (in addition to the normal behavior) the empty string after a newline, 508 and the empty string before a newline. There are special character 509 sequences 510 @ifset PERL 511 (@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'} 512 in basic or extended regular expression modes) 513 @end ifset 514 @ifclear PERL 515 (@code{\`} and @code{\'}) 516 @end ifclear 517 which always match the beginning or the end of the buffer. 518 @code{M} stands for @cite{multi-line}. 519 520 @ifset PERL 521 @item /@var{regexp}/S 522 @itemx \%@var{regexp}%S 523 @cindex @value{SSEDEXT}, @code{S} modifier 524 @cindex Perl-style regular expressions, single line 525 The @code{S} modifier to regular-expression matching is only valid 526 in Perl mode and specifies that the dot character (@code{.}) will 527 match the newline character too. @code{S} stands for @cite{single-line}. 528 @end ifset 529 530 @ifset PERL 531 @item /@var{regexp}/X 532 @itemx \%@var{regexp}%X 533 @cindex @value{SSEDEXT}, @code{X} modifier 534 @cindex Perl-style regular expressions, extended 535 The @code{X} modifier to regular-expression matching is also 536 valid in Perl mode only. If it is used, whitespace in the 537 pattern (other than in a character class) and 538 characters between a @kbd{#} outside a character class and the 539 next newline character are ignored. An escaping backslash 540 can be used to include a whitespace or @kbd{#} character as part 541 of the pattern. 542 @end ifset 543 @end table 544 545 If no addresses are given, then all lines are matched; 546 if one address is given, then only lines matching that 547 address are matched. 548 549 @cindex Range of lines 550 @cindex Several lines, selecting 551 An address range can be specified by specifying two addresses 552 separated by a comma (@code{,}). An address range matches lines 553 starting from where the first address matches, and continues 554 until the second address matches (inclusively). 555 556 If the second address is a @var{regexp}, then checking for the 557 ending match will start with the line @emph{following} the 558 line which matched the first address: a range will always 559 span at least two lines (except of course if the input stream 560 ends). 561 562 If the second address is a @var{number} less than (or equal to) 563 the line matching the first address, then only the one line is 564 matched. 565 566 @cindex Special addressing forms 567 @cindex Range with start address of zero 568 @cindex Zero, as range start address 569 @cindex @var{addr1},+N 570 @cindex @var{addr1},~N 571 @cindex @acronym{GNU} extensions, special two-address forms 572 @cindex @acronym{GNU} extensions, @code{0} address 573 @cindex @acronym{GNU} extensions, 0,@var{addr2} addressing 574 @cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing 575 @cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing 576 @value{SSED} also supports some special two-address forms; all these 577 are @acronym{GNU} extensions: 578 @table @code 579 @item 0,/@var{regexp}/ 580 A line number of @code{0} can be used in an address specification like 581 @code{0,/@var{regexp}/} so that @command{sed} will try to match 582 @var{regexp} in the first input line too. In other words, 583 @code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/}, 584 except that if @var{addr2} matches the very first line of input the 585 @code{0,/@var{regexp}/} form will consider it to end the range, whereas 586 the @code{1,/@var{regexp}/} form will match the beginning of its range and 587 hence make the range span up to the @emph{second} occurrence of the 588 regular expression. 589 590 Note that this is the only place where the @code{0} address makes 591 sense; there is no 0-th line and commands which are given the @code{0} 592 address in any other way will give an error. 593 594 @item @var{addr1},+@var{N} 595 Matches @var{addr1} and the @var{N} lines following @var{addr1}. 596 597 @item @var{addr1},~@var{N} 598 Matches @var{addr1} and the lines following @var{addr1} 599 until the next line whose input line number is a multiple of @var{N}. 600 @end table 601 602 @cindex Excluding lines 603 @cindex Selecting non-matching lines 604 Appending the @code{!} character to the end of an address 605 specification negates the sense of the match. 606 That is, if the @code{!} character follows an address range, 607 then only lines which do @emph{not} match the address range 608 will be selected. 609 This also works for singleton addresses, 610 and, perhaps perversely, for the null address. 611 612 613 @node Regular Expressions 614 @section Overview of Regular Expression Syntax 615 616 To know how to use @command{sed}, people should understand regular 617 expressions (@dfn{regexp} for short). A regular expression 618 is a pattern that is matched against a 619 subject string from left to right. Most characters are 620 @dfn{ordinary}: they stand for 621 themselves in a pattern, and match the corresponding characters 622 in the subject. As a trivial example, the pattern 623 624 @example 625 The quick brown fox 626 @end example 627 628 @noindent 629 matches a portion of a subject string that is identical to 630 itself. The power of regular expressions comes from the 631 ability to include alternatives and repetitions in the pattern. 632 These are encoded in the pattern by the use of @dfn{special characters}, 633 which do not stand for themselves but instead 634 are interpreted in some special way. Here is a brief description 635 of regular expression syntax as used in @command{sed}. 636 637 @table @code 638 @item @var{char} 639 A single ordinary character matches itself. 640 641 @item * 642 @cindex @acronym{GNU} extensions, to basic regular expressions 643 Matches a sequence of zero or more instances of matches for the 644 preceding regular expression, which must be an ordinary character, a 645 special character preceded by @code{\}, a @code{.}, a grouped regexp 646 (see below), or a bracket expression. As a @acronym{GNU} extension, a 647 postfixed regular expression can also be followed by @code{*}; for 648 example, @code{a**} is equivalent to @code{a*}. @acronym{POSIX} 649 1003.1-2001 says that @code{*} stands for itself when it appears at 650 the start of a regular expression or subexpression, but many 651 non@acronym{GNU} implementations do not support this and portable 652 scripts should instead use @code{\*} in these contexts. 653 654 @item \+ 655 @cindex @acronym{GNU} extensions, to basic regular expressions 656 As @code{*}, but matches one or more. It is a @acronym{GNU} extension. 657 658 @item \? 659 @cindex @acronym{GNU} extensions, to basic regular expressions 660 As @code{*}, but only matches zero or one. It is a @acronym{GNU} extension. 661 662 @item \@{@var{i}\@} 663 As @code{*}, but matches exactly @var{i} sequences (@var{i} is a 664 decimal integer; for portability, keep it between 0 and 255 665 inclusive). 666 667 @item \@{@var{i},@var{j}\@} 668 Matches between @var{i} and @var{j}, inclusive, sequences. 669 670 @item \@{@var{i},\@} 671 Matches more than or equal to @var{i} sequences. 672 673 @item \(@var{regexp}\) 674 Groups the inner @var{regexp} as a whole, this is used to: 675 676 @itemize @bullet 677 @item 678 @cindex @acronym{GNU} extensions, to basic regular expressions 679 Apply postfix operators, like @code{\(abcd\)*}: 680 this will search for zero or more whole sequences 681 of @samp{abcd}, while @code{abcd*} would search 682 for @samp{abc} followed by zero or more occurrences 683 of @samp{d}. Note that support for @code{\(abcd\)*} is 684 required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU} 685 implementations do not support it and hence it is not universally 686 portable. 687 688 @item 689 Use back references (see below). 690 @end itemize 691 692 @item . 693 Matches any character, including newline. 694 695 @item ^ 696 Matches the null string at beginning of line, i.e. what 697 appears after the circumflex must appear at the 698 beginning of line. @code{^#include} will match only 699 lines where @samp{#include} is the first thing on line---if 700 there are spaces before, for example, the match fails. 701 @code{^} acts as a special character only at the beginning 702 of the regular expression or subexpression (that is, 703 after @code{\(} or @code{\|}). Portable scripts should avoid 704 @code{^} at the beginning of a subexpression, though, as 705 @acronym{POSIX} allows implementations that treat @code{^} as 706 an ordinary character in that context. 707 708 709 @item $ 710 It is the same as @code{^}, but refers to end of line. 711 @code{$} also acts as a special character only at the end 712 of the regular expression or subexpression (that is, before @code{\)} 713 or @code{\|}), and its use at the end of a subexpression is not 714 portable. 715 716 717 @item [@var{list}] 718 @itemx [^@var{list}] 719 Matches any single character in @var{list}: for example, 720 @code{[aeiou]} matches all vowels. A list may include 721 sequences like @code{@var{char1}-@var{char2}}, which 722 matches any character between (inclusive) @var{char1} 723 and @var{char2}. 724 725 A leading @code{^} reverses the meaning of @var{list}, so that 726 it matches any single character @emph{not} in @var{list}. To include 727 @code{]} in the list, make it the first character (after 728 the @code{^} if needed), to include @code{-} in the list, 729 make it the first or last; to include @code{^} put 730 it after the first character. 731 732 @cindex @code{POSIXLY_CORRECT} behavior, bracket expressions 733 The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\} 734 are normally not special within @var{list}. For example, @code{[\*]} 735 matches either @samp{\} or @samp{*}, because the @code{\} is not 736 special here. However, strings like @code{[.ch.]}, @code{[=a=]}, and 737 @code{[:space:]} are special within @var{list} and represent collating 738 symbols, equivalence classes, and character classes, respectively, and 739 @code{[} is therefore special within @var{list} when it is followed by 740 @code{.}, @code{=}, or @code{:}. Also, when not in 741 @env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and 742 @code{\t} are recognized within @var{list}. @xref{Escapes}. 743 744 @item @var{regexp1}\|@var{regexp2} 745 @cindex @acronym{GNU} extensions, to basic regular expressions 746 Matches either @var{regexp1} or @var{regexp2}. Use 747 parentheses to use complex alternative regular expressions. 748 The matching process tries each alternative in turn, from 749 left to right, and the first one that succeeds is used. 750 It is a @acronym{GNU} extension. 751 752 @item @var{regexp1}@var{regexp2} 753 Matches the concatenation of @var{regexp1} and @var{regexp2}. 754 Concatenation binds more tightly than @code{\|}, @code{^}, and 755 @code{$}, but less tightly than the other regular expression 756 operators. 757 758 @item \@var{digit} 759 Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized 760 subexpression in the regular expression. This is called a @dfn{back 761 reference}. Subexpressions are implicity numbered by counting 762 occurrences of @code{\(} left-to-right. 763 764 @item \n 765 Matches the newline character. 766 767 @item \@var{char} 768 Matches @var{char}, where @var{char} is one of @code{$}, 769 @code{*}, @code{.}, @code{[}, @code{\}, or @code{^}. 770 Note that the only C-like 771 backslash sequences that you can portably assume to be 772 interpreted are @code{\n} and @code{\\}; in particular 773 @code{\t} is not portable, and matches a @samp{t} under most 774 implementations of @command{sed}, rather than a tab character. 775 776 @end table 777 778 @cindex Greedy regular expression matching 779 Note that the regular expression matcher is greedy, i.e., matches 780 are attempted from left to right and, if two or more matches are 781 possible starting at the same character, it selects the longest. 782 783 @noindent 784 Examples: 785 @table @samp 786 @item abcdef 787 Matches @samp{abcdef}. 788 789 @item a*b 790 Matches zero or more @samp{a}s followed by a single 791 @samp{b}. For example, @samp{b} or @samp{aaaaab}. 792 793 @item a\?b 794 Matches @samp{b} or @samp{ab}. 795 796 @item a\+b\+ 797 Matches one or more @samp{a}s followed by one or more 798 @samp{b}s: @samp{ab} is the shortest possible match, but 799 other examples are @samp{aaaab} or @samp{abbbbb} or 800 @samp{aaaaaabbbbbbb}. 801 802 @item .* 803 @itemx .\+ 804 These two both match all the characters in a string; 805 however, the first matches every string (including the empty 806 string), while the second matches only strings containing 807 at least one character. 808 809 @item ^main.*(.*) 810 his matches a string starting with @samp{main}, 811 followed by an opening and closing 812 parenthesis. The @samp{n}, @samp{(} and @samp{)} need not 813 be adjacent. 814 815 @item ^# 816 This matches a string beginning with @samp{#}. 817 818 @item \\$ 819 This matches a string ending with a single backslash. The 820 regexp contains two backslashes for escaping. 821 822 @item \$ 823 Instead, this matches a string consisting of a single dollar sign, 824 because it is escaped. 825 826 @item [a-zA-Z0-9] 827 In the C locale, this matches any @acronym{ASCII} letters or digits. 828 829 @item [^ @kbd{tab}]\+ 830 (Here @kbd{tab} stands for a single tab character.) 831 This matches a string of one or more 832 characters, none of which is a space or a tab. 833 Usually this means a word. 834 835 @item ^\(.*\)\n\1$ 836 This matches a string consisting of two equal substrings separated by 837 a newline. 838 839 @item .\@{9\@}A$ 840 This matches nine characters followed by an @samp{A}. 841 842 @item ^.\@{15\@}A 843 This matches the start of a string that contains 16 characters, 844 the last of which is an @samp{A}. 845 846 @end table 847 848 849 850 @node Common Commands 851 @section Often-Used Commands 852 853 If you use @command{sed} at all, you will quite likely want to know 854 these commands. 855 856 @table @code 857 @item # 858 [No addresses allowed.] 859 860 @findex # (comments) 861 @cindex Comments, in scripts 862 The @code{#} character begins a comment; 863 the comment continues until the next newline. 864 865 @cindex Portability, comments 866 If you are concerned about portability, be aware that 867 some implementations of @command{sed} (which are not @sc{posix} 868 conformant) may only support a single one-line comment, 869 and then only when the very first character of the script is a @code{#}. 870 871 @findex -n, forcing from within a script 872 @cindex Caveat --- #n on first line 873 Warning: if the first two characters of the @command{sed} script 874 are @code{#n}, then the @option{-n} (no-autoprint) option is forced. 875 If you want to put a comment in the first line of your script 876 and that comment begins with the letter @samp{n} 877 and you do not want this behavior, 878 then be sure to either use a capital @samp{N}, 879 or place at least one space before the @samp{n}. 880 881 @item q [@var{exit-code}] 882 This command only accepts a single address. 883 884 @findex q (quit) command 885 @cindex @value{SSEDEXT}, returning an exit code 886 @cindex Quitting 887 Exit @command{sed} without processing any more commands or input. 888 Note that the current pattern space is printed if auto-print is 889 not disabled with the @option{-n} options. The ability to return 890 an exit code from the @command{sed} script is a @value{SSED} extension. 656 657 @item a\ 658 @itemx @var{text} 659 Append @var{text} after a line. 660 661 @item a @var{text} 662 Append @var{text} after a line (alternative syntax). 663 664 @item b @var{label} 665 Branch unconditionally to @var{label}. 666 The @var{label} may be omitted, in which case the next cycle is started. 667 668 @item c\ 669 @itemx @var{text} 670 Replace (change) lines with @var{text}. 671 672 @item c @var{text} 673 Replace (change) lines with @var{text} (alternative syntax). 891 674 892 675 @item d 893 @findex d (delete) command894 @cindex Text, deleting895 676 Delete the pattern space; 896 677 immediately start next cycle. 897 678 898 @item p 899 @findex p (print) command 900 @cindex Text, printing 901 Print out the pattern space (to the standard output). 902 This command is usually only used in conjunction with the @option{-n} 903 command-line option. 679 @item D 680 If pattern space contains newlines, delete text in the pattern 681 space up to the first newline, and restart cycle with the resultant 682 pattern space, without reading a new line of input. 683 684 If pattern space contains no newline, start a normal new cycle as if 685 the @code{d} command was issued. 686 @c TODO: add a section about D+N and D+n commands 687 688 @item e 689 Executes the command that is found in pattern space and 690 replaces the pattern space with the output; a trailing newline 691 is suppressed. 692 693 @item e @var{command} 694 Executes @var{command} and sends its output to the output stream. 695 The command can run across multiple lines, all but the last ending with 696 a back-slash. 697 698 @item F 699 (filename) Print the file name of the current input file (with a trailing 700 newline). 701 702 @item g 703 Replace the contents of the pattern space with the contents of the hold space. 704 705 @item G 706 Append a newline to the contents of the pattern space, 707 and then append the contents of the hold space to that of the pattern space. 708 709 @item h 710 (hold) Replace the contents of the hold space with the contents of the 711 pattern space. 712 713 @item H 714 Append a newline to the contents of the hold space, 715 and then append the contents of the pattern space to that of the hold space. 716 717 @item i\ 718 @itemx @var{text} 719 insert @var{text} before a line. 720 721 @item i @var{text} 722 insert @var{text} before a line (alternative syntax). 723 724 @item l 725 Print the pattern space in an unambiguous form. 904 726 905 727 @item n 906 @findex n (next-line) command 907 @cindex Next input line, replace pattern space with 908 @cindex Read next input line 909 If auto-print is not disabled, print the pattern space, 728 (next) If auto-print is not disabled, print the pattern space, 910 729 then, regardless, replace the pattern space with the next line of input. 911 730 If there is no more input then @command{sed} exits without processing 912 731 any more commands. 913 732 914 @item @{ @var{commands} @} 915 @findex @{@} command grouping 916 @cindex Grouping commands 917 @cindex Command groups 918 A group of commands may be enclosed between 919 @code{@{} and @code{@}} characters. 920 This is particularly useful when you want a group of commands 921 to be triggered by a single address (or address-range) match. 733 @item N 734 Add a newline to the pattern space, 735 then append the next line of input to the pattern space. 736 If there is no more input then @command{sed} exits without processing 737 any more commands. 738 739 @item p 740 Print the pattern space. 741 @c useful with @option{-n} 742 743 @item P 744 Print the pattern space, up to the first <newline>. 745 746 @item q@var{[exit-code]} 747 (quit) Exit @command{sed} without processing any more commands or input. 748 749 @item Q@var{[exit-code]} 750 (quit) This command is the same as @code{q}, but will not print the 751 contents of pattern space. Like @code{q}, it provides the 752 ability to return an exit code to the caller. 753 @c useful to quit on a conditional without printing 754 755 @item r filename 756 Reads file @var{filename}. 757 758 @item R filename 759 Queue a line of @var{filename} to be read and 760 inserted into the output stream at the end of the current cycle, 761 or when the next input line is read. 762 @c useful to interleave files 763 764 @item s@var{/regexp/replacement/[flags]} 765 (substitute) Match the regular-expression against the content of the 766 pattern space. If found, replace matched string with 767 @var{replacement}. 768 769 @item t @var{label} 770 (test) Branch to @var{label} only if there has been a successful 771 @code{s}ubstitution since the last input line was read or conditional 772 branch was taken. The @var{label} may be omitted, in which case the 773 next cycle is started. 774 775 @item T @var{label} 776 (test) Branch to @var{label} only if there have been no successful 777 @code{s}ubstitutions since the last input line was read or 778 conditional branch was taken. The @var{label} may be omitted, 779 in which case the next cycle is started. 780 781 @item v @var{[version]} 782 (version) This command does nothing, but makes @command{sed} fail if 783 @value{SSED} extensions are not supported, or if the requested version 784 is not available. 785 786 @item w filename 787 Write the pattern space to @var{filename}. 788 789 @item W filename 790 Write to the given filename the portion of the pattern space up to 791 the first newline 792 793 @item x 794 Exchange the contents of the hold and pattern spaces. 795 796 797 @item y/src/dst/ 798 Transliterate any characters in the pattern space which match 799 any of the @var{source-chars} with the corresponding character 800 in @var{dest-chars}. 801 802 803 @item z 804 (zap) This command empties the content of pattern space. 805 806 @item # 807 A comment, until the next newline. 808 809 810 @item @{ @var{cmd ; cmd ...} @} 811 Group several commands together. 812 @c useful for multiple commands on same address 813 814 @item = 815 Print the current input line number (with a trailing newline). 816 817 @item : @var{label} 818 Specify the location of @var{label} for branch commands (@code{b}, 819 @code{t}, @code{T}). 922 820 923 821 @end table 822 924 823 925 824 @node The "s" Command 926 825 @section The @code{s} Command 927 826 928 The syntax of the @code{s} (as in substitute) command is 929 @samp{s/@var{regexp}/@var{replacement}/@var{flags}}. The @code{/} 930 characters may be uniformly replaced by any other single 931 character within any given @code{s} command. The @code{/} 932 character (or whatever other character is used in its stead) 933 can appear in the @var{regexp} or @var{replacement} 934 only if it is preceded by a @code{\} character. 935 936 The @code{s} command is probably the most important in @command{sed} 937 and has a lot of different options. Its basic concept is simple: 938 the @code{s} command attempts to match the pattern 939 space against the supplied @var{regexp}; if the match is 940 successful, then that portion of the pattern 941 space which was matched is replaced with @var{replacement}. 827 The @code{s} command (as in substitute) is probably the most important 828 in @command{sed} and has a lot of different options. The syntax of 829 the @code{s} command is 830 @samp{s/@var{regexp}/@var{replacement}/@var{flags}}. 831 832 Its basic concept is simple: the @code{s} command attempts to match 833 the pattern space against the supplied regular expression @var{regexp}; 834 if the match is successful, then that portion of the 835 pattern space which was matched is replaced with @var{replacement}. 836 837 For details about @var{regexp} syntax @pxref{Regexp Addresses,,Regular 838 Expression Addresses}. 942 839 943 840 @cindex Backreferences, in regular expressions … … 950 847 characters which reference the whole matched portion 951 848 of the pattern space. 849 850 @c TODO: xref to backreference section mention @var{\'}. 851 852 The @code{/} 853 characters may be uniformly replaced by any other single 854 character within any given @code{s} command. The @code{/} 855 character (or whatever other character is used in its stead) 856 can appear in the @var{regexp} or @var{replacement} 857 only if it is preceded by a @code{\} character. 858 859 860 952 861 @cindex @value{SSEDEXT}, case modifiers in @code{s} commands 953 862 Finally, as a @value{SSED} extension, you can include a … … 976 885 Stop case conversion started by @code{\L} or @code{\U}. 977 886 @end table 887 888 When the @code{g} flag is being used, case conversion does not 889 propagate from one occurrence of the regular expression to 890 another. For example, when the following command is executed 891 with @samp{a-b-} in pattern space: 892 @example 893 s/\(b\?\)-/x\u\1/g 894 @end example 895 896 @noindent 897 the output is @samp{axxB}. When replacing the first @samp{-}, 898 the @samp{\u} sequence only affects the empty replacement of 899 @samp{\1}. It does not affect the @code{x} character that is 900 added to pattern space when replacing @code{b-} with @code{xB}. 901 902 On the other hand, @code{\l} and @code{\u} do affect the remainder 903 of the replacement text if they are followed by an empty substitution. 904 With @samp{a-b-} in pattern space, the following command: 905 @example 906 s/\(b\?\)-/\u\1x/g 907 @end example 908 909 @noindent 910 will replace @samp{-} with @samp{X} (uppercase) and @samp{b-} with 911 @samp{Bx}. If this behavior is undesirable, you can prevent it by 912 adding a @samp{\E} sequence---after @samp{\1} in this case. 978 913 979 914 To include a literal @code{\}, @code{&}, or newline in the final … … 997 932 Only replace the @var{number}th match of the @var{regexp}. 998 933 999 @cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command 934 @cindex GNU extensions, @code{g} and @var{number} modifier 935 interaction in @code{s} command 1000 936 @cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command 1001 937 Note: the @sc{posix} standard does not specify what should happen … … 1023 959 change in future versions. 1024 960 1025 @item w @var{file -name}961 @item w @var{filename} 1026 962 @cindex Text, writing to a file after substitution 1027 963 @cindex @value{SSEDEXT}, @file{/dev/stdout} file 1028 964 @cindex @value{SSEDEXT}, @file{/dev/stderr} file 1029 965 If the substitution was made, then write out the result to the named file. 1030 As a @value{SSED} extension, two special values of @var{file -name} are966 As a @value{SSED} extension, two special values of @var{filename} are 1031 967 supported: @file{/dev/stderr}, which writes the result to the standard 1032 968 error, and @file{/dev/stdout}, which writes to the standard … … 1048 984 @item I 1049 985 @itemx i 1050 @cindex @acronym{GNU}extensions, @code{I} modifier986 @cindex GNU extensions, @code{I} modifier 1051 987 @cindex Case-insensitive matching 1052 @ifset PERL 1053 @cindex Perl-style regular expressions, case-insensitive 1054 @end ifset 1055 The @code{I} modifier to regular-expression matching is a @acronym{GNU} 988 The @code{I} modifier to regular-expression matching is a GNU 1056 989 extension which makes @command{sed} match @var{regexp} in a 1057 990 case-insensitive manner. … … 1060 993 @itemx m 1061 994 @cindex @value{SSEDEXT}, @code{M} modifier 1062 @ifset PERL1063 @cindex Perl-style regular expressions, multiline1064 @end ifset1065 995 The @code{M} modifier to regular-expression matching is a @value{SSED} 1066 extension which causes @code{^} and @code{$} to match respectively 1067 (in addition to the normal behavior) the empty string after a newline, 1068 and the empty string before a newline. There are special character 1069 sequences 1070 @ifset PERL 1071 (@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'} 1072 in basic or extended regular expression modes) 1073 @end ifset 996 extension which directs @value{SSED} to match the regular expression 997 in @cite{multi-line} mode. The modifier causes @code{^} and @code{$} to 998 match respectively (in addition to the normal behavior) the empty string 999 after a newline, and the empty string before a newline. There are 1000 special character sequences 1074 1001 @ifclear PERL 1075 1002 (@code{\`} and @code{\'}) 1076 1003 @end ifclear 1077 1004 which always match the beginning or the end of the buffer. 1078 @code{M} stands for @cite{multi-line}. 1079 1080 @ifset PERL 1081 @item S 1082 @itemx s 1083 @cindex @value{SSEDEXT}, @code{S} modifier 1084 @cindex Perl-style regular expressions, single line 1085 The @code{S} modifier to regular-expression matching is only valid 1086 in Perl mode and specifies that the dot character (@code{.}) will 1087 match the newline character too. @code{S} stands for @cite{single-line}. 1088 @end ifset 1089 1090 @ifset PERL 1091 @item X 1092 @itemx x 1093 @cindex @value{SSEDEXT}, @code{X} modifier 1094 @cindex Perl-style regular expressions, extended 1095 The @code{X} modifier to regular-expression matching is also 1096 valid in Perl mode only. If it is used, whitespace in the 1097 pattern (other than in a character class) and 1098 characters between a @kbd{#} outside a character class and the 1099 next newline character are ignored. An escaping backslash 1100 can be used to include a whitespace or @kbd{#} character as part 1101 of the pattern. 1102 @end ifset 1005 In addition, 1006 the period character does not match a new-line character in 1007 multi-line mode. 1008 1009 1010 @end table 1011 1012 @node Common Commands 1013 @section Often-Used Commands 1014 1015 If you use @command{sed} at all, you will quite likely want to know 1016 these commands. 1017 1018 @table @code 1019 @item # 1020 [No addresses allowed.] 1021 1022 @findex # (comments) 1023 @cindex Comments, in scripts 1024 The @code{#} character begins a comment; 1025 the comment continues until the next newline. 1026 1027 @cindex Portability, comments 1028 If you are concerned about portability, be aware that 1029 some implementations of @command{sed} (which are not @sc{posix} 1030 conforming) may only support a single one-line comment, 1031 and then only when the very first character of the script is a @code{#}. 1032 1033 @findex -n, forcing from within a script 1034 @cindex Caveat --- #n on first line 1035 Warning: if the first two characters of the @command{sed} script 1036 are @code{#n}, then the @option{-n} (no-autoprint) option is forced. 1037 If you want to put a comment in the first line of your script 1038 and that comment begins with the letter @samp{n} 1039 and you do not want this behavior, 1040 then be sure to either use a capital @samp{N}, 1041 or place at least one space before the @samp{n}. 1042 1043 @item q [@var{exit-code}] 1044 @findex q (quit) command 1045 @cindex @value{SSEDEXT}, returning an exit code 1046 @cindex Quitting 1047 Exit @command{sed} without processing any more commands or input. 1048 1049 Example: stop after printing the second line: 1050 @example 1051 $ seq 3 | sed 2q 1052 1 1053 2 1054 @end example 1055 1056 This command accepts only one address. 1057 Note that the current pattern space is printed if auto-print is 1058 not disabled with the @option{-n} options. The ability to return 1059 an exit code from the @command{sed} script is a @value{SSED} extension. 1060 1061 See also the @value{SSED} extension @code{Q} command which quits silently 1062 without printing the current pattern space. 1063 1064 @item d 1065 @findex d (delete) command 1066 @cindex Text, deleting 1067 Delete the pattern space; 1068 immediately start next cycle. 1069 1070 Example: delete the second input line: 1071 @example 1072 $ seq 3 | sed 2d 1073 1 1074 3 1075 @end example 1076 1077 @item p 1078 @findex p (print) command 1079 @cindex Text, printing 1080 Print out the pattern space (to the standard output). 1081 This command is usually only used in conjunction with the @option{-n} 1082 command-line option. 1083 1084 Example: print only the second input line: 1085 @example 1086 $ seq 3 | sed -n 2p 1087 2 1088 @end example 1089 1090 @item n 1091 @findex n (next-line) command 1092 @cindex Next input line, replace pattern space with 1093 @cindex Read next input line 1094 If auto-print is not disabled, print the pattern space, 1095 then, regardless, replace the pattern space with the next line of input. 1096 If there is no more input then @command{sed} exits without processing 1097 any more commands. 1098 1099 This command is useful to skip lines (e.g. process every Nth line). 1100 1101 Example: perform substitution on every 3rd line (i.e. two @code{n} commands 1102 skip two lines): 1103 @codequoteundirected on 1104 @codequotebacktick on 1105 @example 1106 $ seq 6 | sed 'n;n;s/./x/' 1107 1 1108 2 1109 x 1110 4 1111 5 1112 x 1113 @end example 1114 1115 @value{SSED} provides an extension address syntax of @var{first}~@var{step} 1116 to achieve the same result: 1117 1118 @example 1119 $ seq 6 | sed '0~3s/./x/' 1120 1 1121 2 1122 x 1123 4 1124 5 1125 x 1126 @end example 1127 1128 @codequotebacktick off 1129 @codequoteundirected off 1130 1131 1132 @item @{ @var{commands} @} 1133 @findex @{@} command grouping 1134 @cindex Grouping commands 1135 @cindex Command groups 1136 A group of commands may be enclosed between 1137 @code{@{} and @code{@}} characters. 1138 This is particularly useful when you want a group of commands 1139 to be triggered by a single address (or address-range) match. 1140 1141 Example: perform substitution then print the second input line: 1142 @codequoteundirected on 1143 @codequotebacktick on 1144 @example 1145 $ seq 3 | sed -n '2@{s/2/X/ ; p@}' 1146 X 1147 @end example 1148 @codequoteundirected off 1149 @codequotebacktick off 1150 1103 1151 @end table 1104 1152 … … 1113 1161 @table @code 1114 1162 @item y/@var{source-chars}/@var{dest-chars}/ 1115 (The @code{/} characters may be uniformly replaced by1116 any other single character within any given @code{y} command.)1117 1118 1163 @findex y (transliterate) command 1119 1164 @cindex Transliteration … … 1122 1167 in @var{dest-chars}. 1123 1168 1169 Example: transliterate @samp{a-j} into @samp{0-9}: 1170 @codequoteundirected on 1171 @codequotebacktick on 1172 @example 1173 $ echo hello world | sed 'y/abcdefghij/0123456789/' 1174 74llo worl3 1175 @end example 1176 @codequoteundirected off 1177 @codequotebacktick off 1178 1179 (The @code{/} characters may be uniformly replaced by 1180 any other single character within any given @code{y} command.) 1181 1124 1182 Instances of the @code{/} (or whatever other character is used in its stead), 1125 1183 @code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars} … … 1128 1186 contain the same number of characters (after de-escaping). 1129 1187 1188 See the @command{tr} command from GNU coreutils for similar functionality. 1189 1190 @item a @var{text} 1191 Appending @var{text} after a line. This is a GNU extension 1192 to the standard @code{a} command - see below for details. 1193 1194 Example: Add @samp{hello} after the second line: 1195 @codequoteundirected on 1196 @codequotebacktick on 1197 @example 1198 $ seq 3 | sed '2a hello' 1199 1 1200 2 1201 hello 1202 3 1203 @end example 1204 @codequoteundirected off 1205 @codequotebacktick off 1206 1207 Leading whitespace after the @code{a} command is ignored. 1208 The text to add is read until the end of the line. 1209 1210 1130 1211 @item a\ 1131 1212 @itemx @var{text} 1132 @cindex @value{SSEDEXT}, two addresses supported by most commands1133 As a @acronym{GNU} extension, this command accepts two addresses.1134 1135 1213 @findex a (append text lines) command 1136 1214 @cindex Appending text after a line 1137 1215 @cindex Text, appending 1138 Queue the lines of text which follow this command 1216 Appending @var{text} after a line. 1217 1218 Example: Add @samp{hello} after the second line 1219 (@print{} indicates printed output lines): 1220 @codequoteundirected on 1221 @codequotebacktick on 1222 @example 1223 $ seq 3 | sed '2a\ 1224 hello' 1225 @print{}1 1226 @print{}2 1227 @print{}hello 1228 @print{}3 1229 @end example 1230 @codequoteundirected off 1231 @codequotebacktick off 1232 1233 The @code{a} command queues the lines of text which follow this command 1139 1234 (each but the last ending with a @code{\}, 1140 1235 which are removed from the output) … … 1142 1237 or when the next input line is read. 1143 1238 1239 @cindex @value{SSEDEXT}, two addresses supported by most commands 1240 As a GNU extension, this command accepts two addresses. 1241 1144 1242 Escape sequences in @var{text} are processed, so you should 1145 1243 use @code{\\} in @var{text} to print a single backslash. 1146 1244 1147 As a @acronym{GNU} extension, if between the @code{a} and the newline there is 1148 other than a whitespace-@code{\} sequence, then the text of this line, 1149 starting at the first non-whitespace character after the @code{a}, 1150 is taken as the first line of the @var{text} block. 1151 (This enables a simplification in scripting a one-line add.) 1152 This extension also works with the @code{i} and @code{c} commands. 1153 1245 The commands resume after the last line without a backslash (@code{\}) - 1246 @samp{world} in the following example: 1247 @codequoteundirected on 1248 @codequotebacktick on 1249 @example 1250 $ seq 3 | sed '2a\ 1251 hello\ 1252 world 1253 3s/./X/' 1254 @print{}1 1255 @print{}2 1256 @print{}hello 1257 @print{}world 1258 @print{}X 1259 @end example 1260 @codequoteundirected off 1261 @codequotebacktick off 1262 1263 As a GNU extension, the @code{a} command and @var{text} can be 1264 separated into two @code{-e} parameters, enabling easier scripting: 1265 @codequoteundirected on 1266 @codequotebacktick on 1267 @example 1268 $ seq 3 | sed -e '2a\' -e hello 1269 1 1270 2 1271 hello 1272 3 1273 1274 $ sed -e '2a\' -e "$VAR" 1275 @end example 1276 @codequoteundirected off 1277 @codequotebacktick off 1278 1279 @item i @var{text} 1280 insert @var{text} before a line. This is a GNU extension 1281 to the standard @code{i} command - see below for details. 1282 1283 Example: Insert @samp{hello} before the second line: 1284 @codequoteundirected on 1285 @codequotebacktick on 1286 @example 1287 $ seq 3 | sed '2i hello' 1288 1 1289 hello 1290 2 1291 3 1292 @end example 1293 @codequoteundirected off 1294 @codequotebacktick off 1295 1296 Leading whitespace after the @code{i} command is ignored. 1297 The text to add is read until the end of the line. 1298 1299 @anchor{insert command} 1154 1300 @item i\ 1155 1301 @itemx @var{text} 1156 @cindex @value{SSEDEXT}, two addresses supported by most commands1157 As a @acronym{GNU} extension, this command accepts two addresses.1158 1159 1302 @findex i (insert text lines) command 1160 1303 @cindex Inserting text before a line 1161 1304 @cindex Text, insertion 1162 Immediately output the lines of text which follow this command 1163 (each but the last ending with a @code{\}, 1164 which are removed from the output). 1305 Immediately output the lines of text which follow this command. 1306 1307 Example: Insert @samp{hello} before the second line 1308 (@print{} indicates printed output lines): 1309 @codequoteundirected on 1310 @codequotebacktick on 1311 @example 1312 $ seq 3 | sed '2i\ 1313 hello' 1314 @print{}1 1315 @print{}hello 1316 @print{}2 1317 @print{}3 1318 @end example 1319 @codequoteundirected off 1320 @codequotebacktick off 1321 1322 @cindex @value{SSEDEXT}, two addresses supported by most commands 1323 As a GNU extension, this command accepts two addresses. 1324 1325 Escape sequences in @var{text} are processed, so you should 1326 use @code{\\} in @var{text} to print a single backslash. 1327 1328 The commands resume after the last line without a backslash (@code{\}) - 1329 @samp{world} in the following example: 1330 @codequoteundirected on 1331 @codequotebacktick on 1332 @example 1333 $ seq 3 | sed '2i\ 1334 hello\ 1335 world 1336 s/./X/' 1337 @print{}X 1338 @print{}hello 1339 @print{}world 1340 @print{}X 1341 @print{}X 1342 @end example 1343 @codequoteundirected off 1344 @codequotebacktick off 1345 1346 As a GNU extension, the @code{i} command and @var{text} can be 1347 separated into two @code{-e} parameters, enabling easier scripting: 1348 @codequoteundirected on 1349 @codequotebacktick on 1350 @example 1351 $ seq 3 | sed -e '2i\' -e hello 1352 1 1353 hello 1354 2 1355 3 1356 1357 $ sed -e '2i\' -e "$VAR" 1358 @end example 1359 @codequoteundirected off 1360 @codequotebacktick off 1361 1362 @item c @var{text} 1363 Replaces the line(s) with @var{text}. This is a GNU extension 1364 to the standard @code{c} command - see below for details. 1365 1366 Example: Replace the 2nd to 9th lines with the word @samp{hello}: 1367 @codequoteundirected on 1368 @codequotebacktick on 1369 @example 1370 $ seq 10 | sed '2,9c hello' 1371 1 1372 hello 1373 10 1374 @end example 1375 @codequoteundirected off 1376 @codequotebacktick off 1377 1378 Leading whitespace after the @code{c} command is ignored. 1379 The text to add is read until the end of the line. 1165 1380 1166 1381 @item c\ … … 1169 1384 @cindex Replacing selected lines with other text 1170 1385 Delete the lines matching the address or address-range, 1171 and output the lines of text which follow this command 1172 (each but the last ending with a @code{\}, 1173 which are removed from the output) 1174 in place of the last line 1175 (or in place of each line, if no addresses were specified). 1386 and output the lines of text which follow this command. 1387 1388 Example: Replace 2nd to 4th lines with the words @samp{hello} and 1389 @samp{world} (@print{} indicates printed output lines): 1390 @codequoteundirected on 1391 @codequotebacktick on 1392 @example 1393 $ seq 5 | sed '2,4c\ 1394 hello\ 1395 world' 1396 @print{}1 1397 @print{}hello 1398 @print{}world 1399 @print{}5 1400 @end example 1401 @codequoteundirected off 1402 @codequotebacktick off 1403 1404 If no addresses are given, each line is replaced. 1405 1176 1406 A new cycle is started after this command is done, 1177 1407 since the pattern space will have been deleted. 1408 In the following example, the @code{c} starts a 1409 new cycle and the substitution command is not performed 1410 on the replaced text: 1411 1412 @codequoteundirected on 1413 @codequotebacktick on 1414 @example 1415 $ seq 3 | sed '2c\ 1416 hello 1417 s/./X/' 1418 @print{}X 1419 @print{}hello 1420 @print{}X 1421 @end example 1422 @codequoteundirected off 1423 @codequotebacktick off 1424 1425 As a GNU extension, the @code{c} command and @var{text} can be 1426 separated into two @code{-e} parameters, enabling easier scripting: 1427 @codequoteundirected on 1428 @codequotebacktick on 1429 @example 1430 $ seq 3 | sed -e '2c\' -e hello 1431 1 1432 hello 1433 3 1434 1435 $ sed -e '2c\' -e "$VAR" 1436 @end example 1437 @codequoteundirected off 1438 @codequotebacktick off 1439 1178 1440 1179 1441 @item = 1180 @cindex @value{SSEDEXT}, two addresses supported by most commands1181 As a @acronym{GNU} extension, this command accepts two addresses.1182 1183 1442 @findex = (print line number) command 1184 1443 @cindex Printing line number 1185 1444 @cindex Line number, printing 1186 1445 Print out the current input line number (with a trailing newline). 1446 1447 @codequoteundirected on 1448 @codequotebacktick on 1449 @example 1450 $ printf '%s\n' aaa bbb ccc | sed = 1451 1 1452 aaa 1453 2 1454 bbb 1455 3 1456 ccc 1457 @end example 1458 @codequoteundirected off 1459 @codequotebacktick off 1460 1461 @cindex @value{SSEDEXT}, two addresses supported by most commands 1462 As a GNU extension, this command accepts two addresses. 1463 1464 1465 1187 1466 1188 1467 @item l @var{n} … … 1204 1483 1205 1484 @item r @var{filename} 1206 @cindex @value{SSEDEXT}, two addresses supported by most commands1207 As a @acronym{GNU} extension, this command accepts two addresses.1208 1485 1209 1486 @findex r (read file) command 1210 1487 @cindex Read text from a file 1488 Reads file @var{filename}. Example: 1489 1490 @codequoteundirected on 1491 @codequotebacktick on 1492 @example 1493 $ seq 3 | sed '2r/etc/hostname' 1494 1 1495 2 1496 fencepost.gnu.org 1497 3 1498 @end example 1499 @codequoteundirected off 1500 @codequotebacktick off 1501 1211 1502 @cindex @value{SSEDEXT}, @file{/dev/stdin} file 1212 1503 Queue the contents of @var{filename} to be read and … … 1220 1511 standard input. 1221 1512 1513 @cindex @value{SSEDEXT}, two addresses supported by most commands 1514 As a GNU extension, this command accepts two addresses. The 1515 file will then be reread and inserted on each of the addressed lines. 1516 1517 As a @value{SSED} extension, the @code{r} command accepts a zero address, 1518 inserting a file @emph{before} the first line of the input 1519 @pxref{Adding a header to multiple files}. 1520 1222 1521 @item w @var{filename} 1223 1522 @findex w (write file) command … … 1226 1525 @cindex @value{SSEDEXT}, @file{/dev/stderr} file 1227 1526 Write the pattern space to @var{filename}. 1228 As a @value{SSED} extension, two special values of @var{file -name} are1527 As a @value{SSED} extension, two special values of @var{filename} are 1229 1528 supported: @file{/dev/stderr}, which writes the result to the standard 1230 1529 error, and @file{/dev/stdout}, which writes to the standard … … 1232 1531 option is being used.} 1233 1532 1234 The file will be created (or truncated) before the 1235 first input line is read; all @code{w} commands 1236 (including instances of @code{w} flag on successful @code{s} commands) 1237 which refer to the same @var{filename} are output without 1238 closing and reopening the file. 1533 The file will be created (or truncated) before the first input line is 1534 read; all @code{w} commands (including instances of the @code{w} flag 1535 on successful @code{s} commands) which refer to the same @var{filename} 1536 are output without closing and reopening the file. 1239 1537 1240 1538 @item D 1241 1539 @findex D (delete first line) command 1242 1540 @cindex Delete first line from pattern space 1243 Delete text in the pattern space up to the first newline. 1244 If any text is left, restart cycle with the resultant 1245 pattern space (without reading a new line of input), 1246 otherwise start a normal new cycle.1541 If pattern space contains no newline, start a normal new cycle as if 1542 the @code{d} command was issued. Otherwise, delete text in the pattern 1543 space up to the first newline, and restart cycle with the resultant 1544 pattern space, without reading a new line of input. 1247 1545 1248 1546 @item N … … 1254 1552 If there is no more input then @command{sed} exits without processing 1255 1553 any more commands. 1554 1555 When @option{-z} is used, a zero byte (the ascii @samp{NUL} character) is 1556 added between the lines (instead of a new line). 1557 1558 By default @command{sed} does not terminate if there is no 'next' input line. 1559 This is a GNU extension which can be disabled with @option{--posix}. 1560 @xref{N_command_last_line,,N command on the last line}. 1561 1256 1562 1257 1563 @item P … … 1356 1662 1357 1663 If a parameter is specified, instead, the @code{e} command 1358 interprets it as a command and sends its output to the output stream 1359 (like @code{r} does). The command can run across multiple 1360 lines, all but the last ending witha back-slash.1664 interprets it as a command and sends its output to the output stream. 1665 The command can run across multiple lines, all but the last ending with 1666 a back-slash. 1361 1667 1362 1668 In both cases, the results are undefined if the command to be 1363 1669 executed contains a @sc{nul} character. 1364 1670 1365 @item L @var{n} 1366 @findex L (fLow paragraphs) command 1367 @cindex Reformat pattern space 1368 @cindex Reformatting paragraphs 1369 @cindex @value{SSEDEXT}, reformatting paragraphs 1370 @cindex @value{SSEDEXT}, @code{L} command 1371 This @value{SSED} extension fills and joins lines in pattern space 1372 to produce output lines of (at most) @var{n} characters, like 1373 @code{fmt} does; if @var{n} is omitted, the default as specified 1374 on the command line is used. This command is considered a failed 1375 experiment and unless there is enough request (which seems unlikely) 1376 will be removed in future versions. 1377 1378 @ignore 1379 Blank lines, spaces between words, and indentation are 1380 preserved in the output; successive input lines with different 1381 indentation are not joined; tabs are expanded to 8 columns. 1382 1383 If the pattern space contains multiple lines, they are joined, but 1384 since the pattern space usually contains a single line, the behavior 1385 of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e., 1386 it does not join short lines to form longer ones). 1387 1388 @var{n} specifies the desired line-wrap length; if omitted, 1389 the default as specified on the command line is used. 1390 @end ignore 1671 Note that, unlike the @code{r} command, the output of the command will 1672 be printed immediately; the @code{r} command instead delays the output 1673 to the end of the current cycle. 1674 1675 @item F 1676 @findex F (File name) command 1677 @cindex Printing file name 1678 @cindex File name, printing 1679 Print out the file name of the current input file (with a trailing 1680 newline). 1391 1681 1392 1682 @item Q [@var{exit-code}] 1393 This command only accepts a single address.1683 This command accepts only one address. 1394 1684 1395 1685 @findex Q (silent Quit) command … … 1409 1699 @example 1410 1700 :eat 1411 $d @i{ Quit silently on the last line}1412 N @i{ Read another line, silently}1413 g @i{ Overwrite pattern space each time to save memory}1701 $d @i{@r{Quit silently on the last line}} 1702 N @i{@r{Read another line, silently}} 1703 g @i{@r{Overwrite pattern space each time to save memory}} 1414 1704 b eat 1415 1705 @end example … … 1462 1752 the first newline. Everything said under the @code{w} command about 1463 1753 file handling holds here too. 1754 1755 @item z 1756 @findex z (Zap) command 1757 @cindex @value{SSEDEXT}, emptying pattern space 1758 @cindex Emptying pattern space 1759 This command empties the content of pattern space. It is 1760 usually the same as @samp{s/.*//}, but is more efficient 1761 and works in the presence of invalid multibyte sequences 1762 in the input stream. @sc{posix} mandates that such sequences 1763 are @emph{not} matched by @samp{.}, so that there is no portable 1764 way to clear @command{sed}'s buffers in the middle of the 1765 script in most multibyte locales (including UTF-8 locales). 1464 1766 @end table 1465 1767 1768 1769 @node Multiple commands syntax 1770 @section Multiple commands syntax 1771 1772 @c POSIX says: 1773 @c Editing commands other than {...}, a, b, c, i, r, t, w, :, and # 1774 @c can be followed by a <semicolon>, optional <blank> characters, and 1775 @c another editing command. However, when an s editing command is used 1776 @c with the w flag, following it with another command in this manner 1777 @c produces undefined results. 1778 1779 There are several methods to specify multiple commands in a @command{sed} 1780 program. 1781 1782 Using newlines is most natural when running a sed script from a file 1783 (using the @option{-f} option). 1784 1785 On the command line, all @command{sed} commands may be separated by newlines. 1786 Alternatively, you may specify each command as an argument to an @option{-e} 1787 option: 1788 1789 @codequoteundirected on 1790 @codequotebacktick on 1791 @example 1792 @group 1793 $ seq 6 | sed '1d 1794 3d 1795 5d' 1796 2 1797 4 1798 6 1799 1800 $ seq 6 | sed -e 1d -e 3d -e 5d 1801 2 1802 4 1803 6 1804 @end group 1805 @end example 1806 @codequoteundirected off 1807 @codequotebacktick off 1808 1809 A semicolon (@samp{;}) may be used to separate most simple commands: 1810 1811 @codequoteundirected on 1812 @codequotebacktick on 1813 @example 1814 @group 1815 $ seq 6 | sed '1d;3d;5d' 1816 2 1817 4 1818 6 1819 @end group 1820 @end example 1821 @codequoteundirected off 1822 @codequotebacktick off 1823 1824 The @code{@{},@code{@}},@code{b},@code{t},@code{T},@code{:} commands can 1825 be separated with a semicolon (this is a non-portable @value{SSED} extension). 1826 1827 @codequoteundirected on 1828 @codequotebacktick on 1829 @example 1830 @group 1831 $ seq 4 | sed '@{1d;3d@}' 1832 2 1833 4 1834 1835 $ seq 6 | sed '@{1d;3d@};5d' 1836 2 1837 4 1838 6 1839 @end group 1840 @end example 1841 @codequoteundirected off 1842 @codequotebacktick off 1843 1844 Labels used in @code{b},@code{t},@code{T},@code{:} commands are read 1845 until a semicolon. Leading and trailing whitespace is ignored. In 1846 the examples below the label is @samp{x}. The first example works 1847 with @value{SSED}. The second is a portable equivalent. For more 1848 information about branching and labels @pxref{Branching and flow 1849 control}. 1850 1851 @codequoteundirected on 1852 @codequotebacktick on 1853 @example 1854 @group 1855 $ seq 3 | sed '/1/b x ; s/^/=/ ; :x ; 3d' 1856 1 1857 =2 1858 1859 $ seq 3 | sed -e '/1/bx' -e 's/^/=/' -e ':x' -e '3d' 1860 1 1861 =2 1862 @end group 1863 @end example 1864 @codequoteundirected off 1865 @codequotebacktick off 1866 1867 1868 1869 @subsection Commands Requiring a newline 1870 1871 The following commands cannot be separated by a semicolon and 1872 require a newline: 1873 1874 @table @asis 1875 1876 @item @code{a},@code{c},@code{i} (append/change/insert) 1877 1878 All characters following @code{a},@code{c},@code{i} commands are taken 1879 as the text to append/change/insert. Using a semicolon leads to 1880 undesirable results: 1881 1882 @codequoteundirected on 1883 @codequotebacktick on 1884 @example 1885 @group 1886 $ seq 2 | sed '1aHello ; 2d' 1887 1 1888 Hello ; 2d 1889 2 1890 @end group 1891 @end example 1892 @codequoteundirected off 1893 @codequotebacktick off 1894 1895 Separate the commands using @option{-e} or a newline: 1896 1897 @codequoteundirected on 1898 @codequotebacktick on 1899 @example 1900 @group 1901 $ seq 2 | sed -e 1aHello -e 2d 1902 1 1903 Hello 1904 1905 $ seq 2 | sed '1aHello 1906 2d' 1907 1 1908 Hello 1909 @end group 1910 @end example 1911 @codequoteundirected off 1912 @codequotebacktick off 1913 1914 Note that specifying the text to add (@samp{Hello}) immediately 1915 after @code{a},@code{c},@code{i} is itself a @value{SSED} extension. 1916 A portable, POSIX-compliant alternative is: 1917 1918 @codequoteundirected on 1919 @codequotebacktick on 1920 @example 1921 @group 1922 $ seq 2 | sed '1a\ 1923 Hello 1924 2d' 1925 1 1926 Hello 1927 @end group 1928 @end example 1929 @codequoteundirected off 1930 @codequotebacktick off 1931 1932 @item @code{#} (comment) 1933 1934 All characters following @samp{#} until the next newline are ignored. 1935 1936 @codequoteundirected on 1937 @codequotebacktick on 1938 @example 1939 @group 1940 $ seq 3 | sed '# this is a comment ; 2d' 1941 1 1942 2 1943 3 1944 1945 1946 $ seq 3 | sed '# this is a comment 1947 2d' 1948 1 1949 3 1950 @end group 1951 @end example 1952 @codequoteundirected off 1953 @codequotebacktick off 1954 1955 @item @code{r},@code{R},@code{w},@code{W} (reading and writing files) 1956 1957 The @code{r},@code{R},@code{w},@code{W} commands parse the filename 1958 until end of the line. If whitespace, comments or semicolons are found, 1959 they will be included in the filename, leading to unexpected results: 1960 1961 @codequoteundirected on 1962 @codequotebacktick on 1963 @example 1964 @group 1965 $ seq 2 | sed '1w hello.txt ; 2d' 1966 1 1967 2 1968 1969 $ ls -log 1970 total 4 1971 -rw-rw-r-- 1 2 Jan 23 23:03 hello.txt ; 2d 1972 1973 $ cat 'hello.txt ; 2d' 1974 1 1975 @end group 1976 @end example 1977 @codequoteundirected off 1978 @codequotebacktick off 1979 1980 Note that @command{sed} silently ignores read/write errors in 1981 @code{r},@code{R},@code{w},@code{W} commands (such as missing files). 1982 In the following example, @command{sed} tries to read a file named 1983 @samp{@file{hello.txt ; N}}. The file is missing, and the error is silently 1984 ignored: 1985 1986 @codequoteundirected on 1987 @codequotebacktick on 1988 @example 1989 @group 1990 $ echo x | sed '1rhello.txt ; N' 1991 x 1992 @end group 1993 @end example 1994 @codequoteundirected off 1995 @codequotebacktick off 1996 1997 @item @code{e} (command execution) 1998 1999 Any characters following the @code{e} command until the end of the line 2000 will be sent to the shell. If whitespace, comments or semicolons are found, 2001 they will be included in the shell command, leading to unexpected results: 2002 2003 @codequoteundirected on 2004 @codequotebacktick on 2005 @example 2006 @group 2007 $ echo a | sed '1e touch foo#bar' 2008 a 2009 2010 $ ls -1 2011 foo#bar 2012 2013 $ echo a | sed '1e touch foo ; s/a/b/' 2014 sh: 1: s/a/b/: not found 2015 a 2016 @end group 2017 @end example 2018 @codequoteundirected off 2019 @codequotebacktick off 2020 2021 2022 @item @code{s///[we]} (substitute with @code{e} or @code{w} flags) 2023 2024 In a substitution command, the @code{w} flag writes the substitution 2025 result to a file, and the @code{e} flag executes the substitution result 2026 as a shell command. As with the @code{r/R/w/W/e} commands, these 2027 must be terminated with a newline. If whitespace, comments or semicolons 2028 are found, they will be included in the shell command or filename, leading to 2029 unexpected results: 2030 2031 @codequoteundirected on 2032 @codequotebacktick on 2033 @example 2034 @group 2035 $ echo a | sed 's/a/b/w1.txt#foo' 2036 b 2037 2038 $ ls -1 2039 1.txt#foo 2040 @end group 2041 @end example 2042 @codequoteundirected off 2043 @codequotebacktick off 2044 2045 @end table 2046 2047 2048 @node sed addresses 2049 @chapter Addresses: selecting lines 2050 2051 @menu 2052 * Addresses overview:: Addresses overview 2053 * Numeric Addresses:: selecting lines by numbers 2054 * Regexp Addresses:: selecting lines by text matching 2055 * Range Addresses:: selecting a range of lines 2056 * Zero Address:: Using address @code{0} 2057 @end menu 2058 2059 @node Addresses overview 2060 @section Addresses overview 2061 2062 @cindex addresses, numeric 2063 @cindex numeric addresses 2064 Addresses determine on which line(s) the @command{sed} command will be 2065 executed. The following command replaces any first occurrence of @samp{hello} 2066 with @samp{world} only on line 144: 2067 2068 @codequoteundirected on 2069 @codequotebacktick on 2070 @example 2071 sed '144s/hello/world/' input.txt > output.txt 2072 @end example 2073 @codequoteundirected off 2074 @codequotebacktick off 2075 2076 2077 2078 If no address is specified, the command is performed on all lines. 2079 The following command replaces @samp{hello} with @samp{world}, 2080 targeting every line of the input file. 2081 However, note that it modifies only the first instance of @samp{hello} 2082 on each line. 2083 Use the @samp{g} modifier to affect every instance on each affected line. 2084 2085 @codequoteundirected on 2086 @codequotebacktick on 2087 @example 2088 sed 's/hello/world/' input.txt > output.txt 2089 @end example 2090 @codequoteundirected off 2091 @codequotebacktick off 2092 2093 2094 2095 @cindex addresses, regular expression 2096 @cindex regular expression addresses 2097 Addresses can contain regular expressions to match lines based 2098 on content instead of line numbers. The following command replaces 2099 @samp{hello} with @samp{world} only on lines 2100 containing the string @samp{apple}: 2101 2102 @codequoteundirected on 2103 @codequotebacktick on 2104 @example 2105 sed '/apple/s/hello/world/' input.txt > output.txt 2106 @end example 2107 @codequoteundirected off 2108 @codequotebacktick off 2109 2110 2111 2112 @cindex addresses, range 2113 @cindex range addresses 2114 An address range is specified with two addresses separated by a comma 2115 (@code{,}). Addresses can be numeric, regular expressions, or a mix of 2116 both. 2117 The following command replaces @samp{hello} with @samp{world} 2118 only on lines 4 to 17 (inclusive): 2119 2120 @codequoteundirected on 2121 @codequotebacktick on 2122 @example 2123 sed '4,17s/hello/world/' input.txt > output.txt 2124 @end example 2125 @codequoteundirected off 2126 @codequotebacktick off 2127 2128 2129 2130 @cindex Excluding lines 2131 @cindex Selecting non-matching lines 2132 @cindex addresses, negating 2133 @cindex addresses, excluding 2134 Appending the @code{!} character to the end of an address 2135 specification (before the command letter) negates the sense of the 2136 match. That is, if the @code{!} character follows an address or an 2137 address range, then only lines which do @emph{not} match the addresses 2138 will be selected. The following command replaces @samp{hello} 2139 with @samp{world} only on lines @emph{not} containing the string 2140 @samp{apple}: 2141 2142 @example 2143 sed '/apple/!s/hello/world/' input.txt > output.txt 2144 @end example 2145 2146 The following command replaces @samp{hello} with 2147 @samp{world} only on lines 1 to 3 and from line 18 to the last line of the 2148 input file (i.e. excluding lines 4 to 17): 2149 2150 @example 2151 sed '4,17!s/hello/world/' input.txt > output.txt 2152 @end example 2153 2154 2155 2156 2157 2158 @node Numeric Addresses 2159 @section Selecting lines by numbers 2160 @cindex Addresses, in @command{sed} scripts 2161 @cindex Line selection 2162 @cindex Selecting lines to process 2163 2164 Addresses in a @command{sed} script can be in any of the following forms: 2165 @table @code 2166 @item @var{number} 2167 @cindex Address, numeric 2168 @cindex Line, selecting by number 2169 Specifying a line number will match only that line in the input. 2170 (Note that @command{sed} counts lines continuously across all input files 2171 unless @option{-i} or @option{-s} options are specified.) 2172 2173 @item $ 2174 @cindex Address, last line 2175 @cindex Last line, selecting 2176 @cindex Line, selecting last 2177 This address matches the last line of the last file of input, or 2178 the last line of each file when the @option{-i} or @option{-s} options 2179 are specified. 2180 2181 2182 @item @var{first}~@var{step} 2183 @cindex GNU extensions, @samp{@var{n}~@var{m}} addresses 2184 This GNU extension matches every @var{step}th line 2185 starting with line @var{first}. 2186 In particular, lines will be selected when there exists 2187 a non-negative @var{n} such that the current line-number equals 2188 @var{first} + (@var{n} * @var{step}). 2189 Thus, one would use @code{1~2} to select the odd-numbered lines and 2190 @code{0~2} for even-numbered lines; 2191 to pick every third line starting with the second, @samp{2~3} would be used; 2192 to pick every fifth line starting with the tenth, use @samp{10~5}; 2193 and @samp{50~0} is just an obscure way of saying @code{50}. 2194 2195 The following commands demonstrate the step address usage: 2196 2197 @example 2198 $ seq 10 | sed -n '0~4p' 2199 4 2200 8 2201 2202 $ seq 10 | sed -n '1~3p' 2203 1 2204 4 2205 7 2206 10 2207 @end example 2208 2209 2210 @end table 2211 2212 2213 2214 @node Regexp Addresses 2215 @section selecting lines by text matching 2216 2217 @value{SSED} supports the following regular expression addresses. 2218 The default regular expression is 2219 @ref{BRE syntax, , Basic Regular Expression (BRE)}. 2220 If @option{-E} or @option{-r} options are used, The regular expression should be 2221 in @ref{ERE syntax, , Extended Regular Expression (ERE)} syntax. 2222 @xref{BRE vs ERE}. 2223 2224 @table @code 2225 @item /@var{regexp}/ 2226 @cindex Address, as a regular expression 2227 @cindex Line, selecting by regular expression match 2228 This will select any line which matches the regular expression @var{regexp}. 2229 If @var{regexp} itself includes any @code{/} characters, 2230 each must be escaped by a backslash (@code{\}). 2231 2232 The following command prints lines in @file{/etc/passwd} 2233 which end with @samp{bash}@footnote{ 2234 There are of course many other ways to do the same, 2235 e.g. 2236 @example 2237 grep 'bash$' /etc/passwd 2238 awk -F: '$7 == "/bin/bash"' /etc/passwd 2239 @end example 2240 }: 2241 2242 @example 2243 sed -n '/bash$/p' /etc/passwd 2244 @end example 2245 2246 @cindex empty regular expression 2247 @cindex @value{SSEDEXT}, modifiers and the empty regular expression 2248 The empty regular expression @samp{//} repeats the last regular 2249 expression match (the same holds if the empty regular expression is 2250 passed to the @code{s} command). Note that modifiers to regular expressions 2251 are evaluated when the regular expression is compiled, thus it is invalid to 2252 specify them together with the empty regular expression. 2253 2254 @item \%@var{regexp}% 2255 (The @code{%} may be replaced by any other single character.) 2256 2257 @cindex Slash character, in regular expressions 2258 This also matches the regular expression @var{regexp}, 2259 but allows one to use a different delimiter than @code{/}. 2260 This is particularly useful if the @var{regexp} itself contains 2261 a lot of slashes, since it avoids the tedious escaping of every @code{/}. 2262 If @var{regexp} itself includes any delimiter characters, 2263 each must be escaped by a backslash (@code{\}). 2264 2265 The following commands are equivalent. They print lines 2266 which start with @samp{/home/alice/documents/}: 2267 2268 @example 2269 sed -n '/^\/home\/alice\/documents\//p' 2270 sed -n '\%^/home/alice/documents/%p' 2271 sed -n '\;^/home/alice/documents/;p' 2272 @end example 2273 2274 2275 @item /@var{regexp}/I 2276 @itemx \%@var{regexp}%I 2277 @cindex GNU extensions, @code{I} modifier 2278 @cindex case insensitive, regular expression 2279 The @code{I} modifier to regular-expression matching is a GNU 2280 extension which causes the @var{regexp} to be matched in 2281 a case-insensitive manner. 2282 2283 In many other programming languages, a lower case @code{i} is used 2284 for case-insensitive regular expression matching. However, in @command{sed} 2285 the @code{i} is used for the insert command (@pxref{insert command}). 2286 2287 Observe the difference between the following examples. 2288 2289 In this example, @code{/b/I} is the address: regular expression with @code{I} 2290 modifier. @code{d} is the delete command: 2291 2292 @example 2293 $ printf "%s\n" a b c | sed '/b/Id' 2294 a 2295 c 2296 @end example 2297 2298 Here, @code{/b/} is the address: a regular expression. 2299 @code{i} is the insert command. 2300 @code{d} is the value to insert. 2301 A line with @samp{d} is then inserted above the matched line: 2302 2303 @example 2304 $ printf "%s\n" a b c | sed '/b/id' 2305 a 2306 d 2307 b 2308 c 2309 @end example 2310 2311 @item /@var{regexp}/M 2312 @itemx \%@var{regexp}%M 2313 @cindex @value{SSEDEXT}, @code{M} modifier 2314 The @code{M} modifier to regular-expression matching is a @value{SSED} 2315 extension which directs @value{SSED} to match the regular expression 2316 in @cite{multi-line} mode. The modifier causes @code{^} and @code{$} to 2317 match respectively (in addition to the normal behavior) the empty string 2318 after a newline, and the empty string before a newline. There are 2319 special character sequences 2320 @ifclear PERL 2321 (@code{\`} and @code{\'}) 2322 @end ifclear 2323 which always match the beginning or the end of the buffer. 2324 In addition, 2325 the period character does not match a new-line character in 2326 multi-line mode. 2327 @end table 2328 2329 2330 @cindex regex addresses and pattern space 2331 @cindex regex addresses and input lines 2332 Regex addresses operate on the content of the current 2333 pattern space. If the pattern space is changed (for example with @code{s///} 2334 command) the regular expression matching will operate on the changed text. 2335 2336 In the following example, automatic printing is disabled with 2337 @option{-n}. The @code{s/2/X/} command changes lines containing 2338 @samp{2} to @samp{X}. The command @code{/[0-9]/p} matches 2339 lines with digits and prints them. 2340 Because the second line is changed before the @code{/[0-9]/} regex, 2341 it will not match and will not be printed: 2342 2343 @codequoteundirected on 2344 @codequotebacktick on 2345 @example 2346 @group 2347 $ seq 3 | sed -n 's/2/X/ ; /[0-9]/p' 2348 1 2349 3 2350 @end group 2351 @end example 2352 @codequoteundirected off 2353 @codequotebacktick off 2354 2355 2356 @node Range Addresses 2357 @section Range Addresses 2358 2359 @cindex Range of lines 2360 @cindex Several lines, selecting 2361 An address range can be specified by specifying two addresses 2362 separated by a comma (@code{,}). An address range matches lines 2363 starting from where the first address matches, and continues 2364 until the second address matches (inclusively): 2365 2366 @example 2367 $ seq 10 | sed -n '4,6p' 2368 4 2369 5 2370 6 2371 @end example 2372 2373 If the second address is a @var{regexp}, then checking for the 2374 ending match will start with the line @emph{following} the 2375 line which matched the first address: a range will always 2376 span at least two lines (except of course if the input stream 2377 ends). 2378 2379 @example 2380 $ seq 10 | sed -n '4,/[0-9]/p' 2381 4 2382 5 2383 @end example 2384 2385 If the second address is a @var{number} less than (or equal to) 2386 the line matching the first address, then only the one line is 2387 matched: 2388 2389 @example 2390 $ seq 10 | sed -n '4,1p' 2391 4 2392 @end example 2393 2394 @anchor{Zero Address Regex Range} 2395 @cindex Special addressing forms 2396 @cindex Range with start address of zero 2397 @cindex Zero, as range start address 2398 @cindex @var{addr1},+N 2399 @cindex @var{addr1},~N 2400 @cindex GNU extensions, special two-address forms 2401 @cindex GNU extensions, @code{0} address 2402 @cindex GNU extensions, 0,@var{addr2} addressing 2403 @cindex GNU extensions, @var{addr1},+@var{N} addressing 2404 @cindex GNU extensions, @var{addr1},~@var{N} addressing 2405 @value{SSED} also supports some special two-address forms; all these 2406 are GNU extensions: 2407 @table @code 2408 @item 0,/@var{regexp}/ 2409 A line number of @code{0} can be used in an address specification like 2410 @code{0,/@var{regexp}/} so that @command{sed} will try to match 2411 @var{regexp} in the first input line too. In other words, 2412 @code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/}, 2413 except that if @var{addr2} matches the very first line of input the 2414 @code{0,/@var{regexp}/} form will consider it to end the range, whereas 2415 the @code{1,/@var{regexp}/} form will match the beginning of its range and 2416 hence make the range span up to the @emph{second} occurrence of the 2417 regular expression. 2418 2419 The following examples demonstrate the difference between starting 2420 with address 1 and 0: 2421 2422 @example 2423 $ seq 10 | sed -n '1,/[0-9]/p' 2424 1 2425 2 2426 2427 $ seq 10 | sed -n '0,/[0-9]/p' 2428 1 2429 @end example 2430 2431 2432 @item @var{addr1},+@var{N} 2433 Matches @var{addr1} and the @var{N} lines following @var{addr1}. 2434 2435 @example 2436 $ seq 10 | sed -n '6,+2p' 2437 6 2438 7 2439 8 2440 @end example 2441 2442 @var{addr1} can be a line number or a regular expression. 2443 2444 @item @var{addr1},~@var{N} 2445 Matches @var{addr1} and the lines following @var{addr1} 2446 until the next line whose input line number is a multiple of @var{N}. 2447 The following command prints starting at line 6, until the next line which 2448 is a multiple of 4 (i.e. line 8): 2449 2450 @example 2451 $ seq 10 | sed -n '6,~4p' 2452 6 2453 7 2454 8 2455 @end example 2456 2457 @var{addr1} can be a line number or a regular expression. 2458 2459 @end table 2460 2461 2462 2463 @node Zero Address 2464 @section Zero Address 2465 @cindex Zero Address 2466 As a @value{SSED} extension, @code{0} address can be used in two cases: 2467 @enumerate 2468 @item 2469 In a regex range addresses as @code{0,/@var{regexp}/} 2470 (@pxref{Zero Address Regex Range}). 2471 @item 2472 With the @code{r} command, inserting a file before the first line 2473 (@pxref{Adding a header to multiple files}). 2474 @end enumerate 2475 2476 Note that these are the only places where the @code{0} address makes 2477 sense; Commands which are given the @code{0} address in any 2478 other way will give an error. 2479 2480 2481 2482 @node sed regular expressions 2483 @chapter Regular Expressions: selecting text 2484 2485 @menu 2486 * Regular Expressions Overview:: Overview of Regular expression in @command{sed} 2487 * BRE vs ERE:: Basic (BRE) and extended (ERE) regular expression 2488 syntax 2489 * BRE syntax:: Overview of basic regular expression syntax 2490 * ERE syntax:: Overview of extended regular expression syntax 2491 * Character Classes and Bracket Expressions:: 2492 * regexp extensions:: Additional regular expression commands 2493 * Back-references and Subexpressions:: Back-references and Subexpressions 2494 * Escapes:: Specifying special characters 2495 * Locale Considerations:: Multibyte characters and locale considerations 2496 @end menu 2497 2498 @node Regular Expressions Overview 2499 @section Overview of regular expression in @command{sed} 2500 2501 @c NOTE: Keep examples in the 'overview' section 2502 @c neutral in regards to BRE/ERE - to ease understanding. 2503 2504 2505 To know how to use @command{sed}, people should understand regular 2506 expressions (@dfn{regexp} for short). A regular expression 2507 is a pattern that is matched against a 2508 subject string from left to right. Most characters are 2509 @dfn{ordinary}: they stand for 2510 themselves in a pattern, and match the corresponding characters. 2511 Regular expressions in @command{sed} are specified between two 2512 slashes. 2513 2514 The following command prints lines containing the string @samp{hello}: 2515 2516 @example 2517 sed -n '/hello/p' 2518 @end example 2519 2520 The above example is equivalent to this @command{grep} command: 2521 2522 @example 2523 grep 'hello' 2524 @end example 2525 2526 The power of regular expressions comes from the ability to include 2527 alternatives and repetitions in the pattern. These are encoded in the 2528 pattern by the use of @dfn{special characters}, which do not stand for 2529 themselves but instead are interpreted in some special way. 2530 2531 The character @code{^} (caret) in a regular expression matches the 2532 beginning of the line. The character @code{.} (dot) matches any single 2533 character. The following @command{sed} command matches and prints 2534 lines which start with the letter @samp{b}, followed by any single character, 2535 followed by the letter @samp{d}: 2536 2537 @example 2538 $ printf "%s\n" abode bad bed bit bid byte body | sed -n '/^b.d/p' 2539 bad 2540 bed 2541 bid 2542 body 2543 @end example 2544 2545 The following sections explain the meaning and usage of special 2546 characters in regular expressions. 2547 2548 @node BRE vs ERE 2549 @section Basic (BRE) and extended (ERE) regular expression 2550 2551 Basic and extended regular expressions are two variations on the 2552 syntax of the specified pattern. Basic Regular Expression (BRE) syntax is the 2553 default in @command{sed} (and similarly in @command{grep}). 2554 Use the POSIX-specified @option{-E} option (@option{-r}, 2555 @option{--regexp-extended}) to enable Extended Regular Expression (ERE) syntax. 2556 2557 In @value{SSED}, the only difference between basic and extended regular 2558 expressions is in the behavior of a few special characters: @samp{?}, 2559 @samp{+}, parentheses, braces (@samp{@{@}}), and @samp{|}. 2560 2561 With basic (BRE) syntax, these characters do not have special meaning 2562 unless prefixed with a backslash (@samp{\}); While with extended (ERE) syntax 2563 it is reversed: these characters are special unless they are prefixed 2564 with backslash (@samp{\}). 2565 2566 @multitable @columnfractions .28 .36 .35 2567 2568 @headitem Desired pattern 2569 @tab Basic (BRE) Syntax 2570 @tab Extended (ERE) Syntax 2571 2572 @item literal @samp{+} (plus sign) 2573 2574 @tab 2575 @exampleindent 0 2576 @codequoteundirected on 2577 @codequotebacktick on 2578 @example 2579 $ echo 'a+b=c' > foo 2580 $ sed -n '/a+b/p' foo 2581 a+b=c 2582 @end example 2583 @codequotebacktick off 2584 @codequoteundirected off 2585 2586 @tab 2587 @exampleindent 0 2588 @codequoteundirected on 2589 @codequotebacktick on 2590 @example 2591 $ echo 'a+b=c' > foo 2592 $ sed -E -n '/a\+b/p' foo 2593 a+b=c 2594 @end example 2595 @codequotebacktick off 2596 @codequoteundirected off 2597 2598 2599 @item One or more @samp{a} characters followed by @samp{b} 2600 (plus sign as special meta-character) 2601 2602 @tab 2603 @exampleindent 0 2604 @codequoteundirected on 2605 @codequotebacktick on 2606 @example 2607 $ echo aab > foo 2608 $ sed -n '/a\+b/p' foo 2609 aab 2610 @end example 2611 @codequotebacktick off 2612 @codequoteundirected off 2613 2614 @tab 2615 @exampleindent 0 2616 @codequoteundirected on 2617 @codequotebacktick on 2618 @example 2619 $ echo aab > foo 2620 $ sed -E -n '/a+b/p' foo 2621 aab 2622 @end example 2623 @codequotebacktick off 2624 @codequoteundirected off 2625 2626 @end multitable 2627 2628 2629 2630 2631 @node BRE syntax 2632 @section Overview of basic regular expression syntax 2633 2634 Here is a brief description 2635 of regular expression syntax as used in @command{sed}. 2636 2637 @table @code 2638 @item @var{char} 2639 A single ordinary character matches itself. 2640 2641 @item * 2642 @cindex GNU extensions, to basic regular expressions 2643 Matches a sequence of zero or more instances of matches for the 2644 preceding regular expression, which must be an ordinary character, a 2645 special character preceded by @code{\}, a @code{.}, a grouped regexp 2646 (see below), or a bracket expression. As a GNU extension, a 2647 postfixed regular expression can also be followed by @code{*}; for 2648 example, @code{a**} is equivalent to @code{a*}. POSIX 2649 1003.1-2001 says that @code{*} stands for itself when it appears at 2650 the start of a regular expression or subexpression, but many 2651 non-GNU implementations do not support this and portable 2652 scripts should instead use @code{\*} in these contexts. 2653 @item . 2654 Matches any character, including newline. 2655 2656 @item ^ 2657 Matches the null string at beginning of the pattern space, i.e. what 2658 appears after the circumflex must appear at the beginning of the 2659 pattern space. 2660 2661 In most scripts, pattern space is initialized to the content of each 2662 line (@pxref{Execution Cycle, , How @code{sed} works}). So, it is a 2663 useful simplification to think of @code{^#include} as matching only 2664 lines where @samp{#include} is the first thing on the line---if there is 2665 any preceding space, for example, the match fails. This simplification is 2666 valid as long as the original content of pattern space is not modified, 2667 for example with an @code{s} command. 2668 2669 @code{^} acts as a special character only at the beginning of the 2670 regular expression or subexpression (that is, after @code{\(} or 2671 @code{\|}). Portable scripts should avoid @code{^} at the beginning of 2672 a subexpression, though, as POSIX allows implementations that 2673 treat @code{^} as an ordinary character in that context. 2674 2675 @item $ 2676 It is the same as @code{^}, but refers to end of pattern space. 2677 @code{$} also acts as a special character only at the end 2678 of the regular expression or subexpression (that is, before @code{\)} 2679 or @code{\|}), and its use at the end of a subexpression is not 2680 portable. 2681 2682 2683 @item [@var{list}] 2684 @itemx [^@var{list}] 2685 Matches any single character in @var{list}: for example, 2686 @code{[aeiou]} matches all vowels. A list may include 2687 sequences like @code{@var{char1}-@var{char2}}, which 2688 matches any character between (inclusive) @var{char1} 2689 and @var{char2}. 2690 @xref{Character Classes and Bracket Expressions}. 2691 2692 @item \+ 2693 @cindex GNU extensions, to basic regular expressions 2694 As @code{*}, but matches one or more. It is a GNU extension. 2695 2696 @item \? 2697 @cindex GNU extensions, to basic regular expressions 2698 As @code{*}, but only matches zero or one. It is a GNU extension. 2699 2700 @item \@{@var{i}\@} 2701 As @code{*}, but matches exactly @var{i} sequences (@var{i} is a 2702 decimal integer; for portability, keep it between 0 and 255 2703 inclusive). 2704 2705 @item \@{@var{i},@var{j}\@} 2706 Matches between @var{i} and @var{j}, inclusive, sequences. 2707 2708 @item \@{@var{i},\@} 2709 Matches more than or equal to @var{i} sequences. 2710 2711 @item \(@var{regexp}\) 2712 Groups the inner @var{regexp} as a whole, this is used to: 2713 2714 @itemize @bullet 2715 @item 2716 @cindex GNU extensions, to basic regular expressions 2717 Apply postfix operators, like @code{\(abcd\)*}: 2718 this will search for zero or more whole sequences 2719 of @samp{abcd}, while @code{abcd*} would search 2720 for @samp{abc} followed by zero or more occurrences 2721 of @samp{d}. Note that support for @code{\(abcd\)*} is 2722 required by POSIX 1003.1-2001, but many non-GNU 2723 implementations do not support it and hence it is not universally 2724 portable. 2725 2726 @item 2727 Use back references (see below). 2728 @end itemize 2729 2730 2731 @item @var{regexp1}\|@var{regexp2} 2732 @cindex GNU extensions, to basic regular expressions 2733 Matches either @var{regexp1} or @var{regexp2}. Use 2734 parentheses to use complex alternative regular expressions. 2735 The matching process tries each alternative in turn, from 2736 left to right, and the first one that succeeds is used. 2737 It is a GNU extension. 2738 2739 @item @var{regexp1}@var{regexp2} 2740 Matches the concatenation of @var{regexp1} and @var{regexp2}. 2741 Concatenation binds more tightly than @code{\|}, @code{^}, and 2742 @code{$}, but less tightly than the other regular expression 2743 operators. 2744 2745 @item \@var{digit} 2746 Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized 2747 subexpression in the regular expression. This is called a @dfn{back 2748 reference}. Subexpressions are implicitly numbered by counting 2749 occurrences of @code{\(} left-to-right. 2750 2751 @item \n 2752 Matches the newline character. 2753 2754 @item \@var{char} 2755 Matches @var{char}, where @var{char} is one of @code{$}, 2756 @code{*}, @code{.}, @code{[}, @code{\}, or @code{^}. 2757 Note that the only C-like 2758 backslash sequences that you can portably assume to be 2759 interpreted are @code{\n} and @code{\\}; in particular 2760 @code{\t} is not portable, and matches a @samp{t} under most 2761 implementations of @command{sed}, rather than a tab character. 2762 2763 @end table 2764 2765 @cindex Greedy regular expression matching 2766 Note that the regular expression matcher is greedy, i.e., matches 2767 are attempted from left to right and, if two or more matches are 2768 possible starting at the same character, it selects the longest. 2769 2770 @noindent 2771 Examples: 2772 @table @samp 2773 @item abcdef 2774 Matches @samp{abcdef}. 2775 2776 @item a*b 2777 Matches zero or more @samp{a}s followed by a single 2778 @samp{b}. For example, @samp{b} or @samp{aaaaab}. 2779 2780 @item a\?b 2781 Matches @samp{b} or @samp{ab}. 2782 2783 @item a\+b\+ 2784 Matches one or more @samp{a}s followed by one or more 2785 @samp{b}s: @samp{ab} is the shortest possible match, but 2786 other examples are @samp{aaaab} or @samp{abbbbb} or 2787 @samp{aaaaaabbbbbbb}. 2788 2789 @item .* 2790 @itemx .\+ 2791 These two both match all the characters in a string; 2792 however, the first matches every string (including the empty 2793 string), while the second matches only strings containing 2794 at least one character. 2795 2796 @item ^main.*(.*) 2797 This matches a string starting with @samp{main}, 2798 followed by an opening and closing 2799 parenthesis. The @samp{n}, @samp{(} and @samp{)} need not 2800 be adjacent. 2801 2802 @item ^# 2803 This matches a string beginning with @samp{#}. 2804 2805 @item \\$ 2806 This matches a string ending with a single backslash. The 2807 regexp contains two backslashes for escaping. 2808 2809 @item \$ 2810 Instead, this matches a string consisting of a single dollar sign, 2811 because it is escaped. 2812 2813 @item [a-zA-Z0-9] 2814 In the C locale, this matches any ASCII letters or digits. 2815 2816 @item [^ @kbd{@key{TAB}}]\+ 2817 (Here @kbd{@key{TAB}} stands for a single tab character.) 2818 This matches a string of one or more 2819 characters, none of which is a space or a tab. 2820 Usually this means a word. 2821 2822 @item ^\(.*\)\n\1$ 2823 This matches a string consisting of two equal substrings separated by 2824 a newline. 2825 2826 @item .\@{9\@}A$ 2827 This matches nine characters followed by an @samp{A} at the end of a line. 2828 2829 @item ^.\@{15\@}A 2830 This matches the start of a string that contains 16 characters, 2831 the last of which is an @samp{A}. 2832 2833 @end table 2834 2835 2836 @node ERE syntax 2837 @section Overview of extended regular expression syntax 2838 @cindex Extended regular expressions, syntax 2839 2840 The only difference between basic and extended regular expressions is in 2841 the behavior of a few characters: @samp{?}, @samp{+}, parentheses, 2842 braces (@samp{@{@}}), and @samp{|}. While basic regular expressions 2843 require these to be escaped if you want them to behave as special 2844 characters, when using extended regular expressions you must escape 2845 them if you want them @emph{to match a literal character}. @samp{|} 2846 is special here because @samp{\|} is a GNU extension -- standard 2847 basic regular expressions do not provide its functionality. 2848 2849 @noindent 2850 Examples: 2851 @table @code 2852 @item abc? 2853 becomes @samp{abc\?} when using extended regular expressions. It matches 2854 the literal string @samp{abc?}. 2855 2856 @item c\+ 2857 becomes @samp{c+} when using extended regular expressions. It matches 2858 one or more @samp{c}s. 2859 2860 @item a\@{3,\@} 2861 becomes @samp{a@{3,@}} when using extended regular expressions. It matches 2862 three or more @samp{a}s. 2863 2864 @item \(abc\)\@{2,3\@} 2865 becomes @samp{(abc)@{2,3@}} when using extended regular expressions. It 2866 matches either @samp{abcabc} or @samp{abcabcabc}. 2867 2868 @item \(abc*\)\1 2869 becomes @samp{(abc*)\1} when using extended regular expressions. 2870 Backreferences must still be escaped when using extended regular 2871 expressions. 2872 2873 @item a\|b 2874 becomes @samp{a|b} when using extended regular expressions. It matches 2875 @samp{a} or @samp{b}. 2876 @end table 2877 2878 @node Character Classes and Bracket Expressions 2879 @section Character Classes and Bracket Expressions 2880 2881 @c The 'character class' section is shamelessly copied from grep's manual. 2882 2883 @cindex bracket expression 2884 @cindex character class 2885 A @dfn{bracket expression} is a list of characters enclosed by @samp{[} and 2886 @samp{]}. 2887 It matches any single character in that list; 2888 if the first character of the list is the caret @samp{^}, 2889 then it matches any character @strong{not} in the list. 2890 For example, the following command replaces the strings 2891 @samp{gray} or @samp{grey} with @samp{blue}: 2892 2893 @example 2894 sed 's/gr[ae]y/blue/' 2895 @end example 2896 2897 @c TODO: fix 'ref' to look good in both HTML and PDF 2898 Bracket expressions can be used in both 2899 @ref{BRE syntax,,basic} and @ref{ERE syntax,,extended} 2900 regular expressions (that is, with or without the @option{-E}/@option{-r} 2901 options). 2902 2903 @cindex range expression 2904 Within a bracket expression, a @dfn{range expression} consists of two 2905 characters separated by a hyphen. 2906 It matches any single character that 2907 sorts between the two characters, inclusive. 2908 In the default C locale, the sorting sequence is the native character 2909 order; for example, @samp{[a-d]} is equivalent to @samp{[abcd]}. 2910 2911 2912 Finally, certain named classes of characters are predefined within 2913 bracket expressions, as follows. 2914 2915 These named classes must be used @emph{inside} brackets 2916 themselves. Correct usage: 2917 @example 2918 $ echo 1 | sed 's/[[:digit:]]/X/' 2919 X 2920 @end example 2921 2922 Incorrect usage is rejected by newer @command{sed} versions. 2923 Older versions accepted it but treated it as a single bracket expression 2924 (which is equivalent to @samp{[dgit:]}, 2925 that is, only the characters @var{d/g/i/t/:}): 2926 @example 2927 # current GNU sed versions - incorrect usage rejected 2928 $ echo 1 | sed 's/[:digit:]/X/' 2929 sed: character class syntax is [[:space:]], not [:space:] 2930 2931 # older GNU sed versions 2932 $ echo 1 | sed 's/[:digit:]/X/' 2933 1 2934 @end example 2935 2936 2937 @cindex classes of characters 2938 @cindex character classes 2939 @cindex named character classes 2940 @table @samp 2941 2942 @item [:alnum:] 2943 @opindex alnum @r{character class} 2944 @cindex alphanumeric characters 2945 Alphanumeric characters: 2946 @samp{[:alpha:]} and @samp{[:digit:]}; in the @samp{C} locale and ASCII 2947 character encoding, this is the same as @samp{[0-9A-Za-z]}. 2948 2949 @item [:alpha:] 2950 @opindex alpha @r{character class} 2951 @cindex alphabetic characters 2952 Alphabetic characters: 2953 @samp{[:lower:]} and @samp{[:upper:]}; in the @samp{C} locale and ASCII 2954 character encoding, this is the same as @samp{[A-Za-z]}. 2955 2956 @item [:blank:] 2957 @opindex blank @r{character class} 2958 @cindex blank characters 2959 Blank characters: 2960 space and tab. 2961 2962 @item [:cntrl:] 2963 @opindex cntrl @r{character class} 2964 @cindex control characters 2965 Control characters. 2966 In ASCII, these characters have octal codes 000 2967 through 037, and 177 (DEL). 2968 In other character sets, these are 2969 the equivalent characters, if any. 2970 2971 @item [:digit:] 2972 @opindex digit @r{character class} 2973 @cindex digit characters 2974 @cindex numeric characters 2975 Digits: @code{0 1 2 3 4 5 6 7 8 9}. 2976 2977 @item [:graph:] 2978 @opindex graph @r{character class} 2979 @cindex graphic characters 2980 Graphical characters: 2981 @samp{[:alnum:]} and @samp{[:punct:]}. 2982 2983 @item [:lower:] 2984 @opindex lower @r{character class} 2985 @cindex lower-case letters 2986 Lower-case letters; in the @samp{C} locale and ASCII character 2987 encoding, this is 2988 @code{a b c d e f g h i j k l m n o p q r s t u v w x y z}. 2989 2990 @item [:print:] 2991 @opindex print @r{character class} 2992 @cindex printable characters 2993 Printable characters: 2994 @samp{[:alnum:]}, @samp{[:punct:]}, and space. 2995 2996 @item [:punct:] 2997 @opindex punct @r{character class} 2998 @cindex punctuation characters 2999 Punctuation characters; in the @samp{C} locale and ASCII character 3000 encoding, this is 3001 @code{!@: " # $ % & ' ( ) * + , - .@: / : ; < = > ?@: @@ [ \ ] ^ _ ` @{ | @} ~}. 3002 3003 @item [:space:] 3004 @opindex space @r{character class} 3005 @cindex space characters 3006 @cindex whitespace characters 3007 Space characters: in the @samp{C} locale, this is 3008 tab, newline, vertical tab, form feed, carriage return, and space. 3009 3010 3011 @item [:upper:] 3012 @opindex upper @r{character class} 3013 @cindex upper-case letters 3014 Upper-case letters: in the @samp{C} locale and ASCII character 3015 encoding, this is 3016 @code{A B C D E F G H I J K L M N O P Q R S T U V W X Y Z}. 3017 3018 @item [:xdigit:] 3019 @opindex xdigit @r{character class} 3020 @cindex xdigit class 3021 @cindex hexadecimal digits 3022 Hexadecimal digits: 3023 @code{0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f}. 3024 3025 @end table 3026 Note that the brackets in these class names are 3027 part of the symbolic names, and must be included in addition to 3028 the brackets delimiting the bracket expression. 3029 3030 Most meta-characters lose their special meaning inside bracket expressions: 3031 3032 @table @samp 3033 @item ] 3034 ends the bracket expression if it's not the first list item. 3035 So, if you want to make the @samp{]} character a list item, 3036 you must put it first. 3037 3038 @item - 3039 represents the range if it's not first or last in a list or the ending point 3040 of a range. 3041 3042 @item ^ 3043 represents the characters not in the list. 3044 If you want to make the @samp{^} 3045 character a list item, place it anywhere but first. 3046 @end table 3047 3048 TODO: incorporate this paragraph (copied verbatim from BRE section). 3049 3050 @cindex @code{POSIXLY_CORRECT} behavior, bracket expressions 3051 The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\} 3052 are normally not special within @var{list}. For example, @code{[\*]} 3053 matches either @samp{\} or @samp{*}, because the @code{\} is not 3054 special here. However, strings like @code{[.ch.]}, @code{[=a=]}, and 3055 @code{[:space:]} are special within @var{list} and represent collating 3056 symbols, equivalence classes, and character classes, respectively, and 3057 @code{[} is therefore special within @var{list} when it is followed by 3058 @code{.}, @code{=}, or @code{:}. Also, when not in 3059 @env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and 3060 @code{\t} are recognized within @var{list}. @xref{Escapes}. 3061 @c ******** 3062 3063 3064 @c TODO: improve explanation about collation classes and equivalence classes 3065 @c perhaps dedicate a section to Locales ?? 3066 3067 @table @samp 3068 @item [. 3069 represents the open collating symbol. 3070 3071 @item .] 3072 represents the close collating symbol. 3073 3074 @item [= 3075 represents the open equivalence class. 3076 3077 @item =] 3078 represents the close equivalence class. 3079 3080 @item [: 3081 represents the open character class symbol, and should be followed by a 3082 valid character class name. 3083 3084 @item :] 3085 represents the close character class symbol. 3086 @end table 3087 3088 3089 @node regexp extensions 3090 @section regular expression extensions 3091 3092 The following sequences have special meaning inside regular expressions 3093 (used in @ref{Regexp Addresses,,addresses} and the @code{s} command). 3094 3095 These can be used in both 3096 @ref{BRE syntax,,basic} and @ref{ERE syntax,,extended} 3097 regular expressions (that is, with or without the @option{-E}/@option{-r} 3098 options). 3099 3100 @table @code 3101 @item \w 3102 Matches any ``word'' character. A ``word'' character is any 3103 letter or digit or the underscore character. 3104 3105 @example 3106 $ echo "abc %-= def." | sed 's/\w/X/g' 3107 XXX %-= XXX. 3108 @end example 3109 3110 3111 @item \W 3112 Matches any ``non-word'' character. 3113 3114 @example 3115 $ echo "abc %-= def." | sed 's/\W/X/g' 3116 abcXXXXXdefX 3117 @end example 3118 3119 3120 @item \b 3121 Matches a word boundary; that is it matches if the character 3122 to the left is a ``word'' character and the character to the 3123 right is a ``non-word'' character, or vice-versa. 3124 3125 @example 3126 $ echo "abc %-= def." | sed 's/\b/X/g' 3127 XabcX %-= XdefX. 3128 @end example 3129 3130 3131 @item \B 3132 Matches everywhere but on a word boundary; that is it matches 3133 if the character to the left and the character to the right 3134 are either both ``word'' characters or both ``non-word'' 3135 characters. 3136 3137 @example 3138 $ echo "abc %-= def." | sed 's/\B/X/g' 3139 aXbXc X%X-X=X dXeXf.X 3140 @end example 3141 3142 3143 @item \s 3144 Matches whitespace characters (spaces and tabs). 3145 Newlines embedded in the pattern/hold spaces will also match: 3146 3147 @example 3148 $ echo "abc %-= def." | sed 's/\s/X/g' 3149 abcX%-=Xdef. 3150 @end example 3151 3152 3153 @item \S 3154 Matches non-whitespace characters. 3155 3156 @example 3157 $ echo "abc %-= def." | sed 's/\S/X/g' 3158 XXX XXX XXXX 3159 @end example 3160 3161 3162 @item \< 3163 Matches the beginning of a word. 3164 3165 @example 3166 $ echo "abc %-= def." | sed 's/\</X/g' 3167 Xabc %-= Xdef. 3168 @end example 3169 3170 3171 @item \> 3172 Matches the end of a word. 3173 3174 @example 3175 $ echo "abc %-= def." | sed 's/\>/X/g' 3176 abcX %-= defX. 3177 @end example 3178 3179 3180 @item \` 3181 Matches only at the start of pattern space. This is different 3182 from @code{^} in multi-line mode. 3183 3184 Compare the following two examples: 3185 3186 @example 3187 $ printf "a\nb\nc\n" | sed 'N;N;s/^/X/gm' 3188 Xa 3189 Xb 3190 Xc 3191 3192 $ printf "a\nb\nc\n" | sed 'N;N;s/\`/X/gm' 3193 Xa 3194 b 3195 c 3196 @end example 3197 3198 @item \' 3199 Matches only at the end of pattern space. This is different 3200 from @code{$} in multi-line mode. 3201 3202 3203 3204 @end table 3205 3206 3207 @node Back-references and Subexpressions 3208 @section Back-references and Subexpressions 3209 @cindex subexpression 3210 @cindex back-reference 3211 3212 @dfn{back-references} are regular expression commands which refer to a 3213 previous part of the matched regular expression. Back-references are 3214 specified with backslash and a single digit (e.g. @samp{\1}). The 3215 part of the regular expression they refer to is called a 3216 @dfn{subexpression}, and is designated with parentheses. 3217 3218 Back-references and subexpressions are used in two cases: in the 3219 regular expression search pattern, and in the @var{replacement} part 3220 of the @command{s} command (@pxref{Regexp Addresses,,Regular 3221 Expression Addresses} and @ref{The "s" Command}). 3222 3223 In a regular expression pattern, back-references are used to match 3224 the same content as a previously matched subexpression. In the 3225 following example, the subexpression is @samp{.} - any single 3226 character (being surrounded by parentheses makes it a 3227 subexpression). The back-reference @samp{\1} asks to match the same 3228 content (same character) as the sub-expression. 3229 3230 The command below matches words starting with any character, 3231 followed by the letter @samp{o}, followed by the same character as the 3232 first. 3233 3234 @example 3235 $ sed -E -n '/^(.)o\1$/p' /usr/share/dict/words 3236 bob 3237 mom 3238 non 3239 pop 3240 sos 3241 tot 3242 wow 3243 @end example 3244 3245 Multiple subexpressions are automatically numbered from 3246 left-to-right. This command searches for 6-letter 3247 palindromes (the first three letters are 3 subexpressions, 3248 followed by 3 back-references in reverse order): 3249 3250 @example 3251 $ sed -E -n '/^(.)(.)(.)\3\2\1$/p' /usr/share/dict/words 3252 redder 3253 @end example 3254 3255 In the @command{s} command, back-references can be 3256 used in the @var{replacement} part to refer back to subexpressions in 3257 the @var{regexp} part. 3258 3259 The following example uses two subexpressions in the regular 3260 expression to match two space-separated words. The back-references in 3261 the @var{replacement} part prints the words in a different order: 3262 3263 @example 3264 $ echo "James Bond" | sed -E 's/(.*) (.*)/The name is \2, \1 \2./' 3265 The name is Bond, James Bond. 3266 @end example 3267 3268 3269 When used with alternation, if the group does not participate in the 3270 match then the back-reference makes the whole match fail. For 3271 example, @samp{a(.)|b\1} will not match @samp{ba}. When multiple 3272 regular expressions are given with @option{-e} or from a file 3273 (@samp{-f @var{file}}), back-references are local to each expression. 3274 3275 1466 3276 @node Escapes 1467 @section @acronym{GNU} Extensions for Escapes in Regular Expressions1468 1469 @cindex @acronym{GNU}extensions, special escapes3277 @section Escape Sequences - specifying special characters 3278 3279 @cindex GNU extensions, special escapes 1470 3280 Until this chapter, we have only encountered escapes of the form 1471 3281 @samp{\^}, which tell @command{sed} not to interpret the circumflex … … 1476 3286 @cindex @code{POSIXLY_CORRECT} behavior, escapes 1477 3287 This chapter introduces another kind of escape@footnote{All 1478 the escapes introduced here are @acronym{GNU}3288 the escapes introduced here are GNU 1479 3289 extensions, with the exception of @code{\n}. In basic regular 1480 3290 expression mode, setting @code{POSIXLY_CORRECT} disables them inside … … 1522 3332 1523 3333 @item \o@var{xxx} 1524 @ifset PERL1525 @item \@var{xxx}1526 @end ifset1527 3334 Produces or matches a character whose octal @sc{ascii} value is @var{xxx}. 1528 @ifset PERL1529 The syntax without the @code{o} is active in Perl mode, while the one1530 with the @code{o} is active in the normal or extended @sc{posix} regular1531 expression modes.1532 @end ifset1533 3335 1534 3336 @item \x@var{xx} … … 1539 3341 the existing ``word boundary'' meaning. 1540 3342 1541 Other escapes match a particular character class and are valid only in 1542 regular expressions: 1543 3343 @subsection Escaping Precedence 3344 3345 @value{SSED} processes escape sequences @emph{before} passing 3346 the text onto the regular-expression matching of the @command{s///} command 3347 and Address matching. Thus the following two commands are equivalent 3348 (@samp{0x5e} is the hexadecimal @sc{ascii} value of the character @samp{^}): 3349 3350 @codequoteundirected on 3351 @codequotebacktick on 3352 @example 3353 @group 3354 $ echo 'a^c' | sed 's/^/b/' 3355 ba^c 3356 3357 $ echo 'a^c' | sed 's/\x5e/b/' 3358 ba^c 3359 @end group 3360 @end example 3361 @codequoteundirected off 3362 @codequotebacktick off 3363 3364 As are the following (@samp{0x5b},@samp{0x5d} are the hexadecimal 3365 @sc{ascii} values of @samp{[},@samp{]}, respectively): 3366 3367 @codequoteundirected on 3368 @codequotebacktick on 3369 @example 3370 @group 3371 $ echo abc | sed 's/[a]/x/' 3372 Xbc 3373 $ echo abc | sed 's/\x5ba\x5d/x/' 3374 Xbc 3375 @end group 3376 @end example 3377 @codequoteundirected off 3378 @codequotebacktick off 3379 3380 However it is recommended to avoid such special characters 3381 due to unexpected edge-cases. For example, the following 3382 are not equivalent: 3383 3384 @codequoteundirected on 3385 @codequotebacktick on 3386 @example 3387 @group 3388 $ echo 'a^c' | sed 's/\^/b/' 3389 abc 3390 3391 $ echo 'a^c' | sed 's/\\\x5e/b/' 3392 a^c 3393 @end group 3394 @end example 3395 @codequoteundirected off 3396 @codequotebacktick off 3397 3398 @c also: this fails in different places: 3399 @c $ sed 's/[//' 3400 @c sed: -e expression #1, char 5: unterminated `s' command 3401 @c $ sed 's/\x5b//' 3402 @c sed: -e expression #1, char 8: Invalid regular expression 3403 @c 3404 @c which is OK but confusing to explain why (the first 3405 @c fails in compile.c:snarf_char_class while the second 3406 @c is passed to the regex engine and then fails). 3407 3408 3409 @node Locale Considerations 3410 @section Multibyte characters and Locale Considerations 3411 3412 @value{SSED} processes valid multibyte characters in multibyte locales 3413 (e.g. @code{UTF-8}). @footnote{Some regexp edge-cases depends on the 3414 operating system and libc implementation. The examples shown are known 3415 to work as-expected on GNU/Linux systems using glibc.} 3416 3417 @noindent The following example uses the Greek letter Capital Sigma 3418 (@value{ucsigma}, 3419 Unicode code point @code{0x03A3}). In a @code{UTF-8} locale, 3420 @command{sed} correctly processes the Sigma as one character despite 3421 it being 2 octets (bytes): 3422 3423 @codequoteundirected on 3424 @codequotebacktick on 3425 @example 3426 @group 3427 $ locale | grep LANG 3428 LANG=en_US.UTF-8 3429 3430 $ printf 'a\u03A3b' 3431 a@value{ucsigma}b 3432 3433 $ printf 'a\u03A3b' | sed 's/./X/g' 3434 XXX 3435 3436 $ printf 'a\u03A3b' | od -tx1 -An 3437 61 ce a3 62 3438 @end group 3439 @end example 3440 @codequoteundirected off 3441 @codequotebacktick off 3442 3443 @noindent 3444 To force @command{sed} to process octets separately, use the @code{C} locale 3445 (also known as the @code{POSIX} locale): 3446 3447 @codequoteundirected on 3448 @codequotebacktick on 3449 @example 3450 $ printf 'a\u03A3b' | LC_ALL=C sed 's/./X/g' 3451 XXXX 3452 @end example 3453 @codequoteundirected off 3454 @codequotebacktick off 3455 3456 @subsection Invalid multibyte characters 3457 3458 @command{sed}'s regular expressions @emph{do not} match 3459 invalid multibyte sequences in a multibyte locale. 3460 3461 @noindent 3462 In the following examples, the ascii value @code{0xCE} is 3463 an incomplete multibyte character (shown here as @value{unicodeFFFD}). 3464 The regular expression @samp{.} does not match it: 3465 3466 @codequoteundirected on 3467 @codequotebacktick on 3468 @example 3469 @group 3470 $ printf 'a\xCEb\n' 3471 a@value{unicodeFFFD}e 3472 3473 $ printf 'a\xCEb\n' | sed 's/./X/g' 3474 X@value{unicodeFFFD}X 3475 3476 $ printf 'a\xCEc\n' | sed 's/./X/g' | od -tx1c -An 3477 58 ce 58 0a 3478 X X \n 3479 @end group 3480 @end example 3481 @codequoteundirected off 3482 @codequotebacktick off 3483 3484 @noindent Similarly, the 'catch-all' regular expression @samp{.*} does not 3485 match the entire line: 3486 3487 @codequoteundirected on 3488 @codequotebacktick on 3489 @example 3490 @group 3491 $ printf 'a\xCEc\n' | sed 's/.*//' | od -tx1c -An 3492 ce 63 0a 3493 c \n 3494 @end group 3495 @end example 3496 @codequoteundirected off 3497 @codequotebacktick off 3498 3499 @noindent 3500 @value{SSED} offers the special @command{z} command to clear the 3501 current pattern space regardless of invalid multibyte characters 3502 (i.e. it works like @code{s/.*//} but also removes invalid multibyte 3503 characters): 3504 3505 @codequoteundirected on 3506 @codequotebacktick on 3507 @example 3508 @group 3509 $ printf 'a\xCEc\n' | sed 'z' | od -tx1c -An 3510 0a 3511 \n 3512 @end group 3513 @end example 3514 @codequoteundirected off 3515 @codequotebacktick off 3516 3517 @noindent Alternatively, force the @code{C} locale to process 3518 each octet separately (every octet is a valid character in the @code{C} 3519 locale): 3520 3521 @codequoteundirected on 3522 @codequotebacktick on 3523 @example 3524 @group 3525 $ printf 'a\xCEc\n' | LC_ALL=C sed 's/.*//' | od -tx1c -An 3526 0a 3527 \n 3528 @end group 3529 @end example 3530 @codequoteundirected off 3531 @codequotebacktick off 3532 3533 3534 @command{sed}'s inability to process invalid multibyte characters 3535 can be used to detect such invalid sequences in a file. 3536 In the following examples, the @code{\xCE\xCE} is an invalid 3537 multibyte sequence, while @code{\xCE\A3} is a valid multibyte sequence 3538 (of the Greek Sigma character). 3539 3540 @noindent 3541 The following @command{sed} program removes all valid 3542 characters using @code{s/.//g}. Any content left in the pattern space 3543 (the invalid characters) are added to the hold space using the 3544 @code{H} command. On the last line (@code{$}), the hold space is retrieved 3545 (@code{x}), newlines are removed (@code{s/\n//g}), and any remaining 3546 octets are printed unambiguously (@code{l}). Thus, any invalid 3547 multibyte sequences are printed as octal values: 3548 3549 @codequoteundirected on 3550 @codequotebacktick on 3551 @example 3552 @group 3553 $ printf 'ab\nc\n\xCE\xCEde\n\xCE\xA3f\n' > invalid.txt 3554 3555 $ cat invalid.txt 3556 ab 3557 c 3558 @value{unicodeFFFD}@value{unicodeFFFD}de 3559 @value{ucsigma}f 3560 3561 $ sed -n 's/.//g ; H ; $@{x;s/\n//g;l@}' invalid.txt 3562 \316\316$ 3563 @end group 3564 @end example 3565 @codequoteundirected off 3566 @codequotebacktick off 3567 3568 @noindent With a few more commands, @command{sed} can print 3569 the exact line number corresponding to each invalid characters (line 3). 3570 These characters can then be removed by forcing the @code{C} locale 3571 and using octal escape sequences: 3572 3573 @codequoteundirected on 3574 @codequotebacktick on 3575 @example 3576 $ sed -n 's/.//g;=;l' invalid.txt | paste - - | awk '$2!="$"' 3577 3 \316\316$ 3578 3579 $ LC_ALL=C sed '3s/\o316\o316//' invalid.txt > fixed.txt 3580 @end example 3581 @codequoteundirected off 3582 @codequotebacktick off 3583 3584 @subsection Upper/Lower case conversion 3585 3586 3587 @value{SSED}'s substitute command (@code{s}) supports upper/lower 3588 case conversions using @code{\U},@code{\L} codes. 3589 These conversions support multibyte characters: 3590 3591 @codequoteundirected on 3592 @codequotebacktick on 3593 @example 3594 $ printf 'ABC\u03a3\n' 3595 ABC@value{ucsigma} 3596 3597 $ printf 'ABC\u03a3\n' | sed 's/.*/\L&/' 3598 abc@value{lcsigma} 3599 @end example 3600 @codequoteundirected off 3601 @codequotebacktick off 3602 3603 @noindent 3604 @xref{The "s" Command}. 3605 3606 3607 @subsection Multibyte regexp character classes 3608 3609 @c TODO: fix following paragraphs (copied verbatim from 'bracket 3610 @c expression' section). 3611 3612 In other locales, the sorting sequence is not specified, and 3613 @samp{[a-d]} might be equivalent to @samp{[abcd]} or to 3614 @samp{[aBbCcDd]}, or it might fail to match any character, or the set of 3615 characters that it matches might even be erratic. 3616 To obtain the traditional interpretation 3617 of bracket expressions, you can use the @samp{C} locale by setting the 3618 @env{LC_ALL} environment variable to the value @samp{C}. 3619 3620 @example 3621 # TODO: is there any real-world system/locale where 'A' 3622 # is replaced by '-' ? 3623 $ echo A | sed 's/[a-z]/-/' 3624 A 3625 @end example 3626 3627 Their interpretation depends on the @env{LC_CTYPE} locale; 3628 for example, @samp{[[:alnum:]]} means the character class of numbers and letters 3629 in the current locale. 3630 3631 TODO: show example of collation 3632 3633 @codequoteundirected on 3634 @codequotebacktick on 3635 @example 3636 # TODO: this works on glibc systems, not on musl-libc/freebsd/macosx. 3637 $ printf 'cliché\n' | LC_ALL=fr_FR.utf8 sed 's/[[=e=]]/X/g' 3638 clichX 3639 @end example 3640 @codequoteundirected off 3641 @codequotebacktick off 3642 3643 3644 @node advanced sed 3645 @chapter Advanced @command{sed}: cycles and buffers 3646 3647 @menu 3648 * Execution Cycle:: How @command{sed} works 3649 * Hold and Pattern Buffers:: 3650 * Multiline techniques:: Using D,G,H,N,P to process multiple lines 3651 * Branching and flow control:: 3652 @end menu 3653 3654 @node Execution Cycle 3655 @section How @command{sed} Works 3656 3657 @cindex Buffer spaces, pattern and hold 3658 @cindex Spaces, pattern and hold 3659 @cindex Pattern space, definition 3660 @cindex Hold space, definition 3661 @command{sed} maintains two data buffers: the active @emph{pattern} space, 3662 and the auxiliary @emph{hold} space. Both are initially empty. 3663 3664 @command{sed} operates by performing the following cycle on each 3665 line of input: first, @command{sed} reads one line from the input 3666 stream, removes any trailing newline, and places it in the pattern space. 3667 Then commands are executed; each command can have an address associated 3668 to it: addresses are a kind of condition code, and a command is only 3669 executed if the condition is verified before the command is to be 3670 executed. 3671 3672 When the end of the script is reached, unless the @option{-n} option 3673 is in use, the contents of pattern space are printed out to the output 3674 stream, adding back the trailing newline if it was removed.@footnote{Actually, 3675 if @command{sed} prints a line without the terminating newline, it will 3676 nevertheless print the missing newline as soon as more text is sent to 3677 the same output stream, which gives the ``least expected surprise'' 3678 even though it does not make commands like @samp{sed -n p} exactly 3679 identical to @command{cat}.} Then the next cycle starts for the next 3680 input line. 3681 3682 Unless special commands (like @samp{D}) are used, the pattern space is 3683 deleted between two cycles. The hold space, on the other hand, keeps 3684 its data between cycles (see commands @samp{h}, @samp{H}, @samp{x}, 3685 @samp{g}, @samp{G} to move data between both buffers). 3686 3687 @node Hold and Pattern Buffers 3688 @section Hold and Pattern Buffers 3689 3690 TODO 3691 3692 @node Multiline techniques 3693 @section Multiline techniques - using D,G,H,N,P to process multiple lines 3694 3695 Multiple lines can be processed as one buffer using the 3696 @code{D},@code{G},@code{H},@code{N},@code{P}. They are similar to 3697 their lowercase counterparts (@code{d},@code{g}, 3698 @code{h},@code{n},@code{p}), except that these commands append or 3699 subtract data while respecting embedded newlines - allowing adding and 3700 removing lines from the pattern and hold spaces. 3701 3702 They operate as follows: 1544 3703 @table @code 1545 @item \w 1546 Matches any ``word'' character. A ``word'' character is any 1547 letter or digit or the underscore character. 1548 1549 @item \W 1550 Matches any ``non-word'' character. 1551 1552 @item \b 1553 Matches a word boundary; that is it matches if the character 1554 to the left is a ``word'' character and the character to the 1555 right is a ``non-word'' character, or vice-versa. 1556 1557 @item \B 1558 Matches everywhere but on a word boundary; that is it matches 1559 if the character to the left and the character to the right 1560 are either both ``word'' characters or both ``non-word'' 1561 characters. 1562 1563 @item \` 1564 Matches only at the start of pattern space. This is different 1565 from @code{^} in multi-line mode. 1566 1567 @item \' 1568 Matches only at the end of pattern space. This is different 1569 from @code{$} in multi-line mode. 1570 1571 @ifset PERL 1572 @item \G 1573 Match only at the start of pattern space or, when doing a global 1574 substitution using the @code{s///g} command and option, at 1575 the end-of-match position of the prior match. For example, 1576 @samp{s/\Ga/Z/g} will change an initial run of @code{a}s to 1577 a run of @code{Z}s 1578 @end ifset 3704 @item D 3705 @emph{deletes} line from the pattern space until the first newline, 3706 and restarts the cycle. 3707 3708 @item G 3709 @emph{appends} line from the hold space to the pattern space, with a 3710 newline before it. 3711 3712 @item H 3713 @emph{appends} line from the pattern space to the hold space, with a 3714 newline before it. 3715 3716 @item N 3717 @emph{appends} line from the input file to the pattern space. 3718 3719 @item P 3720 @emph{prints} line from the pattern space until the first newline. 3721 1579 3722 @end table 3723 3724 3725 The following example illustrates the operation of @code{N} and 3726 @code{D} commands: 3727 3728 @codequoteundirected on 3729 @codequotebacktick on 3730 @example 3731 @group 3732 $ seq 6 | sed -n 'N;l;D' 3733 1\n2$ 3734 2\n3$ 3735 3\n4$ 3736 4\n5$ 3737 5\n6$ 3738 @end group 3739 @end example 3740 @codequoteundirected off 3741 @codequotebacktick off 3742 3743 @enumerate 3744 @item 3745 @command{sed} starts by reading the first line into the pattern space 3746 (i.e. @samp{1}). 3747 @item 3748 At the beginning of every cycle, the @code{N} 3749 command appends a newline and the next line to the pattern space 3750 (i.e. @samp{1}, @samp{\n}, @samp{2} in the first cycle). 3751 @item 3752 The @code{l} command prints the content of the pattern space 3753 unambiguously. 3754 @item 3755 The @code{D} command then removes the content of pattern 3756 space up to the first newline (leaving @samp{2} at the end of 3757 the first cycle). 3758 @item 3759 At the next cycle the @code{N} command appends a 3760 newline and the next input line to the pattern space 3761 (e.g. @samp{2}, @samp{\n}, @samp{3}). 3762 @end enumerate 3763 3764 3765 @cindex processing paragraphs 3766 @cindex paragraphs, processing 3767 A common technique to process blocks of text such as paragraphs 3768 (instead of line-by-line) is using the following construct: 3769 3770 @codequoteundirected on 3771 @codequotebacktick on 3772 @example 3773 sed '/./@{H;$!d@} ; x ; s/REGEXP/REPLACEMENT/' 3774 @end example 3775 @codequoteundirected off 3776 @codequotebacktick off 3777 3778 @enumerate 3779 @item 3780 The first expression, @code{/./@{H;$!d@}} operates on all non-empty lines, 3781 and adds the current line (in the pattern space) to the hold space. 3782 On all lines except the last, the pattern space is deleted and the cycle is 3783 restarted. 3784 3785 @item 3786 The other expressions @code{x} and @code{s} are executed only on empty 3787 lines (i.e. paragraph separators). The @code{x} command fetches the 3788 accumulated lines from the hold space back to the pattern space. The 3789 @code{s///} command then operates on all the text in the paragraph 3790 (including the embedded newlines). 3791 @end enumerate 3792 3793 The following example demonstrates this technique: 3794 @codequoteundirected on 3795 @codequotebacktick on 3796 @example 3797 @group 3798 $ cat input.txt 3799 a a a aa aaa 3800 aaaa aaaa aa 3801 aaaa aaa aaa 3802 3803 bbbb bbb bbb 3804 bb bb bbb bb 3805 bbbbbbbb bbb 3806 3807 ccc ccc cccc 3808 cccc ccccc c 3809 cc cc cc cc 3810 3811 $ sed '/./@{H;$!d@} ; x ; s/^/\nSTART-->/ ; s/$/\n<--END/' input.txt 3812 3813 START--> 3814 a a a aa aaa 3815 aaaa aaaa aa 3816 aaaa aaa aaa 3817 <--END 3818 3819 START--> 3820 bbbb bbb bbb 3821 bb bb bbb bb 3822 bbbbbbbb bbb 3823 <--END 3824 3825 START--> 3826 ccc ccc cccc 3827 cccc ccccc c 3828 cc cc cc cc 3829 <--END 3830 @end group 3831 @end example 3832 @codequoteundirected off 3833 @codequotebacktick off 3834 3835 For more annotated examples, @pxref{Text search across multiple lines} 3836 and @ref{Line length adjustment}. 3837 3838 @node Branching and flow control 3839 @section Branching and Flow Control 3840 3841 The branching commands @code{b}, @code{t}, and @code{T} enable 3842 changing the flow of @command{sed} programs. 3843 3844 By default, @command{sed} reads an input line into the pattern buffer, 3845 then continues to processes all commands in order. 3846 Commands without addresses affect all lines. 3847 Commands with addresses affect only matching lines. 3848 @xref{Execution Cycle} and @ref{Addresses overview}. 3849 3850 @command{sed} does not support a typical @code{if/then} construct. 3851 Instead, some commands can be used as conditionals or to change the 3852 default flow control: 3853 3854 @table @code 3855 3856 @item d 3857 delete (clears) the current pattern space, 3858 and restart the program cycle without processing the rest of the commands 3859 and without printing the pattern space. 3860 3861 @item D 3862 delete the contents of the pattern space @emph{up to the first newline}, 3863 and restart the program cycle without processing the rest of 3864 the commands and without printing the pattern space. 3865 3866 @item [addr]X 3867 @itemx [addr]@{ X ; X ; X @} 3868 @item /regexp/X 3869 @item /regexp/@{ X ; X ; X @} 3870 Addresses and regular expressions can be used as an @code{if/then} 3871 conditional: If @var{[addr]} matches the current pattern space, 3872 execute the command(s). 3873 For example: The command @code{/^#/d} means: 3874 @emph{if} the current pattern matches the regular expression @code{^#} (a line 3875 starting with a hash), @emph{then} execute the @code{d} command: 3876 delete the line without printing it, and restart the program cycle 3877 immediately. 3878 3879 @item b 3880 branch unconditionally (that is: always jump to a label, skipping 3881 or repeating other commands, without restarting a new cycle). Combined 3882 with an address, the branch can be conditionally executed on matched 3883 lines. 3884 3885 @item t 3886 branch conditionally (that is: jump to a label) @emph{only if} a 3887 @code{s///} command has succeeded since the last input line was read 3888 or another conditional branch was taken. 3889 3890 @item T 3891 similar but opposite to the @code{t} command: branch only if 3892 there has been @emph{no} successful substitutions since the last 3893 input line was read. 3894 @end table 3895 3896 3897 The following two @command{sed} programs are equivalent. The first 3898 (contrived) example uses the @code{b} command to skip the @code{s///} 3899 command on lines containing @samp{1}. The second example uses an 3900 address with negation (@samp{!}) to perform substitution only on 3901 desired lines. The @code{y///} command is still executed on all 3902 lines: 3903 3904 @codequoteundirected on 3905 @codequotebacktick on 3906 @example 3907 @group 3908 $ printf '%s\n' a1 a2 a3 | sed -E '/1/bx ; s/a/z/ ; :x ; y/123/456/' 3909 a4 3910 z5 3911 z6 3912 3913 $ printf '%s\n' a1 a2 a3 | sed -E '/1/!s/a/z/ ; y/123/456/' 3914 a4 3915 z5 3916 z6 3917 @end group 3918 @end example 3919 @codequoteundirected off 3920 @codequotebacktick off 3921 3922 3923 3924 @subsection Branching and Cycles 3925 @cindex labels 3926 @cindex omitting labels 3927 @cindex cycle, restarting 3928 @cindex restarting a cycle 3929 The @code{b},@code{t} and @code{T} commands can be followed by a label 3930 (typically a single letter). Labels are defined with a colon followed by 3931 one or more letters (e.g. @samp{:x}). If the label is omitted the 3932 branch commands restart the cycle. Note the difference between 3933 branching to a label and restarting the cycle: when a cycle is 3934 restarted, @command{sed} first prints the current content of the 3935 pattern space, then reads the next input line into the pattern space; 3936 Jumping to a label (even if it is at the beginning of the program) 3937 does not print the pattern space and does not read the next input line. 3938 3939 The following program is a no-op. The @code{b} command (the only command 3940 in the program) does not have a label, and thus simply restarts the cycle. 3941 On each cycle, the pattern space is printed and the next input line is read: 3942 3943 @example 3944 @group 3945 $ seq 3 | sed b 3946 1 3947 2 3948 3 3949 @end group 3950 @end example 3951 3952 @cindex infinite loop, branching 3953 @cindex branching, infinite loop 3954 The following example is an infinite-loop - it doesn't terminate and 3955 doesn't print anything. The @code{b} command jumps to the @samp{x} 3956 label, and a new cycle is never started: 3957 3958 @codequoteundirected on 3959 @codequotebacktick on 3960 @example 3961 @group 3962 $ seq 3 | sed ':x ; bx' 3963 3964 # The above command requires gnu sed (which supports additional 3965 # commands following a label, without a newline). A portable equivalent: 3966 # sed -e ':x' -e bx 3967 @end group 3968 @end example 3969 @codequoteundirected off 3970 @codequotebacktick off 3971 3972 @cindex branching and n, N 3973 @cindex n, and branching 3974 @cindex N, and branching 3975 Branching is often complemented with the @code{n} or @code{N} commands: 3976 both commands read the next input line into the pattern space without waiting 3977 for the cycle to restart. Before reading the next input line, @code{n} 3978 prints the current pattern space then empties it, while @code{N} 3979 appends a newline and the next input line to the pattern space. 3980 3981 Consider the following two examples: 3982 3983 @codequoteundirected on 3984 @codequotebacktick on 3985 @example 3986 @group 3987 $ seq 3 | sed ':x ; n ; bx' 3988 1 3989 2 3990 3 3991 3992 $ seq 3 | sed ':x ; N ; bx' 3993 1 3994 2 3995 3 3996 @end group 3997 @end example 3998 @codequoteundirected off 3999 @codequotebacktick off 4000 4001 @itemize 4002 @item 4003 Both examples do not inf-loop, despite never starting a new cycle. 4004 4005 @item 4006 In the first example, the @code{n} commands first prints the content 4007 of the pattern space, empties the pattern space then reads the next 4008 input line. 4009 4010 @item 4011 In the second example, the @code{N} commands appends the next input 4012 line to the pattern space (with a newline). Lines are accumulated in 4013 the pattern space until there are no more input lines to read, then 4014 the @code{N} command terminates the @command{sed} program. When the 4015 program terminates, the end-of-cycle actions are performed, and the 4016 entire pattern space is printed. 4017 4018 @item 4019 The second example requires @value{SSED}, 4020 because it uses the non-POSIX-standard behavior of @code{N}. 4021 See the ``@code{N} command on the last line'' paragraph 4022 in @ref{Reporting Bugs}. 4023 4024 @item 4025 To further examine the difference between the two examples, 4026 try the following commands: 4027 @codequoteundirected on 4028 @codequotebacktick on 4029 @example 4030 @group 4031 printf '%s\n' aa bb cc dd | sed ':x ; n ; = ; bx' 4032 printf '%s\n' aa bb cc dd | sed ':x ; N ; = ; bx' 4033 printf '%s\n' aa bb cc dd | sed ':x ; n ; s/\n/***/ ; bx' 4034 printf '%s\n' aa bb cc dd | sed ':x ; N ; s/\n/***/ ; bx' 4035 @end group 4036 @end example 4037 @codequoteundirected off 4038 @codequotebacktick off 4039 4040 @end itemize 4041 4042 4043 4044 @subsection Branching example: joining lines 4045 4046 @cindex joining lines with branching 4047 @cindex branching, joining lines 4048 @cindex quoted-printable lines, joining 4049 @cindex joining quoted-printable lines 4050 @cindex t, joining lines with 4051 @cindex b, joining lines with 4052 @cindex b, versus t 4053 @cindex t, versus b 4054 As a real-world example of using branching, consider the case of 4055 @uref{https://en.wikipedia.org/wiki/Quoted-printable,quoted-printable} files, 4056 typically used to encode email messages. 4057 In these files long lines are split and marked with a @dfn{soft line break} 4058 consisting of a single @samp{=} character at the end of the line: 4059 4060 @example 4061 @group 4062 $ cat jaques.txt 4063 All the wor= 4064 ld's a stag= 4065 e, 4066 And all the= 4067 men and wo= 4068 men merely = 4069 players: 4070 They have t= 4071 heir exits = 4072 and their e= 4073 ntrances; 4074 And one man= 4075 in his tim= 4076 e plays man= 4077 y parts. 4078 @end group 4079 @end example 4080 4081 4082 The following program uses an address match @samp{/=$/} as a 4083 conditional: If the current pattern space ends with a @samp{=}, it 4084 reads the next input line using @code{N}, replaces all @samp{=} 4085 characters which are followed by a newline, and unconditionally 4086 branches (@code{b}) to the beginning of the program without restarting 4087 a new cycle. If the pattern space does not ends with @samp{=}, the 4088 default action is performed: the pattern space is printed and a new 4089 cycle is started: 4090 4091 @codequoteundirected on 4092 @codequotebacktick on 4093 @example 4094 @group 4095 $ sed ':x ; /=$/ @{ N ; s/=\n//g ; bx @}' jaques.txt 4096 All the world's a stage, 4097 And all the men and women merely players: 4098 They have their exits and their entrances; 4099 And one man in his time plays many parts. 4100 @end group 4101 @end example 4102 @codequoteundirected off 4103 @codequotebacktick off 4104 4105 Here's an alternative program with a slightly different approach: On 4106 all lines except the last, @code{N} appends the line to the pattern 4107 space. A substitution command then removes soft line breaks 4108 (@samp{=} at the end of a line, i.e. followed by a newline) by replacing 4109 them with an empty string. 4110 @emph{if} the substitution was successful (meaning the pattern space contained 4111 a line which should be joined), The conditional branch command @code{t} jumps 4112 to the beginning of the program without completing or restarting the cycle. 4113 If the substitution failed (meaning there were no soft line breaks), 4114 The @code{t} command will @emph{not} branch. Then, @code{P} will 4115 print the pattern space content until the first newline, and @code{D} 4116 will delete the pattern space content until the first new line. 4117 (To learn more about @code{N}, @code{P} and @code{D} commands 4118 @pxref{Multiline techniques}). 4119 4120 4121 @codequoteundirected on 4122 @codequotebacktick on 4123 @example 4124 @group 4125 $ sed ':x ; $!N ; s/=\n// ; tx ; P ; D' jaques.txt 4126 All the world's a stage, 4127 And all the men and women merely players: 4128 They have their exits and their entrances; 4129 And one man in his time plays many parts. 4130 @end group 4131 @end example 4132 @codequoteundirected off 4133 @codequotebacktick off 4134 4135 4136 For more line-joining examples @pxref{Joining lines}. 4137 1580 4138 1581 4139 @node Examples … … 1586 4144 1587 4145 @menu 4146 4147 Useful one-liners: 4148 * Joining lines:: 4149 1588 4150 Some exotic examples: 1589 4151 * Centering lines:: … … 1592 4154 * Print bash environment:: 1593 4155 * Reverse chars of lines:: 4156 * Text search across multiple lines:: 4157 * Line length adjustment:: 4158 * Adding a header to multiple files:: 1594 4159 1595 4160 Emulating standard utilities: … … 1608 4173 @end menu 1609 4174 4175 @node Joining lines 4176 @section Joining lines 4177 4178 This section uses @code{N}, @code{D} and @code{P} commands to process 4179 multiple lines, and the @code{b} and @code{t} commands for branching. 4180 @xref{Multiline techniques} and @ref{Branching and flow control}. 4181 4182 Join specific lines (e.g. if lines 2 and 3 need to be joined): 4183 4184 @codequoteundirected on 4185 @codequotebacktick on 4186 @example 4187 $ cat lines.txt 4188 hello 4189 hel 4190 lo 4191 hello 4192 4193 $ sed '2@{N;s/\n//;@}' lines.txt 4194 hello 4195 hello 4196 hello 4197 @end example 4198 @codequoteundirected off 4199 @codequotebacktick off 4200 4201 Join backslash-continued lines: 4202 4203 @codequoteundirected on 4204 @codequotebacktick on 4205 @example 4206 $ cat 1.txt 4207 this \ 4208 is \ 4209 a \ 4210 long \ 4211 line 4212 and another \ 4213 line 4214 4215 $ sed -e ':x /\\$/ @{ N; s/\\\n//g ; bx @}' 1.txt 4216 this is a long line 4217 and another line 4218 4219 4220 #TODO: The above requires gnu sed. 4221 # non-gnu seds need newlines after ':' and 'b' 4222 @end example 4223 @codequoteundirected off 4224 @codequotebacktick off 4225 4226 Join lines that start with whitespace (e.g SMTP headers): 4227 4228 @codequoteundirected on 4229 @codequotebacktick on 4230 @example 4231 @group 4232 $ cat 2.txt 4233 Subject: Hello 4234 World 4235 Content-Type: multipart/alternative; 4236 boundary=94eb2c190cc6370f06054535da6a 4237 Date: Tue, 3 Jan 2017 19:41:16 +0000 (GMT) 4238 Authentication-Results: mx.gnu.org; 4239 dkim=pass header.i=@@gnu.org; 4240 spf=pass 4241 Message-ID: <abcdef@@gnu.org> 4242 From: John Doe <jdoe@@gnu.org> 4243 To: Jane Smith <jsmith@@gnu.org> 4244 4245 $ sed -E ':a ; $!N ; s/\n\s+/ / ; ta ; P ; D' 2.txt 4246 Subject: Hello World 4247 Content-Type: multipart/alternative; boundary=94eb2c190cc6370f06054535da6a 4248 Date: Tue, 3 Jan 2017 19:41:16 +0000 (GMT) 4249 Authentication-Results: mx.gnu.org; dkim=pass header.i=@@gnu.org; spf=pass 4250 Message-ID: <abcdef@@gnu.org> 4251 From: John Doe <jdoe@@gnu.org> 4252 To: Jane Smith <jsmith@@gnu.org> 4253 4254 # A portable (non-gnu) variation: 4255 # sed -e :a -e '$!N;s/\n */ /;ta' -e 'P;D' 4256 @end group 4257 @end example 4258 @codequoteundirected off 4259 @codequotebacktick off 4260 4261 1610 4262 @node Centering lines 1611 4263 @section Centering Lines … … 1634 4286 1635 4287 @group 1636 # del leading and trailing spaces1637 y/@kbd{ tab}/ /4288 # delete leading and trailing spaces 4289 y/@kbd{@key{TAB}}/ / 1638 4290 s/^ *// 1639 4291 s/ *$// … … 1684 4336 1685 4337 @group 1686 # replace all leading 9s by _ (any other character except digits, could4338 # replace all trailing 9s by _ (any other character except digits, could 1687 4339 # be used) 1688 4340 :d … … 1694 4346 # incr last digit only. The first line adds a most-significant 1695 4347 # digit of 1 if we have to add a digit. 1696 #1697 # The @code{tn} commands are not necessary, but make the thing1698 # faster1699 4348 @end group 1700 4349 … … 1727 4376 seen a script converting the output of @command{date} into a @command{bc} 1728 4377 program! 1729 4378 1730 4379 The main body of this is the @command{sed} script, which remaps the name 1731 from lower to upper (or vice-versa) and even checks out 4380 from lower to upper (or vice-versa) and even checks out 1732 4381 if the remapped name is the same as the original name. 1733 4382 Note how the script is parameterized using shell … … 1738 4387 @group 1739 4388 #! /bin/sh 1740 # rename files to lower/upper case... 4389 # rename files to lower/upper case... 1741 4390 # 1742 # usage: 1743 # move-to-lower * 1744 # move-to-upper * 4391 # usage: 4392 # move-to-lower * 4393 # move-to-upper * 1745 4394 # or 1746 4395 # move-to-lower -R . … … 1752 4401 help() 1753 4402 @{ 1754 4403 cat << eof 1755 4404 Usage: $0 [-n] [-r] [-h] files... 1756 4405 @end group … … 1785 4434 while : 1786 4435 do 1787 case "$1" in 4436 case "$1" in 1788 4437 -n) apply_cmd='cat' ;; 1789 4438 -R) finder='find "$@@" -type f';; … … 1813 4462 esac 1814 4463 @end group 1815 4464 1816 4465 eval $finder | sed -n ' 1817 4466 … … 1855 4504 @group 1856 4505 # check if converted file name is equal to original file name, 1857 # if it is, do not print nothing4506 # if it is, do not print anything 1858 4507 /^.*\/\(.*\)\n\1/b 4508 @end group 4509 4510 @group 4511 # escape special characters for the shell 4512 s/["$`\\]/\\&/g 1859 4513 @end group 1860 4514 … … 1974 4628 @c end--------------------------------------------- 1975 4629 4630 4631 @node Text search across multiple lines 4632 @section Text search across multiple lines 4633 4634 This section uses @code{N} and @code{D} commands to search for 4635 consecutive words spanning multiple lines. @xref{Multiline techniques}. 4636 4637 These examples deal with finding doubled occurrences of words in a document. 4638 4639 Finding doubled words in a single line is easy using GNU @command{grep} 4640 and similarly with @value{SSED}: 4641 4642 @c NOTE: in all examples, 'the@ the' is used to prevent 4643 @c 'make syntax-check' from complaining about double words. 4644 @codequoteundirected on 4645 @codequotebacktick on 4646 @example 4647 @group 4648 $ cat two-cities-dup1.txt 4649 It was the best of times, 4650 it was the worst of times, 4651 it was the@ the age of wisdom, 4652 it was the age of foolishness, 4653 4654 $ grep -E '\b(\w+)\s+\1\b' two-cities-dup1.txt 4655 it was the@ the age of wisdom, 4656 4657 $ grep -n -E '\b(\w+)\s+\1\b' two-cities-dup1.txt 4658 3:it was the@ the age of wisdom, 4659 4660 $ sed -En '/\b(\w+)\s+\1\b/p' two-cities-dup1.txt 4661 it was the@ the age of wisdom, 4662 4663 $ sed -En '/\b(\w+)\s+\1\b/@{=;p@}' two-cities-dup1.txt 4664 3 4665 it was the@ the age of wisdom, 4666 @end group 4667 @end example 4668 @codequoteundirected off 4669 @codequotebacktick off 4670 4671 @itemize @bullet 4672 @item 4673 The regular expression @samp{\b\w+\s+} searches for word-boundary (@samp{\b}), 4674 followed by one-or-more word-characters (@samp{\w+}), followed by whitespace 4675 (@samp{\s+}). @xref{regexp extensions}. 4676 4677 @item 4678 Adding parentheses around the @samp{(\w+)} expression creates a subexpression. 4679 The regular expression pattern @samp{(PATTERN)\s+\1} defines a subexpression 4680 (in the parentheses) followed by a back-reference, separated by whitespace. 4681 A successful match means the @var{PATTERN} was repeated twice in succession. 4682 @xref{Back-references and Subexpressions}. 4683 4684 @item 4685 The word-boundery expression (@samp{\b}) at both ends ensures partial 4686 words are not matched (e.g. @samp{the then} is not a desired match). 4687 @c Thanks to Jim for pointing this out in 4688 @c https://lists.gnu.org/archive/html/sed-devel/2016-12/msg00041.html 4689 4690 @item 4691 The @option{-E} option enables extended regular expression syntax, alleviating 4692 the need to add backslashes before the parenthesis. @xref{ERE syntax}. 4693 4694 @end itemize 4695 4696 When the doubled word span two lines the above regular expression 4697 will not find them as @command{grep} and @command{sed} operate line-by-line. 4698 4699 By using @command{N} and @command{D} commands, @command{sed} can apply 4700 regular expressions on multiple lines (that is, multiple lines are stored 4701 in the pattern space, and the regular expression works on it): 4702 4703 @c NOTE: use 'the@*the' instead of a real new line to prevent 4704 @c 'make syntax-check' to complain about doubled-words. 4705 @codequoteundirected on 4706 @codequotebacktick on 4707 @example 4708 $ cat two-cities-dup2.txt 4709 It was the best of times, it was the 4710 worst of times, it was the@*the age of wisdom, 4711 it was the age of foolishness, 4712 4713 $ sed -En '@{N; /\b(\w+)\s+\1\b/@{=;p@} ; D@}' two-cities-dup2.txt 4714 3 4715 worst of times, it was the@*the age of wisdom, 4716 @end example 4717 @codequoteundirected off 4718 @codequotebacktick off 4719 4720 @itemize @bullet 4721 @item 4722 The @command{N} command appends the next line to the pattern space 4723 (thus ensuring it contains two consecutive lines in every cycle). 4724 4725 @item 4726 The regular expression uses @samp{\s+} for word separator which matches 4727 both spaces and newlines. 4728 4729 @item 4730 The regular expression matches, the entire pattern space is printed 4731 with @command{p}. No lines are printed by default due to the @option{-n} option. 4732 4733 @item 4734 The @command{D} removes the first line from the pattern space (up until the 4735 first newline), readying it for the next cycle. 4736 @end itemize 4737 4738 See the GNU @command{coreutils} manual for an alternative solution using 4739 @command{tr -s} and @command{uniq} at 4740 @c NOTE: cheating and keeping the URL line shorter than 80 characters 4741 @c by using 'gnu.org' and '/s/'. 4742 @url{https://gnu.org/s/coreutils/manual/html_node/Squeezing-and-deleting.html}. 4743 4744 @node Line length adjustment 4745 @section Line length adjustment 4746 4747 This section uses @code{N} and @code{P} commands to read and write 4748 lines, and the @code{b} command for branching. 4749 @xref{Multiline techniques} and @ref{Branching and flow control}. 4750 4751 This (somewhat contrived) example deal with formatting and wrapping 4752 lines of text of the following input file: 4753 4754 @example 4755 @group 4756 $ cat two-cities-mix.txt 4757 It was the best of times, it was 4758 the worst of times, it 4759 was the age of 4760 wisdom, 4761 it 4762 was 4763 the age 4764 of foolishness, 4765 @end group 4766 @end example 4767 4768 @exdent The following sed program wraps lines at 40 characters: 4769 @codequoteundirected on 4770 @codequotebacktick on 4771 @example 4772 @group 4773 $ cat wrap40.sed 4774 # outer loop 4775 :x 4776 4777 # Append a newline followed by the next input line to the pattern buffer 4778 N 4779 4780 # Remove all newlines from the pattern buffer 4781 s/\n/ /g 4782 4783 4784 # Inner loop 4785 :y 4786 4787 # Add a newline after the first 40 characters 4788 s/(.@{40,40@})/\1\n/ 4789 4790 # If there is a newline in the pattern buffer 4791 # (i.e. the previous substitution added a newline) 4792 /\n/ @{ 4793 # There are newlines in the pattern buffer - 4794 # print the content until the first newline. 4795 P 4796 4797 # Remove the printed characters and the first newline 4798 s/.*\n// 4799 4800 # branch to label 'y' - repeat inner loop 4801 by 4802 @} 4803 4804 # No newlines in the pattern buffer - Branch to label 'x' (outer loop) 4805 # and read the next input line 4806 bx 4807 @end group 4808 @end example 4809 @codequoteundirected off 4810 @codequotebacktick off 4811 4812 4813 4814 @exdent The wrapped output: 4815 @codequoteundirected on 4816 @codequotebacktick on 4817 @example 4818 @group 4819 $ sed -E -f wrap40.sed two-cities-mix.txt 4820 It was the best of times, it was the wor 4821 st of times, it was the age of wisdom, i 4822 t was the age of foolishness, 4823 @end group 4824 @end example 4825 @codequoteundirected off 4826 @codequotebacktick off 4827 4828 4829 4830 4831 @node Adding a header to multiple files 4832 @section Adding a header to multiple files 4833 4834 @value{SSED} can be used to safely modify multiple files at once. 4835 4836 @exdent Add a single line to the beginning of source code files: 4837 4838 @codequoteundirected on 4839 @codequotebacktick on 4840 @example 4841 sed -i '1i/* Copyright (C) FOO BAR */' *.c 4842 @end example 4843 @codequoteundirected off 4844 @codequotebacktick off 4845 4846 @exdent Adding a few lines is possible using @samp{\n} in the text: 4847 4848 @codequoteundirected on 4849 @codequotebacktick on 4850 @example 4851 sed -i '1i/*\n * Copyright (C) FOO BAR\n * Created by Jane Doe\n */' *.c 4852 @end example 4853 @codequoteundirected off 4854 @codequotebacktick off 4855 4856 To add multiple lines from another file, use @code{0rFILE}. 4857 A typical use case is adding a license notice header to all files: 4858 4859 @codequoteundirected on 4860 @codequotebacktick on 4861 @example 4862 ## Create the header file: 4863 $ cat<<'EOF'>LIC.TXT 4864 /* 4865 Copyright (C) 1989-2021 FOO BAR 4866 4867 This program is free software; you can redistribute it and/or modify 4868 it under the terms of the GNU General Public License as published by 4869 the Free Software Foundation; either version 3, or (at your option) 4870 any later version. 4871 4872 This program is distributed in the hope that it will be useful, 4873 but WITHOUT ANY WARRANTY; without even the implied warranty of 4874 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 4875 GNU General Public License for more details. 4876 4877 You should have received a copy of the GNU General Public License 4878 along with this program; If not, see <https://www.gnu.org/licenses/>. 4879 */ 4880 EOF 4881 4882 ## Add the file at the beginning of all source code files: 4883 $ sed -i '0rLIC.TXT' *.cpp *.h 4884 @end example 4885 @codequoteundirected off 4886 @codequotebacktick off 4887 4888 4889 With script files (e.g. @file{.sh},@file{.py},@file{.pl} files) 4890 the license notice typically appears @emph{after} the first line (the 4891 'shebang' @samp{#!} line). The @code{1rFILE} command will add @file{FILE} 4892 @emph{after} the first line: 4893 4894 @codequoteundirected on 4895 @codequotebacktick on 4896 @example 4897 ## Create the header file: 4898 $ cat<<'EOF'>LIC.TXT 4899 ## 4900 ## Copyright (C) 1989-2021 FOO BAR 4901 ## 4902 ## This program is free software; you can redistribute it and/or modify 4903 ## it under the terms of the GNU General Public License as published by 4904 ## the Free Software Foundation; either version 3, or (at your option) 4905 ## any later version. 4906 ## 4907 ## This program is distributed in the hope that it will be useful, 4908 ## but WITHOUT ANY WARRANTY; without even the implied warranty of 4909 ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 4910 ## GNU General Public License for more details. 4911 ## 4912 ## You should have received a copy of the GNU General Public License 4913 ## along with this program; If not, see <https://www.gnu.org/licenses/>. 4914 ## 4915 ## 4916 EOF 4917 4918 ## Add the file at the beginning of all source code files: 4919 $ sed -i '1rLIC.TXT' *.py *.sh 4920 @end example 4921 @codequoteundirected off 4922 @codequotebacktick off 4923 4924 The above @command{sed} commands can be combined with @command{find} 4925 to locate files in all subdirectories, @command{xargs} to run additional 4926 commands on selected files and @command{grep} to filter out files that already 4927 contain a copyright notice: 4928 4929 @codequoteundirected on 4930 @codequotebacktick on 4931 @example 4932 find \( -iname '*.cpp' -o -iname '*.c' -o -iname '*.h' \) \ 4933 | xargs grep -Li copyright \ 4934 | xargs -r sed -i '0rLIC.TXT' 4935 @end example 4936 @codequoteundirected off 4937 @codequotebacktick off 4938 4939 @exdent Or a slightly safe version (handling files with spaces and newlines): 4940 4941 @codequoteundirected on 4942 @codequotebacktick on 4943 @example 4944 find \( -iname '*.cpp' -o -iname '*.c' -o -iname '*.h' \) -print0 \ 4945 | xargs -0 grep -Z -Li copyright \ 4946 | xargs -0 -r sed -i '0rLIC.TXT' 4947 @end example 4948 @codequoteundirected off 4949 @codequotebacktick off 4950 4951 Note: using the @code{0} address with @code{r} command requires @value{SSED} 4952 version 4.9 or later. @xref{Zero Address}. 4953 4954 4955 1976 4956 @node tac 1977 4957 @section Reverse Lines of Files … … 1981 4961 is a @command{tac} workalike. 1982 4962 1983 Note that on implementations other than @acronym{GNU} @command{sed} 1984 @ifset PERL 1985 and @value{SSED} 1986 @end ifset 4963 Note that on implementations other than GNU @command{sed} 1987 4964 this script might easily overflow internal buffers. 1988 4965 … … 2015 4992 2016 4993 This script replaces @samp{cat -n}; in fact it formats its output 2017 exactly like @acronym{GNU}@command{cat} does.4994 exactly like GNU @command{cat} does. 2018 4995 2019 4996 Of course this is completely useless and for two reasons: first, … … 2254 5231 @group 2255 5232 # Convert words to a's 2256 s/[ @kbd{ tab}][ @kbd{tab}]*/ /g5233 s/[ @kbd{@key{TAB}}][ @kbd{@key{TAB}}]*/ /g 2257 5234 s/^/ / 2258 5235 s/ [^ ][^ ]*/a /g … … 2431 5408 @c end--------------------------------------------- 2432 5409 2433 As you can see, we ma ntain a 2-line window using @code{P} and @code{D}.5410 As you can see, we maintain a 2-line window using @code{P} and @code{D}. 2434 5411 This technique is often used in advanced @command{sed} scripts. 2435 5412 … … 2585 5562 fastest. Note that loops are completely done with @code{n} and 2586 5563 @code{b}, without relying on @command{sed} to restart the 2587 thescript automatically at the end of a line.5564 script automatically at the end of a line. 2588 5565 2589 5566 @c start------------------------------------------- … … 2603 5580 # get next 2604 5581 n 2605 # got chars? print it again, etc... 5582 # got chars? print it again, etc... 2606 5583 /./bx 2607 5584 @end group … … 2631 5608 @chapter @value{SSED}'s Limitations and Non-limitations 2632 5609 2633 @cindex @acronym{GNU}extensions, unlimited line length5610 @cindex GNU extensions, unlimited line length 2634 5611 @cindex Portability, line length limitations 2635 5612 For those who want to write portable @command{sed} scripts, … … 2647 5624 the size of the buffer that can be processed by certain patterns. 2648 5625 2649 @ifset PERL2650 There are some size limitations in the regular expression2651 matcher but it is hoped that they will never in practice2652 be relevant. The maximum length of a compiled pattern2653 is 65539 (sic) bytes. All values in repeating quantifiers2654 must be less than 65536. The maximum nesting depth of2655 all parenthesized subpatterns, including capturing and2656 non-capturing subpatterns@footnote{The2657 distinction is meaningful when referring to Perl-style2658 regular expressions.}, assertions, and other types of2659 subpattern, is 200.2660 2661 Also, @value{SSED} recognizes the @sc{posix} syntax2662 @code{[.@var{ch}.]} and @code{[=@var{ch}=]}2663 where @var{ch} is a ``collating element'', but these2664 are not supported, and an error is given if they are2665 encountered.2666 2667 Here are a few distinctions between the real Perl-style2668 regular expressions and those that @option{-R} recognizes.2669 2670 @enumerate2671 @item2672 Lookahead assertions do not allow repeat quantifiers after them2673 Perl permits them, but they do not mean what you2674 might think. For example, @samp{(?!a)@{3@}} does not assert that the2675 next three characters are not @samp{a}. It just asserts three times that the2676 next character is not @samp{a} --- a waste of time and nothing else.2677 2678 @item2679 Capturing subpatterns that occur inside negative lookahead2680 head assertions are counted, but their entries are counted2681 as empty in the second half of an @code{s} command.2682 Perl sets its numerical variables from any such patterns2683 that are matched before the assertion fails to match2684 something (thereby succeeding), but only if the negative2685 lookahead assertion contains just one branch.2686 2687 @item2688 The following Perl escape sequences are not supported:2689 @samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E},2690 @samp{\Q}. In fact these are implemented by Perl's general2691 string-handling and are not part of its pattern matching engine.2692 2693 @item2694 The Perl @samp{\G} assertion is not supported as it is not2695 relevant to single pattern matches.2696 2697 @item2698 Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})}2699 and @samp{(?p@{code@})} constructions. However, there is some experimental2700 support for recursive patterns using the non-Perl item @samp{(?R)}.2701 2702 @item2703 There are at the time of writing some oddities in Perl2704 5.005_02 concerned with the settings of captured strings2705 when part of a pattern is repeated. For example, matching2706 @samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets2707 @samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.}2708 to the value @samp{b}, but matching @samp{aabbaa}2709 against @samp{/^(aa(bb)?)+$/} leaves @samp{$2}2710 unset. However, if the pattern is changed to2711 @samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set.2712 In Perl 5.004 @samp{$2} is set in both cases, and that is also2713 true of @value{SSED}.2714 2715 @item2716 Another as yet unresolved discrepancy is that in Perl2717 5.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches2718 the string @samp{a}, whereas in @value{SSED} it does not.2719 However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched2720 against @samp{a} leaves $1 unset.2721 @end enumerate2722 @end ifset2723 5626 2724 5627 @node Other Resources 2725 5628 @chapter Other Resources for Learning About @command{sed} 2726 5629 5630 For up to date information about @value{SSED} please 5631 visit @uref{https://www.gnu.org/software/sed/}. 5632 5633 Send general questions and suggestions to @email{sed-devel@@gnu.org}. 5634 Visit the mailing list archives for past discussions at 5635 @uref{https://lists.gnu.org/archive/html/sed-devel/}. 5636 2727 5637 @cindex Additional reading about @command{sed} 2728 In addition to several books that have been written about @command{sed} 2729 (either specifically or as chapters in books which discuss 2730 shell programming), one can find out more about @command{sed} 2731 (including suggestions of a few books) from the FAQ 2732 for the @code{sed-users} mailing list, available from any of: 2733 @display 2734 @uref{http://www.student.northpark.edu/pemente/sed/sedfaq.html} 2735 @uref{http://sed.sf.net/grabbag/tutorials/sedfaq.html} 2736 @end display 2737 2738 Also of interest are 2739 @uref{http://www.student.northpark.edu/pemente/sed/index.htm} 2740 and @uref{http://sed.sf.net/grabbag}, 2741 which include @command{sed} tutorials and other @command{sed}-related goodies. 2742 2743 The @code{sed-users} mailing list itself maintained by Sven Guckes. 2744 To subscribe, visit @uref{http://groups.yahoo.com} and search 2745 for the @code{sed-users} mailing list. 5638 The following resources provide information about @command{sed} 5639 (both @value{SSED} and other variations). Note these not maintained by 5640 @value{SSED} developers. 5641 5642 @itemize @bullet 5643 5644 @item 5645 sed @code{$HOME}: @uref{http://sed.sf.net} 5646 5647 @item 5648 sed FAQ: @uref{http://sed.sf.net/sedfaq.html} 5649 5650 @item 5651 seder's grabbag: @uref{http://sed.sf.net/grabbag} 5652 5653 @item 5654 The @code{sed-users} mailing list maintained by Sven Guckes: 5655 @uref{http://groups.yahoo.com/group/sed-users/} 5656 (note this is @emph{not} the @value{SSED} mailing list). 5657 5658 @end itemize 2746 5659 2747 5660 @node Reporting Bugs … … 2749 5662 2750 5663 @cindex Bugs, reporting 2751 Email bug reports to @email{bonzini@@gnu.org}. 2752 Be sure to include the word ``sed'' somewhere in the @code{Subject:} field. 5664 Email bug reports to @email{bug-sed@@gnu.org}. 2753 5665 Also, please include the output of @samp{sed --version} in the body 2754 5666 of your report if at all possible. … … 2757 5669 2758 5670 @example 2759 @i{ while building frobme-1.3.4}2760 $ configure 5671 @i{@i{@r{while building frobme-1.3.4}}} 5672 $ configure 2761 5673 @error{} sed: file sedscr line 1: Unknown option to 's' 2762 5674 @end example … … 2777 5689 2778 5690 @table @asis 5691 @anchor{N_command_last_line} 2779 5692 @item @code{N} command on the last line 2780 5693 @cindex Portability, @code{N} command on the last line … … 2786 5699 the @command{-n} command switch has been specified. This choice is 2787 5700 by design. 5701 5702 Default behavior (gnu extension, non-POSIX conforming): 5703 @example 5704 $ seq 3 | sed N 5705 1 5706 2 5707 3 5708 @end example 5709 @noindent 5710 To force POSIX-conforming behavior: 5711 @example 5712 $ seq 3 | sed --posix N 5713 1 5714 2 5715 @end example 2788 5716 2789 5717 For example, the behavior of … … 2806 5734 /foo/@{ N;N;N;N;N;N;N;N;N; @} 2807 5735 @end example 2808 5736 2809 5737 @cindex @code{POSIXLY_CORRECT} behavior, @code{N} command 2810 5738 In any case, the simplest workaround is to use @code{$d;N} in … … 2813 5741 2814 5742 @item Regex syntax clashes (problems with backslashes) 2815 @cindex @acronym{GNU}extensions, to basic regular expressions5743 @cindex GNU extensions, to basic regular expressions 2816 5744 @cindex Non-bugs, regex syntax clashes 2817 5745 @command{sed} uses the @sc{posix} basic regular expression syntax. According to … … 2821 5749 @code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}. 2822 5750 2823 As in all @acronym{GNU}programs that use @sc{posix} basic regular5751 As in all GNU programs that use @sc{posix} basic regular 2824 5752 expressions, @command{sed} interprets these escape sequences as special 2825 5753 characters. So, @code{x\+} matches one or more occurrences of @samp{x}. … … 2832 5760 spurious backslashes if they are to be used with modern implementations 2833 5761 of @command{sed}, like 2834 @ifset PERL 2835 @value{SSED} or 2836 @end ifset 2837 @acronym{GNU} @command{sed}. 5762 GNU @command{sed}. 2838 5763 2839 5764 On the other hand, some scripts use s|abc\|def||g to remove occurrences … … 2841 5766 @command{sed} 4.0.x, newer versions interpret this as removing the 2842 5767 string @code{abc|def}. This is again undefined behavior according to 2843 @acronym{POSIX}, and this interpretation is arguably more robust: older5768 POSIX, and this interpretation is arguably more robust: older 2844 5769 @command{sed}s, for example, required that the regex matcher parsed 2845 5770 @code{\/} as @code{/} in the common case of escaping a slash, which is … … 2847 5772 because the regex matcher is only partially under our control. 2848 5773 2849 @cindex @acronym{GNU}extensions, special escapes5774 @cindex GNU extensions, special escapes 2850 5775 In addition, this version of @command{sed} supports several escape characters 2851 5776 (some of which are multi-character) to insert non-printable characters … … 2863 5788 (@pxref{Invoking sed, , Invocation}) lets you clobber 2864 5789 protected files. This is not a bug, but rather a consequence 2865 of how the Unix file system works.5790 of how the Unix file system works. 2866 5791 2867 5792 The permissions on a file say what can happen to the data … … 2873 5798 modifying the contents of the directory, so the operation depends on 2874 5799 the permissions of the directory, not of the file. For this same 2875 reason, @command{sed} does not let you use @option{-i} on a writ eable file2876 in a read-only directory (but unbelievably nobody reports that as a2877 bug@dots{}).5800 reason, @command{sed} does not let you use @option{-i} on a writable file 5801 in a read-only directory, and will break hard or symbolic links when 5802 @option{-i} is used on such a file. 2878 5803 2879 5804 @item @code{0a} does not work (gives an error) 5805 @cindex @code{0} address 5806 @cindex GNU extensions, @code{0} address 5807 @cindex Non-bugs, @code{0} address 5808 2880 5809 There is no line 0. 0 is a special address that is only used to treat 2881 5810 addresses like @code{0,/@var{RE}/} as active when the script starts: if 2882 you write @code{1,/abc/d} and the first line includes the word@samp{abc},5811 you write @code{1,/abc/d} and the first line includes the string @samp{abc}, 2883 5812 then that match would be ignored because address ranges must span at least 2884 5813 two lines (barring the end of the file); but what you probably wanted is … … 2888 5817 @ifclear PERL 2889 5818 @item @code{[a-z]} is case insensitive 5819 @cindex Non-bugs, localization-related 5820 2890 5821 You are encountering problems with locales. POSIX mandates that @code{[a-z]} 2891 5822 uses the current locale's collation order -- in C parlance, that means using 2892 5823 @code{strcoll(3)} instead of @code{strcmp(3)}. Some locales have a 2893 case-insensitive collation order, others don't: one of those that have 2894 problems is Estonian. 5824 case-insensitive collation order, others don't. 2895 5825 2896 5826 Another problem is that @code{[a-z]} tries to use collation symbols. 2897 This only happens if you are on the @acronym{GNU}system, using2898 @acronym{GNU}libc's regular expression matcher instead of compiling the2899 one supplied with @acronym{GNU}sed. In a Danish locale, for example,5827 This only happens if you are on the GNU system, using 5828 GNU libc's regular expression matcher instead of compiling the 5829 one supplied with GNU sed. In a Danish locale, for example, 2900 5830 the regular expression @code{^[a-z]$} matches the string @samp{aa}, 2901 5831 because this is a single collating symbol that comes after @samp{a} … … 2905 5835 To work around these problems, which may cause bugs in shell scripts, set 2906 5836 the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}. 5837 5838 @item @code{s/.*//} does not clear pattern space 5839 @cindex Non-bugs, localization-related 5840 @cindex @value{SSEDEXT}, emptying pattern space 5841 @cindex Emptying pattern space 5842 5843 This happens if your input stream includes invalid multibyte 5844 sequences. @sc{posix} mandates that such sequences 5845 are @emph{not} matched by @samp{.}, so that @samp{s/.*//} will not clear 5846 pattern space as you would expect. In fact, there is no way to clear 5847 sed's buffers in the middle of the script in most multibyte locales 5848 (including UTF-8 locales). For this reason, @value{SSED} provides a `z' 5849 command (for `zap') as an extension. 5850 5851 To work around these problems, which may cause bugs in shell scripts, set 5852 the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}. 2907 5853 @end ifclear 2908 5854 @end table 2909 5855 2910 5856 2911 @node Extended regexps 2912 @appendix Extended regular expressions 2913 @cindex Extended regular expressions, syntax 2914 2915 The only difference between basic and extended regular expressions is in 2916 the behavior of a few characters: @samp{?}, @samp{+}, parentheses, 2917 and braces (@samp{@{@}}). While basic regular expressions require 2918 these to be escaped if you want them to behave as special characters, 2919 when using extended regular expressions you must escape them if 2920 you want them @emph{to match a literal character}. 2921 2922 @noindent 2923 Examples: 2924 @table @code 2925 @item abc? 2926 becomes @samp{abc\?} when using extended regular expressions. It matches 2927 the literal string @samp{abc?}. 2928 2929 @item c\+ 2930 becomes @samp{c+} when using extended regular expressions. It matches 2931 one or more @samp{c}s. 2932 2933 @item a\@{3,\@} 2934 becomes @samp{a@{3,@}} when using extended regular expressions. It matches 2935 three or more @samp{a}s. 2936 2937 @item \(abc\)\@{2,3\@} 2938 becomes @samp{(abc)@{2,3@}} when using extended regular expressions. It 2939 matches either @samp{abcabc} or @samp{abcabcabc}. 2940 2941 @item \(abc*\)\1 2942 becomes @samp{(abc*)\1} when using extended regular expressions. 2943 Backreferences must still be escaped when using extended regular 2944 expressions. 2945 @end table 2946 2947 @ifset PERL 2948 @node Perl regexps 2949 @appendix Perl-style regular expressions 2950 @cindex Perl-style regular expressions, syntax 2951 2952 @emph{This part is taken from the @file{pcre.txt} file distributed together 2953 with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.} 2954 2955 Perl introduced several extensions to regular expressions, some 2956 of them incompatible with the syntax of regular expressions 2957 accepted by Emacs and other @acronym{GNU} tools (whose matcher was 2958 based on the Emacs matcher). @value{SSED} implements 2959 both kinds of extensions. 2960 2961 @iftex 2962 Summarizing, we have: 2963 2964 @itemize @bullet 2965 @item 2966 A backslash can introduce several special sequences 2967 2968 @item 2969 The circumflex, dollar sign, and period characters behave specially 2970 with regard to new lines 2971 2972 @item 2973 Strange uses of square brackets are parsed differently 2974 2975 @item 2976 You can toggle modifiers in the middle of a regular expression 2977 2978 @item 2979 You can specify that a subpattern does not count when numbering backreferences 2980 2981 @item 2982 @cindex Greedy regular expression matching 2983 You can specify greedy or non-greedy matching 2984 2985 @item 2986 You can have more than ten back references 2987 2988 @item 2989 You can do complex look aheads and look behinds (in the spirit of 2990 @code{\b}, but with subpatterns). 2991 2992 @item 2993 You can often improve performance by avoiding that @command{sed} wastes 2994 time with backtracking 2995 2996 @item 2997 You can have if/then/else branches 2998 2999 @item 3000 You can do recursive matches, for example to look for unbalanced parentheses 3001 3002 @item 3003 You can have comments and non-significant whitespace, because things can 3004 get complex... 3005 @end itemize 3006 3007 Most of these extensions are introduced by the special @code{(?} 3008 sequence, which gives special meanings to parenthesized groups. 3009 @end iftex 3010 @menu 3011 Other extensions can be roughly subdivided in two categories 3012 On one hand Perl introduces several more escaped sequences 3013 (that is, sequences introduced by a backslash). On the other 3014 hand, it specifies that if a question mark follows an open 3015 parentheses it should give a special meaning to the parenthesized 3016 group. 3017 3018 * Backslash:: Introduces special sequences 3019 * Circumflex/dollar sign/period:: Behave specially with regard to new lines 3020 * Square brackets:: Are a bit different in strange cases 3021 * Options setting:: Toggle modifiers in the middle of a regexp 3022 * Non-capturing subpatterns:: Are not counted when backreferencing 3023 * Repetition:: Allows for non-greedy matching 3024 * Backreferences:: Allows for more than 10 back references 3025 * Assertions:: Allows for complex look ahead matches 3026 * Non-backtracking subpatterns:: Often gives more performance 3027 * Conditional subpatterns:: Allows if/then/else branches 3028 * Recursive patterns:: For example to match parentheses 3029 * Comments:: Because things can get complex... 3030 @end menu 3031 3032 @node Backslash 3033 @appendixsec Backslash 3034 @cindex Perl-style regular expressions, escaped sequences 3035 3036 There are a few difference in the handling of backslashed 3037 sequences in Perl mode. 3038 3039 First of all, there are no @code{\o} and @code{\d} sequences. 3040 @sc{ascii} values for characters can be specified in octal 3041 with a @code{\@var{xxx}} sequence, where @var{xxx} is a 3042 sequence of up to three octal digits. If the first digit 3043 is a zero, the treatment of the sequence is straightforward; 3044 just note that if the character that follows the escaped digit 3045 is itself an octal digit, you have to supply three octal digits 3046 for @var{xxx}. For example @code{\07} is a @sc{bel} character 3047 rather than a @sc{nul} and a literal @code{7} (this sequence is 3048 instead represented by @code{\0007}). 3049 3050 @cindex Perl-style regular expressions, backreferences 3051 The handling of a backslash followed by a digit other than 0 3052 is complicated. Outside a character class, @command{sed} reads it 3053 and any following digits as a decimal number. If the number 3054 is less than 10, or if there have been at least that many 3055 previous capturing left parentheses in the expression, the 3056 entire sequence is taken as a back reference. A description 3057 of how this works is given later, following the discussion 3058 of parenthesized subpatterns. 3059 3060 Inside a character class, or if the decimal number is 3061 greater than 9 and there have not been that many capturing 3062 subpatterns, @command{sed} re-reads up to three octal digits following 3063 the backslash, and generates a single byte from the 3064 least significant 8 bits of the value. Any subsequent digits 3065 stand for themselves. For example: 3066 3067 @example 3068 \040 @i{is another way of writing a space} 3069 \40 @i{is the same, provided there are fewer than 40} 3070 @i{previous capturing subpatterns} 3071 \7 @i{is always a back reference} 3072 \011 @i{is always a tab} 3073 \11 @i{might be a back reference, or another way of} 3074 @i{writing a tab} 3075 \0113 @i{is a tab followed by the character @samp{3}} 3076 \113 @i{is the character with octal code 113 (since there} 3077 @i{can be no more than 99 back references)} 3078 \377 @i{is a byte consisting entirely of 1 bits (@sc{ascii} 255)} 3079 \81 @i{is either a back reference, or a binary zero} 3080 @i{followed by the two characters @samp{81}} 3081 @end example 3082 3083 Note that octal values of 100 or greater must not be introduced 3084 duced by a leading zero, because no more than three octal 3085 digits are ever read. 3086 3087 All the sequences that define a single byte value can be 3088 used both inside and outside character classes. In addition, 3089 inside a character class, the sequence @code{\b} is interpreted 3090 as the backspace character (hex 08). Outside a character 3091 class it has a different meaning (see below). 3092 3093 In addition, there are four additional escapes specifying 3094 generic character classes (like @code{\w} and @code{\W} do): 3095 3096 @cindex Perl-style regular expressions, character classes 3097 @table @samp 3098 @item \d 3099 Matches any decimal digit 3100 3101 @item \D 3102 Matches any character that is not a decimal digit 3103 @end table 3104 3105 In Perl mode, these character type sequences can appear both inside and 3106 outside character classes. Instead, in @sc{posix} mode these sequences 3107 (as well as @code{\w} and @code{\W}) are treated as two literal characters 3108 (a backslash and a letter) inside square brackets. 3109 3110 Escaped sequences specifying assertions are also different in 3111 Perl mode. An assertion specifies a condition that has to be met 3112 at a particular point in a match, without consuming any 3113 characters from the subject string. The use of subpatterns 3114 for more complicated assertions is described below. The 3115 backslashed assertions are 3116 3117 @cindex Perl-style regular expressions, assertions 3118 @table @samp 3119 @item \b 3120 Asserts that the point is at a word boundary. 3121 A word boundary is a position in the subject string where 3122 the current character and the previous character do not both 3123 match @code{\w} or @code{\W} (i.e. one matches @code{\w} and 3124 the other matches @code{\W}), or the start or end of the string 3125 if the first or last character matches @code{\w}, respectively. 3126 3127 @item \B 3128 Asserts that the point is not at a word boundary. 3129 3130 @item \A 3131 Asserts the matcher is at the start of pattern space (independent 3132 of multiline mode). 3133 3134 @item \Z 3135 Asserts the matcher is at the end of pattern space, 3136 or at a newline before the end of pattern space (independent of 3137 multiline mode) 3138 3139 @item \z 3140 Asserts the matcher is at the end of pattern space (independent 3141 of multiline mode) 3142 @end table 3143 3144 These assertions may not appear in character classes (but 3145 note that @code{\b} has a different meaning, namely the 3146 backspace character, inside a character class). 3147 Note that Perl mode does not support directly assertions 3148 for the beginning and the end of word; the @acronym{GNU} extensions 3149 @code{\<} and @code{\>} achieve this purpose in @sc{posix} mode 3150 instead. 3151 3152 The @code{\A}, @code{\Z}, and @code{\z} assertions differ 3153 from the traditional circumflex and dollar sign (described below) 3154 in that they only ever match at the very start and end of the 3155 subject string, whatever options are set; in particular @code{\A} 3156 and @code{\z} are the same as the @acronym{GNU} extensions 3157 @code{\`} and @code{\'} that are active in @sc{posix} mode. 3158 3159 @node Circumflex/dollar sign/period 3160 @appendixsec Circumflex, dollar sign, period 3161 @cindex Perl-style regular expressions, newlines 3162 3163 Outside a character class, in the default matching mode, the 3164 circumflex character is an assertion which is true only if 3165 the current matching point is at the start of the subject 3166 string. Inside a character class, the circumflex has an entirely 3167 different meaning (see below). 3168 3169 The circumflex need not be the first character of the pattern if 3170 a number of alternatives are involved, but it should be the 3171 first thing in each alternative in which it appears if the 3172 pattern is ever to match that branch. If all possible alternatives, 3173 start with a circumflex, that is, if the pattern is 3174 constrained to match only at the start of the subject, it is 3175 said to be an @dfn{anchored} pattern. (There are also other constructs 3176 structs that can cause a pattern to be anchored.) 3177 3178 A dollar sign is an assertion which is true only if the 3179 current matching point is at the end of the subject string, 3180 or immediately before a newline character that is the last 3181 character in the string (by default). A dollar sign need not be the 3182 last character of the pattern if a number of alternatives 3183 are involved, but it should be the last item in any branch 3184 in which it appears. A dollar sign has no special meaning in a 3185 character class. 3186 3187 @cindex Perl-style regular expressions, multiline 3188 The meanings of the circumflex and dollar sign characters are 3189 changed if the @code{M} modifier option is used. When this is 3190 the case, they match immediately after and immediately 3191 before an internal @code{\n} character, respectively, in addition 3192 to matching at the start and end of the subject string. For 3193 example, the pattern @code{/^abc$/} matches the subject string 3194 @samp{def\nabc} in multiline mode, but not otherwise. Consequently, 3195 patterns that are anchored in single line mode 3196 because all branches start with @code{^} are not anchored in 3197 multiline mode. 3198 3199 @cindex Perl-style regular expressions, multiline 3200 Note that the sequences @code{\A}, @code{\Z}, and @code{\z} 3201 can be used to match the start and end of the subject in both 3202 modes, and if all branches of a pattern start with @code{\A} 3203 is it always anchored, whether the @code{M} modifier is set or not. 3204 3205 @cindex Perl-style regular expressions, single line 3206 Outside a character class, a dot in the pattern matches any 3207 one character in the subject, including a non-printing character, 3208 but not (by default) newline. If the @code{S} modifier is used, 3209 dots match newlines as well. Actually, the handling of 3210 dot is entirely independent of the handling of circumflex 3211 and dollar sign, the only relationship being that they both 3212 involve newline characters. Dot has no special meaning in a 3213 character class. 3214 3215 @node Square brackets 3216 @appendixsec Square brackets 3217 @cindex Perl-style regular expressions, character classes 3218 3219 An opening square bracket introduces a character class, terminated 3220 by a closing square bracket. A closing square bracket on its own 3221 is not special. If a closing square bracket is required as a 3222 member of the class, it should be the first data character in 3223 the class (after an initial circumflex, if present) or escaped with a backslash. 3224 3225 A character class matches a single character in the subject; 3226 the character must be in the set of characters defined by 3227 the class, unless the first character in the class is a circumflex, 3228 in which case the subject character must not be in 3229 the set defined by the class. If a circumflex is actually 3230 required as a member of the class, ensure it is not the 3231 first character, or escape it with a backslash. 3232 3233 For example, the character class [aeiou] matches any lower 3234 case vowel, while [^aeiou] matches any character that is not 3235 a lower case vowel. Note that a circumflex is just a convenient 3236 venient notation for specifying the characters which are in 3237 the class by enumerating those that are not. It is not an 3238 assertion: it still consumes a character from the subject 3239 string, and fails if the current pointer is at the end of 3240 the string. 3241 3242 @cindex Perl-style regular expressions, case-insensitive 3243 When caseless matching is set, any letters in a class 3244 represent both their upper case and lower case versions, so 3245 for example, a caseless @code{[aeiou]} matches uppercase 3246 and lowercase @samp{A}s, and a caseless @code{[^aeiou]} 3247 does not match @samp{A}, whereas a case-sensitive version would. 3248 3249 @cindex Perl-style regular expressions, single line 3250 @cindex Perl-style regular expressions, multiline 3251 The newline character is never treated in any special way in 3252 character classes, whatever the setting of the @code{S} and 3253 @code{M} options (modifiers) is. A class such as @code{[^a]} will 3254 always match a newline. 3255 3256 The minus (hyphen) character can be used to specify a range 3257 of characters in a character class. For example, @code{[d-m]} 3258 matches any letter between d and m, inclusive. If a minus 3259 character is required in a class, it must be escaped with a 3260 backslash or appear in a position where it cannot be interpreted 3261 as indicating a range, typically as the first or last 3262 character in the class. 3263 3264 It is not possible to have the literal character @code{]} as the 3265 end character of a range. A pattern such as @code{[W-]46]} is 3266 interpreted as a class of two characters (@code{W} and @code{-}) 3267 followed by a literal string @code{46]}, so it would match 3268 @samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped 3269 with a backslash it is interpreted as the end of range, so 3270 @code{[W-\]46]} is interpreted as a single class containing a 3271 range followed by two separate characters. The octal or 3272 hexadecimal representation of @code{]} can also be used to end a range. 3273 3274 Ranges operate in @sc{ascii} collating sequence. They can also be 3275 used for characters specified numerically, for example 3276 @code{[\000-\037]}. If a range that includes letters is used when 3277 caseless matching is set, it matches the letters in either 3278 case. For example, a caseless @code{[W-c]} is equivalent to 3279 @code{[][\^_`wxyzabc]}, matched caselessly, and if character 3280 tables for the French locale are in use, @code{[\xc8-\xcb]} 3281 matches accented E characters in both cases. 3282 3283 Unlike in @sc{posix} mode, the character types @code{\d}, 3284 @code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W} 3285 may also appear in a character class, and add the characters 3286 that they match to the class. For example, @code{[\dABCDEF]} matches any 3287 hexadecimal digit. A circumflex can conveniently be used 3288 with the upper case character types to specify a more restricted 3289 set of characters than the matching lower case type. 3290 For example, the class @code{[^\W_]} matches any letter or digit, 3291 but not underscore. 3292 3293 All non-alphameric characters other than @code{\}, @code{-}, 3294 @code{^} (at the start) and the terminating @code{]} 3295 are non-special in character classes, but it does no harm 3296 if they are escaped. 3297 3298 Perl 5.6 supports the @sc{posix} notation for character classes, which 3299 uses names enclosed by @code{[:} and @code{:]} within the enclosing 3300 square brackets, and @value{SSED} supports this notation as well. 3301 For example, 3302 3303 @example 3304 [01[:alpha:]%] 3305 @end example 3306 3307 @noindent 3308 matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}. 3309 The supported class names are 3310 3311 @table @code 3312 @item alnum 3313 Matches letters and digits 3314 3315 @item alpha 3316 Matches letters 3317 3318 @item ascii 3319 Matches character codes 0 - 127 3320 3321 @item cntrl 3322 Matches control characters 3323 3324 @item digit 3325 Matches decimal digits (same as \d) 3326 3327 @item graph 3328 Matches printing characters, excluding space 3329 3330 @item lower 3331 Matches lower case letters 3332 3333 @item print 3334 Matches printing characters, including space 3335 3336 @item punct 3337 Matches printing characters, excluding letters and digits 3338 3339 @item space 3340 Matches white space (same as \s) 3341 3342 @item upper 3343 Matches upper case letters 3344 3345 @item word 3346 Matches ``word'' characters (same as \w) 3347 3348 @item xdigit 3349 Matches hexadecimal digits 3350 @end table 3351 3352 The names @code{ascii} and @code{word} are extensions valid only in 3353 Perl mode. Another Perl extension is negation, which is 3354 indicated by a circumflex character after the colon. For example, 3355 3356 @example 3357 [12[:^digit:]] 3358 @end example 3359 3360 @noindent 3361 matches @samp{1}, @samp{2}, or any non-digit. 3362 3363 @node Options setting 3364 @appendixsec Options setting 3365 @cindex Perl-style regular expressions, toggling options 3366 @cindex Perl-style regular expressions, case-insensitive 3367 @cindex Perl-style regular expressions, multiline 3368 @cindex Perl-style regular expressions, single line 3369 @cindex Perl-style regular expressions, extended 3370 3371 The settings of the @code{I}, @code{M}, @code{S}, @code{X} 3372 modifiers can be changed from within the pattern by 3373 a sequence of Perl option letters enclosed between @code{(?} 3374 and @code{)}. The option letters must be lowercase. 3375 3376 For example, @code{(?im)} sets caseless, multiline matching. It is 3377 also possible to unset these options by preceding the letter 3378 with a hyphen; you can also have combined settings and unsettings: 3379 @code{(?im-sx)} sets caseless and multiline matching, 3380 while unsets single line matching (for dots) and extended 3381 whitespace interpretation. If a letter appears both before 3382 and after the hyphen, the option is unset. 3383 3384 The scope of these option changes depends on where in the 3385 pattern the setting occurs. For settings that are outside 3386 any subpattern (defined below), the effect is the same as if 3387 the options were set or unset at the start of matching. The 3388 following patterns all behave in exactly the same way: 3389 3390 @example 3391 (?i)abc 3392 a(?i)bc 3393 ab(?i)c 3394 abc(?i) 3395 @end example 3396 3397 which in turn is the same as specifying the pattern abc with 3398 the @code{I} modifier. In other words, ``top level'' settings 3399 apply to the whole pattern (unless there are other 3400 changes inside subpatterns). If there is more than one setting 3401 of the same option at top level, the rightmost setting 3402 is used. 3403 3404 If an option change occurs inside a subpattern, the effect 3405 is different. This is a change of behaviour in Perl 5.005. 3406 An option change inside a subpattern affects only that part 3407 of the subpattern @emph{that follows} it, so 3408 3409 @example 3410 (a(?i)b)c 3411 @end example 3412 3413 @noindent 3414 matches abc and aBc and no other strings (assuming 3415 case-sensitive matching is used). By this means, options can 3416 be made to have different settings in different parts of the 3417 pattern. Any changes made in one alternative do carry on 3418 into subsequent branches within the same subpattern. For 3419 example, 3420 3421 @example 3422 (a(?i)b|c) 3423 @end example 3424 3425 @noindent 3426 matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C}, 3427 even though when matching @samp{C} the first branch is 3428 abandoned before the option setting. 3429 This is because the effects of option settings happen at 3430 compile time. There would be some very weird behaviour otherwise. 3431 3432 @ignore 3433 There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA 3434 that can be changed in the same way as the Perl-compatible options by 3435 using the characters U and X respectively. The (?X) flag 3436 setting is special in that it must always occur earlier in 3437 the pattern than any of the additional features it turns on, 3438 even when it is at top level. It is best put at the start. 3439 @end ignore 3440 3441 3442 @node Non-capturing subpatterns 3443 @appendixsec Non-capturing subpatterns 3444 @cindex Perl-style regular expressions, non-capturing subpatterns 3445 3446 Marking part of a pattern as a subpattern does two things. 3447 On one hand, it localizes a set of alternatives; on the other 3448 hand, it sets up the subpattern as a capturing subpattern (as 3449 defined above). The subpattern can be backreferenced and 3450 referenced in the right side of @code{s} commands. 3451 3452 For example, if the string @samp{the red king} is matched against 3453 the pattern 3454 3455 @example 3456 the ((red|white) (king|queen)) 3457 @end example 3458 3459 @noindent 3460 the captured substrings are @samp{red king}, @samp{red}, 3461 and @samp{king}, and are numbered 1, 2, and 3. 3462 3463 The fact that plain parentheses fulfil two functions is not 3464 always helpful. There are often times when a grouping 3465 subpattern is required without a capturing requirement. If an 3466 opening parenthesis is followed by @code{?:}, the subpattern does 3467 not do any capturing, and is not counted when computing the 3468 number of any subsequent capturing subpatterns. For example, 3469 if the string @samp{the white queen} is matched against the pattern 3470 3471 @example 3472 the ((?:red|white) (king|queen)) 3473 @end example 3474 3475 @noindent 3476 the captured substrings are @samp{white queen} and @samp{queen}, 3477 and are numbered 1 and 2. The maximum number of captured 3478 substrings is 99, while the maximum number of all subpatterns, 3479 both capturing and non-capturing, is 200. 3480 3481 As a convenient shorthand, if any option settings are 3482 equired at the start of a non-capturing subpattern, the 3483 option letters may appear between the @code{?} and the 3484 @code{:}. Thus the two patterns 3485 3486 @example 3487 (?i:saturday|sunday) 3488 (?:(?i)saturday|sunday) 3489 @end example 3490 3491 @noindent 3492 match exactly the same set of strings. Because alternative 3493 branches are tried from left to right, and options are not 3494 reset until the end of the subpattern is reached, an option 3495 setting in one branch does affect subsequent branches, so 3496 the above patterns match @samp{SUNDAY} as well as @samp{Saturday}. 3497 3498 3499 @node Repetition 3500 @appendixsec Repetition 3501 @cindex Perl-style regular expressions, repetitions 3502 3503 Repetition is specified by quantifiers, which can follow any 3504 of the following items: 3505 3506 @itemize @bullet 3507 @item 3508 a single character, possibly escaped 3509 3510 @item 3511 the @code{.} special character 3512 3513 @item 3514 a character class 3515 3516 @item 3517 a back reference (see next section) 3518 3519 @item 3520 a parenthesized subpattern (unless it is an assertion; @pxref{Assertions}) 3521 @end itemize 3522 3523 The general repetition quantifier specifies a minimum and 3524 maximum number of permitted matches, by giving the two 3525 numbers in curly brackets (braces), separated by a comma. 3526 The numbers must be less than 65536, and the first must be 3527 less than or equal to the second. For example: 3528 3529 @example 3530 z@{2,4@} 3531 @end example 3532 3533 @noindent 3534 matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own 3535 is not a special character. If the second number is omitted, 3536 but the comma is present, there is no upper limit; if the 3537 second number and the comma are both omitted, the quantifier 3538 specifies an exact number of required matches. Thus 3539 3540 @example 3541 [aeiou]@{3,@} 3542 @end example 3543 3544 @noindent 3545 matches at least 3 successive vowels, but may match many 3546 more, while 3547 3548 @example 3549 \d@{8@} 3550 @end example 3551 3552 @noindent 3553 matches exactly 8 digits. An opening curly bracket that 3554 appears in a position where a quantifier is not allowed, or 3555 one that does not match the syntax of a quantifier, is taken 3556 as a literal character. For example, @{,6@} is not a quantifier, 3557 but a literal string of four characters.@footnote{It 3558 raises an error if @option{-R} is not used.} 3559 3560 The quantifier @samp{@{0@}} is permitted, causing the expression to 3561 behave as if the previous item and the quantifier were not 3562 present. 3563 3564 For convenience (and historical compatibility) the three 3565 most common quantifiers have single-character abbreviations: 3566 3567 @table @code 3568 @item * 3569 is equivalent to @{0,@} 3570 3571 @item + 3572 is equivalent to @{1,@} 3573 3574 @item ? 3575 is equivalent to @{0,1@} 3576 @end table 3577 3578 It is possible to construct infinite loops by following a 3579 subpattern that can match no characters with a quantifier 3580 that has no upper limit, for example: 3581 3582 @example 3583 (a?)* 3584 @end example 3585 3586 Earlier versions of Perl used to give an error at 3587 compile time for such patterns. However, because there are 3588 cases where this can be useful, such patterns are now 3589 accepted, but if any repetition of the subpattern does in 3590 fact match no characters, the loop is forcibly broken. 3591 3592 @cindex Greedy regular expression matching 3593 @cindex Perl-style regular expressions, stingy repetitions 3594 By default, the quantifiers are @dfn{greedy} like in @sc{posix} 3595 mode, that is, they match as much as possible (up to the maximum 3596 number of permitted times), without causing the rest of the 3597 pattern to fail. The classic example of where this gives problems 3598 is in trying to match comments in C programs. These appear between 3599 the sequences @code{/*} and @code{*/} and within the sequence, individual 3600 @code{*} and @code{/} characters may appear. An attempt to match C 3601 comments by applying the pattern 3602 3603 @example 3604 /\*.*\*/ 3605 @end example 3606 3607 @noindent 3608 to the string 3609 3610 @example 3611 /* first command */ not comment /* second comment */ 3612 @end example 3613 3614 @noindent 3615 3616 fails, because it matches the entire string owing to the 3617 greediness of the @code{.*} item. 3618 3619 However, if a quantifier is followed by a question mark, it 3620 ceases to be greedy, and instead matches the minimum number 3621 of times possible, so the pattern @code{/\*.*?\*/} 3622 does the right thing with the C comments. The meaning of the 3623 various quantifiers is not otherwise changed, just the preferred 3624 number of matches. Do not confuse this use of question 3625 mark with its use as a quantifier in its own right. 3626 Because it has two uses, it can sometimes appear doubled, as in 3627 3628 @example 3629 \d??\d 3630 @end example 3631 3632 which matches one digit by preference, but can match two if 3633 that is the only way the rest of the pattern matches. 3634 3635 Note that greediness does not matter when specifying addresses, 3636 but can be nevertheless used to improve performance. 3637 3638 @ignore 3639 If the PCRE_UNGREEDY option is set (an option which is not 3640 available in Perl), the quantifiers are not greedy by 3641 default, but individual ones can be made greedy by following 3642 them with a question mark. In other words, it inverts the 3643 default behaviour. 3644 @end ignore 3645 3646 When a parenthesized subpattern is quantified with a minimum 3647 repeat count that is greater than 1 or with a limited maximum, 3648 more store is required for the compiled pattern, in 3649 proportion to the size of the minimum or maximum. 3650 3651 @cindex Perl-style regular expressions, single line 3652 If a pattern starts with @code{.*} or @code{.@{0,@}} and the 3653 @code{S} modifier is used, the pattern is implicitly anchored, 3654 because whatever follows will be tried against every character 3655 position in the subject string, so there is no point in 3656 retrying the overall match at any position after the first. 3657 PCRE treats such a pattern as though it were preceded by \A. 3658 3659 When a capturing subpattern is repeated, the value captured 3660 is the substring that matched the final iteration. For example, 3661 after 3662 3663 @example 3664 (tweedle[dume]@{3@}\s*)+ 3665 @end example 3666 3667 @noindent 3668 has matched @samp{tweedledum tweedledee} the value of the 3669 captured substring is @samp{tweedledee}. However, if there are 3670 nested capturing subpatterns, the corresponding captured 3671 values may have been set in previous iterations. For example, 3672 after 3673 3674 @example 3675 /(a|(b))+/ 3676 @end example 3677 3678 matches @samp{aba}, the value of the second captured substring is 3679 @samp{b}. 3680 3681 @node Backreferences 3682 @appendixsec Backreferences 3683 @cindex Perl-style regular expressions, backreferences 3684 3685 Outside a character class, a backslash followed by a digit 3686 greater than 0 (and possibly further digits) is a back 3687 reference to a capturing subpattern earlier (i.e. to its 3688 left) in the pattern, provided there have been that many 3689 previous capturing left parentheses. 3690 3691 However, if the decimal number following the backslash is 3692 less than 10, it is always taken as a back reference, and 3693 causes an error only if there are not that many capturing 3694 left parentheses in the entire pattern. In other words, the 3695 parentheses that are referenced need not be to the left of 3696 the reference for numbers less than 10. @ref{Backslash} 3697 for further details of the handling of digits following a backslash. 3698 3699 A back reference matches whatever actually matched the capturing 3700 subpattern in the current subject string, rather than 3701 anything matching the subpattern itself. So the pattern 3702 3703 @example 3704 (sens|respons)e and \1ibility 3705 @end example 3706 3707 @noindent 3708 matches @samp{sense and sensibility} and @samp{response and responsibility}, 3709 but not @samp{sense and responsibility}. If caseful 3710 matching is in force at the time of the back reference, the 3711 case of letters is relevant. For example, 3712 3713 @example 3714 ((?i)blah)\s+\1 3715 @end example 3716 3717 @noindent 3718 matches @samp{blah blah} and @samp{Blah Blah}, but not 3719 @samp{BLAH blah}, even though the original capturing 3720 subpattern is matched caselessly. 3721 3722 There may be more than one back reference to the same subpattern. 3723 Also, if a subpattern has not actually been used in a 3724 particular match, any back references to it always fail. For 3725 example, the pattern 3726 3727 @example 3728 (a|(bc))\2 3729 @end example 3730 3731 @noindent 3732 always fails if it starts to match @samp{a} rather than 3733 @samp{bc}. Because there may be up to 99 back references, all 3734 digits following the backslash are taken as part of a potential 3735 back reference number; this is different from what happens 3736 in @sc{posix} mode. If the pattern continues with a digit 3737 character, some delimiter must be used to terminate the back 3738 reference. If the @code{X} modifier option is set, this can be 3739 whitespace. Otherwise an empty comment can be used, or the 3740 following character can be expressed in hexadecimal or octal. 3741 3742 A back reference that occurs inside the parentheses to which 3743 it refers fails when the subpattern is first used, so, for 3744 example, @code{(a\1)} never matches. However, such references 3745 can be useful inside repeated subpatterns. For example, the 3746 pattern 3747 3748 @example 3749 (a|b\1)+ 3750 @end example 3751 3752 @noindent 3753 matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa}, 3754 etc. At each iteration of the subpattern, the back reference matches 3755 the character string corresponding to the previous iteration. In 3756 order for this to work, the pattern must be such that the first 3757 iteration does not need to match the back reference. This can be 3758 done using alternation, as in the example above, or by a 3759 quantifier with a minimum of zero. 3760 3761 @node Assertions 3762 @appendixsec Assertions 3763 @cindex Perl-style regular expressions, assertions 3764 @cindex Perl-style regular expressions, asserting subpatterns 3765 3766 An assertion is a test on the characters following or 3767 preceding the current matching point that does not actually 3768 consume any characters. The simple assertions coded as @code{\b}, 3769 @code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$} 3770 are described above. More complicated assertions are coded as 3771 subpatterns. There are two kinds: those that look ahead of the 3772 current position in the subject string, and those that look behind it. 3773 3774 @cindex Perl-style regular expressions, lookahead subpatterns 3775 An assertion subpattern is matched in the normal way, except 3776 that it does not cause the current matching position to be 3777 changed. Lookahead assertions start with @code{(?=} for positive 3778 assertions and @code{(?!} for negative assertions. For example, 3779 3780 @example 3781 \w+(?=;) 3782 @end example 3783 3784 @noindent 3785 matches a word followed by a semicolon, but does not include 3786 the semicolon in the match, and 3787 3788 @example 3789 foo(?!bar) 3790 @end example 3791 3792 @noindent 3793 matches any occurrence of @samp{foo} that is not followed by 3794 @samp{bar}. 3795 3796 Note that the apparently similar pattern 3797 3798 @example 3799 (?!foo)bar 3800 @end example 3801 3802 @noindent 3803 @cindex Perl-style regular expressions, lookbehind subpatterns 3804 finds any occurrence of @samp{bar} even if it is preceded by 3805 @samp{foo}, because the assertion @code{(?!foo)} is always true 3806 when the next three characters are @samp{bar}. A lookbehind 3807 assertion is needed to achieve this effect. 3808 Lookbehind assertions start with @code{(?<=} for positive 3809 assertions and @code{(?<!} for negative assertions. So, 3810 3811 @example 3812 (?<!foo)bar 3813 @end example 3814 3815 achieves the required effect of finding an occurrence of 3816 @samp{bar} that is not preceded by @samp{foo}. The contents of a 3817 lookbehind assertion are restricted 3818 such that all the strings it matches must have a fixed 3819 length. However, if there are several alternatives, they do 3820 not all have to have the same fixed length. This is an extension 3821 compared with Perl 5.005, which requires all branches to match 3822 the same length of string. Thus 3823 3824 @example 3825 (?<=dogs|cats|) 3826 @end example 3827 3828 @noindent 3829 is permitted, but the apparently equivalent regular expression 3830 3831 @example 3832 (?<!dogs?|cats?) 3833 @end example 3834 3835 @noindent 3836 causes an error at compile time. Branches that match different 3837 length strings are permitted only at the top level of 3838 a lookbehind assertion: an assertion such as 3839 3840 @example 3841 (?<=ab(c|de)) 3842 @end example 3843 3844 @noindent 3845 is not permitted, because its single top-level branch can 3846 match two different lengths, but it is acceptable if rewritten 3847 to use two top-level branches: 3848 3849 @example 3850 (?<=abc|abde) 3851 @end example 3852 3853 All this is required because lookbehind assertions simply 3854 move the current position back by the alternative's fixed 3855 width and then try to match. If there are 3856 insufficient characters before the current position, the 3857 match is deemed to fail. Lookbehinds, in conjunction with 3858 non-backtracking subpatterns can be particularly useful for 3859 matching at the ends of strings; an example is given at the end 3860 of the section on non-backtracking subpatterns. 3861 3862 Several assertions (of any sort) may occur in succession. 3863 For example, 3864 3865 @example 3866 (?<=\d@{3@})(?<!999)foo 3867 @end example 3868 3869 @noindent 3870 matches @samp{foo} preceded by three digits that are not @samp{999}. 3871 Notice that each of the assertions is applied independently 3872 at the same point in the subject string. First there is a 3873 check that the previous three characters are all digits, and 3874 then there is a check that the same three characters are not 3875 @samp{999}. This pattern does not match @samp{foo} preceded by six 3876 characters, the first of which are digits and the last three 3877 of which are not @samp{999}. For example, it doesn't match 3878 @samp{123abcfoo}. A pattern to do that is 3879 3880 @example 3881 (?<=\d@{3@}...)(?<!999)foo 3882 @end example 3883 3884 @noindent 3885 This time the first assertion looks at the preceding six 3886 characters, checking that the first three are digits, and 3887 then the second assertion checks that the preceding three 3888 characters are not @samp{999}. Actually, assertions can be 3889 nested in any combination, so one can write this as 3890 3891 @example 3892 (?<=\d@{3@}(?!999)...)foo 3893 @end example 3894 3895 or 3896 3897 @example 3898 (?<=\d@{3@}...(?<!999))foo 3899 @end example 3900 3901 @noindent 3902 both of which might be considered more readable. 3903 3904 Assertion subpatterns are not capturing subpatterns, and may 3905 not be repeated, because it makes no sense to assert the 3906 same thing several times. If any kind of assertion contains 3907 capturing subpatterns within it, these are counted for the 3908 purposes of numbering the capturing subpatterns in the whole 3909 pattern. However, substring capturing is carried out only 3910 for positive assertions, because it does not make sense for 3911 negative assertions. 3912 3913 Assertions count towards the maximum of 200 parenthesized 3914 subpatterns. 3915 3916 @node Non-backtracking subpatterns 3917 @appendixsec Non-backtracking subpatterns 3918 @cindex Perl-style regular expressions, non-backtracking subpatterns 3919 3920 With both maximizing and minimizing repetition, failure of 3921 what follows normally causes the repeated item to be evaluated 3922 again to see if a different number of repeats allows the 3923 rest of the pattern to match. Sometimes it is useful to 3924 prevent this, either to change the nature of the match, or 3925 to cause it fail earlier than it otherwise might, when the 3926 author of the pattern knows there is no point in carrying 3927 on. 3928 3929 Consider, for example, the pattern @code{\d+foo} when applied to 3930 the subject line 3931 3932 @example 3933 123456bar 3934 @end example 3935 3936 After matching all 6 digits and then failing to match @samp{foo}, 3937 the normal action of the matcher is to try again with only 5 3938 digits matching the @code{\d+} item, and then with 4, and so on, 3939 before ultimately failing. Non-backtracking subpatterns 3940 provide the means for specifying that once a portion of the 3941 pattern has matched, it is not to be re-evaluated in this way, 3942 so the matcher would give up immediately on failing to match 3943 @samp{foo} the first time. The notation is another kind of special 3944 parenthesis, starting with @code{(?>} as in this example: 3945 3946 @example 3947 (?>\d+)bar 3948 @end example 3949 3950 This kind of parenthesis ``locks up'' the part of the pattern 3951 it contains once it has matched, and a failure further into 3952 the pattern is prevented from backtracking into it. 3953 Backtracking past it to previous items, however, works as 3954 normal. 3955 3956 Non-backtracking subpatterns are not capturing subpatterns. Simple 3957 cases such as the above example can be thought of as a maximizing 3958 repeat that must swallow everything it can. So, 3959 while both @code{\d+} and @code{\d+?} are prepared to adjust the number of 3960 digits they match in order to make the rest of the pattern 3961 match, @code{(?>\d+)} can only match an entire sequence of digits. 3962 3963 This construction can of course contain arbitrarily complicated 3964 subpatterns, and it can be nested. 3965 3966 @cindex Perl-style regular expressions, lookbehind subpatterns 3967 Non-backtracking subpatterns can be used in conjunction with look-behind 3968 assertions to specify efficient matching at the end 3969 of the subject string. Consider a simple pattern such as 3970 3971 @example 3972 abcd$ 3973 @end example 3974 3975 @noindent 3976 when applied to a long string which does not match. Because 3977 matching proceeds from left to right, @command{sed} will look for 3978 each @samp{a} in the subject and then see if what follows matches 3979 the rest of the pattern. If the pattern is specified as 3980 3981 @example 3982 ^.*abcd$ 3983 @end example 3984 3985 @noindent 3986 the initial @code{.*} matches the entire string at first, but when 3987 this fails (because there is no following @samp{a}), it backtracks 3988 to match all but the last character, then all but the 3989 last two characters, and so on. Once again the search for 3990 @samp{a} covers the entire string, from right to left, so we are 3991 no better off. However, if the pattern is written as 3992 3993 @example 3994 ^(?>.*)(?<=abcd) 3995 @end example 3996 3997 there can be no backtracking for the .* item; it can match 3998 only the entire string. The subsequent lookbehind assertion 3999 does a single test on the last four characters. If it fails, 4000 the match fails immediately. For long strings, this approach 4001 makes a significant difference to the processing time. 4002 4003 When a pattern contains an unlimited repeat inside a subpattern 4004 that can itself be repeated an unlimited number of 4005 times, the use of a once-only subpattern is the only way to 4006 avoid some failing matches taking a very long time 4007 indeed.@footnote{Actually, the matcher embedded in @value{SSED} 4008 tries to do something for this in the simplest cases, 4009 like @code{([^b]*b)*}. These cases are actually quite 4010 common: they happen for example in a regular expression 4011 like @code{\/\*([^*]*\*)*\/} which matches C comments.} 4012 4013 The pattern 4014 4015 @example 4016 (\D+|<\d+>)*[!?] 4017 @end example 4018 4019 ([^0-9<]+<(\d+>)?)*[!?] 4020 4021 @noindent 4022 matches an unlimited number of substrings that either consist 4023 of non-digits, or digits enclosed in angular brackets, followed by 4024 an exclamation or question mark. When it matches, it runs quickly. 4025 However, if it is applied to 4026 4027 @example 4028 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4029 @end example 4030 4031 @noindent 4032 it takes a long time before reporting failure. This is 4033 because the string can be divided between the two repeats in 4034 a large number of ways, and all have to be tried.@footnote{The 4035 example used @code{[!?]} rather than a single character at the end, 4036 because both @value{SSED} and Perl have an optimization that allows 4037 for fast failure when a single character is used. They 4038 remember the last single character that is required for a 4039 match, and fail early if it is not present in the string.} 4040 4041 If the pattern is changed to 4042 4043 @example 4044 ((?>\D+)|<\d+>)*[!?] 4045 @end example 4046 4047 sequences of non-digits cannot be broken, and failure happens 4048 quickly. 4049 4050 @node Conditional subpatterns 4051 @appendixsec Conditional subpatterns 4052 @cindex Perl-style regular expressions, conditional subpatterns 4053 4054 It is possible to cause the matching process to obey a subpattern 4055 conditionally or to choose between two alternative 4056 subpatterns, depending on the result of an assertion, or 4057 whether a previous capturing subpattern matched or not. The 4058 two possible forms of conditional subpattern are 4059 4060 @example 4061 (?(@var{condition})@var{yes-pattern}) 4062 (?(@var{condition})@var{yes-pattern}|@var{no-pattern}) 4063 @end example 4064 4065 If the condition is satisfied, the yes-pattern is used; otherwise 4066 the no-pattern (if present) is used. If there are more than two 4067 alternatives in the subpattern, a compile-time error occurs. 4068 4069 There are two kinds of condition. If the text between the 4070 parentheses consists of a sequence of digits, the condition 4071 is satisfied if the capturing subpattern of that number has 4072 previously matched. The number must be greater than zero. 4073 Consider the following pattern, which contains non-significant 4074 white space to make it more readable (assume the @code{X} modifier) 4075 and to divide it into three parts for ease of discussion: 4076 4077 @example 4078 ( \( )? [^()]+ (?(1) \) ) 4079 @end example 4080 4081 The first part matches an optional opening parenthesis, and 4082 if that character is present, sets it as the first captured 4083 substring. The second part matches one or more characters 4084 that are not parentheses. The third part is a conditional 4085 subpattern that tests whether the first set of parentheses 4086 matched or not. If they did, that is, if subject started 4087 with an opening parenthesis, the condition is true, and so 4088 the yes-pattern is executed and a closing parenthesis is 4089 required. Otherwise, since no-pattern is not present, the 4090 subpattern matches nothing. In other words, this pattern 4091 matches a sequence of non-parentheses, optionally enclosed 4092 in parentheses. 4093 4094 @cindex Perl-style regular expressions, lookahead subpatterns 4095 If the condition is not a sequence of digits, it must be an 4096 assertion. This may be a positive or negative lookahead or 4097 lookbehind assertion. Consider this pattern, again containing 4098 non-significant white space, and with the two alternatives 4099 on the second line: 4100 4101 @example 4102 (?(?=...[a-z]) 4103 \d\d-[a-z]@{3@}-\d\d | 4104 \d\d-\d\d-\d\d ) 4105 @end example 4106 4107 The condition is a positive lookahead assertion that matches 4108 a letter that is three characters away from the current point. 4109 If a letter is found, the subject is matched against the first 4110 alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are 4111 letters and @var{dd} are digits); otherwise it is matched against 4112 the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}. 4113 4114 4115 @node Recursive patterns 4116 @appendixsec Recursive patterns 4117 @cindex Perl-style regular expressions, recursive patterns 4118 @cindex Perl-style regular expressions, recursion 4119 4120 Consider the problem of matching a string in parentheses, 4121 allowing for unlimited nested parentheses. Without the use 4122 of recursion, the best that can be done is to use a pattern 4123 that matches up to some fixed depth of nesting. It is not 4124 possible to handle an arbitrary nesting depth. Perl 5.6 has 4125 provided an experimental facility that allows regular 4126 expressions to recurse (amongst other things). It does this 4127 by interpolating Perl code in the expression at run time, 4128 and the code can refer to the expression itself. A Perl pattern 4129 tern to solve the parentheses problem can be created like 4130 this: 4131 4132 @example 4133 $re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x; 4134 @end example 4135 4136 The @code{(?p@{...@})} item interpolates Perl code at run time, 4137 and in this case refers recursively to the pattern in which it 4138 appears. Obviously, @command{sed} cannot support the interpolation of 4139 Perl code. Instead, the special item @code{(?R)} is provided for 4140 the specific case of recursion. This pattern solves the 4141 parentheses problem (assume the @code{X} modifier option is used 4142 so that white space is ignored): 4143 4144 @example 4145 \( ( (?>[^()]+) | (?R) )* \) 4146 @end example 4147 4148 First it matches an opening parenthesis. Then it matches any 4149 number of substrings which can either be a sequence of 4150 non-parentheses, or a recursive match of the pattern itself 4151 (i.e. a correctly parenthesized substring). Finally there is 4152 a closing parenthesis. 4153 4154 This particular example pattern contains nested unlimited 4155 repeats, and so the use of a non-backtracking subpattern for 4156 matching strings of non-parentheses is important when applying 4157 the pattern to strings that do not match. For example, when 4158 it is applied to 4159 4160 @example 4161 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() 4162 @end example 4163 4164 it yields a ``no match'' response quickly. However, if a 4165 standard backtracking subpattern is not used, the match runs 4166 for a very long time indeed because there are so many different 4167 ways the @code{+} and @code{*} repeats can carve up the subject, 4168 and all have to be tested before failure can be reported. 4169 4170 The values set for any capturing subpatterns are those from 4171 the outermost level of the recursion at which the subpattern 4172 value is set. If the pattern above is matched against 4173 4174 @example 4175 (ab(cd)ef) 4176 @end example 4177 4178 @noindent 4179 the value for the capturing parentheses is @samp{ef}, which is 4180 the last value taken on at the top level. 4181 4182 @node Comments 4183 @appendixsec Comments 4184 @cindex Perl-style regular expressions, comments 4185 4186 The sequence (?# marks the start of a comment which continues 4187 ues up to the next closing parenthesis. Nested parentheses 4188 are not permitted. The characters that make up a comment 4189 play no part in the pattern matching at all. 4190 4191 @cindex Perl-style regular expressions, extended 4192 If the @code{X} modifier option is used, an unescaped @code{#} character 4193 outside a character class introduces a comment that continues 4194 up to the next newline character in the pattern. 4195 @end ifset 5857 5858 5859 @page 5860 @node GNU Free Documentation License 5861 @appendix GNU Free Documentation License 5862 5863 @include fdl.texi 4196 5864 4197 5865
Note:
See TracChangeset
for help on using the changeset viewer.