source: vendor/python/2.5/Doc/lib/libre.tex

Last change on this file was 3225, checked in by bird, 18 years ago

Python 2.5

File size: 41.2 KB
Line 
1\section{\module{re} ---
2 Regular expression operations}
3\declaremodule{standard}{re}
4\moduleauthor{Fredrik Lundh}{fredrik@pythonware.com}
5\sectionauthor{Andrew M. Kuchling}{amk@amk.ca}
6
7
8\modulesynopsis{Regular expression search and match operations with a
9 Perl-style expression syntax.}
10
11
12This module provides regular expression matching operations similar to
13those found in Perl. Regular expression pattern strings may not
14contain null bytes, but can specify the null byte using the
15\code{\e\var{number}} notation. Both patterns and strings to be
16searched can be Unicode strings as well as 8-bit strings. The
17\module{re} module is always available.
18
19Regular expressions use the backslash character (\character{\e}) to
20indicate special forms or to allow special characters to be used
21without invoking their special meaning. This collides with Python's
22usage of the same character for the same purpose in string literals;
23for example, to match a literal backslash, one might have to write
24\code{'\e\e\e\e'} as the pattern string, because the regular expression
25must be \samp{\e\e}, and each backslash must be expressed as
26\samp{\e\e} inside a regular Python string literal.
27
28The solution is to use Python's raw string notation for regular
29expression patterns; backslashes are not handled in any special way in
30a string literal prefixed with \character{r}. So \code{r"\e n"} is a
31two-character string containing \character{\e} and \character{n},
32while \code{"\e n"} is a one-character string containing a newline.
33Usually patterns will be expressed in Python code using this raw
34string notation.
35
36\begin{seealso}
37 \seetitle{Mastering Regular Expressions}{Book on regular expressions
38 by Jeffrey Friedl, published by O'Reilly. The second
39 edition of the book no longer covers Python at all,
40 but the first edition covered writing good regular expression
41 patterns in great detail.}
42\end{seealso}
43
44
45\subsection{Regular Expression Syntax \label{re-syntax}}
46
47A regular expression (or RE) specifies a set of strings that matches
48it; the functions in this module let you check if a particular string
49matches a given regular expression (or if a given regular expression
50matches a particular string, which comes down to the same thing).
51
52Regular expressions can be concatenated to form new regular
53expressions; if \emph{A} and \emph{B} are both regular expressions,
54then \emph{AB} is also a regular expression. In general, if a string
55\emph{p} matches \emph{A} and another string \emph{q} matches \emph{B},
56the string \emph{pq} will match AB. This holds unless \emph{A} or
57\emph{B} contain low precedence operations; boundary conditions between
58\emph{A} and \emph{B}; or have numbered group references. Thus, complex
59expressions can easily be constructed from simpler primitive
60expressions like the ones described here. For details of the theory
61and implementation of regular expressions, consult the Friedl book
62referenced above, or almost any textbook about compiler construction.
63
64A brief explanation of the format of regular expressions follows. For
65further information and a gentler presentation, consult the Regular
66Expression HOWTO, accessible from \url{http://www.python.org/doc/howto/}.
67
68Regular expressions can contain both special and ordinary characters.
69Most ordinary characters, like \character{A}, \character{a}, or
70\character{0}, are the simplest regular expressions; they simply match
71themselves. You can concatenate ordinary characters, so \regexp{last}
72matches the string \code{'last'}. (In the rest of this section, we'll
73write RE's in \regexp{this special style}, usually without quotes, and
74strings to be matched \code{'in single quotes'}.)
75
76Some characters, like \character{|} or \character{(}, are special.
77Special characters either stand for classes of ordinary characters, or
78affect how the regular expressions around them are interpreted.
79
80The special characters are:
81%
82\begin{description}
83
84\item[\character{.}] (Dot.) In the default mode, this matches any
85character except a newline. If the \constant{DOTALL} flag has been
86specified, this matches any character including a newline.
87
88\item[\character{\textasciicircum}] (Caret.) Matches the start of the
89string, and in \constant{MULTILINE} mode also matches immediately
90after each newline.
91
92\item[\character{\$}] Matches the end of the string or just before the
93newline at the end of the string, and in \constant{MULTILINE} mode
94also matches before a newline. \regexp{foo} matches both 'foo' and
95'foobar', while the regular expression \regexp{foo\$} matches only
96'foo'. More interestingly, searching for \regexp{foo.\$} in
97'foo1\textbackslash nfoo2\textbackslash n' matches 'foo2' normally,
98but 'foo1' in \constant{MULTILINE} mode.
99
100\item[\character{*}] Causes the resulting RE to
101match 0 or more repetitions of the preceding RE, as many repetitions
102as are possible. \regexp{ab*} will
103match 'a', 'ab', or 'a' followed by any number of 'b's.
104
105\item[\character{+}] Causes the
106resulting RE to match 1 or more repetitions of the preceding RE.
107\regexp{ab+} will match 'a' followed by any non-zero number of 'b's; it
108will not match just 'a'.
109
110\item[\character{?}] Causes the resulting RE to
111match 0 or 1 repetitions of the preceding RE. \regexp{ab?} will
112match either 'a' or 'ab'.
113
114\item[\code{*?}, \code{+?}, \code{??}] The \character{*},
115\character{+}, and \character{?} qualifiers are all \dfn{greedy}; they
116match as much text as possible. Sometimes this behaviour isn't
117desired; if the RE \regexp{<.*>} is matched against
118\code{'<H1>title</H1>'}, it will match the entire string, and not just
119\code{'<H1>'}. Adding \character{?} after the qualifier makes it
120perform the match in \dfn{non-greedy} or \dfn{minimal} fashion; as
121\emph{few} characters as possible will be matched. Using \regexp{.*?}
122in the previous expression will match only \code{'<H1>'}.
123
124\item[\code{\{\var{m}\}}]
125Specifies that exactly \var{m} copies of the previous RE should be
126matched; fewer matches cause the entire RE not to match. For example,
127\regexp{a\{6\}} will match exactly six \character{a} characters, but
128not five.
129
130\item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
131\var{m} to \var{n} repetitions of the preceding RE, attempting to
132match as many repetitions as possible. For example, \regexp{a\{3,5\}}
133will match from 3 to 5 \character{a} characters. Omitting \var{m}
134specifies a lower bound of zero,
135and omitting \var{n} specifies an infinite upper bound. As an
136example, \regexp{a\{4,\}b} will match \code{aaaab} or a thousand
137\character{a} characters followed by a \code{b}, but not \code{aaab}.
138The comma may not be omitted or the modifier would be confused with
139the previously described form.
140
141\item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
142match from \var{m} to \var{n} repetitions of the preceding RE,
143attempting to match as \emph{few} repetitions as possible. This is
144the non-greedy version of the previous qualifier. For example, on the
1456-character string \code{'aaaaaa'}, \regexp{a\{3,5\}} will match 5
146\character{a} characters, while \regexp{a\{3,5\}?} will only match 3
147characters.
148
149\item[\character{\e}] Either escapes special characters (permitting
150you to match characters like \character{*}, \character{?}, and so
151forth), or signals a special sequence; special sequences are discussed
152below.
153
154If you're not using a raw string to
155express the pattern, remember that Python also uses the
156backslash as an escape sequence in string literals; if the escape
157sequence isn't recognized by Python's parser, the backslash and
158subsequent character are included in the resulting string. However,
159if Python would recognize the resulting sequence, the backslash should
160be repeated twice. This is complicated and hard to understand, so
161it's highly recommended that you use raw strings for all but the
162simplest expressions.
163
164\item[\code{[]}] Used to indicate a set of characters. Characters can
165be listed individually, or a range of characters can be indicated by
166giving two characters and separating them by a \character{-}. Special
167characters are not active inside sets. For example, \regexp{[akm\$]}
168will match any of the characters \character{a}, \character{k},
169\character{m}, or \character{\$}; \regexp{[a-z]}
170will match any lowercase letter, and \code{[a-zA-Z0-9]} matches any
171letter or digit. Character classes such as \code{\e w} or \code{\e S}
172(defined below) are also acceptable inside a range. If you want to
173include a \character{]} or a \character{-} inside a set, precede it with a
174backslash, or place it as the first character. The
175pattern \regexp{[]]} will match \code{']'}, for example.
176
177You can match the characters not within a range by \dfn{complementing}
178the set. This is indicated by including a
179\character{\textasciicircum} as the first character of the set;
180\character{\textasciicircum} elsewhere will simply match the
181\character{\textasciicircum} character. For example,
182\regexp{[{\textasciicircum}5]} will match
183any character except \character{5}, and
184\regexp{[\textasciicircum\code{\textasciicircum}]} will match any character
185except \character{\textasciicircum}.
186
187\item[\character{|}]\code{A|B}, where A and B can be arbitrary REs,
188creates a regular expression that will match either A or B. An
189arbitrary number of REs can be separated by the \character{|} in this
190way. This can be used inside groups (see below) as well. As the target
191string is scanned, REs separated by \character{|} are tried from left to
192right. When one pattern completely matches, that branch is accepted.
193This means that once \code{A} matches, \code{B} will not be tested further,
194even if it would produce a longer overall match. In other words, the
195\character{|} operator is never greedy. To match a literal \character{|},
196use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
197
198\item[\code{(...)}] Matches whatever regular expression is inside the
199parentheses, and indicates the start and end of a group; the contents
200of a group can be retrieved after a match has been performed, and can
201be matched later in the string with the \regexp{\e \var{number}} special
202sequence, described below. To match the literals \character{(} or
203\character{)}, use \regexp{\e(} or \regexp{\e)}, or enclose them
204inside a character class: \regexp{[(] [)]}.
205
206\item[\code{(?...)}] This is an extension notation (a \character{?}
207following a \character{(} is not meaningful otherwise). The first
208character after the \character{?}
209determines what the meaning and further syntax of the construct is.
210Extensions usually do not create a new group;
211\regexp{(?P<\var{name}>...)} is the only exception to this rule.
212Following are the currently supported extensions.
213
214\item[\code{(?iLmsux)}] (One or more letters from the set \character{i},
215\character{L}, \character{m}, \character{s}, \character{u},
216\character{x}.) The group matches the empty string; the letters set
217the corresponding flags (\constant{re.I}, \constant{re.L},
218\constant{re.M}, \constant{re.S}, \constant{re.U}, \constant{re.X})
219for the entire regular expression. This is useful if you wish to
220include the flags as part of the regular expression, instead of
221passing a \var{flag} argument to the \function{compile()} function.
222
223Note that the \regexp{(?x)} flag changes how the expression is parsed.
224It should be used first in the expression string, or after one or more
225whitespace characters. If there are non-whitespace characters before
226the flag, the results are undefined.
227
228\item[\code{(?:...)}] A non-grouping version of regular parentheses.
229Matches whatever regular expression is inside the parentheses, but the
230substring matched by the
231group \emph{cannot} be retrieved after performing a match or
232referenced later in the pattern.
233
234\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
235the substring matched by the group is accessible via the symbolic group
236name \var{name}. Group names must be valid Python identifiers, and
237each group name must be defined only once within a regular expression. A
238symbolic group is also a numbered group, just as if the group were not
239named. So the group named 'id' in the example above can also be
240referenced as the numbered group 1.
241
242For example, if the pattern is
243\regexp{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
244name in arguments to methods of match objects, such as
245\code{m.group('id')} or \code{m.end('id')}, and also by name in
246pattern text (for example, \regexp{(?P=id)}) and replacement text
247(such as \code{\e g<id>}).
248
249\item[\code{(?P=\var{name})}] Matches whatever text was matched by the
250earlier group named \var{name}.
251
252\item[\code{(?\#...)}] A comment; the contents of the parentheses are
253simply ignored.
254
255\item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
256consume any of the string. This is called a lookahead assertion. For
257example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
258followed by \code{'Asimov'}.
259
260\item[\code{(?!...)}] Matches if \regexp{...} doesn't match next. This
261is a negative lookahead assertion. For example,
262\regexp{Isaac (?!Asimov)} will match \code{'Isaac~'} only if it's \emph{not}
263followed by \code{'Asimov'}.
264
265\item[\code{(?<=...)}] Matches if the current position in the string
266is preceded by a match for \regexp{...} that ends at the current
267position. This is called a \dfn{positive lookbehind assertion}.
268\regexp{(?<=abc)def} will find a match in \samp{abcdef}, since the
269lookbehind will back up 3 characters and check if the contained
270pattern matches. The contained pattern must only match strings of
271some fixed length, meaning that \regexp{abc} or \regexp{a|b} are
272allowed, but \regexp{a*} and \regexp{a\{3,4\}} are not. Note that
273patterns which start with positive lookbehind assertions will never
274match at the beginning of the string being searched; you will most
275likely want to use the \function{search()} function rather than the
276\function{match()} function:
277
278\begin{verbatim}
279>>> import re
280>>> m = re.search('(?<=abc)def', 'abcdef')
281>>> m.group(0)
282'def'
283\end{verbatim}
284
285This example looks for a word following a hyphen:
286
287\begin{verbatim}
288>>> m = re.search('(?<=-)\w+', 'spam-egg')
289>>> m.group(0)
290'egg'
291\end{verbatim}
292
293\item[\code{(?<!...)}] Matches if the current position in the string
294is not preceded by a match for \regexp{...}. This is called a
295\dfn{negative lookbehind assertion}. Similar to positive lookbehind
296assertions, the contained pattern must only match strings of some
297fixed length. Patterns which start with negative lookbehind
298assertions may match at the beginning of the string being searched.
299
300\item[\code{(?(\var{id/name})yes-pattern|no-pattern)}] Will try to match
301with \regexp{yes-pattern} if the group with given \var{id} or \var{name}
302exists, and with \regexp{no-pattern} if it doesn't. \regexp{|no-pattern}
303is optional and can be omitted. For example,
304\regexp{(<)?(\e w+@\e w+(?:\e .\e w+)+)(?(1)>)} is a poor email matching
305pattern, which will match with \code{'<user@host.com>'} as well as
306\code{'user@host.com'}, but not with \code{'<user@host.com'}.
307\versionadded{2.4}
308
309\end{description}
310
311The special sequences consist of \character{\e} and a character from the
312list below. If the ordinary character is not on the list, then the
313resulting RE will match the second character. For example,
314\regexp{\e\$} matches the character \character{\$}.
315%
316\begin{description}
317
318\item[\code{\e \var{number}}] Matches the contents of the group of the
319same number. Groups are numbered starting from 1. For example,
320\regexp{(.+) \e 1} matches \code{'the the'} or \code{'55 55'}, but not
321\code{'the end'} (note
322the space after the group). This special sequence can only be used to
323match one of the first 99 groups. If the first digit of \var{number}
324is 0, or \var{number} is 3 octal digits long, it will not be interpreted
325as a group match, but as the character with octal value \var{number}.
326Inside the \character{[} and \character{]} of a character class, all numeric
327escapes are treated as characters.
328
329\item[\code{\e A}] Matches only at the start of the string.
330
331\item[\code{\e b}] Matches the empty string, but only at the
332beginning or end of a word. A word is defined as a sequence of
333alphanumeric or underscore characters, so the end of a word is indicated by
334whitespace or a non-alphanumeric, non-underscore character. Note that
335{}\code{\e b} is defined as the boundary between \code{\e w} and \code{\e
336W}, so the precise set of characters deemed to be alphanumeric depends on the
337values of the \code{UNICODE} and \code{LOCALE} flags. Inside a character
338range, \regexp{\e b} represents the backspace character, for compatibility
339with Python's string literals.
340
341\item[\code{\e B}] Matches the empty string, but only when it is \emph{not}
342at the beginning or end of a word. This is just the opposite of {}\code{\e
343b}, so is also subject to the settings of \code{LOCALE} and \code{UNICODE}.
344
345\item[\code{\e d}]When the \constant{UNICODE} flag is not specified, matches
346any decimal digit; this is equivalent to the set \regexp{[0-9]}.
347With \constant{UNICODE}, it will match whatever is classified as a digit
348in the Unicode character properties database.
349
350\item[\code{\e D}]When the \constant{UNICODE} flag is not specified, matches
351any non-digit character; this is equivalent to the set
352\regexp{[{\textasciicircum}0-9]}. With \constant{UNICODE}, it will match
353anything other than character marked as digits in the Unicode character
354properties database.
355
356\item[\code{\e s}]When the \constant{LOCALE} and \constant{UNICODE}
357flags are not specified, matches any whitespace character; this is
358equivalent to the set \regexp{[ \e t\e n\e r\e f\e v]}.
359With \constant{LOCALE}, it will match this set plus whatever characters
360are defined as space for the current locale. If \constant{UNICODE} is set,
361this will match the characters \regexp{[ \e t\e n\e r\e f\e v]} plus
362whatever is classified as space in the Unicode character properties
363database.
364
365\item[\code{\e S}]When the \constant{LOCALE} and \constant{UNICODE}
366flags are not specified, matches any non-whitespace character; this is
367equivalent to the set \regexp{[\textasciicircum\ \e t\e n\e r\e f\e v]}
368With \constant{LOCALE}, it will match any character not in this set,
369and not defined as space in the current locale. If \constant{UNICODE}
370is set, this will match anything other than \regexp{[ \e t\e n\e r\e f\e v]}
371and characters marked as space in the Unicode character properties database.
372
373\item[\code{\e w}]When the \constant{LOCALE} and \constant{UNICODE}
374flags are not specified, matches any alphanumeric character and the
375underscore; this is equivalent to the set
376\regexp{[a-zA-Z0-9_]}. With \constant{LOCALE}, it will match the set
377\regexp{[0-9_]} plus whatever characters are defined as alphanumeric for
378the current locale. If \constant{UNICODE} is set, this will match the
379characters \regexp{[0-9_]} plus whatever is classified as alphanumeric
380in the Unicode character properties database.
381
382\item[\code{\e W}]When the \constant{LOCALE} and \constant{UNICODE}
383flags are not specified, matches any non-alphanumeric character; this
384is equivalent to the set \regexp{[{\textasciicircum}a-zA-Z0-9_]}. With
385\constant{LOCALE}, it will match any character not in the set
386\regexp{[0-9_]}, and not defined as alphanumeric for the current locale.
387If \constant{UNICODE} is set, this will match anything other than
388\regexp{[0-9_]} and characters marked as alphanumeric in the Unicode
389character properties database.
390
391\item[\code{\e Z}]Matches only at the end of the string.
392
393\end{description}
394
395Most of the standard escapes supported by Python string literals are
396also accepted by the regular expression parser:
397
398\begin{verbatim}
399\a \b \f \n
400\r \t \v \x
401\\
402\end{verbatim}
403
404Octal escapes are included in a limited form: If the first digit is a
4050, or if there are three octal digits, it is considered an octal
406escape. Otherwise, it is a group reference. As for string literals,
407octal escapes are always at most three digits in length.
408
409
410% Note the lack of a period in the section title; it causes problems
411% with readers of the GNU info version. See http://www.python.org/sf/581414.
412\subsection{Matching vs Searching \label{matching-searching}}
413\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
414
415Python offers two different primitive operations based on regular
416expressions: match and search. If you are accustomed to Perl's
417semantics, the search operation is what you're looking for. See the
418\function{search()} function and corresponding method of compiled
419regular expression objects.
420
421Note that match may differ from search using a regular expression
422beginning with \character{\textasciicircum}:
423\character{\textasciicircum} matches only at the
424start of the string, or in \constant{MULTILINE} mode also immediately
425following a newline. The ``match'' operation succeeds only if the
426pattern matches at the start of the string regardless of mode, or at
427the starting position given by the optional \var{pos} argument
428regardless of whether a newline precedes it.
429
430% Examples from Tim Peters:
431\begin{verbatim}
432re.compile("a").match("ba", 1) # succeeds
433re.compile("^a").search("ba", 1) # fails; 'a' not at start
434re.compile("^a").search("\na", 1) # fails; 'a' not at start
435re.compile("^a", re.M).search("\na", 1) # succeeds
436re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n
437\end{verbatim}
438
439
440\subsection{Module Contents}
441\nodename{Contents of Module re}
442
443The module defines several functions, constants, and an exception. Some of the
444functions are simplified versions of the full featured methods for compiled
445regular expressions. Most non-trivial applications always use the compiled
446form.
447
448\begin{funcdesc}{compile}{pattern\optional{, flags}}
449 Compile a regular expression pattern into a regular expression
450 object, which can be used for matching using its \function{match()} and
451 \function{search()} methods, described below.
452
453 The expression's behaviour can be modified by specifying a
454 \var{flags} value. Values can be any of the following variables,
455 combined using bitwise OR (the \code{|} operator).
456
457The sequence
458
459\begin{verbatim}
460prog = re.compile(pat)
461result = prog.match(str)
462\end{verbatim}
463
464is equivalent to
465
466\begin{verbatim}
467result = re.match(pat, str)
468\end{verbatim}
469
470but the version using \function{compile()} is more efficient when the
471expression will be used several times in a single program.
472%(The compiled version of the last pattern passed to
473%\function{re.match()} or \function{re.search()} is cached, so
474%programs that use only a single regular expression at a time needn't
475%worry about compiling regular expressions.)
476\end{funcdesc}
477
478\begin{datadesc}{I}
479\dataline{IGNORECASE}
480Perform case-insensitive matching; expressions like \regexp{[A-Z]}
481will match lowercase letters, too. This is not affected by the
482current locale.
483\end{datadesc}
484
485\begin{datadesc}{L}
486\dataline{LOCALE}
487Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, \regexp{\e B},
488\regexp{\e s} and \regexp{\e S} dependent on the current locale.
489\end{datadesc}
490
491\begin{datadesc}{M}
492\dataline{MULTILINE}
493When specified, the pattern character \character{\textasciicircum}
494matches at the beginning of the string and at the beginning of each
495line (immediately following each newline); and the pattern character
496\character{\$} matches at the end of the string and at the end of each
497line (immediately preceding each newline). By default,
498\character{\textasciicircum} matches only at the beginning of the
499string, and \character{\$} only at the end of the string and
500immediately before the newline (if any) at the end of the string.
501\end{datadesc}
502
503\begin{datadesc}{S}
504\dataline{DOTALL}
505Make the \character{.} special character match any character at all,
506including a newline; without this flag, \character{.} will match
507anything \emph{except} a newline.
508\end{datadesc}
509
510\begin{datadesc}{U}
511\dataline{UNICODE}
512Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, \regexp{\e B},
513\regexp{\e d}, \regexp{\e D}, \regexp{\e s} and \regexp{\e S}
514dependent on the Unicode character properties database.
515\versionadded{2.0}
516\end{datadesc}
517
518\begin{datadesc}{X}
519\dataline{VERBOSE}
520This flag allows you to write regular expressions that look nicer.
521Whitespace within the pattern is ignored,
522except when in a character class or preceded by an unescaped
523backslash, and, when a line contains a \character{\#} neither in a
524character class or preceded by an unescaped backslash, all characters
525from the leftmost such \character{\#} through the end of the line are
526ignored.
527% XXX should add an example here
528\end{datadesc}
529
530
531\begin{funcdesc}{search}{pattern, string\optional{, flags}}
532 Scan through \var{string} looking for a location where the regular
533 expression \var{pattern} produces a match, and return a
534 corresponding \class{MatchObject} instance.
535 Return \code{None} if no
536 position in the string matches the pattern; note that this is
537 different from finding a zero-length match at some point in the string.
538\end{funcdesc}
539
540\begin{funcdesc}{match}{pattern, string\optional{, flags}}
541 If zero or more characters at the beginning of \var{string} match
542 the regular expression \var{pattern}, return a corresponding
543 \class{MatchObject} instance. Return \code{None} if the string does not
544 match the pattern; note that this is different from a zero-length
545 match.
546
547 \note{If you want to locate a match anywhere in
548 \var{string}, use \method{search()} instead.}
549\end{funcdesc}
550
551\begin{funcdesc}{split}{pattern, string\optional{, maxsplit\code{ = 0}}}
552 Split \var{string} by the occurrences of \var{pattern}. If
553 capturing parentheses are used in \var{pattern}, then the text of all
554 groups in the pattern are also returned as part of the resulting list.
555 If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
556 occur, and the remainder of the string is returned as the final
557 element of the list. (Incompatibility note: in the original Python
558 1.5 release, \var{maxsplit} was ignored. This has been fixed in
559 later releases.)
560
561\begin{verbatim}
562>>> re.split('\W+', 'Words, words, words.')
563['Words', 'words', 'words', '']
564>>> re.split('(\W+)', 'Words, words, words.')
565['Words', ', ', 'words', ', ', 'words', '.', '']
566>>> re.split('\W+', 'Words, words, words.', 1)
567['Words', 'words, words.']
568\end{verbatim}
569\end{funcdesc}
570
571\begin{funcdesc}{findall}{pattern, string\optional{, flags}}
572 Return a list of all non-overlapping matches of \var{pattern} in
573 \var{string}. If one or more groups are present in the pattern,
574 return a list of groups; this will be a list of tuples if the
575 pattern has more than one group. Empty matches are included in the
576 result unless they touch the beginning of another match.
577 \versionadded{1.5.2}
578 \versionchanged[Added the optional flags argument]{2.4}
579\end{funcdesc}
580
581\begin{funcdesc}{finditer}{pattern, string\optional{, flags}}
582 Return an iterator over all non-overlapping matches for the RE
583 \var{pattern} in \var{string}. For each match, the iterator returns
584 a match object. Empty matches are included in the result unless they
585 touch the beginning of another match.
586 \versionadded{2.2}
587 \versionchanged[Added the optional flags argument]{2.4}
588\end{funcdesc}
589
590\begin{funcdesc}{sub}{pattern, repl, string\optional{, count}}
591 Return the string obtained by replacing the leftmost non-overlapping
592 occurrences of \var{pattern} in \var{string} by the replacement
593 \var{repl}. If the pattern isn't found, \var{string} is returned
594 unchanged. \var{repl} can be a string or a function; if it is a
595 string, any backslash escapes in it are processed. That is,
596 \samp{\e n} is converted to a single newline character, \samp{\e r}
597 is converted to a linefeed, and so forth. Unknown escapes such as
598 \samp{\e j} are left alone. Backreferences, such as \samp{\e6}, are
599 replaced with the substring matched by group 6 in the pattern. For
600 example:
601
602\begin{verbatim}
603>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
604... r'static PyObject*\npy_\1(void)\n{',
605... 'def myfunc():')
606'static PyObject*\npy_myfunc(void)\n{'
607\end{verbatim}
608
609 If \var{repl} is a function, it is called for every non-overlapping
610 occurrence of \var{pattern}. The function takes a single match
611 object argument, and returns the replacement string. For example:
612
613\begin{verbatim}
614>>> def dashrepl(matchobj):
615... if matchobj.group(0) == '-': return ' '
616... else: return '-'
617>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
618'pro--gram files'
619\end{verbatim}
620
621 The pattern may be a string or an RE object; if you need to specify
622 regular expression flags, you must use a RE object, or use embedded
623 modifiers in a pattern; for example, \samp{sub("(?i)b+", "x", "bbbb
624 BBBB")} returns \code{'x x'}.
625
626 The optional argument \var{count} is the maximum number of pattern
627 occurrences to be replaced; \var{count} must be a non-negative
628 integer. If omitted or zero, all occurrences will be replaced.
629 Empty matches for the pattern are replaced only when not adjacent to
630 a previous match, so \samp{sub('x*', '-', 'abc')} returns
631 \code{'-a-b-c-'}.
632
633 In addition to character escapes and backreferences as described
634 above, \samp{\e g<name>} will use the substring matched by the group
635 named \samp{name}, as defined by the \regexp{(?P<name>...)} syntax.
636 \samp{\e g<number>} uses the corresponding group number;
637 \samp{\e g<2>} is therefore equivalent to \samp{\e 2}, but isn't
638 ambiguous in a replacement such as \samp{\e g<2>0}. \samp{\e 20}
639 would be interpreted as a reference to group 20, not a reference to
640 group 2 followed by the literal character \character{0}. The
641 backreference \samp{\e g<0>} substitutes in the entire substring
642 matched by the RE.
643\end{funcdesc}
644
645\begin{funcdesc}{subn}{pattern, repl, string\optional{, count}}
646 Perform the same operation as \function{sub()}, but return a tuple
647 \code{(\var{new_string}, \var{number_of_subs_made})}.
648\end{funcdesc}
649
650\begin{funcdesc}{escape}{string}
651 Return \var{string} with all non-alphanumerics backslashed; this is
652 useful if you want to match an arbitrary literal string that may have
653 regular expression metacharacters in it.
654\end{funcdesc}
655
656\begin{excdesc}{error}
657 Exception raised when a string passed to one of the functions here
658 is not a valid regular expression (for example, it might contain
659 unmatched parentheses) or when some other error occurs during
660 compilation or matching. It is never an error if a string contains
661 no match for a pattern.
662\end{excdesc}
663
664
665\subsection{Regular Expression Objects \label{re-objects}}
666
667Compiled regular expression objects support the following methods and
668attributes:
669
670\begin{methoddesc}[RegexObject]{match}{string\optional{, pos\optional{,
671 endpos}}}
672 If zero or more characters at the beginning of \var{string} match
673 this regular expression, return a corresponding
674 \class{MatchObject} instance. Return \code{None} if the string does not
675 match the pattern; note that this is different from a zero-length
676 match.
677
678 \note{If you want to locate a match anywhere in
679 \var{string}, use \method{search()} instead.}
680
681 The optional second parameter \var{pos} gives an index in the string
682 where the search is to start; it defaults to \code{0}. This is not
683 completely equivalent to slicing the string; the
684 \code{'\textasciicircum'} pattern
685 character matches at the real beginning of the string and at positions
686 just after a newline, but not necessarily at the index where the search
687 is to start.
688
689 The optional parameter \var{endpos} limits how far the string will
690 be searched; it will be as if the string is \var{endpos} characters
691 long, so only the characters from \var{pos} to \code{\var{endpos} -
692 1} will be searched for a match. If \var{endpos} is less than
693 \var{pos}, no match will be found, otherwise, if \var{rx} is a
694 compiled regular expression object,
695 \code{\var{rx}.match(\var{string}, 0, 50)} is equivalent to
696 \code{\var{rx}.match(\var{string}[:50], 0)}.
697\end{methoddesc}
698
699\begin{methoddesc}[RegexObject]{search}{string\optional{, pos\optional{,
700 endpos}}}
701 Scan through \var{string} looking for a location where this regular
702 expression produces a match, and return a
703 corresponding \class{MatchObject} instance. Return \code{None} if no
704 position in the string matches the pattern; note that this is
705 different from finding a zero-length match at some point in the string.
706
707 The optional \var{pos} and \var{endpos} parameters have the same
708 meaning as for the \method{match()} method.
709\end{methoddesc}
710
711\begin{methoddesc}[RegexObject]{split}{string\optional{,
712 maxsplit\code{ = 0}}}
713Identical to the \function{split()} function, using the compiled pattern.
714\end{methoddesc}
715
716\begin{methoddesc}[RegexObject]{findall}{string\optional{, pos\optional{,
717 endpos}}}
718Identical to the \function{findall()} function, using the compiled pattern.
719\end{methoddesc}
720
721\begin{methoddesc}[RegexObject]{finditer}{string\optional{, pos\optional{,
722 endpos}}}
723Identical to the \function{finditer()} function, using the compiled pattern.
724\end{methoddesc}
725
726\begin{methoddesc}[RegexObject]{sub}{repl, string\optional{, count\code{ = 0}}}
727Identical to the \function{sub()} function, using the compiled pattern.
728\end{methoddesc}
729
730\begin{methoddesc}[RegexObject]{subn}{repl, string\optional{,
731 count\code{ = 0}}}
732Identical to the \function{subn()} function, using the compiled pattern.
733\end{methoddesc}
734
735
736\begin{memberdesc}[RegexObject]{flags}
737The flags argument used when the RE object was compiled, or
738\code{0} if no flags were provided.
739\end{memberdesc}
740
741\begin{memberdesc}[RegexObject]{groupindex}
742A dictionary mapping any symbolic group names defined by
743\regexp{(?P<\var{id}>)} to group numbers. The dictionary is empty if no
744symbolic groups were used in the pattern.
745\end{memberdesc}
746
747\begin{memberdesc}[RegexObject]{pattern}
748The pattern string from which the RE object was compiled.
749\end{memberdesc}
750
751
752\subsection{Match Objects \label{match-objects}}
753
754\class{MatchObject} instances support the following methods and
755attributes:
756
757\begin{methoddesc}[MatchObject]{expand}{template}
758 Return the string obtained by doing backslash substitution on the
759template string \var{template}, as done by the \method{sub()} method.
760Escapes such as \samp{\e n} are converted to the appropriate
761characters, and numeric backreferences (\samp{\e 1}, \samp{\e 2}) and
762named backreferences (\samp{\e g<1>}, \samp{\e g<name>}) are replaced
763by the contents of the corresponding group.
764\end{methoddesc}
765
766\begin{methoddesc}[MatchObject]{group}{\optional{group1, \moreargs}}
767Returns one or more subgroups of the match. If there is a single
768argument, the result is a single string; if there are
769multiple arguments, the result is a tuple with one item per argument.
770Without arguments, \var{group1} defaults to zero (the whole match
771is returned).
772If a \var{groupN} argument is zero, the corresponding return value is the
773entire matching string; if it is in the inclusive range [1..99], it is
774the string matching the corresponding parenthesized group. If a
775group number is negative or larger than the number of groups defined
776in the pattern, an \exception{IndexError} exception is raised.
777If a group is contained in a part of the pattern that did not match,
778the corresponding result is \code{None}. If a group is contained in a
779part of the pattern that matched multiple times, the last match is
780returned.
781
782If the regular expression uses the \regexp{(?P<\var{name}>...)} syntax,
783the \var{groupN} arguments may also be strings identifying groups by
784their group name. If a string argument is not used as a group name in
785the pattern, an \exception{IndexError} exception is raised.
786
787A moderately complicated example:
788
789\begin{verbatim}
790m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
791\end{verbatim}
792
793After performing this match, \code{m.group(1)} is \code{'3'}, as is
794\code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
795\end{methoddesc}
796
797\begin{methoddesc}[MatchObject]{groups}{\optional{default}}
798Return a tuple containing all the subgroups of the match, from 1 up to
799however many groups are in the pattern. The \var{default} argument is
800used for groups that did not participate in the match; it defaults to
801\code{None}. (Incompatibility note: in the original Python 1.5
802release, if the tuple was one element long, a string would be returned
803instead. In later versions (from 1.5.1 on), a singleton tuple is
804returned in such cases.)
805\end{methoddesc}
806
807\begin{methoddesc}[MatchObject]{groupdict}{\optional{default}}
808Return a dictionary containing all the \emph{named} subgroups of the
809match, keyed by the subgroup name. The \var{default} argument is
810used for groups that did not participate in the match; it defaults to
811\code{None}.
812\end{methoddesc}
813
814\begin{methoddesc}[MatchObject]{start}{\optional{group}}
815\methodline{end}{\optional{group}}
816Return the indices of the start and end of the substring
817matched by \var{group}; \var{group} defaults to zero (meaning the whole
818matched substring).
819Return \code{-1} if \var{group} exists but
820did not contribute to the match. For a match object
821\var{m}, and a group \var{g} that did contribute to the match, the
822substring matched by group \var{g} (equivalent to
823\code{\var{m}.group(\var{g})}) is
824
825\begin{verbatim}
826m.string[m.start(g):m.end(g)]
827\end{verbatim}
828
829Note that
830\code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
831\var{group} matched a null string. For example, after \code{\var{m} =
832re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
833\code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
834\code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
835an \exception{IndexError} exception.
836\end{methoddesc}
837
838\begin{methoddesc}[MatchObject]{span}{\optional{group}}
839For \class{MatchObject} \var{m}, return the 2-tuple
840\code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
841Note that if \var{group} did not contribute to the match, this is
842\code{(-1, -1)}. Again, \var{group} defaults to zero.
843\end{methoddesc}
844
845\begin{memberdesc}[MatchObject]{pos}
846The value of \var{pos} which was passed to the \function{search()} or
847\function{match()} method of the \class{RegexObject}. This is the
848index into the string at which the RE engine started looking for a
849match.
850\end{memberdesc}
851
852\begin{memberdesc}[MatchObject]{endpos}
853The value of \var{endpos} which was passed to the \function{search()}
854or \function{match()} method of the \class{RegexObject}. This is the
855index into the string beyond which the RE engine will not go.
856\end{memberdesc}
857
858\begin{memberdesc}[MatchObject]{lastindex}
859The integer index of the last matched capturing group, or \code{None}
860if no group was matched at all. For example, the expressions
861\regexp{(a)b}, \regexp{((a)(b))}, and \regexp{((ab))} will have
862\code{lastindex == 1} if applied to the string \code{'ab'},
863while the expression \regexp{(a)(b)} will have \code{lastindex == 2},
864if applied to the same string.
865\end{memberdesc}
866
867\begin{memberdesc}[MatchObject]{lastgroup}
868The name of the last matched capturing group, or \code{None} if the
869group didn't have a name, or if no group was matched at all.
870\end{memberdesc}
871
872\begin{memberdesc}[MatchObject]{re}
873The regular expression object whose \method{match()} or
874\method{search()} method produced this \class{MatchObject} instance.
875\end{memberdesc}
876
877\begin{memberdesc}[MatchObject]{string}
878The string passed to \function{match()} or \function{search()}.
879\end{memberdesc}
880
881\subsection{Examples}
882
883\leftline{\strong{Simulating \cfunction{scanf()}}}
884
885Python does not currently have an equivalent to \cfunction{scanf()}.
886\ttindex{scanf()}
887Regular expressions are generally more powerful, though also more
888verbose, than \cfunction{scanf()} format strings. The table below
889offers some more-or-less equivalent mappings between
890\cfunction{scanf()} format tokens and regular expressions.
891
892\begin{tableii}{l|l}{textrm}{\cfunction{scanf()} Token}{Regular Expression}
893 \lineii{\code{\%c}}
894 {\regexp{.}}
895 \lineii{\code{\%5c}}
896 {\regexp{.\{5\}}}
897 \lineii{\code{\%d}}
898 {\regexp{[-+]?\e d+}}
899 \lineii{\code{\%e}, \code{\%E}, \code{\%f}, \code{\%g}}
900 {\regexp{[-+]?(\e d+(\e.\e d*)?|\e.\e d+)([eE][-+]?\e d+)?}}
901 \lineii{\code{\%i}}
902 {\regexp{[-+]?(0[xX][\e dA-Fa-f]+|0[0-7]*|\e d+)}}
903 \lineii{\code{\%o}}
904 {\regexp{0[0-7]*}}
905 \lineii{\code{\%s}}
906 {\regexp{\e S+}}
907 \lineii{\code{\%u}}
908 {\regexp{\e d+}}
909 \lineii{\code{\%x}, \code{\%X}}
910 {\regexp{0[xX][\e dA-Fa-f]+}}
911\end{tableii}
912
913To extract the filename and numbers from a string like
914
915\begin{verbatim}
916 /usr/sbin/sendmail - 0 errors, 4 warnings
917\end{verbatim}
918
919you would use a \cfunction{scanf()} format like
920
921\begin{verbatim}
922 %s - %d errors, %d warnings
923\end{verbatim}
924
925The equivalent regular expression would be
926
927\begin{verbatim}
928 (\S+) - (\d+) errors, (\d+) warnings
929\end{verbatim}
930
931\leftline{\strong{Avoiding recursion}}
932
933If you create regular expressions that require the engine to perform a
934lot of recursion, you may encounter a \exception{RuntimeError} exception with
935the message \code{maximum recursion limit} exceeded. For example,
936
937\begin{verbatim}
938>>> import re
939>>> s = 'Begin ' + 1000*'a very long string ' + 'end'
940>>> re.match('Begin (\w| )*? end', s).end()
941Traceback (most recent call last):
942 File "<stdin>", line 1, in ?
943 File "/usr/local/lib/python2.5/re.py", line 132, in match
944 return _compile(pattern, flags).match(string)
945RuntimeError: maximum recursion limit exceeded
946\end{verbatim}
947
948You can often restructure your regular expression to avoid recursion.
949
950Starting with Python 2.3, simple uses of the \regexp{*?} pattern are
951special-cased to avoid recursion. Thus, the above regular expression
952can avoid recursion by being recast as
953\regexp{Begin [a-zA-Z0-9_ ]*?end}. As a further benefit, such regular
954expressions will run faster than their recursive equivalents.
Note: See TracBrowser for help on using the repository browser.