Context Navigation

← Previous Revision
Next Revision →
Normal
Revision Log

re.rst

Last change on this file was 391, checked in by dmik, 11 years ago
python: Merge vendor 2.7.6 to trunk.
Property svn:eol-style set to `native`
File size: 52.1 KB

Rev	Line
[2]	1
	2	:mod:`re` --- Regular expression operations
	3	===========================================
	4
	5	.. module:: re
	6	:synopsis: Regular expression operations.
	7	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
	8	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
	9
	10
	11	This module provides regular expression matching operations similar to
	12	those found in Perl. Both patterns and strings to be searched can be
	13	Unicode strings as well as 8-bit strings.
	14
	15	Regular expressions use the backslash character (``'\'``) to indicate
	16	special forms or to allow special characters to be used without invoking
	17	their special meaning. This collides with Python's usage of the same
	18	character for the same purpose in string literals; for example, to match
	19	a literal backslash, one might have to write ``'\\\\'`` as the pattern
	20	string, because the regular expression must be ``\\``, and each
	21	backslash must be expressed as ``\\`` inside a regular Python string
	22	literal.
	23
	24	The solution is to use Python's raw string notation for regular expression
	25	patterns; backslashes are not handled in any special way in a string literal
	26	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
	27	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
	28	newline. Usually patterns will be expressed in Python code using this raw
	29	string notation.
	30
	31	It is important to note that most regular expression operations are available as
	32	module-level functions and :class:`RegexObject` methods. The functions are
	33	shortcuts that don't require you to compile a regex object first, but miss some
	34	fine-tuning parameters.
	35
	36	.. seealso::
	37
	38	Mastering Regular Expressions
	39	Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
	40	second edition of the book no longer covers Python at all, but the first
	41	edition covered writing good regular expression patterns in great detail.
	42
	43
	44	.. _re-syntax:
	45
	46	Regular Expression Syntax
	47	-------------------------
	48
	49	A regular expression (or RE) specifies a set of strings that matches it; the
	50	functions in this module let you check if a particular string matches a given
	51	regular expression (or if a given regular expression matches a particular
	52	string, which comes down to the same thing).
	53
	54	Regular expressions can be concatenated to form new regular expressions; if A
	55	and B are both regular expressions, then AB is also a regular expression.
	56	In general, if a string p matches A and another string q matches B, the
	57	string pq will match AB. This holds unless A or B contain low precedence
	58	operations; boundary conditions between A and B; or have numbered group
	59	references. Thus, complex expressions can easily be constructed from simpler
	60	primitive expressions like the ones described here. For details of the theory
	61	and implementation of regular expressions, consult the Friedl book referenced
	62	above, or almost any textbook about compiler construction.
	63
	64	A brief explanation of the format of regular expressions follows. For further
	65	information and a gentler presentation, consult the :ref:`regex-howto`.
	66
	67	Regular expressions can contain both special and ordinary characters. Most
	68	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
	69	expressions; they simply match themselves. You can concatenate ordinary
	70	characters, so ``last`` matches the string ``'last'``. (In the rest of this
	71	section, we'll write RE's in ``this special style``, usually without quotes, and
	72	strings to be matched ``'in single quotes'``.)
	73
	74	Some characters, like ``'\|'`` or ``'('``, are special. Special
	75	characters either stand for classes of ordinary characters, or affect
	76	how the regular expressions around them are interpreted. Regular
	77	expression pattern strings may not contain null bytes, but can specify
	78	the null byte using the ``\number`` notation, e.g., ``'\x00'``.
	79
	80
	81	The special characters are:
	82
	83	``'.'``
	84	(Dot.) In the default mode, this matches any character except a newline. If
	85	the :const:`DOTALL` flag has been specified, this matches any character
	86	including a newline.
	87
	88	``'^'``
	89	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
	90	matches immediately after each newline.
	91
	92	``'$'``
	93	Matches the end of the string or just before the newline at the end of the
	94	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
	95	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
	96	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
	97	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
	98	a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
	99	the newline, and one at the end of the string.
	100
	101	``'*'``
	102	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
	103	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
	104	by any number of 'b's.
	105
	106	``'+'``
	107	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
	108	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
	109	match just 'a'.
	110
	111	``'?'``
	112	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
	113	``ab?`` will match either 'a' or 'ab'.
	114
	115	``*?``, ``+?``, ``??``
	116	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
	117	as much text as possible. Sometimes this behaviour isn't desired; if the RE
	118	``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
	119	string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
	120	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
	121	characters as possible will be matched. Using ``.*?`` in the previous
	122	expression will match only ``'<H1>'``.
	123
	124	``{m}``
	125	Specifies that exactly m copies of the previous RE should be matched; fewer
	126	matches cause the entire RE not to match. For example, ``a{6}`` will match
	127	exactly six ``'a'`` characters, but not five.
	128
	129	``{m,n}``
	130	Causes the resulting RE to match from m to n repetitions of the preceding
	131	RE, attempting to match as many repetitions as possible. For example,
	132	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
	133	lower bound of zero, and omitting n specifies an infinite upper bound. As an
	134	example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
	135	followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
	136	modifier would be confused with the previously described form.
	137
	138	``{m,n}?``
	139	Causes the resulting RE to match from m to n repetitions of the preceding
	140	RE, attempting to match as few repetitions as possible. This is the
	141	non-greedy version of the previous qualifier. For example, on the
	142	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
	143	while ``a{3,5}?`` will only match 3 characters.
	144
	145	``'\'``
	146	Either escapes special characters (permitting you to match characters like
	147	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
	148	sequences are discussed below.
	149
	150	If you're not using a raw string to express the pattern, remember that Python
	151	also uses the backslash as an escape sequence in string literals; if the escape
	152	sequence isn't recognized by Python's parser, the backslash and subsequent
	153	character are included in the resulting string. However, if Python would
	154	recognize the resulting sequence, the backslash should be repeated twice. This
	155	is complicated and hard to understand, so it's highly recommended that you use
	156	raw strings for all but the simplest expressions.
	157
	158	``[]``
[391]	159	Used to indicate a set of characters. In a set:
[2]	160
[391]	161	* Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
	162	``'m'``, or ``'k'``.
[2]	163
[391]	164	* Ranges of characters can be indicated by giving two characters and separating
	165	them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
	166	``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
	167	``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
	168	``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
	169	it will match a literal ``'-'``.
[2]	170
[391]	171	* Special characters lose their special meaning inside sets. For example,
	172	``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
	173	``'*'``, or ``')'``.
	174
	175	* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
	176	inside a set, although the characters they match depends on whether
	177	:const:`LOCALE` or :const:`UNICODE` mode is in force.
	178
	179	* Characters that are not within a range can be matched by :dfn:`complementing`
	180	the set. If the first character of the set is ``'^'``, all the characters
	181	that are not in the set will be matched. For example, ``[^5]`` will match
	182	any character except ``'5'``, and ``[^^]`` will match any character except
	183	``'^'``. ``^`` has no special meaning if it's not the first character in
	184	the set.
	185
	186	* To match a literal ``']'`` inside a set, precede it with a backslash, or
	187	place it at the beginning of the set. For example, both ``[()[\]{}]`` and
	188	``[]()[{}]`` will both match a parenthesis.
	189
[2]	190	``'\|'``
	191	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
	192	will match either A or B. An arbitrary number of REs can be separated by the
	193	``'\|'`` in this way. This can be used inside groups (see below) as well. As
	194	the target string is scanned, REs separated by ``'\|'`` are tried from left to
	195	right. When one pattern completely matches, that branch is accepted. This means
	196	that once ``A`` matches, ``B`` will not be tested further, even if it would
	197	produce a longer overall match. In other words, the ``'\|'`` operator is never
	198	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
	199	character class, as in ``[\|]``.
	200
	201	``(...)``
	202	Matches whatever regular expression is inside the parentheses, and indicates the
	203	start and end of a group; the contents of a group can be retrieved after a match
	204	has been performed, and can be matched later in the string with the ``\number``
	205	special sequence, described below. To match the literals ``'('`` or ``')'``,
	206	use ``$`` or ``$``, or enclose them inside a character class: ``[(] [)]``.
	207
	208	``(?...)``
	209	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
	210	otherwise). The first character after the ``'?'`` determines what the meaning
	211	and further syntax of the construct is. Extensions usually do not create a new
	212	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
	213	currently supported extensions.
	214
	215	``(?iLmsux)``
	216	(One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
	217	``'u'``, ``'x'``.) The group matches the empty string; the letters
	218	set the corresponding flags: :const:`re.I` (ignore case),
	219	:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
	220	:const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
	221	and :const:`re.X` (verbose), for the entire regular expression. (The
	222	flags are described in :ref:`contents-of-module-re`.) This
	223	is useful if you wish to include the flags as part of the regular
	224	expression, instead of passing a flag argument to the
	225	:func:`re.compile` function.
	226
	227	Note that the ``(?x)`` flag changes how the expression is parsed. It should be
	228	used first in the expression string, or after one or more whitespace characters.
	229	If there are non-whitespace characters before the flag, the results are
	230	undefined.
	231
	232	``(?:...)``
[391]	233	A non-capturing version of regular parentheses. Matches whatever regular
[2]	234	expression is inside the parentheses, but the substring matched by the group
	235	cannot be retrieved after performing a match or referenced later in the
	236	pattern.
	237
	238	``(?P<name>...)``
	239	Similar to regular parentheses, but the substring matched by the group is
[391]	240	accessible via the symbolic group name name. Group names must be valid
	241	Python identifiers, and each group name must be defined only once within a
	242	regular expression. A symbolic group is also a numbered group, just as if
	243	the group were not named.
[2]	244
[391]	245	Named groups can be referenced in three contexts. If the pattern is
	246	``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
	247	single or double quotes):
[2]	248
[391]	249	+---------------------------------------+----------------------------------+
	250	\| Context of reference to group "quote" \| Ways to reference it \|
	251	+=======================================+==================================+
	252	\| in the same pattern itself \| * ``(?P=quote)`` (as shown) \|
	253	\| \| * ``\1`` \|
	254	+---------------------------------------+----------------------------------+
	255	\| when processing match object ``m`` \| * ``m.group('quote')`` \|
	256	\| \| * ``m.end('quote')`` (etc.) \|
	257	+---------------------------------------+----------------------------------+
	258	\| in a string passed to the ``repl`` \| * ``\g<quote>`` \|
	259	\| argument of ``re.sub()`` \| * ``\g<1>`` \|
	260	\| \| * ``\1`` \|
	261	+---------------------------------------+----------------------------------+
	262
[2]	263	``(?P=name)``
[391]	264	A backreference to a named group; it matches whatever text was matched by the
	265	earlier group named name.
[2]	266
	267	``(?#...)``
	268	A comment; the contents of the parentheses are simply ignored.
	269
	270	``(?=...)``
	271	Matches if ``...`` matches next, but doesn't consume any of the string. This is
	272	called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
	273	``'Isaac '`` only if it's followed by ``'Asimov'``.
	274
	275	``(?!...)``
	276	Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
	277	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
	278	followed by ``'Asimov'``.
	279
	280	``(?<=...)``
	281	Matches if the current position in the string is preceded by a match for ``...``
	282	that ends at the current position. This is called a :dfn:`positive lookbehind
	283	assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
	284	lookbehind will back up 3 characters and check if the contained pattern matches.
	285	The contained pattern must only match strings of some fixed length, meaning that
	286	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
[391]	287	patterns which start with positive lookbehind assertions will not match at the
[2]	288	beginning of the string being searched; you will most likely want to use the
	289	:func:`search` function rather than the :func:`match` function:
	290
	291	>>> import re
	292	>>> m = re.search('(?<=abc)def', 'abcdef')
	293	>>> m.group(0)
	294	'def'
	295
	296	This example looks for a word following a hyphen:
	297
	298	>>> m = re.search('(?<=-)\w+', 'spam-egg')
	299	>>> m.group(0)
	300	'egg'
	301
	302	``(?<!...)``
	303	Matches if the current position in the string is not preceded by a match for
	304	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
	305	positive lookbehind assertions, the contained pattern must only match strings of
	306	some fixed length. Patterns which start with negative lookbehind assertions may
	307	match at the beginning of the string being searched.
	308
	309	``(?(id/name)yes-pattern\|no-pattern)``
	310	Will try to match with ``yes-pattern`` if the group with given id or name
	311	exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
	312	can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
	313	matching pattern, which will match with ``'<user@host.com>'`` as well as
	314	``'user@host.com'``, but not with ``'<user@host.com'``.
	315
	316	.. versionadded:: 2.4
	317
	318	The special sequences consist of ``'\'`` and a character from the list below.
	319	If the ordinary character is not on the list, then the resulting RE will match
	320	the second character. For example, ``\$`` matches the character ``'$'``.
	321
	322	``\number``
	323	Matches the contents of the group of the same number. Groups are numbered
	324	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
[391]	325	but not ``'thethe'`` (note the space after the group). This special sequence
[2]	326	can only be used to match one of the first 99 groups. If the first digit of
	327	number is 0, or number is 3 octal digits long, it will not be interpreted as
	328	a group match, but as the character with octal value number. Inside the
	329	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
	330	characters.
	331
	332	``\A``
	333	Matches only at the start of the string.
	334
	335	``\b``
	336	Matches the empty string, but only at the beginning or end of a word. A word is
	337	defined as a sequence of alphanumeric or underscore characters, so the end of a
	338	word is indicated by whitespace or a non-alphanumeric, non-underscore character.
[391]	339	Note that formally, ``\b`` is defined as the boundary between a ``\w`` and
	340	a ``\W`` character (or vice versa), or between ``\w`` and the beginning/end
	341	of the string, so the precise set of characters deemed to be alphanumeric
	342	depends on the values of the ``UNICODE`` and ``LOCALE`` flags.
	343	For example, ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
	344	``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
	345	Inside a character range, ``\b`` represents the backspace character, for
	346	compatibility with Python's string literals.
[2]	347
	348	``\B``
	349	Matches the empty string, but only when it is not at the beginning or end of a
[391]	350	word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, ``'py2'``,
	351	but not ``'py'``, ``'py.'``, or ``'py!'``.
	352	``\B`` is just the opposite of ``\b``, so is also subject to the settings
[2]	353	of ``LOCALE`` and ``UNICODE``.
	354
	355	``\d``
	356	When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
	357	is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
[391]	358	whatever is classified as a decimal digit in the Unicode character properties
	359	database.
[2]	360
	361	``\D``
	362	When the :const:`UNICODE` flag is not specified, matches any non-digit
	363	character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
	364	will match anything other than character marked as digits in the Unicode
	365	character properties database.
	366
	367	``\s``
[391]	368	When the :const:`UNICODE` flag is not specified, it matches any whitespace
	369	character, this is equivalent to the set ``[ \t\n\r\f\v]``. The
	370	:const:`LOCALE` flag has no extra effect on matching of the space.
	371	If :const:`UNICODE` is set, this will match the characters ``[ \t\n\r\f\v]``
	372	plus whatever is classified as space in the Unicode character properties
	373	database.
[2]	374
	375	``\S``
[391]	376	When the :const:`UNICODE` flags is not specified, matches any non-whitespace
	377	character; this is equivalent to the set ``[^ \t\n\r\f\v]`` The
	378	:const:`LOCALE` flag has no extra effect on non-whitespace match. If
	379	:const:`UNICODE` is set, then any character not marked as space in the
	380	Unicode character properties database is matched.
[2]	381
[391]	382
[2]	383	``\w``
	384	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
	385	any alphanumeric character and the underscore; this is equivalent to the set
	386	``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
	387	whatever characters are defined as alphanumeric for the current locale. If
	388	:const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
	389	is classified as alphanumeric in the Unicode character properties database.
	390
	391	``\W``
	392	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
	393	any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
	394	With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
	395	not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
[391]	396	this will match anything other than ``[0-9_]`` plus characters classied as
	397	not alphanumeric in the Unicode character properties database.
[2]	398
	399	``\Z``
	400	Matches only at the end of the string.
	401
[391]	402	If both :const:`LOCALE` and :const:`UNICODE` flags are included for a
	403	particular sequence, then :const:`LOCALE` flag takes effect first followed by
	404	the :const:`UNICODE`.
	405
[2]	406	Most of the standard escapes supported by Python string literals are also
	407	accepted by the regular expression parser::
	408
	409	\a \b \f \n
	410	\r \t \v \x
	411	\\
	412
[391]	413	(Note that ``\b`` is used to represent word boundaries, and means "backspace"
	414	only inside character classes.)
	415
[2]	416	Octal escapes are included in a limited form: If the first digit is a 0, or if
	417	there are three octal digits, it is considered an octal escape. Otherwise, it is
	418	a group reference. As for string literals, octal escapes are always at most
	419	three digits in length.
	420
	421
	422	.. _contents-of-module-re:
	423
	424	Module Contents
	425	---------------
	426
	427	The module defines several functions, constants, and an exception. Some of the
	428	functions are simplified versions of the full featured methods for compiled
	429	regular expressions. Most non-trivial applications always use the compiled
	430	form.
	431
	432
[391]	433	.. function:: compile(pattern, flags=0)
[2]	434
	435	Compile a regular expression pattern into a regular expression object, which
	436	can be used for matching using its :func:`match` and :func:`search` methods,
	437	described below.
	438
	439	The expression's behaviour can be modified by specifying a flags value.
	440	Values can be any of the following variables, combined using bitwise OR (the
	441	``\|`` operator).
	442
	443	The sequence ::
	444
	445	prog = re.compile(pattern)
	446	result = prog.match(string)
	447
	448	is equivalent to ::
	449
	450	result = re.match(pattern, string)
	451
	452	but using :func:`re.compile` and saving the resulting regular expression
	453	object for reuse is more efficient when the expression will be used several
	454	times in a single program.
	455
	456	.. note::
	457
	458	The compiled versions of the most recent patterns passed to
	459	:func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
	460	programs that use only a few regular expressions at a time needn't worry
	461	about compiling regular expressions.
	462
	463
[391]	464	.. data:: DEBUG
	465
	466	Display debug information about compiled expression.
	467
	468
[2]	469	.. data:: I
	470	IGNORECASE
	471
	472	Perform case-insensitive matching; expressions like ``[A-Z]`` will match
	473	lowercase letters, too. This is not affected by the current locale.
	474
	475
	476	.. data:: L
	477	LOCALE
	478
	479	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
	480	current locale.
	481
	482
	483	.. data:: M
	484	MULTILINE
	485
	486	When specified, the pattern character ``'^'`` matches at the beginning of the
	487	string and at the beginning of each line (immediately following each newline);
	488	and the pattern character ``'$'`` matches at the end of the string and at the
	489	end of each line (immediately preceding each newline). By default, ``'^'``
	490	matches only at the beginning of the string, and ``'$'`` only at the end of the
	491	string and immediately before the newline (if any) at the end of the string.
	492
	493
	494	.. data:: S
	495	DOTALL
	496
	497	Make the ``'.'`` special character match any character at all, including a
	498	newline; without this flag, ``'.'`` will match anything except a newline.
	499
	500
	501	.. data:: U
	502	UNICODE
	503
	504	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
	505	on the Unicode character properties database.
	506
	507	.. versionadded:: 2.0
	508
	509
	510	.. data:: X
	511	VERBOSE
	512
	513	This flag allows you to write regular expressions that look nicer. Whitespace
	514	within the pattern is ignored, except when in a character class or preceded by
	515	an unescaped backslash, and, when a line contains a ``'#'`` neither in a
	516	character class or preceded by an unescaped backslash, all characters from the
	517	leftmost such ``'#'`` through the end of the line are ignored.
	518
	519	That means that the two following regular expression objects that match a
	520	decimal number are functionally equal::
	521
	522	a = re.compile(r"""\d + # the integral part
	523	\. # the decimal point
	524	\d * # some fractional digits""", re.X)
	525	b = re.compile(r"\d+\.\d*")
	526
	527
[391]	528	.. function:: search(pattern, string, flags=0)
[2]	529
	530	Scan through string looking for a location where the regular expression
	531	pattern produces a match, and return a corresponding :class:`MatchObject`
	532	instance. Return ``None`` if no position in the string matches the pattern; note
	533	that this is different from finding a zero-length match at some point in the
	534	string.
	535
	536
[391]	537	.. function:: match(pattern, string, flags=0)
[2]	538
	539	If zero or more characters at the beginning of string match the regular
	540	expression pattern, return a corresponding :class:`MatchObject` instance.
	541	Return ``None`` if the string does not match the pattern; note that this is
	542	different from a zero-length match.
	543
[391]	544	Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
	545	at the beginning of the string and not at the beginning of each line.
[2]	546
[391]	547	If you want to locate a match anywhere in string, use :func:`search`
	548	instead (see also :ref:`search-vs-match`).
[2]	549
	550
[391]	551	.. function:: split(pattern, string, maxsplit=0, flags=0)
[2]	552
	553	Split string by the occurrences of pattern. If capturing parentheses are
	554	used in pattern, then the text of all groups in the pattern are also returned
	555	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
	556	splits occur, and the remainder of the string is returned as the final element
	557	of the list. (Incompatibility note: in the original Python 1.5 release,
	558	maxsplit was ignored. This has been fixed in later releases.)
	559
	560	>>> re.split('\W+', 'Words, words, words.')
	561	['Words', 'words', 'words', '']
	562	>>> re.split('(\W+)', 'Words, words, words.')
	563	['Words', ', ', 'words', ', ', 'words', '.', '']
	564	>>> re.split('\W+', 'Words, words, words.', 1)
	565	['Words', 'words, words.']
[391]	566	>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
	567	['0', '3', '9']
[2]	568
	569	If there are capturing groups in the separator and it matches at the start of
	570	the string, the result will start with an empty string. The same holds for
	571	the end of the string:
	572
	573	>>> re.split('(\W+)', '...words, words...')
	574	['', '...', 'words', ', ', 'words', '...', '']
	575
	576	That way, separator components are always found at the same relative
	577	indices within the result list (e.g., if there's one capturing group
	578	in the separator, the 0th, the 2nd and so forth).
	579
	580	Note that split will never split a string on an empty pattern match.
	581	For example:
	582
	583	>>> re.split('x*', 'foo')
	584	['foo']
	585	>>> re.split("(?m)^$", "foo\n\nbar\n")
	586	['foo\n\nbar\n']
	587
[391]	588	.. versionchanged:: 2.7
	589	Added the optional flags argument.
[2]	590
	591
[391]	592	.. function:: findall(pattern, string, flags=0)
	593
[2]	594	Return all non-overlapping matches of pattern in string, as a list of
	595	strings. The string is scanned left-to-right, and matches are returned in
	596	the order found. If one or more groups are present in the pattern, return a
	597	list of groups; this will be a list of tuples if the pattern has more than
	598	one group. Empty matches are included in the result unless they touch the
	599	beginning of another match.
	600
	601	.. versionadded:: 1.5.2
	602
	603	.. versionchanged:: 2.4
	604	Added the optional flags argument.
	605
	606
[391]	607	.. function:: finditer(pattern, string, flags=0)
[2]	608
	609	Return an :term:`iterator` yielding :class:`MatchObject` instances over all
	610	non-overlapping matches for the RE pattern in string. The string is
	611	scanned left-to-right, and matches are returned in the order found. Empty
	612	matches are included in the result unless they touch the beginning of another
	613	match.
	614
	615	.. versionadded:: 2.2
	616
	617	.. versionchanged:: 2.4
	618	Added the optional flags argument.
	619
	620
[391]	621	.. function:: sub(pattern, repl, string, count=0, flags=0)
[2]	622
	623	Return the string obtained by replacing the leftmost non-overlapping occurrences
	624	of pattern in string by the replacement repl. If the pattern isn't found,
	625	string is returned unchanged. repl can be a string or a function; if it is
	626	a string, any backslash escapes in it are processed. That is, ``\n`` is
[391]	627	converted to a single newline character, ``\r`` is converted to a carriage return, and
[2]	628	so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
	629	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
	630	For example:
	631
	632	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
	633	... r'static PyObject*\npy_\1(void)\n{',
	634	... 'def myfunc():')
	635	'static PyObject*\npy_myfunc(void)\n{'
	636
	637	If repl is a function, it is called for every non-overlapping occurrence of
	638	pattern. The function takes a single match object argument, and returns the
	639	replacement string. For example:
	640
	641	>>> def dashrepl(matchobj):
	642	... if matchobj.group(0) == '-': return ' '
	643	... else: return '-'
	644	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
	645	'pro--gram files'
[391]	646	>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
	647	'Baked Beans & Spam'
[2]	648
[391]	649	The pattern may be a string or an RE object.
[2]	650
	651	The optional argument count is the maximum number of pattern occurrences to be
	652	replaced; count must be a non-negative integer. If omitted or zero, all
	653	occurrences will be replaced. Empty matches for the pattern are replaced only
	654	when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
	655	``'-a-b-c-'``.
	656
[391]	657	In string-type repl arguments, in addition to the character escapes and
	658	backreferences described above,
[2]	659	``\g<name>`` will use the substring matched by the group named ``name``, as
	660	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
	661	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
	662	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
	663	reference to group 20, not a reference to group 2 followed by the literal
	664	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
	665	substring matched by the RE.
	666
[391]	667	.. versionchanged:: 2.7
	668	Added the optional flags argument.
[2]	669
	670
[391]	671	.. function:: subn(pattern, repl, string, count=0, flags=0)
	672
[2]	673	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
	674	number_of_subs_made)``.
	675
[391]	676	.. versionchanged:: 2.7
	677	Added the optional flags argument.
[2]	678
[391]	679
[2]	680	.. function:: escape(string)
	681
	682	Return string with all non-alphanumerics backslashed; this is useful if you
	683	want to match an arbitrary literal string that may have regular expression
	684	metacharacters in it.
	685
	686
[391]	687	.. function:: purge()
	688
	689	Clear the regular expression cache.
	690
	691
[2]	692	.. exception:: error
	693
	694	Exception raised when a string passed to one of the functions here is not a
	695	valid regular expression (for example, it might contain unmatched parentheses)
	696	or when some other error occurs during compilation or matching. It is never an
	697	error if a string contains no match for a pattern.
	698
	699
	700	.. _re-objects:
	701
	702	Regular Expression Objects
	703	--------------------------
	704
[391]	705	.. class:: RegexObject
[2]	706
[391]	707	The :class:`RegexObject` class supports the following methods and attributes:
[2]	708
[391]	709	.. method:: RegexObject.search(string[, pos[, endpos]])
[2]	710
[391]	711	Scan through string looking for a location where this regular expression
	712	produces a match, and return a corresponding :class:`MatchObject` instance.
	713	Return ``None`` if no position in the string matches the pattern; note that this
	714	is different from finding a zero-length match at some point in the string.
[2]	715
[391]	716	The optional second parameter pos gives an index in the string where the
	717	search is to start; it defaults to ``0``. This is not completely equivalent to
	718	slicing the string; the ``'^'`` pattern character matches at the real beginning
	719	of the string and at positions just after a newline, but not necessarily at the
	720	index where the search is to start.
[2]	721
[391]	722	The optional parameter endpos limits how far the string will be searched; it
	723	will be as if the string is endpos characters long, so only the characters
	724	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
	725	than pos, no match will be found, otherwise, if rx is a compiled regular
	726	expression object, ``rx.search(string, 0, 50)`` is equivalent to
	727	``rx.search(string[:50], 0)``.
[2]	728
[391]	729	>>> pattern = re.compile("d")
	730	>>> pattern.search("dog") # Match at index 0
	731	<_sre.SRE_Match object at ...>
	732	>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
[2]	733
	734
[391]	735	.. method:: RegexObject.match(string[, pos[, endpos]])
	736
	737	If zero or more characters at the beginning of string match this regular
	738	expression, return a corresponding :class:`MatchObject` instance. Return
	739	``None`` if the string does not match the pattern; note that this is different
	740	from a zero-length match.
	741
	742	The optional pos and endpos parameters have the same meaning as for the
	743	:meth:`~RegexObject.search` method.
	744
[2]	745	>>> pattern = re.compile("o")
[391]	746	>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
[2]	747	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
	748	<_sre.SRE_Match object at ...>
	749
[391]	750	If you want to locate a match anywhere in string, use
	751	:meth:`~RegexObject.search` instead (see also :ref:`search-vs-match`).
[2]	752
	753
[391]	754	.. method:: RegexObject.split(string, maxsplit=0)
[2]	755
[391]	756	Identical to the :func:`split` function, using the compiled pattern.
[2]	757
	758
[391]	759	.. method:: RegexObject.findall(string[, pos[, endpos]])
[2]	760
[391]	761	Similar to the :func:`findall` function, using the compiled pattern, but
	762	also accepts optional pos and endpos parameters that limit the search
	763	region like for :meth:`match`.
[2]	764
	765
[391]	766	.. method:: RegexObject.finditer(string[, pos[, endpos]])
[2]	767
[391]	768	Similar to the :func:`finditer` function, using the compiled pattern, but
	769	also accepts optional pos and endpos parameters that limit the search
	770	region like for :meth:`match`.
[2]	771
	772
[391]	773	.. method:: RegexObject.sub(repl, string, count=0)
[2]	774
[391]	775	Identical to the :func:`sub` function, using the compiled pattern.
[2]	776
	777
[391]	778	.. method:: RegexObject.subn(repl, string, count=0)
[2]	779
[391]	780	Identical to the :func:`subn` function, using the compiled pattern.
[2]	781
	782
[391]	783	.. attribute:: RegexObject.flags
[2]	784
[391]	785	The regex matching flags. This is a combination of the flags given to
	786	:func:`.compile` and any ``(?...)`` inline flags in the pattern.
[2]	787
	788
[391]	789	.. attribute:: RegexObject.groups
[2]	790
[391]	791	The number of capturing groups in the pattern.
[2]	792
	793
[391]	794	.. attribute:: RegexObject.groupindex
[2]	795
[391]	796	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
	797	numbers. The dictionary is empty if no symbolic groups were used in the
	798	pattern.
[2]	799
	800
[391]	801	.. attribute:: RegexObject.pattern
[2]	802
[391]	803	The pattern string from which the RE object was compiled.
[2]	804
	805
	806	.. _match-objects:
	807
	808	Match Objects
	809	-------------
	810
[391]	811	.. class:: MatchObject
[2]	812
[391]	813	Match objects always have a boolean value of ``True``.
	814	Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
	815	when there is no match, you can test whether there was a match with a simple
	816	``if`` statement::
[2]	817
[391]	818	match = re.search(pattern, string)
	819	if match:
	820	process(match)
[2]	821
[391]	822	Match objects support the following methods and attributes:
[2]	823
	824
[391]	825	.. method:: MatchObject.expand(template)
[2]	826
[391]	827	Return the string obtained by doing backslash substitution on the template
	828	string template, as done by the :meth:`~RegexObject.sub` method. Escapes
	829	such as ``\n`` are converted to the appropriate characters, and numeric
	830	backreferences (``\1``, ``\2``) and named backreferences (``\g<1>``,
	831	``\g<name>``) are replaced by the contents of the corresponding group.
[2]	832
	833
[391]	834	.. method:: MatchObject.group([group1, ...])
[2]	835
[391]	836	Returns one or more subgroups of the match. If there is a single argument, the
	837	result is a single string; if there are multiple arguments, the result is a
	838	tuple with one item per argument. Without arguments, group1 defaults to zero
	839	(the whole match is returned). If a groupN argument is zero, the corresponding
	840	return value is the entire matching string; if it is in the inclusive range
	841	[1..99], it is the string matching the corresponding parenthesized group. If a
	842	group number is negative or larger than the number of groups defined in the
	843	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
	844	part of the pattern that did not match, the corresponding result is ``None``.
	845	If a group is contained in a part of the pattern that matched multiple times,
	846	the last match is returned.
[2]	847
[391]	848	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
	849	>>> m.group(0) # The entire match
	850	'Isaac Newton'
	851	>>> m.group(1) # The first parenthesized subgroup.
	852	'Isaac'
	853	>>> m.group(2) # The second parenthesized subgroup.
	854	'Newton'
	855	>>> m.group(1, 2) # Multiple arguments give us a tuple.
	856	('Isaac', 'Newton')
[2]	857
[391]	858	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
	859	arguments may also be strings identifying groups by their group name. If a
	860	string argument is not used as a group name in the pattern, an :exc:`IndexError`
	861	exception is raised.
[2]	862
[391]	863	A moderately complicated example:
[2]	864
[391]	865	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
	866	>>> m.group('first_name')
	867	'Malcolm'
	868	>>> m.group('last_name')
	869	'Reynolds'
[2]	870
[391]	871	Named groups can also be referred to by their index:
[2]	872
[391]	873	>>> m.group(1)
	874	'Malcolm'
	875	>>> m.group(2)
	876	'Reynolds'
[2]	877
[391]	878	If a group matches multiple times, only the last match is accessible:
[2]	879
[391]	880	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
	881	>>> m.group(1) # Returns only the last match.
	882	'c3'
[2]	883
	884
[391]	885	.. method:: MatchObject.groups([default])
[2]	886
[391]	887	Return a tuple containing all the subgroups of the match, from 1 up to however
	888	many groups are in the pattern. The default argument is used for groups that
	889	did not participate in the match; it defaults to ``None``. (Incompatibility
	890	note: in the original Python 1.5 release, if the tuple was one element long, a
	891	string would be returned instead. In later versions (from 1.5.1 on), a
	892	singleton tuple is returned in such cases.)
[2]	893
[391]	894	For example:
[2]	895
[391]	896	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
	897	>>> m.groups()
	898	('24', '1632')
[2]	899
[391]	900	If we make the decimal place and everything after it optional, not all groups
	901	might participate in the match. These groups will default to ``None`` unless
	902	the default argument is given:
[2]	903
[391]	904	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
	905	>>> m.groups() # Second group defaults to None.
	906	('24', None)
	907	>>> m.groups('0') # Now, the second group defaults to '0'.
	908	('24', '0')
[2]	909
	910
[391]	911	.. method:: MatchObject.groupdict([default])
[2]	912
[391]	913	Return a dictionary containing all the named subgroups of the match, keyed by
	914	the subgroup name. The default argument is used for groups that did not
	915	participate in the match; it defaults to ``None``. For example:
[2]	916
[391]	917	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
	918	>>> m.groupdict()
	919	{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
[2]	920
	921
[391]	922	.. method:: MatchObject.start([group])
	923	MatchObject.end([group])
[2]	924
[391]	925	Return the indices of the start and end of the substring matched by group;
	926	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
	927	group exists but did not contribute to the match. For a match object m, and
	928	a group g that did contribute to the match, the substring matched by group g
	929	(equivalent to ``m.group(g)``) is ::
[2]	930
[391]	931	m.string[m.start(g):m.end(g)]
[2]	932
[391]	933	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
	934	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
	935	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
	936	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
[2]	937
[391]	938	An example that will remove remove_this from email addresses:
[2]	939
[391]	940	>>> email = "tony@tiremove_thisger.net"
	941	>>> m = re.search("remove_this", email)
	942	>>> email[:m.start()] + email[m.end():]
	943	'tony@tiger.net'
[2]	944
	945
[391]	946	.. method:: MatchObject.span([group])
[2]	947
[391]	948	For :class:`MatchObject` m, return the 2-tuple ``(m.start(group),
	949	m.end(group))``. Note that if group did not contribute to the match, this is
	950	``(-1, -1)``. group defaults to zero, the entire match.
[2]	951
	952
[391]	953	.. attribute:: MatchObject.pos
[2]	954
[391]	955	The value of pos which was passed to the :meth:`~RegexObject.search` or
	956	:meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
	957	index into the string at which the RE engine started looking for a match.
[2]	958
	959
[391]	960	.. attribute:: MatchObject.endpos
[2]	961
[391]	962	The value of endpos which was passed to the :meth:`~RegexObject.search` or
	963	:meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
	964	index into the string beyond which the RE engine will not go.
[2]	965
	966
[391]	967	.. attribute:: MatchObject.lastindex
[2]	968
[391]	969	The integer index of the last matched capturing group, or ``None`` if no group
	970	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
	971	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
	972	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
	973	string.
[2]	974
	975
[391]	976	.. attribute:: MatchObject.lastgroup
[2]	977
[391]	978	The name of the last matched capturing group, or ``None`` if the group didn't
	979	have a name, or if no group was matched at all.
[2]	980
	981
[391]	982	.. attribute:: MatchObject.re
[2]	983
[391]	984	The regular expression object whose :meth:`~RegexObject.match` or
	985	:meth:`~RegexObject.search` method produced this :class:`MatchObject`
	986	instance.
[2]	987
	988
[391]	989	.. attribute:: MatchObject.string
	990
	991	The string passed to :meth:`~RegexObject.match` or
	992	:meth:`~RegexObject.search`.
	993
	994
[2]	995	Examples
	996	--------
	997
	998
	999	Checking For a Pair
	1000	^^^^^^^^^^^^^^^^^^^
	1001
	1002	In this example, we'll use the following helper function to display match
	1003	objects a little more gracefully:
	1004
	1005	.. testcode::
	1006
	1007	def displaymatch(match):
	1008	if match is None:
	1009	return None
	1010	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
	1011
	1012	Suppose you are writing a poker program where a player's hand is represented as
	1013	a 5-character string with each character representing a card, "a" for ace, "k"
[391]	1014	for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
[2]	1015	representing the card with that value.
	1016
	1017	To see if a given string is a valid hand, one could do the following:
	1018
[391]	1019	>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
	1020	>>> displaymatch(valid.match("akt5q")) # Valid.
	1021	"<Match: 'akt5q', groups=()>"
	1022	>>> displaymatch(valid.match("akt5e")) # Invalid.
	1023	>>> displaymatch(valid.match("akt")) # Invalid.
[2]	1024	>>> displaymatch(valid.match("727ak")) # Valid.
	1025	"<Match: '727ak', groups=()>"
	1026
	1027	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
	1028	To match this with a regular expression, one could use backreferences as such:
	1029
	1030	>>> pair = re.compile(r".(.).\1")
	1031	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
	1032	"<Match: '717', groups=('7',)>"
	1033	>>> displaymatch(pair.match("718ak")) # No pairs.
	1034	>>> displaymatch(pair.match("354aa")) # Pair of aces.
	1035	"<Match: '354aa', groups=('a',)>"
	1036
	1037	To find out what card the pair consists of, one could use the
	1038	:meth:`~MatchObject.group` method of :class:`MatchObject` in the following
	1039	manner:
	1040
	1041	.. doctest::
	1042
	1043	>>> pair.match("717ak").group(1)
	1044	'7'
	1045
	1046	# Error because re.match() returns None, which doesn't have a group() method:
	1047	>>> pair.match("718ak").group(1)
	1048	Traceback (most recent call last):
	1049	File "<pyshell#23>", line 1, in <module>
	1050	re.match(r".(.).\1", "718ak").group(1)
	1051	AttributeError: 'NoneType' object has no attribute 'group'
	1052
	1053	>>> pair.match("354aa").group(1)
	1054	'a'
	1055
	1056
	1057	Simulating scanf()
	1058	^^^^^^^^^^^^^^^^^^
	1059
	1060	.. index:: single: scanf()
	1061
[391]	1062	Python does not currently have an equivalent to :c:func:`scanf`. Regular
[2]	1063	expressions are generally more powerful, though also more verbose, than
[391]	1064	:c:func:`scanf` format strings. The table below offers some more-or-less
	1065	equivalent mappings between :c:func:`scanf` format tokens and regular
[2]	1066	expressions.
	1067
	1068	+--------------------------------+---------------------------------------------+
[391]	1069	\| :c:func:`scanf` Token \| Regular Expression \|
[2]	1070	+================================+=============================================+
	1071	\| ``%c`` \| ``.`` \|
	1072	+--------------------------------+---------------------------------------------+
	1073	\| ``%5c`` \| ``.{5}`` \|
	1074	+--------------------------------+---------------------------------------------+
	1075	\| ``%d`` \| ``[-+]?\d+`` \|
	1076	+--------------------------------+---------------------------------------------+
	1077	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
	1078	+--------------------------------+---------------------------------------------+
	1079	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
	1080	+--------------------------------+---------------------------------------------+
[391]	1081	\| ``%o`` \| ``[-+]?[0-7]+`` \|
[2]	1082	+--------------------------------+---------------------------------------------+
	1083	\| ``%s`` \| ``\S+`` \|
	1084	+--------------------------------+---------------------------------------------+
	1085	\| ``%u`` \| ``\d+`` \|
	1086	+--------------------------------+---------------------------------------------+
[391]	1087	\| ``%x``, ``%X`` \| ``[-+]?(0[xX])?[\dA-Fa-f]+`` \|
[2]	1088	+--------------------------------+---------------------------------------------+
	1089
	1090	To extract the filename and numbers from a string like ::
	1091
	1092	/usr/sbin/sendmail - 0 errors, 4 warnings
	1093
[391]	1094	you would use a :c:func:`scanf` format like ::
[2]	1095
	1096	%s - %d errors, %d warnings
	1097
	1098	The equivalent regular expression would be ::
	1099
	1100	(\S+) - (\d+) errors, (\d+) warnings
	1101
	1102
[391]	1103	.. _search-vs-match:
[2]	1104
	1105	search() vs. match()
	1106	^^^^^^^^^^^^^^^^^^^^
	1107
[391]	1108	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
[2]	1109
[391]	1110	Python offers two different primitive operations based on regular expressions:
	1111	:func:`re.match` checks for a match only at the beginning of the string, while
	1112	:func:`re.search` checks for a match anywhere in the string (this is what Perl
	1113	does by default).
[2]	1114
[391]	1115	For example::
[2]	1116
[391]	1117	>>> re.match("c", "abcdef") # No match
	1118	>>> re.search("c", "abcdef") # Match
	1119	<_sre.SRE_Match object at ...>
[2]	1120
[391]	1121	Regular expressions beginning with ``'^'`` can be used with :func:`search` to
	1122	restrict the match at the beginning of the string::
[2]	1123
[391]	1124	>>> re.match("c", "abcdef") # No match
	1125	>>> re.search("^c", "abcdef") # No match
	1126	>>> re.search("^a", "abcdef") # Match
	1127	<_sre.SRE_Match object at ...>
[2]	1128
[391]	1129	Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
	1130	beginning of the string, whereas using :func:`search` with a regular expression
	1131	beginning with ``'^'`` will match at the beginning of each line.
[2]	1132
[391]	1133	>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
	1134	>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
[2]	1135	<_sre.SRE_Match object at ...>
	1136
	1137
	1138	Making a Phonebook
	1139	^^^^^^^^^^^^^^^^^^
	1140
	1141	:func:`split` splits a string into a list delimited by the passed pattern. The
	1142	method is invaluable for converting textual data into data structures that can be
	1143	easily read and modified by Python as demonstrated in the following example that
	1144	creates a phonebook.
	1145
	1146	First, here is the input. Normally it may come from a file, here we are using
	1147	triple-quoted string syntax:
	1148
[391]	1149	>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
[2]	1150	...
	1151	... Ronald Heathmore: 892.345.3428 436 Finley Avenue
	1152	... Frank Burger: 925.541.7625 662 South Dogwood Way
	1153	...
	1154	...
	1155	... Heather Albrecht: 548.326.4584 919 Park Place"""
	1156
	1157	The entries are separated by one or more newlines. Now we convert the string
	1158	into a list with each nonempty line having its own entry:
	1159
	1160	.. doctest::
	1161	:options: +NORMALIZE_WHITESPACE
	1162
[391]	1163	>>> entries = re.split("\n+", text)
[2]	1164	>>> entries
	1165	['Ross McFluff: 834.345.1254 155 Elm Street',
	1166	'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
	1167	'Frank Burger: 925.541.7625 662 South Dogwood Way',
	1168	'Heather Albrecht: 548.326.4584 919 Park Place']
	1169
	1170	Finally, split each entry into a list with first name, last name, telephone
	1171	number, and address. We use the ``maxsplit`` parameter of :func:`split`
	1172	because the address has spaces, our splitting pattern, in it:
	1173
	1174	.. doctest::
	1175	:options: +NORMALIZE_WHITESPACE
	1176
	1177	>>> [re.split(":? ", entry, 3) for entry in entries]
	1178	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
	1179	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
	1180	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
	1181	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
	1182
	1183	The ``:?`` pattern matches the colon after the last name, so that it does not
	1184	occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
	1185	house number from the street name:
	1186
	1187	.. doctest::
	1188	:options: +NORMALIZE_WHITESPACE
	1189
	1190	>>> [re.split(":? ", entry, 4) for entry in entries]
	1191	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
	1192	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
	1193	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
	1194	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
	1195
	1196
	1197	Text Munging
	1198	^^^^^^^^^^^^
	1199
	1200	:func:`sub` replaces every occurrence of a pattern with a string or the
	1201	result of a function. This example demonstrates using :func:`sub` with
	1202	a function to "munge" text, or randomize the order of all the characters
	1203	in each word of a sentence except for the first and last characters::
	1204
	1205	>>> def repl(m):
	1206	... inner_word = list(m.group(2))
	1207	... random.shuffle(inner_word)
	1208	... return m.group(1) + "".join(inner_word) + m.group(3)
	1209	>>> text = "Professor Abdolmalek, please report your absences promptly."
[391]	1210	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
[2]	1211	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
[391]	1212	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
[2]	1213	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
	1214
	1215
	1216	Finding all Adverbs
	1217	^^^^^^^^^^^^^^^^^^^
	1218
	1219	:func:`findall` matches all occurrences of a pattern, not just the first
	1220	one as :func:`search` does. For example, if one was a writer and wanted to
	1221	find all of the adverbs in some text, he or she might use :func:`findall` in
	1222	the following manner:
	1223
	1224	>>> text = "He was carefully disguised but captured quickly by police."
	1225	>>> re.findall(r"\w+ly", text)
	1226	['carefully', 'quickly']
	1227
	1228
	1229	Finding all Adverbs and their Positions
	1230	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	1231
	1232	If one wants more information about all matches of a pattern than the matched
	1233	text, :func:`finditer` is useful as it provides instances of
	1234	:class:`MatchObject` instead of strings. Continuing with the previous example,
	1235	if one was a writer who wanted to find all of the adverbs and their positions
	1236	in some text, he or she would use :func:`finditer` in the following manner:
	1237
	1238	>>> text = "He was carefully disguised but captured quickly by police."
	1239	>>> for m in re.finditer(r"\w+ly", text):
	1240	... print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
	1241	07-16: carefully
	1242	40-47: quickly
	1243
	1244
	1245	Raw String Notation
	1246	^^^^^^^^^^^^^^^^^^^
	1247
	1248	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
	1249	every backslash (``'\'``) in a regular expression would have to be prefixed with
	1250	another one to escape it. For example, the two following lines of code are
	1251	functionally identical:
	1252
	1253	>>> re.match(r"\W(.)\1\W", " ff ")
	1254	<_sre.SRE_Match object at ...>
	1255	>>> re.match("\\W(.)\\1\\W", " ff ")
	1256	<_sre.SRE_Match object at ...>
	1257
	1258	When one wants to match a literal backslash, it must be escaped in the regular
	1259	expression. With raw string notation, this means ``r"\\"``. Without raw string
	1260	notation, one must use ``"\\\\"``, making the following lines of code
	1261	functionally identical:
	1262
	1263	>>> re.match(r"\\", r"\\")
	1264	<_sre.SRE_Match object at ...>
	1265	>>> re.match("\\\\", r"\\")
	1266	<_sre.SRE_Match object at ...>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: python/trunk/Doc/library/re.rst

Download in other formats: