1 | \documentclass{howto}
|
---|
2 |
|
---|
3 | % TODO:
|
---|
4 | % Document lookbehind assertions
|
---|
5 | % Better way of displaying a RE, a string, and what it matches
|
---|
6 | % Mention optional argument to match.groups()
|
---|
7 | % Unicode (at least a reference)
|
---|
8 |
|
---|
9 | \title{Regular Expression HOWTO}
|
---|
10 |
|
---|
11 | \release{0.05}
|
---|
12 |
|
---|
13 | \author{A.M. Kuchling}
|
---|
14 | \authoraddress{\email{amk@amk.ca}}
|
---|
15 |
|
---|
16 | \begin{document}
|
---|
17 | \maketitle
|
---|
18 |
|
---|
19 | \begin{abstract}
|
---|
20 | \noindent
|
---|
21 | This document is an introductory tutorial to using regular expressions
|
---|
22 | in Python with the \module{re} module. It provides a gentler
|
---|
23 | introduction than the corresponding section in the Library Reference.
|
---|
24 |
|
---|
25 | This document is available from
|
---|
26 | \url{http://www.amk.ca/python/howto}.
|
---|
27 |
|
---|
28 | \end{abstract}
|
---|
29 |
|
---|
30 | \tableofcontents
|
---|
31 |
|
---|
32 | \section{Introduction}
|
---|
33 |
|
---|
34 | The \module{re} module was added in Python 1.5, and provides
|
---|
35 | Perl-style regular expression patterns. Earlier versions of Python
|
---|
36 | came with the \module{regex} module, which provided Emacs-style
|
---|
37 | patterns. \module{regex} module was removed in Python 2.5.
|
---|
38 |
|
---|
39 | Regular expressions (or REs) are essentially a tiny, highly
|
---|
40 | specialized programming language embedded inside Python and made
|
---|
41 | available through the \module{re} module. Using this little language,
|
---|
42 | you specify the rules for the set of possible strings that you want to
|
---|
43 | match; this set might contain English sentences, or e-mail addresses,
|
---|
44 | or TeX commands, or anything you like. You can then ask questions
|
---|
45 | such as ``Does this string match the pattern?'', or ``Is there a match
|
---|
46 | for the pattern anywhere in this string?''. You can also use REs to
|
---|
47 | modify a string or to split it apart in various ways.
|
---|
48 |
|
---|
49 | Regular expression patterns are compiled into a series of bytecodes
|
---|
50 | which are then executed by a matching engine written in C. For
|
---|
51 | advanced use, it may be necessary to pay careful attention to how the
|
---|
52 | engine will execute a given RE, and write the RE in a certain way in
|
---|
53 | order to produce bytecode that runs faster. Optimization isn't
|
---|
54 | covered in this document, because it requires that you have a good
|
---|
55 | understanding of the matching engine's internals.
|
---|
56 |
|
---|
57 | The regular expression language is relatively small and restricted, so
|
---|
58 | not all possible string processing tasks can be done using regular
|
---|
59 | expressions. There are also tasks that \emph{can} be done with
|
---|
60 | regular expressions, but the expressions turn out to be very
|
---|
61 | complicated. In these cases, you may be better off writing Python
|
---|
62 | code to do the processing; while Python code will be slower than an
|
---|
63 | elaborate regular expression, it will also probably be more understandable.
|
---|
64 |
|
---|
65 | \section{Simple Patterns}
|
---|
66 |
|
---|
67 | We'll start by learning about the simplest possible regular
|
---|
68 | expressions. Since regular expressions are used to operate on
|
---|
69 | strings, we'll begin with the most common task: matching characters.
|
---|
70 |
|
---|
71 | For a detailed explanation of the computer science underlying regular
|
---|
72 | expressions (deterministic and non-deterministic finite automata), you
|
---|
73 | can refer to almost any textbook on writing compilers.
|
---|
74 |
|
---|
75 | \subsection{Matching Characters}
|
---|
76 |
|
---|
77 | Most letters and characters will simply match themselves. For
|
---|
78 | example, the regular expression \regexp{test} will match the string
|
---|
79 | \samp{test} exactly. (You can enable a case-insensitive mode that
|
---|
80 | would let this RE match \samp{Test} or \samp{TEST} as well; more
|
---|
81 | about this later.)
|
---|
82 |
|
---|
83 | There are exceptions to this rule; some characters are
|
---|
84 | special, and don't match themselves. Instead, they signal that some
|
---|
85 | out-of-the-ordinary thing should be matched, or they affect other
|
---|
86 | portions of the RE by repeating them. Much of this document is
|
---|
87 | devoted to discussing various metacharacters and what they do.
|
---|
88 |
|
---|
89 | Here's a complete list of the metacharacters; their meanings will be
|
---|
90 | discussed in the rest of this HOWTO.
|
---|
91 |
|
---|
92 | \begin{verbatim}
|
---|
93 | . ^ $ * + ? { [ ] \ | ( )
|
---|
94 | \end{verbatim}
|
---|
95 | % $
|
---|
96 |
|
---|
97 | The first metacharacters we'll look at are \samp{[} and \samp{]}.
|
---|
98 | They're used for specifying a character class, which is a set of
|
---|
99 | characters that you wish to match. Characters can be listed
|
---|
100 | individually, or a range of characters can be indicated by giving two
|
---|
101 | characters and separating them by a \character{-}. For example,
|
---|
102 | \regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or
|
---|
103 | \samp{c}; this is the same as
|
---|
104 | \regexp{[a-c]}, which uses a range to express the same set of
|
---|
105 | characters. If you wanted to match only lowercase letters, your
|
---|
106 | RE would be \regexp{[a-z]}.
|
---|
107 |
|
---|
108 | Metacharacters are not active inside classes. For example,
|
---|
109 | \regexp{[akm\$]} will match any of the characters \character{a},
|
---|
110 | \character{k}, \character{m}, or \character{\$}; \character{\$} is
|
---|
111 | usually a metacharacter, but inside a character class it's stripped of
|
---|
112 | its special nature.
|
---|
113 |
|
---|
114 | You can match the characters not within a range by \dfn{complementing}
|
---|
115 | the set. This is indicated by including a \character{\^} as the first
|
---|
116 | character of the class; \character{\^} elsewhere will simply match the
|
---|
117 | \character{\^} character. For example, \verb|[^5]| will match any
|
---|
118 | character except \character{5}.
|
---|
119 |
|
---|
120 | Perhaps the most important metacharacter is the backslash, \samp{\e}.
|
---|
121 | As in Python string literals, the backslash can be followed by various
|
---|
122 | characters to signal various special sequences. It's also used to escape
|
---|
123 | all the metacharacters so you can still match them in patterns; for
|
---|
124 | example, if you need to match a \samp{[} or
|
---|
125 | \samp{\e}, you can precede them with a backslash to remove their
|
---|
126 | special meaning: \regexp{\e[} or \regexp{\e\e}.
|
---|
127 |
|
---|
128 | Some of the special sequences beginning with \character{\e} represent
|
---|
129 | predefined sets of characters that are often useful, such as the set
|
---|
130 | of digits, the set of letters, or the set of anything that isn't
|
---|
131 | whitespace. The following predefined special sequences are available:
|
---|
132 |
|
---|
133 | \begin{itemize}
|
---|
134 | \item[\code{\e d}]Matches any decimal digit; this is
|
---|
135 | equivalent to the class \regexp{[0-9]}.
|
---|
136 |
|
---|
137 | \item[\code{\e D}]Matches any non-digit character; this is
|
---|
138 | equivalent to the class \verb|[^0-9]|.
|
---|
139 |
|
---|
140 | \item[\code{\e s}]Matches any whitespace character; this is
|
---|
141 | equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}.
|
---|
142 |
|
---|
143 | \item[\code{\e S}]Matches any non-whitespace character; this is
|
---|
144 | equivalent to the class \verb|[^ \t\n\r\f\v]|.
|
---|
145 |
|
---|
146 | \item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class
|
---|
147 | \regexp{[a-zA-Z0-9_]}.
|
---|
148 |
|
---|
149 | \item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class
|
---|
150 | \verb|[^a-zA-Z0-9_]|.
|
---|
151 | \end{itemize}
|
---|
152 |
|
---|
153 | These sequences can be included inside a character class. For
|
---|
154 | example, \regexp{[\e s,.]} is a character class that will match any
|
---|
155 | whitespace character, or \character{,} or \character{.}.
|
---|
156 |
|
---|
157 | The final metacharacter in this section is \regexp{.}. It matches
|
---|
158 | anything except a newline character, and there's an alternate mode
|
---|
159 | (\code{re.DOTALL}) where it will match even a newline. \character{.}
|
---|
160 | is often used where you want to match ``any character''.
|
---|
161 |
|
---|
162 | \subsection{Repeating Things}
|
---|
163 |
|
---|
164 | Being able to match varying sets of characters is the first thing
|
---|
165 | regular expressions can do that isn't already possible with the
|
---|
166 | methods available on strings. However, if that was the only
|
---|
167 | additional capability of regexes, they wouldn't be much of an advance.
|
---|
168 | Another capability is that you can specify that portions of the RE
|
---|
169 | must be repeated a certain number of times.
|
---|
170 |
|
---|
171 | The first metacharacter for repeating things that we'll look at is
|
---|
172 | \regexp{*}. \regexp{*} doesn't match the literal character \samp{*};
|
---|
173 | instead, it specifies that the previous character can be matched zero
|
---|
174 | or more times, instead of exactly once.
|
---|
175 |
|
---|
176 | For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a}
|
---|
177 | characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a}
|
---|
178 | characters), and so forth. The RE engine has various internal
|
---|
179 | limitations stemming from the size of C's \code{int} type, that will
|
---|
180 | prevent it from matching over 2 billion \samp{a} characters; you
|
---|
181 | probably don't have enough memory to construct a string that large, so
|
---|
182 | you shouldn't run into that limit.
|
---|
183 |
|
---|
184 | Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE,
|
---|
185 | the matching engine will try to repeat it as many times as possible.
|
---|
186 | If later portions of the pattern don't match, the matching engine will
|
---|
187 | then back up and try again with few repetitions.
|
---|
188 |
|
---|
189 | A step-by-step example will make this more obvious. Let's consider
|
---|
190 | the expression \regexp{a[bcd]*b}. This matches the letter
|
---|
191 | \character{a}, zero or more letters from the class \code{[bcd]}, and
|
---|
192 | finally ends with a \character{b}. Now imagine matching this RE
|
---|
193 | against the string \samp{abcbd}.
|
---|
194 |
|
---|
195 | \begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation}
|
---|
196 | \lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.}
|
---|
197 | \lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as
|
---|
198 | it can, which is to the end of the string.}
|
---|
199 | \lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the
|
---|
200 | current position is at the end of the string, so it fails.}
|
---|
201 | \lineiii{4}{\code{abcb}}{Back up, so that \regexp{[bcd]*} matches
|
---|
202 | one less character.}
|
---|
203 | \lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the
|
---|
204 | current position is at the last character, which is a \character{d}.}
|
---|
205 | \lineiii{6}{\code{abc}}{Back up again, so that \regexp{[bcd]*} is
|
---|
206 | only matching \samp{bc}.}
|
---|
207 | \lineiii{6}{\code{abcb}}{Try \regexp{b} again. This time
|
---|
208 | but the character at the current position is \character{b}, so it succeeds.}
|
---|
209 | \end{tableiii}
|
---|
210 |
|
---|
211 | The end of the RE has now been reached, and it has matched
|
---|
212 | \samp{abcb}. This demonstrates how the matching engine goes as far as
|
---|
213 | it can at first, and if no match is found it will then progressively
|
---|
214 | back up and retry the rest of the RE again and again. It will back up
|
---|
215 | until it has tried zero matches for \regexp{[bcd]*}, and if that
|
---|
216 | subsequently fails, the engine will conclude that the string doesn't
|
---|
217 | match the RE at all.
|
---|
218 |
|
---|
219 | Another repeating metacharacter is \regexp{+}, which matches one or
|
---|
220 | more times. Pay careful attention to the difference between
|
---|
221 | \regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more
|
---|
222 | times, so whatever's being repeated may not be present at all, while
|
---|
223 | \regexp{+} requires at least \emph{one} occurrence. To use a similar
|
---|
224 | example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}),
|
---|
225 | \samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}.
|
---|
226 |
|
---|
227 | There are two more repeating qualifiers. The question mark character,
|
---|
228 | \regexp{?}, matches either once or zero times; you can think of it as
|
---|
229 | marking something as being optional. For example, \regexp{home-?brew}
|
---|
230 | matches either \samp{homebrew} or \samp{home-brew}.
|
---|
231 |
|
---|
232 | The most complicated repeated qualifier is
|
---|
233 | \regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal
|
---|
234 | integers. This qualifier means there must be at least \var{m}
|
---|
235 | repetitions, and at most \var{n}. For example, \regexp{a/\{1,3\}b}
|
---|
236 | will match \samp{a/b}, \samp{a//b}, and \samp{a///b}. It won't match
|
---|
237 | \samp{ab}, which has no slashes, or \samp{a////b}, which has four.
|
---|
238 |
|
---|
239 | You can omit either \var{m} or \var{n}; in that case, a reasonable
|
---|
240 | value is assumed for the missing value. Omitting \var{m} is
|
---|
241 | interpreted as a lower limit of 0, while omitting \var{n} results in an
|
---|
242 | upper bound of infinity --- actually, the 2 billion limit mentioned
|
---|
243 | earlier, but that might as well be infinity.
|
---|
244 |
|
---|
245 | Readers of a reductionist bent may notice that the three other qualifiers
|
---|
246 | can all be expressed using this notation. \regexp{\{0,\}} is the same
|
---|
247 | as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and
|
---|
248 | \regexp{\{0,1\}} is the same as \regexp{?}. It's better to use
|
---|
249 | \regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because
|
---|
250 | they're shorter and easier to read.
|
---|
251 |
|
---|
252 | \section{Using Regular Expressions}
|
---|
253 |
|
---|
254 | Now that we've looked at some simple regular expressions, how do we
|
---|
255 | actually use them in Python? The \module{re} module provides an
|
---|
256 | interface to the regular expression engine, allowing you to compile
|
---|
257 | REs into objects and then perform matches with them.
|
---|
258 |
|
---|
259 | \subsection{Compiling Regular Expressions}
|
---|
260 |
|
---|
261 | Regular expressions are compiled into \class{RegexObject} instances,
|
---|
262 | which have methods for various operations such as searching for
|
---|
263 | pattern matches or performing string substitutions.
|
---|
264 |
|
---|
265 | \begin{verbatim}
|
---|
266 | >>> import re
|
---|
267 | >>> p = re.compile('ab*')
|
---|
268 | >>> print p
|
---|
269 | <re.RegexObject instance at 80b4150>
|
---|
270 | \end{verbatim}
|
---|
271 |
|
---|
272 | \function{re.compile()} also accepts an optional \var{flags}
|
---|
273 | argument, used to enable various special features and syntax
|
---|
274 | variations. We'll go over the available settings later, but for now a
|
---|
275 | single example will do:
|
---|
276 |
|
---|
277 | \begin{verbatim}
|
---|
278 | >>> p = re.compile('ab*', re.IGNORECASE)
|
---|
279 | \end{verbatim}
|
---|
280 |
|
---|
281 | The RE is passed to \function{re.compile()} as a string. REs are
|
---|
282 | handled as strings because regular expressions aren't part of the core
|
---|
283 | Python language, and no special syntax was created for expressing
|
---|
284 | them. (There are applications that don't need REs at all, so there's
|
---|
285 | no need to bloat the language specification by including them.)
|
---|
286 | Instead, the \module{re} module is simply a C extension module
|
---|
287 | included with Python, just like the \module{socket} or \module{zlib}
|
---|
288 | module.
|
---|
289 |
|
---|
290 | Putting REs in strings keeps the Python language simpler, but has one
|
---|
291 | disadvantage which is the topic of the next section.
|
---|
292 |
|
---|
293 | \subsection{The Backslash Plague}
|
---|
294 |
|
---|
295 | As stated earlier, regular expressions use the backslash
|
---|
296 | character (\character{\e}) to indicate special forms or to allow
|
---|
297 | special characters to be used without invoking their special meaning.
|
---|
298 | This conflicts with Python's usage of the same character for the same
|
---|
299 | purpose in string literals.
|
---|
300 |
|
---|
301 | Let's say you want to write a RE that matches the string
|
---|
302 | \samp{{\e}section}, which might be found in a \LaTeX\ file. To figure
|
---|
303 | out what to write in the program code, start with the desired string
|
---|
304 | to be matched. Next, you must escape any backslashes and other
|
---|
305 | metacharacters by preceding them with a backslash, resulting in the
|
---|
306 | string \samp{\e\e section}. The resulting string that must be passed
|
---|
307 | to \function{re.compile()} must be \verb|\\section|. However, to
|
---|
308 | express this as a Python string literal, both backslashes must be
|
---|
309 | escaped \emph{again}.
|
---|
310 |
|
---|
311 | \begin{tableii}{c|l}{code}{Characters}{Stage}
|
---|
312 | \lineii{\e section}{Text string to be matched}
|
---|
313 | \lineii{\e\e section}{Escaped backslash for \function{re.compile}}
|
---|
314 | \lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal}
|
---|
315 | \end{tableii}
|
---|
316 |
|
---|
317 | In short, to match a literal backslash, one has to write
|
---|
318 | \code{'\e\e\e\e'} as the RE string, because the regular expression
|
---|
319 | must be \samp{\e\e}, and each backslash must be expressed as
|
---|
320 | \samp{\e\e} inside a regular Python string literal. In REs that
|
---|
321 | feature backslashes repeatedly, this leads to lots of repeated
|
---|
322 | backslashes and makes the resulting strings difficult to understand.
|
---|
323 |
|
---|
324 | The solution is to use Python's raw string notation for regular
|
---|
325 | expressions; backslashes are not handled in any special way in
|
---|
326 | a string literal prefixed with \character{r}, so \code{r"\e n"} is a
|
---|
327 | two-character string containing \character{\e} and \character{n},
|
---|
328 | while \code{"\e n"} is a one-character string containing a newline.
|
---|
329 | Frequently regular expressions will be expressed in Python
|
---|
330 | code using this raw string notation.
|
---|
331 |
|
---|
332 | \begin{tableii}{c|c}{code}{Regular String}{Raw string}
|
---|
333 | \lineii{"ab*"}{\code{r"ab*"}}
|
---|
334 | \lineii{"\e\e\e\e section"}{\code{r"\e\e section"}}
|
---|
335 | \lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}}
|
---|
336 | \end{tableii}
|
---|
337 |
|
---|
338 | \subsection{Performing Matches}
|
---|
339 |
|
---|
340 | Once you have an object representing a compiled regular expression,
|
---|
341 | what do you do with it? \class{RegexObject} instances have several
|
---|
342 | methods and attributes. Only the most significant ones will be
|
---|
343 | covered here; consult \ulink{the Library
|
---|
344 | Reference}{http://www.python.org/doc/lib/module-re.html} for a
|
---|
345 | complete listing.
|
---|
346 |
|
---|
347 | \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
|
---|
348 | \lineii{match()}{Determine if the RE matches at the beginning of
|
---|
349 | the string.}
|
---|
350 | \lineii{search()}{Scan through a string, looking for any location
|
---|
351 | where this RE matches.}
|
---|
352 | \lineii{findall()}{Find all substrings where the RE matches,
|
---|
353 | and returns them as a list.}
|
---|
354 | \lineii{finditer()}{Find all substrings where the RE matches,
|
---|
355 | and returns them as an iterator.}
|
---|
356 | \end{tableii}
|
---|
357 |
|
---|
358 | \method{match()} and \method{search()} return \code{None} if no match
|
---|
359 | can be found. If they're successful, a \code{MatchObject} instance is
|
---|
360 | returned, containing information about the match: where it starts and
|
---|
361 | ends, the substring it matched, and more.
|
---|
362 |
|
---|
363 | You can learn about this by interactively experimenting with the
|
---|
364 | \module{re} module. If you have Tkinter available, you may also want
|
---|
365 | to look at \file{Tools/scripts/redemo.py}, a demonstration program
|
---|
366 | included with the Python distribution. It allows you to enter REs and
|
---|
367 | strings, and displays whether the RE matches or fails.
|
---|
368 | \file{redemo.py} can be quite useful when trying to debug a
|
---|
369 | complicated RE. Phil Schwartz's
|
---|
370 | \ulink{Kodos}{http://kodos.sourceforge.net} is also an interactive
|
---|
371 | tool for developing and testing RE patterns. This HOWTO will use the
|
---|
372 | standard Python interpreter for its examples.
|
---|
373 |
|
---|
374 | First, run the Python interpreter, import the \module{re} module, and
|
---|
375 | compile a RE:
|
---|
376 |
|
---|
377 | \begin{verbatim}
|
---|
378 | Python 2.2.2 (#1, Feb 10 2003, 12:57:01)
|
---|
379 | >>> import re
|
---|
380 | >>> p = re.compile('[a-z]+')
|
---|
381 | >>> p
|
---|
382 | <_sre.SRE_Pattern object at 80c3c28>
|
---|
383 | \end{verbatim}
|
---|
384 |
|
---|
385 | Now, you can try matching various strings against the RE
|
---|
386 | \regexp{[a-z]+}. An empty string shouldn't match at all, since
|
---|
387 | \regexp{+} means 'one or more repetitions'. \method{match()} should
|
---|
388 | return \code{None} in this case, which will cause the interpreter to
|
---|
389 | print no output. You can explicitly print the result of
|
---|
390 | \method{match()} to make this clear.
|
---|
391 |
|
---|
392 | \begin{verbatim}
|
---|
393 | >>> p.match("")
|
---|
394 | >>> print p.match("")
|
---|
395 | None
|
---|
396 | \end{verbatim}
|
---|
397 |
|
---|
398 | Now, let's try it on a string that it should match, such as
|
---|
399 | \samp{tempo}. In this case, \method{match()} will return a
|
---|
400 | \class{MatchObject}, so you should store the result in a variable for
|
---|
401 | later use.
|
---|
402 |
|
---|
403 | \begin{verbatim}
|
---|
404 | >>> m = p.match( 'tempo')
|
---|
405 | >>> print m
|
---|
406 | <_sre.SRE_Match object at 80c4f68>
|
---|
407 | \end{verbatim}
|
---|
408 |
|
---|
409 | Now you can query the \class{MatchObject} for information about the
|
---|
410 | matching string. \class{MatchObject} instances also have several
|
---|
411 | methods and attributes; the most important ones are:
|
---|
412 |
|
---|
413 | \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
|
---|
414 | \lineii{group()}{Return the string matched by the RE}
|
---|
415 | \lineii{start()}{Return the starting position of the match}
|
---|
416 | \lineii{end()}{Return the ending position of the match}
|
---|
417 | \lineii{span()}{Return a tuple containing the (start, end) positions
|
---|
418 | of the match}
|
---|
419 | \end{tableii}
|
---|
420 |
|
---|
421 | Trying these methods will soon clarify their meaning:
|
---|
422 |
|
---|
423 | \begin{verbatim}
|
---|
424 | >>> m.group()
|
---|
425 | 'tempo'
|
---|
426 | >>> m.start(), m.end()
|
---|
427 | (0, 5)
|
---|
428 | >>> m.span()
|
---|
429 | (0, 5)
|
---|
430 | \end{verbatim}
|
---|
431 |
|
---|
432 | \method{group()} returns the substring that was matched by the
|
---|
433 | RE. \method{start()} and \method{end()} return the starting and
|
---|
434 | ending index of the match. \method{span()} returns both start and end
|
---|
435 | indexes in a single tuple. Since the \method{match} method only
|
---|
436 | checks if the RE matches at the start of a string,
|
---|
437 | \method{start()} will always be zero. However, the \method{search}
|
---|
438 | method of \class{RegexObject} instances scans through the string, so
|
---|
439 | the match may not start at zero in that case.
|
---|
440 |
|
---|
441 | \begin{verbatim}
|
---|
442 | >>> print p.match('::: message')
|
---|
443 | None
|
---|
444 | >>> m = p.search('::: message') ; print m
|
---|
445 | <re.MatchObject instance at 80c9650>
|
---|
446 | >>> m.group()
|
---|
447 | 'message'
|
---|
448 | >>> m.span()
|
---|
449 | (4, 11)
|
---|
450 | \end{verbatim}
|
---|
451 |
|
---|
452 | In actual programs, the most common style is to store the
|
---|
453 | \class{MatchObject} in a variable, and then check if it was
|
---|
454 | \code{None}. This usually looks like:
|
---|
455 |
|
---|
456 | \begin{verbatim}
|
---|
457 | p = re.compile( ... )
|
---|
458 | m = p.match( 'string goes here' )
|
---|
459 | if m:
|
---|
460 | print 'Match found: ', m.group()
|
---|
461 | else:
|
---|
462 | print 'No match'
|
---|
463 | \end{verbatim}
|
---|
464 |
|
---|
465 | Two \class{RegexObject} methods return all of the matches for a pattern.
|
---|
466 | \method{findall()} returns a list of matching strings:
|
---|
467 |
|
---|
468 | \begin{verbatim}
|
---|
469 | >>> p = re.compile('\d+')
|
---|
470 | >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
|
---|
471 | ['12', '11', '10']
|
---|
472 | \end{verbatim}
|
---|
473 |
|
---|
474 | \method{findall()} has to create the entire list before it can be
|
---|
475 | returned as the result. In Python 2.2, the \method{finditer()} method
|
---|
476 | is also available, returning a sequence of \class{MatchObject} instances
|
---|
477 | as an iterator.
|
---|
478 |
|
---|
479 | \begin{verbatim}
|
---|
480 | >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
|
---|
481 | >>> iterator
|
---|
482 | <callable-iterator object at 0x401833ac>
|
---|
483 | >>> for match in iterator:
|
---|
484 | ... print match.span()
|
---|
485 | ...
|
---|
486 | (0, 2)
|
---|
487 | (22, 24)
|
---|
488 | (29, 31)
|
---|
489 | \end{verbatim}
|
---|
490 |
|
---|
491 |
|
---|
492 | \subsection{Module-Level Functions}
|
---|
493 |
|
---|
494 | You don't have to produce a \class{RegexObject} and call its methods;
|
---|
495 | the \module{re} module also provides top-level functions called
|
---|
496 | \function{match()}, \function{search()}, \function{sub()}, and so
|
---|
497 | forth. These functions take the same arguments as the corresponding
|
---|
498 | \class{RegexObject} method, with the RE string added as the first
|
---|
499 | argument, and still return either \code{None} or a \class{MatchObject}
|
---|
500 | instance.
|
---|
501 |
|
---|
502 | \begin{verbatim}
|
---|
503 | >>> print re.match(r'From\s+', 'Fromage amk')
|
---|
504 | None
|
---|
505 | >>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')
|
---|
506 | <re.MatchObject instance at 80c5978>
|
---|
507 | \end{verbatim}
|
---|
508 |
|
---|
509 | Under the hood, these functions simply produce a \class{RegexObject}
|
---|
510 | for you and call the appropriate method on it. They also store the
|
---|
511 | compiled object in a cache, so future calls using the same
|
---|
512 | RE are faster.
|
---|
513 |
|
---|
514 | Should you use these module-level functions, or should you get the
|
---|
515 | \class{RegexObject} and call its methods yourself? That choice
|
---|
516 | depends on how frequently the RE will be used, and on your personal
|
---|
517 | coding style. If a RE is being used at only one point in the code,
|
---|
518 | then the module functions are probably more convenient. If a program
|
---|
519 | contains a lot of regular expressions, or re-uses the same ones in
|
---|
520 | several locations, then it might be worthwhile to collect all the
|
---|
521 | definitions in one place, in a section of code that compiles all the
|
---|
522 | REs ahead of time. To take an example from the standard library,
|
---|
523 | here's an extract from \file{xmllib.py}:
|
---|
524 |
|
---|
525 | \begin{verbatim}
|
---|
526 | ref = re.compile( ... )
|
---|
527 | entityref = re.compile( ... )
|
---|
528 | charref = re.compile( ... )
|
---|
529 | starttagopen = re.compile( ... )
|
---|
530 | \end{verbatim}
|
---|
531 |
|
---|
532 | I generally prefer to work with the compiled object, even for
|
---|
533 | one-time uses, but few people will be as much of a purist about this
|
---|
534 | as I am.
|
---|
535 |
|
---|
536 | \subsection{Compilation Flags}
|
---|
537 |
|
---|
538 | Compilation flags let you modify some aspects of how regular
|
---|
539 | expressions work. Flags are available in the \module{re} module under
|
---|
540 | two names, a long name such as \constant{IGNORECASE}, and a short,
|
---|
541 | one-letter form such as \constant{I}. (If you're familiar with Perl's
|
---|
542 | pattern modifiers, the one-letter forms use the same letters; the
|
---|
543 | short form of \constant{re.VERBOSE} is \constant{re.X}, for example.)
|
---|
544 | Multiple flags can be specified by bitwise OR-ing them; \code{re.I |
|
---|
545 | re.M} sets both the \constant{I} and \constant{M} flags, for example.
|
---|
546 |
|
---|
547 | Here's a table of the available flags, followed by
|
---|
548 | a more detailed explanation of each one.
|
---|
549 |
|
---|
550 | \begin{tableii}{c|l}{}{Flag}{Meaning}
|
---|
551 | \lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any
|
---|
552 | character, including newlines}
|
---|
553 | \lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches}
|
---|
554 | \lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match}
|
---|
555 | \lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching,
|
---|
556 | affecting \regexp{\^} and \regexp{\$}}
|
---|
557 | \lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs,
|
---|
558 | which can be organized more cleanly and understandably.}
|
---|
559 | \end{tableii}
|
---|
560 |
|
---|
561 | \begin{datadesc}{I}
|
---|
562 | \dataline{IGNORECASE}
|
---|
563 | Perform case-insensitive matching; character class and literal strings
|
---|
564 | will match
|
---|
565 | letters by ignoring case. For example, \regexp{[A-Z]} will match
|
---|
566 | lowercase letters, too, and \regexp{Spam} will match \samp{Spam},
|
---|
567 | \samp{spam}, or \samp{spAM}.
|
---|
568 | This lowercasing doesn't take the current locale into account; it will
|
---|
569 | if you also set the \constant{LOCALE} flag.
|
---|
570 | \end{datadesc}
|
---|
571 |
|
---|
572 | \begin{datadesc}{L}
|
---|
573 | \dataline{LOCALE}
|
---|
574 | Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b},
|
---|
575 | and \regexp{\e B}, dependent on the current locale.
|
---|
576 |
|
---|
577 | Locales are a feature of the C library intended to help in writing
|
---|
578 | programs that take account of language differences. For example, if
|
---|
579 | you're processing French text, you'd want to be able to write
|
---|
580 | \regexp{\e w+} to match words, but \regexp{\e w} only matches the
|
---|
581 | character class \regexp{[A-Za-z]}; it won't match \character{\'e} or
|
---|
582 | \character{\c c}. If your system is configured properly and a French
|
---|
583 | locale is selected, certain C functions will tell the program that
|
---|
584 | \character{\'e} should also be considered a letter. Setting the
|
---|
585 | \constant{LOCALE} flag when compiling a regular expression will cause the
|
---|
586 | resulting compiled object to use these C functions for \regexp{\e w};
|
---|
587 | this is slower, but also enables \regexp{\e w+} to match French words as
|
---|
588 | you'd expect.
|
---|
589 | \end{datadesc}
|
---|
590 |
|
---|
591 | \begin{datadesc}{M}
|
---|
592 | \dataline{MULTILINE}
|
---|
593 | (\regexp{\^} and \regexp{\$} haven't been explained yet;
|
---|
594 | they'll be introduced in section~\ref{more-metacharacters}.)
|
---|
595 |
|
---|
596 | Usually \regexp{\^} matches only at the beginning of the string, and
|
---|
597 | \regexp{\$} matches only at the end of the string and immediately before the
|
---|
598 | newline (if any) at the end of the string. When this flag is
|
---|
599 | specified, \regexp{\^} matches at the beginning of the string and at
|
---|
600 | the beginning of each line within the string, immediately following
|
---|
601 | each newline. Similarly, the \regexp{\$} metacharacter matches either at
|
---|
602 | the end of the string and at the end of each line (immediately
|
---|
603 | preceding each newline).
|
---|
604 |
|
---|
605 | \end{datadesc}
|
---|
606 |
|
---|
607 | \begin{datadesc}{S}
|
---|
608 | \dataline{DOTALL}
|
---|
609 | Makes the \character{.} special character match any character at all,
|
---|
610 | including a newline; without this flag, \character{.} will match
|
---|
611 | anything \emph{except} a newline.
|
---|
612 | \end{datadesc}
|
---|
613 |
|
---|
614 | \begin{datadesc}{X}
|
---|
615 | \dataline{VERBOSE} This flag allows you to write regular expressions
|
---|
616 | that are more readable by granting you more flexibility in how you can
|
---|
617 | format them. When this flag has been specified, whitespace within the
|
---|
618 | RE string is ignored, except when the whitespace is in a character
|
---|
619 | class or preceded by an unescaped backslash; this lets you organize
|
---|
620 | and indent the RE more clearly. It also enables you to put comments
|
---|
621 | within a RE that will be ignored by the engine; comments are marked by
|
---|
622 | a \character{\#} that's neither in a character class or preceded by an
|
---|
623 | unescaped backslash.
|
---|
624 |
|
---|
625 | For example, here's a RE that uses \constant{re.VERBOSE}; see how
|
---|
626 | much easier it is to read?
|
---|
627 |
|
---|
628 | \begin{verbatim}
|
---|
629 | charref = re.compile(r"""
|
---|
630 | &[#] # Start of a numeric entity reference
|
---|
631 | (
|
---|
632 | [0-9]+[^0-9] # Decimal form
|
---|
633 | | 0[0-7]+[^0-7] # Octal form
|
---|
634 | | x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form
|
---|
635 | )
|
---|
636 | """, re.VERBOSE)
|
---|
637 | \end{verbatim}
|
---|
638 |
|
---|
639 | Without the verbose setting, the RE would look like this:
|
---|
640 | \begin{verbatim}
|
---|
641 | charref = re.compile("&#([0-9]+[^0-9]"
|
---|
642 | "|0[0-7]+[^0-7]"
|
---|
643 | "|x[0-9a-fA-F]+[^0-9a-fA-F])")
|
---|
644 | \end{verbatim}
|
---|
645 |
|
---|
646 | In the above example, Python's automatic concatenation of string
|
---|
647 | literals has been used to break up the RE into smaller pieces, but
|
---|
648 | it's still more difficult to understand than the version using
|
---|
649 | \constant{re.VERBOSE}.
|
---|
650 |
|
---|
651 | \end{datadesc}
|
---|
652 |
|
---|
653 | \section{More Pattern Power}
|
---|
654 |
|
---|
655 | So far we've only covered a part of the features of regular
|
---|
656 | expressions. In this section, we'll cover some new metacharacters,
|
---|
657 | and how to use groups to retrieve portions of the text that was matched.
|
---|
658 |
|
---|
659 | \subsection{More Metacharacters\label{more-metacharacters}}
|
---|
660 |
|
---|
661 | There are some metacharacters that we haven't covered yet. Most of
|
---|
662 | them will be covered in this section.
|
---|
663 |
|
---|
664 | Some of the remaining metacharacters to be discussed are
|
---|
665 | \dfn{zero-width assertions}. They don't cause the engine to advance
|
---|
666 | through the string; instead, they consume no characters at all,
|
---|
667 | and simply succeed or fail. For example, \regexp{\e b} is an
|
---|
668 | assertion that the current position is located at a word boundary; the
|
---|
669 | position isn't changed by the \regexp{\e b} at all. This means that
|
---|
670 | zero-width assertions should never be repeated, because if they match
|
---|
671 | once at a given location, they can obviously be matched an infinite
|
---|
672 | number of times.
|
---|
673 |
|
---|
674 | \begin{list}{}{}
|
---|
675 |
|
---|
676 | \item[\regexp{|}]
|
---|
677 | Alternation, or the ``or'' operator.
|
---|
678 | If A and B are regular expressions,
|
---|
679 | \regexp{A|B} will match any string that matches either \samp{A} or \samp{B}.
|
---|
680 | \regexp{|} has very low precedence in order to make it work reasonably when
|
---|
681 | you're alternating multi-character strings.
|
---|
682 | \regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not
|
---|
683 | \samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}.
|
---|
684 |
|
---|
685 | To match a literal \character{|},
|
---|
686 | use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
|
---|
687 |
|
---|
688 | \item[\regexp{\^}] Matches at the beginning of lines. Unless the
|
---|
689 | \constant{MULTILINE} flag has been set, this will only match at the
|
---|
690 | beginning of the string. In \constant{MULTILINE} mode, this also
|
---|
691 | matches immediately after each newline within the string.
|
---|
692 |
|
---|
693 | For example, if you wish to match the word \samp{From} only at the
|
---|
694 | beginning of a line, the RE to use is \verb|^From|.
|
---|
695 |
|
---|
696 | \begin{verbatim}
|
---|
697 | >>> print re.search('^From', 'From Here to Eternity')
|
---|
698 | <re.MatchObject instance at 80c1520>
|
---|
699 | >>> print re.search('^From', 'Reciting From Memory')
|
---|
700 | None
|
---|
701 | \end{verbatim}
|
---|
702 |
|
---|
703 | %To match a literal \character{\^}, use \regexp{\e\^} or enclose it
|
---|
704 | %inside a character class, as in \regexp{[{\e}\^]}.
|
---|
705 |
|
---|
706 | \item[\regexp{\$}] Matches at the end of a line, which is defined as
|
---|
707 | either the end of the string, or any location followed by a newline
|
---|
708 | character.
|
---|
709 |
|
---|
710 | \begin{verbatim}
|
---|
711 | >>> print re.search('}$', '{block}')
|
---|
712 | <re.MatchObject instance at 80adfa8>
|
---|
713 | >>> print re.search('}$', '{block} ')
|
---|
714 | None
|
---|
715 | >>> print re.search('}$', '{block}\n')
|
---|
716 | <re.MatchObject instance at 80adfa8>
|
---|
717 | \end{verbatim}
|
---|
718 | % $
|
---|
719 |
|
---|
720 | To match a literal \character{\$}, use \regexp{\e\$} or enclose it
|
---|
721 | inside a character class, as in \regexp{[\$]}.
|
---|
722 |
|
---|
723 | \item[\regexp{\e A}] Matches only at the start of the string. When
|
---|
724 | not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are
|
---|
725 | effectively the same. In \constant{MULTILINE} mode, however, they're
|
---|
726 | different; \regexp{\e A} still matches only at the beginning of the
|
---|
727 | string, but \regexp{\^} may match at any location inside the string
|
---|
728 | that follows a newline character.
|
---|
729 |
|
---|
730 | \item[\regexp{\e Z}]Matches only at the end of the string.
|
---|
731 |
|
---|
732 | \item[\regexp{\e b}] Word boundary.
|
---|
733 | This is a zero-width assertion that matches only at the
|
---|
734 | beginning or end of a word. A word is defined as a sequence of
|
---|
735 | alphanumeric characters, so the end of a word is indicated by
|
---|
736 | whitespace or a non-alphanumeric character.
|
---|
737 |
|
---|
738 | The following example matches \samp{class} only when it's a complete
|
---|
739 | word; it won't match when it's contained inside another word.
|
---|
740 |
|
---|
741 | \begin{verbatim}
|
---|
742 | >>> p = re.compile(r'\bclass\b')
|
---|
743 | >>> print p.search('no class at all')
|
---|
744 | <re.MatchObject instance at 80c8f28>
|
---|
745 | >>> print p.search('the declassified algorithm')
|
---|
746 | None
|
---|
747 | >>> print p.search('one subclass is')
|
---|
748 | None
|
---|
749 | \end{verbatim}
|
---|
750 |
|
---|
751 | There are two subtleties you should remember when using this special
|
---|
752 | sequence. First, this is the worst collision between Python's string
|
---|
753 | literals and regular expression sequences. In Python's string
|
---|
754 | literals, \samp{\e b} is the backspace character, ASCII value 8. If
|
---|
755 | you're not using raw strings, then Python will convert the \samp{\e b} to
|
---|
756 | a backspace, and your RE won't match as you expect it to. The
|
---|
757 | following example looks the same as our previous RE, but omits
|
---|
758 | the \character{r} in front of the RE string.
|
---|
759 |
|
---|
760 | \begin{verbatim}
|
---|
761 | >>> p = re.compile('\bclass\b')
|
---|
762 | >>> print p.search('no class at all')
|
---|
763 | None
|
---|
764 | >>> print p.search('\b' + 'class' + '\b')
|
---|
765 | <re.MatchObject instance at 80c3ee0>
|
---|
766 | \end{verbatim}
|
---|
767 |
|
---|
768 | Second, inside a character class, where there's no use for this
|
---|
769 | assertion, \regexp{\e b} represents the backspace character, for
|
---|
770 | compatibility with Python's string literals.
|
---|
771 |
|
---|
772 | \item[\regexp{\e B}] Another zero-width assertion, this is the
|
---|
773 | opposite of \regexp{\e b}, only matching when the current
|
---|
774 | position is not at a word boundary.
|
---|
775 |
|
---|
776 | \end{list}
|
---|
777 |
|
---|
778 | \subsection{Grouping}
|
---|
779 |
|
---|
780 | Frequently you need to obtain more information than just whether the
|
---|
781 | RE matched or not. Regular expressions are often used to dissect
|
---|
782 | strings by writing a RE divided into several subgroups which
|
---|
783 | match different components of interest. For example, an RFC-822
|
---|
784 | header line is divided into a header name and a value, separated by a
|
---|
785 | \character{:}. This can be handled by writing a regular expression
|
---|
786 | which matches an entire header line, and has one group which matches the
|
---|
787 | header name, and another group which matches the header's value.
|
---|
788 |
|
---|
789 | Groups are marked by the \character{(}, \character{)} metacharacters.
|
---|
790 | \character{(} and \character{)} have much the same meaning as they do
|
---|
791 | in mathematical expressions; they group together the expressions
|
---|
792 | contained inside them. For example, you can repeat the contents of a
|
---|
793 | group with a repeating qualifier, such as \regexp{*}, \regexp{+},
|
---|
794 | \regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example,
|
---|
795 | \regexp{(ab)*} will match zero or more repetitions of \samp{ab}.
|
---|
796 |
|
---|
797 | \begin{verbatim}
|
---|
798 | >>> p = re.compile('(ab)*')
|
---|
799 | >>> print p.match('ababababab').span()
|
---|
800 | (0, 10)
|
---|
801 | \end{verbatim}
|
---|
802 |
|
---|
803 | Groups indicated with \character{(}, \character{)} also capture the
|
---|
804 | starting and ending index of the text that they match; this can be
|
---|
805 | retrieved by passing an argument to \method{group()},
|
---|
806 | \method{start()}, \method{end()}, and \method{span()}. Groups are
|
---|
807 | numbered starting with 0. Group 0 is always present; it's the whole
|
---|
808 | RE, so \class{MatchObject} methods all have group 0 as their default
|
---|
809 | argument. Later we'll see how to express groups that don't capture
|
---|
810 | the span of text that they match.
|
---|
811 |
|
---|
812 | \begin{verbatim}
|
---|
813 | >>> p = re.compile('(a)b')
|
---|
814 | >>> m = p.match('ab')
|
---|
815 | >>> m.group()
|
---|
816 | 'ab'
|
---|
817 | >>> m.group(0)
|
---|
818 | 'ab'
|
---|
819 | \end{verbatim}
|
---|
820 |
|
---|
821 | Subgroups are numbered from left to right, from 1 upward. Groups can
|
---|
822 | be nested; to determine the number, just count the opening parenthesis
|
---|
823 | characters, going from left to right.
|
---|
824 |
|
---|
825 | \begin{verbatim}
|
---|
826 | >>> p = re.compile('(a(b)c)d')
|
---|
827 | >>> m = p.match('abcd')
|
---|
828 | >>> m.group(0)
|
---|
829 | 'abcd'
|
---|
830 | >>> m.group(1)
|
---|
831 | 'abc'
|
---|
832 | >>> m.group(2)
|
---|
833 | 'b'
|
---|
834 | \end{verbatim}
|
---|
835 |
|
---|
836 | \method{group()} can be passed multiple group numbers at a time, in
|
---|
837 | which case it will return a tuple containing the corresponding values
|
---|
838 | for those groups.
|
---|
839 |
|
---|
840 | \begin{verbatim}
|
---|
841 | >>> m.group(2,1,2)
|
---|
842 | ('b', 'abc', 'b')
|
---|
843 | \end{verbatim}
|
---|
844 |
|
---|
845 | The \method{groups()} method returns a tuple containing the strings
|
---|
846 | for all the subgroups, from 1 up to however many there are.
|
---|
847 |
|
---|
848 | \begin{verbatim}
|
---|
849 | >>> m.groups()
|
---|
850 | ('abc', 'b')
|
---|
851 | \end{verbatim}
|
---|
852 |
|
---|
853 | Backreferences in a pattern allow you to specify that the contents of
|
---|
854 | an earlier capturing group must also be found at the current location
|
---|
855 | in the string. For example, \regexp{\e 1} will succeed if the exact
|
---|
856 | contents of group 1 can be found at the current position, and fails
|
---|
857 | otherwise. Remember that Python's string literals also use a
|
---|
858 | backslash followed by numbers to allow including arbitrary characters
|
---|
859 | in a string, so be sure to use a raw string when incorporating
|
---|
860 | backreferences in a RE.
|
---|
861 |
|
---|
862 | For example, the following RE detects doubled words in a string.
|
---|
863 |
|
---|
864 | \begin{verbatim}
|
---|
865 | >>> p = re.compile(r'(\b\w+)\s+\1')
|
---|
866 | >>> p.search('Paris in the the spring').group()
|
---|
867 | 'the the'
|
---|
868 | \end{verbatim}
|
---|
869 |
|
---|
870 | Backreferences like this aren't often useful for just searching
|
---|
871 | through a string --- there are few text formats which repeat data in
|
---|
872 | this way --- but you'll soon find out that they're \emph{very} useful
|
---|
873 | when performing string substitutions.
|
---|
874 |
|
---|
875 | \subsection{Non-capturing and Named Groups}
|
---|
876 |
|
---|
877 | Elaborate REs may use many groups, both to capture substrings of
|
---|
878 | interest, and to group and structure the RE itself. In complex REs,
|
---|
879 | it becomes difficult to keep track of the group numbers. There are
|
---|
880 | two features which help with this problem. Both of them use a common
|
---|
881 | syntax for regular expression extensions, so we'll look at that first.
|
---|
882 |
|
---|
883 | Perl 5 added several additional features to standard regular
|
---|
884 | expressions, and the Python \module{re} module supports most of them.
|
---|
885 | It would have been difficult to choose new single-keystroke
|
---|
886 | metacharacters or new special sequences beginning with \samp{\e} to
|
---|
887 | represent the new features without making Perl's regular expressions
|
---|
888 | confusingly different from standard REs. If you chose \samp{\&} as a
|
---|
889 | new metacharacter, for example, old expressions would be assuming that
|
---|
890 | \samp{\&} was a regular character and wouldn't have escaped it by
|
---|
891 | writing \regexp{\e \&} or \regexp{[\&]}.
|
---|
892 |
|
---|
893 | The solution chosen by the Perl developers was to use \regexp{(?...)}
|
---|
894 | as the extension syntax. \samp{?} immediately after a parenthesis was
|
---|
895 | a syntax error because the \samp{?} would have nothing to repeat, so
|
---|
896 | this didn't introduce any compatibility problems. The characters
|
---|
897 | immediately after the \samp{?} indicate what extension is being used,
|
---|
898 | so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and
|
---|
899 | \regexp{(?:foo)} is something else (a non-capturing group containing
|
---|
900 | the subexpression \regexp{foo}).
|
---|
901 |
|
---|
902 | Python adds an extension syntax to Perl's extension syntax. If the
|
---|
903 | first character after the question mark is a \samp{P}, you know that
|
---|
904 | it's an extension that's specific to Python. Currently there are two
|
---|
905 | such extensions: \regexp{(?P<\var{name}>...)} defines a named group,
|
---|
906 | and \regexp{(?P=\var{name})} is a backreference to a named group. If
|
---|
907 | future versions of Perl 5 add similar features using a different
|
---|
908 | syntax, the \module{re} module will be changed to support the new
|
---|
909 | syntax, while preserving the Python-specific syntax for
|
---|
910 | compatibility's sake.
|
---|
911 |
|
---|
912 | Now that we've looked at the general extension syntax, we can return
|
---|
913 | to the features that simplify working with groups in complex REs.
|
---|
914 | Since groups are numbered from left to right and a complex expression
|
---|
915 | may use many groups, it can become difficult to keep track of the
|
---|
916 | correct numbering, and modifying such a complex RE is annoying.
|
---|
917 | Insert a new group near the beginning, and you change the numbers of
|
---|
918 | everything that follows it.
|
---|
919 |
|
---|
920 | First, sometimes you'll want to use a group to collect a part of a
|
---|
921 | regular expression, but aren't interested in retrieving the group's
|
---|
922 | contents. You can make this fact explicit by using a non-capturing
|
---|
923 | group: \regexp{(?:...)}, where you can put any other regular
|
---|
924 | expression inside the parentheses.
|
---|
925 |
|
---|
926 | \begin{verbatim}
|
---|
927 | >>> m = re.match("([abc])+", "abc")
|
---|
928 | >>> m.groups()
|
---|
929 | ('c',)
|
---|
930 | >>> m = re.match("(?:[abc])+", "abc")
|
---|
931 | >>> m.groups()
|
---|
932 | ()
|
---|
933 | \end{verbatim}
|
---|
934 |
|
---|
935 | Except for the fact that you can't retrieve the contents of what the
|
---|
936 | group matched, a non-capturing group behaves exactly the same as a
|
---|
937 | capturing group; you can put anything inside it, repeat it with a
|
---|
938 | repetition metacharacter such as \samp{*}, and nest it within other
|
---|
939 | groups (capturing or non-capturing). \regexp{(?:...)} is particularly
|
---|
940 | useful when modifying an existing group, since you can add new groups
|
---|
941 | without changing how all the other groups are numbered. It should be
|
---|
942 | mentioned that there's no performance difference in searching between
|
---|
943 | capturing and non-capturing groups; neither form is any faster than
|
---|
944 | the other.
|
---|
945 |
|
---|
946 | The second, and more significant, feature is named groups; instead of
|
---|
947 | referring to them by numbers, groups can be referenced by a name.
|
---|
948 |
|
---|
949 | The syntax for a named group is one of the Python-specific extensions:
|
---|
950 | \regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of
|
---|
951 | the group. Except for associating a name with a group, named groups
|
---|
952 | also behave identically to capturing groups. The \class{MatchObject}
|
---|
953 | methods that deal with capturing groups all accept either integers, to
|
---|
954 | refer to groups by number, or a string containing the group name.
|
---|
955 | Named groups are still given numbers, so you can retrieve information
|
---|
956 | about a group in two ways:
|
---|
957 |
|
---|
958 | \begin{verbatim}
|
---|
959 | >>> p = re.compile(r'(?P<word>\b\w+\b)')
|
---|
960 | >>> m = p.search( '(((( Lots of punctuation )))' )
|
---|
961 | >>> m.group('word')
|
---|
962 | 'Lots'
|
---|
963 | >>> m.group(1)
|
---|
964 | 'Lots'
|
---|
965 | \end{verbatim}
|
---|
966 |
|
---|
967 | Named groups are handy because they let you use easily-remembered
|
---|
968 | names, instead of having to remember numbers. Here's an example RE
|
---|
969 | from the \module{imaplib} module:
|
---|
970 |
|
---|
971 | \begin{verbatim}
|
---|
972 | InternalDate = re.compile(r'INTERNALDATE "'
|
---|
973 | r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
|
---|
974 | r'(?P<year>[0-9][0-9][0-9][0-9])'
|
---|
975 | r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
|
---|
976 | r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
|
---|
977 | r'"')
|
---|
978 | \end{verbatim}
|
---|
979 |
|
---|
980 | It's obviously much easier to retrieve \code{m.group('zonem')},
|
---|
981 | instead of having to remember to retrieve group 9.
|
---|
982 |
|
---|
983 | Since the syntax for backreferences, in an expression like
|
---|
984 | \regexp{(...)\e 1}, refers to the number of the group there's
|
---|
985 | naturally a variant that uses the group name instead of the number.
|
---|
986 | This is also a Python extension: \regexp{(?P=\var{name})} indicates
|
---|
987 | that the contents of the group called \var{name} should again be found
|
---|
988 | at the current point. The regular expression for finding doubled
|
---|
989 | words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as
|
---|
990 | \regexp{(?P<word>\e b\e w+)\e s+(?P=word)}:
|
---|
991 |
|
---|
992 | \begin{verbatim}
|
---|
993 | >>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
|
---|
994 | >>> p.search('Paris in the the spring').group()
|
---|
995 | 'the the'
|
---|
996 | \end{verbatim}
|
---|
997 |
|
---|
998 | \subsection{Lookahead Assertions}
|
---|
999 |
|
---|
1000 | Another zero-width assertion is the lookahead assertion. Lookahead
|
---|
1001 | assertions are available in both positive and negative form, and
|
---|
1002 | look like this:
|
---|
1003 |
|
---|
1004 | \begin{itemize}
|
---|
1005 | \item[\regexp{(?=...)}] Positive lookahead assertion. This succeeds
|
---|
1006 | if the contained regular expression, represented here by \code{...},
|
---|
1007 | successfully matches at the current location, and fails otherwise.
|
---|
1008 | But, once the contained expression has been tried, the matching engine
|
---|
1009 | doesn't advance at all; the rest of the pattern is tried right where
|
---|
1010 | the assertion started.
|
---|
1011 |
|
---|
1012 | \item[\regexp{(?!...)}] Negative lookahead assertion. This is the
|
---|
1013 | opposite of the positive assertion; it succeeds if the contained expression
|
---|
1014 | \emph{doesn't} match at the current position in the string.
|
---|
1015 | \end{itemize}
|
---|
1016 |
|
---|
1017 | An example will help make this concrete by demonstrating a case
|
---|
1018 | where a lookahead is useful. Consider a simple pattern to match a
|
---|
1019 | filename and split it apart into a base name and an extension,
|
---|
1020 | separated by a \samp{.}. For example, in \samp{news.rc}, \samp{news}
|
---|
1021 | is the base name, and \samp{rc} is the filename's extension.
|
---|
1022 |
|
---|
1023 | The pattern to match this is quite simple:
|
---|
1024 |
|
---|
1025 | \regexp{.*[.].*\$}
|
---|
1026 |
|
---|
1027 | Notice that the \samp{.} needs to be treated specially because it's a
|
---|
1028 | metacharacter; I've put it inside a character class. Also notice the
|
---|
1029 | trailing \regexp{\$}; this is added to ensure that all the rest of the
|
---|
1030 | string must be included in the extension. This regular expression
|
---|
1031 | matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and
|
---|
1032 | \samp{printers.conf}.
|
---|
1033 |
|
---|
1034 | Now, consider complicating the problem a bit; what if you want to
|
---|
1035 | match filenames where the extension is not \samp{bat}?
|
---|
1036 | Some incorrect attempts:
|
---|
1037 |
|
---|
1038 | \verb|.*[.][^b].*$|
|
---|
1039 | % $
|
---|
1040 |
|
---|
1041 | The first attempt above tries to exclude \samp{bat} by requiring that
|
---|
1042 | the first character of the extension is not a \samp{b}. This is
|
---|
1043 | wrong, because the pattern also doesn't match \samp{foo.bar}.
|
---|
1044 |
|
---|
1045 | % Messes up the HTML without the curly braces around \^
|
---|
1046 | \regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$}
|
---|
1047 |
|
---|
1048 | The expression gets messier when you try to patch up the first
|
---|
1049 | solution by requiring one of the following cases to match: the first
|
---|
1050 | character of the extension isn't \samp{b}; the second character isn't
|
---|
1051 | \samp{a}; or the third character isn't \samp{t}. This accepts
|
---|
1052 | \samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a
|
---|
1053 | three-letter extension and won't accept a filename with a two-letter
|
---|
1054 | extension such as \samp{sendmail.cf}. We'll complicate the pattern
|
---|
1055 | again in an effort to fix it.
|
---|
1056 |
|
---|
1057 | \regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$}
|
---|
1058 |
|
---|
1059 | In the third attempt, the second and third letters are all made
|
---|
1060 | optional in order to allow matching extensions shorter than three
|
---|
1061 | characters, such as \samp{sendmail.cf}.
|
---|
1062 |
|
---|
1063 | The pattern's getting really complicated now, which makes it hard to
|
---|
1064 | read and understand. Worse, if the problem changes and you want to
|
---|
1065 | exclude both \samp{bat} and \samp{exe} as extensions, the pattern
|
---|
1066 | would get even more complicated and confusing.
|
---|
1067 |
|
---|
1068 | A negative lookahead cuts through all this:
|
---|
1069 |
|
---|
1070 | \regexp{.*[.](?!bat\$).*\$}
|
---|
1071 | % $
|
---|
1072 |
|
---|
1073 | The lookahead means: if the expression \regexp{bat} doesn't match at
|
---|
1074 | this point, try the rest of the pattern; if \regexp{bat\$} does match,
|
---|
1075 | the whole pattern will fail. The trailing \regexp{\$} is required to
|
---|
1076 | ensure that something like \samp{sample.batch}, where the extension
|
---|
1077 | only starts with \samp{bat}, will be allowed.
|
---|
1078 |
|
---|
1079 | Excluding another filename extension is now easy; simply add it as an
|
---|
1080 | alternative inside the assertion. The following pattern excludes
|
---|
1081 | filenames that end in either \samp{bat} or \samp{exe}:
|
---|
1082 |
|
---|
1083 | \regexp{.*[.](?!bat\$|exe\$).*\$}
|
---|
1084 | % $
|
---|
1085 |
|
---|
1086 |
|
---|
1087 | \section{Modifying Strings}
|
---|
1088 |
|
---|
1089 | Up to this point, we've simply performed searches against a static
|
---|
1090 | string. Regular expressions are also commonly used to modify a string
|
---|
1091 | in various ways, using the following \class{RegexObject} methods:
|
---|
1092 |
|
---|
1093 | \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
|
---|
1094 | \lineii{split()}{Split the string into a list, splitting it wherever the RE matches}
|
---|
1095 | \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string}
|
---|
1096 | \lineii{subn()}{Does the same thing as \method{sub()},
|
---|
1097 | but returns the new string and the number of replacements}
|
---|
1098 | \end{tableii}
|
---|
1099 |
|
---|
1100 |
|
---|
1101 | \subsection{Splitting Strings}
|
---|
1102 |
|
---|
1103 | The \method{split()} method of a \class{RegexObject} splits a string
|
---|
1104 | apart wherever the RE matches, returning a list of the pieces.
|
---|
1105 | It's similar to the \method{split()} method of strings but
|
---|
1106 | provides much more
|
---|
1107 | generality in the delimiters that you can split by;
|
---|
1108 | \method{split()} only supports splitting by whitespace or by
|
---|
1109 | a fixed string. As you'd expect, there's a module-level
|
---|
1110 | \function{re.split()} function, too.
|
---|
1111 |
|
---|
1112 | \begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}}
|
---|
1113 | Split \var{string} by the matches of the regular expression. If
|
---|
1114 | capturing parentheses are used in the RE, then their contents will
|
---|
1115 | also be returned as part of the resulting list. If \var{maxsplit}
|
---|
1116 | is nonzero, at most \var{maxsplit} splits are performed.
|
---|
1117 | \end{methoddesc}
|
---|
1118 |
|
---|
1119 | You can limit the number of splits made, by passing a value for
|
---|
1120 | \var{maxsplit}. When \var{maxsplit} is nonzero, at most
|
---|
1121 | \var{maxsplit} splits will be made, and the remainder of the string is
|
---|
1122 | returned as the final element of the list. In the following example,
|
---|
1123 | the delimiter is any sequence of non-alphanumeric characters.
|
---|
1124 |
|
---|
1125 | \begin{verbatim}
|
---|
1126 | >>> p = re.compile(r'\W+')
|
---|
1127 | >>> p.split('This is a test, short and sweet, of split().')
|
---|
1128 | ['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
|
---|
1129 | >>> p.split('This is a test, short and sweet, of split().', 3)
|
---|
1130 | ['This', 'is', 'a', 'test, short and sweet, of split().']
|
---|
1131 | \end{verbatim}
|
---|
1132 |
|
---|
1133 | Sometimes you're not only interested in what the text between
|
---|
1134 | delimiters is, but also need to know what the delimiter was. If
|
---|
1135 | capturing parentheses are used in the RE, then their values are also
|
---|
1136 | returned as part of the list. Compare the following calls:
|
---|
1137 |
|
---|
1138 | \begin{verbatim}
|
---|
1139 | >>> p = re.compile(r'\W+')
|
---|
1140 | >>> p2 = re.compile(r'(\W+)')
|
---|
1141 | >>> p.split('This... is a test.')
|
---|
1142 | ['This', 'is', 'a', 'test', '']
|
---|
1143 | >>> p2.split('This... is a test.')
|
---|
1144 | ['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
|
---|
1145 | \end{verbatim}
|
---|
1146 |
|
---|
1147 | The module-level function \function{re.split()} adds the RE to be
|
---|
1148 | used as the first argument, but is otherwise the same.
|
---|
1149 |
|
---|
1150 | \begin{verbatim}
|
---|
1151 | >>> re.split('[\W]+', 'Words, words, words.')
|
---|
1152 | ['Words', 'words', 'words', '']
|
---|
1153 | >>> re.split('([\W]+)', 'Words, words, words.')
|
---|
1154 | ['Words', ', ', 'words', ', ', 'words', '.', '']
|
---|
1155 | >>> re.split('[\W]+', 'Words, words, words.', 1)
|
---|
1156 | ['Words', 'words, words.']
|
---|
1157 | \end{verbatim}
|
---|
1158 |
|
---|
1159 | \subsection{Search and Replace}
|
---|
1160 |
|
---|
1161 | Another common task is to find all the matches for a pattern, and
|
---|
1162 | replace them with a different string. The \method{sub()} method takes
|
---|
1163 | a replacement value, which can be either a string or a function, and
|
---|
1164 | the string to be processed.
|
---|
1165 |
|
---|
1166 | \begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}}
|
---|
1167 | Returns the string obtained by replacing the leftmost non-overlapping
|
---|
1168 | occurrences of the RE in \var{string} by the replacement
|
---|
1169 | \var{replacement}. If the pattern isn't found, \var{string} is returned
|
---|
1170 | unchanged.
|
---|
1171 |
|
---|
1172 | The optional argument \var{count} is the maximum number of pattern
|
---|
1173 | occurrences to be replaced; \var{count} must be a non-negative
|
---|
1174 | integer. The default value of 0 means to replace all occurrences.
|
---|
1175 | \end{methoddesc}
|
---|
1176 |
|
---|
1177 | Here's a simple example of using the \method{sub()} method. It
|
---|
1178 | replaces colour names with the word \samp{colour}:
|
---|
1179 |
|
---|
1180 | \begin{verbatim}
|
---|
1181 | >>> p = re.compile( '(blue|white|red)')
|
---|
1182 | >>> p.sub( 'colour', 'blue socks and red shoes')
|
---|
1183 | 'colour socks and colour shoes'
|
---|
1184 | >>> p.sub( 'colour', 'blue socks and red shoes', count=1)
|
---|
1185 | 'colour socks and red shoes'
|
---|
1186 | \end{verbatim}
|
---|
1187 |
|
---|
1188 | The \method{subn()} method does the same work, but returns a 2-tuple
|
---|
1189 | containing the new string value and the number of replacements
|
---|
1190 | that were performed:
|
---|
1191 |
|
---|
1192 | \begin{verbatim}
|
---|
1193 | >>> p = re.compile( '(blue|white|red)')
|
---|
1194 | >>> p.subn( 'colour', 'blue socks and red shoes')
|
---|
1195 | ('colour socks and colour shoes', 2)
|
---|
1196 | >>> p.subn( 'colour', 'no colours at all')
|
---|
1197 | ('no colours at all', 0)
|
---|
1198 | \end{verbatim}
|
---|
1199 |
|
---|
1200 | Empty matches are replaced only when they're not
|
---|
1201 | adjacent to a previous match.
|
---|
1202 |
|
---|
1203 | \begin{verbatim}
|
---|
1204 | >>> p = re.compile('x*')
|
---|
1205 | >>> p.sub('-', 'abxd')
|
---|
1206 | '-a-b-d-'
|
---|
1207 | \end{verbatim}
|
---|
1208 |
|
---|
1209 | If \var{replacement} is a string, any backslash escapes in it are
|
---|
1210 | processed. That is, \samp{\e n} is converted to a single newline
|
---|
1211 | character, \samp{\e r} is converted to a carriage return, and so forth.
|
---|
1212 | Unknown escapes such as \samp{\e j} are left alone. Backreferences,
|
---|
1213 | such as \samp{\e 6}, are replaced with the substring matched by the
|
---|
1214 | corresponding group in the RE. This lets you incorporate
|
---|
1215 | portions of the original text in the resulting
|
---|
1216 | replacement string.
|
---|
1217 |
|
---|
1218 | This example matches the word \samp{section} followed by a string
|
---|
1219 | enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to
|
---|
1220 | \samp{subsection}:
|
---|
1221 |
|
---|
1222 | \begin{verbatim}
|
---|
1223 | >>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)
|
---|
1224 | >>> p.sub(r'subsection{\1}','section{First} section{second}')
|
---|
1225 | 'subsection{First} subsection{second}'
|
---|
1226 | \end{verbatim}
|
---|
1227 |
|
---|
1228 | There's also a syntax for referring to named groups as defined by the
|
---|
1229 | \regexp{(?P<name>...)} syntax. \samp{\e g<name>} will use the
|
---|
1230 | substring matched by the group named \samp{name}, and
|
---|
1231 | \samp{\e g<\var{number}>}
|
---|
1232 | uses the corresponding group number.
|
---|
1233 | \samp{\e g<2>} is therefore equivalent to \samp{\e 2},
|
---|
1234 | but isn't ambiguous in a
|
---|
1235 | replacement string such as \samp{\e g<2>0}. (\samp{\e 20} would be
|
---|
1236 | interpreted as a reference to group 20, not a reference to group 2
|
---|
1237 | followed by the literal character \character{0}.) The following
|
---|
1238 | substitutions are all equivalent, but use all three variations of the
|
---|
1239 | replacement string.
|
---|
1240 |
|
---|
1241 | \begin{verbatim}
|
---|
1242 | >>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
|
---|
1243 | >>> p.sub(r'subsection{\1}','section{First}')
|
---|
1244 | 'subsection{First}'
|
---|
1245 | >>> p.sub(r'subsection{\g<1>}','section{First}')
|
---|
1246 | 'subsection{First}'
|
---|
1247 | >>> p.sub(r'subsection{\g<name>}','section{First}')
|
---|
1248 | 'subsection{First}'
|
---|
1249 | \end{verbatim}
|
---|
1250 |
|
---|
1251 | \var{replacement} can also be a function, which gives you even more
|
---|
1252 | control. If \var{replacement} is a function, the function is
|
---|
1253 | called for every non-overlapping occurrence of \var{pattern}. On each
|
---|
1254 | call, the function is
|
---|
1255 | passed a \class{MatchObject} argument for the match
|
---|
1256 | and can use this information to compute the desired replacement string and return it.
|
---|
1257 |
|
---|
1258 | In the following example, the replacement function translates
|
---|
1259 | decimals into hexadecimal:
|
---|
1260 |
|
---|
1261 | \begin{verbatim}
|
---|
1262 | >>> def hexrepl( match ):
|
---|
1263 | ... "Return the hex string for a decimal number"
|
---|
1264 | ... value = int( match.group() )
|
---|
1265 | ... return hex(value)
|
---|
1266 | ...
|
---|
1267 | >>> p = re.compile(r'\d+')
|
---|
1268 | >>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
|
---|
1269 | 'Call 0xffd2 for printing, 0xc000 for user code.'
|
---|
1270 | \end{verbatim}
|
---|
1271 |
|
---|
1272 | When using the module-level \function{re.sub()} function, the pattern
|
---|
1273 | is passed as the first argument. The pattern may be a string or a
|
---|
1274 | \class{RegexObject}; if you need to specify regular expression flags,
|
---|
1275 | you must either use a \class{RegexObject} as the first parameter, or use
|
---|
1276 | embedded modifiers in the pattern, e.g. \code{sub("(?i)b+", "x", "bbbb
|
---|
1277 | BBBB")} returns \code{'x x'}.
|
---|
1278 |
|
---|
1279 | \section{Common Problems}
|
---|
1280 |
|
---|
1281 | Regular expressions are a powerful tool for some applications, but in
|
---|
1282 | some ways their behaviour isn't intuitive and at times they don't
|
---|
1283 | behave the way you may expect them to. This section will point out
|
---|
1284 | some of the most common pitfalls.
|
---|
1285 |
|
---|
1286 | \subsection{Use String Methods}
|
---|
1287 |
|
---|
1288 | Sometimes using the \module{re} module is a mistake. If you're
|
---|
1289 | matching a fixed string, or a single character class, and you're not
|
---|
1290 | using any \module{re} features such as the \constant{IGNORECASE} flag,
|
---|
1291 | then the full power of regular expressions may not be required.
|
---|
1292 | Strings have several methods for performing operations with fixed
|
---|
1293 | strings and they're usually much faster, because the implementation is
|
---|
1294 | a single small C loop that's been optimized for the purpose, instead
|
---|
1295 | of the large, more generalized regular expression engine.
|
---|
1296 |
|
---|
1297 | One example might be replacing a single fixed string with another
|
---|
1298 | one; for example, you might replace \samp{word}
|
---|
1299 | with \samp{deed}. \code{re.sub()} seems like the function to use for
|
---|
1300 | this, but consider the \method{replace()} method. Note that
|
---|
1301 | \function{replace()} will also replace \samp{word} inside
|
---|
1302 | words, turning \samp{swordfish} into \samp{sdeedfish}, but the
|
---|
1303 | na{\"\i}ve RE \regexp{word} would have done that, too. (To avoid performing
|
---|
1304 | the substitution on parts of words, the pattern would have to be
|
---|
1305 | \regexp{\e bword\e b}, in order to require that \samp{word} have a
|
---|
1306 | word boundary on either side. This takes the job beyond
|
---|
1307 | \method{replace}'s abilities.)
|
---|
1308 |
|
---|
1309 | Another common task is deleting every occurrence of a single character
|
---|
1310 | from a string or replacing it with another single character. You
|
---|
1311 | might do this with something like \code{re.sub('\e n', ' ', S)}, but
|
---|
1312 | \method{translate()} is capable of doing both tasks
|
---|
1313 | and will be faster than any regular expression operation can be.
|
---|
1314 |
|
---|
1315 | In short, before turning to the \module{re} module, consider whether
|
---|
1316 | your problem can be solved with a faster and simpler string method.
|
---|
1317 |
|
---|
1318 | \subsection{match() versus search()}
|
---|
1319 |
|
---|
1320 | The \function{match()} function only checks if the RE matches at
|
---|
1321 | the beginning of the string while \function{search()} will scan
|
---|
1322 | forward through the string for a match.
|
---|
1323 | It's important to keep this distinction in mind. Remember,
|
---|
1324 | \function{match()} will only report a successful match which
|
---|
1325 | will start at 0; if the match wouldn't start at zero,
|
---|
1326 | \function{match()} will \emph{not} report it.
|
---|
1327 |
|
---|
1328 | \begin{verbatim}
|
---|
1329 | >>> print re.match('super', 'superstition').span()
|
---|
1330 | (0, 5)
|
---|
1331 | >>> print re.match('super', 'insuperable')
|
---|
1332 | None
|
---|
1333 | \end{verbatim}
|
---|
1334 |
|
---|
1335 | On the other hand, \function{search()} will scan forward through the
|
---|
1336 | string, reporting the first match it finds.
|
---|
1337 |
|
---|
1338 | \begin{verbatim}
|
---|
1339 | >>> print re.search('super', 'superstition').span()
|
---|
1340 | (0, 5)
|
---|
1341 | >>> print re.search('super', 'insuperable').span()
|
---|
1342 | (2, 7)
|
---|
1343 | \end{verbatim}
|
---|
1344 |
|
---|
1345 | Sometimes you'll be tempted to keep using \function{re.match()}, and
|
---|
1346 | just add \regexp{.*} to the front of your RE. Resist this temptation
|
---|
1347 | and use \function{re.search()} instead. The regular expression
|
---|
1348 | compiler does some analysis of REs in order to speed up the process of
|
---|
1349 | looking for a match. One such analysis figures out what the first
|
---|
1350 | character of a match must be; for example, a pattern starting with
|
---|
1351 | \regexp{Crow} must match starting with a \character{C}. The analysis
|
---|
1352 | lets the engine quickly scan through the string looking for the
|
---|
1353 | starting character, only trying the full match if a \character{C} is found.
|
---|
1354 |
|
---|
1355 | Adding \regexp{.*} defeats this optimization, requiring scanning to
|
---|
1356 | the end of the string and then backtracking to find a match for the
|
---|
1357 | rest of the RE. Use \function{re.search()} instead.
|
---|
1358 |
|
---|
1359 | \subsection{Greedy versus Non-Greedy}
|
---|
1360 |
|
---|
1361 | When repeating a regular expression, as in \regexp{a*}, the resulting
|
---|
1362 | action is to consume as much of the pattern as possible. This
|
---|
1363 | fact often bites you when you're trying to match a pair of
|
---|
1364 | balanced delimiters, such as the angle brackets surrounding an HTML
|
---|
1365 | tag. The na{\"\i}ve pattern for matching a single HTML tag doesn't
|
---|
1366 | work because of the greedy nature of \regexp{.*}.
|
---|
1367 |
|
---|
1368 | \begin{verbatim}
|
---|
1369 | >>> s = '<html><head><title>Title</title>'
|
---|
1370 | >>> len(s)
|
---|
1371 | 32
|
---|
1372 | >>> print re.match('<.*>', s).span()
|
---|
1373 | (0, 32)
|
---|
1374 | >>> print re.match('<.*>', s).group()
|
---|
1375 | <html><head><title>Title</title>
|
---|
1376 | \end{verbatim}
|
---|
1377 |
|
---|
1378 | The RE matches the \character{<} in \samp{<html>}, and the
|
---|
1379 | \regexp{.*} consumes the rest of the string. There's still more left
|
---|
1380 | in the RE, though, and the \regexp{>} can't match at the end of
|
---|
1381 | the string, so the regular expression engine has to backtrack
|
---|
1382 | character by character until it finds a match for the \regexp{>}.
|
---|
1383 | The final match extends from the \character{<} in \samp{<html>}
|
---|
1384 | to the \character{>} in \samp{</title>}, which isn't what you want.
|
---|
1385 |
|
---|
1386 | In this case, the solution is to use the non-greedy qualifiers
|
---|
1387 | \regexp{*?}, \regexp{+?}, \regexp{??}, or
|
---|
1388 | \regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as
|
---|
1389 | possible. In the above example, the \character{>} is tried
|
---|
1390 | immediately after the first \character{<} matches, and when it fails,
|
---|
1391 | the engine advances a character at a time, retrying the \character{>}
|
---|
1392 | at every step. This produces just the right result:
|
---|
1393 |
|
---|
1394 | \begin{verbatim}
|
---|
1395 | >>> print re.match('<.*?>', s).group()
|
---|
1396 | <html>
|
---|
1397 | \end{verbatim}
|
---|
1398 |
|
---|
1399 | (Note that parsing HTML or XML with regular expressions is painful.
|
---|
1400 | Quick-and-dirty patterns will handle common cases, but HTML and XML
|
---|
1401 | have special cases that will break the obvious regular expression; by
|
---|
1402 | the time you've written a regular expression that handles all of the
|
---|
1403 | possible cases, the patterns will be \emph{very} complicated. Use an
|
---|
1404 | HTML or XML parser module for such tasks.)
|
---|
1405 |
|
---|
1406 | \subsection{Not Using re.VERBOSE}
|
---|
1407 |
|
---|
1408 | By now you've probably noticed that regular expressions are a very
|
---|
1409 | compact notation, but they're not terribly readable. REs of
|
---|
1410 | moderate complexity can become lengthy collections of backslashes,
|
---|
1411 | parentheses, and metacharacters, making them difficult to read and
|
---|
1412 | understand.
|
---|
1413 |
|
---|
1414 | For such REs, specifying the \code{re.VERBOSE} flag when
|
---|
1415 | compiling the regular expression can be helpful, because it allows
|
---|
1416 | you to format the regular expression more clearly.
|
---|
1417 |
|
---|
1418 | The \code{re.VERBOSE} flag has several effects. Whitespace in the
|
---|
1419 | regular expression that \emph{isn't} inside a character class is
|
---|
1420 | ignored. This means that an expression such as \regexp{dog | cat} is
|
---|
1421 | equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]}
|
---|
1422 | will still match the characters \character{a}, \character{b}, or a
|
---|
1423 | space. In addition, you can also put comments inside a RE; comments
|
---|
1424 | extend from a \samp{\#} character to the next newline. When used with
|
---|
1425 | triple-quoted strings, this enables REs to be formatted more neatly:
|
---|
1426 |
|
---|
1427 | \begin{verbatim}
|
---|
1428 | pat = re.compile(r"""
|
---|
1429 | \s* # Skip leading whitespace
|
---|
1430 | (?P<header>[^:]+) # Header name
|
---|
1431 | \s* : # Whitespace, and a colon
|
---|
1432 | (?P<value>.*?) # The header's value -- *? used to
|
---|
1433 | # lose the following trailing whitespace
|
---|
1434 | \s*$ # Trailing whitespace to end-of-line
|
---|
1435 | """, re.VERBOSE)
|
---|
1436 | \end{verbatim}
|
---|
1437 | % $
|
---|
1438 |
|
---|
1439 | This is far more readable than:
|
---|
1440 |
|
---|
1441 | \begin{verbatim}
|
---|
1442 | pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
|
---|
1443 | \end{verbatim}
|
---|
1444 | % $
|
---|
1445 |
|
---|
1446 | \section{Feedback}
|
---|
1447 |
|
---|
1448 | Regular expressions are a complicated topic. Did this document help
|
---|
1449 | you understand them? Were there parts that were unclear, or Problems
|
---|
1450 | you encountered that weren't covered here? If so, please send
|
---|
1451 | suggestions for improvements to the author.
|
---|
1452 |
|
---|
1453 | The most complete book on regular expressions is almost certainly
|
---|
1454 | Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published
|
---|
1455 | by O'Reilly. Unfortunately, it exclusively concentrates on Perl and
|
---|
1456 | Java's flavours of regular expressions, and doesn't contain any Python
|
---|
1457 | material at all, so it won't be useful as a reference for programming
|
---|
1458 | in Python. (The first edition covered Python's now-removed
|
---|
1459 | \module{regex} module, which won't help you much.) Consider checking
|
---|
1460 | it out from your library.
|
---|
1461 |
|
---|
1462 | \end{document}
|
---|
1463 |
|
---|