source: python/trunk/Doc/library/htmlparser.rst

Last change on this file was 391, checked in by dmik, 11 years ago

python: Merge vendor 2.7.6 to trunk.

  • Property svn:eol-style set to native
File size: 11.3 KB
RevLine 
[2]1
2:mod:`HTMLParser` --- Simple HTML and XHTML parser
3==================================================
4
5.. module:: HTMLParser
6 :synopsis: A simple parser that can handle HTML and XHTML.
7
8.. note::
9
10 The :mod:`HTMLParser` module has been renamed to :mod:`html.parser` in Python
[391]11 3. The :term:`2to3` tool will automatically adapt imports when converting
12 your sources to Python 3.
[2]13
14
15.. versionadded:: 2.2
16
17.. index::
18 single: HTML
19 single: XHTML
20
[391]21**Source code:** :source:`Lib/HTMLParser.py`
22
23--------------
24
25This module defines a class :class:`.HTMLParser` which serves as the basis for
[2]26parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
27Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
28in :mod:`sgmllib`.
29
30
31.. class:: HTMLParser()
32
[391]33 An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
34 when start tags, end tags, text, comments, and other markup elements are
35 encountered. The user should subclass :class:`.HTMLParser` and override its
36 methods to implement the desired behavior.
[2]37
[391]38 The :class:`.HTMLParser` class is instantiated without arguments.
[2]39
40 Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
41 match start tags or call the end-tag handler for elements which are closed
42 implicitly by closing an outer element.
43
44An exception is defined as well:
45
46.. exception:: HTMLParseError
47
[391]48 :class:`.HTMLParser` is able to handle broken markup, but in some cases it
49 might raise this exception when it encounters an error while parsing.
50 This exception provides three attributes: :attr:`msg` is a brief
51 message explaining the error, :attr:`lineno` is the number of the line on
52 which the broken construct was detected, and :attr:`offset` is the number of
[2]53 characters into the line at which the construct starts.
54
55
[391]56Example HTML Parser Application
57-------------------------------
[2]58
[391]59As a basic example, below is a simple HTML parser that uses the
60:class:`.HTMLParser` class to print out start tags, end tags and data
61as they are encountered::
[2]62
[391]63 from HTMLParser import HTMLParser
[2]64
[391]65 # create a subclass and override the handler methods
66 class MyHTMLParser(HTMLParser):
67 def handle_starttag(self, tag, attrs):
68 print "Encountered a start tag:", tag
69 def handle_endtag(self, tag):
70 print "Encountered an end tag :", tag
71 def handle_data(self, data):
72 print "Encountered some data :", data
[2]73
[391]74 # instantiate the parser and fed it some HTML
75 parser = MyHTMLParser()
76 parser.feed('<html><head><title>Test</title></head>'
77 '<body><h1>Parse me!</h1></body></html>')
78
79The output will then be::
80
81 Encountered a start tag: html
82 Encountered a start tag: head
83 Encountered a start tag: title
84 Encountered some data : Test
85 Encountered an end tag : title
86 Encountered an end tag : head
87 Encountered a start tag: body
88 Encountered a start tag: h1
89 Encountered some data : Parse me!
90 Encountered an end tag : h1
91 Encountered an end tag : body
92 Encountered an end tag : html
93
94
95:class:`.HTMLParser` Methods
96----------------------------
97
98:class:`.HTMLParser` instances have the following methods:
99
100
[2]101.. method:: HTMLParser.feed(data)
102
103 Feed some text to the parser. It is processed insofar as it consists of
104 complete elements; incomplete data is buffered until more data is fed or
[391]105 :meth:`close` is called. *data* can be either :class:`unicode` or
106 :class:`str`, but passing :class:`unicode` is advised.
[2]107
108
109.. method:: HTMLParser.close()
110
111 Force processing of all buffered data as if it were followed by an end-of-file
112 mark. This method may be redefined by a derived class to define additional
113 processing at the end of the input, but the redefined version should always call
[391]114 the :class:`.HTMLParser` base class method :meth:`close`.
[2]115
116
[391]117.. method:: HTMLParser.reset()
118
119 Reset the instance. Loses all unprocessed data. This is called implicitly at
120 instantiation time.
121
122
[2]123.. method:: HTMLParser.getpos()
124
125 Return current line number and offset.
126
127
128.. method:: HTMLParser.get_starttag_text()
129
130 Return the text of the most recently opened start tag. This should not normally
131 be needed for structured processing, but may be useful in dealing with HTML "as
132 deployed" or for re-generating input with minimal changes (whitespace between
133 attributes can be preserved, etc.).
134
135
[391]136The following methods are called when data or markup elements are encountered
137and they are meant to be overridden in a subclass. The base class
138implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
139
140
[2]141.. method:: HTMLParser.handle_starttag(tag, attrs)
142
[391]143 This method is called to handle the start of a tag (e.g. ``<div id="main">``).
[2]144
145 The *tag* argument is the name of the tag converted to lower case. The *attrs*
146 argument is a list of ``(name, value)`` pairs containing the attributes found
147 inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
148 and quotes in the *value* have been removed, and character and entity references
[391]149 have been replaced.
[2]150
[391]151 For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
152 would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
153
[2]154 .. versionchanged:: 2.6
[391]155 All entity references from :mod:`htmlentitydefs` are now replaced in the
156 attribute values.
[2]157
158
[391]159.. method:: HTMLParser.handle_endtag(tag)
160
161 This method is called to handle the end tag of an element (e.g. ``</div>``).
162
163 The *tag* argument is the name of the tag converted to lower case.
164
165
[2]166.. method:: HTMLParser.handle_startendtag(tag, attrs)
167
168 Similar to :meth:`handle_starttag`, but called when the parser encounters an
[391]169 XHTML-style empty tag (``<img ... />``). This method may be overridden by
[2]170 subclasses which require this particular lexical information; the default
[391]171 implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
[2]172
173
[391]174.. method:: HTMLParser.handle_data(data)
[2]175
[391]176 This method is called to process arbitrary data (e.g. text nodes and the
177 content of ``<script>...</script>`` and ``<style>...</style>``).
[2]178
179
[391]180.. method:: HTMLParser.handle_entityref(name)
[2]181
[391]182 This method is called to process a named character reference of the form
183 ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
184 (e.g. ``'gt'``).
[2]185
186
187.. method:: HTMLParser.handle_charref(name)
188
[391]189 This method is called to process decimal and hexadecimal numeric character
190 references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
191 equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
192 in this case the method will receive ``'62'`` or ``'x3E'``.
[2]193
194
[391]195.. method:: HTMLParser.handle_comment(data)
[2]196
[391]197 This method is called when a comment is encountered (e.g. ``<!--comment-->``).
[2]198
[391]199 For example, the comment ``<!-- comment -->`` will cause this method to be
200 called with the argument ``' comment '``.
[2]201
[391]202 The content of Internet Explorer conditional comments (condcoms) will also be
203 sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
204 this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
[2]205
206
207.. method:: HTMLParser.handle_decl(decl)
208
[391]209 This method is called to handle an HTML doctype declaration (e.g.
210 ``<!DOCTYPE html>``).
[2]211
[391]212 The *decl* parameter will be the entire contents of the declaration inside
213 the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
[2]214
[391]215
[2]216.. method:: HTMLParser.handle_pi(data)
217
[391]218 This method is called when a processing instruction is encountered. The *data*
219 parameter will contain the entire processing instruction. For example, for the
[2]220 processing instruction ``<?proc color='red'>``, this method would be called as
[391]221 ``handle_pi("proc color='red'")``.
[2]222
223 .. note::
224
[391]225 The :class:`.HTMLParser` class uses the SGML syntactic rules for processing
[2]226 instructions. An XHTML processing instruction using the trailing ``'?'`` will
227 cause the ``'?'`` to be included in *data*.
228
229
[391]230.. method:: HTMLParser.unknown_decl(data)
[2]231
[391]232 This method is called when an unrecognized declaration is read by the parser.
[2]233
[391]234 The *data* parameter will be the entire contents of the declaration inside
235 the ``<![...]>`` markup. It is sometimes useful to be overridden by a
236 derived class.
[2]237
[391]238
239.. _htmlparser-examples:
240
241Examples
242--------
243
244The following class implements a parser that will be used to illustrate more
245examples::
246
[2]247 from HTMLParser import HTMLParser
[391]248 from htmlentitydefs import name2codepoint
[2]249
250 class MyHTMLParser(HTMLParser):
251 def handle_starttag(self, tag, attrs):
[391]252 print "Start tag:", tag
253 for attr in attrs:
254 print " attr:", attr
[2]255 def handle_endtag(self, tag):
[391]256 print "End tag :", tag
257 def handle_data(self, data):
258 print "Data :", data
259 def handle_comment(self, data):
260 print "Comment :", data
261 def handle_entityref(self, name):
262 c = unichr(name2codepoint[name])
263 print "Named ent:", c
264 def handle_charref(self, name):
265 if name.startswith('x'):
266 c = unichr(int(name[1:], 16))
267 else:
268 c = unichr(int(name))
269 print "Num ent :", c
270 def handle_decl(self, data):
271 print "Decl :", data
[2]272
[391]273 parser = MyHTMLParser()
274
275Parsing a doctype::
276
277 >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
278 ... '"http://www.w3.org/TR/html4/strict.dtd">')
279 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
280
281Parsing an element with a few attributes and a title::
282
283 >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
284 Start tag: img
285 attr: ('src', 'python-logo.png')
286 attr: ('alt', 'The Python logo')
287 >>>
288 >>> parser.feed('<h1>Python</h1>')
289 Start tag: h1
290 Data : Python
291 End tag : h1
292
293The content of ``script`` and ``style`` elements is returned as is, without
294further parsing::
295
296 >>> parser.feed('<style type="text/css">#python { color: green }</style>')
297 Start tag: style
298 attr: ('type', 'text/css')
299 Data : #python { color: green }
300 End tag : style
301 >>>
302 >>> parser.feed('<script type="text/javascript">'
303 ... 'alert("<strong>hello!</strong>");</script>')
304 Start tag: script
305 attr: ('type', 'text/javascript')
306 Data : alert("<strong>hello!</strong>");
307 End tag : script
308
309Parsing comments::
310
311 >>> parser.feed('<!-- a comment -->'
312 ... '<!--[if IE 9]>IE-specific content<![endif]-->')
313 Comment : a comment
314 Comment : [if IE 9]>IE-specific content<![endif]
315
316Parsing named and numeric character references and converting them to the
317correct char (note: these 3 references are all equivalent to ``'>'``)::
318
319 >>> parser.feed('&gt;&#62;&#x3E;')
320 Named ent: >
321 Num ent : >
322 Num ent : >
323
324Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
325:meth:`~HTMLParser.handle_data` might be called more than once::
326
327 >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
328 ... parser.feed(chunk)
329 ...
330 Start tag: span
331 Data : buff
332 Data : ered
333 Data : text
334 End tag : span
335
336Parsing invalid HTML (e.g. unquoted attributes) also works::
337
338 >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
339 Start tag: p
340 Start tag: a
341 attr: ('class', 'link')
342 attr: ('href', '#main')
343 Data : tag soup
344 End tag : p
345 End tag : a
Note: See TracBrowser for help on using the repository browser.