Changeset 391 for python/trunk/Doc/library/htmlparser.rst
- Timestamp:
- Mar 19, 2014, 11:31:01 PM (11 years ago)
- Location:
- python/trunk
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
python/trunk
-
Property svn:mergeinfo
set to
/python/vendor/Python-2.7.6 merged eligible /python/vendor/current merged eligible
-
Property svn:mergeinfo
set to
-
python/trunk/Doc/library/htmlparser.rst
r2 r391 9 9 10 10 The :mod:`HTMLParser` module has been renamed to :mod:`html.parser` in Python 11 3. 0.The :term:`2to3` tool will automatically adapt imports when converting12 your sources to 3.0.11 3. The :term:`2to3` tool will automatically adapt imports when converting 12 your sources to Python 3. 13 13 14 14 … … 19 19 single: XHTML 20 20 21 This module defines a class :class:`HTMLParser` which serves as the basis for 21 **Source code:** :source:`Lib/HTMLParser.py` 22 23 -------------- 24 25 This module defines a class :class:`.HTMLParser` which serves as the basis for 22 26 parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. 23 27 Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser … … 27 31 .. class:: HTMLParser() 28 32 29 The :class:`HTMLParser` class is instantiated without arguments. 30 31 An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags 32 begin and end. The :class:`HTMLParser` class is meant to be overridden by the 33 user to provide a desired behavior. 33 An :class:`.HTMLParser` instance is fed HTML data and calls handler methods 34 when start tags, end tags, text, comments, and other markup elements are 35 encountered. The user should subclass :class:`.HTMLParser` and override its 36 methods to implement the desired behavior. 37 38 The :class:`.HTMLParser` class is instantiated without arguments. 34 39 35 40 Unlike the parser in :mod:`htmllib`, this parser does not check that end tags … … 39 44 An exception is defined as well: 40 45 41 42 46 .. exception:: HTMLParseError 43 47 44 Exception raised by the :class:`HTMLParser` class when it encounters an error 45 while parsing. This exception provides three attributes: :attr:`msg` is a brief 46 message explaining the error, :attr:`lineno` is the number of the line on which 47 the broken construct was detected, and :attr:`offset` is the number of 48 :class:`.HTMLParser` is able to handle broken markup, but in some cases it 49 might raise this exception when it encounters an error while parsing. 50 This exception provides three attributes: :attr:`msg` is a brief 51 message explaining the error, :attr:`lineno` is the number of the line on 52 which the broken construct was detected, and :attr:`offset` is the number of 48 53 characters into the line at which the construct starts. 49 54 50 :class:`HTMLParser` instances have the following methods: 51 52 53 .. method:: HTMLParser.reset() 54 55 Reset the instance. Loses all unprocessed data. This is called implicitly at 56 instantiation time. 55 56 Example HTML Parser Application 57 ------------------------------- 58 59 As a basic example, below is a simple HTML parser that uses the 60 :class:`.HTMLParser` class to print out start tags, end tags and data 61 as they are encountered:: 62 63 from HTMLParser import HTMLParser 64 65 # create a subclass and override the handler methods 66 class MyHTMLParser(HTMLParser): 67 def handle_starttag(self, tag, attrs): 68 print "Encountered a start tag:", tag 69 def handle_endtag(self, tag): 70 print "Encountered an end tag :", tag 71 def handle_data(self, data): 72 print "Encountered some data :", data 73 74 # instantiate the parser and fed it some HTML 75 parser = MyHTMLParser() 76 parser.feed('<html><head><title>Test</title></head>' 77 '<body><h1>Parse me!</h1></body></html>') 78 79 The output will then be:: 80 81 Encountered a start tag: html 82 Encountered a start tag: head 83 Encountered a start tag: title 84 Encountered some data : Test 85 Encountered an end tag : title 86 Encountered an end tag : head 87 Encountered a start tag: body 88 Encountered a start tag: h1 89 Encountered some data : Parse me! 90 Encountered an end tag : h1 91 Encountered an end tag : body 92 Encountered an end tag : html 93 94 95 :class:`.HTMLParser` Methods 96 ---------------------------- 97 98 :class:`.HTMLParser` instances have the following methods: 57 99 58 100 … … 61 103 Feed some text to the parser. It is processed insofar as it consists of 62 104 complete elements; incomplete data is buffered until more data is fed or 63 :meth:`close` is called. 105 :meth:`close` is called. *data* can be either :class:`unicode` or 106 :class:`str`, but passing :class:`unicode` is advised. 64 107 65 108 … … 69 112 mark. This method may be redefined by a derived class to define additional 70 113 processing at the end of the input, but the redefined version should always call 71 the :class:`HTMLParser` base class method :meth:`close`. 114 the :class:`.HTMLParser` base class method :meth:`close`. 115 116 117 .. method:: HTMLParser.reset() 118 119 Reset the instance. Loses all unprocessed data. This is called implicitly at 120 instantiation time. 72 121 73 122 … … 85 134 86 135 136 The following methods are called when data or markup elements are encountered 137 and they are meant to be overridden in a subclass. The base class 138 implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`): 139 140 87 141 .. method:: HTMLParser.handle_starttag(tag, attrs) 88 142 89 This method is called to handle the start of a tag. It is intended to be 90 overridden by a derived class; the base class implementation does nothing. 143 This method is called to handle the start of a tag (e.g. ``<div id="main">``). 91 144 92 145 The *tag* argument is the name of the tag converted to lower case. The *attrs* … … 94 147 inside the tag's ``<>`` brackets. The *name* will be translated to lower case, 95 148 and quotes in the *value* have been removed, and character and entity references 96 have been replaced. For instance, for the tag ``<A 97 HREF="http://www.cwi.nl/">``, this method would be called as 98 ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. 149 have been replaced. 150 151 For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method 152 would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. 99 153 100 154 .. versionchanged:: 2.6 101 All entity references from :mod:`htmlentitydefs` are now replaced in the attribute 102 values. 155 All entity references from :mod:`htmlentitydefs` are now replaced in the 156 attribute values. 157 158 159 .. method:: HTMLParser.handle_endtag(tag) 160 161 This method is called to handle the end tag of an element (e.g. ``</div>``). 162 163 The *tag* argument is the name of the tag converted to lower case. 103 164 104 165 … … 106 167 107 168 Similar to :meth:`handle_starttag`, but called when the parser encounters an 108 XHTML-style empty tag (``< a .../>``). This method may be overridden by169 XHTML-style empty tag (``<img ... />``). This method may be overridden by 109 170 subclasses which require this particular lexical information; the default 110 implementation simple calls :meth:`handle_starttag` and :meth:`handle_endtag`. 111 112 113 .. method:: HTMLParser.handle_endtag(tag) 114 115 This method is called to handle the end tag of an element. It is intended to be 116 overridden by a derived class; the base class implementation does nothing. The 117 *tag* argument is the name of the tag converted to lower case. 171 implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`. 118 172 119 173 120 174 .. method:: HTMLParser.handle_data(data) 121 175 122 This method is called to process arbitrary data. It is intended to be 123 overridden by a derived class; the base class implementation does nothing. 176 This method is called to process arbitrary data (e.g. text nodes and the 177 content of ``<script>...</script>`` and ``<style>...</style>``). 178 179 180 .. method:: HTMLParser.handle_entityref(name) 181 182 This method is called to process a named character reference of the form 183 ``&name;`` (e.g. ``>``), where *name* is a general entity reference 184 (e.g. ``'gt'``). 124 185 125 186 126 187 .. method:: HTMLParser.handle_charref(name) 127 188 128 This method is called to process a character reference of the form ``&#ref;``. 129 It is intended to be overridden by a derived class; the base class 130 implementation does nothing. 131 132 133 .. method:: HTMLParser.handle_entityref(name) 134 135 This method is called to process a general entity reference of the form 136 ``&name;`` where *name* is an general entity reference. It is intended to be 137 overridden by a derived class; the base class implementation does nothing. 189 This method is called to process decimal and hexadecimal numeric character 190 references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal 191 equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``; 192 in this case the method will receive ``'62'`` or ``'x3E'``. 138 193 139 194 140 195 .. method:: HTMLParser.handle_comment(data) 141 196 142 This method is called when a comment is encountered. The *comment* argument is 143 a string containing the text between the ``--`` and ``--`` delimiters, but not 144 the delimiters themselves. For example, the comment ``<!--text-->`` will cause 145 this method to be called with the argument ``'text'``. It is intended to be 146 overridden by a derived class; the base class implementation does nothing. 197 This method is called when a comment is encountered (e.g. ``<!--comment-->``). 198 199 For example, the comment ``<!-- comment -->`` will cause this method to be 200 called with the argument ``' comment '``. 201 202 The content of Internet Explorer conditional comments (condcoms) will also be 203 sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``, 204 this method will receive ``'[if IE 9]>IE-specific content<![endif]'``. 147 205 148 206 149 207 .. method:: HTMLParser.handle_decl(decl) 150 208 151 Method called when an SGML declaration is read by the parser. The *decl* 152 parameter will be the entire contents of the declaration inside the ``<!``...\ 153 ``>`` markup. It is intended to be overridden by a derived class; the base 154 class implementation does nothing. 209 This method is called to handle an HTML doctype declaration (e.g. 210 ``<!DOCTYPE html>``). 211 212 The *decl* parameter will be the entire contents of the declaration inside 213 the ``<!...>`` markup (e.g. ``'DOCTYPE html'``). 155 214 156 215 157 216 .. method:: HTMLParser.handle_pi(data) 158 217 159 Methodcalled when a processing instruction is encountered. The *data*160 parameter will contain the entire processing instruction. For example, for the218 This method is called when a processing instruction is encountered. The *data* 219 parameter will contain the entire processing instruction. For example, for the 161 220 processing instruction ``<?proc color='red'>``, this method would be called as 162 ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived 163 class; the base class implementation does nothing. 221 ``handle_pi("proc color='red'")``. 164 222 165 223 .. note:: 166 224 167 The :class:` HTMLParser` class uses the SGML syntactic rules for processing225 The :class:`.HTMLParser` class uses the SGML syntactic rules for processing 168 226 instructions. An XHTML processing instruction using the trailing ``'?'`` will 169 227 cause the ``'?'`` to be included in *data*. 170 228 171 229 172 .. _htmlparser-example: 173 174 Example HTML Parser Application 175 ------------------------------- 176 177 As a basic example, below is a very basic HTML parser that uses the 178 :class:`HTMLParser` class to print out tags as they are encountered:: 230 .. method:: HTMLParser.unknown_decl(data) 231 232 This method is called when an unrecognized declaration is read by the parser. 233 234 The *data* parameter will be the entire contents of the declaration inside 235 the ``<![...]>`` markup. It is sometimes useful to be overridden by a 236 derived class. 237 238 239 .. _htmlparser-examples: 240 241 Examples 242 -------- 243 244 The following class implements a parser that will be used to illustrate more 245 examples:: 179 246 180 247 from HTMLParser import HTMLParser 248 from htmlentitydefs import name2codepoint 181 249 182 250 class MyHTMLParser(HTMLParser): 183 184 251 def handle_starttag(self, tag, attrs): 185 print "Encountered the beginning of a %s tag" % tag 186 252 print "Start tag:", tag 253 for attr in attrs: 254 print " attr:", attr 187 255 def handle_endtag(self, tag): 188 print "Encountered the end of a %s tag" % tag 189 256 print "End tag :", tag 257 def handle_data(self, data): 258 print "Data :", data 259 def handle_comment(self, data): 260 print "Comment :", data 261 def handle_entityref(self, name): 262 c = unichr(name2codepoint[name]) 263 print "Named ent:", c 264 def handle_charref(self, name): 265 if name.startswith('x'): 266 c = unichr(int(name[1:], 16)) 267 else: 268 c = unichr(int(name)) 269 print "Num ent :", c 270 def handle_decl(self, data): 271 print "Decl :", data 272 273 parser = MyHTMLParser() 274 275 Parsing a doctype:: 276 277 >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" ' 278 ... '"http://www.w3.org/TR/html4/strict.dtd">') 279 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" 280 281 Parsing an element with a few attributes and a title:: 282 283 >>> parser.feed('<img src="python-logo.png" alt="The Python logo">') 284 Start tag: img 285 attr: ('src', 'python-logo.png') 286 attr: ('alt', 'The Python logo') 287 >>> 288 >>> parser.feed('<h1>Python</h1>') 289 Start tag: h1 290 Data : Python 291 End tag : h1 292 293 The content of ``script`` and ``style`` elements is returned as is, without 294 further parsing:: 295 296 >>> parser.feed('<style type="text/css">#python { color: green }</style>') 297 Start tag: style 298 attr: ('type', 'text/css') 299 Data : #python { color: green } 300 End tag : style 301 >>> 302 >>> parser.feed('<script type="text/javascript">' 303 ... 'alert("<strong>hello!</strong>");</script>') 304 Start tag: script 305 attr: ('type', 'text/javascript') 306 Data : alert("<strong>hello!</strong>"); 307 End tag : script 308 309 Parsing comments:: 310 311 >>> parser.feed('<!-- a comment -->' 312 ... '<!--[if IE 9]>IE-specific content<![endif]-->') 313 Comment : a comment 314 Comment : [if IE 9]>IE-specific content<![endif] 315 316 Parsing named and numeric character references and converting them to the 317 correct char (note: these 3 references are all equivalent to ``'>'``):: 318 319 >>> parser.feed('>>>') 320 Named ent: > 321 Num ent : > 322 Num ent : > 323 324 Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but 325 :meth:`~HTMLParser.handle_data` might be called more than once:: 326 327 >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']: 328 ... parser.feed(chunk) 329 ... 330 Start tag: span 331 Data : buff 332 Data : ered 333 Data : text 334 End tag : span 335 336 Parsing invalid HTML (e.g. unquoted attributes) also works:: 337 338 >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>') 339 Start tag: p 340 Start tag: a 341 attr: ('class', 'link') 342 attr: ('href', '#main') 343 Data : tag soup 344 End tag : p 345 End tag : a
Note:
See TracChangeset
for help on using the changeset viewer.