Ignore:
Timestamp:
Mar 19, 2014, 11:31:01 PM (11 years ago)
Author:
dmik
Message:

python: Merge vendor 2.7.6 to trunk.

Location:
python/trunk
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • python/trunk

  • python/trunk/Doc/library/htmlparser.rst

    r2 r391  
    99
    1010   The :mod:`HTMLParser` module has been renamed to :mod:`html.parser` in Python
    11    3.0.  The :term:`2to3` tool will automatically adapt imports when converting
    12    your sources to 3.0.
     11   3.  The :term:`2to3` tool will automatically adapt imports when converting
     12   your sources to Python 3.
    1313
    1414
     
    1919   single: XHTML
    2020
    21 This module defines a class :class:`HTMLParser` which serves as the basis for
     21**Source code:** :source:`Lib/HTMLParser.py`
     22
     23--------------
     24
     25This module defines a class :class:`.HTMLParser` which serves as the basis for
    2226parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
    2327Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
     
    2731.. class:: HTMLParser()
    2832
    29    The :class:`HTMLParser` class is instantiated without arguments.
    30 
    31    An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags
    32    begin and end.  The :class:`HTMLParser` class is meant to be overridden by the
    33    user to provide a desired behavior.
     33   An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
     34   when start tags, end tags, text, comments, and other markup elements are
     35   encountered.  The user should subclass :class:`.HTMLParser` and override its
     36   methods to implement the desired behavior.
     37
     38   The :class:`.HTMLParser` class is instantiated without arguments.
    3439
    3540   Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
     
    3944An exception is defined as well:
    4045
    41 
    4246.. exception:: HTMLParseError
    4347
    44    Exception raised by the :class:`HTMLParser` class when it encounters an error
    45    while parsing.  This exception provides three attributes: :attr:`msg` is a brief
    46    message explaining the error, :attr:`lineno` is the number of the line on which
    47    the broken construct was detected, and :attr:`offset` is the number of
     48   :class:`.HTMLParser` is able to handle broken markup, but in some cases it
     49   might raise this exception when it encounters an error while parsing.
     50   This exception provides three attributes: :attr:`msg` is a brief
     51   message explaining the error, :attr:`lineno` is the number of the line on
     52   which the broken construct was detected, and :attr:`offset` is the number of
    4853   characters into the line at which the construct starts.
    4954
    50 :class:`HTMLParser` instances have the following methods:
    51 
    52 
    53 .. method:: HTMLParser.reset()
    54 
    55    Reset the instance.  Loses all unprocessed data.  This is called implicitly at
    56    instantiation time.
     55
     56Example HTML Parser Application
     57-------------------------------
     58
     59As a basic example, below is a simple HTML parser that uses the
     60:class:`.HTMLParser` class to print out start tags, end tags and data
     61as they are encountered::
     62
     63   from HTMLParser import HTMLParser
     64
     65   # create a subclass and override the handler methods
     66   class MyHTMLParser(HTMLParser):
     67       def handle_starttag(self, tag, attrs):
     68           print "Encountered a start tag:", tag
     69       def handle_endtag(self, tag):
     70           print "Encountered an end tag :", tag
     71       def handle_data(self, data):
     72           print "Encountered some data  :", data
     73
     74   # instantiate the parser and fed it some HTML
     75   parser = MyHTMLParser()
     76   parser.feed('<html><head><title>Test</title></head>'
     77               '<body><h1>Parse me!</h1></body></html>')
     78
     79The output will then be::
     80
     81   Encountered a start tag: html
     82   Encountered a start tag: head
     83   Encountered a start tag: title
     84   Encountered some data  : Test
     85   Encountered an end tag : title
     86   Encountered an end tag : head
     87   Encountered a start tag: body
     88   Encountered a start tag: h1
     89   Encountered some data  : Parse me!
     90   Encountered an end tag : h1
     91   Encountered an end tag : body
     92   Encountered an end tag : html
     93
     94
     95:class:`.HTMLParser` Methods
     96----------------------------
     97
     98:class:`.HTMLParser` instances have the following methods:
    5799
    58100
     
    61103   Feed some text to the parser.  It is processed insofar as it consists of
    62104   complete elements; incomplete data is buffered until more data is fed or
    63    :meth:`close` is called.
     105   :meth:`close` is called.  *data* can be either :class:`unicode` or
     106   :class:`str`, but passing :class:`unicode` is advised.
    64107
    65108
     
    69112   mark.  This method may be redefined by a derived class to define additional
    70113   processing at the end of the input, but the redefined version should always call
    71    the :class:`HTMLParser` base class method :meth:`close`.
     114   the :class:`.HTMLParser` base class method :meth:`close`.
     115
     116
     117.. method:: HTMLParser.reset()
     118
     119   Reset the instance.  Loses all unprocessed data.  This is called implicitly at
     120   instantiation time.
    72121
    73122
     
    85134
    86135
     136The following methods are called when data or markup elements are encountered
     137and they are meant to be overridden in a subclass.  The base class
     138implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
     139
     140
    87141.. method:: HTMLParser.handle_starttag(tag, attrs)
    88142
    89    This method is called to handle the start of a tag.  It is intended to be
    90    overridden by a derived class; the base class implementation does nothing.
     143   This method is called to handle the start of a tag (e.g. ``<div id="main">``).
    91144
    92145   The *tag* argument is the name of the tag converted to lower case. The *attrs*
     
    94147   inside the tag's ``<>`` brackets.  The *name* will be translated to lower case,
    95148   and quotes in the *value* have been removed, and character and entity references
    96    have been replaced.  For instance, for the tag ``<A
    97    HREF="http://www.cwi.nl/">``, this method would be called as
    98    ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
     149   have been replaced.
     150
     151   For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
     152   would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
    99153
    100154   .. versionchanged:: 2.6
    101       All entity references from :mod:`htmlentitydefs` are now replaced in the attribute
    102       values.
     155      All entity references from :mod:`htmlentitydefs` are now replaced in the
     156      attribute values.
     157
     158
     159.. method:: HTMLParser.handle_endtag(tag)
     160
     161   This method is called to handle the end tag of an element (e.g. ``</div>``).
     162
     163   The *tag* argument is the name of the tag converted to lower case.
    103164
    104165
     
    106167
    107168   Similar to :meth:`handle_starttag`, but called when the parser encounters an
    108    XHTML-style empty tag (``<a .../>``).  This method may be overridden by
     169   XHTML-style empty tag (``<img ... />``).  This method may be overridden by
    109170   subclasses which require this particular lexical information; the default
    110    implementation simple calls :meth:`handle_starttag` and :meth:`handle_endtag`.
    111 
    112 
    113 .. method:: HTMLParser.handle_endtag(tag)
    114 
    115    This method is called to handle the end tag of an element.  It is intended to be
    116    overridden by a derived class; the base class implementation does nothing.  The
    117    *tag* argument is the name of the tag converted to lower case.
     171   implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
    118172
    119173
    120174.. method:: HTMLParser.handle_data(data)
    121175
    122    This method is called to process arbitrary data.  It is intended to be
    123    overridden by a derived class; the base class implementation does nothing.
     176   This method is called to process arbitrary data (e.g. text nodes and the
     177   content of ``<script>...</script>`` and ``<style>...</style>``).
     178
     179
     180.. method:: HTMLParser.handle_entityref(name)
     181
     182   This method is called to process a named character reference of the form
     183   ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
     184   (e.g. ``'gt'``).
    124185
    125186
    126187.. method:: HTMLParser.handle_charref(name)
    127188
    128    This method is called to process a character reference of the form ``&#ref;``.
    129    It is intended to be overridden by a derived class; the base class
    130    implementation does nothing.
    131 
    132 
    133 .. method:: HTMLParser.handle_entityref(name)
    134 
    135    This method is called to process a general entity reference of the form
    136    ``&name;`` where *name* is an general entity reference.  It is intended to be
    137    overridden by a derived class; the base class implementation does nothing.
     189   This method is called to process decimal and hexadecimal numeric character
     190   references of the form ``&#NNN;`` and ``&#xNNN;``.  For example, the decimal
     191   equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
     192   in this case the method will receive ``'62'`` or ``'x3E'``.
    138193
    139194
    140195.. method:: HTMLParser.handle_comment(data)
    141196
    142    This method is called when a comment is encountered.  The *comment* argument is
    143    a string containing the text between the ``--`` and ``--`` delimiters, but not
    144    the delimiters themselves.  For example, the comment ``<!--text-->`` will cause
    145    this method to be called with the argument ``'text'``.  It is intended to be
    146    overridden by a derived class; the base class implementation does nothing.
     197   This method is called when a comment is encountered (e.g. ``<!--comment-->``).
     198
     199   For example, the comment ``<!-- comment -->`` will cause this method to be
     200   called with the argument ``' comment '``.
     201
     202   The content of Internet Explorer conditional comments (condcoms) will also be
     203   sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
     204   this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
    147205
    148206
    149207.. method:: HTMLParser.handle_decl(decl)
    150208
    151    Method called when an SGML declaration is read by the parser.  The *decl*
    152    parameter will be the entire contents of the declaration inside the ``<!``...\
    153    ``>`` markup.  It is intended to be overridden by a derived class; the base
    154    class implementation does nothing.
     209   This method is called to handle an HTML doctype declaration (e.g.
     210   ``<!DOCTYPE html>``).
     211
     212   The *decl* parameter will be the entire contents of the declaration inside
     213   the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
    155214
    156215
    157216.. method:: HTMLParser.handle_pi(data)
    158217
    159    Method called when a processing instruction is encountered.  The *data*
    160    parameter will contain the entire processing instruction. For example, for the
     218   This method is called when a processing instruction is encountered.  The *data*
     219   parameter will contain the entire processing instruction.  For example, for the
    161220   processing instruction ``<?proc color='red'>``, this method would be called as
    162    ``handle_pi("proc color='red'")``.  It is intended to be overridden by a derived
    163    class; the base class implementation does nothing.
     221   ``handle_pi("proc color='red'")``.
    164222
    165223   .. note::
    166224
    167       The :class:`HTMLParser` class uses the SGML syntactic rules for processing
     225      The :class:`.HTMLParser` class uses the SGML syntactic rules for processing
    168226      instructions.  An XHTML processing instruction using the trailing ``'?'`` will
    169227      cause the ``'?'`` to be included in *data*.
    170228
    171229
    172 .. _htmlparser-example:
    173 
    174 Example HTML Parser Application
    175 -------------------------------
    176 
    177 As a basic example, below is a very basic HTML parser that uses the
    178 :class:`HTMLParser` class to print out tags as they are encountered::
     230.. method:: HTMLParser.unknown_decl(data)
     231
     232   This method is called when an unrecognized declaration is read by the parser.
     233
     234   The *data* parameter will be the entire contents of the declaration inside
     235   the ``<![...]>`` markup.  It is sometimes useful to be overridden by a
     236   derived class.
     237
     238
     239.. _htmlparser-examples:
     240
     241Examples
     242--------
     243
     244The following class implements a parser that will be used to illustrate more
     245examples::
    179246
    180247   from HTMLParser import HTMLParser
     248   from htmlentitydefs import name2codepoint
    181249
    182250   class MyHTMLParser(HTMLParser):
    183 
    184251       def handle_starttag(self, tag, attrs):
    185            print "Encountered the beginning of a %s tag" % tag
    186 
     252           print "Start tag:", tag
     253           for attr in attrs:
     254               print "     attr:", attr
    187255       def handle_endtag(self, tag):
    188            print "Encountered the end of a %s tag" % tag
    189 
     256           print "End tag  :", tag
     257       def handle_data(self, data):
     258           print "Data     :", data
     259       def handle_comment(self, data):
     260           print "Comment  :", data
     261       def handle_entityref(self, name):
     262           c = unichr(name2codepoint[name])
     263           print "Named ent:", c
     264       def handle_charref(self, name):
     265           if name.startswith('x'):
     266               c = unichr(int(name[1:], 16))
     267           else:
     268               c = unichr(int(name))
     269           print "Num ent  :", c
     270       def handle_decl(self, data):
     271           print "Decl     :", data
     272
     273   parser = MyHTMLParser()
     274
     275Parsing a doctype::
     276
     277   >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
     278   ...             '"http://www.w3.org/TR/html4/strict.dtd">')
     279   Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
     280
     281Parsing an element with a few attributes and a title::
     282
     283   >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
     284   Start tag: img
     285        attr: ('src', 'python-logo.png')
     286        attr: ('alt', 'The Python logo')
     287   >>>
     288   >>> parser.feed('<h1>Python</h1>')
     289   Start tag: h1
     290   Data     : Python
     291   End tag  : h1
     292
     293The content of ``script`` and ``style`` elements is returned as is, without
     294further parsing::
     295
     296   >>> parser.feed('<style type="text/css">#python { color: green }</style>')
     297   Start tag: style
     298        attr: ('type', 'text/css')
     299   Data     : #python { color: green }
     300   End tag  : style
     301   >>>
     302   >>> parser.feed('<script type="text/javascript">'
     303   ...             'alert("<strong>hello!</strong>");</script>')
     304   Start tag: script
     305        attr: ('type', 'text/javascript')
     306   Data     : alert("<strong>hello!</strong>");
     307   End tag  : script
     308
     309Parsing comments::
     310
     311   >>> parser.feed('<!-- a comment -->'
     312   ...             '<!--[if IE 9]>IE-specific content<![endif]-->')
     313   Comment  :  a comment
     314   Comment  : [if IE 9]>IE-specific content<![endif]
     315
     316Parsing named and numeric character references and converting them to the
     317correct char (note: these 3 references are all equivalent to ``'>'``)::
     318
     319   >>> parser.feed('&gt;&#62;&#x3E;')
     320   Named ent: >
     321   Num ent  : >
     322   Num ent  : >
     323
     324Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
     325:meth:`~HTMLParser.handle_data` might be called more than once::
     326
     327   >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
     328   ...     parser.feed(chunk)
     329   ...
     330   Start tag: span
     331   Data     : buff
     332   Data     : ered
     333   Data     : text
     334   End tag  : span
     335
     336Parsing invalid HTML (e.g. unquoted attributes) also works::
     337
     338   >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
     339   Start tag: p
     340   Start tag: a
     341        attr: ('class', 'link')
     342        attr: ('href', '#main')
     343   Data     : tag soup
     344   End tag  : p
     345   End tag  : a
Note: See TracChangeset for help on using the changeset viewer.