[2] | 1 | :mod:`htmllib` --- A parser for HTML documents
|
---|
| 2 | ==============================================
|
---|
| 3 |
|
---|
| 4 | .. module:: htmllib
|
---|
| 5 | :synopsis: A parser for HTML documents.
|
---|
| 6 | :deprecated:
|
---|
| 7 |
|
---|
| 8 | .. deprecated:: 2.6
|
---|
[391] | 9 | The :mod:`htmllib` module has been removed in Python 3.
|
---|
[2] | 10 |
|
---|
| 11 |
|
---|
| 12 | .. index::
|
---|
| 13 | single: HTML
|
---|
| 14 | single: hypertext
|
---|
| 15 |
|
---|
| 16 | .. index::
|
---|
| 17 | module: sgmllib
|
---|
| 18 | module: formatter
|
---|
| 19 | single: SGMLParser (in module sgmllib)
|
---|
| 20 |
|
---|
| 21 | This module defines a class which can serve as a base for parsing text files
|
---|
| 22 | formatted in the HyperText Mark-up Language (HTML). The class is not directly
|
---|
| 23 | concerned with I/O --- it must be provided with input in string form via a
|
---|
| 24 | method, and makes calls to methods of a "formatter" object in order to produce
|
---|
| 25 | output. The :class:`HTMLParser` class is designed to be used as a base class
|
---|
| 26 | for other classes in order to add functionality, and allows most of its methods
|
---|
| 27 | to be extended or overridden. In turn, this class is derived from and extends
|
---|
| 28 | the :class:`SGMLParser` class defined in module :mod:`sgmllib`. The
|
---|
| 29 | :class:`HTMLParser` implementation supports the HTML 2.0 language as described
|
---|
| 30 | in :rfc:`1866`. Two implementations of formatter objects are provided in the
|
---|
| 31 | :mod:`formatter` module; refer to the documentation for that module for
|
---|
| 32 | information on the formatter interface.
|
---|
| 33 |
|
---|
| 34 | The following is a summary of the interface defined by
|
---|
| 35 | :class:`sgmllib.SGMLParser`:
|
---|
| 36 |
|
---|
| 37 | * The interface to feed data to an instance is through the :meth:`feed` method,
|
---|
| 38 | which takes a string argument. This can be called with as little or as much
|
---|
| 39 | text at a time as desired; ``p.feed(a); p.feed(b)`` has the same effect as
|
---|
| 40 | ``p.feed(a+b)``. When the data contains complete HTML markup constructs, these
|
---|
| 41 | are processed immediately; incomplete constructs are saved in a buffer. To
|
---|
| 42 | force processing of all unprocessed data, call the :meth:`close` method.
|
---|
| 43 |
|
---|
| 44 | For example, to parse the entire contents of a file, use::
|
---|
| 45 |
|
---|
| 46 | parser.feed(open('myfile.html').read())
|
---|
| 47 | parser.close()
|
---|
| 48 |
|
---|
| 49 | * The interface to define semantics for HTML tags is very simple: derive a class
|
---|
| 50 | and define methods called :meth:`start_tag`, :meth:`end_tag`, or :meth:`do_tag`.
|
---|
| 51 | The parser will call these at appropriate moments: :meth:`start_tag` or
|
---|
| 52 | :meth:`do_tag` is called when an opening tag of the form ``<tag ...>`` is
|
---|
| 53 | encountered; :meth:`end_tag` is called when a closing tag of the form ``<tag>``
|
---|
| 54 | is encountered. If an opening tag requires a corresponding closing tag, like
|
---|
| 55 | ``<H1>`` ... ``</H1>``, the class should define the :meth:`start_tag` method; if
|
---|
| 56 | a tag requires no closing tag, like ``<P>``, the class should define the
|
---|
| 57 | :meth:`do_tag` method.
|
---|
| 58 |
|
---|
| 59 | The module defines a parser class and an exception:
|
---|
| 60 |
|
---|
| 61 |
|
---|
| 62 | .. class:: HTMLParser(formatter)
|
---|
| 63 |
|
---|
| 64 | This is the basic HTML parser class. It supports all entity names required by
|
---|
| 65 | the XHTML 1.0 Recommendation (http://www.w3.org/TR/xhtml1). It also defines
|
---|
| 66 | handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
|
---|
| 67 |
|
---|
| 68 |
|
---|
| 69 | .. exception:: HTMLParseError
|
---|
| 70 |
|
---|
| 71 | Exception raised by the :class:`HTMLParser` class when it encounters an error
|
---|
| 72 | while parsing.
|
---|
| 73 |
|
---|
| 74 | .. versionadded:: 2.4
|
---|
| 75 |
|
---|
| 76 |
|
---|
| 77 | .. seealso::
|
---|
| 78 |
|
---|
| 79 | Module :mod:`formatter`
|
---|
| 80 | Interface definition for transforming an abstract flow of formatting events into
|
---|
| 81 | specific output events on writer objects.
|
---|
| 82 |
|
---|
| 83 | Module :mod:`HTMLParser`
|
---|
| 84 | Alternate HTML parser that offers a slightly lower-level view of the input, but
|
---|
| 85 | is designed to work with XHTML, and does not implement some of the SGML syntax
|
---|
| 86 | not used in "HTML as deployed" and which isn't legal for XHTML.
|
---|
| 87 |
|
---|
| 88 | Module :mod:`htmlentitydefs`
|
---|
| 89 | Definition of replacement text for XHTML 1.0 entities.
|
---|
| 90 |
|
---|
| 91 | Module :mod:`sgmllib`
|
---|
| 92 | Base class for :class:`HTMLParser`.
|
---|
| 93 |
|
---|
| 94 |
|
---|
| 95 | .. _html-parser-objects:
|
---|
| 96 |
|
---|
| 97 | HTMLParser Objects
|
---|
| 98 | ------------------
|
---|
| 99 |
|
---|
| 100 | In addition to tag methods, the :class:`HTMLParser` class provides some
|
---|
| 101 | additional methods and instance variables for use within tag methods.
|
---|
| 102 |
|
---|
| 103 |
|
---|
| 104 | .. attribute:: HTMLParser.formatter
|
---|
| 105 |
|
---|
| 106 | This is the formatter instance associated with the parser.
|
---|
| 107 |
|
---|
| 108 |
|
---|
| 109 | .. attribute:: HTMLParser.nofill
|
---|
| 110 |
|
---|
| 111 | Boolean flag which should be true when whitespace should not be collapsed, or
|
---|
| 112 | false when it should be. In general, this should only be true when character
|
---|
| 113 | data is to be treated as "preformatted" text, as within a ``<PRE>`` element.
|
---|
| 114 | The default value is false. This affects the operation of :meth:`handle_data`
|
---|
| 115 | and :meth:`save_end`.
|
---|
| 116 |
|
---|
| 117 |
|
---|
| 118 | .. method:: HTMLParser.anchor_bgn(href, name, type)
|
---|
| 119 |
|
---|
| 120 | This method is called at the start of an anchor region. The arguments
|
---|
| 121 | correspond to the attributes of the ``<A>`` tag with the same names. The
|
---|
| 122 | default implementation maintains a list of hyperlinks (defined by the ``HREF``
|
---|
| 123 | attribute for ``<A>`` tags) within the document. The list of hyperlinks is
|
---|
| 124 | available as the data attribute :attr:`anchorlist`.
|
---|
| 125 |
|
---|
| 126 |
|
---|
| 127 | .. method:: HTMLParser.anchor_end()
|
---|
| 128 |
|
---|
| 129 | This method is called at the end of an anchor region. The default
|
---|
| 130 | implementation adds a textual footnote marker using an index into the list of
|
---|
| 131 | hyperlinks created by :meth:`anchor_bgn`.
|
---|
| 132 |
|
---|
| 133 |
|
---|
| 134 | .. method:: HTMLParser.handle_image(source, alt[, ismap[, align[, width[, height]]]])
|
---|
| 135 |
|
---|
| 136 | This method is called to handle images. The default implementation simply
|
---|
| 137 | passes the *alt* value to the :meth:`handle_data` method.
|
---|
| 138 |
|
---|
| 139 |
|
---|
| 140 | .. method:: HTMLParser.save_bgn()
|
---|
| 141 |
|
---|
| 142 | Begins saving character data in a buffer instead of sending it to the formatter
|
---|
| 143 | object. Retrieve the stored data via :meth:`save_end`. Use of the
|
---|
| 144 | :meth:`save_bgn` / :meth:`save_end` pair may not be nested.
|
---|
| 145 |
|
---|
| 146 |
|
---|
| 147 | .. method:: HTMLParser.save_end()
|
---|
| 148 |
|
---|
| 149 | Ends buffering character data and returns all data saved since the preceding
|
---|
| 150 | call to :meth:`save_bgn`. If the :attr:`nofill` flag is false, whitespace is
|
---|
| 151 | collapsed to single spaces. A call to this method without a preceding call to
|
---|
| 152 | :meth:`save_bgn` will raise a :exc:`TypeError` exception.
|
---|
| 153 |
|
---|
| 154 |
|
---|
| 155 | :mod:`htmlentitydefs` --- Definitions of HTML general entities
|
---|
| 156 | ==============================================================
|
---|
| 157 |
|
---|
| 158 | .. module:: htmlentitydefs
|
---|
| 159 | :synopsis: Definitions of HTML general entities.
|
---|
| 160 | .. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
|
---|
| 161 |
|
---|
| 162 | .. note::
|
---|
| 163 |
|
---|
| 164 | The :mod:`htmlentitydefs` module has been renamed to :mod:`html.entities` in
|
---|
[391] | 165 | Python 3. The :term:`2to3` tool will automatically adapt imports when
|
---|
| 166 | converting your sources to Python 3.
|
---|
[2] | 167 |
|
---|
[391] | 168 | **Source code:** :source:`Lib/htmlentitydefs.py`
|
---|
[2] | 169 |
|
---|
[391] | 170 | --------------
|
---|
| 171 |
|
---|
[2] | 172 | This module defines three dictionaries, ``name2codepoint``, ``codepoint2name``,
|
---|
| 173 | and ``entitydefs``. ``entitydefs`` is used by the :mod:`htmllib` module to
|
---|
[391] | 174 | provide the :attr:`entitydefs` attribute of the :class:`HTMLParser` class. The
|
---|
[2] | 175 | definition provided here contains all the entities defined by XHTML 1.0 that
|
---|
| 176 | can be handled using simple textual substitution in the Latin-1 character set
|
---|
| 177 | (ISO-8859-1).
|
---|
| 178 |
|
---|
| 179 |
|
---|
| 180 | .. data:: entitydefs
|
---|
| 181 |
|
---|
| 182 | A dictionary mapping XHTML 1.0 entity definitions to their replacement text in
|
---|
| 183 | ISO Latin-1.
|
---|
| 184 |
|
---|
| 185 |
|
---|
| 186 | .. data:: name2codepoint
|
---|
| 187 |
|
---|
| 188 | A dictionary that maps HTML entity names to the Unicode codepoints.
|
---|
| 189 |
|
---|
| 190 | .. versionadded:: 2.3
|
---|
| 191 |
|
---|
| 192 |
|
---|
| 193 | .. data:: codepoint2name
|
---|
| 194 |
|
---|
| 195 | A dictionary that maps Unicode codepoints to HTML entity names.
|
---|
| 196 |
|
---|
| 197 | .. versionadded:: 2.3
|
---|
| 198 |
|
---|