[2] | 1 | :mod:`sgmllib` --- Simple SGML parser
|
---|
| 2 | =====================================
|
---|
| 3 |
|
---|
| 4 | .. module:: sgmllib
|
---|
| 5 | :synopsis: Only as much of an SGML parser as needed to parse HTML.
|
---|
| 6 | :deprecated:
|
---|
| 7 |
|
---|
| 8 | .. deprecated:: 2.6
|
---|
[391] | 9 | The :mod:`sgmllib` module has been removed in Python 3.
|
---|
[2] | 10 |
|
---|
| 11 | .. index:: single: SGML
|
---|
| 12 |
|
---|
| 13 | This module defines a class :class:`SGMLParser` which serves as the basis for
|
---|
| 14 | parsing text files formatted in SGML (Standard Generalized Mark-up Language).
|
---|
| 15 | In fact, it does not provide a full SGML parser --- it only parses SGML insofar
|
---|
| 16 | as it is used by HTML, and the module only exists as a base for the
|
---|
| 17 | :mod:`htmllib` module. Another HTML parser which supports XHTML and offers a
|
---|
| 18 | somewhat different interface is available in the :mod:`HTMLParser` module.
|
---|
| 19 |
|
---|
| 20 |
|
---|
| 21 | .. class:: SGMLParser()
|
---|
| 22 |
|
---|
| 23 | The :class:`SGMLParser` class is instantiated without arguments. The parser is
|
---|
| 24 | hardcoded to recognize the following constructs:
|
---|
| 25 |
|
---|
| 26 | * Opening and closing tags of the form ``<tag attr="value" ...>`` and
|
---|
| 27 | ``</tag>``, respectively.
|
---|
| 28 |
|
---|
| 29 | * Numeric character references of the form ``&#name;``.
|
---|
| 30 |
|
---|
| 31 | * Entity references of the form ``&name;``.
|
---|
| 32 |
|
---|
| 33 | * SGML comments of the form ``<!--text-->``. Note that spaces, tabs, and
|
---|
| 34 | newlines are allowed between the trailing ``>`` and the immediately preceding
|
---|
| 35 | ``--``.
|
---|
| 36 |
|
---|
| 37 | A single exception is defined as well:
|
---|
| 38 |
|
---|
| 39 |
|
---|
| 40 | .. exception:: SGMLParseError
|
---|
| 41 |
|
---|
| 42 | Exception raised by the :class:`SGMLParser` class when it encounters an error
|
---|
| 43 | while parsing.
|
---|
| 44 |
|
---|
| 45 | .. versionadded:: 2.1
|
---|
| 46 |
|
---|
| 47 | :class:`SGMLParser` instances have the following methods:
|
---|
| 48 |
|
---|
| 49 |
|
---|
| 50 | .. method:: SGMLParser.reset()
|
---|
| 51 |
|
---|
| 52 | Reset the instance. Loses all unprocessed data. This is called implicitly at
|
---|
| 53 | instantiation time.
|
---|
| 54 |
|
---|
| 55 |
|
---|
| 56 | .. method:: SGMLParser.setnomoretags()
|
---|
| 57 |
|
---|
| 58 | Stop processing tags. Treat all following input as literal input (CDATA).
|
---|
| 59 | (This is only provided so the HTML tag ``<PLAINTEXT>`` can be implemented.)
|
---|
| 60 |
|
---|
| 61 |
|
---|
| 62 | .. method:: SGMLParser.setliteral()
|
---|
| 63 |
|
---|
| 64 | Enter literal mode (CDATA mode).
|
---|
| 65 |
|
---|
| 66 |
|
---|
| 67 | .. method:: SGMLParser.feed(data)
|
---|
| 68 |
|
---|
| 69 | Feed some text to the parser. It is processed insofar as it consists of
|
---|
| 70 | complete elements; incomplete data is buffered until more data is fed or
|
---|
| 71 | :meth:`close` is called.
|
---|
| 72 |
|
---|
| 73 |
|
---|
| 74 | .. method:: SGMLParser.close()
|
---|
| 75 |
|
---|
| 76 | Force processing of all buffered data as if it were followed by an end-of-file
|
---|
| 77 | mark. This method may be redefined by a derived class to define additional
|
---|
| 78 | processing at the end of the input, but the redefined version should always call
|
---|
| 79 | :meth:`close`.
|
---|
| 80 |
|
---|
| 81 |
|
---|
| 82 | .. method:: SGMLParser.get_starttag_text()
|
---|
| 83 |
|
---|
| 84 | Return the text of the most recently opened start tag. This should not normally
|
---|
| 85 | be needed for structured processing, but may be useful in dealing with HTML "as
|
---|
| 86 | deployed" or for re-generating input with minimal changes (whitespace between
|
---|
| 87 | attributes can be preserved, etc.).
|
---|
| 88 |
|
---|
| 89 |
|
---|
| 90 | .. method:: SGMLParser.handle_starttag(tag, method, attributes)
|
---|
| 91 |
|
---|
| 92 | This method is called to handle start tags for which either a :meth:`start_tag`
|
---|
| 93 | or :meth:`do_tag` method has been defined. The *tag* argument is the name of
|
---|
| 94 | the tag converted to lower case, and the *method* argument is the bound method
|
---|
| 95 | which should be used to support semantic interpretation of the start tag. The
|
---|
| 96 | *attributes* argument is a list of ``(name, value)`` pairs containing the
|
---|
| 97 | attributes found inside the tag's ``<>`` brackets.
|
---|
| 98 |
|
---|
| 99 | The *name* has been translated to lower case. Double quotes and backslashes in
|
---|
| 100 | the *value* have been interpreted, as well as known character references and
|
---|
| 101 | known entity references terminated by a semicolon (normally, entity references
|
---|
| 102 | can be terminated by any non-alphanumerical character, but this would break the
|
---|
| 103 | very common case of ``<A HREF="url?spam=1&eggs=2">`` when ``eggs`` is a valid
|
---|
| 104 | entity name).
|
---|
| 105 |
|
---|
| 106 | For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method would
|
---|
| 107 | be called as ``unknown_starttag('a', [('href', 'http://www.cwi.nl/')])``. The
|
---|
| 108 | base implementation simply calls *method* with *attributes* as the only
|
---|
| 109 | argument.
|
---|
| 110 |
|
---|
| 111 | .. versionadded:: 2.5
|
---|
| 112 | Handling of entity and character references within attribute values.
|
---|
| 113 |
|
---|
| 114 |
|
---|
| 115 | .. method:: SGMLParser.handle_endtag(tag, method)
|
---|
| 116 |
|
---|
| 117 | This method is called to handle endtags for which an :meth:`end_tag` method has
|
---|
| 118 | been defined. The *tag* argument is the name of the tag converted to lower
|
---|
| 119 | case, and the *method* argument is the bound method which should be used to
|
---|
| 120 | support semantic interpretation of the end tag. If no :meth:`end_tag` method is
|
---|
| 121 | defined for the closing element, this handler is not called. The base
|
---|
| 122 | implementation simply calls *method*.
|
---|
| 123 |
|
---|
| 124 |
|
---|
| 125 | .. method:: SGMLParser.handle_data(data)
|
---|
| 126 |
|
---|
| 127 | This method is called to process arbitrary data. It is intended to be
|
---|
| 128 | overridden by a derived class; the base class implementation does nothing.
|
---|
| 129 |
|
---|
| 130 |
|
---|
| 131 | .. method:: SGMLParser.handle_charref(ref)
|
---|
| 132 |
|
---|
| 133 | This method is called to process a character reference of the form ``&#ref;``.
|
---|
| 134 | The base implementation uses :meth:`convert_charref` to convert the reference to
|
---|
| 135 | a string. If that method returns a string, it is passed to :meth:`handle_data`,
|
---|
| 136 | otherwise ``unknown_charref(ref)`` is called to handle the error.
|
---|
| 137 |
|
---|
| 138 | .. versionchanged:: 2.5
|
---|
| 139 | Use :meth:`convert_charref` instead of hard-coding the conversion.
|
---|
| 140 |
|
---|
| 141 |
|
---|
| 142 | .. method:: SGMLParser.convert_charref(ref)
|
---|
| 143 |
|
---|
| 144 | Convert a character reference to a string, or ``None``. *ref* is the reference
|
---|
| 145 | passed in as a string. In the base implementation, *ref* must be a decimal
|
---|
| 146 | number in the range 0-255. It converts the code point found using the
|
---|
| 147 | :meth:`convert_codepoint` method. If *ref* is invalid or out of range, this
|
---|
| 148 | method returns ``None``. This method is called by the default
|
---|
| 149 | :meth:`handle_charref` implementation and by the attribute value parser.
|
---|
| 150 |
|
---|
| 151 | .. versionadded:: 2.5
|
---|
| 152 |
|
---|
| 153 |
|
---|
| 154 | .. method:: SGMLParser.convert_codepoint(codepoint)
|
---|
| 155 |
|
---|
| 156 | Convert a codepoint to a :class:`str` value. Encodings can be handled here if
|
---|
| 157 | appropriate, though the rest of :mod:`sgmllib` is oblivious on this matter.
|
---|
| 158 |
|
---|
| 159 | .. versionadded:: 2.5
|
---|
| 160 |
|
---|
| 161 |
|
---|
| 162 | .. method:: SGMLParser.handle_entityref(ref)
|
---|
| 163 |
|
---|
| 164 | This method is called to process a general entity reference of the form
|
---|
| 165 | ``&ref;`` where *ref* is an general entity reference. It converts *ref* by
|
---|
| 166 | passing it to :meth:`convert_entityref`. If a translation is returned, it calls
|
---|
| 167 | the method :meth:`handle_data` with the translation; otherwise, it calls the
|
---|
| 168 | method ``unknown_entityref(ref)``. The default :attr:`entitydefs` defines
|
---|
| 169 | translations for ``&``, ``&apos``, ``>``, ``<``, and ``"``.
|
---|
| 170 |
|
---|
| 171 | .. versionchanged:: 2.5
|
---|
| 172 | Use :meth:`convert_entityref` instead of hard-coding the conversion.
|
---|
| 173 |
|
---|
| 174 |
|
---|
| 175 | .. method:: SGMLParser.convert_entityref(ref)
|
---|
| 176 |
|
---|
| 177 | Convert a named entity reference to a :class:`str` value, or ``None``. The
|
---|
| 178 | resulting value will not be parsed. *ref* will be only the name of the entity.
|
---|
| 179 | The default implementation looks for *ref* in the instance (or class) variable
|
---|
| 180 | :attr:`entitydefs` which should be a mapping from entity names to corresponding
|
---|
| 181 | translations. If no translation is available for *ref*, this method returns
|
---|
| 182 | ``None``. This method is called by the default :meth:`handle_entityref`
|
---|
| 183 | implementation and by the attribute value parser.
|
---|
| 184 |
|
---|
| 185 | .. versionadded:: 2.5
|
---|
| 186 |
|
---|
| 187 |
|
---|
| 188 | .. method:: SGMLParser.handle_comment(comment)
|
---|
| 189 |
|
---|
| 190 | This method is called when a comment is encountered. The *comment* argument is
|
---|
| 191 | a string containing the text between the ``<!--`` and ``-->`` delimiters, but
|
---|
| 192 | not the delimiters themselves. For example, the comment ``<!--text-->`` will
|
---|
| 193 | cause this method to be called with the argument ``'text'``. The default method
|
---|
| 194 | does nothing.
|
---|
| 195 |
|
---|
| 196 |
|
---|
| 197 | .. method:: SGMLParser.handle_decl(data)
|
---|
| 198 |
|
---|
| 199 | Method called when an SGML declaration is read by the parser. In practice, the
|
---|
| 200 | ``DOCTYPE`` declaration is the only thing observed in HTML, but the parser does
|
---|
| 201 | not discriminate among different (or broken) declarations. Internal subsets in
|
---|
| 202 | a ``DOCTYPE`` declaration are not supported. The *data* parameter will be the
|
---|
| 203 | entire contents of the declaration inside the ``<!``...\ ``>`` markup. The
|
---|
| 204 | default implementation does nothing.
|
---|
| 205 |
|
---|
| 206 |
|
---|
| 207 | .. method:: SGMLParser.report_unbalanced(tag)
|
---|
| 208 |
|
---|
| 209 | This method is called when an end tag is found which does not correspond to any
|
---|
| 210 | open element.
|
---|
| 211 |
|
---|
| 212 |
|
---|
| 213 | .. method:: SGMLParser.unknown_starttag(tag, attributes)
|
---|
| 214 |
|
---|
| 215 | This method is called to process an unknown start tag. It is intended to be
|
---|
| 216 | overridden by a derived class; the base class implementation does nothing.
|
---|
| 217 |
|
---|
| 218 |
|
---|
| 219 | .. method:: SGMLParser.unknown_endtag(tag)
|
---|
| 220 |
|
---|
| 221 | This method is called to process an unknown end tag. It is intended to be
|
---|
| 222 | overridden by a derived class; the base class implementation does nothing.
|
---|
| 223 |
|
---|
| 224 |
|
---|
| 225 | .. method:: SGMLParser.unknown_charref(ref)
|
---|
| 226 |
|
---|
| 227 | This method is called to process unresolvable numeric character references.
|
---|
| 228 | Refer to :meth:`handle_charref` to determine what is handled by default. It is
|
---|
| 229 | intended to be overridden by a derived class; the base class implementation does
|
---|
| 230 | nothing.
|
---|
| 231 |
|
---|
| 232 |
|
---|
| 233 | .. method:: SGMLParser.unknown_entityref(ref)
|
---|
| 234 |
|
---|
| 235 | This method is called to process an unknown entity reference. It is intended to
|
---|
| 236 | be overridden by a derived class; the base class implementation does nothing.
|
---|
| 237 |
|
---|
| 238 | Apart from overriding or extending the methods listed above, derived classes may
|
---|
| 239 | also define methods of the following form to define processing of specific tags.
|
---|
| 240 | Tag names in the input stream are case independent; the *tag* occurring in
|
---|
| 241 | method names must be in lower case:
|
---|
| 242 |
|
---|
| 243 |
|
---|
| 244 | .. method:: SGMLParser.start_tag(attributes)
|
---|
| 245 | :noindex:
|
---|
| 246 |
|
---|
| 247 | This method is called to process an opening tag *tag*. It has preference over
|
---|
| 248 | :meth:`do_tag`. The *attributes* argument has the same meaning as described for
|
---|
| 249 | :meth:`handle_starttag` above.
|
---|
| 250 |
|
---|
| 251 |
|
---|
| 252 | .. method:: SGMLParser.do_tag(attributes)
|
---|
| 253 | :noindex:
|
---|
| 254 |
|
---|
| 255 | This method is called to process an opening tag *tag* for which no
|
---|
| 256 | :meth:`start_tag` method is defined. The *attributes* argument has the same
|
---|
| 257 | meaning as described for :meth:`handle_starttag` above.
|
---|
| 258 |
|
---|
| 259 |
|
---|
| 260 | .. method:: SGMLParser.end_tag()
|
---|
| 261 | :noindex:
|
---|
| 262 |
|
---|
| 263 | This method is called to process a closing tag *tag*.
|
---|
| 264 |
|
---|
| 265 | Note that the parser maintains a stack of open elements for which no end tag has
|
---|
| 266 | been found yet. Only tags processed by :meth:`start_tag` are pushed on this
|
---|
| 267 | stack. Definition of an :meth:`end_tag` method is optional for these tags. For
|
---|
| 268 | tags processed by :meth:`do_tag` or by :meth:`unknown_tag`, no :meth:`end_tag`
|
---|
| 269 | method must be defined; if defined, it will not be used. If both
|
---|
| 270 | :meth:`start_tag` and :meth:`do_tag` methods exist for a tag, the
|
---|
| 271 | :meth:`start_tag` method takes precedence.
|
---|
| 272 |
|
---|