Context Navigation

← Previous Revision
Next Revision →
Normal
Revision Log

htmlparser.rst

Last change on this file was 391, checked in by dmik, 11 years ago
python: Merge vendor 2.7.6 to trunk.
Property svn:eol-style set to `native`
File size: 11.3 KB

Rev	Line
[2]	1
	2	:mod:`HTMLParser` --- Simple HTML and XHTML parser
	3	==================================================
	4
	5	.. module:: HTMLParser
	6	:synopsis: A simple parser that can handle HTML and XHTML.
	7
	8	.. note::
	9
	10	The :mod:`HTMLParser` module has been renamed to :mod:`html.parser` in Python
[391]	11	3. The :term:`2to3` tool will automatically adapt imports when converting
	12	your sources to Python 3.
[2]	13
	14
	15	.. versionadded:: 2.2
	16
	17	.. index::
	18	single: HTML
	19	single: XHTML
	20
[391]	21	Source code: :source:`Lib/HTMLParser.py`
	22
	23	--------------
	24
	25	This module defines a class :class:`.HTMLParser` which serves as the basis for
[2]	26	parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
	27	Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
	28	in :mod:`sgmllib`.
	29
	30
	31	.. class:: HTMLParser()
	32
[391]	33	An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
	34	when start tags, end tags, text, comments, and other markup elements are
	35	encountered. The user should subclass :class:`.HTMLParser` and override its
	36	methods to implement the desired behavior.
[2]	37
[391]	38	The :class:`.HTMLParser` class is instantiated without arguments.
[2]	39
	40	Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
	41	match start tags or call the end-tag handler for elements which are closed
	42	implicitly by closing an outer element.
	43
	44	An exception is defined as well:
	45
	46	.. exception:: HTMLParseError
	47
[391]	48	:class:`.HTMLParser` is able to handle broken markup, but in some cases it
	49	might raise this exception when it encounters an error while parsing.
	50	This exception provides three attributes: :attr:`msg` is a brief
	51	message explaining the error, :attr:`lineno` is the number of the line on
	52	which the broken construct was detected, and :attr:`offset` is the number of
[2]	53	characters into the line at which the construct starts.
	54
	55
[391]	56	Example HTML Parser Application
	57	-------------------------------
[2]	58
[391]	59	As a basic example, below is a simple HTML parser that uses the
	60	:class:`.HTMLParser` class to print out start tags, end tags and data
	61	as they are encountered::
[2]	62
[391]	63	from HTMLParser import HTMLParser
[2]	64
[391]	65	# create a subclass and override the handler methods
	66	class MyHTMLParser(HTMLParser):
	67	def handle_starttag(self, tag, attrs):
	68	print "Encountered a start tag:", tag
	69	def handle_endtag(self, tag):
	70	print "Encountered an end tag :", tag
	71	def handle_data(self, data):
	72	print "Encountered some data :", data
[2]	73
[391]	74	# instantiate the parser and fed it some HTML
	75	parser = MyHTMLParser()
	76	parser.feed('<html><head><title>Test</title></head>'
	77	'<body><h1>Parse me!</h1></body></html>')
	78
	79	The output will then be::
	80
	81	Encountered a start tag: html
	82	Encountered a start tag: head
	83	Encountered a start tag: title
	84	Encountered some data : Test
	85	Encountered an end tag : title
	86	Encountered an end tag : head
	87	Encountered a start tag: body
	88	Encountered a start tag: h1
	89	Encountered some data : Parse me!
	90	Encountered an end tag : h1
	91	Encountered an end tag : body
	92	Encountered an end tag : html
	93
	94
	95	:class:`.HTMLParser` Methods
	96	----------------------------
	97
	98	:class:`.HTMLParser` instances have the following methods:
	99
	100
[2]	101	.. method:: HTMLParser.feed(data)
	102
	103	Feed some text to the parser. It is processed insofar as it consists of
	104	complete elements; incomplete data is buffered until more data is fed or
[391]	105	:meth:`close` is called. data can be either :class:`unicode` or
	106	:class:`str`, but passing :class:`unicode` is advised.
[2]	107
	108
	109	.. method:: HTMLParser.close()
	110
	111	Force processing of all buffered data as if it were followed by an end-of-file
	112	mark. This method may be redefined by a derived class to define additional
	113	processing at the end of the input, but the redefined version should always call
[391]	114	the :class:`.HTMLParser` base class method :meth:`close`.
[2]	115
	116
[391]	117	.. method:: HTMLParser.reset()
	118
	119	Reset the instance. Loses all unprocessed data. This is called implicitly at
	120	instantiation time.
	121
	122
[2]	123	.. method:: HTMLParser.getpos()
	124
	125	Return current line number and offset.
	126
	127
	128	.. method:: HTMLParser.get_starttag_text()
	129
	130	Return the text of the most recently opened start tag. This should not normally
	131	be needed for structured processing, but may be useful in dealing with HTML "as
	132	deployed" or for re-generating input with minimal changes (whitespace between
	133	attributes can be preserved, etc.).
	134
	135
[391]	136	The following methods are called when data or markup elements are encountered
	137	and they are meant to be overridden in a subclass. The base class
	138	implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
	139
	140
[2]	141	.. method:: HTMLParser.handle_starttag(tag, attrs)
	142
[391]	143	This method is called to handle the start of a tag (e.g. ``<div id="main">``).
[2]	144
	145	The tag argument is the name of the tag converted to lower case. The attrs
	146	argument is a list of ``(name, value)`` pairs containing the attributes found
	147	inside the tag's ``<>`` brackets. The name will be translated to lower case,
	148	and quotes in the value have been removed, and character and entity references
[391]	149	have been replaced.
[2]	150
[391]	151	For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
	152	would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
	153
[2]	154	.. versionchanged:: 2.6
[391]	155	All entity references from :mod:`htmlentitydefs` are now replaced in the
	156	attribute values.
[2]	157
	158
[391]	159	.. method:: HTMLParser.handle_endtag(tag)
	160
	161	This method is called to handle the end tag of an element (e.g. ``</div>``).
	162
	163	The tag argument is the name of the tag converted to lower case.
	164
	165
[2]	166	.. method:: HTMLParser.handle_startendtag(tag, attrs)
	167
	168	Similar to :meth:`handle_starttag`, but called when the parser encounters an
[391]	169	XHTML-style empty tag (``<img ... />``). This method may be overridden by
[2]	170	subclasses which require this particular lexical information; the default
[391]	171	implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
[2]	172
	173
[391]	174	.. method:: HTMLParser.handle_data(data)
[2]	175
[391]	176	This method is called to process arbitrary data (e.g. text nodes and the
	177	content of ``<script>...</script>`` and ``<style>...</style>``).
[2]	178
	179
[391]	180	.. method:: HTMLParser.handle_entityref(name)
[2]	181
[391]	182	This method is called to process a named character reference of the form
	183	``&name;`` (e.g. ``>``), where name is a general entity reference
	184	(e.g. ``'gt'``).
[2]	185
	186
	187	.. method:: HTMLParser.handle_charref(name)
	188
[391]	189	This method is called to process decimal and hexadecimal numeric character
	190	references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
	191	equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``;
	192	in this case the method will receive ``'62'`` or ``'x3E'``.
[2]	193
	194
[391]	195	.. method:: HTMLParser.handle_comment(data)
[2]	196
[391]	197	This method is called when a comment is encountered (e.g. ``<!--comment-->``).
[2]	198
[391]	199	For example, the comment ``<!-- comment -->`` will cause this method to be
	200	called with the argument ``' comment '``.
[2]	201
[391]	202	The content of Internet Explorer conditional comments (condcoms) will also be
	203	sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
	204	this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
[2]	205
	206
	207	.. method:: HTMLParser.handle_decl(decl)
	208
[391]	209	This method is called to handle an HTML doctype declaration (e.g.
	210	``<!DOCTYPE html>``).
[2]	211
[391]	212	The decl parameter will be the entire contents of the declaration inside
	213	the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
[2]	214
[391]	215
[2]	216	.. method:: HTMLParser.handle_pi(data)
	217
[391]	218	This method is called when a processing instruction is encountered. The data
	219	parameter will contain the entire processing instruction. For example, for the
[2]	220	processing instruction ``<?proc color='red'>``, this method would be called as
[391]	221	``handle_pi("proc color='red'")``.
[2]	222
	223	.. note::
	224
[391]	225	The :class:`.HTMLParser` class uses the SGML syntactic rules for processing
[2]	226	instructions. An XHTML processing instruction using the trailing ``'?'`` will
	227	cause the ``'?'`` to be included in data.
	228
	229
[391]	230	.. method:: HTMLParser.unknown_decl(data)
[2]	231
[391]	232	This method is called when an unrecognized declaration is read by the parser.
[2]	233
[391]	234	The data parameter will be the entire contents of the declaration inside
	235	the ``<![...]>`` markup. It is sometimes useful to be overridden by a
	236	derived class.
[2]	237
[391]	238
	239	.. _htmlparser-examples:
	240
	241	Examples
	242	--------
	243
	244	The following class implements a parser that will be used to illustrate more
	245	examples::
	246
[2]	247	from HTMLParser import HTMLParser
[391]	248	from htmlentitydefs import name2codepoint
[2]	249
	250	class MyHTMLParser(HTMLParser):
	251	def handle_starttag(self, tag, attrs):
[391]	252	print "Start tag:", tag
	253	for attr in attrs:
	254	print " attr:", attr
[2]	255	def handle_endtag(self, tag):
[391]	256	print "End tag :", tag
	257	def handle_data(self, data):
	258	print "Data :", data
	259	def handle_comment(self, data):
	260	print "Comment :", data
	261	def handle_entityref(self, name):
	262	c = unichr(name2codepoint[name])
	263	print "Named ent:", c
	264	def handle_charref(self, name):
	265	if name.startswith('x'):
	266	c = unichr(int(name[1:], 16))
	267	else:
	268	c = unichr(int(name))
	269	print "Num ent :", c
	270	def handle_decl(self, data):
	271	print "Decl :", data
[2]	272
[391]	273	parser = MyHTMLParser()
	274
	275	Parsing a doctype::
	276
	277	>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
	278	... '"http://www.w3.org/TR/html4/strict.dtd">')
	279	Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
	280
	281	Parsing an element with a few attributes and a title::
	282
	283	>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
	284	Start tag: img
	285	attr: ('src', 'python-logo.png')
	286	attr: ('alt', 'The Python logo')
	287	>>>
	288	>>> parser.feed('<h1>Python</h1>')
	289	Start tag: h1
	290	Data : Python
	291	End tag : h1
	292
	293	The content of ``script`` and ``style`` elements is returned as is, without
	294	further parsing::
	295
	296	>>> parser.feed('<style type="text/css">#python { color: green }</style>')
	297	Start tag: style
	298	attr: ('type', 'text/css')
	299	Data : #python { color: green }
	300	End tag : style
	301	>>>
	302	>>> parser.feed('<script type="text/javascript">'
	303	... 'alert("<strong>hello!</strong>");</script>')
	304	Start tag: script
	305	attr: ('type', 'text/javascript')
	306	Data : alert("<strong>hello!</strong>");
	307	End tag : script
	308
	309	Parsing comments::
	310
	311	>>> parser.feed('<!-- a comment -->'
	312	... '<!--[if IE 9]>IE-specific content<![endif]-->')
	313	Comment : a comment
	314	Comment : [if IE 9]>IE-specific content<![endif]
	315
	316	Parsing named and numeric character references and converting them to the
	317	correct char (note: these 3 references are all equivalent to ``'>'``)::
	318
	319	>>> parser.feed('>>>')
	320	Named ent: >
	321	Num ent : >
	322	Num ent : >
	323
	324	Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
	325	:meth:`~HTMLParser.handle_data` might be called more than once::
	326
	327	>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
	328	... parser.feed(chunk)
	329	...
	330	Start tag: span
	331	Data : buff
	332	Data : ered
	333	Data : text
	334	End tag : span
	335
	336	Parsing invalid HTML (e.g. unquoted attributes) also works::
	337
	338	>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
	339	Start tag: p
	340	Start tag: a
	341	attr: ('class', 'link')
	342	attr: ('href', '#main')
	343	Data : tag soup
	344	End tag : p
	345	End tag : a

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: python/trunk/Doc/library/htmlparser.rst

Download in other formats: