[391] | 1 | :mod:`xml.dom.minidom` --- Minimal DOM implementation
|
---|
| 2 | =====================================================
|
---|
[2] | 3 |
|
---|
| 4 | .. module:: xml.dom.minidom
|
---|
[391] | 5 | :synopsis: Minimal Document Object Model (DOM) implementation.
|
---|
[2] | 6 | .. moduleauthor:: Paul Prescod <paul@prescod.net>
|
---|
| 7 | .. sectionauthor:: Paul Prescod <paul@prescod.net>
|
---|
| 8 | .. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
|
---|
| 9 |
|
---|
| 10 |
|
---|
| 11 | .. versionadded:: 2.0
|
---|
| 12 |
|
---|
[391] | 13 | **Source code:** :source:`Lib/xml/dom/minidom.py`
|
---|
[2] | 14 |
|
---|
[391] | 15 | --------------
|
---|
| 16 |
|
---|
| 17 | :mod:`xml.dom.minidom` is a minimal implementation of the Document Object
|
---|
| 18 | Model interface, with an API similar to that in other languages. It is intended
|
---|
| 19 | to be simpler than the full DOM and also significantly smaller. Users who are
|
---|
| 20 | not already proficient with the DOM should consider using the
|
---|
| 21 | :mod:`xml.etree.ElementTree` module for their XML processing instead
|
---|
| 22 |
|
---|
| 23 |
|
---|
| 24 | .. warning::
|
---|
| 25 |
|
---|
| 26 | The :mod:`xml.dom.minidom` module is not secure against
|
---|
| 27 | maliciously constructed data. If you need to parse untrusted or
|
---|
| 28 | unauthenticated data see :ref:`xml-vulnerabilities`.
|
---|
| 29 |
|
---|
| 30 |
|
---|
[2] | 31 | DOM applications typically start by parsing some XML into a DOM. With
|
---|
| 32 | :mod:`xml.dom.minidom`, this is done through the parse functions::
|
---|
| 33 |
|
---|
| 34 | from xml.dom.minidom import parse, parseString
|
---|
| 35 |
|
---|
| 36 | dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name
|
---|
| 37 |
|
---|
| 38 | datasource = open('c:\\temp\\mydata.xml')
|
---|
| 39 | dom2 = parse(datasource) # parse an open file
|
---|
| 40 |
|
---|
| 41 | dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')
|
---|
| 42 |
|
---|
| 43 | The :func:`parse` function can take either a filename or an open file object.
|
---|
| 44 |
|
---|
| 45 |
|
---|
| 46 | .. function:: parse(filename_or_file[, parser[, bufsize]])
|
---|
| 47 |
|
---|
| 48 | Return a :class:`Document` from the given input. *filename_or_file* may be
|
---|
| 49 | either a file name, or a file-like object. *parser*, if given, must be a SAX2
|
---|
| 50 | parser object. This function will change the document handler of the parser and
|
---|
| 51 | activate namespace support; other parser configuration (like setting an entity
|
---|
| 52 | resolver) must have been done in advance.
|
---|
| 53 |
|
---|
| 54 | If you have XML in a string, you can use the :func:`parseString` function
|
---|
| 55 | instead:
|
---|
| 56 |
|
---|
| 57 |
|
---|
| 58 | .. function:: parseString(string[, parser])
|
---|
| 59 |
|
---|
| 60 | Return a :class:`Document` that represents the *string*. This method creates a
|
---|
[391] | 61 | :class:`~StringIO.StringIO` object for the string and passes that on to :func:`parse`.
|
---|
[2] | 62 |
|
---|
| 63 | Both functions return a :class:`Document` object representing the content of the
|
---|
| 64 | document.
|
---|
| 65 |
|
---|
| 66 | What the :func:`parse` and :func:`parseString` functions do is connect an XML
|
---|
| 67 | parser with a "DOM builder" that can accept parse events from any SAX parser and
|
---|
| 68 | convert them into a DOM tree. The name of the functions are perhaps misleading,
|
---|
| 69 | but are easy to grasp when learning the interfaces. The parsing of the document
|
---|
| 70 | will be completed before these functions return; it's simply that these
|
---|
| 71 | functions do not provide a parser implementation themselves.
|
---|
| 72 |
|
---|
| 73 | You can also create a :class:`Document` by calling a method on a "DOM
|
---|
| 74 | Implementation" object. You can get this object either by calling the
|
---|
| 75 | :func:`getDOMImplementation` function in the :mod:`xml.dom` package or the
|
---|
| 76 | :mod:`xml.dom.minidom` module. Using the implementation from the
|
---|
| 77 | :mod:`xml.dom.minidom` module will always return a :class:`Document` instance
|
---|
| 78 | from the minidom implementation, while the version from :mod:`xml.dom` may
|
---|
| 79 | provide an alternate implementation (this is likely if you have the `PyXML
|
---|
| 80 | package <http://pyxml.sourceforge.net/>`_ installed). Once you have a
|
---|
| 81 | :class:`Document`, you can add child nodes to it to populate the DOM::
|
---|
| 82 |
|
---|
| 83 | from xml.dom.minidom import getDOMImplementation
|
---|
| 84 |
|
---|
| 85 | impl = getDOMImplementation()
|
---|
| 86 |
|
---|
| 87 | newdoc = impl.createDocument(None, "some_tag", None)
|
---|
| 88 | top_element = newdoc.documentElement
|
---|
| 89 | text = newdoc.createTextNode('Some textual content.')
|
---|
| 90 | top_element.appendChild(text)
|
---|
| 91 |
|
---|
| 92 | Once you have a DOM document object, you can access the parts of your XML
|
---|
| 93 | document through its properties and methods. These properties are defined in
|
---|
| 94 | the DOM specification. The main property of the document object is the
|
---|
| 95 | :attr:`documentElement` property. It gives you the main element in the XML
|
---|
| 96 | document: the one that holds all others. Here is an example program::
|
---|
| 97 |
|
---|
| 98 | dom3 = parseString("<myxml>Some data</myxml>")
|
---|
| 99 | assert dom3.documentElement.tagName == "myxml"
|
---|
| 100 |
|
---|
[391] | 101 | When you are finished with a DOM tree, you may optionally call the
|
---|
| 102 | :meth:`unlink` method to encourage early cleanup of the now-unneeded
|
---|
| 103 | objects. :meth:`unlink` is a :mod:`xml.dom.minidom`\ -specific
|
---|
| 104 | extension to the DOM API that renders the node and its descendants are
|
---|
| 105 | essentially useless. Otherwise, Python's garbage collector will
|
---|
| 106 | eventually take care of the objects in the tree.
|
---|
[2] | 107 |
|
---|
| 108 | .. seealso::
|
---|
| 109 |
|
---|
| 110 | `Document Object Model (DOM) Level 1 Specification <http://www.w3.org/TR/REC-DOM-Level-1/>`_
|
---|
| 111 | The W3C recommendation for the DOM supported by :mod:`xml.dom.minidom`.
|
---|
| 112 |
|
---|
| 113 |
|
---|
| 114 | .. _minidom-objects:
|
---|
| 115 |
|
---|
| 116 | DOM Objects
|
---|
| 117 | -----------
|
---|
| 118 |
|
---|
| 119 | The definition of the DOM API for Python is given as part of the :mod:`xml.dom`
|
---|
| 120 | module documentation. This section lists the differences between the API and
|
---|
| 121 | :mod:`xml.dom.minidom`.
|
---|
| 122 |
|
---|
| 123 |
|
---|
| 124 | .. method:: Node.unlink()
|
---|
| 125 |
|
---|
| 126 | Break internal references within the DOM so that it will be garbage collected on
|
---|
| 127 | versions of Python without cyclic GC. Even when cyclic GC is available, using
|
---|
| 128 | this can make large amounts of memory available sooner, so calling this on DOM
|
---|
| 129 | objects as soon as they are no longer needed is good practice. This only needs
|
---|
| 130 | to be called on the :class:`Document` object, but may be called on child nodes
|
---|
| 131 | to discard children of that node.
|
---|
| 132 |
|
---|
| 133 |
|
---|
[391] | 134 | .. method:: Node.writexml(writer, indent="", addindent="", newl="")
|
---|
[2] | 135 |
|
---|
| 136 | Write XML to the writer object. The writer should have a :meth:`write` method
|
---|
| 137 | which matches that of the file object interface. The *indent* parameter is the
|
---|
| 138 | indentation of the current node. The *addindent* parameter is the incremental
|
---|
| 139 | indentation to use for subnodes of the current one. The *newl* parameter
|
---|
| 140 | specifies the string to use to terminate newlines.
|
---|
| 141 |
|
---|
[391] | 142 | For the :class:`Document` node, an additional keyword argument *encoding* can
|
---|
| 143 | be used to specify the encoding field of the XML header.
|
---|
| 144 |
|
---|
[2] | 145 | .. versionchanged:: 2.1
|
---|
| 146 | The optional keyword parameters *indent*, *addindent*, and *newl* were added to
|
---|
| 147 | support pretty output.
|
---|
| 148 |
|
---|
| 149 | .. versionchanged:: 2.3
|
---|
| 150 | For the :class:`Document` node, an additional keyword argument
|
---|
| 151 | *encoding* can be used to specify the encoding field of the XML header.
|
---|
| 152 |
|
---|
| 153 |
|
---|
| 154 | .. method:: Node.toxml([encoding])
|
---|
| 155 |
|
---|
| 156 | Return the XML that the DOM represents as a string.
|
---|
| 157 |
|
---|
| 158 | With no argument, the XML header does not specify an encoding, and the result is
|
---|
| 159 | Unicode string if the default encoding cannot represent all characters in the
|
---|
| 160 | document. Encoding this string in an encoding other than UTF-8 is likely
|
---|
| 161 | incorrect, since UTF-8 is the default encoding of XML.
|
---|
| 162 |
|
---|
| 163 | With an explicit *encoding* [1]_ argument, the result is a byte string in the
|
---|
| 164 | specified encoding. It is recommended that this argument is always specified. To
|
---|
| 165 | avoid :exc:`UnicodeError` exceptions in case of unrepresentable text data, the
|
---|
| 166 | encoding argument should be specified as "utf-8".
|
---|
| 167 |
|
---|
| 168 | .. versionchanged:: 2.3
|
---|
| 169 | the *encoding* argument was introduced; see :meth:`writexml`.
|
---|
| 170 |
|
---|
| 171 |
|
---|
| 172 | .. method:: Node.toprettyxml([indent=""[, newl=""[, encoding=""]]])
|
---|
| 173 |
|
---|
| 174 | Return a pretty-printed version of the document. *indent* specifies the
|
---|
| 175 | indentation string and defaults to a tabulator; *newl* specifies the string
|
---|
| 176 | emitted at the end of each line and defaults to ``\n``.
|
---|
| 177 |
|
---|
| 178 | .. versionadded:: 2.1
|
---|
| 179 |
|
---|
| 180 | .. versionchanged:: 2.3
|
---|
| 181 | the encoding argument was introduced; see :meth:`writexml`.
|
---|
| 182 |
|
---|
| 183 | The following standard DOM methods have special considerations with
|
---|
| 184 | :mod:`xml.dom.minidom`:
|
---|
| 185 |
|
---|
| 186 |
|
---|
| 187 | .. method:: Node.cloneNode(deep)
|
---|
| 188 |
|
---|
| 189 | Although this method was present in the version of :mod:`xml.dom.minidom`
|
---|
| 190 | packaged with Python 2.0, it was seriously broken. This has been corrected for
|
---|
| 191 | subsequent releases.
|
---|
| 192 |
|
---|
| 193 |
|
---|
| 194 | .. _dom-example:
|
---|
| 195 |
|
---|
| 196 | DOM Example
|
---|
| 197 | -----------
|
---|
| 198 |
|
---|
| 199 | This example program is a fairly realistic example of a simple program. In this
|
---|
| 200 | particular case, we do not take much advantage of the flexibility of the DOM.
|
---|
| 201 |
|
---|
| 202 | .. literalinclude:: ../includes/minidom-example.py
|
---|
| 203 |
|
---|
| 204 |
|
---|
| 205 | .. _minidom-and-dom:
|
---|
| 206 |
|
---|
| 207 | minidom and the DOM standard
|
---|
| 208 | ----------------------------
|
---|
| 209 |
|
---|
| 210 | The :mod:`xml.dom.minidom` module is essentially a DOM 1.0-compatible DOM with
|
---|
| 211 | some DOM 2 features (primarily namespace features).
|
---|
| 212 |
|
---|
| 213 | Usage of the DOM interface in Python is straight-forward. The following mapping
|
---|
| 214 | rules apply:
|
---|
| 215 |
|
---|
| 216 | * Interfaces are accessed through instance objects. Applications should not
|
---|
| 217 | instantiate the classes themselves; they should use the creator functions
|
---|
| 218 | available on the :class:`Document` object. Derived interfaces support all
|
---|
| 219 | operations (and attributes) from the base interfaces, plus any new operations.
|
---|
| 220 |
|
---|
| 221 | * Operations are used as methods. Since the DOM uses only :keyword:`in`
|
---|
| 222 | parameters, the arguments are passed in normal order (from left to right).
|
---|
| 223 | There are no optional arguments. ``void`` operations return ``None``.
|
---|
| 224 |
|
---|
| 225 | * IDL attributes map to instance attributes. For compatibility with the OMG IDL
|
---|
| 226 | language mapping for Python, an attribute ``foo`` can also be accessed through
|
---|
| 227 | accessor methods :meth:`_get_foo` and :meth:`_set_foo`. ``readonly``
|
---|
| 228 | attributes must not be changed; this is not enforced at runtime.
|
---|
| 229 |
|
---|
| 230 | * The types ``short int``, ``unsigned int``, ``unsigned long long``, and
|
---|
| 231 | ``boolean`` all map to Python integer objects.
|
---|
| 232 |
|
---|
| 233 | * The type ``DOMString`` maps to Python strings. :mod:`xml.dom.minidom` supports
|
---|
| 234 | either byte or Unicode strings, but will normally produce Unicode strings.
|
---|
| 235 | Values of type ``DOMString`` may also be ``None`` where allowed to have the IDL
|
---|
| 236 | ``null`` value by the DOM specification from the W3C.
|
---|
| 237 |
|
---|
| 238 | * ``const`` declarations map to variables in their respective scope (e.g.
|
---|
| 239 | ``xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE``); they must not be changed.
|
---|
| 240 |
|
---|
| 241 | * ``DOMException`` is currently not supported in :mod:`xml.dom.minidom`.
|
---|
| 242 | Instead, :mod:`xml.dom.minidom` uses standard Python exceptions such as
|
---|
| 243 | :exc:`TypeError` and :exc:`AttributeError`.
|
---|
| 244 |
|
---|
| 245 | * :class:`NodeList` objects are implemented using Python's built-in list type.
|
---|
| 246 | Starting with Python 2.2, these objects provide the interface defined in the DOM
|
---|
| 247 | specification, but with earlier versions of Python they do not support the
|
---|
| 248 | official API. They are, however, much more "Pythonic" than the interface
|
---|
| 249 | defined in the W3C recommendations.
|
---|
| 250 |
|
---|
| 251 | The following interfaces have no implementation in :mod:`xml.dom.minidom`:
|
---|
| 252 |
|
---|
| 253 | * :class:`DOMTimeStamp`
|
---|
| 254 |
|
---|
| 255 | * :class:`DocumentType` (added in Python 2.1)
|
---|
| 256 |
|
---|
| 257 | * :class:`DOMImplementation` (added in Python 2.1)
|
---|
| 258 |
|
---|
| 259 | * :class:`CharacterData`
|
---|
| 260 |
|
---|
| 261 | * :class:`CDATASection`
|
---|
| 262 |
|
---|
| 263 | * :class:`Notation`
|
---|
| 264 |
|
---|
| 265 | * :class:`Entity`
|
---|
| 266 |
|
---|
| 267 | * :class:`EntityReference`
|
---|
| 268 |
|
---|
| 269 | * :class:`DocumentFragment`
|
---|
| 270 |
|
---|
| 271 | Most of these reflect information in the XML document that is not of general
|
---|
| 272 | utility to most DOM users.
|
---|
| 273 |
|
---|
| 274 | .. rubric:: Footnotes
|
---|
| 275 |
|
---|
| 276 | .. [#] The encoding string included in XML output should conform to the
|
---|
| 277 | appropriate standards. For example, "UTF-8" is valid, but "UTF8" is
|
---|
| 278 | not. See http://www.w3.org/TR/2006/REC-xml11-20060816/#NT-EncodingDecl
|
---|
| 279 | and http://www.iana.org/assignments/character-sets .
|
---|