1 | \section{\module{xml.dom.minidom} ---
|
---|
2 | Lightweight DOM implementation}
|
---|
3 |
|
---|
4 | \declaremodule{standard}{xml.dom.minidom}
|
---|
5 | \modulesynopsis{Lightweight Document Object Model (DOM) implementation.}
|
---|
6 | \moduleauthor{Paul Prescod}{paul@prescod.net}
|
---|
7 | \sectionauthor{Paul Prescod}{paul@prescod.net}
|
---|
8 | \sectionauthor{Martin v. L\"owis}{martin@v.loewis.de}
|
---|
9 |
|
---|
10 | \versionadded{2.0}
|
---|
11 |
|
---|
12 | \module{xml.dom.minidom} is a light-weight implementation of the
|
---|
13 | Document Object Model interface. It is intended to be
|
---|
14 | simpler than the full DOM and also significantly smaller.
|
---|
15 |
|
---|
16 | DOM applications typically start by parsing some XML into a DOM. With
|
---|
17 | \module{xml.dom.minidom}, this is done through the parse functions:
|
---|
18 |
|
---|
19 | \begin{verbatim}
|
---|
20 | from xml.dom.minidom import parse, parseString
|
---|
21 |
|
---|
22 | dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name
|
---|
23 |
|
---|
24 | datasource = open('c:\\temp\\mydata.xml')
|
---|
25 | dom2 = parse(datasource) # parse an open file
|
---|
26 |
|
---|
27 | dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')
|
---|
28 | \end{verbatim}
|
---|
29 |
|
---|
30 | The \function{parse()} function can take either a filename or an open
|
---|
31 | file object.
|
---|
32 |
|
---|
33 | \begin{funcdesc}{parse}{filename_or_file{, parser}}
|
---|
34 | Return a \class{Document} from the given input. \var{filename_or_file}
|
---|
35 | may be either a file name, or a file-like object. \var{parser}, if
|
---|
36 | given, must be a SAX2 parser object. This function will change the
|
---|
37 | document handler of the parser and activate namespace support; other
|
---|
38 | parser configuration (like setting an entity resolver) must have been
|
---|
39 | done in advance.
|
---|
40 | \end{funcdesc}
|
---|
41 |
|
---|
42 | If you have XML in a string, you can use the
|
---|
43 | \function{parseString()} function instead:
|
---|
44 |
|
---|
45 | \begin{funcdesc}{parseString}{string\optional{, parser}}
|
---|
46 | Return a \class{Document} that represents the \var{string}. This
|
---|
47 | method creates a \class{StringIO} object for the string and passes
|
---|
48 | that on to \function{parse}.
|
---|
49 | \end{funcdesc}
|
---|
50 |
|
---|
51 | Both functions return a \class{Document} object representing the
|
---|
52 | content of the document.
|
---|
53 |
|
---|
54 | What the \function{parse()} and \function{parseString()} functions do
|
---|
55 | is connect an XML parser with a ``DOM builder'' that can accept parse
|
---|
56 | events from any SAX parser and convert them into a DOM tree. The name
|
---|
57 | of the functions are perhaps misleading, but are easy to grasp when
|
---|
58 | learning the interfaces. The parsing of the document will be
|
---|
59 | completed before these functions return; it's simply that these
|
---|
60 | functions do not provide a parser implementation themselves.
|
---|
61 |
|
---|
62 | You can also create a \class{Document} by calling a method on a ``DOM
|
---|
63 | Implementation'' object. You can get this object either by calling
|
---|
64 | the \function{getDOMImplementation()} function in the
|
---|
65 | \refmodule{xml.dom} package or the \module{xml.dom.minidom} module.
|
---|
66 | Using the implementation from the \module{xml.dom.minidom} module will
|
---|
67 | always return a \class{Document} instance from the minidom
|
---|
68 | implementation, while the version from \refmodule{xml.dom} may provide
|
---|
69 | an alternate implementation (this is likely if you have the
|
---|
70 | \ulink{PyXML package}{http://pyxml.sourceforge.net/} installed). Once
|
---|
71 | you have a \class{Document}, you can add child nodes to it to populate
|
---|
72 | the DOM:
|
---|
73 |
|
---|
74 | \begin{verbatim}
|
---|
75 | from xml.dom.minidom import getDOMImplementation
|
---|
76 |
|
---|
77 | impl = getDOMImplementation()
|
---|
78 |
|
---|
79 | newdoc = impl.createDocument(None, "some_tag", None)
|
---|
80 | top_element = newdoc.documentElement
|
---|
81 | text = newdoc.createTextNode('Some textual content.')
|
---|
82 | top_element.appendChild(text)
|
---|
83 | \end{verbatim}
|
---|
84 |
|
---|
85 | Once you have a DOM document object, you can access the parts of your
|
---|
86 | XML document through its properties and methods. These properties are
|
---|
87 | defined in the DOM specification. The main property of the document
|
---|
88 | object is the \member{documentElement} property. It gives you the
|
---|
89 | main element in the XML document: the one that holds all others. Here
|
---|
90 | is an example program:
|
---|
91 |
|
---|
92 | \begin{verbatim}
|
---|
93 | dom3 = parseString("<myxml>Some data</myxml>")
|
---|
94 | assert dom3.documentElement.tagName == "myxml"
|
---|
95 | \end{verbatim}
|
---|
96 |
|
---|
97 | When you are finished with a DOM, you should clean it up. This is
|
---|
98 | necessary because some versions of Python do not support garbage
|
---|
99 | collection of objects that refer to each other in a cycle. Until this
|
---|
100 | restriction is removed from all versions of Python, it is safest to
|
---|
101 | write your code as if cycles would not be cleaned up.
|
---|
102 |
|
---|
103 | The way to clean up a DOM is to call its \method{unlink()} method:
|
---|
104 |
|
---|
105 | \begin{verbatim}
|
---|
106 | dom1.unlink()
|
---|
107 | dom2.unlink()
|
---|
108 | dom3.unlink()
|
---|
109 | \end{verbatim}
|
---|
110 |
|
---|
111 | \method{unlink()} is a \module{xml.dom.minidom}-specific extension to
|
---|
112 | the DOM API. After calling \method{unlink()} on a node, the node and
|
---|
113 | its descendants are essentially useless.
|
---|
114 |
|
---|
115 | \begin{seealso}
|
---|
116 | \seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{Document Object
|
---|
117 | Model (DOM) Level 1 Specification}
|
---|
118 | {The W3C recommendation for the
|
---|
119 | DOM supported by \module{xml.dom.minidom}.}
|
---|
120 | \end{seealso}
|
---|
121 |
|
---|
122 |
|
---|
123 | \subsection{DOM Objects \label{dom-objects}}
|
---|
124 |
|
---|
125 | The definition of the DOM API for Python is given as part of the
|
---|
126 | \refmodule{xml.dom} module documentation. This section lists the
|
---|
127 | differences between the API and \refmodule{xml.dom.minidom}.
|
---|
128 |
|
---|
129 |
|
---|
130 | \begin{methoddesc}[Node]{unlink}{}
|
---|
131 | Break internal references within the DOM so that it will be garbage
|
---|
132 | collected on versions of Python without cyclic GC. Even when cyclic
|
---|
133 | GC is available, using this can make large amounts of memory available
|
---|
134 | sooner, so calling this on DOM objects as soon as they are no longer
|
---|
135 | needed is good practice. This only needs to be called on the
|
---|
136 | \class{Document} object, but may be called on child nodes to discard
|
---|
137 | children of that node.
|
---|
138 | \end{methoddesc}
|
---|
139 |
|
---|
140 | \begin{methoddesc}[Node]{writexml}{writer\optional{,indent=""\optional{,addindent=""\optional{,newl=""}}}}
|
---|
141 | Write XML to the writer object. The writer should have a
|
---|
142 | \method{write()} method which matches that of the file object
|
---|
143 | interface. The \var{indent} parameter is the indentation of the current
|
---|
144 | node. The \var{addindent} parameter is the incremental indentation to use
|
---|
145 | for subnodes of the current one. The \var{newl} parameter specifies the
|
---|
146 | string to use to terminate newlines.
|
---|
147 |
|
---|
148 | \versionchanged[The optional keyword parameters
|
---|
149 | \var{indent}, \var{addindent}, and \var{newl} were added to support pretty
|
---|
150 | output]{2.1}
|
---|
151 |
|
---|
152 | \versionchanged[For the \class{Document} node, an additional keyword
|
---|
153 | argument \var{encoding} can be used to specify the encoding field of the XML
|
---|
154 | header]{2.3}
|
---|
155 | \end{methoddesc}
|
---|
156 |
|
---|
157 | \begin{methoddesc}[Node]{toxml}{\optional{encoding}}
|
---|
158 | Return the XML that the DOM represents as a string.
|
---|
159 |
|
---|
160 | With no argument, the XML header does not specify an encoding, and the
|
---|
161 | result is Unicode string if the default encoding cannot represent all
|
---|
162 | characters in the document. Encoding this string in an encoding other
|
---|
163 | than UTF-8 is likely incorrect, since UTF-8 is the default encoding of
|
---|
164 | XML.
|
---|
165 |
|
---|
166 | With an explicit \var{encoding} argument, the result is a byte string
|
---|
167 | in the specified encoding. It is recommended that this argument is
|
---|
168 | always specified. To avoid \exception{UnicodeError} exceptions in case of
|
---|
169 | unrepresentable text data, the encoding argument should be specified
|
---|
170 | as "utf-8".
|
---|
171 |
|
---|
172 | \versionchanged[the \var{encoding} argument was introduced]{2.3}
|
---|
173 | \end{methoddesc}
|
---|
174 |
|
---|
175 | \begin{methoddesc}[Node]{toprettyxml}{\optional{indent\optional{, newl}}}
|
---|
176 | Return a pretty-printed version of the document. \var{indent} specifies
|
---|
177 | the indentation string and defaults to a tabulator; \var{newl} specifies
|
---|
178 | the string emitted at the end of each line and defaults to \code{\e n}.
|
---|
179 |
|
---|
180 | \versionadded{2.1}
|
---|
181 | \versionchanged[the encoding argument; see \method{toxml()}]{2.3}
|
---|
182 | \end{methoddesc}
|
---|
183 |
|
---|
184 | The following standard DOM methods have special considerations with
|
---|
185 | \refmodule{xml.dom.minidom}:
|
---|
186 |
|
---|
187 | \begin{methoddesc}[Node]{cloneNode}{deep}
|
---|
188 | Although this method was present in the version of
|
---|
189 | \refmodule{xml.dom.minidom} packaged with Python 2.0, it was seriously
|
---|
190 | broken. This has been corrected for subsequent releases.
|
---|
191 | \end{methoddesc}
|
---|
192 |
|
---|
193 |
|
---|
194 | \subsection{DOM Example \label{dom-example}}
|
---|
195 |
|
---|
196 | This example program is a fairly realistic example of a simple
|
---|
197 | program. In this particular case, we do not take much advantage
|
---|
198 | of the flexibility of the DOM.
|
---|
199 |
|
---|
200 | \verbatiminput{minidom-example.py}
|
---|
201 |
|
---|
202 |
|
---|
203 | \subsection{minidom and the DOM standard \label{minidom-and-dom}}
|
---|
204 |
|
---|
205 | The \refmodule{xml.dom.minidom} module is essentially a DOM
|
---|
206 | 1.0-compatible DOM with some DOM 2 features (primarily namespace
|
---|
207 | features).
|
---|
208 |
|
---|
209 | Usage of the DOM interface in Python is straight-forward. The
|
---|
210 | following mapping rules apply:
|
---|
211 |
|
---|
212 | \begin{itemize}
|
---|
213 | \item Interfaces are accessed through instance objects. Applications
|
---|
214 | should not instantiate the classes themselves; they should use
|
---|
215 | the creator functions available on the \class{Document} object.
|
---|
216 | Derived interfaces support all operations (and attributes) from
|
---|
217 | the base interfaces, plus any new operations.
|
---|
218 |
|
---|
219 | \item Operations are used as methods. Since the DOM uses only
|
---|
220 | \keyword{in} parameters, the arguments are passed in normal
|
---|
221 | order (from left to right). There are no optional
|
---|
222 | arguments. \keyword{void} operations return \code{None}.
|
---|
223 |
|
---|
224 | \item IDL attributes map to instance attributes. For compatibility
|
---|
225 | with the OMG IDL language mapping for Python, an attribute
|
---|
226 | \code{foo} can also be accessed through accessor methods
|
---|
227 | \method{_get_foo()} and \method{_set_foo()}. \keyword{readonly}
|
---|
228 | attributes must not be changed; this is not enforced at
|
---|
229 | runtime.
|
---|
230 |
|
---|
231 | \item The types \code{short int}, \code{unsigned int}, \code{unsigned
|
---|
232 | long long}, and \code{boolean} all map to Python integer
|
---|
233 | objects.
|
---|
234 |
|
---|
235 | \item The type \code{DOMString} maps to Python strings.
|
---|
236 | \refmodule{xml.dom.minidom} supports either byte or Unicode
|
---|
237 | strings, but will normally produce Unicode strings. Values
|
---|
238 | of type \code{DOMString} may also be \code{None} where allowed
|
---|
239 | to have the IDL \code{null} value by the DOM specification from
|
---|
240 | the W3C.
|
---|
241 |
|
---|
242 | \item \keyword{const} declarations map to variables in their
|
---|
243 | respective scope
|
---|
244 | (e.g. \code{xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE});
|
---|
245 | they must not be changed.
|
---|
246 |
|
---|
247 | \item \code{DOMException} is currently not supported in
|
---|
248 | \refmodule{xml.dom.minidom}. Instead,
|
---|
249 | \refmodule{xml.dom.minidom} uses standard Python exceptions such
|
---|
250 | as \exception{TypeError} and \exception{AttributeError}.
|
---|
251 |
|
---|
252 | \item \class{NodeList} objects are implemented using Python's built-in
|
---|
253 | list type. Starting with Python 2.2, these objects provide the
|
---|
254 | interface defined in the DOM specification, but with earlier
|
---|
255 | versions of Python they do not support the official API. They
|
---|
256 | are, however, much more ``Pythonic'' than the interface defined
|
---|
257 | in the W3C recommendations.
|
---|
258 | \end{itemize}
|
---|
259 |
|
---|
260 |
|
---|
261 | The following interfaces have no implementation in
|
---|
262 | \refmodule{xml.dom.minidom}:
|
---|
263 |
|
---|
264 | \begin{itemize}
|
---|
265 | \item \class{DOMTimeStamp}
|
---|
266 |
|
---|
267 | \item \class{DocumentType} (added in Python 2.1)
|
---|
268 |
|
---|
269 | \item \class{DOMImplementation} (added in Python 2.1)
|
---|
270 |
|
---|
271 | \item \class{CharacterData}
|
---|
272 |
|
---|
273 | \item \class{CDATASection}
|
---|
274 |
|
---|
275 | \item \class{Notation}
|
---|
276 |
|
---|
277 | \item \class{Entity}
|
---|
278 |
|
---|
279 | \item \class{EntityReference}
|
---|
280 |
|
---|
281 | \item \class{DocumentFragment}
|
---|
282 | \end{itemize}
|
---|
283 |
|
---|
284 | Most of these reflect information in the XML document that is not of
|
---|
285 | general utility to most DOM users.
|
---|