source: python/trunk/Doc/library/urlparse.rst

Last change on this file was 391, checked in by dmik, 11 years ago

python: Merge vendor 2.7.6 to trunk.

  • Property svn:eol-style set to native
File size: 15.6 KB
RevLine 
[2]1:mod:`urlparse` --- Parse URLs into components
2==============================================
3
4.. module:: urlparse
5 :synopsis: Parse URLs into or assemble them from components.
6
7
8.. index::
9 single: WWW
10 single: World Wide Web
11 single: URL
12 pair: URL; parsing
13 pair: relative; URL
14
15.. note::
[391]16 The :mod:`urlparse` module is renamed to :mod:`urllib.parse` in Python 3.
[2]17 The :term:`2to3` tool will automatically adapt imports when converting
[391]18 your sources to Python 3.
[2]19
[391]20**Source code:** :source:`Lib/urlparse.py`
[2]21
[391]22--------------
23
[2]24This module defines a standard interface to break Uniform Resource Locator (URL)
25strings up in components (addressing scheme, network location, path etc.), to
26combine the components back into a URL string, and to convert a "relative URL"
27to an absolute URL given a "base URL."
28
29The module has been designed to match the Internet RFC on Relative Uniform
[391]30Resource Locators. It supports the following URL schemes: ``file``, ``ftp``,
31``gopher``, ``hdl``, ``http``, ``https``, ``imap``, ``mailto``, ``mms``,
32``news``, ``nntp``, ``prospero``, ``rsync``, ``rtsp``, ``rtspu``, ``sftp``,
33``shttp``, ``sip``, ``sips``, ``snews``, ``svn``, ``svn+ssh``, ``telnet``,
34``wais``.
[2]35
36.. versionadded:: 2.5
37 Support for the ``sftp`` and ``sips`` schemes.
38
39The :mod:`urlparse` module defines the following functions:
40
41
[391]42.. function:: urlparse(urlstring[, scheme[, allow_fragments]])
[2]43
44 Parse a URL into six components, returning a 6-tuple. This corresponds to the
45 general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``.
46 Each tuple item is a string, possibly empty. The components are not broken up in
47 smaller parts (for example, the network location is a single string), and %
48 escapes are not expanded. The delimiters as shown above are not part of the
49 result, except for a leading slash in the *path* component, which is retained if
50 present. For example:
51
52 >>> from urlparse import urlparse
53 >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
54 >>> o # doctest: +NORMALIZE_WHITESPACE
55 ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
56 params='', query='', fragment='')
57 >>> o.scheme
58 'http'
59 >>> o.port
60 80
61 >>> o.geturl()
62 'http://www.cwi.nl:80/%7Eguido/Python.html'
63
[391]64
65 Following the syntax specifications in :rfc:`1808`, urlparse recognizes
66 a netloc only if it is properly introduced by '//'. Otherwise the
67 input is presumed to be a relative URL and thus to start with
68 a path component.
69
70 >>> from urlparse import urlparse
71 >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
72 ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
73 params='', query='', fragment='')
74 >>> urlparse('www.cwi.nl/%7Eguido/Python.html')
75 ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html',
76 params='', query='', fragment='')
77 >>> urlparse('help/Python.html')
78 ParseResult(scheme='', netloc='', path='help/Python.html', params='',
79 query='', fragment='')
80
81 If the *scheme* argument is specified, it gives the default addressing
[2]82 scheme, to be used only if the URL does not specify one. The default value for
83 this argument is the empty string.
84
85 If the *allow_fragments* argument is false, fragment identifiers are not
86 allowed, even if the URL's addressing scheme normally does support them. The
87 default value for this argument is :const:`True`.
88
89 The return value is actually an instance of a subclass of :class:`tuple`. This
90 class has the following additional read-only convenience attributes:
91
92 +------------------+-------+--------------------------+----------------------+
93 | Attribute | Index | Value | Value if not present |
94 +==================+=======+==========================+======================+
95 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
96 +------------------+-------+--------------------------+----------------------+
97 | :attr:`netloc` | 1 | Network location part | empty string |
98 +------------------+-------+--------------------------+----------------------+
99 | :attr:`path` | 2 | Hierarchical path | empty string |
100 +------------------+-------+--------------------------+----------------------+
101 | :attr:`params` | 3 | Parameters for last path | empty string |
102 | | | element | |
103 +------------------+-------+--------------------------+----------------------+
104 | :attr:`query` | 4 | Query component | empty string |
105 +------------------+-------+--------------------------+----------------------+
106 | :attr:`fragment` | 5 | Fragment identifier | empty string |
107 +------------------+-------+--------------------------+----------------------+
108 | :attr:`username` | | User name | :const:`None` |
109 +------------------+-------+--------------------------+----------------------+
110 | :attr:`password` | | Password | :const:`None` |
111 +------------------+-------+--------------------------+----------------------+
112 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
113 +------------------+-------+--------------------------+----------------------+
114 | :attr:`port` | | Port number as integer, | :const:`None` |
115 | | | if present | |
116 +------------------+-------+--------------------------+----------------------+
117
118 See section :ref:`urlparse-result-object` for more information on the result
119 object.
120
121 .. versionchanged:: 2.5
122 Added attributes to return value.
123
[391]124 .. versionchanged:: 2.7
125 Added IPv6 URL parsing capabilities.
126
127
[2]128.. function:: parse_qs(qs[, keep_blank_values[, strict_parsing]])
129
130 Parse a query string given as a string argument (data of type
131 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a
132 dictionary. The dictionary keys are the unique query variable names and the
133 values are lists of values for each name.
134
135 The optional argument *keep_blank_values* is a flag indicating whether blank
[391]136 values in percent-encoded queries should be treated as blank strings. A true value
[2]137 indicates that blanks should be retained as blank strings. The default false
138 value indicates that blank values are to be ignored and treated as if they were
139 not included.
140
141 The optional argument *strict_parsing* is a flag indicating what to do with
142 parsing errors. If false (the default), errors are silently ignored. If true,
143 errors raise a :exc:`ValueError` exception.
144
145 Use the :func:`urllib.urlencode` function to convert such dictionaries into
146 query strings.
147
148 .. versionadded:: 2.6
149 Copied from the :mod:`cgi` module.
150
151
152.. function:: parse_qsl(qs[, keep_blank_values[, strict_parsing]])
153
154 Parse a query string given as a string argument (data of type
155 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a list of
156 name, value pairs.
157
158 The optional argument *keep_blank_values* is a flag indicating whether blank
[391]159 values in percent-encoded queries should be treated as blank strings. A true value
[2]160 indicates that blanks should be retained as blank strings. The default false
161 value indicates that blank values are to be ignored and treated as if they were
162 not included.
163
164 The optional argument *strict_parsing* is a flag indicating what to do with
165 parsing errors. If false (the default), errors are silently ignored. If true,
166 errors raise a :exc:`ValueError` exception.
167
168 Use the :func:`urllib.urlencode` function to convert such lists of pairs into
169 query strings.
170
171 .. versionadded:: 2.6
172 Copied from the :mod:`cgi` module.
173
174
175.. function:: urlunparse(parts)
176
177 Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument
178 can be any six-item iterable. This may result in a slightly different, but
179 equivalent URL, if the URL that was parsed originally had unnecessary delimiters
180 (for example, a ? with an empty query; the RFC states that these are
181 equivalent).
182
183
[391]184.. function:: urlsplit(urlstring[, scheme[, allow_fragments]])
[2]185
186 This is similar to :func:`urlparse`, but does not split the params from the URL.
187 This should generally be used instead of :func:`urlparse` if the more recent URL
188 syntax allowing parameters to be applied to each segment of the *path* portion
189 of the URL (see :rfc:`2396`) is wanted. A separate function is needed to
190 separate the path segments and parameters. This function returns a 5-tuple:
191 (addressing scheme, network location, path, query, fragment identifier).
192
193 The return value is actually an instance of a subclass of :class:`tuple`. This
194 class has the following additional read-only convenience attributes:
195
196 +------------------+-------+-------------------------+----------------------+
197 | Attribute | Index | Value | Value if not present |
198 +==================+=======+=========================+======================+
199 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
200 +------------------+-------+-------------------------+----------------------+
201 | :attr:`netloc` | 1 | Network location part | empty string |
202 +------------------+-------+-------------------------+----------------------+
203 | :attr:`path` | 2 | Hierarchical path | empty string |
204 +------------------+-------+-------------------------+----------------------+
205 | :attr:`query` | 3 | Query component | empty string |
206 +------------------+-------+-------------------------+----------------------+
207 | :attr:`fragment` | 4 | Fragment identifier | empty string |
208 +------------------+-------+-------------------------+----------------------+
209 | :attr:`username` | | User name | :const:`None` |
210 +------------------+-------+-------------------------+----------------------+
211 | :attr:`password` | | Password | :const:`None` |
212 +------------------+-------+-------------------------+----------------------+
213 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
214 +------------------+-------+-------------------------+----------------------+
215 | :attr:`port` | | Port number as integer, | :const:`None` |
216 | | | if present | |
217 +------------------+-------+-------------------------+----------------------+
218
219 See section :ref:`urlparse-result-object` for more information on the result
220 object.
221
222 .. versionadded:: 2.2
223
224 .. versionchanged:: 2.5
225 Added attributes to return value.
226
227
228.. function:: urlunsplit(parts)
229
230 Combine the elements of a tuple as returned by :func:`urlsplit` into a complete
231 URL as a string. The *parts* argument can be any five-item iterable. This may
232 result in a slightly different, but equivalent URL, if the URL that was parsed
233 originally had unnecessary delimiters (for example, a ? with an empty query; the
234 RFC states that these are equivalent).
235
236 .. versionadded:: 2.2
237
238
239.. function:: urljoin(base, url[, allow_fragments])
240
241 Construct a full ("absolute") URL by combining a "base URL" (*base*) with
242 another URL (*url*). Informally, this uses components of the base URL, in
243 particular the addressing scheme, the network location and (part of) the path,
244 to provide missing components in the relative URL. For example:
245
246 >>> from urlparse import urljoin
247 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
248 'http://www.cwi.nl/%7Eguido/FAQ.html'
249
250 The *allow_fragments* argument has the same meaning and default as for
251 :func:`urlparse`.
252
253 .. note::
254
255 If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``),
256 the *url*'s host name and/or scheme will be present in the result. For example:
257
258 .. doctest::
259
260 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html',
261 ... '//www.python.org/%7Eguido')
262 'http://www.python.org/%7Eguido'
263
264 If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and
265 :func:`urlunsplit`, removing possible *scheme* and *netloc* parts.
266
267
268.. function:: urldefrag(url)
269
270 If *url* contains a fragment identifier, returns a modified version of *url*
271 with no fragment identifier, and the fragment identifier as a separate string.
272 If there is no fragment identifier in *url*, returns *url* unmodified and an
273 empty string.
274
275
276.. seealso::
277
[391]278 :rfc:`3986` - Uniform Resource Identifiers
279 This is the current standard (STD66). Any changes to urlparse module
280 should conform to this. Certain deviations could be observed, which are
281 mostly due backward compatiblity purposes and for certain de-facto
282 parsing requirements as commonly observed in major browsers.
[2]283
[391]284 :rfc:`2732` - Format for Literal IPv6 Addresses in URL's.
285 This specifies the parsing requirements of IPv6 URLs.
286
287 :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax
288 Document describing the generic syntactic requirements for both Uniform Resource
289 Names (URNs) and Uniform Resource Locators (URLs).
290
291 :rfc:`2368` - The mailto URL scheme.
292 Parsing requirements for mailto url schemes.
293
[2]294 :rfc:`1808` - Relative Uniform Resource Locators
295 This Request For Comments includes the rules for joining an absolute and a
296 relative URL, including a fair number of "Abnormal Examples" which govern the
297 treatment of border cases.
298
[391]299 :rfc:`1738` - Uniform Resource Locators (URL)
300 This specifies the formal syntax and semantics of absolute URLs.
[2]301
302
303.. _urlparse-result-object:
304
305Results of :func:`urlparse` and :func:`urlsplit`
306------------------------------------------------
307
308The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
309subclasses of the :class:`tuple` type. These subclasses add the attributes
310described in those functions, as well as provide an additional method:
311
312
313.. method:: ParseResult.geturl()
314
315 Return the re-combined version of the original URL as a string. This may differ
316 from the original URL in that the scheme will always be normalized to lower case
317 and empty components may be dropped. Specifically, empty parameters, queries,
318 and fragment identifiers will be removed.
319
320 The result of this method is a fixpoint if passed back through the original
321 parsing function:
322
323 >>> import urlparse
324 >>> url = 'HTTP://www.Python.org/doc/#'
325
326 >>> r1 = urlparse.urlsplit(url)
327 >>> r1.geturl()
328 'http://www.Python.org/doc/'
329
330 >>> r2 = urlparse.urlsplit(r1.geturl())
331 >>> r2.geturl()
332 'http://www.Python.org/doc/'
333
334 .. versionadded:: 2.5
335
336The following classes provide the implementations of the parse results:
337
338
339.. class:: BaseResult
340
341 Base class for the concrete result classes. This provides most of the attribute
342 definitions. It does not provide a :meth:`geturl` method. It is derived from
343 :class:`tuple`, but does not override the :meth:`__init__` or :meth:`__new__`
344 methods.
345
346
347.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
348
349 Concrete class for :func:`urlparse` results. The :meth:`__new__` method is
350 overridden to support checking that the right number of arguments are passed.
351
352
353.. class:: SplitResult(scheme, netloc, path, query, fragment)
354
355 Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is
356 overridden to support checking that the right number of arguments are passed.
357
Note: See TracBrowser for help on using the repository browser.