[2] | 1 | :mod:`urlparse` --- Parse URLs into components
|
---|
| 2 | ==============================================
|
---|
| 3 |
|
---|
| 4 | .. module:: urlparse
|
---|
| 5 | :synopsis: Parse URLs into or assemble them from components.
|
---|
| 6 |
|
---|
| 7 |
|
---|
| 8 | .. index::
|
---|
| 9 | single: WWW
|
---|
| 10 | single: World Wide Web
|
---|
| 11 | single: URL
|
---|
| 12 | pair: URL; parsing
|
---|
| 13 | pair: relative; URL
|
---|
| 14 |
|
---|
| 15 | .. note::
|
---|
[391] | 16 | The :mod:`urlparse` module is renamed to :mod:`urllib.parse` in Python 3.
|
---|
[2] | 17 | The :term:`2to3` tool will automatically adapt imports when converting
|
---|
[391] | 18 | your sources to Python 3.
|
---|
[2] | 19 |
|
---|
[391] | 20 | **Source code:** :source:`Lib/urlparse.py`
|
---|
[2] | 21 |
|
---|
[391] | 22 | --------------
|
---|
| 23 |
|
---|
[2] | 24 | This module defines a standard interface to break Uniform Resource Locator (URL)
|
---|
| 25 | strings up in components (addressing scheme, network location, path etc.), to
|
---|
| 26 | combine the components back into a URL string, and to convert a "relative URL"
|
---|
| 27 | to an absolute URL given a "base URL."
|
---|
| 28 |
|
---|
| 29 | The module has been designed to match the Internet RFC on Relative Uniform
|
---|
[391] | 30 | Resource Locators. It supports the following URL schemes: ``file``, ``ftp``,
|
---|
| 31 | ``gopher``, ``hdl``, ``http``, ``https``, ``imap``, ``mailto``, ``mms``,
|
---|
| 32 | ``news``, ``nntp``, ``prospero``, ``rsync``, ``rtsp``, ``rtspu``, ``sftp``,
|
---|
| 33 | ``shttp``, ``sip``, ``sips``, ``snews``, ``svn``, ``svn+ssh``, ``telnet``,
|
---|
| 34 | ``wais``.
|
---|
[2] | 35 |
|
---|
| 36 | .. versionadded:: 2.5
|
---|
| 37 | Support for the ``sftp`` and ``sips`` schemes.
|
---|
| 38 |
|
---|
| 39 | The :mod:`urlparse` module defines the following functions:
|
---|
| 40 |
|
---|
| 41 |
|
---|
[391] | 42 | .. function:: urlparse(urlstring[, scheme[, allow_fragments]])
|
---|
[2] | 43 |
|
---|
| 44 | Parse a URL into six components, returning a 6-tuple. This corresponds to the
|
---|
| 45 | general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``.
|
---|
| 46 | Each tuple item is a string, possibly empty. The components are not broken up in
|
---|
| 47 | smaller parts (for example, the network location is a single string), and %
|
---|
| 48 | escapes are not expanded. The delimiters as shown above are not part of the
|
---|
| 49 | result, except for a leading slash in the *path* component, which is retained if
|
---|
| 50 | present. For example:
|
---|
| 51 |
|
---|
| 52 | >>> from urlparse import urlparse
|
---|
| 53 | >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
|
---|
| 54 | >>> o # doctest: +NORMALIZE_WHITESPACE
|
---|
| 55 | ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
|
---|
| 56 | params='', query='', fragment='')
|
---|
| 57 | >>> o.scheme
|
---|
| 58 | 'http'
|
---|
| 59 | >>> o.port
|
---|
| 60 | 80
|
---|
| 61 | >>> o.geturl()
|
---|
| 62 | 'http://www.cwi.nl:80/%7Eguido/Python.html'
|
---|
| 63 |
|
---|
[391] | 64 |
|
---|
| 65 | Following the syntax specifications in :rfc:`1808`, urlparse recognizes
|
---|
| 66 | a netloc only if it is properly introduced by '//'. Otherwise the
|
---|
| 67 | input is presumed to be a relative URL and thus to start with
|
---|
| 68 | a path component.
|
---|
| 69 |
|
---|
| 70 | >>> from urlparse import urlparse
|
---|
| 71 | >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
|
---|
| 72 | ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
|
---|
| 73 | params='', query='', fragment='')
|
---|
| 74 | >>> urlparse('www.cwi.nl/%7Eguido/Python.html')
|
---|
| 75 | ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html',
|
---|
| 76 | params='', query='', fragment='')
|
---|
| 77 | >>> urlparse('help/Python.html')
|
---|
| 78 | ParseResult(scheme='', netloc='', path='help/Python.html', params='',
|
---|
| 79 | query='', fragment='')
|
---|
| 80 |
|
---|
| 81 | If the *scheme* argument is specified, it gives the default addressing
|
---|
[2] | 82 | scheme, to be used only if the URL does not specify one. The default value for
|
---|
| 83 | this argument is the empty string.
|
---|
| 84 |
|
---|
| 85 | If the *allow_fragments* argument is false, fragment identifiers are not
|
---|
| 86 | allowed, even if the URL's addressing scheme normally does support them. The
|
---|
| 87 | default value for this argument is :const:`True`.
|
---|
| 88 |
|
---|
| 89 | The return value is actually an instance of a subclass of :class:`tuple`. This
|
---|
| 90 | class has the following additional read-only convenience attributes:
|
---|
| 91 |
|
---|
| 92 | +------------------+-------+--------------------------+----------------------+
|
---|
| 93 | | Attribute | Index | Value | Value if not present |
|
---|
| 94 | +==================+=======+==========================+======================+
|
---|
| 95 | | :attr:`scheme` | 0 | URL scheme specifier | empty string |
|
---|
| 96 | +------------------+-------+--------------------------+----------------------+
|
---|
| 97 | | :attr:`netloc` | 1 | Network location part | empty string |
|
---|
| 98 | +------------------+-------+--------------------------+----------------------+
|
---|
| 99 | | :attr:`path` | 2 | Hierarchical path | empty string |
|
---|
| 100 | +------------------+-------+--------------------------+----------------------+
|
---|
| 101 | | :attr:`params` | 3 | Parameters for last path | empty string |
|
---|
| 102 | | | | element | |
|
---|
| 103 | +------------------+-------+--------------------------+----------------------+
|
---|
| 104 | | :attr:`query` | 4 | Query component | empty string |
|
---|
| 105 | +------------------+-------+--------------------------+----------------------+
|
---|
| 106 | | :attr:`fragment` | 5 | Fragment identifier | empty string |
|
---|
| 107 | +------------------+-------+--------------------------+----------------------+
|
---|
| 108 | | :attr:`username` | | User name | :const:`None` |
|
---|
| 109 | +------------------+-------+--------------------------+----------------------+
|
---|
| 110 | | :attr:`password` | | Password | :const:`None` |
|
---|
| 111 | +------------------+-------+--------------------------+----------------------+
|
---|
| 112 | | :attr:`hostname` | | Host name (lower case) | :const:`None` |
|
---|
| 113 | +------------------+-------+--------------------------+----------------------+
|
---|
| 114 | | :attr:`port` | | Port number as integer, | :const:`None` |
|
---|
| 115 | | | | if present | |
|
---|
| 116 | +------------------+-------+--------------------------+----------------------+
|
---|
| 117 |
|
---|
| 118 | See section :ref:`urlparse-result-object` for more information on the result
|
---|
| 119 | object.
|
---|
| 120 |
|
---|
| 121 | .. versionchanged:: 2.5
|
---|
| 122 | Added attributes to return value.
|
---|
| 123 |
|
---|
[391] | 124 | .. versionchanged:: 2.7
|
---|
| 125 | Added IPv6 URL parsing capabilities.
|
---|
| 126 |
|
---|
| 127 |
|
---|
[2] | 128 | .. function:: parse_qs(qs[, keep_blank_values[, strict_parsing]])
|
---|
| 129 |
|
---|
| 130 | Parse a query string given as a string argument (data of type
|
---|
| 131 | :mimetype:`application/x-www-form-urlencoded`). Data are returned as a
|
---|
| 132 | dictionary. The dictionary keys are the unique query variable names and the
|
---|
| 133 | values are lists of values for each name.
|
---|
| 134 |
|
---|
| 135 | The optional argument *keep_blank_values* is a flag indicating whether blank
|
---|
[391] | 136 | values in percent-encoded queries should be treated as blank strings. A true value
|
---|
[2] | 137 | indicates that blanks should be retained as blank strings. The default false
|
---|
| 138 | value indicates that blank values are to be ignored and treated as if they were
|
---|
| 139 | not included.
|
---|
| 140 |
|
---|
| 141 | The optional argument *strict_parsing* is a flag indicating what to do with
|
---|
| 142 | parsing errors. If false (the default), errors are silently ignored. If true,
|
---|
| 143 | errors raise a :exc:`ValueError` exception.
|
---|
| 144 |
|
---|
| 145 | Use the :func:`urllib.urlencode` function to convert such dictionaries into
|
---|
| 146 | query strings.
|
---|
| 147 |
|
---|
| 148 | .. versionadded:: 2.6
|
---|
| 149 | Copied from the :mod:`cgi` module.
|
---|
| 150 |
|
---|
| 151 |
|
---|
| 152 | .. function:: parse_qsl(qs[, keep_blank_values[, strict_parsing]])
|
---|
| 153 |
|
---|
| 154 | Parse a query string given as a string argument (data of type
|
---|
| 155 | :mimetype:`application/x-www-form-urlencoded`). Data are returned as a list of
|
---|
| 156 | name, value pairs.
|
---|
| 157 |
|
---|
| 158 | The optional argument *keep_blank_values* is a flag indicating whether blank
|
---|
[391] | 159 | values in percent-encoded queries should be treated as blank strings. A true value
|
---|
[2] | 160 | indicates that blanks should be retained as blank strings. The default false
|
---|
| 161 | value indicates that blank values are to be ignored and treated as if they were
|
---|
| 162 | not included.
|
---|
| 163 |
|
---|
| 164 | The optional argument *strict_parsing* is a flag indicating what to do with
|
---|
| 165 | parsing errors. If false (the default), errors are silently ignored. If true,
|
---|
| 166 | errors raise a :exc:`ValueError` exception.
|
---|
| 167 |
|
---|
| 168 | Use the :func:`urllib.urlencode` function to convert such lists of pairs into
|
---|
| 169 | query strings.
|
---|
| 170 |
|
---|
| 171 | .. versionadded:: 2.6
|
---|
| 172 | Copied from the :mod:`cgi` module.
|
---|
| 173 |
|
---|
| 174 |
|
---|
| 175 | .. function:: urlunparse(parts)
|
---|
| 176 |
|
---|
| 177 | Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument
|
---|
| 178 | can be any six-item iterable. This may result in a slightly different, but
|
---|
| 179 | equivalent URL, if the URL that was parsed originally had unnecessary delimiters
|
---|
| 180 | (for example, a ? with an empty query; the RFC states that these are
|
---|
| 181 | equivalent).
|
---|
| 182 |
|
---|
| 183 |
|
---|
[391] | 184 | .. function:: urlsplit(urlstring[, scheme[, allow_fragments]])
|
---|
[2] | 185 |
|
---|
| 186 | This is similar to :func:`urlparse`, but does not split the params from the URL.
|
---|
| 187 | This should generally be used instead of :func:`urlparse` if the more recent URL
|
---|
| 188 | syntax allowing parameters to be applied to each segment of the *path* portion
|
---|
| 189 | of the URL (see :rfc:`2396`) is wanted. A separate function is needed to
|
---|
| 190 | separate the path segments and parameters. This function returns a 5-tuple:
|
---|
| 191 | (addressing scheme, network location, path, query, fragment identifier).
|
---|
| 192 |
|
---|
| 193 | The return value is actually an instance of a subclass of :class:`tuple`. This
|
---|
| 194 | class has the following additional read-only convenience attributes:
|
---|
| 195 |
|
---|
| 196 | +------------------+-------+-------------------------+----------------------+
|
---|
| 197 | | Attribute | Index | Value | Value if not present |
|
---|
| 198 | +==================+=======+=========================+======================+
|
---|
| 199 | | :attr:`scheme` | 0 | URL scheme specifier | empty string |
|
---|
| 200 | +------------------+-------+-------------------------+----------------------+
|
---|
| 201 | | :attr:`netloc` | 1 | Network location part | empty string |
|
---|
| 202 | +------------------+-------+-------------------------+----------------------+
|
---|
| 203 | | :attr:`path` | 2 | Hierarchical path | empty string |
|
---|
| 204 | +------------------+-------+-------------------------+----------------------+
|
---|
| 205 | | :attr:`query` | 3 | Query component | empty string |
|
---|
| 206 | +------------------+-------+-------------------------+----------------------+
|
---|
| 207 | | :attr:`fragment` | 4 | Fragment identifier | empty string |
|
---|
| 208 | +------------------+-------+-------------------------+----------------------+
|
---|
| 209 | | :attr:`username` | | User name | :const:`None` |
|
---|
| 210 | +------------------+-------+-------------------------+----------------------+
|
---|
| 211 | | :attr:`password` | | Password | :const:`None` |
|
---|
| 212 | +------------------+-------+-------------------------+----------------------+
|
---|
| 213 | | :attr:`hostname` | | Host name (lower case) | :const:`None` |
|
---|
| 214 | +------------------+-------+-------------------------+----------------------+
|
---|
| 215 | | :attr:`port` | | Port number as integer, | :const:`None` |
|
---|
| 216 | | | | if present | |
|
---|
| 217 | +------------------+-------+-------------------------+----------------------+
|
---|
| 218 |
|
---|
| 219 | See section :ref:`urlparse-result-object` for more information on the result
|
---|
| 220 | object.
|
---|
| 221 |
|
---|
| 222 | .. versionadded:: 2.2
|
---|
| 223 |
|
---|
| 224 | .. versionchanged:: 2.5
|
---|
| 225 | Added attributes to return value.
|
---|
| 226 |
|
---|
| 227 |
|
---|
| 228 | .. function:: urlunsplit(parts)
|
---|
| 229 |
|
---|
| 230 | Combine the elements of a tuple as returned by :func:`urlsplit` into a complete
|
---|
| 231 | URL as a string. The *parts* argument can be any five-item iterable. This may
|
---|
| 232 | result in a slightly different, but equivalent URL, if the URL that was parsed
|
---|
| 233 | originally had unnecessary delimiters (for example, a ? with an empty query; the
|
---|
| 234 | RFC states that these are equivalent).
|
---|
| 235 |
|
---|
| 236 | .. versionadded:: 2.2
|
---|
| 237 |
|
---|
| 238 |
|
---|
| 239 | .. function:: urljoin(base, url[, allow_fragments])
|
---|
| 240 |
|
---|
| 241 | Construct a full ("absolute") URL by combining a "base URL" (*base*) with
|
---|
| 242 | another URL (*url*). Informally, this uses components of the base URL, in
|
---|
| 243 | particular the addressing scheme, the network location and (part of) the path,
|
---|
| 244 | to provide missing components in the relative URL. For example:
|
---|
| 245 |
|
---|
| 246 | >>> from urlparse import urljoin
|
---|
| 247 | >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
|
---|
| 248 | 'http://www.cwi.nl/%7Eguido/FAQ.html'
|
---|
| 249 |
|
---|
| 250 | The *allow_fragments* argument has the same meaning and default as for
|
---|
| 251 | :func:`urlparse`.
|
---|
| 252 |
|
---|
| 253 | .. note::
|
---|
| 254 |
|
---|
| 255 | If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``),
|
---|
| 256 | the *url*'s host name and/or scheme will be present in the result. For example:
|
---|
| 257 |
|
---|
| 258 | .. doctest::
|
---|
| 259 |
|
---|
| 260 | >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html',
|
---|
| 261 | ... '//www.python.org/%7Eguido')
|
---|
| 262 | 'http://www.python.org/%7Eguido'
|
---|
| 263 |
|
---|
| 264 | If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and
|
---|
| 265 | :func:`urlunsplit`, removing possible *scheme* and *netloc* parts.
|
---|
| 266 |
|
---|
| 267 |
|
---|
| 268 | .. function:: urldefrag(url)
|
---|
| 269 |
|
---|
| 270 | If *url* contains a fragment identifier, returns a modified version of *url*
|
---|
| 271 | with no fragment identifier, and the fragment identifier as a separate string.
|
---|
| 272 | If there is no fragment identifier in *url*, returns *url* unmodified and an
|
---|
| 273 | empty string.
|
---|
| 274 |
|
---|
| 275 |
|
---|
| 276 | .. seealso::
|
---|
| 277 |
|
---|
[391] | 278 | :rfc:`3986` - Uniform Resource Identifiers
|
---|
| 279 | This is the current standard (STD66). Any changes to urlparse module
|
---|
| 280 | should conform to this. Certain deviations could be observed, which are
|
---|
| 281 | mostly due backward compatiblity purposes and for certain de-facto
|
---|
| 282 | parsing requirements as commonly observed in major browsers.
|
---|
[2] | 283 |
|
---|
[391] | 284 | :rfc:`2732` - Format for Literal IPv6 Addresses in URL's.
|
---|
| 285 | This specifies the parsing requirements of IPv6 URLs.
|
---|
| 286 |
|
---|
| 287 | :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax
|
---|
| 288 | Document describing the generic syntactic requirements for both Uniform Resource
|
---|
| 289 | Names (URNs) and Uniform Resource Locators (URLs).
|
---|
| 290 |
|
---|
| 291 | :rfc:`2368` - The mailto URL scheme.
|
---|
| 292 | Parsing requirements for mailto url schemes.
|
---|
| 293 |
|
---|
[2] | 294 | :rfc:`1808` - Relative Uniform Resource Locators
|
---|
| 295 | This Request For Comments includes the rules for joining an absolute and a
|
---|
| 296 | relative URL, including a fair number of "Abnormal Examples" which govern the
|
---|
| 297 | treatment of border cases.
|
---|
| 298 |
|
---|
[391] | 299 | :rfc:`1738` - Uniform Resource Locators (URL)
|
---|
| 300 | This specifies the formal syntax and semantics of absolute URLs.
|
---|
[2] | 301 |
|
---|
| 302 |
|
---|
| 303 | .. _urlparse-result-object:
|
---|
| 304 |
|
---|
| 305 | Results of :func:`urlparse` and :func:`urlsplit`
|
---|
| 306 | ------------------------------------------------
|
---|
| 307 |
|
---|
| 308 | The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
|
---|
| 309 | subclasses of the :class:`tuple` type. These subclasses add the attributes
|
---|
| 310 | described in those functions, as well as provide an additional method:
|
---|
| 311 |
|
---|
| 312 |
|
---|
| 313 | .. method:: ParseResult.geturl()
|
---|
| 314 |
|
---|
| 315 | Return the re-combined version of the original URL as a string. This may differ
|
---|
| 316 | from the original URL in that the scheme will always be normalized to lower case
|
---|
| 317 | and empty components may be dropped. Specifically, empty parameters, queries,
|
---|
| 318 | and fragment identifiers will be removed.
|
---|
| 319 |
|
---|
| 320 | The result of this method is a fixpoint if passed back through the original
|
---|
| 321 | parsing function:
|
---|
| 322 |
|
---|
| 323 | >>> import urlparse
|
---|
| 324 | >>> url = 'HTTP://www.Python.org/doc/#'
|
---|
| 325 |
|
---|
| 326 | >>> r1 = urlparse.urlsplit(url)
|
---|
| 327 | >>> r1.geturl()
|
---|
| 328 | 'http://www.Python.org/doc/'
|
---|
| 329 |
|
---|
| 330 | >>> r2 = urlparse.urlsplit(r1.geturl())
|
---|
| 331 | >>> r2.geturl()
|
---|
| 332 | 'http://www.Python.org/doc/'
|
---|
| 333 |
|
---|
| 334 | .. versionadded:: 2.5
|
---|
| 335 |
|
---|
| 336 | The following classes provide the implementations of the parse results:
|
---|
| 337 |
|
---|
| 338 |
|
---|
| 339 | .. class:: BaseResult
|
---|
| 340 |
|
---|
| 341 | Base class for the concrete result classes. This provides most of the attribute
|
---|
| 342 | definitions. It does not provide a :meth:`geturl` method. It is derived from
|
---|
| 343 | :class:`tuple`, but does not override the :meth:`__init__` or :meth:`__new__`
|
---|
| 344 | methods.
|
---|
| 345 |
|
---|
| 346 |
|
---|
| 347 | .. class:: ParseResult(scheme, netloc, path, params, query, fragment)
|
---|
| 348 |
|
---|
| 349 | Concrete class for :func:`urlparse` results. The :meth:`__new__` method is
|
---|
| 350 | overridden to support checking that the right number of arguments are passed.
|
---|
| 351 |
|
---|
| 352 |
|
---|
| 353 | .. class:: SplitResult(scheme, netloc, path, query, fragment)
|
---|
| 354 |
|
---|
| 355 | Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is
|
---|
| 356 | overridden to support checking that the right number of arguments are passed.
|
---|
| 357 |
|
---|