Context Navigation

← Previous Revision
Next Revision →
Normal
Revision Log

urllib.rst

Last change on this file was 391, checked in by dmik, 11 years ago
python: Merge vendor 2.7.6 to trunk.
Property svn:eol-style set to `native`
File size: 22.5 KB

Rev	Line
[2]	1	:mod:`urllib` --- Open arbitrary resources by URL
	2	=================================================
	3
	4	.. module:: urllib
	5	:synopsis: Open an arbitrary network resource by URL (requires sockets).
	6
	7	.. note::
	8	The :mod:`urllib` module has been split into parts and renamed in
[391]	9	Python 3 to :mod:`urllib.request`, :mod:`urllib.parse`,
[2]	10	and :mod:`urllib.error`. The :term:`2to3` tool will automatically adapt
[391]	11	imports when converting your sources to Python 3.
[2]	12	Also note that the :func:`urllib.urlopen` function has been removed in
[391]	13	Python 3 in favor of :func:`urllib2.urlopen`.
[2]	14
	15	.. index::
	16	single: WWW
	17	single: World Wide Web
	18	single: URL
	19
	20	This module provides a high-level interface for fetching data across the World
	21	Wide Web. In particular, the :func:`urlopen` function is similar to the
	22	built-in function :func:`open`, but accepts Universal Resource Locators (URLs)
	23	instead of filenames. Some restrictions apply --- it can only open URLs for
	24	reading, and no seek operations are available.
	25
[391]	26	.. warning:: When opening HTTPS URLs, it does not attempt to validate the
	27	server certificate. Use at your own risk!
	28
	29
[2]	30	High-level interface
	31	--------------------
	32
	33	.. function:: urlopen(url[, data[, proxies]])
	34
[391]	35	Open a network object denoted by a URL for reading. If the URL does not
	36	have a scheme identifier, or if it has :file:`file:` as its scheme
	37	identifier, this opens a local file (without :term:`universal newlines`);
	38	otherwise it opens a socket to a server somewhere on the network. If the
	39	connection cannot be made the :exc:`IOError` exception is raised. If all
	40	went well, a file-like object is returned. This supports the following
	41	methods: :meth:`read`, :meth:`readline`, :meth:`readlines`, :meth:`fileno`,
	42	:meth:`close`, :meth:`info`, :meth:`getcode` and :meth:`geturl`. It also
	43	has proper support for the :term:`iterator` protocol. One caveat: the
	44	:meth:`read` method, if the size argument is omitted or negative, may not
	45	read until the end of the data stream; there is no good way to determine
[2]	46	that the entire stream from a socket has been read in the general case.
	47
	48	Except for the :meth:`info`, :meth:`getcode` and :meth:`geturl` methods,
	49	these methods have the same interface as for file objects --- see section
	50	:ref:`bltin-file-objects` in this manual. (It is not a built-in file object,
	51	however, so it can't be used at those few places where a true built-in file
	52	object is required.)
	53
	54	.. index:: module: mimetools
	55
	56	The :meth:`info` method returns an instance of the class
[391]	57	:class:`mimetools.Message` containing meta-information associated with the
[2]	58	URL. When the method is HTTP, these headers are those returned by the server
	59	at the head of the retrieved HTML page (including Content-Length and
	60	Content-Type). When the method is FTP, a Content-Length header will be
	61	present if (as is now usual) the server passed back a file length in response
	62	to the FTP retrieval request. A Content-Type header will be present if the
	63	MIME type can be guessed. When the method is local-file, returned headers
	64	will include a Date representing the file's last-modified time, a
	65	Content-Length giving file size, and a Content-Type containing a guess at the
	66	file's type. See also the description of the :mod:`mimetools` module.
	67
	68	The :meth:`geturl` method returns the real URL of the page. In some cases, the
	69	HTTP server redirects a client to another URL. The :func:`urlopen` function
	70	handles this transparently, but in some cases the caller needs to know which URL
	71	the client was redirected to. The :meth:`geturl` method can be used to get at
	72	this redirected URL.
	73
	74	The :meth:`getcode` method returns the HTTP status code that was sent with the
	75	response, or ``None`` if the URL is no HTTP URL.
	76
	77	If the url uses the :file:`http:` scheme identifier, the optional data
	78	argument may be given to specify a ``POST`` request (normally the request type
	79	is ``GET``). The data argument must be in standard
	80	:mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
	81	function below.
	82
	83	The :func:`urlopen` function works transparently with proxies which do not
	84	require authentication. In a Unix or Windows environment, set the
	85	:envvar:`http_proxy`, or :envvar:`ftp_proxy` environment variables to a URL that
	86	identifies the proxy server before starting the Python interpreter. For example
	87	(the ``'%'`` is the command prompt)::
	88
	89	% http_proxy="http://www.someproxy.com:3128"
	90	% export http_proxy
	91	% python
	92	...
	93
	94	The :envvar:`no_proxy` environment variable can be used to specify hosts which
	95	shouldn't be reached via proxy; if set, it should be a comma-separated list
	96	of hostname suffixes, optionally with ``:port`` appended, for example
	97	``cern.ch,ncsa.uiuc.edu,some.host:8080``.
	98
	99	In a Windows environment, if no proxy environment variables are set, proxy
	100	settings are obtained from the registry's Internet Settings section.
	101
	102	.. index:: single: Internet Config
	103
	104	In a Mac OS X environment, :func:`urlopen` will retrieve proxy information
	105	from the OS X System Configuration Framework, which can be managed with
	106	Network System Preferences panel.
	107
	108
	109	Alternatively, the optional proxies argument may be used to explicitly specify
	110	proxies. It must be a dictionary mapping scheme names to proxy URLs, where an
	111	empty dictionary causes no proxies to be used, and ``None`` (the default value)
	112	causes environmental proxy settings to be used as discussed above. For
	113	example::
	114
	115	# Use http://www.someproxy.com:3128 for http proxying
	116	proxies = {'http': 'http://www.someproxy.com:3128'}
	117	filehandle = urllib.urlopen(some_url, proxies=proxies)
	118	# Don't use any proxies
	119	filehandle = urllib.urlopen(some_url, proxies={})
	120	# Use proxies from environment - both versions are equivalent
	121	filehandle = urllib.urlopen(some_url, proxies=None)
	122	filehandle = urllib.urlopen(some_url)
	123
	124	Proxies which require authentication for use are not currently supported; this
	125	is considered an implementation limitation.
	126
	127	.. versionchanged:: 2.3
	128	Added the proxies support.
	129
	130	.. versionchanged:: 2.6
	131	Added :meth:`getcode` to returned object and support for the
	132	:envvar:`no_proxy` environment variable.
	133
	134	.. deprecated:: 2.6
[391]	135	The :func:`urlopen` function has been removed in Python 3 in favor
[2]	136	of :func:`urllib2.urlopen`.
	137
	138
	139	.. function:: urlretrieve(url[, filename[, reporthook[, data]]])
	140
	141	Copy a network object denoted by a URL to a local file, if necessary. If the URL
	142	points to a local file, or a valid cached copy of the object exists, the object
	143	is not copied. Return a tuple ``(filename, headers)`` where filename is the
	144	local file name under which the object can be found, and headers is whatever
	145	the :meth:`info` method of the object returned by :func:`urlopen` returned (for
	146	a remote object, possibly cached). Exceptions are the same as for
	147	:func:`urlopen`.
	148
	149	The second argument, if present, specifies the file location to copy to (if
	150	absent, the location will be a tempfile with a generated name). The third
	151	argument, if present, is a hook function that will be called once on
	152	establishment of the network connection and once after each block read
	153	thereafter. The hook will be passed three arguments; a count of blocks
	154	transferred so far, a block size in bytes, and the total size of the file. The
	155	third argument may be ``-1`` on older FTP servers which do not return a file
	156	size in response to a retrieval request.
	157
	158	If the url uses the :file:`http:` scheme identifier, the optional data
	159	argument may be given to specify a ``POST`` request (normally the request type
	160	is ``GET``). The data argument must in standard
	161	:mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
	162	function below.
	163
	164	.. versionchanged:: 2.5
	165	:func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that
	166	the amount of data available was less than the expected amount (which is the
	167	size reported by a Content-Length header). This can occur, for example, when
	168	the download is interrupted.
	169
	170	The Content-Length is treated as a lower bound: if there's more data to read,
[391]	171	:func:`urlretrieve` reads more data, but if less data is available, it raises
	172	the exception.
[2]	173
	174	You can still retrieve the downloaded data in this case, it is stored in the
	175	:attr:`content` attribute of the exception instance.
	176
[391]	177	If no Content-Length header was supplied, :func:`urlretrieve` can not check
	178	the size of the data it has downloaded, and just returns it. In this case you
	179	just have to assume that the download was successful.
[2]	180
	181
	182	.. data:: _urlopener
	183
	184	The public functions :func:`urlopen` and :func:`urlretrieve` create an instance
	185	of the :class:`FancyURLopener` class and use it to perform their requested
	186	actions. To override this functionality, programmers can create a subclass of
	187	:class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that
	188	class to the ``urllib._urlopener`` variable before calling the desired function.
	189	For example, applications may want to specify a different
	190	:mailheader:`User-Agent` header than :class:`URLopener` defines. This can be
	191	accomplished with the following code::
	192
	193	import urllib
	194
	195	class AppURLopener(urllib.FancyURLopener):
	196	version = "App/1.7"
	197
	198	urllib._urlopener = AppURLopener()
	199
	200
	201	.. function:: urlcleanup()
	202
	203	Clear the cache that may have been built up by previous calls to
	204	:func:`urlretrieve`.
	205
	206
	207	Utility functions
	208	-----------------
	209
	210	.. function:: quote(string[, safe])
	211
	212	Replace special characters in string using the ``%xx`` escape. Letters,
	213	digits, and the characters ``'_.-'`` are never quoted. By default, this
[391]	214	function is intended for quoting the path section of the URL. The optional
[2]	215	safe parameter specifies additional characters that should not be quoted
	216	--- its default value is ``'/'``.
	217
	218	Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``.
	219
	220
	221	.. function:: quote_plus(string[, safe])
	222
	223	Like :func:`quote`, but also replaces spaces by plus signs, as required for
	224	quoting HTML form values when building up a query string to go into a URL.
	225	Plus signs in the original string are escaped unless they are included in
	226	safe. It also does not have safe default to ``'/'``.
	227
	228
	229	.. function:: unquote(string)
	230
	231	Replace ``%xx`` escapes by their single-character equivalent.
	232
	233	Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``.
	234
	235
	236	.. function:: unquote_plus(string)
	237
	238	Like :func:`unquote`, but also replaces plus signs by spaces, as required for
	239	unquoting HTML form values.
	240
	241
	242	.. function:: urlencode(query[, doseq])
	243
[391]	244	Convert a mapping object or a sequence of two-element tuples to a
	245	"percent-encoded" string, suitable to pass to :func:`urlopen` above as the
	246	optional data argument. This is useful to pass a dictionary of form
	247	fields to a ``POST`` request. The resulting string is a series of
	248	``key=value`` pairs separated by ``'&'`` characters, where both key and
	249	value are quoted using :func:`quote_plus` above. When a sequence of
	250	two-element tuples is used as the query argument, the first element of
	251	each tuple is a key and the second is a value. The value element in itself
	252	can be a sequence and in that case, if the optional parameter doseq is
	253	evaluates to True, individual ``key=value`` pairs separated by ``'&'`` are
	254	generated for each element of the value sequence for the key. The order of
	255	parameters in the encoded string will match the order of parameter tuples in
	256	the sequence. The :mod:`urlparse` module provides the functions
[2]	257	:func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings
	258	into Python data structures.
	259
	260
	261	.. function:: pathname2url(path)
	262
	263	Convert the pathname path from the local syntax for a path to the form used in
	264	the path component of a URL. This does not produce a complete URL. The return
	265	value will already be quoted using the :func:`quote` function.
	266
	267
	268	.. function:: url2pathname(path)
	269
[391]	270	Convert the path component path from an percent-encoded URL to the local syntax for a
[2]	271	path. This does not accept a complete URL. This function uses :func:`unquote`
	272	to decode path.
	273
	274
	275	.. function:: getproxies()
	276
	277	This helper function returns a dictionary of scheme to proxy server URL
[391]	278	mappings. It scans the environment for variables named ``<scheme>_proxy``,
	279	in case insensitive way, for all operating systems first, and when it cannot
	280	find it, looks for proxy information from Mac OSX System Configuration for
	281	Mac OS X and Windows Systems Registry for Windows.
[2]	282
[391]	283	.. note::
	284	urllib also exposes certain utility functions like splittype, splithost and
	285	others parsing url into various components. But it is recommended to use
	286	:mod:`urlparse` for parsing urls than using these functions directly.
	287	Python 3 does not expose these helper functions from :mod:`urllib.parse`
	288	module.
[2]	289
[391]	290
[2]	291	URL Opener objects
	292	------------------
	293
	294	.. class:: URLopener([proxies[, **x509]])
	295
	296	Base class for opening and reading URLs. Unless you need to support opening
	297	objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`,
	298	you probably want to use :class:`FancyURLopener`.
	299
	300	By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header
	301	of ``urllib/VVV``, where VVV is the :mod:`urllib` version number.
	302	Applications can define their own :mailheader:`User-Agent` header by subclassing
	303	:class:`URLopener` or :class:`FancyURLopener` and setting the class attribute
	304	:attr:`version` to an appropriate string value in the subclass definition.
	305
	306	The optional proxies parameter should be a dictionary mapping scheme names to
	307	proxy URLs, where an empty dictionary turns proxies off completely. Its default
	308	value is ``None``, in which case environmental proxy settings will be used if
	309	present, as discussed in the definition of :func:`urlopen`, above.
	310
	311	Additional keyword parameters, collected in x509, may be used for
	312	authentication of the client when using the :file:`https:` scheme. The keywords
	313	key_file and cert_file are supported to provide an SSL key and certificate;
	314	both are needed to support client authentication.
	315
	316	:class:`URLopener` objects will raise an :exc:`IOError` exception if the server
	317	returns an error code.
	318
	319	.. method:: open(fullurl[, data])
	320
	321	Open fullurl using the appropriate protocol. This method sets up cache and
	322	proxy information, then calls the appropriate open method with its input
	323	arguments. If the scheme is not recognized, :meth:`open_unknown` is called.
	324	The data argument has the same meaning as the data argument of
	325	:func:`urlopen`.
	326
	327
	328	.. method:: open_unknown(fullurl[, data])
	329
	330	Overridable interface to open unknown URL types.
	331
	332
	333	.. method:: retrieve(url[, filename[, reporthook[, data]]])
	334
	335	Retrieves the contents of url and places it in filename. The return value
	336	is a tuple consisting of a local filename and either a
	337	:class:`mimetools.Message` object containing the response headers (for remote
	338	URLs) or ``None`` (for local URLs). The caller must then open and read the
	339	contents of filename. If filename is not given and the URL refers to a
	340	local file, the input filename is returned. If the URL is non-local and
	341	filename is not given, the filename is the output of :func:`tempfile.mktemp`
	342	with a suffix that matches the suffix of the last path component of the input
	343	URL. If reporthook is given, it must be a function accepting three numeric
	344	parameters. It will be called after each chunk of data is read from the
	345	network. reporthook is ignored for local URLs.
	346
	347	If the url uses the :file:`http:` scheme identifier, the optional data
	348	argument may be given to specify a ``POST`` request (normally the request type
	349	is ``GET``). The data argument must in standard
	350	:mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
	351	function below.
	352
	353
	354	.. attribute:: version
	355
	356	Variable that specifies the user agent of the opener object. To get
	357	:mod:`urllib` to tell servers that it is a particular user agent, set this in a
	358	subclass as a class variable or in the constructor before calling the base
	359	constructor.
	360
	361
	362	.. class:: FancyURLopener(...)
	363
	364	:class:`FancyURLopener` subclasses :class:`URLopener` providing default handling
	365	for the following HTTP response codes: 301, 302, 303, 307 and 401. For the 30x
	366	response codes listed above, the :mailheader:`Location` header is used to fetch
	367	the actual URL. For 401 response codes (authentication required), basic HTTP
	368	authentication is performed. For the 30x response codes, recursion is bounded
	369	by the value of the maxtries attribute, which defaults to 10.
	370
	371	For all other response codes, the method :meth:`http_error_default` is called
	372	which you can override in subclasses to handle the error appropriately.
	373
	374	.. note::
	375
	376	According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests
	377	must not be automatically redirected without confirmation by the user. In
	378	reality, browsers do allow automatic redirection of these responses, changing
	379	the POST to a GET, and :mod:`urllib` reproduces this behaviour.
	380
	381	The parameters to the constructor are the same as those for :class:`URLopener`.
	382
	383	.. note::
	384
	385	When performing basic authentication, a :class:`FancyURLopener` instance calls
	386	its :meth:`prompt_user_passwd` method. The default implementation asks the
	387	users for the required information on the controlling terminal. A subclass may
	388	override this method to support more appropriate behavior if needed.
	389
	390	The :class:`FancyURLopener` class offers one additional method that should be
	391	overloaded to provide the appropriate behavior:
	392
	393	.. method:: prompt_user_passwd(host, realm)
	394
	395	Return information needed to authenticate the user at the given host in the
	396	specified security realm. The return value should be a tuple, ``(user,
	397	password)``, which can be used for basic authentication.
	398
	399	The implementation prompts for this information on the terminal; an application
	400	should override this method to use an appropriate interaction model in the local
	401	environment.
	402
	403	.. exception:: ContentTooShortError(msg[, content])
	404
	405	This exception is raised when the :func:`urlretrieve` function detects that the
	406	amount of the downloaded data is less than the expected amount (given by the
	407	Content-Length header). The :attr:`content` attribute stores the downloaded
	408	(and supposedly truncated) data.
	409
	410	.. versionadded:: 2.5
	411
	412
	413	:mod:`urllib` Restrictions
	414	--------------------------
	415
	416	.. index::
	417	pair: HTTP; protocol
	418	pair: FTP; protocol
	419
	420	* Currently, only the following protocols are supported: HTTP, (versions 0.9 and
	421	1.0), FTP, and local files.
	422
	423	* The caching feature of :func:`urlretrieve` has been disabled until I find the
	424	time to hack proper processing of Expiration time headers.
	425
	426	* There should be a function to query whether a particular URL is in the cache.
	427
	428	* For backward compatibility, if a URL appears to point to a local file but the
	429	file can't be opened, the URL is re-interpreted using the FTP protocol. This
	430	can sometimes cause confusing error messages.
	431
	432	* The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily
	433	long delays while waiting for a network connection to be set up. This means
	434	that it is difficult to build an interactive Web client using these functions
	435	without using threads.
	436
	437	.. index::
	438	single: HTML
	439	pair: HTTP; protocol
	440	module: htmllib
	441
	442	* The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data
	443	returned by the server. This may be binary data (such as an image), plain text
	444	or (for example) HTML. The HTTP protocol provides type information in the reply
	445	header, which can be inspected by looking at the :mailheader:`Content-Type`
	446	header. If the returned data is HTML, you can use the module :mod:`htmllib` to
	447	parse it.
	448
	449	.. index:: single: FTP
	450
	451	* The code handling the FTP protocol cannot differentiate between a file and a
	452	directory. This can lead to unexpected behavior when attempting to read a URL
	453	that points to a file that is not accessible. If the URL ends in a ``/``, it is
	454	assumed to refer to a directory and will be handled accordingly. But if an
	455	attempt to read a file leads to a 550 error (meaning the URL cannot be found or
	456	is not accessible, often for permission reasons), then the path is treated as a
	457	directory in order to handle the case when a directory is specified by a URL but
	458	the trailing ``/`` has been left off. This can cause misleading results when
	459	you try to fetch a file whose read permissions make it inaccessible; the FTP
	460	code will try to read it, fail with a 550 error, and then perform a directory
	461	listing for the unreadable file. If fine-grained control is needed, consider
[391]	462	using the :mod:`ftplib` module, subclassing :class:`FancyURLopener`, or changing
[2]	463	_urlopener to meet your needs.
	464
	465	* This module does not support the use of proxies which require authentication.
	466	This may be implemented in the future.
	467
	468	.. index:: module: urlparse
	469
	470	* Although the :mod:`urllib` module contains (undocumented) routines to parse
	471	and unparse URL strings, the recommended interface for URL manipulation is in
	472	module :mod:`urlparse`.
	473
	474
	475	.. _urllib-examples:
	476
	477	Examples
	478	--------
	479
	480	Here is an example session that uses the ``GET`` method to retrieve a URL
	481	containing parameters::
	482
	483	>>> import urllib
	484	>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
	485	>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
	486	>>> print f.read()
	487
	488	The following example uses the ``POST`` method instead::
	489
	490	>>> import urllib
	491	>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
	492	>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
	493	>>> print f.read()
	494
	495	The following example uses an explicitly specified HTTP proxy, overriding
	496	environment settings::
	497
	498	>>> import urllib
	499	>>> proxies = {'http': 'http://proxy.example.com:8080/'}
	500	>>> opener = urllib.FancyURLopener(proxies)
	501	>>> f = opener.open("http://www.python.org")
	502	>>> f.read()
	503
	504	The following example uses no proxies at all, overriding environment settings::
	505
	506	>>> import urllib
	507	>>> opener = urllib.FancyURLopener({})
	508	>>> f = opener.open("http://www.python.org/")
	509	>>> f.read()
	510

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: python/trunk/Doc/library/urllib.rst

Download in other formats: