[2] | 1 | ************************************************
|
---|
| 2 | HOWTO Fetch Internet Resources Using urllib2
|
---|
| 3 | ************************************************
|
---|
| 4 |
|
---|
| 5 | :Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
|
---|
| 6 |
|
---|
| 7 | .. note::
|
---|
| 8 |
|
---|
| 9 | There is an French translation of an earlier revision of this
|
---|
| 10 | HOWTO, available at `urllib2 - Le Manuel manquant
|
---|
| 11 | <http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_.
|
---|
| 12 |
|
---|
| 13 |
|
---|
| 14 |
|
---|
| 15 | Introduction
|
---|
| 16 | ============
|
---|
| 17 |
|
---|
| 18 | .. sidebar:: Related Articles
|
---|
| 19 |
|
---|
| 20 | You may also find useful the following article on fetching web resources
|
---|
| 21 | with Python :
|
---|
| 22 |
|
---|
| 23 | * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
|
---|
| 24 |
|
---|
| 25 | A tutorial on *Basic Authentication*, with examples in Python.
|
---|
| 26 |
|
---|
| 27 | **urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs
|
---|
| 28 | (Uniform Resource Locators). It offers a very simple interface, in the form of
|
---|
| 29 | the *urlopen* function. This is capable of fetching URLs using a variety of
|
---|
| 30 | different protocols. It also offers a slightly more complex interface for
|
---|
| 31 | handling common situations - like basic authentication, cookies, proxies and so
|
---|
| 32 | on. These are provided by objects called handlers and openers.
|
---|
| 33 |
|
---|
| 34 | urllib2 supports fetching URLs for many "URL schemes" (identified by the string
|
---|
| 35 | before the ":" in URL - for example "ftp" is the URL scheme of
|
---|
| 36 | "ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP).
|
---|
| 37 | This tutorial focuses on the most common case, HTTP.
|
---|
| 38 |
|
---|
| 39 | For straightforward situations *urlopen* is very easy to use. But as soon as you
|
---|
| 40 | encounter errors or non-trivial cases when opening HTTP URLs, you will need some
|
---|
| 41 | understanding of the HyperText Transfer Protocol. The most comprehensive and
|
---|
| 42 | authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
|
---|
| 43 | not intended to be easy to read. This HOWTO aims to illustrate using *urllib2*,
|
---|
| 44 | with enough detail about HTTP to help you through. It is not intended to replace
|
---|
| 45 | the :mod:`urllib2` docs, but is supplementary to them.
|
---|
| 46 |
|
---|
| 47 |
|
---|
| 48 | Fetching URLs
|
---|
| 49 | =============
|
---|
| 50 |
|
---|
| 51 | The simplest way to use urllib2 is as follows::
|
---|
| 52 |
|
---|
| 53 | import urllib2
|
---|
| 54 | response = urllib2.urlopen('http://python.org/')
|
---|
| 55 | html = response.read()
|
---|
| 56 |
|
---|
| 57 | Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we
|
---|
| 58 | could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the
|
---|
| 59 | purpose of this tutorial to explain the more complicated cases, concentrating on
|
---|
| 60 | HTTP.
|
---|
| 61 |
|
---|
| 62 | HTTP is based on requests and responses - the client makes requests and servers
|
---|
| 63 | send responses. urllib2 mirrors this with a ``Request`` object which represents
|
---|
| 64 | the HTTP request you are making. In its simplest form you create a Request
|
---|
| 65 | object that specifies the URL you want to fetch. Calling ``urlopen`` with this
|
---|
| 66 | Request object returns a response object for the URL requested. This response is
|
---|
| 67 | a file-like object, which means you can for example call ``.read()`` on the
|
---|
| 68 | response::
|
---|
| 69 |
|
---|
| 70 | import urllib2
|
---|
| 71 |
|
---|
| 72 | req = urllib2.Request('http://www.voidspace.org.uk')
|
---|
| 73 | response = urllib2.urlopen(req)
|
---|
| 74 | the_page = response.read()
|
---|
| 75 |
|
---|
| 76 | Note that urllib2 makes use of the same Request interface to handle all URL
|
---|
| 77 | schemes. For example, you can make an FTP request like so::
|
---|
| 78 |
|
---|
| 79 | req = urllib2.Request('ftp://example.com/')
|
---|
| 80 |
|
---|
| 81 | In the case of HTTP, there are two extra things that Request objects allow you
|
---|
| 82 | to do: First, you can pass data to be sent to the server. Second, you can pass
|
---|
| 83 | extra information ("metadata") *about* the data or the about request itself, to
|
---|
| 84 | the server - this information is sent as HTTP "headers". Let's look at each of
|
---|
| 85 | these in turn.
|
---|
| 86 |
|
---|
| 87 | Data
|
---|
| 88 | ----
|
---|
| 89 |
|
---|
| 90 | Sometimes you want to send data to a URL (often the URL will refer to a CGI
|
---|
| 91 | (Common Gateway Interface) script [#]_ or other web application). With HTTP,
|
---|
| 92 | this is often done using what's known as a **POST** request. This is often what
|
---|
| 93 | your browser does when you submit a HTML form that you filled in on the web. Not
|
---|
| 94 | all POSTs have to come from forms: you can use a POST to transmit arbitrary data
|
---|
| 95 | to your own application. In the common case of HTML forms, the data needs to be
|
---|
| 96 | encoded in a standard way, and then passed to the Request object as the ``data``
|
---|
| 97 | argument. The encoding is done using a function from the ``urllib`` library
|
---|
| 98 | *not* from ``urllib2``. ::
|
---|
| 99 |
|
---|
| 100 | import urllib
|
---|
| 101 | import urllib2
|
---|
| 102 |
|
---|
| 103 | url = 'http://www.someserver.com/cgi-bin/register.cgi'
|
---|
| 104 | values = {'name' : 'Michael Foord',
|
---|
| 105 | 'location' : 'Northampton',
|
---|
| 106 | 'language' : 'Python' }
|
---|
| 107 |
|
---|
| 108 | data = urllib.urlencode(values)
|
---|
| 109 | req = urllib2.Request(url, data)
|
---|
| 110 | response = urllib2.urlopen(req)
|
---|
| 111 | the_page = response.read()
|
---|
| 112 |
|
---|
| 113 | Note that other encodings are sometimes required (e.g. for file upload from HTML
|
---|
| 114 | forms - see `HTML Specification, Form Submission
|
---|
| 115 | <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
|
---|
| 116 | details).
|
---|
| 117 |
|
---|
| 118 | If you do not pass the ``data`` argument, urllib2 uses a **GET** request. One
|
---|
| 119 | way in which GET and POST requests differ is that POST requests often have
|
---|
| 120 | "side-effects": they change the state of the system in some way (for example by
|
---|
| 121 | placing an order with the website for a hundredweight of tinned spam to be
|
---|
| 122 | delivered to your door). Though the HTTP standard makes it clear that POSTs are
|
---|
| 123 | intended to *always* cause side-effects, and GET requests *never* to cause
|
---|
| 124 | side-effects, nothing prevents a GET request from having side-effects, nor a
|
---|
| 125 | POST requests from having no side-effects. Data can also be passed in an HTTP
|
---|
| 126 | GET request by encoding it in the URL itself.
|
---|
| 127 |
|
---|
| 128 | This is done as follows::
|
---|
| 129 |
|
---|
| 130 | >>> import urllib2
|
---|
| 131 | >>> import urllib
|
---|
| 132 | >>> data = {}
|
---|
| 133 | >>> data['name'] = 'Somebody Here'
|
---|
| 134 | >>> data['location'] = 'Northampton'
|
---|
| 135 | >>> data['language'] = 'Python'
|
---|
| 136 | >>> url_values = urllib.urlencode(data)
|
---|
[391] | 137 | >>> print url_values # The order may differ. #doctest: +SKIP
|
---|
[2] | 138 | name=Somebody+Here&language=Python&location=Northampton
|
---|
| 139 | >>> url = 'http://www.example.com/example.cgi'
|
---|
| 140 | >>> full_url = url + '?' + url_values
|
---|
[391] | 141 | >>> data = urllib2.urlopen(full_url)
|
---|
[2] | 142 |
|
---|
| 143 | Notice that the full URL is created by adding a ``?`` to the URL, followed by
|
---|
| 144 | the encoded values.
|
---|
| 145 |
|
---|
| 146 | Headers
|
---|
| 147 | -------
|
---|
| 148 |
|
---|
| 149 | We'll discuss here one particular HTTP header, to illustrate how to add headers
|
---|
| 150 | to your HTTP request.
|
---|
| 151 |
|
---|
| 152 | Some websites [#]_ dislike being browsed by programs, or send different versions
|
---|
| 153 | to different browsers [#]_ . By default urllib2 identifies itself as
|
---|
| 154 | ``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
|
---|
| 155 | numbers of the Python release,
|
---|
| 156 | e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
|
---|
| 157 | not work. The way a browser identifies itself is through the
|
---|
| 158 | ``User-Agent`` header [#]_. When you create a Request object you can
|
---|
| 159 | pass a dictionary of headers in. The following example makes the same
|
---|
| 160 | request as above, but identifies itself as a version of Internet
|
---|
| 161 | Explorer [#]_. ::
|
---|
| 162 |
|
---|
| 163 | import urllib
|
---|
| 164 | import urllib2
|
---|
| 165 |
|
---|
| 166 | url = 'http://www.someserver.com/cgi-bin/register.cgi'
|
---|
| 167 | user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
|
---|
| 168 | values = {'name' : 'Michael Foord',
|
---|
| 169 | 'location' : 'Northampton',
|
---|
| 170 | 'language' : 'Python' }
|
---|
| 171 | headers = { 'User-Agent' : user_agent }
|
---|
| 172 |
|
---|
| 173 | data = urllib.urlencode(values)
|
---|
| 174 | req = urllib2.Request(url, data, headers)
|
---|
| 175 | response = urllib2.urlopen(req)
|
---|
| 176 | the_page = response.read()
|
---|
| 177 |
|
---|
| 178 | The response also has two useful methods. See the section on `info and geturl`_
|
---|
| 179 | which comes after we have a look at what happens when things go wrong.
|
---|
| 180 |
|
---|
| 181 |
|
---|
| 182 | Handling Exceptions
|
---|
| 183 | ===================
|
---|
| 184 |
|
---|
| 185 | *urlopen* raises :exc:`URLError` when it cannot handle a response (though as
|
---|
| 186 | usual with Python APIs, built-in exceptions such as :exc:`ValueError`,
|
---|
| 187 | :exc:`TypeError` etc. may also be raised).
|
---|
| 188 |
|
---|
| 189 | :exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of
|
---|
| 190 | HTTP URLs.
|
---|
| 191 |
|
---|
| 192 | URLError
|
---|
| 193 | --------
|
---|
| 194 |
|
---|
| 195 | Often, URLError is raised because there is no network connection (no route to
|
---|
| 196 | the specified server), or the specified server doesn't exist. In this case, the
|
---|
| 197 | exception raised will have a 'reason' attribute, which is a tuple containing an
|
---|
| 198 | error code and a text error message.
|
---|
| 199 |
|
---|
| 200 | e.g. ::
|
---|
| 201 |
|
---|
| 202 | >>> req = urllib2.Request('http://www.pretend_server.org')
|
---|
| 203 | >>> try: urllib2.urlopen(req)
|
---|
[391] | 204 | ... except URLError as e:
|
---|
| 205 | ... print e.reason #doctest: +SKIP
|
---|
| 206 | ...
|
---|
[2] | 207 | (4, 'getaddrinfo failed')
|
---|
| 208 |
|
---|
| 209 |
|
---|
| 210 | HTTPError
|
---|
| 211 | ---------
|
---|
| 212 |
|
---|
| 213 | Every HTTP response from the server contains a numeric "status code". Sometimes
|
---|
| 214 | the status code indicates that the server is unable to fulfil the request. The
|
---|
| 215 | default handlers will handle some of these responses for you (for example, if
|
---|
| 216 | the response is a "redirection" that requests the client fetch the document from
|
---|
| 217 | a different URL, urllib2 will handle that for you). For those it can't handle,
|
---|
| 218 | urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not
|
---|
| 219 | found), '403' (request forbidden), and '401' (authentication required).
|
---|
| 220 |
|
---|
| 221 | See section 10 of RFC 2616 for a reference on all the HTTP error codes.
|
---|
| 222 |
|
---|
| 223 | The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which
|
---|
| 224 | corresponds to the error sent by the server.
|
---|
| 225 |
|
---|
| 226 | Error Codes
|
---|
| 227 | ~~~~~~~~~~~
|
---|
| 228 |
|
---|
| 229 | Because the default handlers handle redirects (codes in the 300 range), and
|
---|
| 230 | codes in the 100-299 range indicate success, you will usually only see error
|
---|
| 231 | codes in the 400-599 range.
|
---|
| 232 |
|
---|
| 233 | ``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful dictionary of
|
---|
| 234 | response codes in that shows all the response codes used by RFC 2616. The
|
---|
| 235 | dictionary is reproduced here for convenience ::
|
---|
| 236 |
|
---|
| 237 | # Table mapping response codes to messages; entries have the
|
---|
| 238 | # form {code: (shortmessage, longmessage)}.
|
---|
| 239 | responses = {
|
---|
| 240 | 100: ('Continue', 'Request received, please continue'),
|
---|
| 241 | 101: ('Switching Protocols',
|
---|
| 242 | 'Switching to new protocol; obey Upgrade header'),
|
---|
| 243 |
|
---|
| 244 | 200: ('OK', 'Request fulfilled, document follows'),
|
---|
| 245 | 201: ('Created', 'Document created, URL follows'),
|
---|
| 246 | 202: ('Accepted',
|
---|
| 247 | 'Request accepted, processing continues off-line'),
|
---|
| 248 | 203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
|
---|
| 249 | 204: ('No Content', 'Request fulfilled, nothing follows'),
|
---|
| 250 | 205: ('Reset Content', 'Clear input form for further input.'),
|
---|
| 251 | 206: ('Partial Content', 'Partial content follows.'),
|
---|
| 252 |
|
---|
| 253 | 300: ('Multiple Choices',
|
---|
| 254 | 'Object has several resources -- see URI list'),
|
---|
| 255 | 301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
|
---|
| 256 | 302: ('Found', 'Object moved temporarily -- see URI list'),
|
---|
| 257 | 303: ('See Other', 'Object moved -- see Method and URL list'),
|
---|
| 258 | 304: ('Not Modified',
|
---|
| 259 | 'Document has not changed since given time'),
|
---|
| 260 | 305: ('Use Proxy',
|
---|
| 261 | 'You must use proxy specified in Location to access this '
|
---|
| 262 | 'resource.'),
|
---|
| 263 | 307: ('Temporary Redirect',
|
---|
| 264 | 'Object moved temporarily -- see URI list'),
|
---|
| 265 |
|
---|
| 266 | 400: ('Bad Request',
|
---|
| 267 | 'Bad request syntax or unsupported method'),
|
---|
| 268 | 401: ('Unauthorized',
|
---|
| 269 | 'No permission -- see authorization schemes'),
|
---|
| 270 | 402: ('Payment Required',
|
---|
| 271 | 'No payment -- see charging schemes'),
|
---|
| 272 | 403: ('Forbidden',
|
---|
| 273 | 'Request forbidden -- authorization will not help'),
|
---|
| 274 | 404: ('Not Found', 'Nothing matches the given URI'),
|
---|
| 275 | 405: ('Method Not Allowed',
|
---|
| 276 | 'Specified method is invalid for this server.'),
|
---|
| 277 | 406: ('Not Acceptable', 'URI not available in preferred format.'),
|
---|
| 278 | 407: ('Proxy Authentication Required', 'You must authenticate with '
|
---|
| 279 | 'this proxy before proceeding.'),
|
---|
| 280 | 408: ('Request Timeout', 'Request timed out; try again later.'),
|
---|
| 281 | 409: ('Conflict', 'Request conflict.'),
|
---|
| 282 | 410: ('Gone',
|
---|
| 283 | 'URI no longer exists and has been permanently removed.'),
|
---|
| 284 | 411: ('Length Required', 'Client must specify Content-Length.'),
|
---|
| 285 | 412: ('Precondition Failed', 'Precondition in headers is false.'),
|
---|
| 286 | 413: ('Request Entity Too Large', 'Entity is too large.'),
|
---|
| 287 | 414: ('Request-URI Too Long', 'URI is too long.'),
|
---|
| 288 | 415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
|
---|
| 289 | 416: ('Requested Range Not Satisfiable',
|
---|
| 290 | 'Cannot satisfy request range.'),
|
---|
| 291 | 417: ('Expectation Failed',
|
---|
| 292 | 'Expect condition could not be satisfied.'),
|
---|
| 293 |
|
---|
| 294 | 500: ('Internal Server Error', 'Server got itself in trouble'),
|
---|
| 295 | 501: ('Not Implemented',
|
---|
| 296 | 'Server does not support this operation'),
|
---|
| 297 | 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
|
---|
| 298 | 503: ('Service Unavailable',
|
---|
| 299 | 'The server cannot process the request due to a high load'),
|
---|
| 300 | 504: ('Gateway Timeout',
|
---|
| 301 | 'The gateway server did not receive a timely response'),
|
---|
| 302 | 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
|
---|
| 303 | }
|
---|
| 304 |
|
---|
| 305 | When an error is raised the server responds by returning an HTTP error code
|
---|
| 306 | *and* an error page. You can use the :exc:`HTTPError` instance as a response on the
|
---|
| 307 | page returned. This means that as well as the code attribute, it also has read,
|
---|
| 308 | geturl, and info, methods. ::
|
---|
| 309 |
|
---|
| 310 | >>> req = urllib2.Request('http://www.python.org/fish.html')
|
---|
| 311 | >>> try:
|
---|
[391] | 312 | ... urllib2.urlopen(req)
|
---|
| 313 | ... except urllib2.HTTPError as e:
|
---|
| 314 | ... print e.code
|
---|
| 315 | ... print e.read() #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
|
---|
| 316 | ...
|
---|
[2] | 317 | 404
|
---|
[391] | 318 | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
---|
| 319 | "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
---|
| 320 | ...
|
---|
| 321 | <title>Page Not Found</title>
|
---|
| 322 | ...
|
---|
[2] | 323 |
|
---|
[391] | 324 |
|
---|
[2] | 325 | Wrapping it Up
|
---|
| 326 | --------------
|
---|
| 327 |
|
---|
| 328 | So if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two
|
---|
| 329 | basic approaches. I prefer the second approach.
|
---|
| 330 |
|
---|
| 331 | Number 1
|
---|
| 332 | ~~~~~~~~
|
---|
| 333 |
|
---|
| 334 | ::
|
---|
| 335 |
|
---|
| 336 |
|
---|
| 337 | from urllib2 import Request, urlopen, URLError, HTTPError
|
---|
| 338 | req = Request(someurl)
|
---|
| 339 | try:
|
---|
| 340 | response = urlopen(req)
|
---|
[391] | 341 | except HTTPError as e:
|
---|
[2] | 342 | print 'The server couldn\'t fulfill the request.'
|
---|
| 343 | print 'Error code: ', e.code
|
---|
[391] | 344 | except URLError as e:
|
---|
[2] | 345 | print 'We failed to reach a server.'
|
---|
| 346 | print 'Reason: ', e.reason
|
---|
| 347 | else:
|
---|
| 348 | # everything is fine
|
---|
| 349 |
|
---|
| 350 |
|
---|
| 351 | .. note::
|
---|
| 352 |
|
---|
| 353 | The ``except HTTPError`` *must* come first, otherwise ``except URLError``
|
---|
| 354 | will *also* catch an :exc:`HTTPError`.
|
---|
| 355 |
|
---|
| 356 | Number 2
|
---|
| 357 | ~~~~~~~~
|
---|
| 358 |
|
---|
| 359 | ::
|
---|
| 360 |
|
---|
| 361 | from urllib2 import Request, urlopen, URLError
|
---|
| 362 | req = Request(someurl)
|
---|
| 363 | try:
|
---|
| 364 | response = urlopen(req)
|
---|
[391] | 365 | except URLError as e:
|
---|
[2] | 366 | if hasattr(e, 'reason'):
|
---|
| 367 | print 'We failed to reach a server.'
|
---|
| 368 | print 'Reason: ', e.reason
|
---|
| 369 | elif hasattr(e, 'code'):
|
---|
| 370 | print 'The server couldn\'t fulfill the request.'
|
---|
| 371 | print 'Error code: ', e.code
|
---|
| 372 | else:
|
---|
| 373 | # everything is fine
|
---|
| 374 |
|
---|
| 375 |
|
---|
| 376 | info and geturl
|
---|
| 377 | ===============
|
---|
| 378 |
|
---|
| 379 | The response returned by urlopen (or the :exc:`HTTPError` instance) has two useful
|
---|
| 380 | methods :meth:`info` and :meth:`geturl`.
|
---|
| 381 |
|
---|
| 382 | **geturl** - this returns the real URL of the page fetched. This is useful
|
---|
| 383 | because ``urlopen`` (or the opener object used) may have followed a
|
---|
| 384 | redirect. The URL of the page fetched may not be the same as the URL requested.
|
---|
| 385 |
|
---|
| 386 | **info** - this returns a dictionary-like object that describes the page
|
---|
| 387 | fetched, particularly the headers sent by the server. It is currently an
|
---|
| 388 | ``httplib.HTTPMessage`` instance.
|
---|
| 389 |
|
---|
| 390 | Typical headers include 'Content-length', 'Content-type', and so on. See the
|
---|
| 391 | `Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_
|
---|
| 392 | for a useful listing of HTTP headers with brief explanations of their meaning
|
---|
| 393 | and use.
|
---|
| 394 |
|
---|
| 395 |
|
---|
| 396 | Openers and Handlers
|
---|
| 397 | ====================
|
---|
| 398 |
|
---|
| 399 | When you fetch a URL you use an opener (an instance of the perhaps
|
---|
| 400 | confusingly-named :class:`urllib2.OpenerDirector`). Normally we have been using
|
---|
| 401 | the default opener - via ``urlopen`` - but you can create custom
|
---|
| 402 | openers. Openers use handlers. All the "heavy lifting" is done by the
|
---|
| 403 | handlers. Each handler knows how to open URLs for a particular URL scheme (http,
|
---|
| 404 | ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
|
---|
| 405 | redirections or HTTP cookies.
|
---|
| 406 |
|
---|
| 407 | You will want to create openers if you want to fetch URLs with specific handlers
|
---|
| 408 | installed, for example to get an opener that handles cookies, or to get an
|
---|
| 409 | opener that does not handle redirections.
|
---|
| 410 |
|
---|
| 411 | To create an opener, instantiate an ``OpenerDirector``, and then call
|
---|
| 412 | ``.add_handler(some_handler_instance)`` repeatedly.
|
---|
| 413 |
|
---|
| 414 | Alternatively, you can use ``build_opener``, which is a convenience function for
|
---|
| 415 | creating opener objects with a single function call. ``build_opener`` adds
|
---|
| 416 | several handlers by default, but provides a quick way to add more and/or
|
---|
| 417 | override the default handlers.
|
---|
| 418 |
|
---|
| 419 | Other sorts of handlers you might want to can handle proxies, authentication,
|
---|
| 420 | and other common but slightly specialised situations.
|
---|
| 421 |
|
---|
| 422 | ``install_opener`` can be used to make an ``opener`` object the (global) default
|
---|
| 423 | opener. This means that calls to ``urlopen`` will use the opener you have
|
---|
| 424 | installed.
|
---|
| 425 |
|
---|
| 426 | Opener objects have an ``open`` method, which can be called directly to fetch
|
---|
| 427 | urls in the same way as the ``urlopen`` function: there's no need to call
|
---|
| 428 | ``install_opener``, except as a convenience.
|
---|
| 429 |
|
---|
| 430 |
|
---|
| 431 | Basic Authentication
|
---|
| 432 | ====================
|
---|
| 433 |
|
---|
| 434 | To illustrate creating and installing a handler we will use the
|
---|
| 435 | ``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
|
---|
| 436 | including an explanation of how Basic Authentication works - see the `Basic
|
---|
| 437 | Authentication Tutorial
|
---|
| 438 | <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
|
---|
| 439 |
|
---|
| 440 | When authentication is required, the server sends a header (as well as the 401
|
---|
| 441 | error code) requesting authentication. This specifies the authentication scheme
|
---|
[391] | 442 | and a 'realm'. The header looks like : ``WWW-Authenticate: SCHEME
|
---|
[2] | 443 | realm="REALM"``.
|
---|
| 444 |
|
---|
| 445 | e.g. ::
|
---|
| 446 |
|
---|
[391] | 447 | WWW-Authenticate: Basic realm="cPanel Users"
|
---|
[2] | 448 |
|
---|
| 449 |
|
---|
| 450 | The client should then retry the request with the appropriate name and password
|
---|
| 451 | for the realm included as a header in the request. This is 'basic
|
---|
| 452 | authentication'. In order to simplify this process we can create an instance of
|
---|
| 453 | ``HTTPBasicAuthHandler`` and an opener to use this handler.
|
---|
| 454 |
|
---|
| 455 | The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
|
---|
| 456 | the mapping of URLs and realms to passwords and usernames. If you know what the
|
---|
| 457 | realm is (from the authentication header sent by the server), then you can use a
|
---|
| 458 | ``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
|
---|
| 459 | case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
|
---|
| 460 | you to specify a default username and password for a URL. This will be supplied
|
---|
| 461 | in the absence of you providing an alternative combination for a specific
|
---|
| 462 | realm. We indicate this by providing ``None`` as the realm argument to the
|
---|
| 463 | ``add_password`` method.
|
---|
| 464 |
|
---|
| 465 | The top-level URL is the first URL that requires authentication. URLs "deeper"
|
---|
| 466 | than the URL you pass to .add_password() will also match. ::
|
---|
| 467 |
|
---|
| 468 | # create a password manager
|
---|
| 469 | password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
|
---|
| 470 |
|
---|
| 471 | # Add the username and password.
|
---|
| 472 | # If we knew the realm, we could use it instead of None.
|
---|
| 473 | top_level_url = "http://example.com/foo/"
|
---|
| 474 | password_mgr.add_password(None, top_level_url, username, password)
|
---|
| 475 |
|
---|
| 476 | handler = urllib2.HTTPBasicAuthHandler(password_mgr)
|
---|
| 477 |
|
---|
| 478 | # create "opener" (OpenerDirector instance)
|
---|
| 479 | opener = urllib2.build_opener(handler)
|
---|
| 480 |
|
---|
| 481 | # use the opener to fetch a URL
|
---|
| 482 | opener.open(a_url)
|
---|
| 483 |
|
---|
| 484 | # Install the opener.
|
---|
| 485 | # Now all calls to urllib2.urlopen use our opener.
|
---|
| 486 | urllib2.install_opener(opener)
|
---|
| 487 |
|
---|
| 488 | .. note::
|
---|
| 489 |
|
---|
| 490 | In the above example we only supplied our ``HTTPBasicAuthHandler`` to
|
---|
| 491 | ``build_opener``. By default openers have the handlers for normal situations
|
---|
[391] | 492 | -- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy`
|
---|
| 493 | environment variable is set), ``UnknownHandler``, ``HTTPHandler``,
|
---|
[2] | 494 | ``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
|
---|
| 495 | ``FileHandler``, ``HTTPErrorProcessor``.
|
---|
| 496 |
|
---|
| 497 | ``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme
|
---|
| 498 | component and the hostname and optionally the port number)
|
---|
| 499 | e.g. "http://example.com/" *or* an "authority" (i.e. the hostname,
|
---|
| 500 | optionally including the port number) e.g. "example.com" or "example.com:8080"
|
---|
| 501 | (the latter example includes a port number). The authority, if present, must
|
---|
| 502 | NOT contain the "userinfo" component - for example "joe@password:example.com" is
|
---|
| 503 | not correct.
|
---|
| 504 |
|
---|
| 505 |
|
---|
| 506 | Proxies
|
---|
| 507 | =======
|
---|
| 508 |
|
---|
| 509 | **urllib2** will auto-detect your proxy settings and use those. This is through
|
---|
[391] | 510 | the ``ProxyHandler``, which is part of the normal handler chain when a proxy
|
---|
| 511 | setting is detected. Normally that's a good thing, but there are occasions
|
---|
| 512 | when it may not be helpful [#]_. One way to do this is to setup our own
|
---|
| 513 | ``ProxyHandler``, with no proxies defined. This is done using similar steps to
|
---|
| 514 | setting up a `Basic Authentication`_ handler : ::
|
---|
[2] | 515 |
|
---|
| 516 | >>> proxy_support = urllib2.ProxyHandler({})
|
---|
| 517 | >>> opener = urllib2.build_opener(proxy_support)
|
---|
| 518 | >>> urllib2.install_opener(opener)
|
---|
| 519 |
|
---|
| 520 | .. note::
|
---|
| 521 |
|
---|
| 522 | Currently ``urllib2`` *does not* support fetching of ``https`` locations
|
---|
| 523 | through a proxy. However, this can be enabled by extending urllib2 as
|
---|
| 524 | shown in the recipe [#]_.
|
---|
| 525 |
|
---|
| 526 |
|
---|
| 527 | Sockets and Layers
|
---|
| 528 | ==================
|
---|
| 529 |
|
---|
| 530 | The Python support for fetching resources from the web is layered. urllib2 uses
|
---|
| 531 | the httplib library, which in turn uses the socket library.
|
---|
| 532 |
|
---|
| 533 | As of Python 2.3 you can specify how long a socket should wait for a response
|
---|
| 534 | before timing out. This can be useful in applications which have to fetch web
|
---|
| 535 | pages. By default the socket module has *no timeout* and can hang. Currently,
|
---|
| 536 | the socket timeout is not exposed at the httplib or urllib2 levels. However,
|
---|
| 537 | you can set the default timeout globally for all sockets using ::
|
---|
| 538 |
|
---|
| 539 | import socket
|
---|
| 540 | import urllib2
|
---|
| 541 |
|
---|
| 542 | # timeout in seconds
|
---|
| 543 | timeout = 10
|
---|
| 544 | socket.setdefaulttimeout(timeout)
|
---|
| 545 |
|
---|
| 546 | # this call to urllib2.urlopen now uses the default timeout
|
---|
| 547 | # we have set in the socket module
|
---|
| 548 | req = urllib2.Request('http://www.voidspace.org.uk')
|
---|
| 549 | response = urllib2.urlopen(req)
|
---|
| 550 |
|
---|
| 551 |
|
---|
| 552 | -------
|
---|
| 553 |
|
---|
| 554 |
|
---|
| 555 | Footnotes
|
---|
| 556 | =========
|
---|
| 557 |
|
---|
| 558 | This document was reviewed and revised by John Lee.
|
---|
| 559 |
|
---|
| 560 | .. [#] For an introduction to the CGI protocol see
|
---|
| 561 | `Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_.
|
---|
| 562 | .. [#] Like Google for example. The *proper* way to use google from a program
|
---|
| 563 | is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See
|
---|
| 564 | `Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_
|
---|
| 565 | for some examples of using the Google API.
|
---|
| 566 | .. [#] Browser sniffing is a very bad practise for website design - building
|
---|
| 567 | sites using web standards is much more sensible. Unfortunately a lot of
|
---|
| 568 | sites still send different versions to different browsers.
|
---|
| 569 | .. [#] The user agent for MSIE 6 is
|
---|
| 570 | *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
|
---|
| 571 | .. [#] For details of more HTTP request headers, see
|
---|
| 572 | `Quick Reference to HTTP Headers`_.
|
---|
| 573 | .. [#] In my case I have to use a proxy to access the internet at work. If you
|
---|
| 574 | attempt to fetch *localhost* URLs through this proxy it blocks them. IE
|
---|
| 575 | is set to use the proxy, which urllib2 picks up on. In order to test
|
---|
| 576 | scripts with a localhost server, I have to prevent urllib2 from using
|
---|
| 577 | the proxy.
|
---|
| 578 | .. [#] urllib2 opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
|
---|
| 579 | <http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_.
|
---|
| 580 |
|
---|