[2] | 1 |
|
---|
| 2 | :mod:`robotparser` --- Parser for robots.txt
|
---|
| 3 | =============================================
|
---|
| 4 |
|
---|
| 5 | .. module:: robotparser
|
---|
| 6 | :synopsis: Loads a robots.txt file and answers questions about
|
---|
| 7 | fetchability of other URLs.
|
---|
| 8 | .. sectionauthor:: Skip Montanaro <skip@pobox.com>
|
---|
| 9 |
|
---|
| 10 |
|
---|
| 11 | .. index::
|
---|
| 12 | single: WWW
|
---|
| 13 | single: World Wide Web
|
---|
| 14 | single: URL
|
---|
| 15 | single: robots.txt
|
---|
| 16 |
|
---|
| 17 | .. note::
|
---|
| 18 | The :mod:`robotparser` module has been renamed :mod:`urllib.robotparser` in
|
---|
[391] | 19 | Python 3.
|
---|
[2] | 20 | The :term:`2to3` tool will automatically adapt imports when converting
|
---|
[391] | 21 | your sources to Python 3.
|
---|
[2] | 22 |
|
---|
| 23 | This module provides a single class, :class:`RobotFileParser`, which answers
|
---|
| 24 | questions about whether or not a particular user agent can fetch a URL on the
|
---|
| 25 | Web site that published the :file:`robots.txt` file. For more details on the
|
---|
| 26 | structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
|
---|
| 27 |
|
---|
| 28 |
|
---|
[391] | 29 | .. class:: RobotFileParser(url='')
|
---|
[2] | 30 |
|
---|
[391] | 31 | This class provides methods to read, parse and answer questions about the
|
---|
| 32 | :file:`robots.txt` file at *url*.
|
---|
[2] | 33 |
|
---|
| 34 |
|
---|
| 35 | .. method:: set_url(url)
|
---|
| 36 |
|
---|
| 37 | Sets the URL referring to a :file:`robots.txt` file.
|
---|
| 38 |
|
---|
| 39 |
|
---|
| 40 | .. method:: read()
|
---|
| 41 |
|
---|
| 42 | Reads the :file:`robots.txt` URL and feeds it to the parser.
|
---|
| 43 |
|
---|
| 44 |
|
---|
| 45 | .. method:: parse(lines)
|
---|
| 46 |
|
---|
| 47 | Parses the lines argument.
|
---|
| 48 |
|
---|
| 49 |
|
---|
| 50 | .. method:: can_fetch(useragent, url)
|
---|
| 51 |
|
---|
| 52 | Returns ``True`` if the *useragent* is allowed to fetch the *url*
|
---|
| 53 | according to the rules contained in the parsed :file:`robots.txt`
|
---|
| 54 | file.
|
---|
| 55 |
|
---|
| 56 |
|
---|
| 57 | .. method:: mtime()
|
---|
| 58 |
|
---|
| 59 | Returns the time the ``robots.txt`` file was last fetched. This is
|
---|
| 60 | useful for long-running web spiders that need to check for new
|
---|
| 61 | ``robots.txt`` files periodically.
|
---|
| 62 |
|
---|
| 63 |
|
---|
| 64 | .. method:: modified()
|
---|
| 65 |
|
---|
| 66 | Sets the time the ``robots.txt`` file was last fetched to the current
|
---|
| 67 | time.
|
---|
| 68 |
|
---|
| 69 | The following example demonstrates basic use of the RobotFileParser class. ::
|
---|
| 70 |
|
---|
| 71 | >>> import robotparser
|
---|
| 72 | >>> rp = robotparser.RobotFileParser()
|
---|
| 73 | >>> rp.set_url("http://www.musi-cal.com/robots.txt")
|
---|
| 74 | >>> rp.read()
|
---|
| 75 | >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
|
---|
| 76 | False
|
---|
| 77 | >>> rp.can_fetch("*", "http://www.musi-cal.com/")
|
---|
| 78 | True
|
---|
| 79 |
|
---|