1 | \section{\module{robotparser} ---
|
---|
2 | Parser for robots.txt}
|
---|
3 |
|
---|
4 | \declaremodule{standard}{robotparser}
|
---|
5 | \modulesynopsis{Loads a \protect\file{robots.txt} file and
|
---|
6 | answers questions about fetchability of other URLs.}
|
---|
7 | \sectionauthor{Skip Montanaro}{skip@mojam.com}
|
---|
8 |
|
---|
9 | \index{WWW}
|
---|
10 | \index{World Wide Web}
|
---|
11 | \index{URL}
|
---|
12 | \index{robots.txt}
|
---|
13 |
|
---|
14 | This module provides a single class, \class{RobotFileParser}, which answers
|
---|
15 | questions about whether or not a particular user agent can fetch a URL on
|
---|
16 | the Web site that published the \file{robots.txt} file. For more details on
|
---|
17 | the structure of \file{robots.txt} files, see
|
---|
18 | \url{http://www.robotstxt.org/wc/norobots.html}.
|
---|
19 |
|
---|
20 | \begin{classdesc}{RobotFileParser}{}
|
---|
21 |
|
---|
22 | This class provides a set of methods to read, parse and answer questions
|
---|
23 | about a single \file{robots.txt} file.
|
---|
24 |
|
---|
25 | \begin{methoddesc}{set_url}{url}
|
---|
26 | Sets the URL referring to a \file{robots.txt} file.
|
---|
27 | \end{methoddesc}
|
---|
28 |
|
---|
29 | \begin{methoddesc}{read}{}
|
---|
30 | Reads the \file{robots.txt} URL and feeds it to the parser.
|
---|
31 | \end{methoddesc}
|
---|
32 |
|
---|
33 | \begin{methoddesc}{parse}{lines}
|
---|
34 | Parses the lines argument.
|
---|
35 | \end{methoddesc}
|
---|
36 |
|
---|
37 | \begin{methoddesc}{can_fetch}{useragent, url}
|
---|
38 | Returns \code{True} if the \var{useragent} is allowed to fetch the \var{url}
|
---|
39 | according to the rules contained in the parsed \file{robots.txt} file.
|
---|
40 | \end{methoddesc}
|
---|
41 |
|
---|
42 | \begin{methoddesc}{mtime}{}
|
---|
43 | Returns the time the \code{robots.txt} file was last fetched. This is
|
---|
44 | useful for long-running web spiders that need to check for new
|
---|
45 | \code{robots.txt} files periodically.
|
---|
46 | \end{methoddesc}
|
---|
47 |
|
---|
48 | \begin{methoddesc}{modified}{}
|
---|
49 | Sets the time the \code{robots.txt} file was last fetched to the current
|
---|
50 | time.
|
---|
51 | \end{methoddesc}
|
---|
52 |
|
---|
53 | \end{classdesc}
|
---|
54 |
|
---|
55 | The following example demonstrates basic use of the RobotFileParser class.
|
---|
56 |
|
---|
57 | \begin{verbatim}
|
---|
58 | >>> import robotparser
|
---|
59 | >>> rp = robotparser.RobotFileParser()
|
---|
60 | >>> rp.set_url("http://www.musi-cal.com/robots.txt")
|
---|
61 | >>> rp.read()
|
---|
62 | >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
|
---|
63 | False
|
---|
64 | >>> rp.can_fetch("*", "http://www.musi-cal.com/")
|
---|
65 | True
|
---|
66 | \end{verbatim}
|
---|