Context Navigation

← Previous Revision
Next Revision →
Normal
Revision Log

webchecker.py

Last change on this file was 391, checked in by dmik, 11 years ago
python: Merge vendor 2.7.6 to trunk.
Property svn:eol-style set to `native`
File size: 29.0 KB

Rev	Line
[2]	1	#! /usr/bin/env python
	2
	3	# Original code by Guido van Rossum; extensive changes by Sam Bayer,
	4	# including code to check URL fragments.
	5
	6	"""Web tree checker.
	7
	8	This utility is handy to check a subweb of the world-wide web for
	9	errors. A subweb is specified by giving one or more ``root URLs''; a
	10	page belongs to the subweb if one of the root URLs is an initial
	11	prefix of it.
	12
	13	File URL extension:
	14
	15	In order to easy the checking of subwebs via the local file system,
	16	the interpretation of ``file:'' URLs is extended to mimic the behavior
	17	of your average HTTP daemon: if a directory pathname is given, the
	18	file index.html in that directory is returned if it exists, otherwise
	19	a directory listing is returned. Now, you can point webchecker to the
	20	document tree in the local file system of your HTTP daemon, and have
	21	most of it checked. In fact the default works this way if your local
	22	web tree is located at /usr/local/etc/httpd/htdpcs (the default for
	23	the NCSA HTTP daemon and probably others).
	24
	25	Report printed:
	26
	27	When done, it reports pages with bad links within the subweb. When
	28	interrupted, it reports for the pages that it has checked already.
	29
	30	In verbose mode, additional messages are printed during the
	31	information gathering phase. By default, it prints a summary of its
	32	work status every 50 URLs (adjustable with the -r option), and it
	33	reports errors as they are encountered. Use the -q option to disable
	34	this output.
	35
	36	Checkpoint feature:
	37
	38	Whether interrupted or not, it dumps its state (a Python pickle) to a
	39	checkpoint file and the -R option allows it to restart from the
	40	checkpoint (assuming that the pages on the subweb that were already
	41	processed haven't changed). Even when it has run till completion, -R
	42	can still be useful -- it will print the reports again, and -Rq prints
	43	the errors only. In this case, the checkpoint file is not written
	44	again. The checkpoint file can be set with the -d option.
	45
	46	The checkpoint file is written as a Python pickle. Remember that
	47	Python's pickle module is currently quite slow. Give it the time it
	48	needs to load and save the checkpoint file. When interrupted while
	49	writing the checkpoint file, the old checkpoint file is not
	50	overwritten, but all work done in the current run is lost.
	51
	52	Miscellaneous:
	53
	54	- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
	55
	56	- Webchecker honors the "robots.txt" convention. Thanks to Skip
	57	Montanaro for his robotparser.py module (included in this directory)!
	58	The agent name is hardwired to "webchecker". URLs that are disallowed
	59	by the robots.txt file are reported as external URLs.
	60
	61	- Because the SGML parser is a bit slow, very large SGML files are
	62	skipped. The size limit can be set with the -m option.
	63
	64	- When the server or protocol does not tell us a file's type, we guess
	65	it based on the URL's suffix. The mimetypes.py module (also in this
	66	directory) has a built-in table mapping most currently known suffixes,
	67	and in addition attempts to read the mime.types configuration files in
	68	the default locations of Netscape and the NCSA HTTP daemon.
	69
	70	- We follow links indicated by <A>, <FRAME> and <IMG> tags. We also
	71	honor the <BASE> tag.
	72
	73	- We now check internal NAME anchor links, as well as toplevel links.
	74
	75	- Checking external links is now done by default; use -x to disable
	76	this feature. External links are now checked during normal
	77	processing. (XXX The status of a checked link could be categorized
	78	better. Later...)
	79
	80	- If external links are not checked, you can use the -t flag to
	81	provide specific overrides to -x.
	82
	83	Usage: webchecker.py [option] ... [rooturl] ...
	84
	85	Options:
	86
	87	-R -- restart from checkpoint file
	88	-d file -- checkpoint filename (default %(DUMPFILE)s)
	89	-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
	90	-n -- reports only, no checking (use with -R)
	91	-q -- quiet operation (also suppresses external links report)
	92	-r number -- number of links processed per round (default %(ROUNDSIZE)d)
	93	-t root -- specify root dir which should be treated as internal (can repeat)
	94	-v -- verbose operation; repeating -v will increase verbosity
	95	-x -- don't check external links (these are often slow to check)
	96	-a -- don't check name anchors
	97
	98	Arguments:
	99
	100	rooturl -- URL to start checking
	101	(default %(DEFROOT)s)
	102
	103	"""
	104
	105
[391]	106	__version__ = "$Revision$"
[2]	107
	108
	109	import sys
	110	import os
	111	from types import *
	112	import StringIO
	113	import getopt
	114	import pickle
	115
	116	import urllib
	117	import urlparse
	118	import sgmllib
	119	import cgi
	120
	121	import mimetypes
	122	import robotparser
	123
	124	# Extract real version number if necessary
	125	if __version__[0] == '$':
	126	_v = __version__.split()
	127	if len(_v) == 3:
	128	__version__ = _v[1]
	129
	130
	131	# Tunable parameters
	132	DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
	133	CHECKEXT = 1 # Check external references (1 deep)
	134	VERBOSE = 1 # Verbosity level (0-3)
	135	MAXPAGE = 150000 # Ignore files bigger than this
	136	ROUNDSIZE = 50 # Number of links processed per round
	137	DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
	138	AGENTNAME = "webchecker" # Agent name for robots.txt parser
	139	NONAMES = 0 # Force name anchor checking
	140
	141
	142	# Global variables
	143
	144
	145	def main():
	146	checkext = CHECKEXT
	147	verbose = VERBOSE
	148	maxpage = MAXPAGE
	149	roundsize = ROUNDSIZE
	150	dumpfile = DUMPFILE
	151	restart = 0
	152	norun = 0
	153
	154	try:
	155	opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:t:vxa')
	156	except getopt.error, msg:
	157	sys.stdout = sys.stderr
	158	print msg
	159	print __doc__%globals()
	160	sys.exit(2)
	161
	162	# The extra_roots variable collects extra roots.
	163	extra_roots = []
	164	nonames = NONAMES
	165
	166	for o, a in opts:
	167	if o == '-R':
	168	restart = 1
	169	if o == '-d':
	170	dumpfile = a
	171	if o == '-m':
	172	maxpage = int(a)
	173	if o == '-n':
	174	norun = 1
	175	if o == '-q':
	176	verbose = 0
	177	if o == '-r':
	178	roundsize = int(a)
	179	if o == '-t':
	180	extra_roots.append(a)
	181	if o == '-a':
	182	nonames = not nonames
	183	if o == '-v':
	184	verbose = verbose + 1
	185	if o == '-x':
	186	checkext = not checkext
	187
	188	if verbose > 0:
	189	print AGENTNAME, "version", __version__
	190
	191	if restart:
	192	c = load_pickle(dumpfile=dumpfile, verbose=verbose)
	193	else:
	194	c = Checker()
	195
	196	c.setflags(checkext=checkext, verbose=verbose,
	197	maxpage=maxpage, roundsize=roundsize,
	198	nonames=nonames
	199	)
	200
	201	if not restart and not args:
	202	args.append(DEFROOT)
	203
	204	for arg in args:
	205	c.addroot(arg)
	206
	207	# The -t flag is only needed if external links are not to be
	208	# checked. So -t values are ignored unless -x was specified.
	209	if not checkext:
	210	for root in extra_roots:
	211	# Make sure it's terminated by a slash,
	212	# so that addroot doesn't discard the last
	213	# directory component.
	214	if root[-1] != "/":
	215	root = root + "/"
	216	c.addroot(root, add_to_do = 0)
	217
	218	try:
	219
	220	if not norun:
	221	try:
	222	c.run()
	223	except KeyboardInterrupt:
	224	if verbose > 0:
	225	print "[run interrupted]"
	226
	227	try:
	228	c.report()
	229	except KeyboardInterrupt:
	230	if verbose > 0:
	231	print "[report interrupted]"
	232
	233	finally:
	234	if c.save_pickle(dumpfile):
	235	if dumpfile == DUMPFILE:
	236	print "Use ``%s -R'' to restart." % sys.argv[0]
	237	else:
	238	print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
	239	dumpfile)
	240
	241
	242	def load_pickle(dumpfile=DUMPFILE, verbose=VERBOSE):
	243	if verbose > 0:
	244	print "Loading checkpoint from %s ..." % dumpfile
	245	f = open(dumpfile, "rb")
	246	c = pickle.load(f)
	247	f.close()
	248	if verbose > 0:
	249	print "Done."
	250	print "Root:", "\n ".join(c.roots)
	251	return c
	252
	253
	254	class Checker:
	255
	256	checkext = CHECKEXT
	257	verbose = VERBOSE
	258	maxpage = MAXPAGE
	259	roundsize = ROUNDSIZE
	260	nonames = NONAMES
	261
	262	validflags = tuple(dir())
	263
	264	def __init__(self):
	265	self.reset()
	266
	267	def setflags(self, **kw):
	268	for key in kw.keys():
	269	if key not in self.validflags:
	270	raise NameError, "invalid keyword argument: %s" % str(key)
	271	for key, value in kw.items():
	272	setattr(self, key, value)
	273
	274	def reset(self):
	275	self.roots = []
	276	self.todo = {}
	277	self.done = {}
	278	self.bad = {}
	279
	280	# Add a name table, so that the name URLs can be checked. Also
	281	# serves as an implicit cache for which URLs are done.
	282	self.name_table = {}
	283
	284	self.round = 0
	285	# The following are not pickled:
	286	self.robots = {}
	287	self.errors = {}
	288	self.urlopener = MyURLopener()
	289	self.changed = 0
	290
	291	def note(self, level, format, *args):
	292	if self.verbose > level:
	293	if args:
	294	format = format%args
	295	self.message(format)
	296
	297	def message(self, format, *args):
	298	if args:
	299	format = format%args
	300	print format
	301
	302	def __getstate__(self):
	303	return (self.roots, self.todo, self.done, self.bad, self.round)
	304
	305	def __setstate__(self, state):
	306	self.reset()
	307	(self.roots, self.todo, self.done, self.bad, self.round) = state
	308	for root in self.roots:
	309	self.addrobot(root)
	310	for url in self.bad.keys():
	311	self.markerror(url)
	312
	313	def addroot(self, root, add_to_do = 1):
	314	if root not in self.roots:
	315	troot = root
	316	scheme, netloc, path, params, query, fragment = \
	317	urlparse.urlparse(root)
	318	i = path.rfind("/") + 1
	319	if 0 < i < len(path):
	320	path = path[:i]
	321	troot = urlparse.urlunparse((scheme, netloc, path,
	322	params, query, fragment))
	323	self.roots.append(troot)
	324	self.addrobot(root)
	325	if add_to_do:
	326	self.newlink((root, ""), ("<root>", root))
	327
	328	def addrobot(self, root):
	329	root = urlparse.urljoin(root, "/")
	330	if self.robots.has_key(root): return
	331	url = urlparse.urljoin(root, "/robots.txt")
	332	self.robots[root] = rp = robotparser.RobotFileParser()
	333	self.note(2, "Parsing %s", url)
	334	rp.debug = self.verbose > 3
	335	rp.set_url(url)
	336	try:
	337	rp.read()
	338	except (OSError, IOError), msg:
	339	self.note(1, "I/O error parsing %s: %s", url, msg)
	340
	341	def run(self):
	342	while self.todo:
	343	self.round = self.round + 1
	344	self.note(0, "\nRound %d (%s)\n", self.round, self.status())
	345	urls = self.todo.keys()
	346	urls.sort()
	347	del urls[self.roundsize:]
	348	for url in urls:
	349	self.dopage(url)
	350
	351	def status(self):
	352	return "%d total, %d to do, %d done, %d bad" % (
	353	len(self.todo)+len(self.done),
	354	len(self.todo), len(self.done),
	355	len(self.bad))
	356
	357	def report(self):
	358	self.message("")
	359	if not self.todo: s = "Final"
	360	else: s = "Interim"
	361	self.message("%s Report (%s)", s, self.status())
	362	self.report_errors()
	363
	364	def report_errors(self):
	365	if not self.bad:
	366	self.message("\nNo errors")
	367	return
	368	self.message("\nError Report:")
	369	sources = self.errors.keys()
	370	sources.sort()
	371	for source in sources:
	372	triples = self.errors[source]
	373	self.message("")
	374	if len(triples) > 1:
	375	self.message("%d Errors in %s", len(triples), source)
	376	else:
	377	self.message("Error in %s", source)
	378	# Call self.format_url() instead of referring
	379	# to the URL directly, since the URLs in these
	380	# triples is now a (URL, fragment) pair. The value
	381	# of the "source" variable comes from the list of
	382	# origins, and is a URL, not a pair.
	383	for url, rawlink, msg in triples:
	384	if rawlink != self.format_url(url): s = " (%s)" % rawlink
	385	else: s = ""
	386	self.message(" HREF %s%s\n msg %s",
	387	self.format_url(url), s, msg)
	388
	389	def dopage(self, url_pair):
	390
	391	# All printing of URLs uses format_url(); argument changed to
	392	# url_pair for clarity.
	393	if self.verbose > 1:
	394	if self.verbose > 2:
	395	self.show("Check ", self.format_url(url_pair),
	396	" from", self.todo[url_pair])
	397	else:
	398	self.message("Check %s", self.format_url(url_pair))
	399	url, local_fragment = url_pair
	400	if local_fragment and self.nonames:
	401	self.markdone(url_pair)
	402	return
	403	try:
	404	page = self.getpage(url_pair)
	405	except sgmllib.SGMLParseError, msg:
	406	msg = self.sanitize(msg)
	407	self.note(0, "Error parsing %s: %s",
	408	self.format_url(url_pair), msg)
	409	# Dont actually mark the URL as bad - it exists, just
	410	# we can't parse it!
	411	page = None
	412	if page:
	413	# Store the page which corresponds to this URL.
	414	self.name_table[url] = page
	415	# If there is a fragment in this url_pair, and it's not
	416	# in the list of names for the page, call setbad(), since
	417	# it's a missing anchor.
	418	if local_fragment and local_fragment not in page.getnames():
	419	self.setbad(url_pair, ("Missing name anchor `%s'" % local_fragment))
	420	for info in page.getlinkinfos():
	421	# getlinkinfos() now returns the fragment as well,
	422	# and we store that fragment here in the "todo" dictionary.
	423	link, rawlink, fragment = info
	424	# However, we don't want the fragment as the origin, since
	425	# the origin is logically a page.
	426	origin = url, rawlink
	427	self.newlink((link, fragment), origin)
	428	else:
	429	# If no page has been created yet, we want to
	430	# record that fact.
	431	self.name_table[url_pair[0]] = None
	432	self.markdone(url_pair)
	433
	434	def newlink(self, url, origin):
	435	if self.done.has_key(url):
	436	self.newdonelink(url, origin)
	437	else:
	438	self.newtodolink(url, origin)
	439
	440	def newdonelink(self, url, origin):
	441	if origin not in self.done[url]:
	442	self.done[url].append(origin)
	443
	444	# Call self.format_url(), since the URL here
	445	# is now a (URL, fragment) pair.
	446	self.note(3, " Done link %s", self.format_url(url))
	447
	448	# Make sure that if it's bad, that the origin gets added.
	449	if self.bad.has_key(url):
	450	source, rawlink = origin
	451	triple = url, rawlink, self.bad[url]
	452	self.seterror(source, triple)
	453
	454	def newtodolink(self, url, origin):
	455	# Call self.format_url(), since the URL here
	456	# is now a (URL, fragment) pair.
	457	if self.todo.has_key(url):
	458	if origin not in self.todo[url]:
	459	self.todo[url].append(origin)
	460	self.note(3, " Seen todo link %s", self.format_url(url))
	461	else:
	462	self.todo[url] = [origin]
	463	self.note(3, " New todo link %s", self.format_url(url))
	464
	465	def format_url(self, url):
	466	link, fragment = url
	467	if fragment: return link + "#" + fragment
	468	else: return link
	469
	470	def markdone(self, url):
	471	self.done[url] = self.todo[url]
	472	del self.todo[url]
	473	self.changed = 1
	474
	475	def inroots(self, url):
	476	for root in self.roots:
	477	if url[:len(root)] == root:
	478	return self.isallowed(root, url)
	479	return 0
	480
	481	def isallowed(self, root, url):
	482	root = urlparse.urljoin(root, "/")
	483	return self.robots[root].can_fetch(AGENTNAME, url)
	484
	485	def getpage(self, url_pair):
	486	# Incoming argument name is a (URL, fragment) pair.
	487	# The page may have been cached in the name_table variable.
	488	url, fragment = url_pair
	489	if self.name_table.has_key(url):
	490	return self.name_table[url]
	491
	492	scheme, path = urllib.splittype(url)
	493	if scheme in ('mailto', 'news', 'javascript', 'telnet'):
	494	self.note(1, " Not checking %s URL" % scheme)
	495	return None
	496	isint = self.inroots(url)
	497
	498	# Ensure that openpage gets the URL pair to
	499	# print out its error message and record the error pair
	500	# correctly.
	501	if not isint:
	502	if not self.checkext:
	503	self.note(1, " Not checking ext link")
	504	return None
	505	f = self.openpage(url_pair)
	506	if f:
	507	self.safeclose(f)
	508	return None
	509	text, nurl = self.readhtml(url_pair)
	510
	511	if nurl != url:
	512	self.note(1, " Redirected to %s", nurl)
	513	url = nurl
	514	if text:
	515	return Page(text, url, maxpage=self.maxpage, checker=self)
	516
	517	# These next three functions take (URL, fragment) pairs as
	518	# arguments, so that openpage() receives the appropriate tuple to
	519	# record error messages.
	520	def readhtml(self, url_pair):
	521	url, fragment = url_pair
	522	text = None
	523	f, url = self.openhtml(url_pair)
	524	if f:
	525	text = f.read()
	526	f.close()
	527	return text, url
	528
	529	def openhtml(self, url_pair):
	530	url, fragment = url_pair
	531	f = self.openpage(url_pair)
	532	if f:
	533	url = f.geturl()
	534	info = f.info()
	535	if not self.checkforhtml(info, url):
	536	self.safeclose(f)
	537	f = None
	538	return f, url
	539
	540	def openpage(self, url_pair):
	541	url, fragment = url_pair
	542	try:
	543	return self.urlopener.open(url)
	544	except (OSError, IOError), msg:
	545	msg = self.sanitize(msg)
	546	self.note(0, "Error %s", msg)
	547	if self.verbose > 0:
	548	self.show(" HREF ", url, " from", self.todo[url_pair])
	549	self.setbad(url_pair, msg)
	550	return None
	551
	552	def checkforhtml(self, info, url):
	553	if info.has_key('content-type'):
	554	ctype = cgi.parse_header(info['content-type'])[0].lower()
	555	if ';' in ctype:
	556	# handle content-type: text/html; charset=iso8859-1 :
	557	ctype = ctype.split(';', 1)[0].strip()
	558	else:
	559	if url[-1:] == "/":
	560	return 1
	561	ctype, encoding = mimetypes.guess_type(url)
	562	if ctype == 'text/html':
	563	return 1
	564	else:
	565	self.note(1, " Not HTML, mime type %s", ctype)
	566	return 0
	567
	568	def setgood(self, url):
	569	if self.bad.has_key(url):
	570	del self.bad[url]
	571	self.changed = 1
	572	self.note(0, "(Clear previously seen error)")
	573
	574	def setbad(self, url, msg):
	575	if self.bad.has_key(url) and self.bad[url] == msg:
	576	self.note(0, "(Seen this error before)")
	577	return
	578	self.bad[url] = msg
	579	self.changed = 1
	580	self.markerror(url)
	581
	582	def markerror(self, url):
	583	try:
	584	origins = self.todo[url]
	585	except KeyError:
	586	origins = self.done[url]
	587	for source, rawlink in origins:
	588	triple = url, rawlink, self.bad[url]
	589	self.seterror(source, triple)
	590
	591	def seterror(self, url, triple):
	592	try:
	593	# Because of the way the URLs are now processed, I need to
	594	# check to make sure the URL hasn't been entered in the
	595	# error list. The first element of the triple here is a
	596	# (URL, fragment) pair, but the URL key is not, since it's
	597	# from the list of origins.
	598	if triple not in self.errors[url]:
	599	self.errors[url].append(triple)
	600	except KeyError:
	601	self.errors[url] = [triple]
	602
	603	# The following used to be toplevel functions; they have been
	604	# changed into methods so they can be overridden in subclasses.
	605
	606	def show(self, p1, link, p2, origins):
	607	self.message("%s %s", p1, link)
	608	i = 0
	609	for source, rawlink in origins:
	610	i = i+1
	611	if i == 2:
	612	p2 = ' '*len(p2)
	613	if rawlink != link: s = " (%s)" % rawlink
	614	else: s = ""
	615	self.message("%s %s%s", p2, source, s)
	616
	617	def sanitize(self, msg):
	618	if isinstance(IOError, ClassType) and isinstance(msg, IOError):
	619	# Do the other branch recursively
	620	msg.args = self.sanitize(msg.args)
	621	elif isinstance(msg, TupleType):
	622	if len(msg) >= 4 and msg[0] == 'http error' and \
	623	isinstance(msg[3], InstanceType):
	624	# Remove the Message instance -- it may contain
	625	# a file object which prevents pickling.
	626	msg = msg[:3] + msg[4:]
	627	return msg
	628
	629	def safeclose(self, f):
	630	try:
	631	url = f.geturl()
	632	except AttributeError:
	633	pass
	634	else:
	635	if url[:4] == 'ftp:' or url[:7] == 'file://':
	636	# Apparently ftp connections don't like to be closed
	637	# prematurely...
	638	text = f.read()
	639	f.close()
	640
	641	def save_pickle(self, dumpfile=DUMPFILE):
	642	if not self.changed:
	643	self.note(0, "\nNo need to save checkpoint")
	644	elif not dumpfile:
	645	self.note(0, "No dumpfile, won't save checkpoint")
	646	else:
	647	self.note(0, "\nSaving checkpoint to %s ...", dumpfile)
	648	newfile = dumpfile + ".new"
	649	f = open(newfile, "wb")
	650	pickle.dump(self, f)
	651	f.close()
	652	try:
	653	os.unlink(dumpfile)
	654	except os.error:
	655	pass
	656	os.rename(newfile, dumpfile)
	657	self.note(0, "Done.")
	658	return 1
	659
	660
	661	class Page:
	662
	663	def __init__(self, text, url, verbose=VERBOSE, maxpage=MAXPAGE, checker=None):
	664	self.text = text
	665	self.url = url
	666	self.verbose = verbose
	667	self.maxpage = maxpage
	668	self.checker = checker
	669
	670	# The parsing of the page is done in the __init__() routine in
	671	# order to initialize the list of names the file
	672	# contains. Stored the parser in an instance variable. Passed
	673	# the URL to MyHTMLParser().
	674	size = len(self.text)
	675	if size > self.maxpage:
	676	self.note(0, "Skip huge file %s (%.0f Kbytes)", self.url, (size*0.001))
	677	self.parser = None
	678	return
	679	self.checker.note(2, " Parsing %s (%d bytes)", self.url, size)
	680	self.parser = MyHTMLParser(url, verbose=self.verbose,
	681	checker=self.checker)
	682	self.parser.feed(self.text)
	683	self.parser.close()
	684
	685	def note(self, level, msg, *args):
	686	if self.checker:
	687	apply(self.checker.note, (level, msg) + args)
	688	else:
	689	if self.verbose >= level:
	690	if args:
	691	msg = msg%args
	692	print msg
	693
	694	# Method to retrieve names.
	695	def getnames(self):
	696	if self.parser:
	697	return self.parser.names
	698	else:
	699	return []
	700
	701	def getlinkinfos(self):
	702	# File reading is done in __init__() routine. Store parser in
	703	# local variable to indicate success of parsing.
	704
	705	# If no parser was stored, fail.
	706	if not self.parser: return []
	707
	708	rawlinks = self.parser.getlinks()
	709	base = urlparse.urljoin(self.url, self.parser.getbase() or "")
	710	infos = []
	711	for rawlink in rawlinks:
	712	t = urlparse.urlparse(rawlink)
	713	# DON'T DISCARD THE FRAGMENT! Instead, include
	714	# it in the tuples which are returned. See Checker.dopage().
	715	fragment = t[-1]
	716	t = t[:-1] + ('',)
	717	rawlink = urlparse.urlunparse(t)
	718	link = urlparse.urljoin(base, rawlink)
	719	infos.append((link, rawlink, fragment))
	720
	721	return infos
	722
	723
	724	class MyStringIO(StringIO.StringIO):
	725
	726	def __init__(self, url, info):
	727	self.__url = url
	728	self.__info = info
	729	StringIO.StringIO.__init__(self)
	730
	731	def info(self):
	732	return self.__info
	733
	734	def geturl(self):
	735	return self.__url
	736
	737
	738	class MyURLopener(urllib.FancyURLopener):
	739
	740	http_error_default = urllib.URLopener.http_error_default
	741
	742	def __init__(*args):
	743	self = args[0]
	744	apply(urllib.FancyURLopener.__init__, args)
	745	self.addheaders = [
	746	('User-agent', 'Python-webchecker/%s' % __version__),
	747	]
	748
	749	def http_error_401(self, url, fp, errcode, errmsg, headers):
	750	return None
	751
	752	def open_file(self, url):
	753	path = urllib.url2pathname(urllib.unquote(url))
	754	if os.path.isdir(path):
	755	if path[-1] != os.sep:
	756	url = url + '/'
	757	indexpath = os.path.join(path, "index.html")
	758	if os.path.exists(indexpath):
	759	return self.open_file(url + "index.html")
	760	try:
	761	names = os.listdir(path)
	762	except os.error, msg:
	763	exc_type, exc_value, exc_tb = sys.exc_info()
	764	raise IOError, msg, exc_tb
	765	names.sort()
	766	s = MyStringIO("file:"+url, {'content-type': 'text/html'})
	767	s.write('<BASE HREF="file:%s">\n' %
	768	urllib.quote(os.path.join(path, "")))
	769	for name in names:
	770	q = urllib.quote(name)
	771	s.write('<A HREF="%s">%s</A>\n' % (q, q))
	772	s.seek(0)
	773	return s
	774	return urllib.FancyURLopener.open_file(self, url)
	775
	776
	777	class MyHTMLParser(sgmllib.SGMLParser):
	778
	779	def __init__(self, url, verbose=VERBOSE, checker=None):
	780	self.myverbose = verbose # now unused
	781	self.checker = checker
	782	self.base = None
	783	self.links = {}
	784	self.names = []
	785	self.url = url
	786	sgmllib.SGMLParser.__init__(self)
	787
	788	def check_name_id(self, attributes):
	789	""" Check the name or id attributes on an element.
	790	"""
	791	# We must rescue the NAME or id (name is deprecated in XHTML)
	792	# attributes from the anchor, in order to
	793	# cache the internal anchors which are made
	794	# available in the page.
	795	for name, value in attributes:
	796	if name == "name" or name == "id":
	797	if value in self.names:
	798	self.checker.message("WARNING: duplicate ID name %s in %s",
	799	value, self.url)
	800	else: self.names.append(value)
	801	break
	802
	803	def unknown_starttag(self, tag, attributes):
	804	""" In XHTML, you can have id attributes on any element.
	805	"""
	806	self.check_name_id(attributes)
	807
	808	def start_a(self, attributes):
	809	self.link_attr(attributes, 'href')
	810	self.check_name_id(attributes)
	811
	812	def end_a(self): pass
	813
	814	def do_area(self, attributes):
	815	self.link_attr(attributes, 'href')
	816	self.check_name_id(attributes)
	817
	818	def do_body(self, attributes):
	819	self.link_attr(attributes, 'background', 'bgsound')
	820	self.check_name_id(attributes)
	821
	822	def do_img(self, attributes):
	823	self.link_attr(attributes, 'src', 'lowsrc')
	824	self.check_name_id(attributes)
	825
	826	def do_frame(self, attributes):
	827	self.link_attr(attributes, 'src', 'longdesc')
	828	self.check_name_id(attributes)
	829
	830	def do_iframe(self, attributes):
	831	self.link_attr(attributes, 'src', 'longdesc')
	832	self.check_name_id(attributes)
	833
	834	def do_link(self, attributes):
	835	for name, value in attributes:
	836	if name == "rel":
	837	parts = value.lower().split()
	838	if ( parts == ["stylesheet"]
	839	or parts == ["alternate", "stylesheet"]):
	840	self.link_attr(attributes, "href")
	841	break
	842	self.check_name_id(attributes)
	843
	844	def do_object(self, attributes):
	845	self.link_attr(attributes, 'data', 'usemap')
	846	self.check_name_id(attributes)
	847
	848	def do_script(self, attributes):
	849	self.link_attr(attributes, 'src')
	850	self.check_name_id(attributes)
	851
	852	def do_table(self, attributes):
	853	self.link_attr(attributes, 'background')
	854	self.check_name_id(attributes)
	855
	856	def do_td(self, attributes):
	857	self.link_attr(attributes, 'background')
	858	self.check_name_id(attributes)
	859
	860	def do_th(self, attributes):
	861	self.link_attr(attributes, 'background')
	862	self.check_name_id(attributes)
	863
	864	def do_tr(self, attributes):
	865	self.link_attr(attributes, 'background')
	866	self.check_name_id(attributes)
	867
	868	def link_attr(self, attributes, *args):
	869	for name, value in attributes:
	870	if name in args:
	871	if value: value = value.strip()
	872	if value: self.links[value] = None
	873
	874	def do_base(self, attributes):
	875	for name, value in attributes:
	876	if name == 'href':
	877	if value: value = value.strip()
	878	if value:
	879	if self.checker:
	880	self.checker.note(1, " Base %s", value)
	881	self.base = value
	882	self.check_name_id(attributes)
	883
	884	def getlinks(self):
	885	return self.links.keys()
	886
	887	def getbase(self):
	888	return self.base
	889
	890
	891	if __name__ == '__main__':
	892	main()

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: python/trunk/Tools/webchecker/webchecker.py

Download in other formats: