Context Navigation

webchecker.py@ 604

Last change on this file since 604 was 391, checked in by dmik, 12 years ago
python: Merge vendor 2.7.6 to trunk.
Property svn:eol-style set to `native`
File size: 29.0 KB

Line
1	#! /usr/bin/env python
2
3	# Original code by Guido van Rossum; extensive changes by Sam Bayer,
4	# including code to check URL fragments.
5
6	"""Web tree checker.
7
8	This utility is handy to check a subweb of the world-wide web for
9	errors. A subweb is specified by giving one or more ``root URLs''; a
10	page belongs to the subweb if one of the root URLs is an initial
11	prefix of it.
12
13	File URL extension:
14
15	In order to easy the checking of subwebs via the local file system,
16	the interpretation of ``file:'' URLs is extended to mimic the behavior
17	of your average HTTP daemon: if a directory pathname is given, the
18	file index.html in that directory is returned if it exists, otherwise
19	a directory listing is returned. Now, you can point webchecker to the
20	document tree in the local file system of your HTTP daemon, and have
21	most of it checked. In fact the default works this way if your local
22	web tree is located at /usr/local/etc/httpd/htdpcs (the default for
23	the NCSA HTTP daemon and probably others).
24
25	Report printed:
26
27	When done, it reports pages with bad links within the subweb. When
28	interrupted, it reports for the pages that it has checked already.
29
30	In verbose mode, additional messages are printed during the
31	information gathering phase. By default, it prints a summary of its
32	work status every 50 URLs (adjustable with the -r option), and it
33	reports errors as they are encountered. Use the -q option to disable
34	this output.
35
36	Checkpoint feature:
37
38	Whether interrupted or not, it dumps its state (a Python pickle) to a
39	checkpoint file and the -R option allows it to restart from the
40	checkpoint (assuming that the pages on the subweb that were already
41	processed haven't changed). Even when it has run till completion, -R
42	can still be useful -- it will print the reports again, and -Rq prints
43	the errors only. In this case, the checkpoint file is not written
44	again. The checkpoint file can be set with the -d option.
45
46	The checkpoint file is written as a Python pickle. Remember that
47	Python's pickle module is currently quite slow. Give it the time it
48	needs to load and save the checkpoint file. When interrupted while
49	writing the checkpoint file, the old checkpoint file is not
50	overwritten, but all work done in the current run is lost.
51
52	Miscellaneous:
53
54	- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
55
56	- Webchecker honors the "robots.txt" convention. Thanks to Skip
57	Montanaro for his robotparser.py module (included in this directory)!
58	The agent name is hardwired to "webchecker". URLs that are disallowed
59	by the robots.txt file are reported as external URLs.
60
61	- Because the SGML parser is a bit slow, very large SGML files are
62	skipped. The size limit can be set with the -m option.
63
64	- When the server or protocol does not tell us a file's type, we guess
65	it based on the URL's suffix. The mimetypes.py module (also in this
66	directory) has a built-in table mapping most currently known suffixes,
67	and in addition attempts to read the mime.types configuration files in
68	the default locations of Netscape and the NCSA HTTP daemon.
69
70	- We follow links indicated by <A>, <FRAME> and <IMG> tags. We also
71	honor the <BASE> tag.
72
73	- We now check internal NAME anchor links, as well as toplevel links.
74
75	- Checking external links is now done by default; use -x to disable
76	this feature. External links are now checked during normal
77	processing. (XXX The status of a checked link could be categorized
78	better. Later...)
79
80	- If external links are not checked, you can use the -t flag to
81	provide specific overrides to -x.
82
83	Usage: webchecker.py [option] ... [rooturl] ...
84
85	Options:
86
87	-R -- restart from checkpoint file
88	-d file -- checkpoint filename (default %(DUMPFILE)s)
89	-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
90	-n -- reports only, no checking (use with -R)
91	-q -- quiet operation (also suppresses external links report)
92	-r number -- number of links processed per round (default %(ROUNDSIZE)d)
93	-t root -- specify root dir which should be treated as internal (can repeat)
94	-v -- verbose operation; repeating -v will increase verbosity
95	-x -- don't check external links (these are often slow to check)
96	-a -- don't check name anchors
97
98	Arguments:
99
100	rooturl -- URL to start checking
101	(default %(DEFROOT)s)
102
103	"""
104
105
106	__version__ = "$Revision$"
107
108
109	import sys
110	import os
111	from types import *
112	import StringIO
113	import getopt
114	import pickle
115
116	import urllib
117	import urlparse
118	import sgmllib
119	import cgi
120
121	import mimetypes
122	import robotparser
123
124	# Extract real version number if necessary
125	if __version__[0] == '$':
126	_v = __version__.split()
127	if len(_v) == 3:
128	__version__ = _v[1]
129
130
131	# Tunable parameters
132	DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
133	CHECKEXT = 1 # Check external references (1 deep)
134	VERBOSE = 1 # Verbosity level (0-3)
135	MAXPAGE = 150000 # Ignore files bigger than this
136	ROUNDSIZE = 50 # Number of links processed per round
137	DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
138	AGENTNAME = "webchecker" # Agent name for robots.txt parser
139	NONAMES = 0 # Force name anchor checking
140
141
142	# Global variables
143
144
145	def main():
146	checkext = CHECKEXT
147	verbose = VERBOSE
148	maxpage = MAXPAGE
149	roundsize = ROUNDSIZE
150	dumpfile = DUMPFILE
151	restart = 0
152	norun = 0
153
154	try:
155	opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:t:vxa')
156	except getopt.error, msg:
157	sys.stdout = sys.stderr
158	print msg
159	print __doc__%globals()
160	sys.exit(2)
161
162	# The extra_roots variable collects extra roots.
163	extra_roots = []
164	nonames = NONAMES
165
166	for o, a in opts:
167	if o == '-R':
168	restart = 1
169	if o == '-d':
170	dumpfile = a
171	if o == '-m':
172	maxpage = int(a)
173	if o == '-n':
174	norun = 1
175	if o == '-q':
176	verbose = 0
177	if o == '-r':
178	roundsize = int(a)
179	if o == '-t':
180	extra_roots.append(a)
181	if o == '-a':
182	nonames = not nonames
183	if o == '-v':
184	verbose = verbose + 1
185	if o == '-x':
186	checkext = not checkext
187
188	if verbose > 0:
189	print AGENTNAME, "version", __version__
190
191	if restart:
192	c = load_pickle(dumpfile=dumpfile, verbose=verbose)
193	else:
194	c = Checker()
195
196	c.setflags(checkext=checkext, verbose=verbose,
197	maxpage=maxpage, roundsize=roundsize,
198	nonames=nonames
199	)
200
201	if not restart and not args:
202	args.append(DEFROOT)
203
204	for arg in args:
205	c.addroot(arg)
206
207	# The -t flag is only needed if external links are not to be
208	# checked. So -t values are ignored unless -x was specified.
209	if not checkext:
210	for root in extra_roots:
211	# Make sure it's terminated by a slash,
212	# so that addroot doesn't discard the last
213	# directory component.
214	if root[-1] != "/":
215	root = root + "/"
216	c.addroot(root, add_to_do = 0)
217
218	try:
219
220	if not norun:
221	try:
222	c.run()
223	except KeyboardInterrupt:
224	if verbose > 0:
225	print "[run interrupted]"
226
227	try:
228	c.report()
229	except KeyboardInterrupt:
230	if verbose > 0:
231	print "[report interrupted]"
232
233	finally:
234	if c.save_pickle(dumpfile):
235	if dumpfile == DUMPFILE:
236	print "Use ``%s -R'' to restart." % sys.argv[0]
237	else:
238	print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
239	dumpfile)
240
241
242	def load_pickle(dumpfile=DUMPFILE, verbose=VERBOSE):
243	if verbose > 0:
244	print "Loading checkpoint from %s ..." % dumpfile
245	f = open(dumpfile, "rb")
246	c = pickle.load(f)
247	f.close()
248	if verbose > 0:
249	print "Done."
250	print "Root:", "\n ".join(c.roots)
251	return c
252
253
254	class Checker:
255
256	checkext = CHECKEXT
257	verbose = VERBOSE
258	maxpage = MAXPAGE
259	roundsize = ROUNDSIZE
260	nonames = NONAMES
261
262	validflags = tuple(dir())
263
264	def __init__(self):
265	self.reset()
266
267	def setflags(self, **kw):
268	for key in kw.keys():
269	if key not in self.validflags:
270	raise NameError, "invalid keyword argument: %s" % str(key)
271	for key, value in kw.items():
272	setattr(self, key, value)
273
274	def reset(self):
275	self.roots = []
276	self.todo = {}
277	self.done = {}
278	self.bad = {}
279
280	# Add a name table, so that the name URLs can be checked. Also
281	# serves as an implicit cache for which URLs are done.
282	self.name_table = {}
283
284	self.round = 0
285	# The following are not pickled:
286	self.robots = {}
287	self.errors = {}
288	self.urlopener = MyURLopener()
289	self.changed = 0
290
291	def note(self, level, format, *args):
292	if self.verbose > level:
293	if args:
294	format = format%args
295	self.message(format)
296
297	def message(self, format, *args):
298	if args:
299	format = format%args
300	print format
301
302	def __getstate__(self):
303	return (self.roots, self.todo, self.done, self.bad, self.round)
304
305	def __setstate__(self, state):
306	self.reset()
307	(self.roots, self.todo, self.done, self.bad, self.round) = state
308	for root in self.roots:
309	self.addrobot(root)
310	for url in self.bad.keys():
311	self.markerror(url)
312
313	def addroot(self, root, add_to_do = 1):
314	if root not in self.roots:
315	troot = root
316	scheme, netloc, path, params, query, fragment = \
317	urlparse.urlparse(root)
318	i = path.rfind("/") + 1
319	if 0 < i < len(path):
320	path = path[:i]
321	troot = urlparse.urlunparse((scheme, netloc, path,
322	params, query, fragment))
323	self.roots.append(troot)
324	self.addrobot(root)
325	if add_to_do:
326	self.newlink((root, ""), ("<root>", root))
327
328	def addrobot(self, root):
329	root = urlparse.urljoin(root, "/")
330	if self.robots.has_key(root): return
331	url = urlparse.urljoin(root, "/robots.txt")
332	self.robots[root] = rp = robotparser.RobotFileParser()
333	self.note(2, "Parsing %s", url)
334	rp.debug = self.verbose > 3
335	rp.set_url(url)
336	try:
337	rp.read()
338	except (OSError, IOError), msg:
339	self.note(1, "I/O error parsing %s: %s", url, msg)
340
341	def run(self):
342	while self.todo:
343	self.round = self.round + 1
344	self.note(0, "\nRound %d (%s)\n", self.round, self.status())
345	urls = self.todo.keys()
346	urls.sort()
347	del urls[self.roundsize:]
348	for url in urls:
349	self.dopage(url)
350
351	def status(self):
352	return "%d total, %d to do, %d done, %d bad" % (
353	len(self.todo)+len(self.done),
354	len(self.todo), len(self.done),
355	len(self.bad))
356
357	def report(self):
358	self.message("")
359	if not self.todo: s = "Final"
360	else: s = "Interim"
361	self.message("%s Report (%s)", s, self.status())
362	self.report_errors()
363
364	def report_errors(self):
365	if not self.bad:
366	self.message("\nNo errors")
367	return
368	self.message("\nError Report:")
369	sources = self.errors.keys()
370	sources.sort()
371	for source in sources:
372	triples = self.errors[source]
373	self.message("")
374	if len(triples) > 1:
375	self.message("%d Errors in %s", len(triples), source)
376	else:
377	self.message("Error in %s", source)
378	# Call self.format_url() instead of referring
379	# to the URL directly, since the URLs in these
380	# triples is now a (URL, fragment) pair. The value
381	# of the "source" variable comes from the list of
382	# origins, and is a URL, not a pair.
383	for url, rawlink, msg in triples:
384	if rawlink != self.format_url(url): s = " (%s)" % rawlink
385	else: s = ""
386	self.message(" HREF %s%s\n msg %s",
387	self.format_url(url), s, msg)
388
389	def dopage(self, url_pair):
390
391	# All printing of URLs uses format_url(); argument changed to
392	# url_pair for clarity.
393	if self.verbose > 1:
394	if self.verbose > 2:
395	self.show("Check ", self.format_url(url_pair),
396	" from", self.todo[url_pair])
397	else:
398	self.message("Check %s", self.format_url(url_pair))
399	url, local_fragment = url_pair
400	if local_fragment and self.nonames:
401	self.markdone(url_pair)
402	return
403	try:
404	page = self.getpage(url_pair)
405	except sgmllib.SGMLParseError, msg:
406	msg = self.sanitize(msg)
407	self.note(0, "Error parsing %s: %s",
408	self.format_url(url_pair), msg)
409	# Dont actually mark the URL as bad - it exists, just
410	# we can't parse it!
411	page = None
412	if page:
413	# Store the page which corresponds to this URL.
414	self.name_table[url] = page
415	# If there is a fragment in this url_pair, and it's not
416	# in the list of names for the page, call setbad(), since
417	# it's a missing anchor.
418	if local_fragment and local_fragment not in page.getnames():
419	self.setbad(url_pair, ("Missing name anchor `%s'" % local_fragment))
420	for info in page.getlinkinfos():
421	# getlinkinfos() now returns the fragment as well,
422	# and we store that fragment here in the "todo" dictionary.
423	link, rawlink, fragment = info
424	# However, we don't want the fragment as the origin, since
425	# the origin is logically a page.
426	origin = url, rawlink
427	self.newlink((link, fragment), origin)
428	else:
429	# If no page has been created yet, we want to
430	# record that fact.
431	self.name_table[url_pair[0]] = None
432	self.markdone(url_pair)
433
434	def newlink(self, url, origin):
435	if self.done.has_key(url):
436	self.newdonelink(url, origin)
437	else:
438	self.newtodolink(url, origin)
439
440	def newdonelink(self, url, origin):
441	if origin not in self.done[url]:
442	self.done[url].append(origin)
443
444	# Call self.format_url(), since the URL here
445	# is now a (URL, fragment) pair.
446	self.note(3, " Done link %s", self.format_url(url))
447
448	# Make sure that if it's bad, that the origin gets added.
449	if self.bad.has_key(url):
450	source, rawlink = origin
451	triple = url, rawlink, self.bad[url]
452	self.seterror(source, triple)
453
454	def newtodolink(self, url, origin):
455	# Call self.format_url(), since the URL here
456	# is now a (URL, fragment) pair.
457	if self.todo.has_key(url):
458	if origin not in self.todo[url]:
459	self.todo[url].append(origin)
460	self.note(3, " Seen todo link %s", self.format_url(url))
461	else:
462	self.todo[url] = [origin]
463	self.note(3, " New todo link %s", self.format_url(url))
464
465	def format_url(self, url):
466	link, fragment = url
467	if fragment: return link + "#" + fragment
468	else: return link
469
470	def markdone(self, url):
471	self.done[url] = self.todo[url]
472	del self.todo[url]
473	self.changed = 1
474
475	def inroots(self, url):
476	for root in self.roots:
477	if url[:len(root)] == root:
478	return self.isallowed(root, url)
479	return 0
480
481	def isallowed(self, root, url):
482	root = urlparse.urljoin(root, "/")
483	return self.robots[root].can_fetch(AGENTNAME, url)
484
485	def getpage(self, url_pair):
486	# Incoming argument name is a (URL, fragment) pair.
487	# The page may have been cached in the name_table variable.
488	url, fragment = url_pair
489	if self.name_table.has_key(url):
490	return self.name_table[url]
491
492	scheme, path = urllib.splittype(url)
493	if scheme in ('mailto', 'news', 'javascript', 'telnet'):
494	self.note(1, " Not checking %s URL" % scheme)
495	return None
496	isint = self.inroots(url)
497
498	# Ensure that openpage gets the URL pair to
499	# print out its error message and record the error pair
500	# correctly.
501	if not isint:
502	if not self.checkext:
503	self.note(1, " Not checking ext link")
504	return None
505	f = self.openpage(url_pair)
506	if f:
507	self.safeclose(f)
508	return None
509	text, nurl = self.readhtml(url_pair)
510
511	if nurl != url:
512	self.note(1, " Redirected to %s", nurl)
513	url = nurl
514	if text:
515	return Page(text, url, maxpage=self.maxpage, checker=self)
516
517	# These next three functions take (URL, fragment) pairs as
518	# arguments, so that openpage() receives the appropriate tuple to
519	# record error messages.
520	def readhtml(self, url_pair):
521	url, fragment = url_pair
522	text = None
523	f, url = self.openhtml(url_pair)
524	if f:
525	text = f.read()
526	f.close()
527	return text, url
528
529	def openhtml(self, url_pair):
530	url, fragment = url_pair
531	f = self.openpage(url_pair)
532	if f:
533	url = f.geturl()
534	info = f.info()
535	if not self.checkforhtml(info, url):
536	self.safeclose(f)
537	f = None
538	return f, url
539
540	def openpage(self, url_pair):
541	url, fragment = url_pair
542	try:
543	return self.urlopener.open(url)
544	except (OSError, IOError), msg:
545	msg = self.sanitize(msg)
546	self.note(0, "Error %s", msg)
547	if self.verbose > 0:
548	self.show(" HREF ", url, " from", self.todo[url_pair])
549	self.setbad(url_pair, msg)
550	return None
551
552	def checkforhtml(self, info, url):
553	if info.has_key('content-type'):
554	ctype = cgi.parse_header(info['content-type'])[0].lower()
555	if ';' in ctype:
556	# handle content-type: text/html; charset=iso8859-1 :
557	ctype = ctype.split(';', 1)[0].strip()
558	else:
559	if url[-1:] == "/":
560	return 1
561	ctype, encoding = mimetypes.guess_type(url)
562	if ctype == 'text/html':
563	return 1
564	else:
565	self.note(1, " Not HTML, mime type %s", ctype)
566	return 0
567
568	def setgood(self, url):
569	if self.bad.has_key(url):
570	del self.bad[url]
571	self.changed = 1
572	self.note(0, "(Clear previously seen error)")
573
574	def setbad(self, url, msg):
575	if self.bad.has_key(url) and self.bad[url] == msg:
576	self.note(0, "(Seen this error before)")
577	return
578	self.bad[url] = msg
579	self.changed = 1
580	self.markerror(url)
581
582	def markerror(self, url):
583	try:
584	origins = self.todo[url]
585	except KeyError:
586	origins = self.done[url]
587	for source, rawlink in origins:
588	triple = url, rawlink, self.bad[url]
589	self.seterror(source, triple)
590
591	def seterror(self, url, triple):
592	try:
593	# Because of the way the URLs are now processed, I need to
594	# check to make sure the URL hasn't been entered in the
595	# error list. The first element of the triple here is a
596	# (URL, fragment) pair, but the URL key is not, since it's
597	# from the list of origins.
598	if triple not in self.errors[url]:
599	self.errors[url].append(triple)
600	except KeyError:
601	self.errors[url] = [triple]
602
603	# The following used to be toplevel functions; they have been
604	# changed into methods so they can be overridden in subclasses.
605
606	def show(self, p1, link, p2, origins):
607	self.message("%s %s", p1, link)
608	i = 0
609	for source, rawlink in origins:
610	i = i+1
611	if i == 2:
612	p2 = ' '*len(p2)
613	if rawlink != link: s = " (%s)" % rawlink
614	else: s = ""
615	self.message("%s %s%s", p2, source, s)
616
617	def sanitize(self, msg):
618	if isinstance(IOError, ClassType) and isinstance(msg, IOError):
619	# Do the other branch recursively
620	msg.args = self.sanitize(msg.args)
621	elif isinstance(msg, TupleType):
622	if len(msg) >= 4 and msg[0] == 'http error' and \
623	isinstance(msg[3], InstanceType):
624	# Remove the Message instance -- it may contain
625	# a file object which prevents pickling.
626	msg = msg[:3] + msg[4:]
627	return msg
628
629	def safeclose(self, f):
630	try:
631	url = f.geturl()
632	except AttributeError:
633	pass
634	else:
635	if url[:4] == 'ftp:' or url[:7] == 'file://':
636	# Apparently ftp connections don't like to be closed
637	# prematurely...
638	text = f.read()
639	f.close()
640
641	def save_pickle(self, dumpfile=DUMPFILE):
642	if not self.changed:
643	self.note(0, "\nNo need to save checkpoint")
644	elif not dumpfile:
645	self.note(0, "No dumpfile, won't save checkpoint")
646	else:
647	self.note(0, "\nSaving checkpoint to %s ...", dumpfile)
648	newfile = dumpfile + ".new"
649	f = open(newfile, "wb")
650	pickle.dump(self, f)
651	f.close()
652	try:
653	os.unlink(dumpfile)
654	except os.error:
655	pass
656	os.rename(newfile, dumpfile)
657	self.note(0, "Done.")
658	return 1
659
660
661	class Page:
662
663	def __init__(self, text, url, verbose=VERBOSE, maxpage=MAXPAGE, checker=None):
664	self.text = text
665	self.url = url
666	self.verbose = verbose
667	self.maxpage = maxpage
668	self.checker = checker
669
670	# The parsing of the page is done in the __init__() routine in
671	# order to initialize the list of names the file
672	# contains. Stored the parser in an instance variable. Passed
673	# the URL to MyHTMLParser().
674	size = len(self.text)
675	if size > self.maxpage:
676	self.note(0, "Skip huge file %s (%.0f Kbytes)", self.url, (size*0.001))
677	self.parser = None
678	return
679	self.checker.note(2, " Parsing %s (%d bytes)", self.url, size)
680	self.parser = MyHTMLParser(url, verbose=self.verbose,
681	checker=self.checker)
682	self.parser.feed(self.text)
683	self.parser.close()
684
685	def note(self, level, msg, *args):
686	if self.checker:
687	apply(self.checker.note, (level, msg) + args)
688	else:
689	if self.verbose >= level:
690	if args:
691	msg = msg%args
692	print msg
693
694	# Method to retrieve names.
695	def getnames(self):
696	if self.parser:
697	return self.parser.names
698	else:
699	return []
700
701	def getlinkinfos(self):
702	# File reading is done in __init__() routine. Store parser in
703	# local variable to indicate success of parsing.
704
705	# If no parser was stored, fail.
706	if not self.parser: return []
707
708	rawlinks = self.parser.getlinks()
709	base = urlparse.urljoin(self.url, self.parser.getbase() or "")
710	infos = []
711	for rawlink in rawlinks:
712	t = urlparse.urlparse(rawlink)
713	# DON'T DISCARD THE FRAGMENT! Instead, include
714	# it in the tuples which are returned. See Checker.dopage().
715	fragment = t[-1]
716	t = t[:-1] + ('',)
717	rawlink = urlparse.urlunparse(t)
718	link = urlparse.urljoin(base, rawlink)
719	infos.append((link, rawlink, fragment))
720
721	return infos
722
723
724	class MyStringIO(StringIO.StringIO):
725
726	def __init__(self, url, info):
727	self.__url = url
728	self.__info = info
729	StringIO.StringIO.__init__(self)
730
731	def info(self):
732	return self.__info
733
734	def geturl(self):
735	return self.__url
736
737
738	class MyURLopener(urllib.FancyURLopener):
739
740	http_error_default = urllib.URLopener.http_error_default
741
742	def __init__(*args):
743	self = args[0]
744	apply(urllib.FancyURLopener.__init__, args)
745	self.addheaders = [
746	('User-agent', 'Python-webchecker/%s' % __version__),
747	]
748
749	def http_error_401(self, url, fp, errcode, errmsg, headers):
750	return None
751
752	def open_file(self, url):
753	path = urllib.url2pathname(urllib.unquote(url))
754	if os.path.isdir(path):
755	if path[-1] != os.sep:
756	url = url + '/'
757	indexpath = os.path.join(path, "index.html")
758	if os.path.exists(indexpath):
759	return self.open_file(url + "index.html")
760	try:
761	names = os.listdir(path)
762	except os.error, msg:
763	exc_type, exc_value, exc_tb = sys.exc_info()
764	raise IOError, msg, exc_tb
765	names.sort()
766	s = MyStringIO("file:"+url, {'content-type': 'text/html'})
767	s.write('<BASE HREF="file:%s">\n' %
768	urllib.quote(os.path.join(path, "")))
769	for name in names:
770	q = urllib.quote(name)
771	s.write('<A HREF="%s">%s</A>\n' % (q, q))
772	s.seek(0)
773	return s
774	return urllib.FancyURLopener.open_file(self, url)
775
776
777	class MyHTMLParser(sgmllib.SGMLParser):
778
779	def __init__(self, url, verbose=VERBOSE, checker=None):
780	self.myverbose = verbose # now unused
781	self.checker = checker
782	self.base = None
783	self.links = {}
784	self.names = []
785	self.url = url
786	sgmllib.SGMLParser.__init__(self)
787
788	def check_name_id(self, attributes):
789	""" Check the name or id attributes on an element.
790	"""
791	# We must rescue the NAME or id (name is deprecated in XHTML)
792	# attributes from the anchor, in order to
793	# cache the internal anchors which are made
794	# available in the page.
795	for name, value in attributes:
796	if name == "name" or name == "id":
797	if value in self.names:
798	self.checker.message("WARNING: duplicate ID name %s in %s",
799	value, self.url)
800	else: self.names.append(value)
801	break
802
803	def unknown_starttag(self, tag, attributes):
804	""" In XHTML, you can have id attributes on any element.
805	"""
806	self.check_name_id(attributes)
807
808	def start_a(self, attributes):
809	self.link_attr(attributes, 'href')
810	self.check_name_id(attributes)
811
812	def end_a(self): pass
813
814	def do_area(self, attributes):
815	self.link_attr(attributes, 'href')
816	self.check_name_id(attributes)
817
818	def do_body(self, attributes):
819	self.link_attr(attributes, 'background', 'bgsound')
820	self.check_name_id(attributes)
821
822	def do_img(self, attributes):
823	self.link_attr(attributes, 'src', 'lowsrc')
824	self.check_name_id(attributes)
825
826	def do_frame(self, attributes):
827	self.link_attr(attributes, 'src', 'longdesc')
828	self.check_name_id(attributes)
829
830	def do_iframe(self, attributes):
831	self.link_attr(attributes, 'src', 'longdesc')
832	self.check_name_id(attributes)
833
834	def do_link(self, attributes):
835	for name, value in attributes:
836	if name == "rel":
837	parts = value.lower().split()
838	if ( parts == ["stylesheet"]
839	or parts == ["alternate", "stylesheet"]):
840	self.link_attr(attributes, "href")
841	break
842	self.check_name_id(attributes)
843
844	def do_object(self, attributes):
845	self.link_attr(attributes, 'data', 'usemap')
846	self.check_name_id(attributes)
847
848	def do_script(self, attributes):
849	self.link_attr(attributes, 'src')
850	self.check_name_id(attributes)
851
852	def do_table(self, attributes):
853	self.link_attr(attributes, 'background')
854	self.check_name_id(attributes)
855
856	def do_td(self, attributes):
857	self.link_attr(attributes, 'background')
858	self.check_name_id(attributes)
859
860	def do_th(self, attributes):
861	self.link_attr(attributes, 'background')
862	self.check_name_id(attributes)
863
864	def do_tr(self, attributes):
865	self.link_attr(attributes, 'background')
866	self.check_name_id(attributes)
867
868	def link_attr(self, attributes, *args):
869	for name, value in attributes:
870	if name in args:
871	if value: value = value.strip()
872	if value: self.links[value] = None
873
874	def do_base(self, attributes):
875	for name, value in attributes:
876	if name == 'href':
877	if value: value = value.strip()
878	if value:
879	if self.checker:
880	self.checker.note(1, " Base %s", value)
881	self.base = value
882	self.check_name_id(attributes)
883
884	def getlinks(self):
885	return self.links.keys()
886
887	def getbase(self):
888	return self.base
889
890
891	if __name__ == '__main__':
892	main()

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: python/trunk/Tools/webchecker/webchecker.py@ 604

Download in other formats: