Chapter HTML Processing 8.1 Diving in I often see questions on comp.lang.python like “How can I list all the [headers|images|links] in my HTML document?” “How I parse/translate/munge the text of my HTML document but leave the tags alone?” “How can I add/remove/quote attributes of all my HTML tags at once?” This chapter will answer all of these questions Here is a complete, working Python program in two parts The first part, BaseHTMLProcessor.py, is a generic tool to help you process HTML files by walking through the tags and text blocks The second part, dialect.py, is an example of how to use BaseHTMLProcessor.py to translate the text of an HTML document but leave the tags alone Read the doc strings and comments to get an overview of what's going on Most of it will seem like black magic, because it's not obvious how any of these class methods ever get called Don't worry, all will be revealed in due time Example 8.1 BaseHTMLProcessor.py If you have not already done so, you can download this and other examples used in this book from sgmllib import SGMLParser import htmlentitydefs class BaseHTMLProcessor(SGMLParser): def reset(self): # extend (called by SGMLParser. init ) self.pieces = [] SGMLParser.reset(self) def unknown_starttag(self, tag, attrs): # called for each start tag # attrs is a list of (attr, value) tuples # e.g for , tag="pre", attrs=[("class", "screen")] # Ideally we would like to reconstruct original tag and attributes, but # we may end up quoting attribute values that weren't quoted in the source # document, or we may change the type of quotes around the attribute value # (single to double quotes) # Note that improperly embedded non-HTML code (like client-side Javascript) # may be parsed incorrectly by the ancestor, causing runtime script errors # All non-HTML code must be enclosed in HTML comment tags () # to ensure that it will pass through this parser unaltered (in handle_comment) strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs]) self.pieces.append("" % locals()) def unknown_endtag(self, tag): # called for each end tag, e.g for , tag will be "pre" # Reconstruct the original end tag self.pieces.append("" % locals()) def handle_charref(self, ref): # called for each character reference, e.g for " ", ref will be "160" # Reconstruct the original character reference self.pieces.append("%(ref)s;" % locals()) def handle_entityref(self, ref): # called for each entity reference, e.g for "©", ref will be "copy" # Reconstruct the original entity reference self.pieces.append("&%(ref)s" % locals()) # standard HTML entities are closed with a semicolon; other entities are not if htmlentitydefs.entitydefs.has_key(ref): self.pieces.append(";") def handle_data(self, text): # called for each block of plain text, i.e outside of any tag and # not containing any character or entity references # Store the original text verbatim self.pieces.append(text) def handle_comment(self, text): # called for each HTML comment, e.g # Reconstruct the original comment # It is especially important that the source document enclose client-side # code (like Javascript) within comments so it can pass through this # processor undisturbed; see comments in unknown_starttag for details self.pieces.append("" % locals()) def handle_pi(self, text): # called for each processing instruction, e.g