Archive for October, 2008

Image spidering in Python

Tuesday, October 28th, 2008

I have several useful tools in Python for working with websites. Today I needed a script to report the images on a website, along with their corresponding alt tags. The script was extremely quick to write using the available tools, which makes it a fairly good example of how powerful Python is.

I have based this script on a pre-existing webspider class I have written:

class Spider(object):
   def __init__(self, base_url):
      self.base_url = base_url
   
   def pages(self):
      queue = [self.base_url]
      seen = set(queue)
   
      while queue:
         url = queue.pop(0)
         f = urllib2.urlopen(url)
         if f.info().gettype() not in ['text/html', 'application/xhtml+xml']:
            continue
         doc = ElementSoup.parse(f)
         doc.make_links_absolute(url)
         for element, attribute, link, pos in doc.iterlinks():
            if not link.startswith(self.base_url):
               continue
            if element.tag == 'a' and attribute == 'href':
               l = re.sub(r'#.*$', '', link)
               if l not in seen:
                  queue.append(l)
                  seen.add(l)
   
         path = url[len(self.base_url):]
         yield path, doc

This class effectively wraps a generator which yields every pair of path and web page it finds on the site. Generators are incredibly useful for keeping code simple without being memory hungry. It's easier to type yield than building a list of items, but in this case it's better than that: this code returns one LXML ElementTree at a time, rather than reading and parsing them all up front.

Generators encapsulate state as local variables, which generally means you don't even need to wrap them in a class like I've done. I only do this because I like to add functionality by subclassing. This may be a throwback to my days of programming Java.

It should be noted that most of the heavy lifting here is being done by lxml and BeautifulSoup. lxml.html makes it extremely easy to work with HTML. BeautifulSoup's excellent broken-HTML parser is used not because my HTML demands it, but to allow this one script to work with any site I want to use it with.

class ImageSpider(Spider):
   def images(self):
      seen = set()
      for path, doc in self.pages():
         imgs = []
         for img in doc.findall('.//img'):
            src = img.get('src')
            alt = img.get('alt')
            title = img.get('title')
            i = (src, alt, title)
            if i not in seen:
               seen.add(i)
               imgs.append(i)
   
         if imgs:
            yield path, imgs
   
...

This is another generator that effectively filters the list of pages, yielding a list of images within each page. Generators calling generators is again very elegant. Each time the caller asks for the next page of images, ImageSpider will go back to the original Spider for a new page until it has one with images.

def text_report(self, out=sys.stdout):
      for path, imgs in self.images():
         print >>out, 'In', path
         for src, alt, title in imgs:
            print >>out, '- src:', src
            if alt is not None:
               print >>out, '  alt:', alt
            else:
               print >>out, '  alt is MISSING'
            if title is not None:
               print >>out, '  title:', title
         print >>out

Other methods of ImageSpider generate reports. Here I use the handy print chevrons to write to any file-like object. File-like objects are a particularly handy piece of duck typing. By default these methods will write to stdout, which is the same as printing normally, but you can pass in any other file-like object for very simple redirection.

def html_report(self, out=sys.stdout):
      from cgi import escape
      print >>out, """<html>
   <head>
      <title>Image Report for %(base_url)s</title>
   </head>
   <body>
      <h1>Image report for %(base_url)s</h1>
      "
"" % {'base_url': escape(self.base_url)}
   
      for path, imgs in self.images():
         print >>out, '\t\t<h2>%s</h2>' % escape(path).encode('utf8')
         for src, alt, title in imgs:
            idict = {'src': escape(unicode(src)).encode('utf8'),
                'alt': escape(unicode(alt)).encode('utf8'),
                'title': escape(unicode(title)).encode('utf8')}
            print >>out, '\t\t<img src="%(src)s" alt="%(alt)s" />' % idict
            if alt is not None:
               print >>out, '\t\t<p><strong>alt:</strong> %(alt)s</p>' % idict
            else:
               print >>out, '\t\t<p><strong>alt is MISSING</strong></p>'
            if title is not None:
               print >>out, '\t\t<p><strong>title:</strong> %(title)s</p>' % idict
            print >>out
      print >>out, """   </body>
</html>
"
""

Again, similar, but this method demonstrates a simple form of templating: the string formatting operator, %, allows you to retrieve values from a dictionary.

Finally, there's the commandline interface to all this:

from optparse import OptionParser
   
op = OptionParser()
op.add_option('-f', '--format', choices=['text', 'html'])
op.add_option('-o', '--outfile')
   
options, args = op.parse_args()
   
if len(args) != 1:
   op.error('You must provide a site URL from which to spider images.')
   
s = ImageSpider(args[0])
   
if options.outfile:
   out = open(options.outfile, 'w')
else:
   out = sys.stdout
   
if options.format == 'html':
   s.html_report(out)
else:
   s.text_report(out)

In a few lines, the amazing optparse module turns a quick script into a flexible commandline tool.

Download the source: siteimages.py

Bubble Background Animation

Saturday, October 11th, 2008

I was pondering concepts for interesting web designs when the idea occurred to me that an animated bubble effect might lend a peaceful ambience to a webpage. I experimented with placing a Javascript-controlled SVG animation into the background of a page. You might like to judge for yourself whether this is successful or not (SVG-enabled browser required and a reasonably fast CPU recommended).

If you were around at the dawn of dynamic HTML you will probably have stumbled across amateur websites who thought it was really rather stylish to add a Javascript snow or bubble effect over the top of the page content.

Fortunately, those days are gone. By and large, it seems that amateur webmasters today know that just a nice colour scheme and a consistent, simple style trump a jumble of styles, javascript effects and stock animated GIFs that we all remember too well. Nice graphic design is done for you if you just install a blog and browse existing themes. Some may not even remember effects like this (Warning: Not safe for work or indeed any other time you require functioning eyeballs).

It's well-known that animations draw the user's attention in webpages. That doesn't mean we always want to avoid them: sometimes we want to direct the user's attention in one direction or another, particularly when the page is being updated dynamically with Javascript. This is not one of those special cases. Since the goal of this experiment is to build a fully-animated webpage, we will have to ignore that inconvenient little fact. However, this suggests we need to keep the animation as unintrusive as possible. Keeping it nice and slow may help, and it should certainly be in the background and not the foreground.

SVG is useful for this kind of effect because it has a feature (<svg:use>) for manipulating independent clones of a symbol. It is therefore simple to draw the original shape using an SVG editor, and the Javascript merely needs to manage instances of the clones.

Using Inkscape, I drew up a bubble looks like this:

Bubble

There's a certain knack to drawing bubbles like this, of course. Air bubbles in water are colourless, but they are reflective due to total internal reflection. The amount of reflection increases as the angle of incidence increases, up to the critical angle, at which all light is reflected. At a water-air boundary the critical angle is 48.6° so actually the bubble should appear totally reflective from about 75% of the radius.

If this sends you into a bit of a panic as you struggle to remember your school physics lessons, don't worry. I'm not recommending a mathematically accurate implementation of Fresnel's Equations. With a lot of art (not just on computers), an appreciation of the physics can go a long way towards adding realism. But a 100% accurate simulation is not necessary for an effect to seem convincing – trial and error is much easier. The gradient as I've drawn it is not accurate but looks alright. Similarly, bubbles have two specular highlights corresponding to the water-air boundary and the air-water boundary.

As an aside, one day it may be possibly to depict fully reflective and refractive bubbles. Using SVG's incredible feDisplacementMap filter, you could distort the background using a pre-computed "lens" image. But that is unlikely to run at interactive speeds today, even if the filters required were fully and accurately supported, which they are not. The bubbles I've drawn are intended to be a compromise between rendering simplicity and attractiveness.

The bubble system (really just the SVG on its own) animates 20 clones of the bubble symbol. Again, this is based on some physical principles. The smaller bubbles are subject to less drag so have a higher terminal velocity, bubbles grow slightly as they rise and the pressure decreases and so on. One of the most effective things is that the bubbles drift with a random walk: they can randomly drift to one side or the other. They don't go straight up nor do they oscillate sinusoidally like the classic DynamicDrive script. For the most effective animation, bubbles would drift with the currents but this is simpler and reasonably effective.

I am quite pleased with the results. To really rid ourselves of the legacy of Javascript-animated GIF images, it would be important for this effect to tie in with the graphic design of the page, which I haven't shown.

I don't think this is realistically ready for production websites: Internet Explorer cannot display SVG, for one thing, and the intensive CPU requirement is also a problem. But I do think that sharp SVG graphics allow us to produce a wholly better standard of animation than what was possible before. With this, I think it's possible to make a bubble animation complement rather than detract from a web page.

SVG Buttons

Thursday, October 9th, 2008

With SVG filters, it's easier than ever to create stylish graphical buttons for the web.

Using images for buttons is a much more pragmatic approach than attempting to style buttons with CSS, at least until widespread support for CSS3's draft-but-stable border-image property is available.

Up until a couple of years ago, I had generally created buttons using a PHP script that glued them together:

Example of Add To Basket button

This was a useful when working with XSL, allowing me to simply call a template to include an arbitrary button text, rather than linking to a static button image.

Because I now use Django for most of my sites, this technique is no longer relevant. Because I'm not now producing templates to transform an arbitrary XML model, but producing templates to render specific models, I know when writing the template what buttons it will require. A typical button, designed for editing convenience, would look like this:

Example of Add To Basket button

This button is a rounded rectangle with a gradient. The label is typed twice to give it a slightly inset look. Even though you have to retype the label twice to change a button, it takes only a few seconds to change the label and adjust the width of the rectangle to fit.

Inkscape 0.46 provided access to a wide range of SVG filters, making the process even simpler. Buttons are now never more complicated than a rectangle, a label, and the SVG filters to make them look pretty and three-dimensional:

Example of View Products button

Changing a button is as simple as it can be. Or is it?

I sometimes like to connect adjacent buttons into one strip, something which will be familiar to Mac OS X users:

Example of connected buttons

SVG filters can make this a doddle too. By using SVG filters to create all of the graphical effects, including the rounded corners, these buttons can be dragged together and automatically connect with one another. The filter is applied to the layer, and the above buttons are editable simply as rectangles.

Try it: Download the SVG (Inkscape 0.46+ recommended).