Archive for the ‘Python’ Category

Image spidering in Python

Tuesday, October 28th, 2008

I have several useful tools in Python for working with websites. Today I needed a script to report the images on a website, along with their corresponding alt tags. The script was extremely quick to write using the available tools, which makes it a fairly good example of how powerful Python is.

I have based this script on a pre-existing webspider class I have written:

class Spider(object):
   def __init__(self, base_url):
      self.base_url = base_url
   
   def pages(self):
      queue = [self.base_url]
      seen = set(queue)
   
      while queue:
         url = queue.pop(0)
         f = urllib2.urlopen(url)
         if f.info().gettype() not in ['text/html', 'application/xhtml+xml']:
            continue
         doc = ElementSoup.parse(f)
         doc.make_links_absolute(url)
         for element, attribute, link, pos in doc.iterlinks():
            if not link.startswith(self.base_url):
               continue
            if element.tag == 'a' and attribute == 'href':
               l = re.sub(r'#.*$', '', link)
               if l not in seen:
                  queue.append(l)
                  seen.add(l)
   
         path = url[len(self.base_url):]
         yield path, doc

This class effectively wraps a generator which yields every pair of path and web page it finds on the site. Generators are incredibly useful for keeping code simple without being memory hungry. It's easier to type yield than building a list of items, but in this case it's better than that: this code returns one LXML ElementTree at a time, rather than reading and parsing them all up front.

Generators encapsulate state as local variables, which generally means you don't even need to wrap them in a class like I've done. I only do this because I like to add functionality by subclassing. This may be a throwback to my days of programming Java.

It should be noted that most of the heavy lifting here is being done by lxml and BeautifulSoup. lxml.html makes it extremely easy to work with HTML. BeautifulSoup's excellent broken-HTML parser is used not because my HTML demands it, but to allow this one script to work with any site I want to use it with.

class ImageSpider(Spider):
   def images(self):
      seen = set()
      for path, doc in self.pages():
         imgs = []
         for img in doc.findall('.//img'):
            src = img.get('src')
            alt = img.get('alt')
            title = img.get('title')
            i = (src, alt, title)
            if i not in seen:
               seen.add(i)
               imgs.append(i)
   
         if imgs:
            yield path, imgs
   
...

This is another generator that effectively filters the list of pages, yielding a list of images within each page. Generators calling generators is again very elegant. Each time the caller asks for the next page of images, ImageSpider will go back to the original Spider for a new page until it has one with images.

def text_report(self, out=sys.stdout):
      for path, imgs in self.images():
         print >>out, 'In', path
         for src, alt, title in imgs:
            print >>out, '- src:', src
            if alt is not None:
               print >>out, '  alt:', alt
            else:
               print >>out, '  alt is MISSING'
            if title is not None:
               print >>out, '  title:', title
         print >>out

Other methods of ImageSpider generate reports. Here I use the handy print chevrons to write to any file-like object. File-like objects are a particularly handy piece of duck typing. By default these methods will write to stdout, which is the same as printing normally, but you can pass in any other file-like object for very simple redirection.

def html_report(self, out=sys.stdout):
      from cgi import escape
      print >>out, """<html>
   <head>
      <title>Image Report for %(base_url)s</title>
   </head>
   <body>
      <h1>Image report for %(base_url)s</h1>
      "
"" % {'base_url': escape(self.base_url)}
   
      for path, imgs in self.images():
         print >>out, '\t\t<h2>%s</h2>' % escape(path).encode('utf8')
         for src, alt, title in imgs:
            idict = {'src': escape(unicode(src)).encode('utf8'),
                'alt': escape(unicode(alt)).encode('utf8'),
                'title': escape(unicode(title)).encode('utf8')}
            print >>out, '\t\t<img src="%(src)s" alt="%(alt)s" />' % idict
            if alt is not None:
               print >>out, '\t\t<p><strong>alt:</strong> %(alt)s</p>' % idict
            else:
               print >>out, '\t\t<p><strong>alt is MISSING</strong></p>'
            if title is not None:
               print >>out, '\t\t<p><strong>title:</strong> %(title)s</p>' % idict
            print >>out
      print >>out, """   </body>
</html>
"
""

Again, similar, but this method demonstrates a simple form of templating: the string formatting operator, %, allows you to retrieve values from a dictionary.

Finally, there's the commandline interface to all this:

from optparse import OptionParser
   
op = OptionParser()
op.add_option('-f', '--format', choices=['text', 'html'])
op.add_option('-o', '--outfile')
   
options, args = op.parse_args()
   
if len(args) != 1:
   op.error('You must provide a site URL from which to spider images.')
   
s = ImageSpider(args[0])
   
if options.outfile:
   out = open(options.outfile, 'w')
else:
   out = sys.stdout
   
if options.format == 'html':
   s.html_report(out)
else:
   s.text_report(out)

In a few lines, the amazing optparse module turns a quick script into a flexible commandline tool.

Download the source: siteimages.py

Paypal with Django

Wednesday, October 10th, 2007

In a previous post I discussed the method I used to integrate Paypal's Encrypted Web Payments in generic SSL terms I hoped would make it easy to implement from scratch in any language. I've had a request from Ross Poulton to share the Python code that makes it work using the M2Crypto wrapper. So, here it is:

from M2Crypto import BIO, SMIME, X509
from django.conf import settings

class PaypalOrder(dict):
        """Acts as a dictionary which can be encrypted to Paypal's EWP service"""
        def __init__(self):
                dict.__init__(self)
                self['cert_id']=settings.MY_CERT_ID

        def setNotifyURL(self, notify_url):
                self['notify_url']=notify_url

        # snip more wrapper functions

        def plaintext(self):
                """The plaintext for the cryptography operation."""
                s=''
                for k in self:
                        s+=u'%s=%s\n'%(k,self[k])
                return s.encode('utf-8')

        __str__=plaintext

        def encrypt(self):
                """Return the contents of this order, encrypted to Paypal's
                certificate and signed using the private key
                configured in the Django settings."""

                # Instantiate an SMIME object.
                s = SMIME.SMIME()

                # Load signer's key and cert. Sign the buffer.
                s.load_key_bio(BIO.openfile(settings.MY_KEYPAIR), BIO.openfile(settings.MY_CERT))

                p7 = s.sign(BIO.MemoryBuffer(self.plaintext()), flags=SMIME.PKCS7_BINARY)

                # Load target cert to encrypt the signed message to.
                x509 = X509.load_cert_bio(BIO.openfile(settings.PAYPAL_CERT))
                sk = X509.X509_Stack()
                sk.push(x509)
                s.set_x509_stack(sk)

                # Set cipher: 3-key triple-DES in CBC mode.
                s.set_cipher(SMIME.Cipher('des_ede3_cbc'))

                # Create a temporary buffer.
                tmp = BIO.MemoryBuffer()

                # Write the signed message into the temporary buffer.
                p7.write_der(tmp)

                # Encrypt the temporary buffer.
                p7 = s.encrypt(tmp, flags=SMIME.PKCS7_BINARY)

                # Output p7 in mail-friendly format.
                out = BIO.MemoryBuffer()
                p7.write(out)

                return out.read()

The settings required are as follows:

MY_KEYPAIR='keys/keypair.pem'    #path to keypair in PEM format
MY_CERT='keys/merchant.crt'    #path to merchant certificate
MY_CERT_ID='ASDF12345'    # code which Paypal assign to the certificate when you upload it
PAYPAL_CERT='keys/paypal.crt'    #path to Paypal's own certificate 

Web apps need scriptable interfaces

Wednesday, November 1st, 2006

I was just working on a set of separate Joomla installations for a client today when I realised that I really needed to be able to run scripts against the different installations.

I was trying to install three different Mambots (one of Joomla's three different types of extensions) in about 8 installations of Joomla – each with different database configurations and paths, and having started out with a Bash script to merely copy the plugin files into place, I realised that because automating the whole operation would involve reading a configuration file in PHP syntax and performing some queries in MySQL with it, coding this would probably take longer than installing the plugins manually.

There are not very many web apps which have any kind of scriptable API. In fact, I only really know of Mailman, which is only partly a web application. But it's a feature I've used frequently in Mailman – there is a script bin/withlist which acquires locks and opens the list, allows you to modify the list as a Python object, and saves it on exit. Mailman provides a few CLI tools too which can be used in scripting but which are really only trivial examples of the power of the scriptable API.

When I began writing Mailhammer, my own announcement-only mailing list software, I took this scriptability even futher based on my positive experience with Mailman's scriptable API. All of the working parts are implemented in Python, and the PHP is just an HTML wrapper which opens and talks to a CLI Python script over pipes. This means that the PHP is kept extremely simple, and the Python core is a very clean and simple API, and that the CLI can do everything reliably. It's a cleanly divided implementation of an n-tier architecture. In fact in practice, I only use the web interface for viewing the data already in the database. Consequently, that interface isn't very powerful – yet!

Python is well-suite for scriptable APIs – its interactive interpreter and neat object model mean that it's easy to perform arbitrary operations interactively on complex, persistent data structures. In PHP web applications it might be more feasible to build an XML-RPC interface of some kind and provide a command-line client.

I don't think that scriptability is considered as even a potential feature for almost any web application I've tried; their operation is tied inextricably to their unique interfaces.

For anybody developing a new web application please ask yourself this: will administrators using your software want to be locked in to your pretty and easy-to-use interface, or will they end up cursing you for failing to provide them with power beyond what HTML can provide?

How I came to love developing in Python

Wednesday, October 4th, 2006

As I've implied previously, I find PHP a desperately bad language for developing web applications. Python is my current favourite; it is a joy to work with both in writing code and maintaining code. Using Python, I can develop web applications faster and with more complexity, than I ever could with PHP.

There was a disaster a couple of years ago with PHP which was the reason my preference changed. PHP fell apart when it came to the crunch, but using Python I was able to rapidly pick up the pieces. I was developing an application which would display quite an extensive mortgage application form, collect the answers and print them back to the PDF, because the mortgage lender was still using a paper-based system.

I developed a system which read questions in XML. The asking of some questions could be predicated on the answers given to previous questions. This allowed me to omit questions which the original paper form didn't require, and this would mean that I could require valid values to all of the questions I asked.

I had written the system in PHP, as was our standard practice at the time. Obviously this required quite complex data structures; each question was an object, but the predication was effectively a parse tree which could be be evaluated – collapsed to a single value: true, false or unknown. PHP makes this kind of work a huge nuisance. It's only got a SAX parser, which means you need your own stack to parse it, and when you're doing any data structure work in PHP you have to be very careful to keep references rather than copies, which means you have to insert & in every assignment and function spec, and you can't update the $v in foreach ($x as $k=>$v) – that's also a copy.

The system worked on my simple hand-drafted test data, which was much of the first page of the form, but it was extremely laborious to set up the XML source, because the questions needed coordinates from the PDF.

I stopped work on the web application and swapped over to writing a tool to generate the XML input from the original paper form, which we had in the form of a PDF. I wrote a Java tool which called on Ghostscript to render the PDF, and displayed a Swing and Java2D UI to draw the fields onto the page.

A week of programming and 3 days and 12 pages of questions later, I plugged a completed XML file into the PHP application, and… nothing. Blank page. Couldn't get any output from PHP at all. It turned out PHP was segfaulting serialising the data structure. This was an almost impossible situation to resolve; the gdb trace was useless, the project was running late, and PHP wasn't behaving in a deterministic way, making it impossible to debug.

The best solution I could think of was to rewrite the entire application in a language I trusted more than PHP, and Python, which I had been experimenting with, seemed appropriate. I already had a very basic framework for writing CGI applications in Python, and even though I didn't start with a session system, I was able to write one, transcribe the PHP into Python, and get it all up and running within about 2 hours, which I remain impressed with to this day.

As I worked, I found I could transcribe every PHP construct into Python quickly and more succinctly. I could simply omit the & nonsense as objects are always passed by reference. It's amazing to be able to look at a block of PHP code, recall what it does, and write one line of Python which can do the same thing, omitting all the hoops that PHP requires you to jump through to construct data structures.