<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>From Accessibility to Zope &#187; Python</title>
	<atom:link href="http://blog.mauveweb.co.uk/category/python/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mauveweb.co.uk</link>
	<description>experiments in contemporary web development</description>
	<lastBuildDate>Fri, 30 Jul 2010 13:09:58 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Image spidering in Python</title>
		<link>http://blog.mauveweb.co.uk/2008/10/28/image-spidering-in-python/</link>
		<comments>http://blog.mauveweb.co.uk/2008/10/28/image-spidering-in-python/#comments</comments>
		<pubDate>Tue, 28 Oct 2008 20:51:22 +0000</pubDate>
		<dc:creator>mauve</dc:creator>
				<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://blog.mauveweb.co.uk/?p=196</guid>
		<description><![CDATA[I have several useful tools in Python for working with websites. Today I needed a script to report the images on a website, along with their corresponding alt tags. The script was extremely quick to write using the available tools, which makes it a fairly good example of how powerful Python is.
I have based this [...]]]></description>
			<content:encoded><![CDATA[<p>I have several useful tools in Python for working with websites. Today I needed a script to report the images on a website, along with their corresponding alt tags. The script was extremely quick to write using the available tools, which makes it a fairly good example of how powerful Python is.</p>
<p>I have based this script on a pre-existing webspider class I have written:</p>
<p><code><span style="color: #ff7700;font-weight:bold;">class</span> Spider<span style="color: black;">&#40;</span><span style="color: #008000;">object</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, base_url<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span>.<span style="color: black;">base_url</span> = base_url<br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">def</span> pages<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; <span style="color: #dc143c;">queue</span> = <span style="color: black;">&#91;</span><span style="color: #008000;">self</span>.<span style="color: black;">base_url</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; seen = <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">queue</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">while</span> <span style="color: #dc143c;">queue</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;url = <span style="color: #dc143c;">queue</span>.<span style="color: black;">pop</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;f = <span style="color: #dc143c;">urllib2</span>.<span style="color: black;">urlopen</span><span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">if</span> f.<span style="color: black;">info</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">gettype</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: black;">&#91;</span><span style="color: #483d8b;">'text/html'</span>, <span style="color: #483d8b;">'application/xhtml+xml'</span><span style="color: black;">&#93;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">continue</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;doc = ElementSoup.<span style="color: black;">parse</span><span style="color: black;">&#40;</span>f<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;doc.<span style="color: black;">make_links_absolute</span><span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">for</span> element, attribute, link, pos <span style="color: #ff7700;font-weight:bold;">in</span> doc.<span style="color: black;">iterlinks</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> link.<span style="color: black;">startswith</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">base_url</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">continue</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> element.<span style="color: black;">tag</span> == <span style="color: #483d8b;">'a'</span> <span style="color: #ff7700;font-weight:bold;">and</span> attribute == <span style="color: #483d8b;">'href'</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;l = <span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">'#.*$'</span>, <span style="color: #483d8b;">''</span>, link<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">if</span> l <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #ff7700;font-weight:bold;">in</span> seen:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #dc143c;">queue</span>.<span style="color: black;">append</span><span style="color: black;">&#40;</span>l<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; seen.<span style="color: black;">add</span><span style="color: black;">&#40;</span>l<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;path = url<span style="color: black;">&#91;</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">base_url</span><span style="color: black;">&#41;</span>:<span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">yield</span> path, doc</code></p>
<p>This class effectively wraps a <strong>generator</strong> which yields every pair of path and web page it finds on the site. Generators are incredibly useful for keeping code simple without being memory hungry. It's easier to type <code>yield</code> than building a list of items, but in this case it's better than that: this code returns one LXML ElementTree at a time, rather than reading and parsing them all up front.</p>
<p>Generators encapsulate state as local variables, which generally means you don't even need to wrap them in a class like I've done. I only do this because I like to add functionality by subclassing. This may be a throwback to my days of programming Java.</p>
<p>It should be noted that most of the heavy lifting here is being done by lxml and BeautifulSoup. lxml.html makes it extremely easy to work with <acronym title="HyperText Markup Language">HTML</acronym>. BeautifulSoup's excellent broken-<acronym title="HyperText Markup Language">HTML</acronym> parser is used not because my <acronym title="HyperText Markup Language">HTML</acronym> demands it, but to allow this one script to work with any site I want to use it with.</p>
<p><code><span style="color: #ff7700;font-weight:bold;">class</span> ImageSpider<span style="color: black;">&#40;</span>Spider<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">def</span> images<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; seen = <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> path, doc <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">self</span>.<span style="color: black;">pages</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;imgs = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">for</span> img <span style="color: #ff7700;font-weight:bold;">in</span> doc.<span style="color: black;">findall</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'.//img'</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; src = img.<span style="color: black;">get</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'src'</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; alt = img.<span style="color: black;">get</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'alt'</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; title = img.<span style="color: black;">get</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'title'</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; i = <span style="color: black;">&#40;</span>src, alt, title<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> i <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #ff7700;font-weight:bold;">in</span> seen:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;seen.<span style="color: black;">add</span><span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;imgs.<span style="color: black;">append</span><span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">if</span> imgs:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">yield</span> path, imgs<br />
&nbsp; &nbsp; <br />
...</code></p>
<p>This is another generator that effectively filters the list of pages, yielding a list of images within each page. Generators calling generators is again very elegant. Each time the caller asks for the next page of images, ImageSpider will go back to the original Spider for a new page until it has one with images.</p>
<p><code><span style="color: #ff7700;font-weight:bold;">def</span> text_report<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, out=<span style="color: #dc143c;">sys</span>.<span style="color: black;">stdout</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> path, imgs <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">self</span>.<span style="color: black;">images</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">print</span> &gt;&gt;out, <span style="color: #483d8b;">'In'</span>, path<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">for</span> src, alt, title <span style="color: #ff7700;font-weight:bold;">in</span> imgs:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> &gt;&gt;out, <span style="color: #483d8b;">'- src:'</span>, src<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> alt <span style="color: #ff7700;font-weight:bold;">is</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">None</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">print</span> &gt;&gt;out, <span style="color: #483d8b;">'&nbsp; alt:'</span>, alt<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">print</span> &gt;&gt;out, <span style="color: #483d8b;">'&nbsp; alt is MISSING'</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> title <span style="color: #ff7700;font-weight:bold;">is</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">None</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">print</span> &gt;&gt;out, <span style="color: #483d8b;">'&nbsp; title:'</span>, title<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">print</span> &gt;&gt;out</code></p>
<p>Other methods of ImageSpider generate reports. Here I use the handy print chevrons to write to any file-like object. File-like objects are a particularly handy piece of duck typing. By default these methods will write to stdout, which is the same as printing normally, but you can pass in any other file-like object for very simple redirection.</p>
<p><code><span style="color: #ff7700;font-weight:bold;">def</span> html_report<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, out=<span style="color: #dc143c;">sys</span>.<span style="color: black;">stdout</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">cgi</span> <span style="color: #ff7700;font-weight:bold;">import</span> escape<br />
&nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> &gt;&gt;out, <span style="color: #483d8b;">&quot;&quot;</span><span style="color: #483d8b;">&quot;&lt;html&gt;<br />
&nbsp; &nbsp;&lt;head&gt;<br />
&nbsp; &nbsp; &nbsp; &lt;title&gt;Image Report for %(base_url)s&lt;/title&gt;<br />
&nbsp; &nbsp;&lt;/head&gt;<br />
&nbsp; &nbsp;&lt;body&gt;<br />
&nbsp; &nbsp; &nbsp; &lt;h1&gt;Image report for %(base_url)s&lt;/h1&gt;<br />
&nbsp; &nbsp; &nbsp; &quot;</span><span style="color: #483d8b;">&quot;&quot;</span> % <span style="color: black;">&#123;</span><span style="color: #483d8b;">'base_url'</span>: escape<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">base_url</span><span style="color: black;">&#41;</span><span style="color: black;">&#125;</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> path, imgs <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">self</span>.<span style="color: black;">images</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">print</span> &gt;&gt;out, <span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\t</span><span style="color: #000099; font-weight: bold;">\t</span>&lt;h2&gt;%s&lt;/h2&gt;'</span> % escape<span style="color: black;">&#40;</span>path<span style="color: black;">&#41;</span>.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'utf8'</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">for</span> src, alt, title <span style="color: #ff7700;font-weight:bold;">in</span> imgs:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; idict = <span style="color: black;">&#123;</span><span style="color: #483d8b;">'src'</span>: escape<span style="color: black;">&#40;</span><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>src<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'utf8'</span><span style="color: black;">&#41;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'alt'</span>: escape<span style="color: black;">&#40;</span><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>alt<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'utf8'</span><span style="color: black;">&#41;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'title'</span>: escape<span style="color: black;">&#40;</span><span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>title<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'utf8'</span><span style="color: black;">&#41;</span><span style="color: black;">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> &gt;&gt;out, <span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\t</span><span style="color: #000099; font-weight: bold;">\t</span>&lt;img src=&quot;%(src)s&quot; alt=&quot;%(alt)s&quot; /&gt;'</span> % idict<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> alt <span style="color: #ff7700;font-weight:bold;">is</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">None</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">print</span> &gt;&gt;out, <span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\t</span><span style="color: #000099; font-weight: bold;">\t</span>&lt;p&gt;&lt;strong&gt;alt:&lt;/strong&gt; %(alt)s&lt;/p&gt;'</span> % idict<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">print</span> &gt;&gt;out, <span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\t</span><span style="color: #000099; font-weight: bold;">\t</span>&lt;p&gt;&lt;strong&gt;alt is MISSING&lt;/strong&gt;&lt;/p&gt;'</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> title <span style="color: #ff7700;font-weight:bold;">is</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">None</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">print</span> &gt;&gt;out, <span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\t</span><span style="color: #000099; font-weight: bold;">\t</span>&lt;p&gt;&lt;strong&gt;title:&lt;/strong&gt; %(title)s&lt;/p&gt;'</span> % idict<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> &gt;&gt;out<br />
&nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">print</span> &gt;&gt;out, <span style="color: #483d8b;">&quot;&quot;</span><span style="color: #483d8b;">&quot;&nbsp; &nbsp;&lt;/body&gt;<br />
&lt;/html&gt;<br />
&quot;</span><span style="color: #483d8b;">&quot;&quot;</span></code></p>
<p>Again, similar, but this method demonstrates a simple form of templating: the string formatting operator, %, allows you to retrieve values from a dictionary.</p>
<p>Finally, there's the commandline interface to all this:</p>
<p><code><span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">optparse</span> <span style="color: #ff7700;font-weight:bold;">import</span> OptionParser<br />
&nbsp; &nbsp; <br />
op = OptionParser<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
op.<span style="color: black;">add_option</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'-f'</span>, <span style="color: #483d8b;">'--format'</span>, choices=<span style="color: black;">&#91;</span><span style="color: #483d8b;">'text'</span>, <span style="color: #483d8b;">'html'</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
op.<span style="color: black;">add_option</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'-o'</span>, <span style="color: #483d8b;">'--outfile'</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <br />
options, args = op.<span style="color: black;">parse_args</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <br />
<span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>args<span style="color: black;">&#41;</span> != <span style="color: #ff4500;">1</span>:<br />
&nbsp; &nbsp;op.<span style="color: black;">error</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'You must provide a site <acronym title="Uniform Resource Locator">URL</acronym> from which to spider images.'</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <br />
s = ImageSpider<span style="color: black;">&#40;</span>args<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <br />
<span style="color: #ff7700;font-weight:bold;">if</span> options.<span style="color: black;">outfile</span>:<br />
&nbsp; &nbsp;out = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span>options.<span style="color: black;">outfile</span>, <span style="color: #483d8b;">'w'</span><span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp;out = <span style="color: #dc143c;">sys</span>.<span style="color: black;">stdout</span><br />
&nbsp; &nbsp; <br />
<span style="color: #ff7700;font-weight:bold;">if</span> options.<span style="color: black;">format</span> == <span style="color: #483d8b;">'html'</span>:<br />
&nbsp; &nbsp;s.<span style="color: black;">html_report</span><span style="color: black;">&#40;</span>out<span style="color: black;">&#41;</span><br />
<span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp;s.<span style="color: black;">text_report</span><span style="color: black;">&#40;</span>out<span style="color: black;">&#41;</span></code></p>
<p>In a few lines, the amazing <code>optparse</code> module turns a quick script into a flexible commandline tool.</p>
<p><a href='http://blog.mauveweb.co.uk/wp-content/uploads/2008/10/siteimages.py'>Download the source: siteimages.py</a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.mauveweb.co.uk/2008/10/28/image-spidering-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Paypal with Django</title>
		<link>http://blog.mauveweb.co.uk/2007/10/10/paypal-with-django/</link>
		<comments>http://blog.mauveweb.co.uk/2007/10/10/paypal-with-django/#comments</comments>
		<pubDate>Wed, 10 Oct 2007 22:13:06 +0000</pubDate>
		<dc:creator>mauve</dc:creator>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[e-Commerce]]></category>

		<guid isPermaLink="false">http://blog.mauveweb.co.uk/2007/10/10/paypal-with-django/</guid>
		<description><![CDATA[In a previous post I discussed the method I used to integrate Paypal's Encrypted Web Payments in generic SSL terms I hoped would make it easy to implement from scratch in any language. I've had a request from Ross Poulton to share the Python code that makes it work using the M2Crypto wrapper. So, here [...]]]></description>
			<content:encoded><![CDATA[<p>In a <a href="/2007/06/14/paypal-encrypted-web-payments/">previous post</a> I discussed the method I used to integrate Paypal's Encrypted Web Payments in generic <acronym title="Secure Sockets Layer">SSL</acronym> terms I hoped would make it easy to implement from scratch in any language. I've had a request from Ross Poulton to share the Python code that makes it work using the M2Crypto wrapper. So, here it is:</p>
<pre><code><span style="color: #ff7700;font-weight:bold;">from</span> M2Crypto <span style="color: #ff7700;font-weight:bold;">import</span> BIO, <acronym title="Secure Multipurpose Internet Mail Extensions">SMIME</acronym>, X509
<span style="color: #ff7700;font-weight:bold;">from</span> django.<span style="color: black;">conf</span> <span style="color: #ff7700;font-weight:bold;">import</span> settings

<span style="color: #ff7700;font-weight:bold;">class</span> PaypalOrder<span style="color: black;">&#40;</span><span style="color: #008000;">dict</span><span style="color: black;">&#41;</span>:
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">&quot;&quot;</span><span style="color: #483d8b;">&quot;Acts as a dictionary which can be encrypted to Paypal's <acronym title="(PayPal) Encrypted Web Payments">EWP</acronym> service&quot;</span><span style="color: #483d8b;">&quot;&quot;</span>
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">dict</span>.<span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'cert_id'</span><span style="color: black;">&#93;</span>=settings.<span style="color: black;">MY_CERT_ID</span>

&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> setNotifyURL<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, notify_url<span style="color: black;">&#41;</span>:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #008000;">self</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'notify_url'</span><span style="color: black;">&#93;</span>=notify_url

&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># snip more wrapper functions</span>

&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> plaintext<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">&quot;&quot;</span><span style="color: #483d8b;">&quot;The plaintext for the cryptography operation.&quot;</span><span style="color: #483d8b;">&quot;&quot;</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; s=<span style="color: #483d8b;">''</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> k <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">self</span>:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; s+=u<span style="color: #483d8b;">'%s=%s<span style="color: #000099; font-weight: bold;">\n</span>'</span>%<span style="color: black;">&#40;</span>k,<span style="color: #008000;">self</span><span style="color: black;">&#91;</span>k<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> s.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'utf-8'</span><span style="color: black;">&#41;</span>

&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #0000cd;">__str__</span>=plaintext

&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">def</span> encrypt<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">&quot;&quot;</span><span style="color: #483d8b;">&quot;Return the contents of this order, encrypted to Paypal's
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; certificate and signed using the private key
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; configured in the Django settings.&quot;</span><span style="color: #483d8b;">&quot;&quot;</span>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># Instantiate an <acronym title="Secure Multipurpose Internet Mail Extensions">SMIME</acronym> object.</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; s = <acronym title="Secure Multipurpose Internet Mail Extensions">SMIME</acronym>.<span style="color: black;"><acronym title="Secure Multipurpose Internet Mail Extensions">SMIME</acronym></span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># Load signer's key and cert. Sign the buffer.</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; s.<span style="color: black;">load_key_bio</span><span style="color: black;">&#40;</span>BIO.<span style="color: black;">openfile</span><span style="color: black;">&#40;</span>settings.<span style="color: black;">MY_KEYPAIR</span><span style="color: black;">&#41;</span>, BIO.<span style="color: black;">openfile</span><span style="color: black;">&#40;</span>settings.<span style="color: black;">MY_CERT</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; p7 = s.<span style="color: black;">sign</span><span style="color: black;">&#40;</span>BIO.<span style="color: black;">MemoryBuffer</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">plaintext</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>, flags=<acronym title="Secure Multipurpose Internet Mail Extensions">SMIME</acronym>.<span style="color: black;">PKCS7_BINARY</span><span style="color: black;">&#41;</span>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># Load target cert to encrypt the signed message to.</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; x509 = X509.<span style="color: black;">load_cert_bio</span><span style="color: black;">&#40;</span>BIO.<span style="color: black;">openfile</span><span style="color: black;">&#40;</span>settings.<span style="color: black;">PAYPAL_CERT</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; sk = X509.<span style="color: black;">X509_Stack</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; sk.<span style="color: black;">push</span><span style="color: black;">&#40;</span>x509<span style="color: black;">&#41;</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; s.<span style="color: black;">set_x509_stack</span><span style="color: black;">&#40;</span>sk<span style="color: black;">&#41;</span>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># Set cipher: 3-key triple-<acronym title="Data Encryption Standard">DES</acronym> in <acronym title="Cipher-Block Chaining">CBC</acronym> mode.</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; s.<span style="color: black;">set_cipher</span><span style="color: black;">&#40;</span><acronym title="Secure Multipurpose Internet Mail Extensions">SMIME</acronym>.<span style="color: black;">Cipher</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'des_ede3_cbc'</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># Create a temporary buffer.</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tmp = BIO.<span style="color: black;">MemoryBuffer</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># Write the signed message into the temporary buffer.</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; p7.<span style="color: black;">write_der</span><span style="color: black;">&#40;</span>tmp<span style="color: black;">&#41;</span>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># Encrypt the temporary buffer.</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; p7 = s.<span style="color: black;">encrypt</span><span style="color: black;">&#40;</span>tmp, flags=<acronym title="Secure Multipurpose Internet Mail Extensions">SMIME</acronym>.<span style="color: black;">PKCS7_BINARY</span><span style="color: black;">&#41;</span>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># Output p7 in mail-friendly format.</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; out = BIO.<span style="color: black;">MemoryBuffer</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; p7.<span style="color: black;">write</span><span style="color: black;">&#40;</span>out<span style="color: black;">&#41;</span>

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> out.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></code></pre>
<p>The settings required are as follows:</p>
<pre><code>MY_KEYPAIR=<span style="color: #483d8b;">'keys/keypair.pem'</span>&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#path to keypair in <acronym title="Privacy Enhanced Mail">PEM</acronym> format</span>
MY_CERT=<span style="color: #483d8b;">'keys/merchant.crt'</span>&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#path to merchant certificate</span>
MY_CERT_ID=<span style="color: #483d8b;">'ASDF12345'</span>&nbsp; &nbsp; <span style="color: #808080; font-style: italic;"># code which Paypal assign to the certificate when you upload it</span>
PAYPAL_CERT=<span style="color: #483d8b;">'keys/paypal.crt'</span>&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">#path to Paypal's own certificate </span></code></pre>]]></content:encoded>
			<wfw:commentRss>http://blog.mauveweb.co.uk/2007/10/10/paypal-with-django/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Web apps need scriptable interfaces</title>
		<link>http://blog.mauveweb.co.uk/2006/11/01/web-apps-need-scriptable-interfaces/</link>
		<comments>http://blog.mauveweb.co.uk/2006/11/01/web-apps-need-scriptable-interfaces/#comments</comments>
		<pubDate>Wed, 01 Nov 2006 00:59:06 +0000</pubDate>
		<dc:creator>mauve</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web Apps]]></category>

		<guid isPermaLink="false">http://blog.mauveweb.co.uk/2006/11/01/web-apps-need-scriptable-interfaces/</guid>
		<description><![CDATA[I was just working on a set of separate Joomla installations for a client today when I realised that I really needed to be able to run scripts against the different installations.
I was trying to install three different Mambots (one of Joomla's three different types of extensions) in about 8 installations of Joomla &#8211; each [...]]]></description>
			<content:encoded><![CDATA[<p>I was just working on a set of separate <a href="http://joomla.org/">Joomla</a> installations for a client today when I realised that I really needed to be able to run scripts against the different installations.</p>
<p>I was trying to install three different Mambots (one of Joomla's three different types of extensions) in about 8 installations of Joomla &#8211; each with different database configurations and paths, and having started out with a Bash script to merely copy the plugin files into place, I realised that because automating the whole operation would involve reading a configuration file in <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> syntax and performing some queries in MySQL with it, coding this would probably take longer than installing the plugins manually.</p>
<p>There are not very many web apps which have any kind of scriptable <acronym title="	Application Programming Interface">API</acronym>. In fact, I only really know of <a href="http://www.gnu.org/software/mailman/">Mailman</a>, which is only partly a web application. But it's a feature I've used frequently in Mailman &#8211; there is a script <code>bin/withlist</code> which acquires locks and opens the list, allows you to modify the list as a Python object, and saves it on exit. Mailman provides a few <acronym title="Command-Line Interface">CLI</acronym> tools too which can be used in scripting but which are really only trivial examples of the power of the scriptable <acronym title="	Application Programming Interface">API</acronym>.</p>
<p>When I began writing <a href="http://www.mauveinternet.co.uk/products.xml">Mailhammer</a>, my own announcement-only mailing list software, I took this scriptability even futher based on my positive experience with Mailman's scriptable <acronym title="	Application Programming Interface">API</acronym>. All of the working parts are implemented in Python, and the <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> is just an <acronym title="HyperText Markup Language">HTML</acronym> wrapper which opens and talks to a <acronym title="Command-Line Interface">CLI</acronym> Python script over pipes. This means that the <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> is kept extremely simple, and the Python core is a very clean and simple <acronym title="	Application Programming Interface">API</acronym>, and that the <acronym title="Command-Line Interface">CLI</acronym> can do everything reliably. It's a cleanly divided implementation of an <em>n</em>-tier architecture. In fact in practice, I only use the web interface for viewing the data already in the database. Consequently, that interface isn't very powerful &#8211; yet!</p>
<p>Python is well-suite for scriptable <acronym title="	Application Programming Interface">APIs</acronym> &#8211; its interactive interpreter and neat object model mean that it's easy to perform arbitrary operations interactively on complex, persistent data structures. In <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> web applications it might be more feasible to build an <acronym title="eXtensible Markup Language">XML</acronym>-<acronym title="Remote Procedure Call">RPC</acronym> interface of some kind and provide a command-line client.</p>
<p>I don't think that scriptability is considered as even a <em>potential</em> feature for almost any web application I've tried;  their operation is tied inextricably to their unique interfaces.</p>
<p>For anybody developing a new web application please ask yourself this: will administrators using your software want to be locked in to your pretty and easy-to-use interface, or will they end up cursing you for failing to provide them with power beyond what <acronym title="HyperText Markup Language">HTML</acronym> can provide?</p>]]></content:encoded>
			<wfw:commentRss>http://blog.mauveweb.co.uk/2006/11/01/web-apps-need-scriptable-interfaces/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>How I came to love developing in Python</title>
		<link>http://blog.mauveweb.co.uk/2006/10/04/how-i-came-to-love-developing-in-python/</link>
		<comments>http://blog.mauveweb.co.uk/2006/10/04/how-i-came-to-love-developing-in-python/#comments</comments>
		<pubDate>Wed, 04 Oct 2006 16:11:05 +0000</pubDate>
		<dc:creator>mauve</dc:creator>
				<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://blog.mauveweb.co.uk/2006/10/04/why-i-develop-in-python/</guid>
		<description><![CDATA[As I've implied previously, I find PHP a desperately bad language for developing web applications. Python is my current favourite; it is a joy to work with both in writing code and maintaining code. Using Python, I can develop web applications faster and with more complexity, than I ever could with PHP.
There was a disaster [...]]]></description>
			<content:encoded><![CDATA[<p>As I've implied previously, I find <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> a desperately bad language for developing web applications. Python is my current favourite; it is a joy to work with both in writing code and maintaining code. Using Python, I can develop web applications faster and with more complexity, than I ever could with <acronym title="PHP: Hypertext Preprocessor">PHP</acronym>.</p>
<p>There was a disaster a couple of years ago with <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> which was the reason my preference changed. <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> fell apart when it came to the crunch, but using Python I was able to rapidly pick up the pieces. I was developing an application which would display quite an extensive mortgage application form, collect the answers and print them back to the <acronym title="Portable Document Format">PDF</acronym>, because the mortgage lender was still using a paper-based system.</p>
<p>I developed a system which read questions in <acronym title="eXtensible Markup Language">XML</acronym>. The asking of some questions could be predicated on the answers given to previous questions. This allowed me to omit questions which the original paper form didn't require, and this would mean that I could <em>require</em> valid values to all of the questions I asked.</p>
<p>I had written the system in <acronym title="PHP: Hypertext Preprocessor">PHP</acronym>, as was our standard practice at the time. Obviously this required quite complex data structures; each question was an object, but the predication was effectively a parse tree which could be be evaluated &#8211; collapsed to a single value: true, false or unknown. <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> makes this kind of work a huge nuisance. It's only got a <acronym title="Simple API for XML">SAX</acronym> parser, which means you need your own stack to parse it, and when you're doing <em>any</em> data structure work in <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> you have to be very careful to keep references rather than copies, which means you have to insert <code>&amp;</code> in every assignment and function spec, and you can't update the <code><span style="color: #0000ff;">$v</span></code> in <code><span style="color: #b1b100;">foreach</span> <span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$x</span> <span style="color: #b1b100;">as</span> <span style="color: #0000ff;">$k</span>=&gt;<span style="color: #0000ff;">$v</span><span style="color: #66cc66;">&#41;</span></code> &#8211; that's also a copy.</p>
<p>The system worked on my simple hand-drafted test data, which was much of the first page of the form, but it was extremely laborious to set up the <acronym title="eXtensible Markup Language">XML</acronym> source, because the questions needed coordinates from the <acronym title="Portable Document Format">PDF</acronym>.</p>
<p>I stopped work on the web application and swapped over to writing a tool to generate the <acronym title="eXtensible Markup Language">XML</acronym> input from the original paper form, which we had in the form of a <acronym title="Portable Document Format">PDF</acronym>. I wrote a Java tool which called on Ghostscript to render the <acronym title="Portable Document Format">PDF</acronym>, and displayed a Swing and Java2D <acronym title="User Interface">UI</acronym> to draw the fields onto the page.</p>
<p>A week of programming and 3 days and 12 pages of questions later, I plugged a completed <acronym title="eXtensible Markup Language">XML</acronym> file into the <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> application, and&#8230; nothing. Blank page. Couldn't get any output from <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> at all. It turned out <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> was segfaulting serialising the data structure. This was an almost impossible situation to resolve; the gdb trace was useless, the project was running late, and <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> wasn't behaving in a deterministic way, making it impossible to debug.</p>
<p>The best solution I could think of was to rewrite the entire application in a language I trusted more than <acronym title="PHP: Hypertext Preprocessor">PHP</acronym>, and Python, which I had been experimenting with, seemed appropriate. I already had a very basic framework for writing <acronym title="Common Gateway Interface">CGI</acronym> applications in Python, and even though I didn't start with a session system, I was able to write one, transcribe the <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> into Python, and get it all up and running within about 2 hours, which I remain impressed with to this day.</p>
<p>As I worked, I found I could transcribe every <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> construct into Python quickly and more succinctly. I could simply omit the <code>&amp;</code> nonsense as objects are always passed by reference. It's amazing to be able to look at a block of <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> code, recall what it does, and write one line of Python which can do the same thing, omitting all the hoops that <acronym title="PHP: Hypertext Preprocessor">PHP</acronym> requires you to jump through to construct data structures.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.mauveweb.co.uk/2006/10/04/how-i-came-to-love-developing-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
