Profanity

January 28th, 2009

The web has never responded very well to censorship. So much of the web is about freedom of expression that whenever someone tries to express himself, and is prevented from doing so, he feels disenfranchised. That applies even more so in the case of the Scunthorpe problem, because people who weren't trying to swear in the first place feel much more aggrieved.

On the other hand, website owners do not want their image damaged by users who can't keep their potty mouths shut.

When developing sites that allow users a voice, we need to find ways to protect the website owners, or the atmosphere of a community, without damaging the goodwill of the user base. Any website that depends on user input, and which doesn't have any users, is a failure.

Profanity filtering is not the answer because, at least, I've never seen it done well enough to be both comprehensive and unintrusive. All problems that relate to processing natural language are extremely complicated. We have barely started to scrape the surface in terms of parsing English text, let alone extracting the semantics from it that we would need to determine if a word is offensive. So any attempt at a naïve profanity filter is doomed to failure. For example, you can be profane without being offensive:

She turned round and screamed, "Fuck off, you stuck-up bitch". I was appalled!

You're a grumpy old bastard, but I love you.

and you can be offensive without being profane:

I did your mum last night. She's fatter than a blue whale, but she knows a trick or two. Your sister does too actually.

and let's not forget the cases where you can't tell:

Do you have a cock or do you just keep hens? Oh, we have a big gold cock. You know, the pussy is afraid of him!

Ok, the last example is contrived and of course nobody would type it with a straight face.  Still, in the right context, it's innuendo not profanity.

With those insurmountable problems, there's simply no substitute for a human keeping an eye on things. However, even with moderation, there are problems to face. Exactly what is acceptable? Moderators can easily pronounce on clear-cut cases of abusiveness or offensiveness, but people have different sensibilities as to what's acceptable. It's also fairly easy for moderators to miss the odd bit of abuse, especially if it's only offensive in some contexts.

One trick to help keep control of the situation is to carefully set the tone. If you can use the language and style of the website to convey a sense of what might be appropriate, you can influence the tone users are likely to take. Though moderators still have to check the same amount of content, this reduces the chance that something untoward will slip through. Phrases like "Interglobal Inc do not take any responsibility for the content of this service"  – phrases which are of dubious merit anyway – may have the opposite effect, by giving users the impressi0n that they don't care what the tone is. You also stand to lose control of the tone in the subconscious minds of users if you use some well-known software – phpBB for example – which users might have used elsewhere and come to associate with a certain mode of speech.

If you do censor people,  a light touch is often better than a heavy hand.

Mauvesoft

January 26th, 2009

I've overhauled Mauvesoft, my programming projects website. Check it out.

How to program a calendar

January 26th, 2009

Programming a calendar sounds deceptively easy. And it is, until you come to realise that there's very little point in displaying a calendar that doesn't show information about events and periods. You have a potentially overlapping set of periods to display, each spanning days or months. It becomes much more complicated.

At the moment I'm programming a calendar for the booking of accommodation, which is particularly complicated because a) you book nights, not days, and month planners have cells for days, not nights, and b) the dates that are available are the dates not booked, not the dates booked.

I'm using a simpler approach, converting all calendar periods into a stream of events in date order. The interface between producers and consumers of calendar events looks like this:

class CalendarListener(object):
  def start_month(self, month):
    """Called before the first day of the month, and before any periods in that month."""
   
  def end_month(self, month):
    """Called after the last day of the month, and after any periods in that month."""
   
  def start_day(self, date):
    """Called once for each day to display"""
   
  def start_period(self, date, period):
    """Called before the day in which the period begins"""
   
  def end_period(self, date, period):
    """Ends the previously started period"""

This interface makes it very easy to produce, filter, and consume calendar data. What was previously a complicated process of intersecting, splitting, joining, structuring and outputting date ranges suddenly becomes very simple. All of the events received via this interface are guaranteed to be in chronological order, so no date comparison is needed. Almost all calendar operations can be performed with a simple state machine.

A consumer that renders to HTML, for example, is as simple as this:

class MonthRenderer(CalendarListener):
  def __init__(self):
    self.buf = StringIO()
   
  def start_month(self, month):
    print >>self.buf, """<div class="month"><h4>%s</h4>
      <img class="
week" src="/assets/cal/week.png" alt=""/>""" % month.name()
   
    w = month.first_day().weekday()
    if w:
      print >>self.buf, '<div class="padding" style="width: %dpx"></div>' % (w * 21)
 
  def end_month(self, month):
    print >>self.buf, "</div>"
   
  def start_day(self, date):
    print >>self.buf, '<span class="day">%d</span>' % date.day

(Note: date and datetime are standard Python classes. Month, however, is my own class. Also, some people use a table rather than CSS for this; that's obviously a fairly simple alteration.)

It took me quite a few false starts before I realised the relative simplicity and convenience of this pattern, which is why I wanted to recommend this. It's very easy to fall into a trap of building complexity and tackling problems using ever-more complicated calendar classes and processors and never take the step back to find a better approach.

The naïve approach for programming a calendar is to write a function, say, print_month() which renders a month of a calendar. Then call this 12 times. Then wrap it up in a class so you can subclass it to retrieve a list of events and modify output. This quickly became excessively complicated, as I wrote methods to chop and join periods together, work out what the formatting of each day should be, and render it.

Alas, the calendar also requires Javascript, and doesn't benefit quite as much from an event-driven approach because it needs to operate on the structured HTML DOM.

Tip: Don't use uppercase/lowercase in HTML

January 14th, 2009

It's sometimes tempting to use case for emphasis: uppercase and lowercase are well within the repertoire of useful graphic design tools. Graphic designers know that uppercase is slower to read than lower case, but in isolated phrases that's unimportant. But on the web there's a penalty to using just upper- or lowercase: it's not as accessible. Writing in normal sentence case conveys information. Specifically, the semantics of the sentence – particularly abbreviations – depend on the use of case, as this photo shows:

NUT CONFERENCE

NUT CONFERENCE

CSS provides a way around this: the text-transform property. This allows you to write your content in full-sentence case, and display it in full uppercase or lowercase as desired for stylistic reasons. For example, if your design calls for <h2> tags to be in uppercase, use

h2 {
text-transform: uppercase;
}

Of course, this allows you to simply remove the property if you change your site design; no content needs to be rewritten.

Some offenders even publish an RSS feed using uppercase titles. Never do this. People who want to syndicate your feed normally want it in sentence case, and there's no way to force that to happen if you aren't publishing the RSS feed using proper sentence case.

Wordpress Audio Player

January 9th, 2009

Martin Laine's Wordpress Audio Player seems to have quite a broad penetration, but having seen it in a couple of places, I want to add that I think it's an excellent. When not playing, it's a plain, unintrusive icon that clearly indicates an option to play a sound, and which smoothly expands to a straightforward, clutter-free player. By changing the colour scheme, you could make this fit with nearly any website style, and unlike many alternatives it will not draw attention away from your text or audio content.

Users don't need WYSIWYG

December 12th, 2008

We've all been wowed by impressive off-the-shelf components for WYSIWYG editing in webpages. TinyMCE and HTMLArea are just two. Looking at them it all looks staggeringly simple.

However, in practice these editors become incredibly painful to use. When you watch users trying to edit a content management system, it becomes very obvious that the usability they profess is an illusion.

  • The editing experience is slow, sometimes to the point of non-interactivity
  • Loading of the component is also slow
  • Users struggle to make formatting behave when all they want to do is write
  • Pasting from word processor documents also pastes unwanted formatting

My solution is to return to the humble textarea. This was something I discovered by accident: Django provides a Markdown filter, so this was the simplest way to provide formatting, though my intention had been to embed TinyMCE later. Markdown is a simple formatting language based on formatting conventions in plain-text e-mails. I soon discovered the convenience of editing content with Markdown – freed from worries about formatting, the experience of publishing content accelerates, at the expense of a slightly steeper learning curve. It's only a learning curve for the formatting features. In the simplest case of unformatted paragraphs, writing in Markdown is the same as writing in plain text, so users can start publishing at full speed straight away.

There's also the benefit that Markdown doesn't give users the ability to break out of the style of a site. Users are not graphic designers: if you give them the ability to make text red, thinking perhaps that this may be useful on special occasions to draw attention to an important message, don't be surprised if they make every other sentence red, bold and italic. After all, they want to draw attention to everything they say, don't they?

To enhance the experience in my applications, I've bolted on a toolbar based on Livepipe and a preview based on Showdown.

Screenshot of my MarkdownArea

Screenshot of my MarkdownArea

Calls to action

November 28th, 2008

A way of supposedly increasing the conversions from your site is by adding calls to action, links or banners or buttons nudging people away from simply reading and towards taking action – purchasing your products, enquiring about your services and so on.

The practice of including calls to action is taken straight out of the advertising industry. Advertisers have a small list of things that they need to include in an advert, and a call to action is on that list. However a website is not an advert. Users browsing the web are mainly in a mode where they will read and compare and research a purchase. Who would click the first "buy this now" button they see when they can hop onto another site and check out alternatives and price first? In this context, calls to action may not be very effective and can be intrusive. It's even less effective if your call to action is not something as passive and easily handled over the web as just "buying", such as "Enquire now about our calibration service".

In the UK we also like our calls to action implicit. Watch TV ads for a few minutes and the number you'll see that include an explicit call like "Sofas half price at DFS until Monday! Come down to DFS showrooms today!" are small compared to the number that run more along the lines of "The sun is shining and this man in trendy clothes is laughing with a group of attractive women. What's that he's drinking? Oh, Coca-cola."

So include calls to action, make sure they are seen, but keep them understated and out of people's faces and users may find your site that much more appealing – easily enough to outweigh the effectiveness of intrusive calls to action.

Book Meme

November 17th, 2008
  • Grab the nearest book.
  • Open it to page 56.
  • Find the fifth sentence.
  • Post the text of the sentence in your blog along with these instructions.
  • Don’t dig for your favorite book, the cool book, or the intellectual one: pick the CLOSEST.

So, my sentence is this:

"Spring Deer is almost like drinking flavored water; there's tons of flavor in a creamy and velvety package."

The book is Sake: A Modern Guide by Beau Timken and Sara Deseran (ISBN 0-8118-4960-0).

Debunking SEO

November 17th, 2008

I've discussed previously how the SEO industry constructs its advice. But I now want to take them to task on actual advice I've received from SEO companies. SEO companies make claims that are poorly scientifically verifiable, because it's very difficult to distinguish causal factors in changes in search result positions.

To validate these claims we could imagine a study where we compare the rankings of two groups of websites, distinguished only by whether they implement a given SEO suggestion. If the hypothesised recommendation does affect ranking we would expect to see a statistically significant amelioration of search engine ranking.

I don't believe this is possible. For one thing, there are too many factors, given the complexity of the web, to be able to extract a clear picture, so any results would be unlikely to be "statistically significant". This means any effect noted would not be as great as the margins of error of the experiment. The results would be too muddied by independent and much more important considerations like inbound links and accessibility. Also you can't get a very good appreciation of how much a rank is affected: you only see the order of results, not how much better one result is considered than the next. Statistically that should widen the margins of error.

I am skeptical about a lot of these things. I don't think I can disprove them given the doubts I've expressed above, but I do contest them. I believe they are unlikely and I believe SEO people believe them for invalid reasons.

Read the rest of this entry »

Image spidering in Python

October 28th, 2008

I have several useful tools in Python for working with websites. Today I needed a script to report the images on a website, along with their corresponding alt tags. The script was extremely quick to write using the available tools, which makes it a fairly good example of how powerful Python is.

I have based this script on a pre-existing webspider class I have written:

class Spider(object):
   def __init__(self, base_url):
      self.base_url = base_url
   
   def pages(self):
      queue = [self.base_url]
      seen = set(queue)
   
      while queue:
         url = queue.pop(0)
         f = urllib2.urlopen(url)
         if f.info().gettype() not in ['text/html', 'application/xhtml+xml']:
            continue
         doc = ElementSoup.parse(f)
         doc.make_links_absolute(url)
         for element, attribute, link, pos in doc.iterlinks():
            if not link.startswith(self.base_url):
               continue
            if element.tag == 'a' and attribute == 'href':
               l = re.sub(r'#.*$', '', link)
               if l not in seen:
                  queue.append(l)
                  seen.add(l)
   
         path = url[len(self.base_url):]
         yield path, doc

This class effectively wraps a generator which yields every pair of path and web page it finds on the site. Generators are incredibly useful for keeping code simple without being memory hungry. It's easier to type yield than building a list of items, but in this case it's better than that: this code returns one LXML ElementTree at a time, rather than reading and parsing them all up front.

Generators encapsulate state as local variables, which generally means you don't even need to wrap them in a class like I've done. I only do this because I like to add functionality by subclassing. This may be a throwback to my days of programming Java.

It should be noted that most of the heavy lifting here is being done by lxml and BeautifulSoup. lxml.html makes it extremely easy to work with HTML. BeautifulSoup's excellent broken-HTML parser is used not because my HTML demands it, but to allow this one script to work with any site I want to use it with.

class ImageSpider(Spider):
   def images(self):
      seen = set()
      for path, doc in self.pages():
         imgs = []
         for img in doc.findall('.//img'):
            src = img.get('src')
            alt = img.get('alt')
            title = img.get('title')
            i = (src, alt, title)
            if i not in seen:
               seen.add(i)
               imgs.append(i)
   
         if imgs:
            yield path, imgs
   
...

This is another generator that effectively filters the list of pages, yielding a list of images within each page. Generators calling generators is again very elegant. Each time the caller asks for the next page of images, ImageSpider will go back to the original Spider for a new page until it has one with images.

def text_report(self, out=sys.stdout):
      for path, imgs in self.images():
         print >>out, 'In', path
         for src, alt, title in imgs:
            print >>out, '- src:', src
            if alt is not None:
               print >>out, '  alt:', alt
            else:
               print >>out, '  alt is MISSING'
            if title is not None:
               print >>out, '  title:', title
         print >>out

Other methods of ImageSpider generate reports. Here I use the handy print chevrons to write to any file-like object. File-like objects are a particularly handy piece of duck typing. By default these methods will write to stdout, which is the same as printing normally, but you can pass in any other file-like object for very simple redirection.

def html_report(self, out=sys.stdout):
      from cgi import escape
      print >>out, """<html>
   <head>
      <title>Image Report for %(base_url)s</title>
   </head>
   <body>
      <h1>Image report for %(base_url)s</h1>
      "
"" % {'base_url': escape(self.base_url)}
   
      for path, imgs in self.images():
         print >>out, '\t\t<h2>%s</h2>' % escape(path).encode('utf8')
         for src, alt, title in imgs:
            idict = {'src': escape(unicode(src)).encode('utf8'),
                'alt': escape(unicode(alt)).encode('utf8'),
                'title': escape(unicode(title)).encode('utf8')}
            print >>out, '\t\t<img src="%(src)s" alt="%(alt)s" />' % idict
            if alt is not None:
               print >>out, '\t\t<p><strong>alt:</strong> %(alt)s</p>' % idict
            else:
               print >>out, '\t\t<p><strong>alt is MISSING</strong></p>'
            if title is not None:
               print >>out, '\t\t<p><strong>title:</strong> %(title)s</p>' % idict
            print >>out
      print >>out, """   </body>
</html>
"
""

Again, similar, but this method demonstrates a simple form of templating: the string formatting operator, %, allows you to retrieve values from a dictionary.

Finally, there's the commandline interface to all this:

from optparse import OptionParser
   
op = OptionParser()
op.add_option('-f', '--format', choices=['text', 'html'])
op.add_option('-o', '--outfile')
   
options, args = op.parse_args()
   
if len(args) != 1:
   op.error('You must provide a site URL from which to spider images.')
   
s = ImageSpider(args[0])
   
if options.outfile:
   out = open(options.outfile, 'w')
else:
   out = sys.stdout
   
if options.format == 'html':
   s.html_report(out)
else:
   s.text_report(out)

In a few lines, the amazing optparse module turns a quick script into a flexible commandline tool.

Download the source: siteimages.py