Archive for October, 2006

Data mining with AJAX

Friday, October 27th, 2006

Just had an idea: how about using Javascript to record client-side usage of your website?

The principle is this:

  1. Register Javascript listeners which construct a list of events, particularly mouse, scroll and click events, along with the time that the event was fired.
  2. Register an unload event which posts the information as XML with AJAX to a script on the server when the user leaves the page.
  3. Browsing sessions can be collated on the server using cookies.
  4. Create a player, which reads the events as XML and renders them using a DHTML ‘cursor’ and/or by firing events within the DOM. Could have a time slider and fast-forward controls, etc, depending on how complex you want to get.

Voila - see exactly what people are doing with your site. I have knocked up a test which implements the first two steps, for mousemove events, and that much works, so the whole concept would be workable. I can imagine it would break down if your site uses plugins (or Javascript navigation, depending on how easy it is to replay the events accurately) but that’s a limitation you would have to live with.

There are obviously privacy concerns but this is relatively mild as no personal data would be recorded. Perhaps it could pop up a Javascript window.confirm() dialog asking if it’s OK to record your behaviour. But it would be a very useful tool for examining site usage, especially for commercial sites. This is the way modern marketing works. I leave it up to your conscience as to whether it’s ethical.

IE7 and FF2

Thursday, October 26th, 2006

Well, this was the week that the world of web design was turned on its head. The release of Firefox 2.0 wasn’t the cause: it’s only Microsoft who can shake the industry up like this with it’s release of Windows Internet Explorer 7.

It’s been 5 years with the same IE bugs, but now we get a lot of them taken away and a whole new load handed back. Because of this, we once again have three major platforms to target: IE6, IE7 and Real Browsers. There has been a lot of talk about better standards compliance in IE7 but it only takes a glance at comparative Acid2 renderings to see that IE is still way off the mark. Firefox 2 doesn’t pass Acid2, but its performance is actually not bad at all. In fact there is only one major class of bugs left for Acid2, bugs which will all be fixed by the reflow rewrite, which has been in progress for a while now.

(more…)

More PHP segfaults

Monday, October 16th, 2006

Another case of PHP segfaulting. This time, at least, it was behaving deterministically and by inserting

print "Meep!";
flush();

throughout various bits of the code I managed to track down the problem. It was segfaulting trying to read a config file to which it didn’t have read permissions.

PHP is bad.

Why I’m not sold on RSS

Saturday, October 14th, 2006

I don’t know if I’m the only one but I’ve just never gotten on with RSS (under the umbrella of which I include Atom too). Nothing I’ve read about it resolves these open questions:

  • What is RSS for?
  • Why is RSS the best way to do… whatever it is that it’s for?

I think that RSS’s history lends credibility to the fact that nobody really has the answers to those questions.

(more…)

MP3-spliced Encrypted Filesystem

Tuesday, October 10th, 2006

I’ve had a crazy idea for a way of protecting data using an MP3 collection. It’s completely ridiculous, inefficient, and it can probably be shot to pieces. But it’s fun.

MP3 streams consist of a bundle of frames. Frames begin with a 12-bit string of zeros, then there is a brief header which gives the bitrate and length of the frame, then the frame data. MP3 players should wait for the 12-bit sync, then read the frame, then wait for another sync (in most MP3 data this follows immediately).

It should be possible to bung in random bytes in there between frames and have players ignore it completely.

What if the bytes you stick in there comprise a filesystem? Say you use 10 6MB MP3s and pad them by say 10% with the filesystem data. 6MB filesystem! Enough for your most important secrets, and nobody is going to look for it there. Security through obscurity. Still, that’s only version 1 of the protocol (ie. it occurred to me first).

Version 2: use a block cipher to encrypt the filesystem. Obviously, this is important as otherwise plaintext bytes are readily visible.

Version 3: hide the MP3s that you’ve used to create the filesystem in a collection of MP3s - a large but random number of MP3s that have been similarly padded, but with junk. Now you need the right MP3s in the right order. Choosing r MP3s in the right order from a set of n gives nPr combinations, which, for r << n is approximately nr. For example, choosing 10 (in order) from a relatively modest collection of 500 (~ 29) is roughly equivalent to a 90-bit passphrase or a 15-character random password consisting of A-Z,a-z,0-9. But the nice thing about this is that humans should be good at reconstructing playlists from memory, even with thousands of MP3s to choose from.

Version 4: Stripe the bytes between different MP3s. Ensure that cipher blocks are split between MP3s. This ensures that you can’t run a brute force crack attempt against part of the encrypted data because you can’t dig a whole block out of any one file.

Version 5: (optionally) use some acoustic element of the assembled MP3 playlist as part of the passphrase for the block cipher. Entering one ‘digit’ of passphrase might be the equivalent of selecting a riff in the right song, or one particular lyric. Say you have to choose a 5-second segment from your 10 3-minute MP3s - that’s about 8.5 bits of passphrase.

Version 6: swap some, but not all of your MP3s with P2Ps. There is an element of deniability - the random data may or may not be yours. Most MP3 collections I’ve seen have been collected from hundreds of different sources - and anyone using the system will have lots of MP3s with a mixture of junk and real padding, so to find many MP3s containing junk on any one system does not mean they have encrypted data. It just means they are guilty of piracy.
Told you it’s completely ridiculous. But isn’t it fun? :)

How I came to love developing in Python

Wednesday, October 4th, 2006

As I’ve implied previously, I find PHP a desperately bad language for developing web applications. Python is my current favourite; it is a joy to work with both in writing code and maintaining code. Using Python, I can develop web applications faster and with more complexity, than I ever could with PHP.

There was a disaster a couple of years ago with PHP which was the reason my preference changed. PHP fell apart when it came to the crunch, but using Python I was able to rapidly pick up the pieces. I was developing an application which would display quite an extensive mortgage application form, collect the answers and print them back to the PDF, because the mortgage lender was still using a paper-based system.

I developed a system which read questions in XML. The asking of some questions could be predicated on the answers given to previous questions. This allowed me to omit questions which the original paper form didn’t require, and this would mean that I could require valid values to all of the questions I asked.

I had written the system in PHP, as was our standard practice at the time. Obviously this required quite complex data structures; each question was an object, but the predication was effectively a parse tree which could be be evaluated - collapsed to a single value: true, false or unknown. PHP makes this kind of work a huge nuisance. It’s only got a SAX parser, which means you need your own stack to parse it, and when you’re doing any data structure work in PHP you have to be very careful to keep references rather than copies, which means you have to insert & in every assignment and function spec, and you can’t update the $v in foreach ($x as $k=>$v) - that’s also a copy.

The system worked on my simple hand-drafted test data, which was much of the first page of the form, but it was extremely laborious to set up the XML source, because the questions needed coordinates from the PDF.

I stopped work on the web application and swapped over to writing a tool to generate the XML input from the original paper form, which we had in the form of a PDF. I wrote a Java tool which called on Ghostscript to render the PDF, and displayed a Swing and Java2D UI to draw the fields onto the page.

A week of programming and 3 days and 12 pages of questions later, I plugged a completed XML file into the PHP application, and… nothing. Blank page. Couldn’t get any output from PHP at all. It turned out PHP was segfaulting serialising the data structure. This was an almost impossible situation to resolve; the gdb trace was useless, the project was running late, and PHP wasn’t behaving in a deterministic way, making it impossible to debug.

The best solution I could think of was to rewrite the entire application in a language I trusted more than PHP, and Python, which I had been experimenting with, seemed appropriate. I already had a very basic framework for writing CGI applications in Python, and even though I didn’t start with a session system, I was able to write one, transcribe the PHP into Python, and get it all up and running within about 2 hours, which I remain impressed with to this day.

As I worked, I found I could transcribe every PHP construct into Python quickly and more succinctly. I could simply omit the & nonsense as objects are always passed by reference. It’s amazing to be able to look at a block of PHP code, recall what it does, and write one line of Python which can do the same thing, omitting all the hoops that PHP requires you to jump through to construct data structures.

Domains as a measure of trust

Tuesday, October 3rd, 2006

I’m increasingly amazed by the number of banks and other secure services that seem to spread their online services over dozens of differerent domains. Simple put, a domain is one unit of trust, for a variety of reasons, and this is even assumed for security reasons in many applications (cookies and XSS sandboxing spring to mind). It’s cheaper, easier, more secure, and visibly more secure to use subdomains than purchase a separate domain to redirect users to for secure services.

Some of the culprits I’ve come across:

  • NatWest (at natwest.com) use nwolb.com for online banking.
  • RBS (which owns Natwest) also owns Streamline Direct, a payment gateway. RBS’ merchants’ customers get redirected onto Streamline Direct (at streamline-esolutions.com) to enter credit card details. Most won’t have ever heard of them. But if you did Google for them you’d find them at streamline-direct.co.uk and/or streamline.com.
  • Paying for domains online yesterday (at streamline), I was redirected to securesuite.com, ostensibly some Mastercard security thing, and asked to enter my credit card details a second time.
  • Barclays’ (at barclays.co.uk) runs their payment gateway out of epdq.co.uk
  • Play.com hands over to playsecureserver1.com to take card details.

And just to contrast the way it’s supposed to work, let’s think of a few examples of big sites with secure services:

  • Amazon (www.amazon.co.uk) uses https://www.amazon.co.uk.
  • If you pay Google for advertising (adwords.google.co.uk), you’ll pay at https://adwords.google.co.uk.
  • What domain does Paypal (www.paypal.com) use for secure services? https://www.paypal.com/.

It is relatively trivial for a hacker to obtain an SSL cert for an arbitrary domain, but extremely hard to obtain an SSL cert for someone else’s domain and then insert his machine into their DNS. Either way, he still has to compromise a web server somewhere to get his machine inserted into the chain, but web servers do get compromised, and he would have to find it beneficial to redirect to a third-party machine rather than set up some credit-card interception on the compromised host, but that’s not that hard to imagine either - maybe he can’t obtain the requisite privileges, or perhaps it’s less traceable to redirect to a different (perhaps also compromised) server.

Maybe I’m just paranoid, but more important than technical security measures are social measures: How can the public be expected to avoid phishing attacks when legitimate services are being given untrusted domains?

Refactoring stylesheets

Monday, October 2nd, 2006

Incidentally, one of the things I need to do to the shop is refactor the styles, which are XSL and CSS.

There is no good way I know of refactoring selector-driven stylesheets. Procedural styling, such as that used by smarty or even vanilla PHP, is easy. Styling starts in one place and flows in a controlled way through to the end. Selectors make life difficult because you don’t know what is going to match where or what is overridden by a different selector elsewhere. It’s very difficult to work out how it works, even with comments, because you don’t know which comments are immediately relevant and besides, they refer to a document structure which is dynamically generated.

I’ve never managed to cleanly refactor CSS and I’ve not really tried with XSL, because it looks difficult for all the same reasons. With XSL there’s an intermediate XML structure that can be refactored, but this is generated procedurally. But for the presentation layer - CSS, XSL and to some extent Javascript - if anyone knows a better way of refactoring than throwing it all out and starting again, please let me know.

e-Commerce enquiries

Monday, October 2nd, 2006

Mauve Internet has had two new enquiries about e-Commerce sites this week, which is good. First in a while.

I suspect that there is typically a slump in the summer as smaller business owners plan more for their weekends than the future of their business. As summer has now passed, people start looking ahead more.

This does however mean that I will have to pimp my shop codebase. It really needs tidying up - lots of things that I wouldn’t do the way they are done now that I’ve had some experience of maintaining the codebase.

I have a ton of integration to do. There are two branches to the codebase:

  • One (let’s call it ’stable’) has seen bugfixes and customer-driven improvements, but has been branched a dozen times and is a huge mess.
  • One has had some refactoring and more developer-driven improvements, but currently crashes due to character set issues.

After that is done, the administration interface needs to have some serious work done. Most importantly, the ImageChooser service needs to be pretty much redone. It all needs a bit of AJAX on top to make administration a more smooth experience, and I need to hook up TinyMCE to bolt in a minimal CMS.

The difficulty, if I do this work, is that I may still have to work with the aforementioned ’stable’ version even though I will have a much improved next-generation version available. Perhaps I can cut a deal on that.

I’m also considering supporting osCommerce, because it would be cheaper in terms of codebase maintenance, but I wouldn’t be able to make the same guarantees I can about implementation of bespoke features and use of future-proof technologies. This would be available as an alternative to my shop software.

What I most want to do is rewrite everything in Python. Python is much faster to develop with than PHP, and leads to much tidier and more legible code.