On Elephant Traps

Elephant trap (software): ground that looks firm and takes one’s initial weight, but which subsequently drops out from under you, leaving you in a dark pit, staring up at the disapproving face of your project manager as they scatter the ashes of your estimates over you like sarcastic confetti.

The difference between real elephant traps and software ones is that the software traps aren’t intentional (and don’t actually trap elephants, which is Bad™). Instead, software traps are created by our desire to avoid risk, but in the process create massive unknown risks that far outweigh the initial risk we thought we were avoiding.

A real example from my career was when we decided to move from using GUIDs as primary keys to integers. This was a demonstrably Good Thing To Do, but it was also a huge change: our application logic lived primarily in stored procedures, and all of these took GUIDs as keys (not to mention our ADO.NET application code, web pages, etc). In our desire to avoid risk, we made a fateful decision: fudge it.

Where possible, we continued to use the GUIDs, and left the columns in place in the database. We created SQL functions to look up the real integer keys from GUIDs, and vice-versa, and peppered our stored procedures with these; meanwhile, the website kept passing GUIDs from page to page.

What we thought we were doing was avoiding risk:

change = risk ∴ fewer LoC changed = less risk

Unfortunately, this is false, and in fact is one of the main causes of risk in software. By being clever, we made the system less obvious – new developers had to get their heads around the fact that the keys passed via the website weren’t the actual database keys. This might have been OK, except that newer pages bypassed the GUIDs and just used integers, not to mention some pages that had to be modified to use integers to get better performance. Gradually, stored procedures evolved that could take either a GUID or an integer as a key – and would use one if the other was null (except when developers got confused, passed both in and the procedure crashed).

Smoke & Mirrors

smokemirrorsIf code is less obvious, it takes longer for a developer to fully understand it – and the result is that developers are more likely to inadvertently break stuff because they don’t realise that the layer of smoke-and-mirrors is pretending that it’s something else to satisfy the needs of an old bit of VB6. In effect, we’re disguising complexity, and thus we create elephant traps.

As I see it, this is a legacy of old, poorly-understood code. In such projects, the application is so poorly known that developers cannot be sure what the effects of making code changes will be. The natural instinct therefore is to make changes in such a way as to change the code as little as possible – in practise, this usually means pushing changes down to the database, because a) it’s the lowest level, and b) it can be changed live, so if it breaks it can be fixed without a build.

Change != Risk

The problem we often face is that we have no way of knowing what effects our changes will have, and so our natural instinct is to make fewer changes. This results in brittle software and strangles innovation. No-one is going to risk adding a cool new feature if they’re too scared to change the page layout because no-one understands how that custom JavaScript-generated CSS engine resizes page content.

We should make breaking changes as soon as possible: the sooner we break things, the sooner we know the scale of the problem we face, and the better we understand the system (breaking something is, despite what Gandalf said, a fantastic way of finding out what something is).

More Tests = Less Risk

The answer is tests. An application with a comprehensive suite of unit tests, integration tests, automated UI tests and a manual tester has a high degree of safety – a developer just has to run the unit tests to see whether they’ve broken anything fundamental. To get the best out of this, the application should have a short feedback cycle – if it takes a day to get it running locally, developers will make as few changes as possible. If it’s weeks between builds and integration tests, then people will start resenting fixing weeks-old mistakes because they’re now working on something else (and in the worst case, these bugs will become Business As Usual).

A well-run project with unit tests that can be run in seconds, continuous integration builds and tests on every push, and regular builds to testing with, at the very least, a five minute manual smoke test, is one where developers are not afraid to make changes.

Don’t be mean to elephants! They never forget…