Jan 10 2010

Competing on the Basis of Speed

For those not familiar with the ideas behind Lean software (and even for those who are!), please check out Competing on the Basis of Speed, a talk given to Google by Mary Poppendieck in 2006.

One of my work goals for 2010 is to compete on the basis of speed. Specifically, I want to help my team:

  • identify tech debt, defects, or process problems that are increasing lead time
  • minimize or eliminate the creation of new defects and tech debt
  • convince stakeholders 0f the value of delivering fast to get customer feedback

I think I’m already off to a good start. The aim of the project I’m currently working on is to add functionality to our product that makes it easier to integrate into customers’ existing infrastructure. The technical requirements are known to us, and we’ve finished implementing the functionality. However, we’re at a point where we’ve hit a bit of a wall – no one on the team has any experience as a consumer of this functionality (that is to say, none of us are IT administrators), and it has been difficult to get focused feedback from others within our organization who have such experience. What it comes down to is that we’ve got a feature that is technically correct (follows RFC specifications, has been load tested, etc.) but has not been tuned to customer environments.

This is not an uncommon situation in software – in fact, it’s part of the reason why “Have an embedded customer representative on the team” is a practice in Extreme Programming. However, it’s not always possible to find internal “customers” (we call them Product Managers) with extensive experience with every particular area of the field. For larger projects it may make sense to train the Product Manager in the specifics of the new functionality by having them consult heavily with paying customers, but for smaller projects this is not always feasible. My current project is less than a month old and we’re code complete, and much of the time was over the Christmas holidays with our Product Manager (and most of our customers) on vacation, so consultation wasn’t much of an option.

So, here we stand – a mostly finished project that can be release-ready within 2 weeks but that has not been fine-tuned to meet all customer requirements (as such requirements are unknown). What to do? The traditional approach within the company has been to do a beta, but these don’t necessarily solve all problems. Our betas are opt-in, and often contain very few members (less than .5% of our customer base). Feedback can be hard to gather as beta systems are often put into non-production situations. There is also quite a bit overhead involved in coordinating and communicating with all of the customers involved.

Instead, I’m pushing towards getting this thing released to all customers as early as possible. The more people playing with it the better. The code is not buggy (we hope!), it just may be lacking some specific features or compatibility. Rather than wait around for 2 or 3 months as we do research and try to completely accurately model customer scenarios (a process that’s inevitably difficult and fraught with errors), we’ll get the code out into the field.

The best case scenario is that everything we’ve done so far is adequate for the market, and there are no future requirements. This means we’ve starting recognizing value 2-3 months earlier than if we had waited and done more market research, and we haven’t sat around gold plating the project for a quarter. The most likely scenario is that our code is adequate for some, but others will need some enhancements before it will work for them. We can then prioritize these enhancements based on some sort of financial metric (renewal date of customers who need the feature, likelihood that the enhancement will bring new customers, etc.) and deliver them over the next little while.

The worst case scenario is by far the least likely to happen – that would be where customers get the new feature, find that it’s not quite up to snuff, and because of this decide to overhaul their IT infrastructure and rip out all of our company’s products because of this. That’s so unlikely that it’s barely worth mentioning. Something like this is more likely in the case where we’re changing an existing feature instead of adding a new one, but even then it’s a slim slim chance, and would be the result of a decision on the customer’s part based on emotion rather than reason.

If you presented a customer with the choice between the following two options:

  • a rudimentary version of Feature X now, with improvements to come soon afterward
  • a “complete” version of Feature X several months from now

… I’m willing to bet that most customers would pick the first option. The choice would be even easier once the customer realized that the “complete” version from the second option would likely have to be followed up by a release or two afterward containing improvements that the developers / Product Management failed to identify in the first go-round.

I’m excited that there seems to some buy-in to this approach so far – hopefully it pays off for us!


Nov 14 2009

Limited WIP – Project Portfolio

All of this Kanban reading I’ve been doing has been great, and as I’ve mentioned before I’ve started implementing some of the techniques and metric tracking. However, after meeting with my manager (himself an experienced Agile thinker), I’ve realized that in some ways I’m trying to solve problems I don’t have, while neglecting some of my bigger issues.

Limited WIP is of course one of the key techniques / philosophies of Kanban (and lean in general). It’s not important in and of itself, but instead because out of it comes increased collaboration, reduced cycle time, identification of bottlenecks, and all of that other great stuff. We haven’t really reaped any benefit from this (yet) besides limiting the backlog, which has fostered cross-business-unit collaboration, but it’s only been 3 weeks or so and we were a fairly disciplined team before that. Some of the metrics I’ve been tracking (such as how long bugs have been in the system before we ship them, and how long items are blocked for) will definitely come in handy to measure the team’s effectiveness and customer responsiveness.

That said, we’ve never really had much of a problem with Work in Progress, at least at the task level. After talking with my manager, we realized that we had another problem, one that was discussed by Johanna Rothman at the recent Agile Vancouver conference – far too many items in our project portfolio! We have 8 people on the team, and had limited our active WIP to 12 (to allow for some blockages), but it turns out that we were working on 7 different, mostly independent projects.  Most of these projects were rather small, and the diffusion of resources was mostly due to the fact that there was no obvious parallelization for most of the projects, but still – that’s almost 1 independent project per person!

Due to the nature of my team (maintaining 3 products, and integrating these products once monthly with software from elsewhere in the company), it’s unlikely that we’ll ever get down to a project WIP limit of one. After we clean up the current mess, we’re going to try 1+1 – a maximum of one project on the go at any given time, with the exception of these small monthly integrations. With the slack time, we can clean up some of our (ample) technical debt. Part of the metrics I’ve been tracking is the percentage of our work spent on “failure load”, a combination of technical debt and missed requirements.

This will likely start in earnest in the new year, but I’m already excited about the results. I’m actually surprised that I let it get this bad – we’d been effectively doing a 1+1 project WIP limit for most of the past year and a half – but I think by formalizing the process a bit more (not making it heavier though!) we can keep ourselves to good habits.


Sep 15 2009

Performance Loss due to Project Switching

This post is a slightly edited version of an email I sent today in response to the question “Do you know any publications on performance loss due to switching from project to project?” I thought I gave a good answer, and I’ve wanted to write more about software engineering recently, so here it is.

It really depends on what is meant by “performance loss” and what the nature of the “switching” is.

The worst case scenario (while still remaining plausible) would be switching back and forth between 2 (or more!) projects in some sort of phased SDLC – that is, doing all of the analysis for project 1, then project 2, followed by all of the design for project 1, then 2, then coding & 2 and testing 1 & 2 and so on until you’ve shipped both projects. This may seem absurd to some people, and would definitely be a major process failure for an independent team, but in some cases this might be necessary; say, if specialist resources like designers and testers are a constraint, or if there are external dependencies. The obvious problem here is the greatly increased lead time for both of the projects. If the projects are of roughly the same size, the lead time has at the very least doubled (and that’s discounting any cost at all for context switching). This is a “Bad Thing” (there should be ample references in the lean literature about the benefit of decreased lead time), even if the scenario is not as blatantly bad as the one I just laid out.

For “performance loss”, it really depends on what’s being measured. If you’re thinking about lead time (which is directly related to throughput1, and thus revenue – in other words, you should always be thinking about lead time), switching (or expediting, a different manifestation of switching) will negatively affect performance. Tom Demarco argues in Slack that the real penalties for task / project switching are often hidden because of how people measure efficiency (number of hours worked / busy people as opposed to value delivered to customers, i.e. throughput).

As for actual loss due to context switching, which was possibly the original intent of the question, I think this may have been partially covered by Brooks in The Mythical Man Month, and according to a blog post by Jeff Atwood of Coding Horror this topic is discussed in Gerald Weinberg’s Quality Software Managemnt: Systems Thinking. I haven’t read this book, but the title is evocative of Peter Senge’s The Fifth Discipline: The Art and Practice of Learning Organization (which helped pioneer systems thinking). I haven’t had a chance to read that either (it’s on my to-do list!), but it’s pretty well regarded from what I can tell.

So, does anyone out there have some other references to exactly how switching from project to project can cause trouble? I’m looking for papers, books, studies and the like, not just restating of lean or agile principles, so you can’t just say “context switching is bad because is slows people down” – show me the study!

1Agile Management for Software Engineering, David J. Anderson [2003]


May 6 2009

GMake it Happen: Build Improvements and Parallelization

Finally, a post on something that’s potentially interesting to software folk! After attending Eric Ries‘ talk on the Lean Startup, I started thinking about how to work towards continuous deployment within Sophos. Note that I say “work towards” and not “achieve” – for my product lines, at least, achieving continuous deployment would involve a very large and fundamental re-architecture, so that’s not in my plans at the minute. However, I believe that in working towards continuous deployment it is possibly to obtain some very real benefits, so I decided to take some first steps.

We’ve already made great improvements since the early days of our projects. One such change was the componentization of our builds. Rather than have to rebuild absolutely everything whenever anything in the entire product changes (leading to statements like “I changed the wording in the help file, so now it’s going to take two hours to rebuild the operating system”), we’ve broken things out into logical components. In the Sophos Email Appliance, for example, these components include:

* os (our custom hardened version of FreeBSD)
* sophox (core system tools that are separate from the os)
* pmx (the appliance version of PureMessage, the core mail filtering software)
* apps (third-party things such as the database, MTA, CPAN modules, etc.)
* ui (all code related to the browser-based GUI of the product)

Almost everything has to rebuild if you change the OS, but hardly anything builds if you only change the UI. This didn’t do anything for our worst-case build time, of course, but it’s certainly cut down our average build time quite a bit. Starting with version 5.5, PureMessage for Unix has adopted this componentized build system as well (much to the team’s relief).

To repeatedly get software out quickly, you need to be confident your code base and automated tests. If every change requires a week-long manual test pass, you’re never going to be releasing every 3 days – the numbers just don’t add up. So, we want a lot of automated tests, but this can get to be slow as well (our current nightly regression suite takes about 10 hours, mostly due to having to set up and tear down browsers). Unit testing things can help mitigate this, as often unit tests are much speedier than full end-to-end UI tests due to their limited scope, and we have several thousand unit tests in our product. But now, since we run unit tests (with coverage!) on every build, it can take up to several hours to build the entire system. So, are we stuck with a time vs. quality trade off?

As it turns out, there’s still a lot of room for improvement. In the last week, we’ve made a few relatively easy changes that have cut the build time on several key components by over 50%. There was no new hardware purchased, and nothing was rewritten. What we did was take advantage our existing hardware and the lack of dependence in most our packages by adding some parallelism to the build. This started out with me toying around with the ‘-j’ option for Test::Harness, which runs all unit tests in parallel. This mostly worked, except for a few straggling .t files that didn’t use process-specific temporary directories. Over the weekend, a fellow development manager took these ideas, fixed up all the tests, and actually got parallel tests in the build – a big win! However, he wasn’t done. After realizing that a lot of the build time is spent running “make” in all of our sub-directories (as every “make test” first runs “make”), he changed our builds to take advantage of yet another ‘-j’ option, this time for gmake. Once this was in the build, all packages were built simultaneously, and all tests within a package were run simultaneously. This really makes our build boxes work – all 4 cores are now being used – but it cuts build times in half.

Continuous deployment is still miles away, but we’ve halved the amount of time that it takes to get a change validated in a real build. It’s a start!


Apr 22 2009

Lean Development for Lean Times

Yesterday I was able to attend a mini-conference entitled “Lean Development for Lean Times”, put on by Agile Vancouver. The conference consisted of 3 speakers in two rooms, each speaking twice over the course of three time slots, allowing everyone to see every speaker (or one speaker twice, if they so desired). As you know, my company has been investigating lean development practices, and my manager was one of the conference organizers, so there ended up being about 10 people from Sophos there. Thankfully, there was a mix of software developers, managers, product management, and even the VP of development, which I believe increases our chances of adopting lean.

The first speaker I took in was Silicon Valley veteran Eric Ries, who gave a rapid-fire talk on the Lean Startup. The talk was quite informative, my favourite of the three, even though I have no desire to start or join a startup – the ideas and methods he gave are applicable to startup-like teams within large organizations as well. I took extensive notes, and I’ll likely end up doing a separate post just on this talk (unless my writing duplicates Eric’s website, which I have yet to read in-depth).

The second talk was given by Corey Ladas entitled “Scrumban: Lean Thinking for Agile Process Evolution”. Scrumban is a portmanteau of Scrum, an implementation of the agile software methodology, and kanban, a japanese word translated roughly as “sign board”. Scrumban is the idea of taking kanban cards, an idea borrowed directly from lean manufacturing, and integrating them into the standard agile task board as a way of implementing pull scheduling. This is an idea that I’m likely going to try out with my team, so I’ll talk more about the concept once I’ve had a bit of experience with it.

The last talk of the day was “An Introduction to Lean Product Development” by Katherine Radeka. True to its name, it was an introductory talk on lean development principles, how lean development differs from lean manufacturing, and the nature of waste. Having read a few books on lean recently, there was nothing too groundbreaking here for me, although I did enjoy the Q&A and participatory aspects of the talk.

I quite enjoyed the conference, and it was interesting to see the reactions of people to some of the claims being put forth by the presenters (as well as from others in the audience). Hopefully I’ll get to try out some of these techniques at work, and I’ll definitely post about any successes or failures that result.