GMake it Happen: Build Improvements and Parallelization
Finally, a post on something that’s potentially interesting to software folk! After attending Eric Ries‘ talk on the Lean Startup, I started thinking about how to work towards continuous deployment within Sophos. Note that I say “work towards” and not “achieve” – for my product lines, at least, achieving continuous deployment would involve a very large and fundamental re-architecture, so that’s not in my plans at the minute. However, I believe that in working towards continuous deployment it is possibly to obtain some very real benefits, so I decided to take some first steps.
We’ve already made great improvements since the early days of our projects. One such change was the componentization of our builds. Rather than have to rebuild absolutely everything whenever anything in the entire product changes (leading to statements like “I changed the wording in the help file, so now it’s going to take two hours to rebuild the operating system”), we’ve broken things out into logical components. In the Sophos Email Appliance, for example, these components include:
* os (our custom hardened version of FreeBSD)
* sophox (core system tools that are separate from the os)
* pmx (the appliance version of PureMessage, the core mail filtering software)
* apps (third-party things such as the database, MTA, CPAN modules, etc.)
* ui (all code related to the browser-based GUI of the product)
Almost everything has to rebuild if you change the OS, but hardly anything builds if you only change the UI. This didn’t do anything for our worst-case build time, of course, but it’s certainly cut down our average build time quite a bit. Starting with version 5.5, PureMessage for Unix has adopted this componentized build system as well (much to the team’s relief).
To repeatedly get software out quickly, you need to be confident your code base and automated tests. If every change requires a week-long manual test pass, you’re never going to be releasing every 3 days – the numbers just don’t add up. So, we want a lot of automated tests, but this can get to be slow as well (our current nightly regression suite takes about 10 hours, mostly due to having to set up and tear down browsers). Unit testing things can help mitigate this, as often unit tests are much speedier than full end-to-end UI tests due to their limited scope, and we have several thousand unit tests in our product. But now, since we run unit tests (with coverage!) on every build, it can take up to several hours to build the entire system. So, are we stuck with a time vs. quality trade off?
As it turns out, there’s still a lot of room for improvement. In the last week, we’ve made a few relatively easy changes that have cut the build time on several key components by over 50%. There was no new hardware purchased, and nothing was rewritten. What we did was take advantage our existing hardware and the lack of dependence in most our packages by adding some parallelism to the build. This started out with me toying around with the ‘-j’ option for Test::Harness, which runs all unit tests in parallel. This mostly worked, except for a few straggling .t files that didn’t use process-specific temporary directories. Over the weekend, a fellow development manager took these ideas, fixed up all the tests, and actually got parallel tests in the build – a big win! However, he wasn’t done. After realizing that a lot of the build time is spent running “make” in all of our sub-directories (as every “make test” first runs “make”), he changed our builds to take advantage of yet another ‘-j’ option, this time for gmake. Once this was in the build, all packages were built simultaneously, and all tests within a package were run simultaneously. This really makes our build boxes work – all 4 cores are now being used – but it cuts build times in half.
Continuous deployment is still miles away, but we’ve halved the amount of time that it takes to get a change validated in a real build. It’s a start!
