Automated Builds | kebabShopBlues

dayssincethebuildhasbeenbroken_web The picture was taken in my office a few weeks ago. It is not faked. The output of an automated build system (Team City) is reporting how many days there have been problems with the build for; and the number of days is 42.

I thought that the point of the automated build was to help ensure that test suites were run and that individual commits were free of problems, and of course that associated units of work had not been impacted negatively. If they were ‘someone’ would need to resolve the problem, and get the build fixed again. I’m not exactly clear on what the process was, and while that may be because of late I have been a DBA more than I have been a developer (and thus quite reasonably not kept in the loop about development processes) I think I can say there had been at least 42 days where one or more developers were unclear on the process too!

When it comes to automated tests, I am a little bit proud that I added the very first (NUnit) test projects to the code-base. The company at that time was very small, and although I published information on how to run them, uptake was limited. I have to admit they were a bit slow – being what I would now term ‘Integration tests’ (which is to say that they tested the behaviour of the system at the database level, requiring all normal database reads and writes to occur for a test to be undertaken). Still, I had started a process that I was able to follow – before I committed I would run the tests and see if system behaviour had been changed negatively. After quite a period of time, the company gained momentum on unit-testing, and lots more were written by many people.

A few people even got into the habit of running the tests! But even this early change in uptake intermittently showed one thing; and that was the integration tests did not handle multiple-concurrent-runs very well (e.g. each test might set up its own test data in the database, but this would cause problems if two similar tests were setting up that same data at the same time). This was especially exacerbated when the automated build tools were brought-in, because now the automated build might conflict with itself in certain circumstances! Developers were heard muttering how unreliable the tests were.

A push was made to add more ‘unit’ tests, which is to say tests that tested very small parts of code, and ran quick, and this caused Rhino Mocks and Ninjection code to proliferate. A rash of errors started occurring in live because code that was not supported by unit tests was being changed to help support them, but in the process some mistakes were inevitably made and made it through to live. One issue was that developers were proving to be poor about explaining to testers what they had changed exactly, so changes had simply not been tested.

I believe there were also some fundamental misunderstandings about the purposes of testing any particular piece of functionality. I saw tests for code where almost every line was treated like it was critical… and tests like this tended to be fragile to incredibly minor changes. My view is more that in any block of code there are often critical intent of the code (let’s say to add a new loan record), and there may be side-effects with medium importance (perhaps add a record to describe in English to a user what had been done) and then there is the stuff of relatively minor importance (like the exact wording of the message in English). Remove a duplicated space in such a message, and you have a test failure.

One problem is that even with well-intentioned team members working to fix some issues, it seems to be difficult to assess in hindsight which parts of the code are critical or not. I certainly had conversations with developers who suggested that particular tests had broken because they were flawed and unreliable tests, whereas my opinion was that the code had been changed or broken somehow! If such confusion is possible, then the efficacy of the tests will always be weakened as time passes. But why does it occur? Additionally, I saw some tests were set to Ignore and I am pretty certain that this was occasionally done without reference to the original test author. To me, that is dangerous; surely an ‘Ignored’ test is as good as a test failure? Something that someone once considered worth testing, but someone else has since overrulted that? So now I assert that developers may be bad at assessing historic code for importance when retrofitting tests to it, and they may also be bad at assessing the importance of tests when assessing them if the test failed (or is simply being reviewed).

I think the most recent killer issue is that other tests were added to the build suites that were someohow considered to be even lower priority or ‘experimental’. Thus, this marker-number of ‘days the build has been broken’ started rising even if the key unit and integration tests had been passing. Perhaps they should have been separated into a totally unrelated build until they were well-proven.

As for myself, I remain a fan certainly of the concept of automated testing, but I probably still consider the concept of the full ‘Integration test’ to be most valid. A unit-test of a small function may be A Good Thing, but I would say the overall process is still of far more interest to the business. As an example, one time I went out of my way to write ‘unit’ tests of code rather than integration tests. Of course, the tests proved that what I had written worked ok and that encouraged me to press for the code to be released. It was only then that I realised that there was an error in the ORM layer, and as a result, values that should have been null were appearing as a magic null-equivalent value! Had I written Integration tests I would have caught this error. Integration tests also allow you to test database functionality that your system might rely on; such as unique keys intended to prevent duplication, or triggers, or whatever. Unit tests will never allow you to leverage or test such DB functionality.

And now? The screens in the office are left off all the time… the builds and failures still happen behind-the-scenes, but no-one knows and I presume no-one can face looking. It’s difficult to know if the business will recover from this situation. I think, had I documented the timeline here in full, that you the reader would have identified a hundred places where different decisions or processes could have been implemented with substantially improved results.