Augmenting agile with formal methods • hillel wayne

class BoundedBuffer { synchronized void put ( Object x ) throws InterruptedException { while ( occupied == buffer . length ) wait (); notify (); ++ occupied ; putAt %= buffer . length ; buffer [ putAt ++] = x ; } synchronized Object take () throws InterruptedException { while ( occupied == 0 ) wait (); notify (); — occupied ; takeAt %= buffer . length ; return buffer [ takeAt ++]; } private Object [] buffer = new Object [ 4 ]; private int putAt , takeAt , occupied ; }

Nobody could do it. Don Wells even said that Cargill was mistaken: the code was just fine! After a few weeks 2 Cargill revealed the bug and asked if anybody could write a test knowing now what to specifically look for. A few weeks after that Wells finally managed it: if he took 11 threads and ran one specific test 250 times, there was a good (but not certain!) chance one of those runs would exhibit the bug.

He claimed finding the right test took him about 16 hours, or “two days on an actual project”, and that this proved concurrent code wasn’t a problem for XP. Everybody else disagreed.

Concurrency is evil. Synchronous algorithms have a single well-defined output for a given input. Concurrent algorithms have a set of possible outputs depending on the order of events. There are n! ways to order n unique, independent events. Assuming each thread runs exactly once, that’s almost 40,000,000 different behaviors. Some of these are impossible, and some are permutations of other behaviors, but even so you’re looking at hundreds or thousands of distinct behaviors to test. And you have to test them all, as even one broken behavior is enough to crash production.

It doesn’t help that testing specific behaviors is notoriously difficult. Fortunately, we have much a better tool for studying concurrency: formal methods. By creating a mathematical description of the system, we can explore its properties in abstract. Checking a few thousand behaviors of a design is much easier than setting up a few thousand finely-tuned test cases.

My spec language of choice was TLA+. Not only is it extremely good at modeling concurrency, but it has a DSL, PlusCal, that is almost perfect for representing sequential processes. I wrote a spec and then used the TLC model-checker to evaluate all of the reachable states. I’ve talked about TLA+ a bunch before, but if this is the first you’re hearing about it, I have a gentler intro here. Writing the Spec

with nondeterministically pulls an arbitrary element from SleepingThreads. Since TLC is an exhaustive checker, it will verify all possible selections. So if something calls notify() and there are three sleeping threads, TLC will split into three new timelines, one for each element, and check them all. If in one of those timelines we immediately call notify() again TLC will split that into two more timelines.

get and put are pretty straightforward translations of the Java code. Each has a single label, which represents the unit of atomicity. Since each process has only one label, TLC will evaluate the entire body before switching to another process. As with with, the choice of next process to run is nondeterministic, so after evaluating one process it will create another timeline for each other awake process. This is all on top of the with forks, so the state space balloons extremely quickly. Running the Model

Cargill hardcoded the buffer length to 4, but I decided to generalize it. Before getting to more complex properties, I had TLC check for deadlocks, undefined operations, and type errors. I first ran TLC with BuffLen <- 2, Threads <- 1..2. That passed. I also ran a model with four threads and eight threads. 4 threads passed, too. But when we had 8 threads, we get a deadlock:

Once he knew what the bug was, it still took Wells several hours to write a unit test that caught it. Even then, the test was incredibly fine-tuned and had to be run 250 times in a row to usually catch the bug. If he modified the function and the unit test starts passing, there’s no way to know whether he actually fixed the bug or just moved it outside of the test’s calibration.

Case in point: Wells identified the bug happens because “take threads are not specifically limited to put threads as they should be.” Changing the getters to only wake up putters makes all of his tests pass, but the code is still broken (put threads can wake other put threads). It just requires a slightly different configuration and several more steps, so his tests don’t find it.

The difference between writing TLA+ and just writing unit tests isn’t half an hour versus sixteen hours, it’s half an hour versus “Two weeks to realize there’s a bug, a week to find the bug, three days to understand the bug, sixteen hours to write the test, twenty minutes to run the test, and you don’t know if your fix really works.” Caveats

• “You might make a mistake translating the spec to code.” This one is really interesting. I’ll admit I have made mistakes translating specs before, leading to bugs in the implemented system. You still need to rigorously test and review the code you write; specs can’t really help you here. At the same time, you have that problem with any code you write. If you write it without a spec, though, you don’t know whether the code is buggy because the implementation is wrong or because the fundamental design is wrong.

Additionally, TLA+ specs can take a long time to run. For a small model like this one it isn’t too bad, but you tried, say, BuffLen <- 10 and Threads <- 21, it will take a very long time to finish. There’s some symmetry tricks you can exploit to reduce the state space, and you can parallelize checking safety invariants, but you’re still not going to be able to run this as part of your CI.

Finally, the comparison here isn’t completely accurate, because I got lucky with the bug. Since I first wrote and verified the locking code, I didn’t need to write the rest of the logic. If the bug was something about the data in the buffer itself I would have had to flesh that stuff out. By contrast, the Agilists were also trying to test that the buffer data was also valid. Extending the spec would take a little more time, making the relative effort ratio a little less extreme. 3

Ultimately, though, I’m fine with these limitations. I’m not suggesting formal methods replaces heavy refactoring, pairing, testing, etc, but that it augments it. Specifications give us a means of thinking quickly and deeply about complex systems and finding flaws in our designs. It helps us build higher-quality systems faster and cheaper. If that isn’t Agile I don’t know what is.