BDDBDDB – Pointer alias analysis comes of age
One of the most exciting program analysis techniques I’ve ever seen (I love having my own geek blog, where I can say things like that with a straight face) is the BDDBDDB system, from Whaley and Lam at Stanford. (I will capitalize BDDBDDB here, though they don’t.)
Pointer alias analysis is tracking the movement of object references through a large piece of code, across method calls and through heap references. I’ve been reading programming research papers for almost two decades now, and for much of that time, pointer alias analysis was a chimera. It just wasn’t possible to make it scale. The number of potential paths in even a moderately large piece of code is dizzying — numbers like 10 to the 14th power aren’t uncommon. And for most of the last two decades, this has been an uncracked nut.
Well, Whaley and Lam seem to have finally cracked it, by building on other research done using binary decision diagrams (BDDs) to compactly encode the movement of values through variables in a program, in a way that maximizes the amount of shared structure in the variable flow graphs. BDDs were originally created for hardware logic analysis, to track the movement of signals through a huge circuit. So it’s elegant that they can also apply to software analysis, tracking the movement of value through a large code base.
The further elegance in Whaley and Lam’s technique is the query language laid on top of the BDD-based analysis. Binary decision diagrams can be used to encode relations; they can be joined and queried much like relational tables. And there is a query language, Datalog, that is well suited for expressing queries over a BDD-based relational model. Datalog is a relational query language, similar to a more abstract version of SQL, except it supports recursive queries (which SQL does not). Whaley and Lam are able to encode a Datalog-based query and relationship-generation engine on top of their BDD aliasing structures. Hence, BDDBDDB — Binary Decision Diagram-Based Deductive DataBase.
One of the more amazing results in their paper is their discussion of how their analyses improved once they had the Datalog layer. originally they wrote a context-specific alias analysis on top of a BDD engine without any intervening abstract query language. This took them a few months, the code was thousands of lines long, the performance was not good, and there were a number of subtle bugs. Once they wrote the Datalog engine and reformulated their analysis as Datalog formulas, the total size of the analysis dropped to just around a dozen lines (!) of Datalog, and the performance was one to two orders of magnitude faster.
Once you have a query engine that can efficiently query an entire software system to track value and type flow throughout, in a reasonably accurate way, there are many systems you can build on top of it. For instance, you can write an Eclipse plugin to track the exact lines of code that expose a web application to SQL injection attacks. This can be done by notating particular methods as sources of input data from the Internet, and other methods as destinations (“sinks”) of query text for databases, and then evaluating whether any values flow from source methods to sink methods. Any paths through the code that allow unfiltered input to travel from the Internet to your database are security holes. The BDDBDDB analysis can tell you exactly what code paths allow that value flow. And it can do it purely as a static analysis!
They use Java as the basis of their system, for unstated but obvious reasons. (Namely, strong static typing makes their value flow analysis much more efficient, since it can eliminate many value flows that are not dynamically permitted; and lack of pointer arithmetic means their analysis can be trusted, with no need to limit to the “safe” subset of the language without pointer math.) Many people have been wooed away from Java by the flexibility of dynamic language programming, where you needn’t encumber yourself with types; but those of us who remain in the static typing camp are pretty much hooked on the tools. And BDDBDDB is the harbinger of a whole new level of Java-based analysis tools that are going to take Java programming and safety checking to the next level.
One thing not mentioned in any of the BDDBDDB papers I’ve yet seen is reflection. I can’t see how reflection can be handled in the BDDBDDB system, since tracking reflective method invocations requires not just tracking object references through variable sites, but actually tracking method values (and even specific method name strings!) through reflective invocations. Certainly their system as described doesn’t address reflection in any way. This also seems to me to limit the applicability of BDDBDDB to languages like Ruby, which make heavy use of code generation (in Rails, for instance). Though I’m sure that some enterprising Rubyists will undertake to prove me wrong.
In general it’s very challenging to consider the interactions of static analysis with metaprogramming frameworks — runtime bytecode generation is becoming more common in Java systems, but BDDBDDB currently runs on a source analysis, without access to the runtime-modified bytecode. Metaprogramming can blur the line between compile-time and runtime, and BDDBDDB is solidly on the compile-time side… at least for the moment. I say this not to denigrate the importance of this technology, but rather to indicate an even further level to which it can go.
I look forward to what’s next for these analyses. Good times in the static typing world!