Archive for May 2010
So for various reasons I have been learning a lot about parsers lately. The #1 thing I have learned is that parsing is all about tradeoffs. The #2 thing I have learned is that the ultimate parsing environment does not yet exist.
Long-time readers of this blog (assuming I didn’t lose them all in the slightly bobbled switch to WordPress!) know that I’m very interested in growable languages and in extensible programming environments. Parsing is at the heart of programming language design, since it brings meaning to the text we programmers write. When writing extensible programming environments, it is important that your language be closed under composition, so extending your language’s grammar still has some meaning (even if it introduces more ambiguity). Parsers, in this sense, want to be powerful.
On the other end of the spectrum, there are parsers for relatively straightforward formats such as XML or CSS. These parsers are often running in resource-constrained environments (a cell phone browser) or are dealing with large volumes under high load (an enterprise messaging bus). For these parsers, high performance is absolutely critical, sometimes in both memory and CPU. Parsers, in this sense, want to be fast.
There is a very tricky tradeoff between power and speed throughout the parsing landscape. As just one example, I’ve been experimenting with ANTLR lately. It is, on many fronts, the best parsing toolkit available. Terence Parr has done a bang-up job over many years of addressing difficult technical issues while building a very usable tool. It is just purely wonderful that ANTLR’s generated code has line-by-line comments that come directly from the grammar that generated it. ANTLRWorks is a very enjoyable tool for language design. Finally, Terence’s books are immensely useful.
But even ANTLR has some surprises. I generated CSharp2 code from the example CSS grammar on the ANTLR site. Then I ran it and looked at its memory footprint with a profiler. I was not surprised to see that there were an immense number of IToken objects being created — ANTLR’s generated lexer (as of version 3, anyway) allocates a new IToken object for each token parsed. This alone is actually a really big performance problem for high-load parsers — allocation in the parser’s hot path is a big problem in a managed language.
There was a surprise, though, and not a good one. There were close to 5 megabytes of int16 arrays. Where did those come from? From the lexer DFAs, which are stored in static lexer fields and uncompressed when the lexer is loaded. The compressed char representation of the lexer DFAs is tiny — certainly not megabytes. So there is a shockingly high compression ratio here. I have to question whether that grammar is really complex enough to motivate a multi-megabyte lexer DFA. It seems that I am not the only person encountering surprisingly large lexers with ANTLR v3. It looks like this can be worked around with a hand-written lexer, but wouldn’t it be wonderful if ANTLR itself were able to tune its lexer generation to achieve the same effect (trading off some speed, perhaps, to reduce DFA explosion)?
So there is room for performance improvement even on the most mature tools in the industry. And on the other end of the spectrum, there are grammatical constructs that are impossible to express in purely context-free parsing frameworks like YACC. I have been corresponding a bit with Yitzhak Mandelbaum at AT&T, subsequent to his excellent paper (with Trevor Jim and David Walker) on data-dependent parsing. This has a lot of similarities with ANTLR’s semantic predicates, but seems to be perhaps more powerful and general (being embedded in a framework that is inherently scannerless). However, being based on Earley parsing, it is hard for this method to gain competitive performance. The method does support embedded “black-box” parsers, however.
The holy grail might be an environment with a powerful, compositional, potentially ambiguous top-level user language, and an implementation that could aggressively analyze sublanguages of the grammar and choose optimized implementations. Grammar and language design is a balancing act between ambiguity and generality on the one hand, and efficiency on the other. Different implementations have different tradeoffs — implementations supporting unambiguous grammars only will inevitably be faster than those that handle ambiguous parses; LR and LL algorithms both have their own idiosyncrasies with respect to the languages they can support; and so on. So the ideal language environment would itself be extensible with different parsing algorithms that could themselves be composed. That way, if all you have is a regular expression, you get tightly tuned inner loops and no backtracking; but if you have a highly ambiguous grammar, you get a backtracking and/or chart-parsing implementation, perhaps with lots of data dependencies. The toolset should let you apply different algorithms to different parts of your language, and should support you in refactoring (for example, the left-corner transform) your grammar.
Incidentally, this last point is one reason why Parsing Expression Grammars are potentially problematic. PEGs are a beautiful embedding of a declarative grammar in a functional programming language. But in some sense most PEG implementations are too embedded — the only representation of the grammar’s structure is in the code itself, and all you can do with your PEG is to run it, not refactor or otherwise transform it. (And let’s leave aside the issue of ordered choice potentially breaking your compositional intention.) There is at least one endeavor to make PEG representations transformable, which is extremely interesting and worth following.
There is much more that could be said here — error recovery, for instance, is extremely tricky in many cases, and has tight interactions with the algorithm you are using. Yet you still would like a declarative way to express your grammar’s error behavior, if possible.
And what about low-level performance? Which is faster: a virtual object dispatch in your parser’s inner loop, or a colossal switch statement? If you are doing code generation, the only answer is probably “generate both versions and then measure it on your specific grammar!” There are so many tradeoffs here that drawing general conclusions may be nigh impossible.
The basic point I want to make is that parsing is very far from being a properly solved problem. Tools like ANTLR point the way, but there is more still to be done. And this is the glorious possibility: building a parsing environment that is itself composable, and that is fully transparent to the developer at all times, letting the developer choose their implementation tradeoffs at all levels.
As Terence knows better than anyone, this alone could consume an entire career. Or multiple such careers. Still, I could see myself having a lot of fun building an F# implementation of this kind of transparent, composable parsing toolkit. What a useful thing it would be!
Edit: And you know I will be following Terence’s work on ANTLR 4. Also, this presentation of his on parser implementation is a great look at the state of the art in managed-language parser code generation.
Edit #2: I must also mention that ANTLR has soundly convinced me of the benefit of LL style: LL style is top-down recursive, which is a very natural way for most programmers to think. Table-driven parsers, without good and scalable visualization, can be quite hard to debug and extend, but ANTLR’s generated code is a model of clarity.