I’ve always felt that code generation is mostly a means to optimize performance of stuff (e.g. DSL execution) that could as well be done during runtime. I might be – and probably are – wrong about that, but, c’mon, you don’t want to argue about feelings, do you. Keep this prejudice of mine in mind when asking yourself why I am doing this and writing about it in the first place.
Recently, on my ongoing quest for the wholly grail of acceptance testing, I examined and reconsidered the available AT frameworks such as FitNesse, Concordion, Cucumber and some of their siblings. At some point or other all of them have to solve the task of parsing text snippets from the application domain in order to extract some information which will then be used to execute the real code, aka the system under test. Consider for example the following step definition from Cucumber’s introductory tutorial:
Given /I have entered (.*) into the calculator/ do |n|
calculator = new Calculator()
calculator.push(n.to_i)
end
What this code does is straightforward: It checks if the line in the test scenarios, ehm sorry, in the behaviour specification matches “I have entered…”, extracts the value of the number and tries to feed a calculator object with it. This allows for nearly natural language specifications of requirements, if – and only if – you stick exactly to the given pattern. Slight variations, like using “put in” instead of “entered into” will either break the spec execution or require you to write more and more complex (and unreadable) regular expressions. What if – so my line of thoughts – if we could use parsing technology to make text snippet interpretation in testing tools more robust and thus more liable to be used by people from the application domain?
No sooner said than done. Being mostly located in the JVM world of Java & Groovy, I figured that ANTLR would be a safe bet to use. Since buying (and sometimes reading) books has been one of my favourite coping strategies for a couple of years, I ordered “The definitive ANTLR Reference”, installed the Eclipse Antlr plugin and dived in. Well, to tell you the truth, it was a short – and cold – swim since I couldn’t find any really simple examples, had to use special editors, go through a code generation step, look for where the generated classed had been placed and so on and so on.
This meant it was time for thinking yet again: If I really wanted parsing technology to be an intergrated part of a testing framework this technology would have to be lighter, meaning at least: no code generation required, no extra tooling required except what the language requires anyway. As a consequence, the ANTLR Reference has joined the fate (the shelf that is) of a dozen other unread books, but I rememered a book I read in 2001: “Building Parsers with Java” by Steven John Metsker. Therein John introduces a framework to build left recursive parsers by programming only. This was exactly what I needed, despite the fact that the library is in fin de siecle Java style, i.e. JDK 1.1 without the new collections and without generics and with enumerators instead of iterators… You get the picture. That’s why I tried to find out if Steve had maintained and modernized his library only to learn the sad truth that he had passed way in 2008. The book is still available and his website is still up; you can download the original version of his library and the examples there.
Since the topic fascinated me so much I eventually ended up modernizing Steve’s original library. Moreover, I added an additional layer to facilitate the creation of grammars. The whole thing is called Parse4SJM in order to give continuous credit to Steve John Metsker. The library is basically an implementation of a recursive descent parser with backup; this type of parser has the comfortable property that you do not need to specify any look ahead since all possible interpretations of a grammar will be tried – if necessary. Of course, this feature comes with a price: potentially exponential execution time.
Let’s consider a simple example. You want to create a language to parse and evaluate simple math expression with addition, e.g. “3 + 5″ or “1.0 + 2 + 3.1″. Your final grammar should look like that:
expr = term ('+' term)*;
term = Num;
Being agile we get to the end result in small steps, of course. A simple step is to start with the term rule:
import sjm.grammar.Grammar
Grammar mathGrammar = new Grammar("simple math");
mathGrammar.defineRule("term = Num;", new IParserMatched() {
public void apply(List<Object> matches,
Stack<Object> stack) {
stack.push(((Token) matches.get(0)).value());
}
});
IParsingResult result = mathGrammar.parse("5.1");
assert result.getStack().pop() == new BigDecimal("5.1");
This is as easy as it gets – in Java. In Groovy you can get rid of a bit ceremony:
def mathGrammar = new Grammar("simple math")
mathGrammar.defineRule("term = Num;")
{ matches, stack ->
stack.push(matches[0].value())
}
def result = mathGrammar.parse("5.1")
assert result.stack.pop() == 5.1
What the code in the closure does is fairly simple: Whenever the rule matches, take the token from the stack and put its (mathematical) value on the stack instead. This kind of combined parsing and evaluation works quite often; if it doesn’t, Parse4SJM has other options as well. The full example looks like that:
def mathGrammar = new Grammar("simple math")
mathGrammar.discardAllConstants()
mathGrammar.defineRule("expr = term ('+' term)*;")
{ matches, stack ->
def sum = matches.inject(0) {sum, each -> sum + each}
stack.push(sum)
}
mathGrammar.defineRule("term = Num;")
{ matches, stack ->
stack.push(matches[0].value())
}
result = mathGrammar.parse("5.1 + 2.0 + 1.8")
assert result.stack.pop() == 8.9
What’s my conclusion, then? Creating parsers can be easier than you think – and it doesn’t even require a code generation step. Maybe this approach can open the door for DSLs even a bit more. If you’re interested, just grab the jar or make your on clone from github. And then give me feedback. If you don’t the project will probably fall into oblivion, which should give you pangs of remorse.
August 13, 2009 at 19:56 |
This parser might be powerful. However, it’s hart for me to see why, for this purpose, an external DSL would be better than an internal one.
The Cucumber example of yours is an internal DSL. You want “[s]light variations, like using “put in” instead of “entered into…”. But I can’t see how you’ve done that with Pars4SJM. However, you solved another example with Parse4SJM: adding numbers.
Well, even the Groovy Parse4SJM solution here is still much too heavy weight. That’s because you chose an external DSL. An internal DSL would do much better here, e.g. a Groovy one. I think a internal Groovy DSL could deal with the adding numbers thing in one line of code, maybe two, but that’d be all.
(Might want to have a look at this.
Blog: http://berndschiffer.blogspot.com/2007/02/kilometerfresser.html
Code: http://www.bytemycode.com/snippets/snippet/593/
Skip the GUI and the formatting parts and you’ll end up with a one or two liner.)
There could be limits in the use of an internal DSL here, but I don’t see why you could not fix your main problem with an internal DSL, too. The benefit would be a light weight kind of tool: your programming language of choice.
So, what power are you looking for in external DSLs that you could not find in an internal DSL?
August 13, 2009 at 20:02 |
My example has been badly chosen as for the WHY of using an on-the-fly parser. Sometimes internal DSL restrict the syntax too much to be usable for non-tech analysts, testers or customers. Imagine you want to use (parts of) prose texts to be used as executable scenarios. Maybe I’ll blog such an example later… this year
.
August 13, 2009 at 23:41
I often hear the point you mentioned, that internal DSLs would restricts the syntax and that non-techs won’t be able to handle them.
I agree with that. So I came up with a mix of both. My example is not a DSL. “10 km” is not a internal Groovy DSL. But “10.km” (see the dot) is. All I have to do to convert the external “10 km”-DSL into an internal “10.km” one is to delete the dot. I’ve done that by a simple regular expression.
So, my point is: Isn’t it totally enough to build an internal DSL and to do some cosmetic surgery with simple tools like regular expressions? Do you really need extra tools for that?
I’m eager to see your example. I’ve set an reminder on that to December 31th, 2009