Building Grammars (and Parsers) on the Fly

August 9, 2009 by johanneslink

I’ve always felt that code generation is mostly a means to optimize performance of stuff (e.g. DSL execution) that could as well be done during runtime. I might be – and probably are – wrong about that, but, c’mon, you don’t want to argue about feelings, do you. Keep this prejudice of mine in mind when asking yourself why I am doing this and writing about it in the first place.

Recently, on my ongoing quest for the wholly grail of acceptance testing, I examined and reconsidered the available AT frameworks such as FitNesse, Concordion, Cucumber and some of their siblings. At some point or other all of them have to solve the task of parsing text snippets from the application domain in order to extract some information which will then be used to execute the real code, aka the system under test. Consider for example the following step definition from Cucumber’s introductory tutorial:

Given /I have entered (.*) into the calculator/ do |n|
     calculator = new Calculator()
     calculator.push(n.to_i)
end

What this code does is  straightforward: It checks if the line in the test scenarios, ehm sorry, in the behaviour specification matches “I have entered…”, extracts the value of the number and tries to feed a calculator object with it. This allows for nearly natural language specifications of requirements, if – and only if – you stick exactly to the given pattern. Slight variations, like using “put in” instead of “entered into” will either break the spec execution or require you to write more and more complex (and unreadable) regular expressions. What if – so my line of thoughts – if we could use parsing technology to make text snippet interpretation in testing tools more robust and thus more liable to be used by people from the application domain?

No sooner said than done. Being mostly located in the JVM world of Java & Groovy, I figured that ANTLR would be a safe bet to use. Since buying (and sometimes reading) books has been one of my favourite coping strategies for a couple of years, I ordered “The definitive ANTLR Reference”, installed the Eclipse Antlr plugin and dived in. Well, to tell you the truth, it was a short – and cold – swim since I couldn’t find any really simple examples, had to use special editors, go through a code generation step, look for where the generated classed had been placed and so on and so on.

This meant it was time for thinking yet again: If I really wanted parsing technology to be an intergrated part of a testing framework this technology would have to be lighter, meaning at least: no code generation required, no extra tooling required except what the language requires anyway. As a consequence, the ANTLR Reference has joined the fate (the shelf that is) of a dozen other unread books, but I rememered a book I read in 2001: “Building Parsers with Java” by Steven John Metsker. Therein John introduces a framework to build left recursive parsers by programming only. This was exactly what I needed, despite the fact that the library is in  fin de siecle Java style, i.e. JDK 1.1 without the new collections and without generics and with enumerators instead of iterators… You get the picture. That’s why I tried to find out if Steve had maintained and modernized his library only to learn the sad truth that he had passed way in 2008. The book is still available and his website is still up; you can download the original version of his library and the examples there.

Since the topic fascinated me so much I eventually ended up modernizing Steve’s original library. Moreover, I added an additional layer to  facilitate the creation of grammars. The whole thing is called Parse4SJM in order to give continuous credit to Steve John Metsker. The library is basically an implementation of a recursive descent parser with backup; this type of parser has the comfortable property that you do not need to specify any look ahead since all possible interpretations of a grammar will be tried – if necessary. Of course, this feature comes with a price: potentially exponential execution time.

Let’s consider a simple example. You want to create a language to parse and evaluate simple math expression with addition, e.g. “3 + 5″ or “1.0 + 2  + 3.1″. Your final grammar should look like that:

expr = term ('+' term)*;
term = Num;

Being agile we get to the end result in small steps, of course. A simple step is to start with the term rule:

import sjm.grammar.Grammar
Grammar mathGrammar = new Grammar("simple math");
mathGrammar.defineRule("term = Num;", new IParserMatched() {
  public void apply(List<Object> matches,
        Stack<Object> stack) {
    stack.push(((Token) matches.get(0)).value());
  }
});

IParsingResult result = mathGrammar.parse("5.1");
assert result.getStack().pop() == new BigDecimal("5.1");

This is as easy as it gets – in Java. In Groovy you can get rid of a bit ceremony:

def mathGrammar = new Grammar("simple math")
mathGrammar.defineRule("term = Num;")
  { matches, stack ->
    stack.push(matches[0].value())
  }
def result = mathGrammar.parse("5.1")
assert result.stack.pop() == 5.1

What the code in the closure does is fairly simple: Whenever the rule matches, take the token from the stack and put its (mathematical) value on the stack instead. This kind of combined parsing and evaluation works quite often; if it doesn’t, Parse4SJM has other options as well. The full example looks like that:

def mathGrammar = new Grammar("simple math")
mathGrammar.discardAllConstants()
mathGrammar.defineRule("expr = term ('+' term)*;")
  { matches, stack ->
    def sum = matches.inject(0) {sum, each -> sum + each}
    stack.push(sum)
  }
mathGrammar.defineRule("term = Num;")
  { matches, stack ->
    stack.push(matches[0].value())
  }
result = mathGrammar.parse("5.1 + 2.0 + 1.8")
assert result.stack.pop() == 8.9

What’s my conclusion, then? Creating parsers can be easier than you think – and it doesn’t even require a code generation step. Maybe this approach can open the door for DSLs even a bit more. If you’re interested, just grab the jar or make your on clone from github. And then give me feedback. If you don’t the project will probably fall into oblivion, which should give you pangs of remorse.

A Unified Theory of Software Design, Architecture and Everything

March 6, 2009 by johanneslink

It’s probably an obvious symptom for my being on a downward spiral from software activist to theoretician, but I’ll do it anyway. I’m going to present you with a naive unified theory of software design, which I will call DRE: Dependencies Rule Everything. I consider it a real pity that the most common use of this abbreviation until today is Digital Rectal Examination. This has to change.

I have been wondering for a while now why software design principles, heuristics and trade-offs are being described in so many different terms: coupling, cohesion, SOLID, LSP, DRY, LoD, you name it. It has always seemed obvious to me that there is only one concept that covers all the others: dependencies.

Here is my definition:

Party A depends on party B means: If party B changes one or more of its observable aspects, there is some probability (greater than zero) that party A will have to adapt.

Let’s examine the building blocks of this definition and its implications more closely:

  • I use the word “party” instead of some more code-centric expression like “module” or “component”. Rationale: There are dependencies between pieces of my software (functions, classes, modules, subsystems, components etc) but there are also dependencies onto the outer world – the domain. For instance, within DRE I would consider the method who implements the logic for withdrawing money from an account as dependent on the domain rule how money should be withdrawn. Thus, DRE dependencies are a superset of static code dependencies.
  • Change, the probability of the change and the probability of the change to affect the dependent together define the strength of a dependency. If the dependee can change (eg. disappear) without forcing the dependent to adapt there is no dependency worth mentioning. In that sense loose coupling has the goal to lower the probability of a change affecting the dependent and static typing usually raises the probability, e.g. by enforcing the number of arguments to a function call.
  • I am only interested in the observables of a party like interface, timing behaviour, error conditions, availability, cost. To put it differently: I don’t care about unobservable implementation details or anything that can change without the dependent noticing.
  • A dependency between individual parts results in a dependency between the respective aggregates. The more and the stronger the dependencies between the parts, the stronger the dependency between the aggregates.
  • Dependencies are directed and that’s a good thing. On an atomic level there must not be bidirectional dependencies since otherwise change would result in an unstable state. On an aggregate level bidirectional dependencies arise much too easy if you are not careful.
  • For most cases the dependency relation is not transitive. In some programming languages it seems like they are, but that’s usually because public visibility of inner parts is the default and the Law of Demeter strikes. One notable exception to the rule is facilitated by C/C++ compilers, which enforce recompilation of statically dependent components and thus propagate change all the way up.
  • There exist at least three different major species of dependencies: compile-time dependencies, runtime dependencies and domain-rule dependencies. Statically typed and dynamically typed languages do different trade-offs between compile-time and runtime dependencies; type information is one kind of explicit dependency, which is obviously preferable over implicit – i.e. potentially unknown – dependencies. Speaking of explicitness: unit tests and acceptance tests are a different way of making dependencies explicit. Many of the techniques that promise  you a loosely coupled design do nothing else than going from explicit dependencies (a statically enforced method call) to implicit dependency by putting some obfuscating mechanism (e.g. XML serializing) in-between. This does not help you a thing, it just makes the system slower and more complex by introducing additional dependencies to libraries and structured documents. Unless, of course, if you really really really need it for cross-platform, cross-version, cross-process communication.

What is Good Design then?

A good design – or architecture for that matter – in DRE is one in which the overall of dependencies and their strengths is minimized. Since strength is defined as the probability that a change will hit you there is quite some fortune telling involved in finding “the best” of all designs. In other words: if you cannot agree on a probable future you cannot agree on a good design. That’s why Agile designers use a simple heuristic to foretell the future: Everything will change but we don’t know how. In DRE-terms that means: optimizing dependencies within the system and assuming that the world outside (the domain and its rules) has zero probability to change. In practice this leads to lots of design changes in early stages of a project until the typical domain changes have been incorporated in internal design elements and will only affect the outer-most “parties” (e.g. the adapter class, the configuration file, the business rules database).

What follows from that is: If you really know the future, the agile approach of evolutionary design is not the best. But be honest, who the hell does?

Reformulating OO Design Principles

For most heuristics of good OO design it’s quite easy to see why they work at least as well in the DRE universe. I leave it to the astute reader to translate the SOLID principles into DRE speak. One central rule is not so easy to reconcile with DRE, namely Don’t Repeat Yourself. I’ll try it anyway:

Consider a simple case of duplication:

def calculationOne() {
  ...
  def salesTax = amount * 0.19
  ...
}
def calculationTwo() {
  ...
  def salesTax = amount * 0.19
  ...
}

Given a probability of p that the sales tax rate will change during the system’s life cycle, you have two dependencies with a strength of p, so your total dependency number is 2p. Let’s now apply DRY in a straightforward matter:

def calculationOne() {
  ...
  def salesTax = calculateSalesTax(amount)
  ...
}
def calculationTwo() {
  ...
  def salesTax = calculateSalesTax(amount)
  ...
}
def calculateSalesTax(amount) {
  return amount * 0.19
}

Given a probability of r that we will have to change the interface of calculateSalesTax later on, the overall dependency number is now 2r + p; if we have chosen our abstraction wisely, this figure will be lower than 2p. Thus, the code with less duplication is better design in DRE theory. Quod erat demonstrandum.

So what?

There are two striking reasons why the unified theory appears attractive to me:

  1. It gives me the vocabulary to talk about the relation between different design approaches and thereby tackle questions like “What has loose coupling to do with the SOLID principles?” and “Does this decoupling technique really decouple anything?”.
  2. It might provide me with a quantifiable means to compare different designs. Any tool vendors, contact me for licensing the approach!

Now it’s up to you all to tell me why this is utter nonsense or perfectly useless or both. And I do appreciate congratulations for my future nobel prize on computer science. Shoot!

Update:

To make that clear, it was not my intention to suggest that simply adding up propabilities would make for a mathematically sound model for a “design quality number”; it was just the simplest thing to do and it felt intuitive enough.

ClasspathSuite 1.2.1

March 4, 2009 by johanneslink

In order to prove that gaussian distribution does not hold for the time between releases I’ve made ClasspathSuite version 1.2.1 available as of now.

Basically one new feature got added. No bugs fixed since 1.2.0 beta – because no one has found any.

ClasspathSuite 1.2.0 beta

March 2, 2009 by johanneslink

After more than a year of perfect stability I eventually adapted ClasspathSuite to JUnit 4.5’s way of building test suites.

I added a new feature, too.

I am Probably Not

February 7, 2009 by johanneslink

There is probably no Johannes Link