header

Torsten Curdt’s weblog

Surfing In Mimizan

148793190 2941567f67 t Surfing In MimizanJust once I tried surfin in Sydney …but I fear I got infected by the surfin virus. Marcus wanted to give me some lessons in Biarritz but unfortunately it did not work out. But since I am really eager to learn surfin I’ll be spending the next two weeks in a surfcamp in Mimizan. Getting ready for a joined trip in autumn :-)

Two weeks out at the beach …without a computer. Sounds awesome!
(sometimes it’s good to prove you are not an addict)

Building Compilers

Compiler building always felt like an interesting subject to me. It’s one of the fields of computer science that has a very practical touch.
Defining your syntax and the lexical structure outside the codebase gives you an elegant separation and therefore makes it much easier to maintain. Plus it helps to keep your lexer and parser portable across different languages.

The lexical analyser (lexer) finds expressions according to the syntax of the parsed format and transforms the character input stream into a stream of tokens.

The parser parses this stream of tokens and acts as defined in the grammar. It might also create the AST (Abstract Syntax Tree) that reflects the full parsed document. The AST in turn can be traversed and evaluated.

[i] -> [lexer] --[ii]--> [parser] --[iii]--> [treeparser]

i) character stream
ii) token stream
iii) abstract syntax tree

From the C world well known compiler building packages include flex/bison or yacc. For java I found javacc, jflex and antlr. After some investigations I did go for antlr and even building parsers for complex expressions becomes a breeze once you understood how the machinery works. Admittingly this might take some time. Although there is quite some documentation available the antlr documentation did not cover all the nifty details I needed to know about.

First off: the parser generator creates classes so we need to specify the name and their package scope.

Inside the class section ‘options’ define implementation parameters. The common “k” defines the look-ahead distance. For the lexer it seems to be important to define the character range in order to support negation of character ranges.

For the parser we can here specify the creation of the AST.

The next thing to do is to divide the source document into simple token definitions.

Note the use of “~” to negate the character set. It’s basically saying: all but or “. By using the “protected” keyword “ESC” is not really a token but more a macro. It will not be passed into the stream of tokens. It’s just a convenient way of specifing the quotation of the special characters (like ”
“”). The “!” in “QLITERAL” does also prevent a copy into the output tree.

Usually whitespace and newline can completely be ignored for the stream of tokens. Well, in fact the newline character needs to trigger the “newline()” method. But other than that we don’t care about them.

Now comes the interesting part: the tokens that define e.g. an expression. The following example demonstrates a simple boolean expression that takes variables in a syntax like $(variable).

These tokens can used to build an expressions. The expression itself needs to be evaluated in particular order. Operator precedence is defined through a hierarchy of rules.

value > expression3 > expression2 > expression1

Evaluation then comes more or less out-of-the-box. Since the parser is meant to create an AST all that needs to be done is to walk the generated tree representation of the document and act on certain expression nodes. The necessary TreeParser class can also be generated from the grammar.

The literal function …well, just returns the literal value of the node. Of course quoted and simple literals behave the same here.

The value function also takes the variables into account. It looks them up from an environment object. (In this case it’s a simple HashMap)

Now the expression function itself calls the evaluation of the operators. And voila! That’s it!

Of course this is just a simple example. But there are quite a few grammars freely available. Java, C#, SQL, …and a lot more.

Great stuff!


header {
  package org.vafer.expression;
}

class MyLexer extends Lexer;
  ...
class MyParser extends Parser;
  ...


class MyLexer extends Lexer;
options {
    k=2;
    charVocabulary='u0000'..'uFFFE';
}


class MyParser extends Parser;
options {
    k=4;
    buildAST = true;
}


LITERAL:
  ('a'..'z'|'A'..'Z'|'0'..'9')+
  ;

QLITERAL:
  '"'! (ESC | ~('\'|'"'))* '"'!
  ;

protected
ESC:
  '\' ('\'|'t'|'n'|'r'|'"') ;


NEWLINE : ('
''
') => '
''
' //DOS
        | '
'                   //MAC
        | '
'                   //UNIX
        { newline(); }
        ;
WS      : (' '|'        ') { $setType(Token.SKIP); } ;


VAR       : '$'  ;
LPAREN    : '('  ;
RPAREN    : ')'  ;
OR        : "||" ;
AND       : "&&" ;
EQUAL     : "==" ;
ASSIGN    : "="  ;
NOT_EQUAL : "!=" ;
NOT       : '!'  ;
LT        : '<'  ;
LE        : "<=" ;
GE        : ">=" ;
GT        : '>'  ;


value
  : LITERAL
  | QLITERAL
  | VAR LPAREN LITERAL RPAREN
  ;

expression
  : expression1 (OR^ expression1)*
  ;

expression1
  : expression2 (AND^ expression2)*
  ;

expression2
  : expression3 ((EQUAL^ | NOT_EQUAL^ | LT^ | LE^ | GE^ | GT^) expression3)*
  ;

expression3
  : value
  | LPAREN! expression RPAREN!
  | NOT^ expression3
  ;


class MyTreeParser extends TreeParser;

literal returns [ String s ]
  {
     s = null;
  }
  : l:LITERAL
    {
      s = l.getText();
    }
  | ql:QLITERAL
    {
      s = ql.getText();
    }
  ;


value returns [ String s ]
  {
    String variable = null;
    s = null;
  }
  : l:LITERAL
    {
      s = l.getText();
    }
  | ql:QLITERAL
    {
      s = ql.getText();
    }
  | VAR! LPAREN! v:LITERAL RPAREN!
    {
      variable = v.getText();

      s = (String) environment.get(variable);

      if (s == null) {
        throw new RecognitionException("unrecognized variable " + variable);
      }
    }
  ;


expression returns [ boolean e ]
  {
    boolean a, b;
    String l, r;

    e = false;
  }
  : #(AND a=expression b=expression { e = a && b; } )
  | #(OR  a=expression b=expression { e = a || b; } )
  | #(EQUAL l=value r=value
       {
         e = l.equals(r);
       }
     )
  ;

Finding Bugs

148793158 7bfa6c9edd t Finding BugsSomehow I came across the findbugs project. I just gave it a try on the current cocoon codebase. Whether they are really all bugs or not… this tool came up with a lot of interesting findings and annotations. Wondering if it would be worth integrating it into the build system. The LGPL probably rules this out but anyway …at least it’s also available as eclipse plugin.

…another show off for BCEL

New iPod Released

148793059 a4b0bfca2d t New iPod Released
A new iPod version has been released. I am wondering if the buttons inside the scroll wheel are really such a good idea. Even on the “old” iPod I accidently hit the buttons from time to time. But the longer lasting battery (4h!) sounds like a big improvement.

Native Java Continuations

148793037 55c0e73233 t Native Java ContinuationsWe’d like to invite everyone who is interested to join our initiative on codehaus.org. We are aiming to write and submit a JSR for native java continuations support inside the JVM.
We are currently looking for people that have any kind of expertise in the continuations field or just like to support the effort by any means.

Please subscribe by sending a mail to

[email protected]

all post will have to go to

[email protected]

Please spread the word,
Thanks