header

Torsten Curdt’s weblog

Building Compilers

Compiler building always felt like an interesting subject to me. It’s one of the fields of computer science that has a very practical touch.
Defining your syntax and the lexical structure outside the codebase gives you an elegant separation and therefore makes it much easier to maintain. Plus it helps to keep your lexer and parser portable across different languages.

The lexical analyser (lexer) finds expressions according to the syntax of the parsed format and transforms the character input stream into a stream of tokens.

The parser parses this stream of tokens and acts as defined in the grammar. It might also create the AST (Abstract Syntax Tree) that reflects the full parsed document. The AST in turn can be traversed and evaluated.

[i] -> [lexer] --[ii]--> [parser] --[iii]--> [treeparser]

i) character stream
ii) token stream
iii) abstract syntax tree

From the C world well known compiler building packages include flex/bison or yacc. For java I found javacc, jflex and antlr. After some investigations I did go for antlr and even building parsers for complex expressions becomes a breeze once you understood how the machinery works. Admittingly this might take some time. Although there is quite some documentation available the antlr documentation did not cover all the nifty details I needed to know about.

First off: the parser generator creates classes so we need to specify the name and their package scope.

Inside the class section ‘options’ define implementation parameters. The common “k” defines the look-ahead distance. For the lexer it seems to be important to define the character range in order to support negation of character ranges.

For the parser we can here specify the creation of the AST.

The next thing to do is to divide the source document into simple token definitions.

Note the use of “~” to negate the character set. It’s basically saying: all but or “. By using the “protected” keyword “ESC” is not really a token but more a macro. It will not be passed into the stream of tokens. It’s just a convenient way of specifing the quotation of the special characters (like ”
“”). The “!” in “QLITERAL” does also prevent a copy into the output tree.

Usually whitespace and newline can completely be ignored for the stream of tokens. Well, in fact the newline character needs to trigger the “newline()” method. But other than that we don’t care about them.

Now comes the interesting part: the tokens that define e.g. an expression. The following example demonstrates a simple boolean expression that takes variables in a syntax like $(variable).

These tokens can used to build an expressions. The expression itself needs to be evaluated in particular order. Operator precedence is defined through a hierarchy of rules.

value > expression3 > expression2 > expression1

Evaluation then comes more or less out-of-the-box. Since the parser is meant to create an AST all that needs to be done is to walk the generated tree representation of the document and act on certain expression nodes. The necessary TreeParser class can also be generated from the grammar.

The literal function …well, just returns the literal value of the node. Of course quoted and simple literals behave the same here.

The value function also takes the variables into account. It looks them up from an environment object. (In this case it’s a simple HashMap)

Now the expression function itself calls the evaluation of the operators. And voila! That’s it!

Of course this is just a simple example. But there are quite a few grammars freely available. Java, C#, SQL, …and a lot more.

Great stuff!


header {
  package org.vafer.expression;
}

class MyLexer extends Lexer;
  ...
class MyParser extends Parser;
  ...


class MyLexer extends Lexer;
options {
    k=2;
    charVocabulary='u0000'..'uFFFE';
}


class MyParser extends Parser;
options {
    k=4;
    buildAST = true;
}


LITERAL:
  ('a'..'z'|'A'..'Z'|'0'..'9')+
  ;

QLITERAL:
  '"'! (ESC | ~('\'|'"'))* '"'!
  ;

protected
ESC:
  '\' ('\'|'t'|'n'|'r'|'"') ;


NEWLINE : ('
''
') => '
''
' //DOS
        | '
'                   //MAC
        | '
'                   //UNIX
        { newline(); }
        ;
WS      : (' '|'        ') { $setType(Token.SKIP); } ;


VAR       : '$'  ;
LPAREN    : '('  ;
RPAREN    : ')'  ;
OR        : "||" ;
AND       : "&&" ;
EQUAL     : "==" ;
ASSIGN    : "="  ;
NOT_EQUAL : "!=" ;
NOT       : '!'  ;
LT        : '<'  ;
LE        : "<=" ;
GE        : ">=" ;
GT        : '>'  ;


value
  : LITERAL
  | QLITERAL
  | VAR LPAREN LITERAL RPAREN
  ;

expression
  : expression1 (OR^ expression1)*
  ;

expression1
  : expression2 (AND^ expression2)*
  ;

expression2
  : expression3 ((EQUAL^ | NOT_EQUAL^ | LT^ | LE^ | GE^ | GT^) expression3)*
  ;

expression3
  : value
  | LPAREN! expression RPAREN!
  | NOT^ expression3
  ;


class MyTreeParser extends TreeParser;

literal returns [ String s ]
  {
     s = null;
  }
  : l:LITERAL
    {
      s = l.getText();
    }
  | ql:QLITERAL
    {
      s = ql.getText();
    }
  ;


value returns [ String s ]
  {
    String variable = null;
    s = null;
  }
  : l:LITERAL
    {
      s = l.getText();
    }
  | ql:QLITERAL
    {
      s = ql.getText();
    }
  | VAR! LPAREN! v:LITERAL RPAREN!
    {
      variable = v.getText();

      s = (String) environment.get(variable);

      if (s == null) {
        throw new RecognitionException("unrecognized variable " + variable);
      }
    }
  ;


expression returns [ boolean e ]
  {
    boolean a, b;
    String l, r;

    e = false;
  }
  : #(AND a=expression b=expression { e = a && b; } )
  | #(OR  a=expression b=expression { e = a || b; } )
  | #(EQUAL l=value r=value
       {
         e = l.equals(r);
       }
     )
  ;

  • roula
    i want to make a very small language with special keywords i want to detect it (its just an idea)

    is there any books or tutorials or web sites on that help me in building a compiler to that language
blog comments powered by Disqus