« September 2001 | Main | February 2002 »
January 02, 2002
Accent Vol 2 No 11
Every couple of years or so I bump onto an engineering problem most suitably solved by a simple lexical analyzer and parser. Typical examples being a configuration file processor, or a URL processor. Each time, I dust off my old lex and yacc manuals and try to remember how a DFSA differs from an NDFSA and how a shift-reduce error differs from a reduce-reduce error. Invariably I give up and revert to the tried and true method of just hacking up some code myself. Surely there must be a better way?
There is, but first a brief introduction to some language processing terms.
A lexical analyzer, also known as a lexer or scanner, simply transforms an input stream of characters into an output stream of tokens. For example, a lexer might be programmed by a set of rules to transform a stream such as (10+20) into the token stream LP N PLUS N RP. Lexers specialize in grouping sequences of characters into larger constructs named tokens.
A parser takes a stream of tokens from a lexer and groups them into higher order constructions called productions. Following on from our example above, a parser might be programmed by a set of rules to recognize the token stream LP N PLUS N RP as the production named expression.
ANTLR, or ANother Language Recognizer, is a combined lexer and parser generator, with a number of features that make it much simpler and nicer to use than the traditional combination of lex and yacc, or the GNU alternative of flex and bison.
That’s enough terms for now, lets get down to some code. We’ll work through the canonical calculator example. What’s the result of the expression (1+1)? Well, first we’ll need a lexer.
// calc.g
options
{
language="Cpp";
}
class CalcLexer extends Lexer;
public
INTEGER: (DIGIT)+;
LPAREN: '(';
RPAREN: ')';
PLUS: '+';
protected
DIGIT: '0'..'9';
The definition code for the lexer is comprised of four sections; options, class derivation, public tokens and protected tokens. The options section contains parameters for the ANTLR tool. In this case the ‘language’ statement selects the generation of C++ code. ANTLR can also generate Java and Sather code, but for this article we’ll use C++. The class statement declares that the subsequent sections are defining a lexer, and allows the programmer to name their scanner. The next two sections provide the token definitions. The case of the token names is significant. The tool knows to treat identifiers that begin with an uppercase letter as tokens. In our example lexer a DIGIT is defined to be any character in the range ‘0’ through to ‘9’, and an INTEGER is a sequence of DIGITS. The definitions use the standard regular expression syntax. Here the ‘+’ in (DIGIT)+ means a sequence of one or more. Other important regular expression operators to keep in mind are (x)? which means x is optional, x | y which means x or y, and (x)* which means a sequence of zero or many x’s.
We need a parser to consume the token stream and recognize productions so that we can evaluate our expressions. The rules used to describe the parser are called production rules and collectively describe a grammar. The grammar for our calculator, in ANTLR notation, looks like this.
class CalcParser extends Parser;
options
{
buildAST=true;
}
expr: LPAREN INTEGER (PLUS INTEGER)* RPAREN;
These three sections of code can appear either before or after the lexer definition. The first section again declares the start of the rules that describe the parser. The options section can contain many statements, but for our simple demonstration we need only to turn on the automatic construction of an abstract syntax tree (AST). The AST is a tree of nodes that is built to represent the productions as they are recognized within the token stream. And the third section provides a definition of the expr production rule. This rule states that an expression begins with a left parenthesis followed by an integer and then followed by zero or more integer additions, followed by a closing right parenthesis.
The lexer and parser source code is generated by executing the following command line. (You’ll have to install ANTLR, the Java SDK, and have set up your class path appropriately.)
> java antlr.Tool calc.g
The files resulting from this execution will be the lexer code: CalcLexer.cpp, CalcLexer.hpp, CalcLexerTokenTypes.hpp; and the parser code: CalcParser.cpp, CalcParser.hpp, CalcParserTokenTypes.hpp. One of the nice things about ANTLR, over lex, flex, yacc, and bision, is that the code it generates is implemented as nested case statements, rather than tables of numbers. This makes the code much simpler to understand when implementing a complex grammar or whilst debugging.
Now we need some main() code to drive our lexer and parser.
//main.cpp
#include
#include "CalcLexer.hpp"
#include "CalcParser.hpp"
using namespace std;
using namespace antlr;
void main()
{
try {
CalcLexer lexer(cin);
CalcParser parser(lexer);
parser.expr();
RefAST t = parser.getAST();
cout << t->toStringList() << endl;
} catch(exception& e) {
cerr << "exception: " << e.what() << endl;
}
}
A stream of characters it taken from the standard input stream, passed into the lexer, from which a token stream is passed into the parser, from which productions are recognized, and an abstract syntax tree generated.
Compile main.cpp and the tool generated code, and link with the ANTLR C++ library. (You’ll need to compile this library yourself, but instructions are provided.)
Here’s a sample session showing a sample input string, and the resultant AST.
[c:\temp\test]test
(1+1)
( 1 + 1 )
[c:\temp\test]test
(1+2+3)
( 1 + 2 + 3 )
[c:\temp\test] test
(1*1)
exception: unexpected char: *
The next stage is clearly for us to actually calculate the resultant value of the expression. A common approach is to place actions with the production rules so that as the production is recognized the result of the expression in evaluated. This solution is cumbersome for large languages as the recognition and evaluation code is intermingled. We’d prefer to separate these issues, just as we separated the concerns of lexical analysis and parsing. Typically, for a complex language, the actions that adorn the grammar do no more than build the abstract syntax tree. But, ANTLR will do this for us.
The default AST being built isn’t quite optimal. In the output above the tree includes nodes that represent the parenthesis. These are just there for syntactic sugar. Also, ,the root node in the tree is also the left parenthesis. We’d much rather it was the plus, as this is the operation we’re trying to perform. The expression production should be changed thus:
expr: LPAREN! INTEGER (PLUS^ INTEGER)* RPAREN!;
Note the addition of the ‘!’ and ‘^’ notations. The ‘!’ indicates that for AST generation the token should be ignored, and the ‘^’ after the PLUS token signifies that the PLUS is to be the root of the tree. Our interactive session now produces this:
[c:\temp\test]test
(1)
1
[c:\temp\test]test
(1+1)
( + 1 1 )
[c:\temp\test]test
(1+2+3)
( + ( + 1 2 ) 3 )
The AST is flattened for printing. Each pair of parenthesis marks the children of the preceding node.
To actually calculate the result of the expression we must build a walker to walk the AST tree implementing the evaluation actions. Again, ANTLR will write this code for us. Here’s the specification of our walker.
class CalcTreeWalker extends TreeParser;
expr returns [float r]
{
float a,b;
r=0;
}
: #(PLUS a=expr b=expr) {r = a+b;}
| i:INTEGER {r = atof(i->getText().c_str());}
;
This introduces quite a bit of new syntax, but the underlying structure is still the same. The walker takes an abstract syntax tree and recognizes productions within that tree. The rule above describes a production called expr, which has a return value of type float. The statements between braces, {…}, are regular C++ code, which I won’t drill into. The #(PLUS …) construction describes an AST node for the walker to match with. A PLUS node is expected to have an expression on the left and an expression on the right. The value returned from the left and right hand sides is assigned to the temporary variables a and b.
Our main code must be enhanced to set the walker off down the AST.
CalcLexer lexer(cin);
CalcParser parser(lexer);
parser.expr();
RefAST t = parser.getAST();
CalcTreeWalker walker;
float r = walker.expr(t);
cout << t->toStringList() << " = " << r << endl;
And, our test session will now produce the answer to our troublesome (1+1) question.
[c:\temp\test]test
(1)
1 = 1
[c:\temp\test]test
(1+1)
( + 1 1 ) = 2
[c:\temp\test]test
(1+2+3)
( + ( + 1 2 ) 3 ) = 6
This is just a taste of some of the powerful features provided by ANTLR. A heartily recommended tool for both pleasure and profit.
John Merrells
merrells@acm.org
www.antlr.org - The home ANTLR.
Posted by John at 10:39 AM | Comments (1)
Overload 46 Editorial
After four long years in the wilderness of C and Java I have finally returned to C++. It’s quite a relief, but also quite humbling. It’s amazing what you forget, and even worse forget that you’ve forgotten.
There was an article in Communications of the ACM a couple of years ago about ignorance. In typical ACM style they presented a model comprised of the five orders of ignorance. I’d hoped I was at the first order of ignorance: I knew that I didn’t know something. But, it turns out that I’m a second order ignoramus. I don’t know what I don’t know. The forth order of ignorance, meta-ignorance, is the ignorance of the five levels of ignorance. I suppose that you, dear reader, may not have been aware of the five orders of ignorance, so perhaps I just bumped you up an order. [1]
I stopped being a fulltime C++ programmer before the Standard was even completed. So, in an attempt to reeducate myself rapidly I have turned to my bulging bookshelves of unread and part-read books. Placing my C++ books in order of purchase I skimmed through many until I encountered Scott Meyer’s Effective C++. What a superb book that is. On my first reading many years ago it all seemed pretty obvious stuff. After four years of brain rot I now hang on every word. For example, I’d remembered that you can’t overload the return type of a method, but I’d forgotten that you can if they differ only by const-ness. In my code I’d written methods for getFoo and getFooAsConst. I now have lots of improvements to make to my unimpressive code.
Emerging from the wilderness I expected all to be lush and right and good in C++-land. But I immediately bumped into a bunch of quite depressing problems.
iostream and iostream.h – You’ve got to chose between them, which is fine. But sometimes some third party code will make the choice for you, and they may make the right choice for them (portability) and the wrong choice for you (conformance).
Templates – The compiler I’m using just doesn’t have a very happy time with some of the more sophisticated aspects of templates. Is it partial specialisation that doesn’t the problem? Sure they’re hard to understand and implement, but it’s been four years.
Ancient STL – The STL that ships with my compiler has copyright notices that date from 1996. That’s partly due to an old legal argument that had nothing to do with the compiler vendor, but surely they could have just got one from somewhere else, or written their own. The compiler just can’t handle a more recent implementation of the STL because it can’t cope with a more ‘modern’ implementation.
Of course I’m writing about the Microsoft compiler here, which has always been notorious for its lackadaisical attitude to language conformance. The good news is that version 7.1 will fix most of the serious conformance problems and should be shipping within a year. Coupled with this, Stan Lippman has just joined the team and has declared that one of his top goals is to bring that compiler as reasonably close to the Standard as possible. [2]
In my pursuit of a conformant development platform I’ve been exploring the domains of independent and open source software. From comments I’ve read recently the Comeau C++ compiler appears to be the most compliant compiler available, and is only deficient in that it does not quite yet support the slightly esoteric ‘export’ keyword. It’s also priced at an amazingly reasonable $50. [3]
I have also been playing with STLPort and the Boost libraries. Both are very impressive, and I’m hoping that Overload will soon be providing more coverage of these libraries. [4, 5]
Editorial Team
After four years of reading draft articles, helping authors, and the editing of Overloads 31 and 32 Einar Nilsen-Nygaard has decided to resign from the Overload editorial team. His contribution has been, and is greatly appreciated, and we wish him well with his future endeavours.
The advantage of having an editorial team work on Overload is that the workload of each issue can be spread over a number of shoulders. With work and life schedules being so hectic each of us are often willing to make an effort, but are only able to do so in occasional spurts. Having a team smoothes out these bumps in the road, by providing an avenue for a voluntary contribution without the commitment to any particular level of contribution or schedule. With that introduction I’d like to ask someone in the Overload readership to step forward and join the team. Please email us if you’d like to contribute in this way to your journal.
Overload Authors - 2001
As this issue of Overload will be the last published in 2001 I want to publicly thank all those authors who over the year have contributed their time by writing engaging articles. Thank you - Thaddaeus Frogley, Francis Glassborow, Pete Goodliffe, Michael Gradman, Alan Griffiths, Kevlin Henney, Jim Hyslop, Corwin Joy, Allan Kelly, Steve Love, Dave Midgly, Antony Pinchbeck, Mark Radford, Oliver Schoenborn, Julian Smith, Detlef Vollmann, Josh Walker, Rob Weltman, Anthony Williams, and Simon Wilson.
References
[1] The five orders of ignorance, Phillip G. Armour, Communnications of the ACM, October 2000, Volume 43, Issue 10.
[2] http://www.microsoft.com/PressPass/press/2001/Oct01/10-19lippmanpr.asp - Stan Lippman joins Microsoft.
[3] http://www.comeaucomputing.com/ - Comeau C++ Compiler.
[4] http://www.stlport.org/ - STLPort Standard Library Project.
[5] http://boost.org/ - Free peer-reviewed portable C++ source libraries.
Posted by John at 10:34 AM | Comments (1)