BuildParse: A C++ Library for Parser Generation

Copyright (c) 2001 Michael A. MacDonald Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no invariant sections, and with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".


Table of Contents

Using BuildParse
Error Handling
Class Reference

BuildParse is a free C++ library that combines a parser and a parser generator. Unlike other parser generators (for example yacc and bison) both the grammar and the generated parser are objects in your program and can be manipulated at run-time. It is written in Standard C++ in the modern idiom, making use of the STL, namespaces and exceptions. Rule actions are full-fledged C++ objects. Grammars can be expressed entirely in C++ code clearly and fairly concisely. BuildParse may be a good fit for C++ programs that require a parser; you might also be interested in seeing a parser generator with an object-oriented design.

BuildParse generates a state-machine (bottom-up) parser that operates in much the same way that a yacc or bison parser would. The chief differences between BuildParse and yacc/bison are:

  1. yacc/bison read a grammar from a source file and generate C-code for a parsing function in an output file. BuildParse takes a grammar defined by a set of objects within a C++ program and returns a parser that is another object in the program. This is the most important difference, since it lets you use a parser in your program that you don't know how to build until runtime; however, not very many applications will really have need for this. (Of course if pressed, you could do the same thing with yacc/bison; you'd just have to compile the generated parser into a shared library and dynamically link to it; not for the faint of heart.) I like the idea of doing everything in C++ instead of using a special-purpose grammar specification language, so I use BuildParse in situations where this feature is not actually required.

  2. The parser generated by BuildParse does not use callbacks to get the next token; similarly the example lexer does not use callbacks to get more text. BuildParse processes as much as it can with the input it has, then returns a code saying either "I am finished" or "I need more input." This is particularly useful in an interactive program. You can get around the use of callbacks in bison/yacc, but it does require some contortions.

  3. The parser generated by BuildParse is likely to be quite a bit slower than an equivalent one based on bison/yacc. I haven't measured this, but I know it's true. bison/yacc stores parse tables as arrays of integers, while I use a Lisp-like approach of lists of atoms. This made the internal workings of the parser much clearer to me, but if you have a batch program parsing lots and lots of text, BuildParse probably isn't a very good choice.

Whether the library is a good choice for your project depends on how important these considerations are to you. I have used it to parse a couple of special purpose languages as well as VBScript.

Using BuildParse

This discussion assumes you know how to construct a LALR(1) grammar and why you might wish to do so. Excellent information about parsers in general can be found in the bison manual. Given that, you can learn to use BuildParse to process a grammar.

To use BuildParse you first define a number of ParseSymbol objects. These come in two varieties: TerminalSymbol objects that define tokens that can be returned by the lexical scanner, and NonTerminalSymbol objects that define sequences that can be recognized by the parser. [1] Then you create a BuildParser object. You create and add Rule objects to the BuildParser. Rule objects define sequences of ParseSymbols that the parser can recognize. Rule actions may have a RuleAction associated with them. The RuleAction object has a method that the parser executes when the parser recognizes the sequence defined by the rule. You can also assign precedences to Rules. BuildParser will use the precedence to resolve ambiguities in the rules when it creates the parser. You set a goal for the parser with the setGoal method. Finally, you use the build method to create a Parser object that implements the parser.

To use the Parser you call the parse method repeatedly with TerminalSymbol objects returned by the lexical scanner of your choice (BuildParse includes an example lexical scanner.) The parse method will return a code indicating:

  • It needs more input to continue.
  • It has processed all input and successfully recognized the goal ParseSymbol.
  • It has encountered a sequence of symbols that the grammar can not account for.

Error Handling

When the parser can not recognize its input according to the grammar from which it was built, it enters a special error-handling state. In this state the parser acts as though it has been asked to recognize the special TerminalSymbol, _error_. If the current state of the parser can not recognize the error symbol, previously recognized tokens are unrecognized and merged with the _error_ token until some rule can match. Only if no rule can be found that will match with the _error_ token will the parser give up.

Even when the parser recognized the error in some rule, it does not immediately return to a normal state. If another error occurs within a few tokens of the point that the error situation is recognized, the error handling continues from the point of the original error.



[1] Actually, you don't need to define symbols in advance; you can create them as you use them in rules. However, since symbols are typically used many times in a grammar definition, it is convenient to define them in advance.