Unix has a long-standing tradition of hosting tools that are specifically designed to generate code for various special purposes. The venerable monuments of this tradition, which go back to Version 7 and earlier days, and were actually used to write the original Portable C Compiler back in the 1970s, are lex(1) and yacc(1). Their modern, upward-compatible successors are flex(1) and bison(1), part of the GNU toolkit and still heavily used today. These programs have set an example that is carried forward in projects like GNOME's Glade interface builder.
yacc and lex are tools for generating language parsers. We observed in Chapter 8 that your first minilanguage is all too likely to be an accident rather than a design. That accident is likely to have a hand-coded parser that costs you far too much maintenance and debugging time — especially if you have not realized it is a parser, and have thus failed to properly separate it from the remainder of your application code. Parser generators are tools for doing better than an accidental, ad-hoc implementation; they don't just let you express your grammar specification at a higher level, they also wall off all the parser's implementation complexity from the rest of your code.
If you reach a point where you are planning to implement a minilanguage from scratch, rather than by extending or embedding an existing scripting language or parsing XML, yacc and lex will probably be your most important tools after your C compiler.
lex was written to automate the task of generating lexical analyzers (tokenizers) for compilers. It turned out to have a surprisingly wide range of uses for other kinds of pattern recognition, and has since been described as “the Swiss-army knife of Unix programming”.[130]
If you are attacking any kind of pattern-recognition or state-machine problem in which all the possible input stimuli will fit in a byte, lex may enable you to generate code that will be more efficient and reliable than a hand-crafted state machine.
Most importantly, the lex specification minilanguage is much higher-level and more compact than equivalent handcrafted C. Modules are available to use flex, the open-source version, with Perl (find them with a Web search for “lex perl”), and a work-alike implementation is part of PLY in Python.
If you can implement your parser in a higher-level language than C (which we recommend you do; see Chapter 14 for discussion), then look for equivalent facilities like Python's PLY (which covers both lex and yacc)[131] or Perl's PY and Parse::Yapp modules, or Java's CUP,[132] Jack,[133] or Yacc/M[134] packages.
As with macro processors, one of the problems with code generators and preprocessors is that compile-time errors in the generated code may carry line numbers that are relative to the generated code (which you don't want to edit) rather than the generator input (which is where you need to make corrections). yacc and lex address this by generating the same #line constructs that the C preprocessor does; these set the current line number for error reporting so the numbers will come out right. Any program that generates C or C++ should do likewise.
The canonical demonstration example that seems to have appeared in every lex and yacc tutorial ever written is a toy interactive calculator program that parses and evaluates arithmetic expressions entered by the user. We will spare you yet another repetition of this cliche; if you are interested, consult the source code of the bc(1) and dc(1) calculator implementations from the GNU project, or the paradigm example ‘hoc’[135] from [Kernighan-Pike84].
Instead, the grammar of fetchmail's run-control-file parser provides a good medium-sized case study in lex and yacc usage. There are a couple of points of interest here.
We looked at Glade in Chapter 8 as a good example of a declarative minilanguage. We also noted that its back end produces a result by generating code in any one of several languages.
Glade is a good modern example of an application-code generator. What makes it Unixy in spirit are the following features, which most GUI builders (especially most proprietary GUI builders) don't have:
Rather than being glued together as one monster monolith, the Glade GUI and Glade code generator obey the Rule of Separation (following the “separated engine and interface” design pattern).
The GUI and code generator are connected by an (XML-based) textual data file format that can be read and modified by other tools.
Multiple target languages (as opposed to just C or C++) are supported. More could easily be added.
[131] PLY is downloadable.
[132] CUP is downloadable.
[133] Jack is downloadable.
[134] Yacc/M is downloadable.