For some related ideas, find a description of the Alternate Hard And Soft Layers and Scripted Components design patterns.
An interesting survey of design styles and techniques in minilanguages is Notable Design Patterns for Domain-Specific Languages [Spinellis].
One very pragmatic reason to stick with structured data rather than a minilanguage is that in a networked world, embedded minilanguage facilities are subject to abuses that can be inconvenient or even dangerous. JavaScript is a prime example in the ‘inconvenient’ category; its designers didn't anticipate that it would be used for pop-up advertisements so obnoxious as to create a demand for browser features that suppress JavaScript interpretation.
Microsoft Word macro viruses show how this sort of thing can become actively dangerous, a security hole that costs billions of dollars in downtime and lost productivity annually. It is instructive to note that despite the existence of at least twenty million Unix users worldwide[95] there has never been any Unix equivalent of Windows's frequent macro-virus outbreaks. There are a number of reasons for this, including the fundamentally better security design of Unix; but at least one is the fact that Unix mail agents do not default to executing live content in any document that the user views.[96]
If there is any way that your application's users might end up running programs from untrusted sources, risky features of your application minilanguage might end up having to be suppressed. Languages like Java and JavaScript are explicitly sandboxed—that is, they have limited access to their environment not merely to simplify their design but to try to prevent potentially destructive operations by buggy or malicious code.
On the other hand, a lot of bad designs have been botched by designers who failed to face up to the fact that they really needed a minilanguage rather than a data-file format. Too often, language-like features get pasted on as an afterthought. The two most common symptoms of this problem are weak, ad-hoc control structures and poor or nonexistent facilities for declaring procedures.
It's risky to design minilanguages that are only accidentally Turing-complete. If you do this the odds are good that, sometime in the future, some clever fellow is going to think he needs to press your language into doing loops and conditionals for him. Because these are only available in an obfuscated way, he'll produce obfuscated code. The results may be serviceable in the short term, but are likely to be a nightmare for those who come after him.
Minilanguage design is both powerful and esthetically rewarding, but it's also full of similar traps. There are kinds of design in which it is appropriate to take the bottom-up approach of pasting together a bunch of low-level services and worrying about their organization after you have explored the problem domain for a while. One of the virtues of minilanguages is that they can help you get a good design out of bottom-up programming by allowing you to defer some top-down decisions into the control flow of programs in your minilanguage. But if you take a bottom-up approach to the minilanguage design itself, you are likely to end up with an ugly syntax reflecting a weak language and a poorly-thought-out implementation.
There are many places in a minilanguage design where small choices make a large difference in the useability and ease of the tool:
On this issue, as elsewhere, there is no substitute for good taste and engineering judgment. If you're going to design a minilanguage, don't do it halfway. Declarative minilanguages should have a clear, consistent language-like syntax designed to be readable by humans. Imperative ones should add a full range of control structures adapted from language models you can expect your users to be familiar with. Think about the language as a language; ask yourself esthetic questions like “Will this be comfortable to program in?” and even “Will it be pleasant to look at?” Here, as elsewhere in software design, David Gelernter's maxim is apt: beauty is the ultimate defense against complexity.
One fundamentally important question is whether you can implement your minilanguage by extending or embedding an existing scripting language. This is often the right way to go for an imperative minilanguage, but much less appropriate for a declarative one.
This is the easiest way to write a minilanguage. Old-school Lisp programmers (including me) love this technique and use it heavily. It underlies the design of the Emacs editor, and has been rediscovered in the new-school scripting languages like Tcl, Python, and Perl. There are drawbacks to it, however.
Your host language may be unable to interface to a code library that you need. Or, internally, its ontology of data types may be inadequate for the kind of computation you need to do. Or, after measuring the performance of a prototype, you discover that it's too slow. When any of these things happen, your solution is usually going to involve coding in C (or C++) and integrating the results into your minilanguage.
The option of extending a scripting language with C code, or of embedding a scripting language in a C program, relies on the existence of scripting languages designed for it. You extend a scripting language by telling it to dynamically load a C library or module in such a way that the C entry points become visible as functions in the extended language. You embed a scripting language in a C program by sending commands to an instance of the interpreter and receiving the results back as values in C.
Both techniques also rely on the ability to move data between the type ontology of C and the type ontology of your scripting language. Some scripting languages are designed from the ground up to support this. One such is Tcl, which we'll cover in Chapter 14. Another is Guile, an open-source dialect of the Lisp variant Scheme. Guile is shipped as a library and specifically designed to be embedded in C programs.
It is possible (though in 2003 still rather painful and difficult) to extend or embed Perl. It is very easy to extend Python and only slightly more difficult to embed it; C extension is especially heavily used in the Python world. Java has an interface to call ‘native methods’ in C, though the practice is explicitly discouraged because it tends to break portability. Most versions of shell are not designed for embeddability and extension, but the Korn shell (ksh93 and later versions) is a notable exception.
There are lots of bad reasons not to piggyback your imperative minilanguage on an existing scripting language. One of the few good ones is that you actually want to implement your own custom grammar for error checking. If that's the case, then see the advice about yacc and lex below.
For declarative minilanguages, one major question is whether or not you should use XML as a base syntax and specify your grammar as an XML document type. This may well be the right thing for elaborately structured declarative minilanguages, but the same caveats we noted in Chapter 5 about the design of data-file formats apply — XML might be overkill. If you don't use XML, follow the Rule of Least Surprise by supporting the Unix conventions we described for data files (simple token-oriented syntax, supporting C backslash conventions, etc.).
If you do need a custom grammar, yacc and lex (or their local equivalent in the language you're using) should probably be your best friends, unless the grammar of your language is so simple that hand-coding a recursive-descent parser is trivial. Even then, yacc may give you better error recovery, and a yacc specification will be easier to modify as the language syntax evolves. See Chapter 9 for a look at the yacc- and lex-derived tools available in different implementation languages.
Even if you decide you must implement your own syntax, consider what mileage you can get from reusing existing tools. If you need a macro facility, consider whether preprocessing with m4(1) might be the right answer — but consider the cautions in the next section first.
Macro expansion facilities were a favored tactic for language designers in early Unix; the C language has one, of course, and we have seen them show up in some of the more complex special-purpose minilanguages like pic(1). The m4 preprocessor provides a generic tool for implementing macro-expanding preprocessors.
In C, the classic example of this sort of problem is a macro such as this:
#define max(x, y) x > y ? x : y
This sort of bad interaction can be headed off by coding the macro definition more defensively.
#define max(x, y) ((x) > (y) ? (x) : (y))
The TeX formatting language (see Chapter 18) well illustrates the general problem. TeX is intentionally Turing-complete (it has conditionals, loops, and recursion), but while it can be made to do amazing things, TeX code tends to be unreadable and painful to debug. The sources for LaTeX, the the most widely used TeX macro package, are an instructive example: they're in very good TeX style, but even so are extremely difficult to follow.
A minor problem, compared to this one, is that macro expansion tends to screw up error diagnostics. The base language processor generates its error reports relative to the macro expanded text, not the original the programmer is looking at. If the relationship between the two has been obfuscated by macro expansion, the emitted diagnostic can be very difficult to associate with the actual location of the error.
This is especially a problem with preprocessors and macros that can have multiline expansions, conditionally include or exclude text, or otherwise change line numbers in the expanded text.
Macro expansion stages that are built into a language can do their own compensation, fiddling line numbers to refer back to the preexpanded text. The macro facility in pic(1) does this, for example. This problem is more difficult to solve when the macro expansion is done by a preprocessor.
Another important question you need to ask is whether your minilanguage engine will be called interactively by other programs, as a slave process. If so, your design should probably look less like a conversational language for human interaction and more like the kind of application protocols we looked at in Chapter 5.
The main difference is how carefully marked the boundaries of transactions are. Human beings are good at spotting where conversational output from a CLI ends, and where the prompt for the next input is. They can use context to tell what's significant and what should be ignored. Computer programs have much more trouble with this. Without either unambiguous end markers on output or advance knowledge of the length of the output, they can't tell when to stop reading.
Even worse is when a program's input is buffered (often inadvertently, as by stdio). A program that stops overtly reading at the right place can nonetheless eat past it. | ||
-- |
Programs in which master processes are trying to do interactive things with slaved minilanguages that are not carefully designed around this problem are prone to deadlock as the master and slave fall out of synchronization (a problem we first noted in Chapter 7).
There are workarounds for driving minilanguages that are not so carefully designed. The prototype for most of them is the Tcl expect package. This package assists conversation with CLIs. It's built around the following operation: read from slave until either a given regular-expression pattern is matched or a specified timeout elapses. With this (and, of course, a send-to-slave operation) it's often possible to construct master programs to do reliable dialogues with slave processes even when the latter have not been tailored for the role.
Be aware of this issue when designing your minilanguage. It may be a good idea to add an option that changes its conversational behavior to make it respond more like an application protocol, with unambiguous end-of-output delimiters and an analog of byte stuffing.