Data-driven programming is sometimes confused with object orientation, another style in which data organization is supposed to be central. There are at least two differences. One is that in data-driven programming, the data is not merely the state of some object, but actually defines the control flow of the program. Where the primary concern in OO is encapsulation, the primary concern in data-driven programming is writing as little fixed code as possible. Unix has a stronger tradition of data-driven programming than of OO.
At the upper end of its complexity scale, data-driven programming merges into writing interpreters for p-code or simple minilanguages of the kind we surveyed in Chapter 8. At other edges, it merges into code generation and state-machine programming. The distinctions are not actually that important; the important part is moving program logic away from hardwired control structures and into data.
I maintain a program called ascii, a very simple little utility that tries to interpret its command-line arguments as names of ASCII (American Standard Code for Information Interchange) characters and report all the equivalent names. Code and documentation for the tool are available from the project page. Here is an illustrative screenshot:
esr@snark:~/WWW/writings/taoup$ ascii 10 ASCII 1/0 is decimal 016, hex 10, octal 020, bits 00010000: called ^P, DLE Official name: Data Link Escape ASCII 0/10 is decimal 010, hex 0a, octal 012, bits 00001010: called ^J, LF, NL Official name: Line Feed C escape: '\n' Other names: Newline ASCII 0/8 is decimal 008, hex 08, octal 010, bits 00001000: called ^H, BS Official name: Backspace C escape: '\b' Other names: ASCII 0/2 is decimal 002, hex 02, octal 002, bits 00000010: called ^B, STX Official name: Start of Text
One indication that this program was a good idea is the fact that it has an unexpected use — as a quick CLI aid to converting between decimal, hex, octal, and binary representations of bytes.
The main logic of this program could have been coded as a 128-branch case statement. This would, however, have made the code bulky and difficult to maintain. It would also have tangled parts that change relatively rapidly (like the list of slang names for characters) with parts that change slowly or not at all (like the official names), putting them both in the same legend string and making errors during editing much more likely to touch data that ought to be stable.
Instead, we apply data-driven programming. All the character name strings live in a table structure that is quite a bit larger than any of the functions in the code (indeed, counted in lines it is larger than any three of the functions in the program). The code merely navigates the table and does low-level tasks like radix conversions. The initializer actually lives in the file nametable.h, which is generated in a way we'll describe later in this chapter.
This organization makes it easy to add new character names, change existing ones, or delete old names by simply editing the table, without disturbing the code.
(The way the program is built is good Unix style, but the output format is questionable. It's hard to see how the output could usefully become the input of any other program, so it does not play well with others.)
Programs like these became common on the Internet very rapidly following Paul Graham's landmark paper A Plan for Spam [Graham] in 2002. While the explosion was triggered by the increasing cost of the pattern-matching arms race, the statistical-filtering idea was adopted first and fastest by Unix shops. In part, this was certainly because almost all the Internet service providers (who are most burdened by spam, and thus had most incentive to adopt effective new techniques) are Unix shops — but undoubtedly the harmony with some traditional themes in Unix software design helped as well.
Conventional spam filters require that a system administrator, or some other responsible party, maintain information on patterns of text found in spam — names of sites that emit nothing but spam, come-on phrases often used by pornography sites or Internet scam artists, and the like. In his paper, Graham noted accurately that computer programmers like the idea of pattern-matching filters, and sometimes have difficulty seeing past that approach, because it offers them so many opportunities to be clever.
Statistical spam filters, on the other hand, work by collecting feedback about what the user judges to be spam versus nonspam. That feedback is processed into databases of statistical correlation coefficients or weights connecting words or phrases to the user's spam/nonspam classification. The most popular algorithms use minor variants of Bayes's Theorem on conditional probabilities, but other techniques (including various sorts of polynomial hashing) are also employed.
In all these programs, the correlation check is a relatively trivial mathematical formula. The weights fed into the formula along with the message being checked serve as implicit control structure for the filtering algorithm.
The problem with conventional pattern-matching spam filters is that they are brittle. Spammers are constantly gaming against the filter-rule databases, forcing the filter maintainers to constantly reprogram their filters to stay ahead in the arms race. Statistical spam filters generate their own filter rules from the user feedback.
In fact, experience with statistical filters seems to show that the particular learning algorithm used is far less important than the quality of the spam and nonspam data sets from which the learning algorithm computes its weights. So the results of statistical filters really are driven more by the shape of the data than by the algorithm.
A Plan for Spam was something of a bombshell because its author argued convincingly that a simple, even crude, statistical approach gave a lower rate of nonspam being erroneously classified as spam than either elaborate pattern-matching techniques or the human eyeball could manage. For Unix programmers, seeing past the lure of clever pattern-matching was far easier than in other programming cultures without as strong an attachment to “Keep It Simple, Stupid!”
The fetchmailconf(1) dotfile configurator shipped with fetchmail(1) contains an instructive example of advanced data-driven programming in a very high-level, object-oriented language.
In October 1997 a series of questions on the fetchmail-friends mailing list made it clear that end-users were having increasing troubles generating configuration files for fetchmail. The file uses a simple, classically-Unixy free-format syntax, but can become forbiddingly complicated when a user has POP3 and IMAP accounts at multiple sites. See Example 9.1 for a somewhat simplified version of the fetchmail author's configuration file.
The parser for fetchmail's configuration file syntax is rather elaborate. It's actually written in yacc and lex, the two classic Unix tools for generating language-parsing code in C. For fetchmailconf to be able to edit existing configuration files, it at first appeared that it would be necessary to replicate that elaborate parser in fetchmailconf's implementation language — Python.
This tactic seemed doomed. Even leaving aside the amount of duplicative work implied, it is notoriously hard to be certain that two parsers in two different languages accept the same grammar. Keeping them synchronized as the configuration language evolved bid fair to be a maintenance nightmare. It would have violated the SPOT rule we discussed in Chapter 4 wholesale.
This problem stumped me for a while. The insight that cracked it was that fetchmailconf could use fetchmail's own parser as a filter! I added a --configdump option to fetchmail that would parse .fetchmailrc and dump the result to standard output in the format of a Python initializer. For the file above, the result would look roughly like Example 9.2 (to save space, some data not relevant to the example is omitted).
Lisp and Java programmers call this introspection; in some other object-oriented languages it's called metaclass hacking and is generally considered fearsomely esoteric, deep black magic. Most object-oriented languages don't support it at all; in those that do (Perl and Java among them), it tends to be a complicated and fragile undertaking. But Python's facilities for introspection and metaclass hacking are unusually accessible.
See Example 9.3 for the solution code, from near line 1895 of the 1.43 version.
def copy_instance(toclass, fromdict): for x in fromdict.keys(): setattr(toclass, x, fromdict[x])
When your code is this simple, it is far more likely to be right. See Example 9.4 for the code that calls it.
This is a new-school sort of example; Python was not even invented until 1990. But it reflects themes that go back to 1969 in the Unix tradition. If meditating on Unix programming as practiced by his predecessors had not taught me constructive laziness — insisting on reuse, and refusing to write duplicative glue code in accordance with the SPOT rule—I might have rushed into coding a parser in Python. The first key insight that fetchmail itself could be made into fetchmailconf's configuration parser might never have happened.
Insights like this can be extraordinarily powerful. The code we have been looking at was written in about ninety minutes, worked the first time it was run, and has been stable in the years since (the only time it has ever broken is when it threw an exception in the presence of genuine version skew). It's less than forty lines and beautifully simple. There is no way that the naïve approach of building an entire second parser could possibly have produced this kind of maintainability, reliability or compactness. Reuse, simplification, generalization, orthogonality: this is the Zen of Unix in action.
In Chapter 10, we'll examine the run-control syntax of fetchmail as an example of the standard shell-like metaformat for run-control files. In Chapter 14 we'll use fetchmailconf as an example of Python's strength in rapidly building GUIs.