more ANTLR - Java, and comparisons to PLY and PyParsing

[ previous | newer ] /home/writings/diary/archive/2007/11/03/antlr_java

more ANTLR - Java, and comparisons to PLY and PyParsing

Parts 1, 2, 3, and comments.

I've been writing a series of articles on parsing a molecular formula (like H2SO4) using ANTLR as a parser. See the first and second essays.

There are several reasons to use ANTLR over one of the Python parsers like PLY and PyParsing. The GUI interface is very nice, it has a sophisticated understanding of how to define a grammar, it's top-down approach means the generated code is easier to understand, tree-grammers simplify some otherwise tedious parts of writing parsers, and the one I'll talk about now - it supports multiple output languages.

Using ANTLR to generate a Java-based MW calculator

I chose to generate Python code but ANTLR also supports generating parsers in C, C# and Java. Python is actually not quite as supported as those three languages, and there's some support for yet other languages.

Converting a grammar to a new language is straight-forward: rewrite the action code in the new language. To show you what I mean, here's the molecular weight evaluator in Python, from earlier:

grammar MWGrammar;

options {
	language=Python;
}

calculate_mw returns [float mw]
@init {
  $mw = 0.0
}
	: (species { $mw += $species.species_weight})* EOF
	;

species	returns [float species_weight]
	: atom DIGITS? {
		count = int($DIGITS.text) if $DIGITS else 1
		$species_weight = $atom.weight * count
		}
	;

atom returns [float weight]
	: 'H' { $weight = 1.00794 }
	| 'C' { $weight = 12.001 }
	| 'Cl' { $weight = 35.453 }
	| 'O' { $weight = 15.999 }
	| 'S' { $weight = 32.06 }
	;

DIGITS	: count='0' .. '9'+ ;

and here's it is rewritten for Java:

grammar JMWGrammar;

options {
	language=Java;
}

calculate_mw returns [double mw]
@init {
  $mw = 0.0;
}
	: (species { $mw += $species.species_weight;} )* EOF
	;

species	returns [double species_weight]
@init {
  int count = 0;
}
	: atom DIGITS? {
		if ($DIGITS == null) {
			count = 1;
		} else {
			count = Integer.parseInt($DIGITS.text);
		}
		$species_weight = $atom.weight * count;
		}
	;

atom returns [double weight]
	: 'H' { $weight = 1.00794; }
	| 'C' { $weight = 12.001; }
	| 'Cl' { $weight = 35.453; }
	| 'O' { $weight = 15.999; }
	| 'S' { $weight = 32.06; }
	;

DIGITS	: '0' .. '9'+ ;

I added a bunch of semicolons, changed a few function name lookups, and used Java's 'null' instead of Python's None. Almost mechanical. I also decided not to use Java's ternary operator and instead have an 'if' statement. Oh, and I changed everything to 'double' instead of 'float', and had to declare a type for the 'count' variable. I suppose I should go back to the Python grammar and change everything to 'double' there, but for the Python code it doesn't actually matter.

It's nice I can do that, but there's almost certainly a price to pay. The generated Python code looks slow. There are a lot of function calls, which are cheap in Java but expensive in Python. How does the ANTLR generated parser compare with a native Python parsing system?

Parsing with PLY

My prefered native Python parser is PLY. I've written about it in my NBN lecture notes. (The NBN is South Africa's National Bioinformatics Network. I taught several Python-related courses for them.)

If you go to my PLY writeup you'll see I described how to use PLY to parse a molecular formula. It's my "hello world" for parsers. Luckily for me, that means I don't have to explain the details again. Hmm, though if I ever write a book or something based on this (not bloody likely), I'll have to spend time to normalize the vocabulary I use for the different parts of the grammar.

Here's my PLY parser for computing the molecular weight. I've added comments directly in the code which point out some of the differences between this grammar and the one I developed in the lecture notes.

"""Calculate the molecular weight given a molecular formula

Parse the formula using PLY.
"""
# ply_mw.py

from ply import lex
from ply.lex import TOKEN
import ply.yacc as yacc

class ParseError(Exception):
    def __init__(self, msg, offset):
        self.msg = msg
        self.offset = offset
    def __repr__(self):
        return "ParseError(%r, %r)" % (self.msg, self.offset)
    def __str__(self):
        return "%s at position %s" % (self.msg, self.offset + 1)


### Define the lexer

tokens = (
    "ATOM",
    "DIGITS",
)

mw_table = {
    'H': 1.00794,
    'C': 12.001,
    'Cl': 35.453,
    'O': 15.999,
    'S': 32.06,
}


# I don't want to duplicate the atom names so extract the
# keys to make the lexer pattern.

# Sort order is:
#   - alphabetically on first character, to make it easier
# for a human to look at and debug any problems
# 
#   - then by the length of the symbol; two letters before 1
# Needed because Python's regular expression matcher
# uses "first match" not "longest match" rules.
# For example, "C|Cl" matches only the "C" in "Cl"
# The "-" in "-len(symbol)" is a trick to reverse the sort order.
#
#   - then by the full symbol, to make it easier for people

# (This is more complicated than needed; it's to show how
# this approach can scale to all 100+ known and named elements)

atom_names = sorted(
    mw_table.keys(),
    key = lambda symbol: (symbol[0], -len(symbol), symbol))

# Creates a pattern like:  Cl|C|H|O|S
atom_pattern = "|".join(atom_names)

# Use a relatively new PLY feature to set the __doc__
# string based on a Python variable.
@TOKEN(atom_pattern)
def t_ATOM(t):
    t.value = mw_table[t.value]
    return t

def t_DIGITS(t):
    r"\d+"
    t.value = int(t.value)
    return t

def t_error(t):
    raise ParseError("unknown character", t.lexpos)

lexer = lex.lex()

## Here's an example of using the lexer

# data = "H2SO4"
# 
# lex.input(data)
# 
# for tok in iter(lex.token, None):
#     print tok

##### Define the grammar

# The molecular weight of "" is 0.0
def p_mw_empty(p):
    "mw : "
    p[0] = 0.0

def p_mw_formula(p):
    "mw : formula"
    p[0] = p[1]
    

def p_first_species_term(p):
    "formula : species"
    p[0] = p[1]

def p_species_list(p):
    "formula : formula species"
    p[0] = p[1] + p[2]

def p_species(p):
    "species : ATOM DIGITS"
    p[0] = p[1] * p[2]

def p_species_default(p):
    "species : ATOM"
    p[0] = p[1]

def p_error(p):
    raise ParseError("unexpected character", p.lexpos)

parser = yacc.yacc()

# Work around a problem in PLY 2.3 where the first parse does not
# allow a "".  I reported it to the ply mailing list on 2 November.
# This guarantees the first parse will never be "" :)
parser.parse("C")

### Calculate molecular weight

def calculate_mw(formula):
    return parser.parse(formula, lexer=lexer)

By comparison, here's the driver code for the ANTLR generated parser. Remember, the grammar and actions are defined in a file named "MWGrammar.g", which is why this code is so short:

"""Calculate the molecular weight given a molecular formula

Parse the formula using a parser generated by ANTLR
"""
# antlr_mw.py

import sys
import antlr3
from MWGrammarParser import MWGrammarParser
from MWGrammarLexer import MWGrammarLexer

def calculate_mw(formula):
    char_stream = antlr3.ANTLRStringStream(formula)
    lexer = MWGrammarLexer(char_stream)
    tokens = antlr3.CommonTokenStream(lexer)
    parser = MWGrammarParser(tokens)
    return parser.calculate_mw()

Tests and timing comparisons

The next step is to test both implementations and compare their performance. I'll extract the self-test code from my previous essay. It generates a lot of molecular formulas semi-randomly along with the MW. The MW calculation uses a completely different approach than the parser code, so it's very unlikely that there will be the same problem in both cods. But just in case, I added a few tests by hand.

With the code extracted, I'll use it to generate 2505 tests and run those formulas through each of the parser-based molecular weight calculators.

"""Run tests to validate the MW parsers and compare timing results."""
# compare_mw.py

import antlr_mw
import ply_mw

# time.clock is more accurate under Windows
import time, sys
if sys.platform == "win32":
    timer = time.clock
else:
    timer = time.time

_mw_table = {
    'H': 1.00794,
    'C': 12.001,
    'Cl': 35.453,
    'O': 15.999,
    'S': 32.06,
}
_element_names = _mw_table.keys()
def _generate_random_formulas():
    import random
    # Using semi-random values so I can check a wide space
    # Number of terms in the formula
    _possible_lengths = (1, 2, 3, 4, 5, 10, 53, 104)
    # Repeat count for each formula
    _possible_counts = tuple(range(12)) +  (88, 91, 106, 107, 200, 1234)
    for i in range(2500):
        terms = []
        total_mw = 0.0
        # Use a variety of lengths
        for j in range(random.choice(_possible_lengths)):
            symbol = random.choice(_element_names)
            terms.append(symbol)
            count = random.choice(_possible_counts)
            if count == 1 and random.randint(0, 2) == 1:
                pass
            else:
                terms.append(str(count))

            total_mw += _mw_table[symbol] * count
        yield total_mw, "".join(terms)

_selected_formulas = [
    (0.0, ""),
    (1.00794, "H"),
    (1.00794, "H1"),
    (32.06, "S"),
    (12.001+1.00794*4, "CH4"),
    ]

good_test_data = (_selected_formulas +
                  list(_generate_random_formulas()))

def do_tests(calculate_mw):
    start_time = timer()
    for expected_mw, formula in good_test_data:
        got_mw = calculate_mw(formula)
        if expected_mw != got_mw:
            raise AssertionError("%r expected %r got %r" %
                                 (formula, expected_mw, got_mw))
    return timer() - start_time

print "Testing", len(good_test_data), "formulas"

# Evaluate everything with ANTLR
antlr_time = do_tests(antlr_mw.calculate_mw)
print "ANTLR", antlr_time

# Evaluate everything with PLY
ply_time = do_tests(ply_mw.calculate_mw)
print "PLY", ply_time

print "ratio = %.02f" % (antlr_time / ply_time)

# I really should test that they handle invalid formulas ...

which when I run it gives me output like this:

% python compare_mw.py
Testing 2505 formulas
ANTLR 8.90074515343
PLY 2.39185404778
ratio = 3.72

PLY for this grammar is about 3.7 times faster than ANTLR.

Faster with a specialized parser

Of course, if you want speed you can write a special-purpose parser instead. The following runs about 13 times faster than PLY:

import re
mw_pat = re.compile(r"(Cl|C|O|S|H)(\d+)?")

_mw_table = {
    'H': 1.00794,
    'C': 12.001,
    'Cl': 35.453,
    'O': 15.999,
    'S': 32.06,
}

def re_mw(formula):
    total_mw = 0.0
    offset = 0
    while 1:
        match = mw_pat.match(formula, offset)
        if match is None:
            break

        symbol = match.group(1)
        weight = _mw_table[symbol]
        count = match.group(2)
        if count is None:
            total_mw += weight
        else:
            total_mw += weight * int(count)
        offset = match.end(0)

    if offset != len(formula):
        raise TypeError("unknown or unexpected character at position %d" % (offset+1,))
    return total_mw

Remember though that the point of this is to get a better understanding of how to use ANTLR. I normally wouldn't parse such a simple grammar using ANTLR or PLY or other parsing system as that's overkill. Consider, however, a more complex formula allowing expressions like "CH3(CH2)6CH3" for octane. That's much harder to do manually but only a couple of extra lines using ANTLR or PLY.

Error handling in ANTLR

(Note: In my original version I could not get error handling to work as I expected. With help from Benjamin Niemann I got it figured out.)

I would like the ANTLR grammar to stop and raise an exception at the first point of failure, rather than writing a message to the console. I tried subclassing the MWGrammarLexer and MWGrammerParser but had limited sucess. Looking in the ANTLR book, p241, section 10.4 "Exiting the Recognizer upon First Error" describes almost exactly what I want to do. I need to add a special @rulecatch action in the grammar and override a couple of default methods.

However, the text given there only describes how to catch and handle parser errors, and doesn't describe what to do for lexer errors. To figure that out I emailed Benjamin Niemann, who is the contact person for the Python back-end. I originally thought there was a problem with the generated code, but it was really due to my limited understanding.

grammar MWGrammar;

options {
	language=Python;
}

# This part is NOT in the Terence Parr's "The Definitive ANTLR Reference"
@lexer::members {
def reportError(self, e):
   raise e
}

@members {
def mismatch(self, input, ttype, follow):
    raise MismatchedTokenException(ttype, input)

def recoverFromMismatchedSet(self, input, e, follow):
    raise e
}

@rulecatch {
except RecognitionException, e:
    raise
}

calculate_mw returns [float mw]
@init {
  $mw = 0.0
}
	: (species { $mw += $species.species_weight})* EOF
	;

species	returns [float species_weight]
	: atom DIGITS? {
		count = int($DIGITS.text) if $DIGITS else 1
		$species_weight = $atom.weight * count
		}
	;

atom returns [float weight]
	: 'H' { $weight = 1.00794 }
	| 'C' { $weight = 12.001 }
	| 'Cl' { $weight = 35.453 }
	| 'O' { $weight = 15.999 }
	| 'S' { $weight = 32.06 }
	;

DIGITS	: count='0' .. '9'+ ;

Here are examples of the errors raised first by the lexer ("@" is not a supported token) and parser (the number "10" cannot start a molecular formula). For clarity I've removed the leading part of the path to the antlr runtime egg:

% python compute_mw2.py "@"
MW is
Traceback (most recent call last):
  File "compute_mw2.py", line 17, in <module>
    print "MW is", calculate_mw(formula)
  File "compute_mw2.py", line 15, in calculate_mw
    return parser.calculate_mw()
  File "/Users/dalke/src/dayparsers/MWGrammarParser.py", line 62, in calculate_mw
    LA1_0 = self.input.LA(1)
  File ".../antlr_python_runtime-3.0.1-py2.5.egg/antlr3/streams.py", line 813, in LA
    return self.LT(i).type
  File ".../antlr_python_runtime-3.0.1-py2.5.egg/antlr3/streams.py", line 752, in LT
    self.fillBuffer()
  File ".../antlr_python_runtime-3.0.1-py2.5.egg/antlr3/streams.py", line 623, in fillBuffer
    t = self.tokenSource.nextToken()
  File ".../antlr_python_runtime-3.0.1-py2.5.egg/antlr3/recognizers.py", line 915, in nextToken
    self.reportError(re)
  File "/Users/dalke/src/dayparsers/MWGrammarLexer.py", line 31, in reportError
    raise e
antlr3.exceptions.NoViableAltException: NoViableAltException('@'!=['1:1: Tokens : ( T5 | T6 | T7 | T8 | T9 | DIGITS );'])

% python compute_mw2.py "10"
MW is
Traceback (most recent call last):
  File "compute_mw2.py", line 17, in <module>
    print "MW is", calculate_mw(formula)
  File "compute_mw2.py", line 15, in calculate_mw
    return parser.calculate_mw()
  File "/Users/dalke/src/dayparsers/MWGrammarParser.py", line 83, in calculate_mw
    self.match(self.input, EOF, self.FOLLOW_EOF_in_calculate_mw61)
  File ".../antlr_python_runtime-3.0.1-py2.5.egg/antlr3/recognizers.py", line 145, in match
    self.mismatch(input, ttype, follow)
  File "/Users/dalke/src/dayparsers/MWGrammarParser.py", line 36, in mismatch
    raise MismatchedTokenException(ttype, input)
antlr3.exceptions.MismatchedTokenException: MismatchedTokenException(4!=-1)

ANTLR is very flexible. I can use my own exception types, change the error handling behaviour quite a bit, and get a lot of parse state information for error reporting.

Parsing with PyParsing

As long I'm here, Paul McGuire did a chemical formula calculator using PyParsing. I might as well show and time that one. I didn't have PyParsing installed but it's available through PyPI and easy_install by doing:

easy_install pyparsing

Once installed I created this module, named "pyparsing_mw.py"

"""Calculate the molecular weight given a molecular formula

Parse the formula using PyParsing.
"""
# pyparsing_mw.py
#
# This code is a modification of
#   http://pyparsing.wikispaces.com/space/showimage/chemicalFormulas.py
# done by Andrew Dalke

# chemicalFormulas.py
#
# Copyright (c) 2003, 2007, Paul McGuire
#

from pyparsing import Word, Optional, ZeroOrMore, Group, ParseException

# define a simple Python dict of atomic weights, with chemical symbols
# for keys.
# The original numbers were more precise than mine; I changed
# them to match what I was using so the tests continue to pass -- Andrew
atomicWeight = {
    'H': 1.00794,
    'C': 12.001,
    'Cl': 35.453,
    'O': 15.999,
    'S': 32.06,
    #"Na" : 22.9897,  # from the original code
    }

# define some strings to use later, when describing valid lists 
# of characters for chemical symbols and numbers
caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
lowers = caps.lower()
digits = "0123456789"

# Define grammar for a chemical formula
# - an element is a Word, beginning with one of the characters in caps,
#   followed by zero or more characters in lowers

element = Word( caps, lowers )

# - an integer is a Word composed of digits

integer = Word( digits )

# - an elementRef is an element, optionally followed by an integer - if 
#   the integer is omitted, assume the value "1" as a default; these are 
#   enclosed in a Group to make it easier to walk the list of parsed 
#   chemical symbols, each with its associated number of atoms per 
#   molecule

elementRef = Group( element("symbol") + Optional( integer, default=1 )("qty") )

# - a chemicalFormula is just zero or more elementRef's
#  (I changed this from OneOrMore to allow "" -- Andrew)
chemicalFormula = ZeroOrMore( elementRef )

## Add some actions 

# Auto-convert integers
def convertIntegers(tokens):
    return int(tokens[0])

# Compute partial molecular weight per element
def computeElementWeight(tokens):
    element = tokens[0]
    element["weight"] = atomicWeight[element.symbol] * element.qty


integer.setParseAction( convertIntegers )
elementRef.setParseAction( computeElementWeight )

def calculate_mw(formula):
    formulaData = chemicalFormula.parseString(formula)
    return sum(element.weight for element in formulaData)

# Some test code; gives you an idea of how to use this
# 
#for formula in ( "H2O", "NaCl", "C6H5OH", "H2SO4" ):
#     # print the results
#     print formula, "->", calculate_mw(formula)

Validation and timings

It was an easy tweak to compare_mw.py file to add the PyParsing-based parser and verify that it passed all of the tests. This also gave me timings comparisons for the different implementations:

Testing 2505 formulas
RE 0.18
PLY 2.33
ANTLR 8.73
PyParsing 9.87

Just because PyParsing is the slowest for this test doesn't mean you shouldn't use it. PyParsing and ANTLR are both top-down parsers, which is more natural for most people compared to the bottom-up parser in PLY. You can also see that PyParsing makes a much more pervasive use of objects. I happen to know regular expressions and BNF very well so I prefer PLY's terseness. It's also nice that it's a lot faster than the other two.

Hybrid parsing as a possible speedup

In emails with Benjamin Niemann I asked about the likelihood of the ANTLR-generated code getting any faster. I used as an example this generated lexer code, which matches \d+:

            cnt1 = 0
            while True: #loop1
                alt1 = 2
                LA1_0 = self.input.LA(1)
                if ((u'0' <= LA1_0 <= u'9')) :
                    alt1 = 1
                if alt1 == 1:
                    # MWGrammar.g:49:15: count= '0' .. '9'
                    count = self.input.LA(1)
                    self.matchRange(u'0', u'9')
                else:
                    if cnt1 >= 1:
                        break #loop1
                    eee = EarlyExitException(1, self.input)
                    raise eee
                cnt1 += 1

It looks like every character read incurs several Python function calls, which are a lot more expensive in Python than in Java or C++. There's no easy change for this, so I'm pretty sure the ANTLR-generated parser always going to be slower than PLY.

Benjamin pointed out that there's no reason to use ANTLR for both lexing and parsing. You can replace the lexer with your own hand-coded one, and let ANLTR do the parser. I wonder how much faster than would be. Perhaps someday I'll try it out, or you can do it and let me know.

Comments?

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me