Number Parsing Strategies for ICU 61
Ticket: http://bugs.icu-project.org/trac/ticket/13513
Branch: http://source.icu-project.org/repos/icu/branches/shane/numberformat3/
Rietveld: https://codereview.appspot.com/335150043/
Current State of Number Parsers
Proposals for Number Parsing in ICU 61
How should the "lead chars" optimization be used?
Known Issues / Behavior Changes (Java)
Modifier Symbols Are Always Required
Modifier Symbols Are Always Accepted
Multiplier / MathContext Exception
Examples of Difficult Strings to Parse
Parsing arbitrary user-input number strings has many challenges, some of which include:
The old implementation (ICU4J prior to 59) of number parsing is what I call "greedy ad-hoc". Using a large if statement, it considers several options for tokens at each offset into the string. It saves state between parse sessions, may backtrack, and has behavior that is difficult to understand and debug. The main benefit of this implementation is that it is reasonably fast.
The current implementation (ICU4J 59 and 60) uses a state table. The states include: BEFORE_PREFIX, AFTER_PREFIX, AFTER_INTEGER_DIGIT, AFTER_FRACTION_DIGIT, AFTER_EXPONENT_SEPARATOR, and so forth (see the code file for the full list). The current state determines what tokens can appear next. The main advantage of this approach is that it is very resilient to ambiguity, with multiple parse paths taken when there is more than one possibility. It is also easy to understand and debug. The disadvantages are that the state table structure is not as easy to extend, and overall the state table approach is much slower than the greedy ad-hoc approach by an average of one order of magnitude.
I am proposing a new implementation to launch in ICU4J and ICU4C 61. I call this one "modular greedy". The basic approach starts with defining an ordered list of matchers. At each point in the string, ask each matcher, in order, to attempt to consume the string starting at the current offset. When a match succeeds, update the offset and go back to the top of the ordered list of matchers. When no matchers can match anything else, you are done.
The pseudocode for "modular greedy" is something like:
def parse(string, results) -> void:
return if string is empty # base case
for matcher in matchers:
num_chars_consumed, _ = matcher.match(string, results)
if num_chars_consumed > 0:
parse(string[num_chars_consumed:], results)
break
In this pseudo-code, "results" is the output variable.
Here is a bare bones naive implementation of a simple minus sign matcher to illustrate how matchers work:
class MinusSignMatcher implements MatcherInterface:
def match(string, results) -> (int, bool):
# Minus sign options: "-" and "**"
if string[:1] == "-":
results.is_negative = true
return 1, false # num_chars_consumed, want_more
elif string[:2] == "**":
results.is_negative = true
return 2, false
elif string == "*":
# Partial prefix match: ask for another char.
return 0, true
return 0, false # no match
In addition, with the ordered list of matchers, it is easy to supplement a slower, "modular non-greedy" algorithm that handles a few extra corner cases with very high code reuse. This algorithm no longer lets matchers greedily eat up as many chars as they want. It essentially tries giving the matcher every possible length, checking each time to see if any of the other matchers are interested in what comes next. For slightly better efficiency, the matcher is allowed to say whether it is interested in seeing more of the string beyond the given segment.
def parse_non_greedy(string, results) -> void:
return if string is empty # base case
initial = results.clone()
for matcher in matchers:
for num_chars in 1 through string.length:
candidate = initial.clone()
num_chars_consumed, want_more = matcher.match(string[:num_chars], candidate)
if num_chars == num_chars_consumed:
parse_non_greedy(string[num_chars_consumed:], candidate)
if candidate is better than best:
results <= candidate
if not want_more:
break # go to next matcher
Since the vast majority of users do not need the non-greedy code path, however, I suggest using greedy as the default. If users run into trouble with the greedy algorithm for some reason, they can discover the non-greedy option by reading the documentation, and they can flip the switch at that time to see if it solves their problem.
To achieve a modest boost to parsing runtime, each matcher can have a method to compute a set of "lead chars". The algorithm can look at that set and determine if it needs to call the slower "match" method. The contract is that the "lead chars" needs to contain a superset of the chars that can actually be accepted by "match".
The downside of this optimization is that it adds to the startup time (constructing your parsing object), so it should be used only when necessary.
A few options:
The "modular greedy" algorithm described above should be better for users than either ICU4J 58 or ICU4J 59/60:
Here are graphs comparing the runtime and allocations for the new implementation (with and without the "lead chars" optimization) and JDK, ICU4J 58, and ICU4J 59/60. These benchmarks are computed with Google Caliper and represent the resources required to parse the string "123,456.789" in locale en-US. The benchmarks were taken on an x86-64 Linux workstation. The JDK being used is OpenJDK.
NOTE: For the ICU 61 implementation, these benchmarks represent the runtime required to compute the correct binary-coded-decimal internal representation of the number. Converting from BCD to an output data type like Long or BigDecimal incurs an additional cost.
In my branch, I have the DecimalFormat parsing API hooked up to my new parsing backend. This section contains all behavior changes that are exhibited via ICU4J's unit test suite.
Old behavior: Accept all forms of the locale's currency, but do not accept non-locale currencies unless parseCurrency() is being used.
New behavior: Always accept all currencies. The performance cost is small because the currency parsing code is very well optimized. The cost of maintaining a separate structure for the locale's currencies, as in previous versions, would add to construction cost.
Consider the following unit test (note that this is a unit test I added in 59):
In other words, if the affix is empty, I don't automatically assume that the number is negative. I require some sort of evidence, like a minus sign or accounting-style parentheses. This also applies to other symbols like percent signs: unless I see the percent sign, I don't scale by 100. Related: #13114
In lenient mode, most symbols are accepted whether or not they appear in the pattern. This change primarily affects percent signs:
This change does not affect strict mode.
The parse method API docs make only one guarantee for the return type: if the number fits inside a Long, then a Long is returned. Previously, a BigInteger might have been returned when possible. Now I always unconditionally return a BigDecimal, which I have found to be faster and easier to deal with than BigInteger. The API spec makes absolutely no guarantees about the return type other than Long, but some users may be expecting certain behavior cases with the BigInteger.
You set a MathContext with unlimited precision in conjunction with a multiplier that when inverted produces a nonterminating decimal.
Old behavior: The exception is thrown in .parse()
New behavior: The exception is thrown immediately in .setMultiplier()/.setMathContext()
String: 123,456
Is the comma a grouping or decimal separator? (Take hints from the locale)
String: 123,456.789
How about now? (If we are in Europe, do we accept the comma as an American-style grouping separator?)
Pattern: 0'0N'
String: 40N
Greedy parses this as 40 and stops. Non-greedy can discover the suffix 0N and parse this as 4. This string is parseable in ICU4J 59/60 and in the non-greedy algorithm for ICU 61. It is not pareseable in ICU4J 58 or greedy 61.
Digit Strings: (0), (1), (2), (3), ...
String: (1)(2)(3)
Backwards compatibility dictates that we should accept symbols like these that are more than one code point in length.