Tuesday 11 December 2007 — This is 17 years old. Be careful.
Kay Rhodes wrote a post about simplistic vs. useful sorting: Alphabetical != ASCIIbetical (cute name). In it, he points to Dave Koelle’s Alphanum algorithm, which says to split the string to be sorted into numeric and non-numeric chunks, then sort so that the numeric chunks are treated as numbers. This makes “z2” sort after “z100”, for example.
I looked at the code Dave provided (in Java, C++, or Perl), and all of it is much longer than I expected: the C++ is 40 or so lines. A comment in there says it’s easier in a pattern language like Perl, but the Perl is still 20 iterative lines.
In Python:
import re
def tryint(s):
"""
Return an int if possible, or `s` unchanged.
"""
try:
return int(s)
except ValueError:
return s
def alphanum_key(s):
"""
Turn a string into a list of string and number chunks.
>>> alphanum_key("z23a")
["z", 23, "a"]
"""
return [ tryint(c) for c in re.split('([0-9]+)', s) ]
def human_sort(l):
"""
Sort a list in the way that humans expect.
"""
l.sort(key=alphanum_key)
A few helpful Python features make this more compact: re.split provides just the function we want for chunking the string, sort takes a key function for computing sort keys from data, the key to sort on can itself be a list, and comparing two lists compares lexicographically among the elements of the list.
Each of these features in and of itself may have only occasional use, but here they conspire to help me write code nearly as expressive as the English description in my first paragraph. And there are enough of those features that they often help make expressive code like that.
Comments
In Python I can express my self better, faster, and without all the bureaucracy involved with other languages. That also makes for faster coding.
- Paddy.
>>> from itertools import groupby
>>> s = 'zx1a23456qwert'
>>> [int("".join(g)) if k else "".join(g) for k,g in groupby(s, lambda x: x in '0123456789')]
['zx', 1, 'a', 23456, 'qwert']
>>>
- Paddy.
Better not to use an exception, since neither of the two branches of tryint are really exceptions, and exception handling is slow.
How about something simpleminded like
if s[0] in ['0','1','2','3','4','5','6','7','8','9']:
return int(s)
else:
return s
or
if ord(s[0]) > 48 and ord(s[0]) < 58:
....
def tryint(s):
... if s.isdigit():
....... return int(s)
... else:
....... return s
def alphanum_key(s):
... return [int(c) if c.isdigit() else c for c in re.split('([0-9]+)', s)]
How about
return [tryfloat(c) for c in re.split('(-?[0-9\.]+)',s)]
-----
import re
def sort_nicely( l ):
""" Sort the given list in the way that humans expect.
"""
convert = lambda text: int(text) if text.isdigit() else text
alphanum = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
l.sort( key=alphanum_key )
I don't think the regexp is a cleaner solution, have a look at my longer version that I posted here: http://weblog.masukomi.org/2007/12/10/alphabetical-asciibetical/comments/1199#comment-1199
The guts of which is:
def splitondigits(string):
return [int("".join(chars)) if digits else "".join(chars)
for digits,chars in groupby(string, str.isdigit)]
asciisorted = sorted(unsorted[:])
alphanumsorted = sorted(unsorted[:], key=splitondigits)
I had to read-up on groupby , whereas probably like you, I know the regexp syntax, but this seems a good fit for the groupby function.
Otherwise 'abc' will sort after 'DEF'
def alphanum_key(s):
____key = re.split(r"(\d+)", s)
____key[1::2] = map(int, key[1::2])
____return key
def sort_nicely(l):
____""" Sort the given list in the way that humans expect."""
____return sorted(l, key=alphanum_key)
def nsort(l): return sorted(l, key=lambda s: [map(int, e[1::2]) for e in re.split(r'(\d+)', s)])
Ned, the python version does have another advantage over the (original) Java version: It works. The Java version throws NumberFormatExceptions if the numbers get to big (e.g. x999999999 > x8888888888). I have sent a changed Java version that uses the fact that bigger numbers are usually longer (taking care of trailing zeroes) to Dave Koelle.
def nsort(l) return sorted(l, key=lambda a:zip(re.split("(\\d+)", a)[0::2], map(int, re.split("(\\d+)", a)[1::2])))
def nsort(l) return sorted(l, key=lambda a.lower()):zip(re.split("(\\d+)", a)[0::2], map(int, re.split("(\\d+)", a)[1::2])))
You've posted 13 months after the original article, making grandiose statements, all without code. I hope your not on the Ruby development team ;-)
- Paddy.
Add a comment: