Iterator merge in Python

Tuesday 3 October 2006This is 18 years old. Be careful.

Although I’ve been working in Python for a long time, every once in a while I come across a chunk of code that makes me stop and think, and admire its elegance and Pythonic nature. Raymond Hettinger’s recipe for merging multiple sorted iterators is one of those pieces of code. It’s only 29 lines including the supporting stuff, but it opened my eyes in these ways:

  • It works with iterators in unusual ways, including taking a list of objects, turning them into iterators by mapping iter over them, and then iterating over the list of iterators.
  • It manipulates the iterators directly, by calling the next() function explicitly. Usually I’d think about the implementation of iterators involving the next() method only when considering writing an unusual iterator, not when using an iterator.
  • It uses a standard module that had escaped my attention, heapq for using a list as a sorted heap.
  • It uses object methods as first-class objects that can be pushed into the heap along with the data to be sorted.

There’s nothing like a succinct thought-provoking snippet to start your day.

Comments

[gravatar]
Not impressed with this one. Far too much work for such a triffle thing. It does too much of the grunt work when Python already includes most of the bells and whistles. Simplicity trumps all else.

def merge(*iters):
    iters = list(iter(i) for i in iters)
    while iters:
        for i in iters:
            try:
                yield i.next()
            except StopIteration, e:
                iters.remove(i)

print sorted(merge([1,2,3,5], [4,5]))
[gravatar]
Calvin, yours is shorter, but misses an important point in Raymond's: yours pulls the entire contents of the iterators into memory, while his does not. Imagine trying to merge 5 sorted 2Gb files together. His will use very little memory (only 5 lines of text need to be stored at once), while yours will consume 10Gb.
[gravatar]
One more thing: Calvin's version of course has use where memory use is not a concern, and can be simplified further, using only standard modules:

from itertools import chain
print sorted(chain([1,2,3,4], [4,5]))
[gravatar]
I don't want to brag or anything, but I wrote something very close to Raymond a few day before he entered his version in the cookbook :

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/466302

It include all the nifty stuff about using heapq and calling next() on iterators. Well, yeah, I guess I'm bragging :) At least Raymond's version is generic and not mixed with file management.
[gravatar]
How embarrasing! I misread the recipe originally. I did not notice the nifty trick of pushing both the value and the next function, and somehow interpreted it as pushing all values into the heap immediately. This is why I mistakenly saw nothing in the recipe that even cared or took advantage of the iterables being pre-sorted, and thus shot it down.

I conceed my disagreement.
[gravatar]
Neat. It's worth noting the similarity of the (val, next-method) approach to the "streams" model in SICP which is fascinating (required?) reading.
[gravatar]
It's great recipe. But it'll need some modification to make it work in Python 3.7+? Would be nice to revisit and offer new insight.

Thanks,

Daniel

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.