Removing overlapping regex matches

Saturday 8 September 2012 — This is nearly 13 years old. Be careful.

In a Stack Overflow question a few months ago, a petitioner wanted to remove all the matches of a number of regexes. The complication was that the regexes could overlap.

Simply using re.sub() on each pattern in turn wouldn’t work, because the overlapping matches wouldn’t be fully matched once the first patterns were removed from the string.

The solution is to match the regexes, and note the locations of the matches, and then in a second pass, delete all those parts of the string. Here’s an updated version of my answer:

def remove_regexes(text, patterns):
    """Remove all the matches of any pattern in patterns from text."""
    bytes = bytearray(text)
    for pat in patterns:
        for m in re.finditer(pat, text):
            start, end = m.span()
            bytes[start:end] = [0] * (end-start)
    new_string = ''.join(chr(c) for c in bytes if c)
    return new_string

There are a few rarely-used features of Python at work here. First, I use a bytearray, which is kind of like a mutable string. Like strings, it is a sequence of bytes. Unlike strings, you can change the bytes in place. This is handy for us to mark which portions of the string are being removed.

I initialize the bytearray to have the same contents as the text string, then for each pattern, I find all the matches for the pattern, and remove them from the bytearray by replacing the matched bytes with a zero bytes.

The re.finditer method gives us an iterator over all the matches, and produces a match object for each one. Match objects are usually just tested and then examined for the matched string, but they have other methods on them too. Here I use m.span(), which returns a two-tuple containing the starting and ending indexes of the match, suitable for use as a slice. I unpack them into start and end, and then use those indexes to write zero bytes into my bytearray using slice assignment.

Because I match against the original unchanged string, the overlapping regexes are not a problem. When all of the patterns have been matched, what’s left in my bytearray are zero bytes where the patterns matched, and real byte values where they didn’t. A list comprehension joins all the good bytes back together, and produces a string.

Nothing earth-shattering here, just a nice showcase of some little-used Python features.

Comments

Nick Coghlan 9:15 PM on 8 Sep 2012

Please, please, please don't encourage anyone to think of arbitrary strings as a sequence of bytes. I know *you* know the difference, but there are still an awful lot of people that don't get the difference between "sequence of bytes" and "sequence of code points".

This trick also needs a strong caveat that it only works for ASCII or latin-1 encoded text.

Nick Coghlan 9:28 PM on 8 Sep 2012

Thinking about this a little further, wouldn't it work to build up a series of index pairs for the parts of the string that *don't* match?

Along the lines of:

   start = 0
   sections = []
   for m in re.finditer(pat, text):
       end, next_start = m.span()
       sections.append((start, end))
       start = next_start
    sections.append((start, len(text)))
    # When end

Nick Coghlan 9:33 PM on 8 Sep 2012

Oh, it got confused by the "<=" in the comment :(

   start = 0
   sections = []
   for m in re.finditer(pat, text):
       end, next_start = m.span()
       sections.append((start, end))
       start = next_start
    sections.append((start, len(text)))
    # When end is less than start, the slice operation
    # produces the empty string, which does what we want
    return ''.join(text[start:end] for start, end in sections)

Ned Batchelder 9:35 PM on 8 Sep 2012

@Nick, you are right, the problem with the code as it stands now is that it only works for bytestrings. With a minor tweak to use a list instead of a bytearray, it will work for Unicode strings as well. The list is a list of Unicode characters, and they are replaced with None if they match, and then the list is joined together to create the final Unicode string.

Ned Batchelder 9:42 PM on 8 Sep 2012

@Nick: your update to keep the good spans is also good, thanks!

Lawrence 12:56 PM on 9 Sep 2012

Sorry to be pedantic but your solution will not work if there are bytes with value zero in the original string. If the original string is only letters, numbers and ordinary special characters which can be easily typed on a keyboard the solution will work. But if the string has bytes that are zero those bytes will not be copied to returning string.

A slight change to your solution would be to use a set to store the index of every character matched instead of storing that data in a byte array. Being a set it automatically ignores duplicates added to the set. Getting on a pedantic soap box, this way does not use the content of the byte as an indicator and separates the data from the meta-data.

Mike 3:28 AM on 11 Sep 2012

Here's a solution in the form of a generator that yields the 'clean' parts of the text as they become known. spans is a list of generators. Each generator produces the spans of the matches for one regex in pats. The regex '\Z' is appended to the patterns to ensure there is a match at the end of the text string. heapq.merge() provides the outputs of the generators in sorted order.

Whenever there is a gap between the maximum end of any previous match and the start of the next match, a chunk of clean text is yielded. If you're using Python 3, you could use 'yield from' instead of 'yield' to have it generate the clean text one character at a time.

import heapq 
import re

def clean(text, pats):
    spans = [(m.span() for m in re.finditer(p, text)) for p in pats + [r'\Z']]
    prev_end = 0
    for next_start, next_end in heapq.merge(*spans):
        if next_start > prev_end:
            yield text[prev_end:next_start]
        prev_end = max(prev_end, next_end)

This is the simplest solution I came up with. Finally found a use for itertools.compress.

import itertools as it
import re

def clean(text, patterns):
    mask = [1] * len(text)
    for pattern in patterns:
        for mo in re.finditer(pattern, text):
            mask[mo.start():mo.end()] = [0]*(mo.end()-mo.start())
    return ''.join(it.compress(text, mask))

Removing overlapping regex matches

Comments

Add a comment: