Personally I love using regular expressions. It simply the best way to parse/split/replace/validate pieces of text. Many people look at regular expressions as black magic. They might be able to create a regex, but don’t understand the flow. When you are creating more advanced expressions, it is important to understand the inner workings for many reasons, including performance.

Let’s have a look at the following regular expression: ^(\d+)(\d+)(\d+)$. It matches one or more digits and does that three times. It is an illogical expression, but bear with me.

Let’s match the string “12345678901234567890″. When we look at the inner workings of the regex engine we can see what is happening:

    1 Beginning match attempt at character 0
    2 ok
    3 12345678901234567890
    4 12345678901234567890
    5 12345678901234567890backtrack
    6 12345678901234567890backtrack
    7 1234567890123456789
    8 1234567890123456789
    9 12345678901234567890
   10 12345678901234567890
   11 12345678901234567890backtrack
   12 12345678901234567890backtrack
   13 1234567890123456789backtrack
   14 123456789012345678
   15 123456789012345678
   16 12345678901234567890
   17 12345678901234567890
   18 12345678901234567890backtrack
   19 12345678901234567890backtrack
   20 1234567890123456789
   21 1234567890123456789
   22 12345678901234567890
   23 12345678901234567890
   24 12345678901234567890ok
   25 Match found
   26

First it matches the whole string to the (\d+) it the first group. The second group doesnt match, so the processor backtracks and tries again with matching the last character of the string to the second group. Now the third group doesn’t match, so it backtracks 2 steps. But now for the interresting part. It first tries to match the last two to characters to the second group. Of course, that will fail as well, so it will backtrack one step and use the last character for the third group and finally find the match.

Now let’s see what will happen if we try to match the string “12345678901234567890a”. This string doesn’t match, but how fast will the regexp processer see this? Let’s take another look:

    1 Beginning match attempt at character 0
    2 ok
    3 12345678901234567890
    4 12345678901234567890
    5 12345678901234567890backtrack
    6 12345678901234567890backtrack
    7 1234567890123456789
    8 1234567890123456789
    9 12345678901234567890
   10 12345678901234567890
   11 12345678901234567890backtrack
   12 12345678901234567890backtrack
   13 1234567890123456789backtrack
   14 123456789012345678
   15 123456789012345678
   16 12345678901234567890
   17 12345678901234567890
   18 12345678901234567890backtrack
   19 12345678901234567890backtrack
   20 1234567890123456789
   21 1234567890123456789
   22 12345678901234567890
   23 12345678901234567890
   24 12345678901234567890backtrack
   25 1234567890123456789backtrack
   26 123456789012345678backtrack
   27 12345678901234567
   28 12345678901234567
   29 12345678901234567890
   29 12345678901234567890
   30 12345678901234567890
   31 12345678901234567890backtrack
   32 12345678901234567890backtrack
   33 1234567890123456789
   34 1234567890123456789
   35 12345678901234567890
   36 12345678901234567890
   37 12345678901234567890backtrack
   38 1234567890123456789backtrack
   39 123456789012345678
   40 123456789012345678
   41 12345678901234567890
   42 12345678901234567890
   42 12345678901234567890backtrack
    ...
 4056 1234567
 4057 1234567
 4058 1234567backtrack
 4059 123456
 4060 123456
 4061 123456backtrack
 4062 12345
 4063 12345
 4064 12345backtrack
 4065 1234
 4066 1234
 4067 1234backtrack
 4068 123
 4069 123
 4070 123backtrack
 4071 12backtrack
 4072 1backtrack
 4073 backtrack
 4074 Match attempt failed
 4075

The processor needs to take 4074 steps to see that the string doesn’t match, where it only needed 25 steps to match the previous string. This indicates that something is wrong. We can see that the processor tries to match the whole string to the first group, backtracks and tries the last character on the second group, etc, etc. Just like we have seen in the first example. But since the string never matches, the processor continues on doing this until it has tried every single combination with spreading the characters over the three groups.

It is unlikely that you will use this expression, because it just is silly. However the same problem can happen in a more subtle way. Have a look at the following expressions:

  • inv\s*(?:nr:)?\s*(\d{5}-\d{3})\s*
  • (\d*[a-z]\d*)*Q

You can see that these kind of structures can really kill the performance of your application. Linking performance problems with regular expressions is something which is often overlooked. I recommend using a regular expression debugger, like RegexBuddy, which can show you exactly what is going on, instead of only showing if the string matches or not.

On the information site of RegexBuddy you can find more expamles of catastrophic backtracking.