Beyond black magic: Regex gotcha
Personally I love using regular expressions. It simply the best way to parse/split/replace/validate pieces of text. Many people look at regular expressions as black magic. They might be able to create a regex, but don’t understand the flow. When you are creating more advanced expressions, it is important to understand the inner workings for many reasons, including performance.
Let’s have a look at the following regular expression: ^(\d+)(\d+)(\d+)$. It matches one or more digits and does that three times. It is an illogical expression, but bear with me.
Let’s match the string “12345678901234567890″. When we look at the inner workings of the regex engine we can see what is happening:
1 Beginning match attempt at character 0
2 ok
3 12345678901234567890
4 12345678901234567890
5 12345678901234567890backtrack
6 12345678901234567890backtrack
7 1234567890123456789
8 1234567890123456789
9 12345678901234567890
10 12345678901234567890
11 12345678901234567890backtrack
12 12345678901234567890backtrack
13 1234567890123456789backtrack
14 123456789012345678
15 123456789012345678
16 12345678901234567890
17 12345678901234567890
18 12345678901234567890backtrack
19 12345678901234567890backtrack
20 1234567890123456789
21 1234567890123456789
22 12345678901234567890
23 12345678901234567890
24 12345678901234567890ok
25 Match found
26
First it matches the whole string to the (\d+) it the first group. The second group doesnt match, so the processor backtracks and tries again with matching the last character of the string to the second group. Now the third group doesn’t match, so it backtracks 2 steps. But now for the interresting part. It first tries to match the last two to characters to the second group. Of course, that will fail as well, so it will backtrack one step and use the last character for the third group and finally find the match.
Now let’s see what will happen if we try to match the string “12345678901234567890a”. This string doesn’t match, but how fast will the regexp processer see this? Let’s take another look:
1 Beginning match attempt at character 0
2 ok
3 12345678901234567890
4 12345678901234567890
5 12345678901234567890backtrack
6 12345678901234567890backtrack
7 1234567890123456789
8 1234567890123456789
9 12345678901234567890
10 12345678901234567890
11 12345678901234567890backtrack
12 12345678901234567890backtrack
13 1234567890123456789backtrack
14 123456789012345678
15 123456789012345678
16 12345678901234567890
17 12345678901234567890
18 12345678901234567890backtrack
19 12345678901234567890backtrack
20 1234567890123456789
21 1234567890123456789
22 12345678901234567890
23 12345678901234567890
24 12345678901234567890backtrack
25 1234567890123456789backtrack
26 123456789012345678backtrack
27 12345678901234567
28 12345678901234567
29 12345678901234567890
29 12345678901234567890
30 12345678901234567890
31 12345678901234567890backtrack
32 12345678901234567890backtrack
33 1234567890123456789
34 1234567890123456789
35 12345678901234567890
36 12345678901234567890
37 12345678901234567890backtrack
38 1234567890123456789backtrack
39 123456789012345678
40 123456789012345678
41 12345678901234567890
42 12345678901234567890
42 12345678901234567890backtrack
...
4056 1234567
4057 1234567
4058 1234567backtrack
4059 123456
4060 123456
4061 123456backtrack
4062 12345
4063 12345
4064 12345backtrack
4065 1234
4066 1234
4067 1234backtrack
4068 123
4069 123
4070 123backtrack
4071 12backtrack
4072 1backtrack
4073 backtrack
4074 Match attempt failed
4075
The processor needs to take 4074 steps to see that the string doesn’t match, where it only needed 25 steps to match the previous string. This indicates that something is wrong. We can see that the processor tries to match the whole string to the first group, backtracks and tries the last character on the second group, etc, etc. Just like we have seen in the first example. But since the string never matches, the processor continues on doing this until it has tried every single combination with spreading the characters over the three groups.
It is unlikely that you will use this expression, because it just is silly. However the same problem can happen in a more subtle way. Have a look at the following expressions:
- inv\s*(?:nr:)?\s*(\d{5}-\d{3})\s*
- (\d*[a-z]\d*)*Q
You can see that these kind of structures can really kill the performance of your application. Linking performance problems with regular expressions is something which is often overlooked. I recommend using a regular expression debugger, like RegexBuddy, which can show you exactly what is going on, instead of only showing if the string matches or not.
On the information site of RegexBuddy you can find more expamles of catastrophic backtracking.
25 Nov 2007 Arnold Daniels



