Text search is something we do everyday. Fast and reliable search is such a staple, it is easy to forget there can be elegance to those underlying mechanisms. It is time to pull back the curtain and dig into one of the foundational search algorithms, the Boyer-Moore fast string search. It is a good algorithm to demonstrate some of the methods used to achieve fast search results. As typical, I’ll reach for F# to show implementation details.
For reference, most of what I discuss is pulled from Boyer and Moore’s 1977 paper A Fast String Searching Algorithm. If you want to read it straight from the source, its a easy to digest paper. One of the things I love about this algorithm is the cleverness in understanding how the structure of the data can enhance the goal. This results in an elegant solution. It also strikes the right balance. It contains some complexity, but its a comprehendable algorithm without the burden of being too clever.
Before digging into the algorithm, its worth setting a baseline. For that I’ll use a naive search. It starts at the beginning of a string, iterates character by character, checking to see if it matches the first character of the pattern. If/When it hits a character match, it checks deeper into the pattern. If it mismatches within the pattern, it backtracks and continues from where it left off. It’s intent is the simplest, most straight-forward approach. Once I have an extremely basic approach, I can dig into how Boyer-Moore improves upon this concept. Below is an example of how the process works, and how long it takes to find the pattern “coffeecake” in the string.
Naive Method: Step 1: String : I like to drink coffee with my coffeecake at breakfast. Pattern: coffeecake ↑ Step 2: String : I like to drink coffee with my coffeecake at breakfast. Pattern: coffeecake ↑ Step 3: String : I like to drink coffee with my coffeecake at breakfast. Pattern: coffeecake ↑ ... Step 38: String : I like to drink coffee with my coffeecake at breakfast. Pattern: coffeecake ↑ ... Step 47: String : I like to drink coffee with my coffeecake at breakfast. Pattern: coffeecake ↑
The example above takes 47 comparisons from start to finish. Some of this is taken by backtracking when it finds a partial match at “coffee”. Most of the time is taken by the fact that it has to go character by character.
If you haven’t guessed by now, there is a better way. There are a couple preliminary aspects of the Boyer-Moore algorithm to establish. First, when doing pattern compares, it starts at the end of the pattern. For the pattern “coffeecake”, it’ll try to match the final ‘e’, then ‘k’, etc. At a minimum, this gives it a bit of a jump start by skipping nearly the length of the search string for non-matches. This isn’t the big optimization, but this dynamic will come into play later. Second, the algorithm uses a preprocessing step prior to the actual search. This is the meat of the optimizations. It builds up information regarding the search string that allows it to make more efficient choices during the search. This process leverages knowledge of the pattern’s structure to improve the scanning process.
The first optimization is a simple one involving a lookup table (described as delta 1 in the paper). Prior to the search, it makes a pass through the pattern. During which a table is built for each possible character, and how many positions it can increment the search pointer if the character it hits doesn’t match against the current character in the pattern. If the character isn’t in the pattern, then it knows it can skip over it entirely, which translates to incrementing the pointer the length of the pattern. If the character is in the pattern, it determines the last instance of that character within the pattern and uses that to calculate the number of characters that can be skipped. The formula in that case is: patternLength - index - 1. The result is being able to increment by multiple positions in the string when the current character doesn’t match. For illustrative purposes, below is an example.
Pattern: coffeecake Index : 0123456789 Resulting lookup table (abbreviated): a: 2 (10 - 7 - 1) b: 10 c: 3 (10 - 6 - 1) d: 10 e: 0 (10 - 9 - 1) f: 6 (10 - 3 - 1) g: 10 ... z: 10
Step 1: String : I like to drink coffee with my coffeecake at breakfast. Pattern: coffeecake Pointer: ↑Action: It compares ' ' to 'e'. Looking at the lookup table, blank doesn't exist in the pattern, so it can jump ahead 10 positions.
Step 2: String : I like to drink coffee with my coffeecake at breakfast. Pattern: coffeecake Pointer: ↑Action: It compares 'f' to 'e'. They don't match, but it does know that f exists in the pattern, using the lookup table it can increment 6.
Step 3: String : I like to drink coffee with my coffeecake at breakfast. Pattern: coffeecake Pointer: ↑Action: Now it is aligned with the previously seen 'f' with an 'f' in the pattern. The goal here is to jump ahead as far as it can, while attempting alignment with the string and pattern. This is good, although as you can see, it will now compare 't' with 'e'. Since 't' isn't in the pattern, it will increment 10 positions.
Step 4: String : I like to drink coffee with my coffeecake at breakfast. Pattern: coffeecake Pointer: ↑Action: Here it find a match, the 'e' matches in the string and pattern. This means it can start comparing backward through the pattern.
Step 5: String : I like to drink coffee with my coffeecake at breakfast. Pattern: coffeecake Pointer: ↑Action: Here 'f' doesn't match 'k', but exists in the pattern, as before, it can increment 6 positions.
Step 6: String : I like to drink coffee with my coffeecake at breakfast. Pattern: coffeecake Pointer: ↑Action: Again, it aligned the 'f' from string with the 'f' in the pattern. It get a hit with 'e', so it will start walking backwards through the pattern. We can also see at this point we've found our match, the algorithm just doesn't know it yet.
Steps 7-15: String : I like to drink coffee with my coffeecake at breakfast. Pattern: coffeecake Pointer: ↑ Pointer: ↑ ... Pointer: ↑
Action: As you can see, it is in the match. Steps 7 - 15 will be walking backwards through the pattern until it verifies the match.
What this shows in a total of 15 compares it found the match. A naive character-by-character check takes considerably more than that. This is a simple example, but a third the number of checks on a small example feels like a good optimization. One attribute of this method is longer search patterns allow the algorithm to jump through the string faster on non-matching characters. How much is obviously pattern specific, but anything is better than one character at a time. What does this lookup table construction look like in code? It’s always a joy when we can get such benefits from something so direct. As a performance sidenote, it is very tempting to build a Map for character lookup, it feels very natural. Unfortunantly the overhead of using Map in this particular case is high enough to be an issue, so using an array for character lookup is fast enough to be worth the effort. It is a reminder of the constant trade-offs that need to be made when looking for performance.
/// Build the delta 1 table
This is great, but Boyer and Moore didn’t stop there, they realized there is even more structure they can extract and leverage from within the pattern. If a subpattern is repeated within the pattern, there is more that can be done, this is the second optimization. It is focused on the subpattern anchored at the end of the pattern. The primary reason for this is searching through the pattern starts at the end and moves backward. it can use this information in cases where it have started matching against the pattern, but cannot match any further. It gains the benefit from a repetition of a terminal subpattern, potentially allowing an optimization over the first methodology.
To build the delta 2 table it looks at the iteratively growing terminal subpatterns. Looking at the previous example pattern “coffeecake”, that means “e” then “ke”, then “ake”, etc. To look at the final “e”, although short is a pattern. Understanding the circumstances to get partway through a match is important. It is not looking for just another “e”, it specifically wants an “e” not preceded by a “k”. This is an important distinction. This is because it is trying to smartly shift to another part of the search string. But it doesn’t need to do that when it has matched the “ke”, it needs it when there is an “e”, but can’t match the next letter “k”. Knowing this allows the algorithm to do a smarter alignment if it gets a partial match “e”, but the subpattern stops matching at ‘k’. This is done for all terminal subpatterns. So next, is there another instance of “ke” not preceded by “a”. Then is there another “ake” not preceded by “c”. This goes through the entire set of terminal subpatterns. This is only part of the case. The other case is what if there is no “e” that isn’t preceded by a “k”. In cases like this, it knows how far into the pattern match it is, and can do a larger shift as a result when it finds a character that doesn’t match. Below is the resulting table of pattern shifts based on subpattern, as well as an example use.
Resulting lookup table for pattern indexes: 0 (c): 19 1 (o): 18 2 (f): 17 3 (f): 16 4 (e): 15 5 (e): 14 6 (c): 13 7 (a): 12 8 (k): 5 9 (e): 1
Example 1: String : Some coffee pairs well with coffeecake in the morning. Pattern: coffeecake Pointer: ↑Action: When it is matching backwards through the pattern, it has matched the 'e', but don't match the 'k'. Here is where it leverages the knowledge that there is an 'e' not preceded by a 'k' pattern earlier. So it can shift 5 positions.
String : Some coffee pairs well with coffeecake in the morning. Pattern: coffeecake Pointer: ↑Action: As you can see, the shift aligned the 'e' between the main string and search string, attempting to optimize the search.
Example 2: String : Small cake is not a match. Pattern: coffeecake Pointer ↑Action: Here is a different case. It matchs the ' cake' part of the pattern. Then it mismatch on the ' ' versus 'e'. It also knows the terminal subpattern isn't seen again. At this point it can jump past what it has already matched so far.
String : Small cake is not a match. Pattern: coffeecake Pointer: ↑
Building this lookup table takes a bit more work than the delta1 table. Searching for the best subpattern match within a string takes a couple steps. This is the most complex part of the preprocessing since it needs to perform multiple scans through the pattern. This is a good place to mention that Boyer-Moore does involve some overhead. In many cases, building the preprocessing tables can dominate the overall search. It is important to make this part as fast as possible. For performance I use arrays, indexes, and mutables instead of passing strings around. It is not my preference, but it is a necessary tradeoff. The performance cost is too much to bear in this part of the algorithm.
// Note: rpr is an index from the end of pattern
Now that I have the two parts, they just need put together. At each step, the current character in the string attempts to match the current character in the pattern. If it matches, it keeps trying to match the pattern. If it doesn’t match, the search index is incremented the maximum of the delta1 and delta2 tables. This allows the algorithm to jump as far as possible, using the available pattern structural information.
/// Find pattern in string using a Boyer-Moore fast string search method
In a nutshell, these are the not-so-secret weapons of the algorithm. As I’ve walked through some methods and examples, it becomes obvious it can leverage internal structure to implement smart jumps through the string. These can provide orders of magnitude larger jumps than a naive search, which translates into a pretty hefty performance boost. Intuitively this makes sense, but a benchmark helps to provide some more concrete data.
| Method | StringSize | PatternSize | Mean | Error | StdDev | Median | Gen 0 | Gen 1 | Gen 2 | Allocated |
As always, you have to take benchmarks with a grain of salt, but the results do match our intuition of a performance boost. As I mentioned earlier, the benefits can be highly dependent upon not only the length of the string and pattern, but of the strucutre of the data itself. It at least shows on the surface what is possible. Hopefully you have enjoyed this dive into the Boyer-Moore string search. If you found this interesting, search can be a pretty deep topic and I encourage you to go deeper. Even if you don’t, I hope you at least take some inspiration regarding interesting ways to solve difficult problems. Until next time.