2021-08-07

Examining Boyer-Moore String Search with F#

Read Time: 19 minutes

Text search is something we do everyday. Fast and reliable search is such a staple, it is easy to forget there can be elegance to those underlying mechanisms. It is time to pull back the curtain and dig into one of the foundational search algorithms, the Boyer-Moore fast string search. It is a good algorithm to demonstrate some of the methods used to achieve fast search results. As typical, I’ll reach for F# to show implementation details.

For reference, most of what I discuss is pulled from Boyer and Moore’s 1977 paper A Fast String Searching Algorithm. If you want to read it straight from the source, its a easy to digest paper. One of the things I love about this algorithm is the cleverness in understanding how the structure of the data can enhance the goal. This results in an elegant solution. It also strikes the right balance. It contains some complexity, but its a comprehendable algorithm without the burden of being too clever.

Before digging into the algorithm, its worth setting a baseline. For that I’ll use a naive search. It starts at the beginning of a string, iterates character by character, checking to see if it matches the first character of the pattern. If/When it hits a character match, it checks deeper into the pattern. If it mismatches within the pattern, it backtracks and continues from where it left off. It’s intent is the simplest, most straight-forward approach. Once I have an extremely basic approach, I can dig into how Boyer-Moore improves upon this concept. Below is an example of how the process works, and how long it takes to find the pattern “coffeecake” in the string.

Naive Method:
Step 1:
String : I like to drink coffee with my coffeecake at breakfast.
Pattern: coffeecake
         ↑

Step 2:
String : I like to drink coffee with my coffeecake at breakfast.
Pattern:  coffeecake
          ↑

Step 3:
String : I like to drink coffee with my coffeecake at breakfast.
Pattern:   coffeecake
           ↑
...
Step 38:
String : I like to drink coffee with my coffeecake at breakfast.
Pattern:                                coffeecake
                                        ↑
...
Step 47:
String : I like to drink coffee with my coffeecake at breakfast.
Pattern:                                coffeecake
                                                 ↑

The example above takes 47 comparisons from start to finish. Some of this is taken by backtracking when it finds a partial match at “coffee”. Most of the time is taken by the fact that it has to go character by character.

If you haven’t guessed by now, there is a better way. There are a couple preliminary aspects of the Boyer-Moore algorithm to establish. First, when doing pattern compares, it starts at the end of the pattern. For the pattern “coffeecake”, it’ll try to match the final ‘e’, then ‘k’, etc. At a minimum, this gives it a bit of a jump start by skipping nearly the length of the search string for non-matches. This isn’t the big optimization, but this dynamic will come into play later. Second, the algorithm uses a preprocessing step prior to the actual search. This is the meat of the optimizations. It builds up information regarding the search string that allows it to make more efficient choices during the search. This process leverages knowledge of the pattern’s structure to improve the scanning process.

The first optimization is a simple one involving a lookup table (described as delta 1 in the paper). Prior to the search, it makes a pass through the pattern. During which a table is built for each possible character, and how many positions it can increment the search pointer if the character it hits doesn’t match against the current character in the pattern. If the character isn’t in the pattern, then it knows it can skip over it entirely, which translates to incrementing the pointer the length of the pattern. If the character is in the pattern, it determines the last instance of that character within the pattern and uses that to calculate the number of characters that can be skipped. The formula in that case is: patternLength - index - 1. The result is being able to increment by multiple positions in the string when the current character doesn’t match. For illustrative purposes, below is an example.

Pattern: coffeecake
Index  : 0123456789

Resulting lookup table (abbreviated):
a: 2 (10 - 7 - 1)
b: 10
c: 3 (10 - 6 - 1)
d: 10
e: 0 (10 - 9 - 1)
f: 6  (10 - 3 - 1)
g: 10
...
z: 10

Step 1:
String : I like to drink coffee with my coffeecake at breakfast.
Pattern: coffeecake
Pointer:          ↑

Action: It compares ' ' to 'e'. Looking at the lookup table, blank doesn't exist in the pattern, so it can jump ahead 10 positions.

Step 2:
String : I like to drink coffee with my coffeecake at breakfast.
Pattern:           coffeecake
Pointer:                    ↑

Action: It compares 'f' to 'e'. They don't match, but it does know that f exists in the pattern, using the lookup table it can increment 6.

Step 3:
String : I like to drink coffee with my coffeecake at breakfast.
Pattern:                 coffeecake
Pointer:                          ↑

Action: Now it is aligned with the previously seen 'f' with an 'f' in the pattern. The goal here is to jump ahead as far as it can, while attempting alignment with the string and pattern. This is good, although as you can see, it will now compare 't' with 'e'. Since 't' isn't in the pattern, it will increment 10 positions.

Step 4:
String : I like to drink coffee with my coffeecake at breakfast.
Pattern:                           coffeecake
Pointer:                                    ↑

Action: Here it find a match, the 'e' matches in the string and pattern. This means it can start comparing backward through the pattern.

Step 5:
String : I like to drink coffee with my coffeecake at breakfast.
Pattern:                           coffeecake
Pointer:                                   ↑

Action: Here 'f' doesn't match 'k', but exists in the pattern, as before, it can increment 6 positions.

Step 6:
String : I like to drink coffee with my coffeecake at breakfast.
Pattern:                                coffeecake
Pointer:                                         ↑

Action: Again, it aligned the 'f' from string with the 'f' in the pattern. It get a hit with 'e', so it will start walking backwards through the pattern. We can also see at this point we've found our match, the algorithm just doesn't know it yet.

Steps 7-15:
String : I like to drink coffee with my coffeecake at breakfast.
Pattern:                                coffeecake
Pointer:                                        ↑
Pointer:                                       ↑
...
Pointer:                                ↑

Action: As you can see, it is in the match. Steps 7 - 15 will be walking backwards through the pattern until it verifies the match.

What this shows in a total of 15 compares it found the match. A naive character-by-character check takes considerably more than that. This is a simple example, but a third the number of checks on a small example feels like a good optimization. One attribute of this method is longer search patterns allow the algorithm to jump through the string faster on non-matching characters. How much is obviously pattern specific, but anything is better than one character at a time. What does this lookup table construction look like in code? It’s always a joy when we can get such benefits from something so direct. As a performance sidenote, it is very tempting to build a Map for character lookup, it feels very natural. Unfortunantly the overhead of using Map in this particular case is high enough to be an issue, so using an array for character lookup is fast enough to be worth the effort. It is a reminder of the constant trade-offs that need to be made when looking for performance.

/// Build the delta 1 table
/// If the char is in the string, it is the rightmost value,
/// otherwise it is the length of the pattern.
let buildDelta1 (pattern: string) =
  let patternLength = pattern.Length 

  [| 0..127 |]
  |> Array.map (fun x ->
      let c = char x
      let index = pattern.LastIndexOf(c)
      if index = -1 then patternLength else patternLength - index - 1)

This is great, but Boyer and Moore didn’t stop there, they realized there is even more structure they can extract and leverage from within the pattern. If a subpattern is repeated within the pattern, there is more that can be done, this is the second optimization. It is focused on the subpattern anchored at the end of the pattern. The primary reason for this is searching through the pattern starts at the end and moves backward. it can use this information in cases where it have started matching against the pattern, but cannot match any further. It gains the benefit from a repetition of a terminal subpattern, potentially allowing an optimization over the first methodology.

To build the delta 2 table it looks at the iteratively growing terminal subpatterns. Looking at the previous example pattern “coffeecake”, that means “e” then “ke”, then “ake”, etc. To look at the final “e”, although short is a pattern. Understanding the circumstances to get partway through a match is important. It is not looking for just another “e”, it specifically wants an “e” not preceded by a “k”. This is an important distinction. This is because it is trying to smartly shift to another part of the search string. But it doesn’t need to do that when it has matched the “ke”, it needs it when there is an “e”, but can’t match the next letter “k”. Knowing this allows the algorithm to do a smarter alignment if it gets a partial match “e”, but the subpattern stops matching at ‘k’. This is done for all terminal subpatterns. So next, is there another instance of “ke” not preceded by “a”. Then is there another “ake” not preceded by “c”. This goes through the entire set of terminal subpatterns. This is only part of the case. The other case is what if there is no “e” that isn’t preceded by a “k”. In cases like this, it knows how far into the pattern match it is, and can do a larger shift as a result when it finds a character that doesn’t match. Below is the resulting table of pattern shifts based on subpattern, as well as an example use.

Resulting lookup table for pattern indexes:
0 (c): 19 
1 (o): 18
2 (f): 17 
3 (f): 16 
4 (e): 15
5 (e): 14
6 (c): 13
7 (a): 12
8 (k):  5
9 (e):  1

Example 1: 
String : Some coffee pairs well with coffeecake in the morning.
Pattern: coffeecake
Pointer:         ↑

Action: When it is matching backwards through the pattern, it has matched the 'e', but don't match the 'k'. Here is where it leverages the knowledge that there is an 'e' not preceded by a 'k' pattern earlier. So it can shift 5 positions.

String : Some coffee pairs well with coffeecake in the morning.
Pattern:      coffeecake
Pointer:               ↑

Action: As you can see, the shift aligned the 'e' between the main string and search string, attempting to optimize the search.

Example 2: 
String : Small cake is not a match.
Pattern: coffeecake
Pointer       ↑

Action: Here is a different case. It matchs the ' cake' part of the pattern. Then it mismatch on the ' ' versus 'e'. It also knows the terminal subpattern isn't seen again. At this point it can jump past what it has already matched so far.

String : Small cake is not a match.
Pattern:           coffeecake
Pointer:                    ↑

Building this lookup table takes a bit more work than the delta1 table. Searching for the best subpattern match within a string takes a couple steps. This is the most complex part of the preprocessing since it needs to perform multiple scans through the pattern. This is a good place to mention that Boyer-Moore does involve some overhead. In many cases, building the preprocessing tables can dominate the overall search. It is important to make this part as fast as possible. For performance I use arrays, indexes, and mutables instead of passing strings around. It is not my preference, but it is a necessary tradeoff. The performance cost is too much to bear in this part of the algorithm.

// Note: rpr is an index from the end of pattern
let findRpr (pattern:string) (candidateStartIndex: int) (candidateStopIndex: int) :int =
  let patternLength = pattern.Length
  let candidateLength = candidateStopIndex - candidateStartIndex + 1

  let rec findRpr' i =
    if i >= patternLength || i >= candidateLength then i
    elif pattern.[patternLength - 1 - i] <> pattern.[candidateStartIndex + (candidateLength - 1 - i)] then i
    else findRpr' (i + 1)

  findRpr' 0

/// Check if the pattern starts with the suffix starting at the defined index
/// For my purposes, this is faster than String.StartsWith
let startsWithSuffix (pattern: string) (subPatternStart: int) :bool =
  let rec startsWithSuffix' i j =
    if j = pattern.Length then true
    elif i = pattern.Length then false
    elif pattern.[i] <> pattern.[j] then false
    else startsWithSuffix' (i + 1) (j + 1)
  startsWithSuffix' 0 subPatternStart

/// Build delta2 table
let buildDelta2 (pattern: string) =
  let last = pattern.Length - 1
  let rprs = Array.create pattern.Length 0

  // Find rprs for pattern
  for i in [| 0 .. (last - 1) |] do
    let rpr = findRpr pattern 1 i
    if pattern.[i-rpr] <> pattern.[last-rpr] then
      rprs.[last - rpr] <- rpr + last - i

  // If rpr isn't found for character, default to a shift based on index position (to handle rpr finding off left side
  let mutable lastPrefix = last
  for i in [| last .. -1 .. 0 |] do
    // Only do this where an rpr value wasn't found 
    if startsWithSuffix pattern (i + 1) then lastPrefix <- i + 1
    if rprs.[i] = 0 then rprs.[i] <- lastPrefix + last - i

  rprs

Now that I have the two parts, they just need put together. At each step, the current character in the string attempts to match the current character in the pattern. If it matches, it keeps trying to match the pattern. If it doesn’t match, the search index is incremented the maximum of the delta1 and delta2 tables. This allows the algorithm to jump as far as possible, using the available pattern structural information.

/// Find pattern in string using a Boyer-Moore fast string search method
let fastSearch (pattern: string) (s: string) :int option =
  let delta1 = buildDelta1 pattern
  let delta2 = buildDelta2 pattern

  let rec fastSearch' stringPosition patternPosition =
    if patternPosition < 0 then
      None
    elif stringPosition >= s.Length then
      // At end of string, pattern not found
      None
    elif patternPosition = 0 && s.[stringPosition] = pattern.[patternPosition] then
      // At the end of pattern, match found
      Some stringPosition
    else
      // If chars match, keep comparing (decrement both stringPosition & patternPosition)
      // If chars don't match, jump ahead (stringPosition jumps by max of delta1 and delta2 tables, patternPosition is reset to end of pattern)
      let (stringPosition', patternPosition') = 
        if s.[stringPosition] = pattern.[patternPosition] then
          (stringPosition - 1, patternPosition - 1)
        else
          (stringPosition + max (delta1.[int s.[stringPosition]]) (delta2.[patternPosition]), pattern.Length - 1)

      fastSearch' stringPosition' patternPosition'

  if pattern.Length > s.Length then
    // If pattern is longer than the string, it can't be found
    None
  else
    // Start searching
    fastSearch' (pattern.Length - 1) (pattern.Length - 1)

In a nutshell, these are the not-so-secret weapons of the algorithm. As I’ve walked through some methods and examples, it becomes obvious it can leverage internal structure to implement smart jumps through the string. These can provide orders of magnitude larger jumps than a naive search, which translates into a pretty hefty performance boost. Intuitively this makes sense, but a benchmark helps to provide some more concrete data.

|      Method | StringSize | PatternSize |         Mean |      Error |     StdDev |       Median |     Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------------ |----------- |------------ |-------------:|-----------:|-----------:|-------------:|----------:|-------:|------:|----------:|
|  fastSearch |      10000 |          10 |    12.295 us |  0.0432 us |  0.0383 us |    12.294 us |    5.3711 |      - |     - |     33 KB |
| naiveSearch |      10000 |          10 |    69.570 us |  0.1936 us |  0.1810 us |    69.589 us |   53.1006 |      - |     - |    326 KB |
|  fastSearch |      10000 |          50 |     8.068 us |  0.1575 us |  0.1992 us |     8.178 us |    2.4719 |      - |     - |     15 KB |
| naiveSearch |      10000 |          50 |    71.696 us |  0.2221 us |  0.2077 us |    71.774 us |   53.4668 |      - |     - |    328 KB |
|  fastSearch |      10000 |         100 |     9.702 us |  0.1869 us |  0.2494 us |     9.852 us |    2.7924 |      - |     - |     17 KB |
| naiveSearch |      10000 |         100 |    73.313 us |  0.1421 us |  0.1329 us |    73.284 us |   54.0771 |      - |     - |    331 KB |
|  fastSearch |      10000 |        1000 |    41.977 us |  0.2469 us |  0.2309 us |    42.120 us |    9.8877 | 0.2441 |     - |     61 KB |
| naiveSearch |      10000 |        1000 |    76.811 us |  0.1456 us |  0.1362 us |    76.796 us |   58.9600 |      - |     - |    362 KB |
|  fastSearch |    1000000 |          10 |   989.763 us |  1.5934 us |  1.4905 us |   989.457 us |  476.5625 |      - |     - |  2,921 KB |
| naiveSearch |    1000000 |          10 | 7,069.001 us | 29.2355 us | 27.3469 us | 7,068.265 us | 5312.5000 |      - |     - | 32,548 KB |
|  fastSearch |    1000000 |          50 |   384.506 us |  0.5249 us |  0.4910 us |   384.562 us |  176.2695 | 0.4883 |     - |  1,081 KB |
| naiveSearch |    1000000 |          50 | 6,928.350 us | 22.2543 us | 20.8167 us | 6,920.000 us | 5312.5000 |      - |     - | 32,549 KB |
|  fastSearch |    1000000 |         100 |   266.840 us |  0.7624 us |  0.6759 us |   266.723 us |  120.6055 |      - |     - |    740 KB |
| naiveSearch |    1000000 |         100 | 7,011.172 us | 77.9405 us | 72.9055 us | 6,968.868 us | 5312.5000 |      - |     - | 32,555 KB |
|  fastSearch |    1000000 |        1000 |   433.229 us |  0.8704 us |  0.7716 us |   433.164 us |  191.4063 | 2.9297 |     - |  1,173 KB |
| naiveSearch |    1000000 |        1000 | 6,994.461 us |  9.9059 us | 15.1274 us | 6,994.001 us | 5312.5000 |      - |     - | 32,580 KB |

As always, you have to take benchmarks with a grain of salt, but the results do match our intuition of a performance boost. As I mentioned earlier, the benefits can be highly dependent upon not only the length of the string and pattern, but of the strucutre of the data itself. It at least shows on the surface what is possible. Hopefully you have enjoyed this dive into the Boyer-Moore string search. If you found this interesting, search can be a pretty deep topic and I encourage you to go deeper. Even if you don’t, I hope you at least take some inspiration regarding interesting ways to solve difficult problems. Until next time.