String Pattern Matching with KMP
Table of Contents + β
Say you are looking for a small word inside a big piece of text. You want to know where the word βabcabβ shows up inside βabcabcabxabcabβ. The simple way is to line the word up at the start. You check it letter by letter. If something does not match, you move one step to the right and start checking from the beginning again. That restart is the real problem. You throw away everything you already learned. KMP fixes exactly that. It never re-checks letters it has already matched. Let us go slow and build it up together.
π― The Problem with Naive Search
First let us name the pain clearly. The naive way searches by brute force.
- You place the pattern at position 0 of the text.
- You compare letter by letter.
- On a mismatch, you slide the pattern just one step to the right.
- Then you start comparing from the very first letter of the pattern again.
That last step is the waste. Imagine the text is βAAAAABβ and the pattern is βAAABβ. You match βAAAβ, and then the fourth letter fails. The naive method shifts by one and compares letters it already knew were a match. For a long text with lots of repeats, this gets slow. In the worst case it does about n times m comparisons. Here n is the text length and m is the pattern length.
KMP is short for Knuth-Morris-Pratt, the three people who came up with it. Its whole trick is simple. When a mismatch happens, KMP already knows that some letters at the start of the pattern matched. So instead of starting again from zero, it jumps the pattern forward to the next sensible spot. It never re-checks those letters.
π§© The Key Idea: Prefix that is also a Suffix
To slide smartly, KMP needs one piece of information about the pattern. For every position in the pattern, it asks one simple question.
Take the part of the pattern matched so far. What is the longest chunk at its start that is exactly the same as the chunk at its end?
That question has a name. A prefix is any chunk taken from the start of a string. A suffix is any chunk taken from the end. We want the longest prefix that is also a suffix. But it cannot be the whole string itself.
Let us look at the word βababaβ.
| String | Prefixes (not whole) | Suffixes (not whole) | Longest match |
|---|---|---|---|
| βaβ | none | none | 0 |
| βab" | "a" | "bβ | 0 |
| βaba" | "aβ, βab" | "aβ, βbaβ | 1 (βaβ) |
| βabab" | "aβ, βabβ, βaba" | "bβ, βabβ, βbabβ | 2 (βabβ) |
| βababa" | "aβ, βabβ, βabaβ, βabab" | "aβ, βbaβ, βabaβ, βbabaβ | 3 (βabaβ) |
So why does this matter? When the pattern fails at some point, the part before the failure already matched the text. Say that matched part has a prefix that is also a suffix. Then that prefix is already sitting in the text in the right place. We can reuse it. We do not check those letters again. We just continue from there.
π The LPS Table
We store that answer for every position in an array. We call it the LPS array. LPS stands for Longest Prefix that is also Suffix.
So lps[i] is the length of the longest proper prefix of the pattern, up to and including index i, that is also a suffix of that same chunk. βProperβ just means we do not count the whole chunk itself.
For the pattern βababaβ, the LPS array looks like this.
Note
Pattern: a b a b a
Index: 0 1 2 3 4
LPS: 0 0 1 2 3
Read it slowly. At index 2 the chunk is βabaβ, and its longest prefix-suffix is βaβ, so the value is 1. At index 4 the chunk is βababaβ, and the answer is βabaβ, which has length 3. This single array is the brain of KMP.
βοΈ How KMP Searches, Step by Step
Now for the search. We walk through the text with one pointer i. We walk through the pattern with another pointer j. Here is the flow in points.
- If
text[i]equalspattern[j], the two letters match. Move bothiandjforward by one. - If
jreaches the pattern length, we found a full match. Record the position. Then setj = lps[j-1]to keep searching for more matches. - If the letters do not match and
jis greater than 0, do not movei. Instead setj = lps[j-1]. This jumps the pattern to reuse the prefix we already matched. - If the letters do not match and
jis 0, there is nothing to reuse. Moveiforward by one.
Notice the important thing here. On a mismatch, i almost never goes backward. The text pointer only moves forward. That is where the speed comes from.
Let us trace a tiny example. The text is βababcababβ and the pattern is βababβ with LPS 0 0 1 2.
text: a b a b c a b a bindex: 0 1 2 3 4 5 6 7 8
i=0 j=0 a==a -> i=1 j=1i=1 j=1 b==b -> i=2 j=2i=2 j=2 a==a -> i=3 j=3i=3 j=3 b==b -> j hits 4 = pattern length -> MATCH at index 0 set j = lps[3] = 2i=4 j=2 text 'c' vs pattern 'a' mismatch, j>0 -> j = lps[1] = 0i=4 j=0 text 'c' vs pattern 'a' mismatch, j==0 -> i=5i=5 j=0 a==a -> i=6 j=1... continues forward, never re-reading old textSee how i kept moving right and never went back? That is the whole win.
π» KMP in Code
Here is the full thing in five languages. We build the LPS array first, and then we run the search. The search prints every starting index where the pattern is found.
# Build the LPS array for the patterndef build_lps(pat): m = len(pat) lps = [0] * m length = 0 # length of the current prefix-suffix i = 1 while i < m: if pat[i] == pat[length]: length += 1 lps[i] = length i += 1 elif length != 0: length = lps[length - 1] # fall back, do not move i else: lps[i] = 0 i += 1 return lps
# Search and print every match indexdef kmp_search(text, pat): n, m = len(text), len(pat) lps = build_lps(pat) i = j = 0 while i < n: if text[i] == pat[j]: i += 1 j += 1 if j == m: print("Found at index", i - j) j = lps[j - 1] elif j != 0: j = lps[j - 1] else: i += 1
kmp_search("abxabcabcaby", "abcaby")The output of the above code will be:
Found at index 6Tip
The fall-back line j = lps[j - 1] is the heart of KMP. When a letter does not match, it does not throw away the matched part. It looks up the LPS value and jumps the pattern to the next useful spot. The text pointer i stays put and never moves backward.
β±οΈ Why KMP is Fast
Let us put a number on it. Say n is the text length and m is the pattern length.
- Building the LPS array walks the pattern once, so that costs O(m).
- The search walks the text. The pointer
ionly moves forward and never goes back. So the search costs O(n). - Add the two together, and KMP runs in O(n + m) time overall.
That is much better than the naive O(n times m). KMP uses O(m) extra space for the LPS array.
Note
KMP time complexity is O(n + m). The naive method can hit O(n times m) on bad inputs, like long runs of repeated letters. KMP avoids that because it never re-reads text it has already passed.
Here is the comparison in one table.
| Approach | Time | Extra Space | Re-checks text? |
|---|---|---|---|
| Naive search | O(n times m) | O(1) | Yes |
| KMP | O(n + m) | O(m) | No |
β οΈ Common Mistakes
A few things trip people up the first time. Watch out for these.
- Moving the text pointer
ibackward on a mismatch. In KMP,ionly goes forward. Onlyjfalls back using the LPS array. - Using
lps[j]instead oflps[j - 1]on a mismatch. You always look up the value for the letter just before the one that failed. - Forgetting to reset
j = lps[j - 1]after a full match. Without it, you miss patterns that overlap. - Building the LPS array from the text. The LPS array is built only from the pattern. The text is never part of it.
- Setting
lps[0]to anything but 0. A single letter has no proper prefix that is also a suffix, so it is always 0.
π§© What Youβve Learned
β Naive search wastes work because it re-checks letters it already matched after every mismatch.
β KMP avoids that by precomputing the LPS array, the longest prefix of the pattern that is also a suffix at each index.
β
On a mismatch, KMP sets j = lps[j - 1] to slide the pattern smartly, and the text pointer i never moves backward.
β Building LPS is O(m) and the search is O(n), so KMP runs in O(n + m) time with O(m) extra space.
β
Common slip-ups are moving i back, using the wrong LPS index, and building LPS from the text instead of the pattern.
Check Your Knowledge
Test what you learned. Pick an answer for each question, then click Check.
- 1
What does each value in the LPS array store?
Why: LPS stands for Longest Prefix that is also a Suffix, measured for the pattern up to each index.
- 2
On a mismatch where j is greater than 0, what does KMP do?
Why: KMP reuses the matched prefix by setting j = lps[j - 1], leaving the text pointer i in place.
- 3
What is the overall time complexity of KMP?
Why: Building LPS is O(m) and the search is O(n), so KMP runs in O(n + m).
- 4
The LPS array is built from which string?
Why: LPS depends only on the pattern's own structure, never on the text.
π Whatβs Next?
- Big O Notation explains what those O(β¦) symbols mean, so the O(n + m) you saw here makes full sense.
- How to Analyze Algorithms walks you through reading code and working out its running time, just like we did for KMP.