KMP Algorithm (Reference)
Table of Contents + β
You already met the idea of searching for a pattern inside a bigger text over in the strings module. So we will not re-teach the slow way here. This page is a tight reference for one algorithm. It is KMP, short for Knuth-Morris-Pratt. The whole point of KMP is simple. It never looks at the same character of the text twice when a match breaks.
Here is the pain it fixes. In the naive search, when characters stop matching, you slide the pattern one step to the right. Then you start comparing all over again from the beginning. You throw away everything you just learned. KMP says no. You already compared those characters, so do not check them again. It remembers what it matched. Then it jumps the pattern forward in a smart way. That memory lives in a small helper array called the LPS array. Building it correctly is the real skill here.
π€ What Problem Does the LPS Array Solve?
Say your text is aaaab and your pattern is aaab. The naive way matches aaa. Then it hits a mismatch on the 4th letter. So it slides one step and re-checks the same letters again. That is wasted work.
The pattern itself tells us how far we can safely jump. If the start of the pattern repeats inside the pattern, we can reuse that overlap. We do not have to go back to zero. The LPS array captures exactly that overlap for every position.
LPS stands for Longest Proper Prefix which is also a Suffix. Let me unpack each word so nothing is hidden.
- A prefix is any chunk taken from the start of a string. For
abc, the prefixes area,ab,abc. - A suffix is any chunk taken from the end. For
abc, the suffixes arec,bc,abc. - Proper means we are not allowed to take the whole string itself. So
abcdoes not count as its own proper prefix.
So for each position i in the pattern, lps[i] is the length of the longest chunk that is both a proper prefix and a suffix of the part of the pattern ending at i. That number tells us how far to jump back on a mismatch.
π Building the LPS Array Step by Step
Letβs build LPS for the pattern ababaca. We keep two pointers. len tracks the length of the current matching prefix. i walks through the pattern.
Here is how the walk goes, in plain points:
- Start with
lps[0] = 0always. A single character has no proper prefix, so its answer is 0. - Set
len = 0andi = 1. Now comparepattern[i]withpattern[len]. - If they match, the matching prefix grew by one. Do
len = len + 1, writelps[i] = len, then moveiforward. - If they do not match and
lenis not 0, do not movei. Instead fall back withlen = lps[len - 1]and try comparing again. - If they do not match and
lenis already 0, there is no overlap here. Writelps[i] = 0and moveiforward.
Follow those rules on ababaca and you land on this table.
| Index i | Character | lps[i] | Why |
|---|---|---|---|
| 0 | a | 0 | First character is always 0. |
| 1 | b | 0 | b does not match a, and len is 0. |
| 2 | a | 1 | a matches the prefix a, so len becomes 1. |
| 3 | b | 2 | b matches, prefix ab is also a suffix. |
| 4 | a | 3 | a matches, prefix aba is also a suffix. |
| 5 | c | 0 | c breaks the match; fall back until len is 0. |
| 6 | a | 1 | a matches the prefix a again, len becomes 1. |
So the final LPS array for ababaca is [0, 0, 1, 2, 3, 0, 1]. That last row is the tricky one. When c broke the run, we did not reset blindly to 0 in one shot. We fell back using len = lps[len - 1] until either a match happened or len hit 0.
Tip
The fall-back line len = lps[len - 1] is the heart of KMP. It reuses earlier answers instead of recomputing them. That is why building LPS itself stays linear in the pattern length.
π How KMP Uses LPS to Search
Now we have the map, so searching the text becomes simple. We keep i walking the text and j walking the pattern.
Here is the search loop in points:
- If
text[i]equalspattern[j], both pointers move forward together. - If
jreaches the pattern length, you found a full match ending just beforei. Record it, then setj = lps[j - 1]to keep searching for more matches. - On a mismatch, if
jis not 0, jump the pattern withj = lps[j - 1]. Noticeidoes not move back here. The text pointer never goes backward, and that is the whole trick. - On a mismatch when
jis 0, just moveiforward by one.
Because i only ever moves forward across the text, KMP reads each text character a constant number of times. There is no restart from the middle.
π» KMP in Code
Below is computeLPS to build the array, and KMPSearch to find every place the pattern appears. We search for the pattern ababaca inside the text abababacaba. The match starts at index 2.
# Build the LPS array for the patterndef compute_lps(pat): m = len(pat) lps = [0] * m length = 0 # length of current matching prefix i = 1 while i < m: if pat[i] == pat[length]: length += 1 lps[i] = length i += 1 elif length != 0: length = lps[length - 1] # fall back, do not move i else: lps[i] = 0 i += 1 return lps
# Search the pattern inside the text using LPSdef kmp_search(text, pat): n, m = len(text), len(pat) lps = compute_lps(pat) i = j = 0 while i < n: if text[i] == pat[j]: i += 1 j += 1 if j == m: print("Found at index", i - j) j = lps[j - 1] elif i < n and text[i] != pat[j]: if j != 0: j = lps[j - 1] else: i += 1
kmp_search("abababacaba", "ababaca")The output of the above code will be:
Found at index 2Note
Building the LPS array costs O(m), where m is the pattern length. The search loop costs O(n), where n is the text length. Put together, KMP runs in O(n + m) time. It uses O(m) extra space for the LPS array. The naive search can drift to O(n times m) on bad inputs. So KMP is a clear win on long texts.
β οΈ Common Mistakes
A few traps catch people every time they write KMP. Watch for these.
- Moving
iwhenlenis not 0 during the LPS build. On a mismatch withlengreater than 0, you fall back withlen = lps[len - 1]and you do not touchi. - Resetting
lenstraight to 0 on a mismatch. That throws away reusable overlap. Always fall back throughlps[len - 1]first. - Forgetting to reset
jwithlps[j - 1]after a full match. If you skip that, you miss overlapping matches. - Letting the text pointer
imove backward. In KMPionly ever moves forward. If it moves back, you broke the whole speed promise.
π§© What Youβve Learned
- β The LPS array stores, for each position, the longest proper prefix that is also a suffix.
- β
You build LPS in O(m) using two pointers and the fall-back line
len = lps[len - 1]. - β
For the pattern
ababaca, the LPS array is[0, 0, 1, 2, 3, 0, 1]. - β KMP search keeps the text pointer moving forward and only jumps the pattern using LPS.
- β The full algorithm runs in O(n + m) time with O(m) extra space.
Check Your Knowledge
Test what you learned. Pick an answer for each question, then click Check.
- 1
What does each value lps[i] represent?
Why: lps[i] is the length of the longest chunk that is both a proper prefix and a suffix of the pattern ending at index i.
- 2
During the LPS build, on a mismatch when len is greater than 0, what do you do?
Why: You fall back with len = lps[len - 1] and leave i where it is, reusing earlier answers.
- 3
What is the LPS array for the pattern ababaca?
Why: Walking the build rules on ababaca gives [0, 0, 1, 2, 3, 0, 1].
- 4
Why does KMP run in O(n + m) instead of O(n times m)?
Why: The LPS array lets the pattern jump forward without rewinding the text pointer, so the text is scanned in linear time.