KMP Algorithm (Reference)

You already met the idea of searching for a pattern inside a bigger text over in the strings module. So we will not re-teach the slow way here. This page is a tight reference for one algorithm. It is KMP, short for Knuth-Morris-Pratt. The whole point of KMP is simple. It never looks at the same character of the text twice when a match breaks.

Here is the pain it fixes. In the naive search, when characters stop matching, you slide the pattern one step to the right. Then you start comparing all over again from the beginning. You throw away everything you just learned. KMP says no. You already compared those characters, so do not check them again. It remembers what it matched. Then it jumps the pattern forward in a smart way. That memory lives in a small helper array called the LPS array. Building it correctly is the real skill here.

πŸ€” What Problem Does the LPS Array Solve?

Say your text is aaaab and your pattern is aaab. The naive way matches aaa. Then it hits a mismatch on the 4th letter. So it slides one step and re-checks the same letters again. That is wasted work.

The pattern itself tells us how far we can safely jump. If the start of the pattern repeats inside the pattern, we can reuse that overlap. We do not have to go back to zero. The LPS array captures exactly that overlap for every position.

LPS stands for Longest Proper Prefix which is also a Suffix. Let me unpack each word so nothing is hidden.

  • A prefix is any chunk taken from the start of a string. For abc, the prefixes are a, ab, abc.
  • A suffix is any chunk taken from the end. For abc, the suffixes are c, bc, abc.
  • Proper means we are not allowed to take the whole string itself. So abc does not count as its own proper prefix.

So for each position i in the pattern, lps[i] is the length of the longest chunk that is both a proper prefix and a suffix of the part of the pattern ending at i. That number tells us how far to jump back on a mismatch.

πŸ“ Building the LPS Array Step by Step

Let’s build LPS for the pattern ababaca. We keep two pointers. len tracks the length of the current matching prefix. i walks through the pattern.

Here is how the walk goes, in plain points:

  • Start with lps[0] = 0 always. A single character has no proper prefix, so its answer is 0.
  • Set len = 0 and i = 1. Now compare pattern[i] with pattern[len].
  • If they match, the matching prefix grew by one. Do len = len + 1, write lps[i] = len, then move i forward.
  • If they do not match and len is not 0, do not move i. Instead fall back with len = lps[len - 1] and try comparing again.
  • If they do not match and len is already 0, there is no overlap here. Write lps[i] = 0 and move i forward.

Follow those rules on ababaca and you land on this table.

Index i Character lps[i] Why
0 a 0 First character is always 0.
1 b 0 b does not match a, and len is 0.
2 a 1 a matches the prefix a, so len becomes 1.
3 b 2 b matches, prefix ab is also a suffix.
4 a 3 a matches, prefix aba is also a suffix.
5 c 0 c breaks the match; fall back until len is 0.
6 a 1 a matches the prefix a again, len becomes 1.

So the final LPS array for ababaca is [0, 0, 1, 2, 3, 0, 1]. That last row is the tricky one. When c broke the run, we did not reset blindly to 0 in one shot. We fell back using len = lps[len - 1] until either a match happened or len hit 0.

Tip

The fall-back line len = lps[len - 1] is the heart of KMP. It reuses earlier answers instead of recomputing them. That is why building LPS itself stays linear in the pattern length.

Now we have the map, so searching the text becomes simple. We keep i walking the text and j walking the pattern.

Here is the search loop in points:

  • If text[i] equals pattern[j], both pointers move forward together.
  • If j reaches the pattern length, you found a full match ending just before i. Record it, then set j = lps[j - 1] to keep searching for more matches.
  • On a mismatch, if j is not 0, jump the pattern with j = lps[j - 1]. Notice i does not move back here. The text pointer never goes backward, and that is the whole trick.
  • On a mismatch when j is 0, just move i forward by one.

Because i only ever moves forward across the text, KMP reads each text character a constant number of times. There is no restart from the middle.

Yes

Yes

No

No

Yes

No

Compare text i with pattern j

Match?

i++, j++

j equals pattern length?

Found match, j = lps j-1

j greater than 0?

j = lps j-1, keep i

i++

πŸ’» KMP in Code

Below is computeLPS to build the array, and KMPSearch to find every place the pattern appears. We search for the pattern ababaca inside the text abababacaba. The match starts at index 2.

kmp.py
# Build the LPS array for the pattern
def compute_lps(pat):
m = len(pat)
lps = [0] * m
length = 0 # length of current matching prefix
i = 1
while i < m:
if pat[i] == pat[length]:
length += 1
lps[i] = length
i += 1
elif length != 0:
length = lps[length - 1] # fall back, do not move i
else:
lps[i] = 0
i += 1
return lps
# Search the pattern inside the text using LPS
def kmp_search(text, pat):
n, m = len(text), len(pat)
lps = compute_lps(pat)
i = j = 0
while i < n:
if text[i] == pat[j]:
i += 1
j += 1
if j == m:
print("Found at index", i - j)
j = lps[j - 1]
elif i < n and text[i] != pat[j]:
if j != 0:
j = lps[j - 1]
else:
i += 1
kmp_search("abababacaba", "ababaca")

The output of the above code will be:

Output
Found at index 2

Note

Building the LPS array costs O(m), where m is the pattern length. The search loop costs O(n), where n is the text length. Put together, KMP runs in O(n + m) time. It uses O(m) extra space for the LPS array. The naive search can drift to O(n times m) on bad inputs. So KMP is a clear win on long texts.

⚠️ Common Mistakes

A few traps catch people every time they write KMP. Watch for these.

  • Moving i when len is not 0 during the LPS build. On a mismatch with len greater than 0, you fall back with len = lps[len - 1] and you do not touch i.
  • Resetting len straight to 0 on a mismatch. That throws away reusable overlap. Always fall back through lps[len - 1] first.
  • Forgetting to reset j with lps[j - 1] after a full match. If you skip that, you miss overlapping matches.
  • Letting the text pointer i move backward. In KMP i only ever moves forward. If it moves back, you broke the whole speed promise.

🧩 What You’ve Learned

  • βœ… The LPS array stores, for each position, the longest proper prefix that is also a suffix.
  • βœ… You build LPS in O(m) using two pointers and the fall-back line len = lps[len - 1].
  • βœ… For the pattern ababaca, the LPS array is [0, 0, 1, 2, 3, 0, 1].
  • βœ… KMP search keeps the text pointer moving forward and only jumps the pattern using LPS.
  • βœ… The full algorithm runs in O(n + m) time with O(m) extra space.

Check Your Knowledge

Test what you learned. Pick an answer for each question, then click Check.

  1. 1

    What does each value lps[i] represent?

    Why: lps[i] is the length of the longest chunk that is both a proper prefix and a suffix of the pattern ending at index i.

  2. 2

    During the LPS build, on a mismatch when len is greater than 0, what do you do?

    Why: You fall back with len = lps[len - 1] and leave i where it is, reusing earlier answers.

  3. 3

    What is the LPS array for the pattern ababaca?

    Why: Walking the build rules on ababaca gives [0, 0, 1, 2, 3, 0, 1].

  4. 4

    Why does KMP run in O(n + m) instead of O(n times m)?

    Why: The LPS array lets the pattern jump forward without rewinding the text pointer, so the text is scanned in linear time.

πŸš€ What’s Next?

Share & Connect