Longest Common Subsequence
Table of Contents + β
Say you have two words. You want to know how similar they are. Not equal, just similar. How many characters do they share, in the same order? That question shows up everywhere. Spell checkers use it. The git diff you see on GitHub uses it. Even DNA matching uses it. The tool for it is called the Longest Common Subsequence, or LCS for short. So let us learn it step by step.
π What Is a Subsequence?
First we need one word clear. The whole problem depends on it. A subsequence is what you get when you take a string and delete some characters but keep the rest in the same order.
You are allowed to skip characters. You are not allowed to reorder them.
So take the word ABCDE. Here are some valid subsequences:
ACEβ we kept A, C, E and dropped the restABEβ we kept A, B, EABCDEβ we kept everything (the full string counts too)""β the empty string counts as well
But ECA is not a subsequence of ABCDE. The order is wrong. In ABCDE the A comes first. In ECA it comes last. Order matters.
People mix this up with substring. So let us clear that now.
| Term | Rule | Example from ABCDE |
|---|---|---|
| Substring | Characters must be next to each other | BCD is a substring |
| Subsequence | Characters keep their order but can have gaps | BD is a subsequence, not a substring |
So a substring is strict. The characters sit together. A subsequence is relaxed. It allows gaps. Every substring is also a subsequence, but not the other way around.
π― The LCS Problem
Now the real problem. You are given two strings. You want the longest subsequence that appears in both of them.
Take these two:
X = "ABCBDAB"Y = "BDCAB"
One common subsequence is BCAB. Check it. B, C, A, B all appear in X in that order. They all appear in Y in that order too. Its length is 4. There is no common subsequence longer than that here. So the LCS length is 4.
For now we only ask for the length of the LCS, not the actual string. The length is what most problems want. And finding the length first makes the rest easy.
Note
A pair of strings can have more than one LCS of the same length. That is fine. We only care about the length, and the length is always unique.
π§ Why Brute Force Is Too Slow
The naive idea is to list every subsequence of the first string and check each one against the second. The trouble is that a string of length n has 2^n subsequences. For a string of just 40 characters that is over a trillion checks. Way too slow.
So we use dynamic programming. Dynamic programming means we break the big problem into small overlapping pieces. We solve each small piece once. Then we store the answer in a table so we never redo work.
π The Match / No-Match Rule
Here is the idea that makes LCS click. We compare the two strings one character at a time, from the end. At each step we look at the last character of each piece.
Let i be how many characters of X we are looking at, and j how many of Y. We build a table where each cell dp[i][j] holds the LCS length of the first i characters of X and the first j characters of Y. The rules are short:
- If the two current characters match, that character is part of the LCS. So we take the answer from the smaller problem and add 1:
dp[i][j] = dp[i-1][j-1] + 1. - If they do not match, we cannot use both. So we drop one character, try both ways, and keep the better one:
dp[i][j] = max(dp[i-1][j], dp[i][j-1]). - If either string is empty (
iis 0 orjis 0), the LCS is 0. This is the base case and it fills the first row and first column with zeros.
That is the whole algorithm. Match means add one and go diagonal. No match means take the best of up or left.
π A Small Worked Example
Let us fill the table for X = "AGGT" and Y = "GT". The rows are characters of X. The columns are characters of Y. The extra empty row and column at the top hold the base-case zeros.
| "" | G | T | |
|---|---|---|---|
| "" | 0 | 0 | 0 |
| A | 0 | 0 | 0 |
| G | 0 | 1 | 1 |
| G | 0 | 1 | 1 |
| T | 0 | 1 | 2 |
Read the bottom-right cell. It says 2. So the LCS of AGGT and GT has length 2, and that subsequence is GT. See how the G row first hit 1 because G matches G. Then the T row hit 2 because T matches T after that. The answer always sits in the last cell.
This diagram shows where each cell pulls its value from:
π Steps to Find the LCS Length
Let us turn the rules into a clear plan before we write any code.
- Read both strings. Let
mbe the length of the first andnbe the length of the second. - Make a table
dpwithm + 1rows andn + 1columns. Fill it with zeros. The extra row and column are the empty-string base case. - Loop
ifrom 1 tom. Inside it, loopjfrom 1 ton. - If the character at
i-1in the first string equals the character atj-1in the second, setdp[i][j] = dp[i-1][j-1] + 1. - Otherwise set
dp[i][j] = max(dp[i-1][j], dp[i][j-1]). - After both loops finish, the answer is in
dp[m][n]. That is the LCS length.
π» LCS in Five Languages
Now the code. Each one builds the same table and prints the LCS length for X = "ABCBDAB" and Y = "BDCAB", which we said is 4.
# builds the dp table and returns the LCS lengthdef lcs(x, y): m = len(x) n = len(y) # table of zeros with one extra row and column for the base case dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1): for j in range(1, n + 1): if x[i - 1] == y[j - 1]: dp[i][j] = dp[i - 1][j - 1] + 1 # characters match else: dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) # no match
return dp[m][n]
x = "ABCBDAB"y = "BDCAB"print("LCS length:", lcs(x, y))The output of the above code will be:
LCS length: 4Tip
We fill the first row and first column with zeros and then start our loops at 1. That way the lines dp[i-1][j-1] never read outside the table. The extra row and column do that safety job for us.
β±οΈ Time and Space Complexity
The work is one pair of nested loops. The outer loop runs m times and the inner loop runs n times. So we fill m * n cells, and each cell takes a fixed amount of work.
Note
Time: O(m * n) β one visit per cell in the table.
Space: O(m * n) β the table itself, where m and n are the two string lengths.
Here is the summary in one place.
| Approach | Time | Space |
|---|---|---|
| Brute force (list every subsequence) | O(2^n * m) | O(n) |
| DP table (this tutorial) | O(m * n) | O(m * n) |
Caution
A common mistake is to use the index i directly into the string. Remember the table is shifted by one. Row i lines up with character i-1 of the string. So always compare x[i-1] with y[j-1], not x[i] with y[j].
π§© What Youβve Learned
- β A subsequence keeps order but allows gaps, while a substring must be a run of characters next to each other.
- β The LCS is the longest subsequence shared by two strings, and we usually want its length first.
- β The DP rule: if the current characters match, take the diagonal cell and add 1; if not, take the larger of the cell above and the cell to the left.
- β
The first row and first column are zeros, and the final answer sits in the bottom-right cell
dp[m][n]. - β
Building the table costs O(m * n) time and O(m * n) space, far better than the brute-force
2^nblow-up.
Check Your Knowledge
Test what you learned. Pick an answer for each question, then click Check.
- 1
Which of these is a subsequence of "ABCDE" but NOT a substring?
Why: AE keeps the order but skips B, C, D, so it has a gap; that makes it a subsequence but not a substring.
- 2
In the LCS table, when the two current characters match, what value goes in dp[i][j]?
Why: A match means that character joins the LCS, so we take the diagonal cell and add 1.
- 3
Why do we make the dp table with m+1 rows and n+1 columns instead of m and n?
Why: The extra row and column represent comparing against an empty string, which has an LCS of 0, and they keep the index math safe.
- 4
What is the time complexity of the DP solution for strings of length m and n?
Why: We fill every one of the m * n cells once, and each cell takes constant work, so it is O(m * n).