Longest Common Subsequence

Say you have two words. You want to know how similar they are. Not equal, just similar. How many characters do they share, in the same order? That question shows up everywhere. Spell checkers use it. The git diff you see on GitHub uses it. Even DNA matching uses it. The tool for it is called the Longest Common Subsequence, or LCS for short. So let us learn it step by step.

πŸ“š What Is a Subsequence?

First we need one word clear. The whole problem depends on it. A subsequence is what you get when you take a string and delete some characters but keep the rest in the same order.

You are allowed to skip characters. You are not allowed to reorder them.

So take the word ABCDE. Here are some valid subsequences:

  • ACE β€” we kept A, C, E and dropped the rest
  • ABE β€” we kept A, B, E
  • ABCDE β€” we kept everything (the full string counts too)
  • "" β€” the empty string counts as well

But ECA is not a subsequence of ABCDE. The order is wrong. In ABCDE the A comes first. In ECA it comes last. Order matters.

People mix this up with substring. So let us clear that now.

Term Rule Example from ABCDE
Substring Characters must be next to each other BCD is a substring
Subsequence Characters keep their order but can have gaps BD is a subsequence, not a substring

So a substring is strict. The characters sit together. A subsequence is relaxed. It allows gaps. Every substring is also a subsequence, but not the other way around.

🎯 The LCS Problem

Now the real problem. You are given two strings. You want the longest subsequence that appears in both of them.

Take these two:

  • X = "ABCBDAB"
  • Y = "BDCAB"

One common subsequence is BCAB. Check it. B, C, A, B all appear in X in that order. They all appear in Y in that order too. Its length is 4. There is no common subsequence longer than that here. So the LCS length is 4.

For now we only ask for the length of the LCS, not the actual string. The length is what most problems want. And finding the length first makes the rest easy.

Note

A pair of strings can have more than one LCS of the same length. That is fine. We only care about the length, and the length is always unique.

🧠 Why Brute Force Is Too Slow

The naive idea is to list every subsequence of the first string and check each one against the second. The trouble is that a string of length n has 2^n subsequences. For a string of just 40 characters that is over a trillion checks. Way too slow.

So we use dynamic programming. Dynamic programming means we break the big problem into small overlapping pieces. We solve each small piece once. Then we store the answer in a table so we never redo work.

πŸ”‘ The Match / No-Match Rule

Here is the idea that makes LCS click. We compare the two strings one character at a time, from the end. At each step we look at the last character of each piece.

Let i be how many characters of X we are looking at, and j how many of Y. We build a table where each cell dp[i][j] holds the LCS length of the first i characters of X and the first j characters of Y. The rules are short:

  • If the two current characters match, that character is part of the LCS. So we take the answer from the smaller problem and add 1: dp[i][j] = dp[i-1][j-1] + 1.
  • If they do not match, we cannot use both. So we drop one character, try both ways, and keep the better one: dp[i][j] = max(dp[i-1][j], dp[i][j-1]).
  • If either string is empty (i is 0 or j is 0), the LCS is 0. This is the base case and it fills the first row and first column with zeros.

That is the whole algorithm. Match means add one and go diagonal. No match means take the best of up or left.

πŸ“Š A Small Worked Example

Let us fill the table for X = "AGGT" and Y = "GT". The rows are characters of X. The columns are characters of Y. The extra empty row and column at the top hold the base-case zeros.

"" G T
"" 0 0 0
A 0 0 0
G 0 1 1
G 0 1 1
T 0 1 2

Read the bottom-right cell. It says 2. So the LCS of AGGT and GT has length 2, and that subsequence is GT. See how the G row first hit 1 because G matches G. Then the T row hit 2 because T matches T after that. The answer always sits in the last cell.

This diagram shows where each cell pulls its value from:

Characters match

Take diagonal cell, add 1

Characters do not match

Take max of cell above and cell to left

πŸ“ Steps to Find the LCS Length

Let us turn the rules into a clear plan before we write any code.

  1. Read both strings. Let m be the length of the first and n be the length of the second.
  2. Make a table dp with m + 1 rows and n + 1 columns. Fill it with zeros. The extra row and column are the empty-string base case.
  3. Loop i from 1 to m. Inside it, loop j from 1 to n.
  4. If the character at i-1 in the first string equals the character at j-1 in the second, set dp[i][j] = dp[i-1][j-1] + 1.
  5. Otherwise set dp[i][j] = max(dp[i-1][j], dp[i][j-1]).
  6. After both loops finish, the answer is in dp[m][n]. That is the LCS length.

πŸ’» LCS in Five Languages

Now the code. Each one builds the same table and prints the LCS length for X = "ABCBDAB" and Y = "BDCAB", which we said is 4.

lcs.py
# builds the dp table and returns the LCS length
def lcs(x, y):
m = len(x)
n = len(y)
# table of zeros with one extra row and column for the base case
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if x[i - 1] == y[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1 # characters match
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) # no match
return dp[m][n]
x = "ABCBDAB"
y = "BDCAB"
print("LCS length:", lcs(x, y))

The output of the above code will be:

Output
LCS length: 4

Tip

We fill the first row and first column with zeros and then start our loops at 1. That way the lines dp[i-1][j-1] never read outside the table. The extra row and column do that safety job for us.

⏱️ Time and Space Complexity

The work is one pair of nested loops. The outer loop runs m times and the inner loop runs n times. So we fill m * n cells, and each cell takes a fixed amount of work.

Note

Time: O(m * n) β€” one visit per cell in the table. Space: O(m * n) β€” the table itself, where m and n are the two string lengths.

Here is the summary in one place.

Approach Time Space
Brute force (list every subsequence) O(2^n * m) O(n)
DP table (this tutorial) O(m * n) O(m * n)

Caution

A common mistake is to use the index i directly into the string. Remember the table is shifted by one. Row i lines up with character i-1 of the string. So always compare x[i-1] with y[j-1], not x[i] with y[j].

🧩 What You’ve Learned

  • βœ… A subsequence keeps order but allows gaps, while a substring must be a run of characters next to each other.
  • βœ… The LCS is the longest subsequence shared by two strings, and we usually want its length first.
  • βœ… The DP rule: if the current characters match, take the diagonal cell and add 1; if not, take the larger of the cell above and the cell to the left.
  • βœ… The first row and first column are zeros, and the final answer sits in the bottom-right cell dp[m][n].
  • βœ… Building the table costs O(m * n) time and O(m * n) space, far better than the brute-force 2^n blow-up.

Check Your Knowledge

Test what you learned. Pick an answer for each question, then click Check.

  1. 1

    Which of these is a subsequence of "ABCDE" but NOT a substring?

    Why: AE keeps the order but skips B, C, D, so it has a gap; that makes it a subsequence but not a substring.

  2. 2

    In the LCS table, when the two current characters match, what value goes in dp[i][j]?

    Why: A match means that character joins the LCS, so we take the diagonal cell and add 1.

  3. 3

    Why do we make the dp table with m+1 rows and n+1 columns instead of m and n?

    Why: The extra row and column represent comparing against an empty string, which has an LCS of 0, and they keep the index math safe.

  4. 4

    What is the time complexity of the DP solution for strings of length m and n?

    Why: We fill every one of the m * n cells once, and each cell takes constant work, so it is O(m * n).

πŸš€ What’s Next?

Share & Connect