KMP algorithm aims to solve the string matching problem. And I believe that it is the most efficient solution of this problem. First let’s review this problem.

String Matching

Given a text string \(T[1…n]\) and a pattern \(P[1…m]\), if \(0<=s<=n-m\) and \(T[s+1…s+m]]=P[1…m]\), then \(P\) occurs with shift \(s\). The string matching problem is to find all valid shifts. For example, given \(T: “bacbababaabcbab”\) and \(P: “ababa”\), we can get s=4 after string matching.

Brute Force

It is easy to come up with the brute force algorithm. Just check if T[s+1…s+m] equals to P[1…m] for s=0…n-m. The worst-case running time is \(\theta((n-m+1)m)\).

Exploration

In order to understand KMP, we should first know why the string matching process could be improved. Consider that when using brute force, after one time of comparing the pattern, which has m elements being compared, the position of s just moves to the next one position. Then we must compare m elements again, which includes \(m-1\) elements that has been compared at the last step. So what we should make effort to is making use of the comparisons.

The idea is that the shift does not need to move to the next one postion, it could move to a further position, that is to say, in one time of comparing the pattern, when a character is unmatched, we could find the longest prefix and suffix of matched substring that are same. Then the shift can move to the start position of the suffix rather than the next one position of this shift’s start position. In other words, the comparing element cursor could move to the next of the end element of the suffix, which is just the cursor’s current position!

KMP

We define a \(\pi\) array ( or call it “next” array) to store the data about which position the shift should return to after one unmatched comparison.

void ComputePrefixFunction(string P,vector<int>& pi)
{
	int m = P.length();
	pi.resize(m);
	pi[0] = 0;
	int k = 0;
	for (int q = 1; q < m; q++)
	{
		while (k >0 && P[k] != P[q])
		{
			k = pi[k-1];
		}
		if (P[k] == P[q])
			k = k + 1;
		pi[q] = k;
	}
}

The code is easy to understand except Line 11 \(k=\pi[k-1]\). We know that if \(P[k]==P[q]\) then the length of same longest prefix and suffix increase one. But what if they are not equal? Here we know the length of same longest prefix and suffix for \(P[0…k-1]\) and \(P[q-k…q-1]\). Then we can check if the next element of the last time’s prefix equals to the element now and do this iteratively until find a element satisfy it or \(k=0\).

After we get the \(\pi\) array, it is easy to inplement the following process.

void KMPMatcher(string T, string P)
{
	int n = T.length(), m = P.length();
	vector<int> pi;
	ComputePrefixFunction(P,pi);
	int q = 0;
	for (int i = 0; i < n; i++)
	{
		while (q > 0 && P[q] != T[i])
			q = pi[q-1];
		if (P[q] == T[i])
			q = q + 1;
		if (q == m)
		{
			printf("%d\n", i - m + 1);
			q = pi[q-1];
		}
	}
	printf("pi:\n");
	for (int i = 0; i < pi.size(); i++)
	{
		printf("%d ", pi[i]);
	}
	printf("\n");
}

Test

#include <bits/stdc++.h>
using namespace std;
int main()
{
	string T, P;
	cin >> T;
	cin >> P;
	KMPMatcher(T, P);
	system("pause");
}

Result:

bacbababaabcbab
ababa
4
pi:
0 0 1 2 3

Analysis

Obviously, the running time of ComputePrefixFunction is \(\theta(m)\). And the running time of KMPMatcher is \(\theta(n)\). It is really efficient!

Reference

[1] Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). Introduction to algorithms. MIT press.

[2] https://www.cnblogs.com/c-cloud/p/3224788.html

[3] https://subetter.com/articles/2018/04/how-to-understand-kmp.html

[4] https://www.zhihu.com/question/36149122/answer/66867065

Leave a Reply

Your email address will not be published. Required fields are marked *