The Knuth–Morris–Pratt (or KMP algorithm) is a substring matching algorithm. It is used for searching occurrences of a pattern P within a string S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters. Time complexity of this algorithm is O(k + n), where k is the length of P and n is the length of S; therefore, the algorithm is linear. In a naive substring matching algorithm, we iterate through S[i] matching each character with P[0], if a match is observed, we proceed matching consecutive characters, comparing S[i + j] P[j]. Upon a mismatch, we resume comparison with P[0] starting at S[i + 1]. This has the complexity of O(k * n), where k is the length of P and n is the length of S. KMP is more efficient because it does some pre-processing on the pattern P, and use that information to prevent from starting back at S[i + 1] upon a mismatch. The pre-processing of pattern P is to give us a way to tell, at any index of the pattern, P[j], when looking at the substring, from P[0] to P[j], if there is a suffix that is also a prefix. So for example if our pattern is "APXBAPWX", at P[5] we should instantly be able to recognize that in the substring of P[0] to P[5], which is "APXBAP", there is a suffix, "AP", which is also a prefix. KMP achieves this capability by computing a temporary array T as follows, where at every index, z, of T, there is an integer value that if non zero, it tells you there is a suffix that is also a prefix, and the value is the size of the said prefix/suffix. Temporary array T is computed as follows: int *getTempArray(char *pattern, int size) { if (pattern == nullptr || size < 1) return nullptr; int *T = new int[size]; T[0] = 0; int j = 0, i = 1; while (i < size) { if (pattern[i] == pattern[j]) { T[i] = j + 1; j++; i++; } else { if (j != 0) j = T[j - 1]; else { T[i] = 0; i++; } } } return T; } So having the temporary array T already computed, when searching for a substring and there is a mismatch in the middle of a partial match, S[i + j] and P[j], KMP simply looks at T[j - 1], value m, which indicates that there is no need to restart the matching from S[i + 1] and P[0] like the naive algorithm. It can just continue matching at S[i + j + 1] and P[m]. This is because if there is a suffix that is also a prefix in the substring of P that we already looked at so far, we know in S, the suffix was the very last values we had (because we had a partial match so far), so if these last few values are also a prefix, there is no need to start over. We just continue matching at the location we were on S and continue matching at the end of a prefix on P (for which T provided us with the exact index). So KMP algorithm is implemented as follows. int KMPSearch(char *S, int sizeS, char *P, int sizeP) { int ret = -1; if (S == nullptr || sizeS < 1 || P == nullptr || sizeP < 1) return ret; int *T = getTempArray(P, sizeP); int i = 0, j = 0; while (i < sizeS && j < sizeP) { if (S[i] == P[j]) { ret = i; while (i < sizeS && j < sizeP && S[i] == P[j]) { i++; j++; } if (j == sizeP) break; else { if (j != 0) j = T[j - 1]; ret = -1; } } else i++; } return ret; } |