Review on String-Matching Algorithm

: String-matching algorithm is one of the most researched algorithms in computer science which has become an important factor in many technologies. This field aims at utilizing the least time and resources to find desired sequence of character in complex data content. The most classical and famous string-search algorithms are Knuth-Morris-Pratt (KMP) algorithm and Boyer-Moore (DM) algorithm. These two algorithms provide efficient heuristic jump rules by prefix or suffix. Bitap algorithm was the first to introduce bit-parallelism into string-matching field. Backward Non-Deterministic DAWG Matching (BNDM) algorithm is a modern practical algorithm that is an outstanding combination of theoretical research and practical application. Those meaningful algorithms play a guiding role in future research in string-search algorithm to improve the average performance of the algorithm and reduce resource consumption.


INTRODUCTION
String-matching problem is one of the oldest and most widely studied problems in computer science.The problem is to find and match single or multiple patterns in the target string or text, having practical applications in a range of fields, including DNA bio-information matching, language translation, data compression, information retrieval, etc.More ordinarily, string matching is an inevitable step to identifying instruction in compilers and systems.Advanced matching methods with small time complexity become an important factor of computer science.
Although string-matching algorithm has been developed for half a century, very practical algorithms have only appeared for decades.There is a gap between theoretical research and application in the field.Experts specialized in algorithm research only concern algorithms that look wonderful in theory with good time complexity.While developers pursue the algorithm possibly fastest in practical.For a nonexpert in a searching algorithm, it is easy to get drowned in voluminous books of diverse matching algorithm, resulting in choice on an ostensibly simple but with the similar or even worse effect of a simple algorithm.A practical algorithm is supposed to have better performance in application and is easy to compete for implementation code in acceptable time.It is easy to find commonly used algorithms are classical algorithms and variants, which usually provide simple and effective approaches.
String-matching is usually divided into exact pattern matches and fuzzy matches.Fuzzy match algorithm is used to find whether an approximately equal substring occurs in the text string.'Approximately equal' is defined by Levinstein distance [1].In this paper we only review on exact pattern search.We review classical stringmatching algorithms: KMP algorithm, BM string-search algorithm, Bitap algorithm, and BNDM algorithm, discussing characteristics of these four algorithms.This paper is aimed to get enlightenment and direction guidance in string-matching research.

Knuth-Morris-Pratt Algorithm
Knuth-Morris-Pratt (KMP) Algorithm was introduced by James Hiram Morris, Vaughan Pratt and Donald Knuth jointly in 1977's SIAM Journal on Computing [2].KMP is the improvement of Naï ve string searching algorithm.The core of KMP algorithm is to acquire information from matching failure to avoid pointer fallback.KMP algorithm is the first one to provide matching methods based on prefixes, indicating a way to optimize string search algorithm.In the follow-up, many developers deepened their research based on KMP and produced many optimizations and hybrid methods [2,3].
Before search, KMP preprocesses generate a table recording the maximum length of a substring which is both a prefix and a suffix.This table determines to skip length beforethe next alignment, avoiding pointer fallback.
Here we give the pseudocode as followed: Algorithm kmp_string_search: In this pseudocode, the auxiliary table is established in preprocessing.In this stage, there are three branches controlling the increment of t and p.In the first branch t and p increase 1 synchronously.In the second branch t is replaced with smaller F[t], thereby increasing p − t.In the third branch p increase 1 and t remains.Therefore, either p or the low boundary p − t increases.The iteration must end up after 2m − 2 loops.Thus, the time complexity is O(m).Searching stage is simpler.Each step aligns pattern or moves text pointer one step.This fact implies that the loop executes at most 2n times.Thus, the algorithm execution search time complexity is O(n).

Boyer-Moore Algorithm
Boyer-Moore (BM) string-matching algorithm was developed by Robert Stephen Boyer and J Strother Moore in 1977 [4].This algorithm requires to preprocess on pattern.BM algorithm would skip some of characters instead of traversal.Generally, the longer the search keyword, the faster the algorithm speed.In general, it is 3-5 times faster than KMP algorithm.The high efficiency comes to the algorithm use this information from the fact that for every failed matching attempt, to exclude unmatched locations.This algorithm is often used in the search matching function in text editors.For example, the well-known GNU grep command uses BM, which is an important reason why GNU grep is faster than BSD grep.BM algorithm is exceeding classical and successful, resulting it has been studied by later generations and produced many variants and optimization such as Apostolico-Giancarlo algorithm [5] and Horspool algorithm [6].Meanwhile, BM algorithm performs splendidly in exact pattern matching.It has been the standard practical string-search literature benchmark [7].
Differing from Brute-Force and KMP algorithm, BM algorithm compares from end to beginning in the window.If the last compared text character mismatches and does not occur in a pattern, we directly move the window to skip the character.Otherwise, we skip maximum distance of two heuristic search rules [4]: The bad character rule and the good suffix rule.

The bad character rule
The bad character rule follows a simple principle.When a mismatch is found, assuming the character pair are

The good suffix rule
The good suffix rule is more complex which has 3 cases.If the algorithm matches a good suffix and there is part of the substring in the pattern, we align them.In case that a substring S which is a good suffix of P and T has been matched when a mismatch occurs at P[i], we search a substring S ′ = S in P[0: i − 1].
Case 1 : If existed, we move pattern to align the rightmost S′ with S. T:........csfz K: .sfz...dsfz K: .sfz...sfz Case 2: If not, we find the longest suffix of S which is a prefix of P and align them.T:.......csfz K: fz...dsfz K: fz.....sfz Case 3: Neither of above two cases, right shift the whole pattern m characters.T:...cxxsfz K:...dxxsfz K: ...dxxsfz The core issue is which case the suffix should be applied.In order to implement the suffix rule efficiently, we define an auxiliary array suffix [], where suffix The construction of this array is simply realized as followed: With the suffix[] array, we can establish Grs[] array in a very clever method to search find substring by good suffix.

New array Gsr[0:m-1]
for (i←0;i<m;i++) do ←m-1-i When a character meets several of the above three cases at the same time, we choose the smallest Gsr[i].if there is a substring satisfies case 1 and a prefix satisfies case 2, we choose case 1 with which has a smaller step length.
Here we can give a complete pseudocode of BM algorithm: Gsr[j] calculate in advance*/ while (i < n) do for (j←m-1;P Return R[] But in practical implement, most websites only create a one-dimensional bad character array to save the rightmost occurrence of character c.However, such a onedimensional array does not affect the actual output result.A very loose explanation is that if we always align the rightmost c with mismatched character in text, the good suffix rule always ensures the algorithm works properly.The following is a rigorous prove. Firstly, if the mismatched character T[i] = c is found only on the right side of P[j] at P[k] (k > j).Good suffix rule will be applied in case 2 or case 3.Here we can deduce that Gsr If the good suffix rule applies case 2 or case 3,we still have Gsr

Bitap Algorithm
Bitap algorithm is an approximate string-matching algorithm, also known as shift-or, shift-and and Baeza-Yates-Gonnet algorithm.The exact string bitap search algorithm was first introduced by Bá lintDömölki in 1964 and extended by R.K. Shyamasundar [8] in 1977.Later in 1989 Ricardo Baeza-Yates extended it to process with wildcard and mismatch [9].In Baeza-Yates' paper [9], bitparallelism method was implemented in string searching for the first time which convinced the classical Shift-Or algorithm.Then in 1991, Udi Manber and Sun Wu's paper introduced Shift-And algorithm that was improved on basis of Shift-Or, giving a new extension to fuzzy match of regular expression [10].It was improved again in 1999 by Baeza-Yates and Gonzalo Navarro [11].
improve the average performance of the algorithm and reduce resource consumption.This passage is limited by reviewing only four exact pattern-matching algorithms, excluding other famous and efficient algorithms like Rabin-Karp algorithm, in lack of guidelines in fuzzy match algorithm.
The time complexity of processing the table is O(mσ), where σ is the size of alphabet.
this point, we can always ensure that the good suffix rule can move a greater distance.