Journal of Software Engineering and Applications
Vol.08 No.03(2015), Article ID:54874,11 pages
10.4236/jsea.2015.83016

Four Sliding Windows Pattern Matching Algorithm (FSW)

Amjad Hudaib1, Rola Al-Khalid1, Aseel Al-Anani1, Mariam Itriq2, Dima Suleiman2

1Department of Computer Information Systems, King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan

2Department of Business Information Technology, King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan

Email: dima.suleiman@ju.edu.jo, m.itriq@ju.edu.jo, a.anani@ju.edu.jo, r.khalid@ju.edu.jo, ahudaib@ju.edu.jo

Copyright © 2015 by authors and Scientific Research Publishing Inc.

This work is licensed under the Creative Commons Attribution International License (CC BY).

http://creativecommons.org/licenses/by/4.0/

Received 25 February 2015; accepted 18 March 2015; published 20 March 2015

ABSTRACT

This paper presents an efficient pattern matching algorithm (FSW). FSW improves the searching process for a pattern in a text. It scans the text with the help of four sliding windows. The windows are equal to the length of the pattern, allowing multiple alignments in the searching process. The text is divided into two parts; each part is scanned from both sides simultaneously using two sliding windows. The four windows slide in parallel in both parts of the text. The comparisons done between the text and the pattern are done from both of the pattern sides in parallel. The conducted experiments show that FSW achieves the best overall results in the number of attempts and the number of character comparisons compared to the pattern matching algorithms: Two Sliding Windows (TSW), Enhanced Two Sliding Windows algorithm (ETSW) and Berry-Ravindran algorithm (BR). The best time case is calculated and found to be while the average case time complexity is.

Keywords:

Pattern Matching, FWS, Enhanced Two Sliding Windows Algorithm, RS-A Fast Pattern Matching Algorithm

1. Introduction

String matching is a challenging subject in computer science. Many researchers proposed and designed different techniques and algorithms to find all possible occurrences of a pattern P of size m from the text string T of size n [1] -[3] . The researchers focus on reducing the number of character comparisons and processing time. String matching algorithms are used in various applications such as matching DNA sequences [4] [5] , voice recognition, image processing, text processing [6] -[8] , network security, real-time problem, web applications and information retrieval from databases [9] [10] .

In this paper, we combine the searching strategy of the ETSW algorithm and the shifting process of the BR algorithm [11] [12] . This paper ends up with a new algorithm (FSW) that uses four sliding windows which are equal to the length of the pattern. FSW divides the text into two parts; each part is scanned from both sides simultaneously using two sliding windows. The four windows slide in parallel in both halves of the text. The comparisons done between the text and the pattern are done from both of the pattern sides in parallel. The FSW algorithm finds the first occurrence of the pattern from either the left windows or the right windows from both parts of the text. In all the cases we tested and the comparisons we performed with other matching algorithms such as BR, TSW and ETSW, the proposed algorithm FSW proved to be the best in reducing the number of comparisons and attempts needed to find the pattern [13] [14] . This paper is organized as follows: Section 2 provides an overview of the related works; Section 3 explains the FSW algorithm; Section 4 includes the performance analysis and Section 5 concludes the paper.

2. Related Works

Recently, several new pattern matching algorithms have been proposed to minimize the number of comparisons done to locate a pattern in a text [15] -[17] . Enhancements are made on both the searching process by using several sliding windows that scan the text in parallel and on the preprocessing phase by determining the shift value that the pattern should move in the process of searching the text for the pattern [18] .

The Berry-Ravindran algorithm (BR) uses the bad character shift function to calculate the shift value for the two consecutive characters in the text immediately to the right of the pattern window. In BR the searching time

complexity is calculated to be and the pre-processing time complexity is [14] . The Two-

Sliding Window algorithm (TSW) determines the shift value by using the idea of Berry-Ravindran bad character

shift function. The pre-processing time complexity is found to be .

In the searching phase, TSW uses two sliding windows to scan the text from both sides in parallel. The search process continues until the first occurrence of the pattern is found or until both windows reach the middle of the text. The size of each sliding window is equal to the length of the pattern. In TSW, the best time complexity is

and the worst case time complexity is [12] .

The Enhanced Two Sliding Windows algorithm (ETSW) utilizes the idea of Berry-Ravindran bad character shift function to get better shift values during the searching phase. In the searching phase, ETSW scans both of the text and the pattern from both sides in parallel. Both the text and the pattern are divided into left and right parts. So, the text is searched from both parts simultaneously and the comparisons with the pattern are done from both its parts at the same time. ETSW algorithm stops when the pattern is not found. In ETSW, the best

time case is while the average case time complexity is [11] . The Enhanced RS-A algorithm

(ERS-A) [19] utilizes the idea of RS-A algorithm to get better shift values. ERS-A algorithm uses four consecutive characters in the text immediately to the right of the pattern window.

The ERS-A algorithm uses two sliding windows to search for a pattern in a text. The two windows slide from both sides of the text simultaneously. The searching process continues until a match is found. It stops immediately if the pattern is not found in the text. In ERS-A, the best case complexity is while the average

case time complexity is [19] .

In this paper enhancements are made on the ETSW algorithm, the preprocessing phase is the same while the searching process is made better using four sliding windows to scan the text simultaneously. The comparisons with the pattern are also done from both of the pattern sides in parallel.

3. The FSW Algorithm

The FSW algorithm scans the text as well as the pattern from both sides simultaneously in order to improve the search process. The proposed algorithm (FSW) scans the text using four sliding windows, allowing multiple alignments in the searching process. Each window size is equal to the length of the pattern. In the searching phase, the text is divided in into two parts; each part is scanned from both sides simultaneously using two sliding windows. The four windows slide in parallel in both halves of the text. The comparisons done between the text and the pattern are done from both sides of the pattern in parallel.

Two of the siding windows are aligned with the left and the right sides of the first part of the text and at the same time the other two sliding windows are aligned with the left and the right sides of the second part of the text resulting in four sliding windows that scam the text simultaneously. FSW algorithm stops when a sliding window finds the pattern or the pattern is not found within the text string at all.

FSW algorithm enhances the searching process in the ETSW algorithm. Both the FSW and ETSW algorithms utilize the idea of BR bad character shift function to get better shift values during the searching phase.

The main difference between the FSW and the ETSW algorithms lies in the searching process. During the search the comparisons between the pattern and the text in the FSW are made using four sliding windows while in the Enhanced TSW algorithm two sliding windows are used. Using two additional windows during the search process decreases the number of comparisons and attempts done.

3.1. Pre-Processing Phase

The pre-processing phase of the FSW algorithm is the same as in ETSW algorithm. Two arrays nextl and nextr are generated. Each array is a one-dimensional array. The shift values are calculated according to Berry-Ravindran bad character algorithm (BR). The shift values needed to search the text from the left side are stored in the nextl array. On the other hand nextr array contains the shift values needed to search the text from the right side.

To build the two arrays (nextl and nextr), we take each two consecutive characters of the pattern and give it an index starting from 0. For example for the pattern structure abcd, the consecutive characters ab, bc and cd are given the indexes 0, 1 and 2 respectively.

The shift values for the nextl array are calculated according to Equation (1) while the shift values for the nextr array are calculated according to Equation (2). In Equation (1), we compare between the last character in the pattern m − 1 with a if there is a match the window is shifted 1 character to the right. If there is a mismatch the shift is the minimum of m − i in case of, m + 1 in case of and m + 2 otherwise. In Equation (2), if the first character of the pattern matches b then the window is shifted 1 character to the left otherwise we take the minimum of in case of, m + 1 in case of, m + 2 otherwise.

(1)

(2)

3.2. Searching Phase

In the four sliding windows algorithm, the text is divided into two parts. The left part is named part 1 while the right part is named part 2. Four windows are created for the whole test. Two windows are created for each part of the text, to search for the pattern in parallel. The left and right windows of part 1 are named p1L and p1R respectively. The left and right windows of part 2 are named p2L and p2R respectively. At the beginning of the search, p1L and p2R windows are aligned with the left most and rightmost sides of the text. p2L window is aligned with the text at index while p1R window is aligned with the text at index where n is the text length and m is the pattern length. The alignments of p2L and p1R are calculated taking into consideration the case where some characters of the pattern may appear in part1 of the text and the rest may appear in the second part of the text.

Figure 1 explains the algorithm of the FSW algorithm.

Figure 1. FSW pattern matching algorithm.

3.3. Working Example

In this section we will present an example to clarify the FSW algorithm.

Given:

Pattern (P) = “abcd”, m = 4,

Text (T) = “abaccbacdacdbadcbaadcbbcacbbcaaddcaabcbaaacbddababcdddabdaabaabccdabccdbacbdcbcdacc dbcbddaadddbcabdb”, n = 100.

3.3.1. Pre-Processing Phase

Initially, shiftl1 = shiftr1 = shiftl2 = shiftr2 = m + 2 = 6.

The shift values are calculated using equations 1 and 2. The values are then stored in two arrays nextl and nextr as shown in Figure 2(a) and Figure 2(b) respectively.

3.3.2. Searching Phase

The searching process for the pattern P is illustrated through the working example as shown in Figure 3.

First attempt: (see Figure 3(b))

We align p1L with the text from the left of part 1. In this case, comparisons are made between the text character located at index 0 (character a) with the leftmost character in the pattern (character a). At the same time, comparisons are made between the text character at index 3 (character c) with the rightmost character in the pattern (character d). As a result, a mismatch occurs between text character c and pattern character d; therefore we take the two consecutive characters from the text at index 4 and 5 which are c and b respectively. To determine the amount of shift (shiftl) we have to do the following two steps:

a) We look for the index of cb in the pattern.

b) Since cb is not found in the pattern, so the window is shifted to the right 6 steps (see Equation (1)).

As explained in the example the number of comparisons needed to determine if there is a match or not is one; this is because two character comparisons between the text and the pattern are performed at the same time.

Second attempt: (see Figure 3(c))

We align p1R with the text from the right of part 1. In this case, a match occurs between the text character at index 52 (d) and the rightmost character in the pattern d while there is a mismatch between the text character at index 49 (b) and the leftmost character in the pattern a; therefore we take the two consecutive characters from the text at index 47 and 48 which are b and a respectively. To determine the amount of shift (shiftr), we have to do the following two steps:

a) We look for the index of ba in the pattern.

b) Since ba is not found in the pattern, but the which is a matches the text character at index 48 then according to the pre-processing phase the sliding window will be shifted 1 step to the left.

Third attempt: (see Figure 3(d))

We align p2L with the text from the left of part 2. In this case, a mismatch occurs between the text character at index 50 (c) and the leftmost character in the pattern a while there is a match between the text character at index 53 (d) and the rightmost character in the pattern d; therefore we take the two consecutive characters from the text at index 54 and 55 which are a and b respectively. To determine the amount of shift (shiftl) we have to do the following two steps:

a) We look for the index of ab in the pattern, which is found 0.

Figure 2. The nextl and nextr arrays.

Figure 2. The nextl and nextr arrays.

Figure 3. Working example.

b) Since we search from the left side we use nextl array, and shiftl = nextl [0] = 4.

Therefore the window will be shifted to the right 4 steps.

Fourth attempt:

In the fourth attempt (see Figure 3(e)), we align the fourth sliding window with the text from the right of part 2. In this case, a mismatch occurs between the text character at index 99 (b) and the rightmost character in the pattern (character d) while there is a match between the text character at index 96 (character a) and the leftmost character in the pattern (character a); therefore we take the two consecutive characters from the text at index 94 and 95 which are b and c respectively. To determine the amount of shift (shiftr), we have to do the following two steps:

a) We look for the index of bc in the pattern.

b) Since we search from the right side, we use nextr array, and shiftr = nextr [1] = 3.

Therefore the window is shifted to the left 3 steps.

Fifth attempt:

In the fifth attempt (see Figure 3(f)), we align the first sliding window with the text from the left of part 1. In this case, a match occurs between the text character at index 6 (a) and the leftmost character in the pattern (character a) while there is a mismatch between the text character at index 9 (character a) and the rightmost character in the pattern (character d); therefore we take the two consecutive characters from the text at index 10 and 11 which are c and d respectively,

a) We look for the index of cd in the pattern.

b) Since we search from the left side we use nextl array, and shiftl = nextl [2] = 2.

Therefore the window is shifted to the right 2 steps.

Sixth attempt:

In the sixth attempt (see Figure 3(g)), we align the second sliding window with the text from the right of part 1. A comparison between the pattern and the text characters leads to a complete match at index 48. In this case, the occurrence of the pattern is found using the right window of part 1.

4. Analysis

Preposition 1: The space complexity is where m is the pattern length.

Preposition 2: The pre-process time complexity is .

Lemma 1: The worst case time complexity is

Proof: The worst case occurs when at each attempt, all the compared characters of both pattern sides at 4 windows that slide simultaneously are matched the corresponding text characters except the pattern character indexed (m), and at the same time the shift value is equal to 1.

Lemma 2: The best case time complexity is

Proof: The best case occurs when the pattern is found at the first index, , or at the last index (n − m), in this case the number of comparisons made to compare m pattern characters are .

Lemma 3: The average case time complexity is .

Proof: The average case occurs when the two consecutive characters of the text directly following the sliding window is not found in the pattern. In this case, the shift value will be (m + 2) for each window from 4 available windows.

5. Results

In order to ensure that the FSW algorithm gives extraordinary results in the searching process, several experiments were performed. The FSW algorithm searches the text using four sliding windows. All the windows slide in parallel. Comparisons done with the pattern is also done from both sides simultaneously. Tables 1-5 as well as Figures 4-7 show the results of comparing FSW with ETSW, TSW and BR algorithms.

Table 1, Figure 4 and Figure 5 show the average number of attempts and comparisons for patterns with different lengths. It is noticeable that the number of comparisons and attempts in FSW is much better than the others. This is because in FSW four windows are used while in both ETSW and TSW algorithms two sliding windows are used. On the other hand, BR algorithm uses only one sliding window. For example, if the text has 1167 words, each of length 8, then the average number of comparisons and attempts made by FSW is 3577 and 3502 respectively. The number of comparisons and attempts made by ETSW is 10115 and 10056 respectively. Looking at Table 1, the number of comparisons and attempts of TSW and BR are also greater than FSW. This makes FSW algorithm better than the other algorithms in terms of the average number of comparisons and attempts.

Table 1. The average number of attempts and comparisons for patterns with different lengths.

Table 2. The average number of attempts and comparisons performed to search for (100) patterns selected from the middle of the text.

Table 3. The number of attempts and comparisons performed to search for a set of patterns that do not exist in the text.

Table 4. The average number of attempts and comparisons performed to search for (100) patterns selected from the beginning of the text.

Table 5. The average number of attempts and comparisons performed to search for (100) patterns selected from the end of the text.

Figure 4. The average number of attempts for patterns with different lengths.

Figure 5. The average number of comparisons for patterns with different lengths.

Figure 6. The average number of attempts performed to search for (100) patterns from the middle of the text.

Figure 7. The average number of comparisons performed to search for (100) patterns from the middle of the text.

Table 2, Figure 6 and Figure 7 show the average number of attempts and comparisons performed to search for 100 patterns selected from the middle of the text. FSW algorithm shows the best results in both number of comparisons and attempts.

This is expected since FSW search the text using four windows, two of them starts from the middle of the text. On the other hand, ETSW and TSW uses two windows aligned at the rightmost and the leftmost sides of the text.

BR algorithm uses only one sliding window starting from the left of the text.

For example, to search for a pattern of length 4, the average number of comparisons and attempts made by FSW is 879 and 875 respectively. Compared to ETSW, TSW and BR, the results are far better in FSW than in the other algorithms.

FSW algorithm has the minimum number of comparisons and attempts performed to search for patterns of different lengths that are not found in the text as shown in Table 3.

Table 4 show the average number of comparisons and attempts performed to search for 100 patterns selected from the beginning of the text. BR algorithm has the minimum number since it searches the text using only one window from the left, i.e. from the beginning of the text. On the other hand, ETSW and TSW use two sliding windows that slide from the left side and the right side of the text which increases the number compared to BR algorithm. FSW algorithm’s results show that there is an increase in the number of comparisons and attempts performed especially if the pattern is found in the middle of the text. This is expected since four sliding widows are used.

Table 5 show the average number of comparisons and attempts performed to search for 100 patterns selected from the end of the text. The results of FSW are reasonable since the pattern is found at the end of the text. BR on the other hand performed a large number of comparisons and attempts since it searches the text starting from the left side of the text.

6. Conclusions

In this paper, we presented a new pattern―matching algorithm (FSW) which finds all occurrences of a given pattern p in a given text t using four sliding windows. The new algorithm enhances the ETSW algorithm which uses only two sliding windows. Extensive experiments have been conducted. The results show that FSW best performance appears when the pattern is found in the middle of the text. If the pattern is in the beginning or the end of the text, the number of comparisons and attempts in FSW increases compared to other algorithms.

Using four sliding windows that search the text in parallel as well as comparing the pattern from both sides simultaneously makes the FSW performance better and decreases the searching time. In the future we intend to apply the FSW algorithm to additional applications such as computational biology and search engines. Also we intend to use threads to implement the FSW algorithm.

References

  1. Simone, F. and Thierry, L. (2013) The Exact Online String Matching Problem: A Review of the Most Recent Results. ACM Computing Surveys, 45, 13.
  2. Yang, Z., Yu, J. and Kitsuregawa, M. (2010) Fast Algorithms for Top-k Approximate String Matching. Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, Atlanta, 11-15 July 2010.
  3. Pendlimarri, D. and Petlu, P.B.B. (2010) Novel Pattern Matching Algorithm for Single Pattern Matching. International Journal on Computer Science and Engineering, 2, 2698-2704.
  4. Bhukya, R. and Somayajulu, D. (2011) Article: Exact Multiple Pattern Matching Algorithm Using DNA Sequence and Pattern Pair. International Journal of Computer Applications, 17, 32-38. http://dx.doi.org/10.5120/2239-2862
  5. Alsmadi, I. and Nuser, M. (2012) String Matching Evaluation Methods for DNA Comparison. International Journal of Advanced Science and Technology, 47, 13-32.
  6. Bhandaru, J. and Kumar, A. (2014) A Survey of Fast Hybrid String Matching Algorithms. International Journal of Emerging Sciences, 4, 24-37.
  7. Linhai, C. (2014) An Innovative Approach for Regular Expression Matching Based on NoC Architecture. International Journal of Smart Home, 8, 45-52. http://dx.doi.org/10.14257/ijsh.2014.8.1.06
  8. Diwate, R. and Alaspurkar, S. (2013) Study of Different Algorithms for Pattern Matching. International Journal of Advanced Research in Computer science and Software Engineering, 3, 615-620.
  9. Singla, N. and Garg, D. (2012) String Matching Algorithms and Their Applicability in Various Applications. International Journal of Soft Computing and Engineering (IJSCE), I, 218-222.
  10. Guo, L., Du, S., Ren, M., Liu, Y., Li, J., He, J., Tian, N. and Li, K. (2013) Parallel Algorithm for Approximate String Matching with K Differences. IEEE Eighth International Conference on Networking, Architecture and Storage, 17-19 July 2013, 257-261. http://dx.doi.org/10.1109/NAS.2013.40
  11. Itriq, M., Hudaib, A., Al-Anani, A., Al-Khalid, R. and Suleiman, D (2012) Enhanced Two Sliding Windows Algorithm for Pattern Matching (ETSW). Journal of American Science, 8, 607- 616.
  12. Hudaib, A., Al-Khalid, R., Suleiman, D., Itriq, M. and Al-Anani, A, (2008) A Fast Pattern Matching Algorithm with Two Sliding Windows (TSW). Journal of Computer Science, 4, 393-401. http://dx.doi.org/10.3844/jcssp.2008.393.401
  13. Suleiman, D. (2014) Enhanced Berry Ravindran Pattern Matching Algorithm (EBR). Life Science Journal, 11, 395- 402.
  14. Berry, T. and Ravindran, S. (2001) A Fast String Matching Algorithm and Experimental Results. In: Holub, J. and Simanek, M., Eds., Proceedings of the Prague Stringology Club Workshop’99, Collaborative Report DC-99-05, Czech Technical University, Prague, 16-26.
  15. Khan, Z. and Pateriya, R.K. (2012) Multiple Pattern String Matching Methodologies: A Comparative Analysis. International Journal of Scientific and Research Publications, 2, 2250-3153.
  16. Claude, F., Navarro, G., Peltola, H., Salmela, L. and Tarhio, J. (2012) String Matching with Alphabet Sampling. Journal of Discrete Algorithms, 11, 37-50. http://dx.doi.org/10.1016/j.jda.2010.09.004
  17. Zhang, P. and Liu, J. (2011) An Improved Pattern Matching Algorithm in the Intrusion Detection System. Applied Mechanics and Materials, 48-49, 203-207. http://dx.doi.org/10.4028/www.scientific.net/AMM.48-49.203
  18. Faro, S. and Lecroq, T. (2012) A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms. SEA, 172-183.
  19. Suleiman, D., Hudaib, A., Al-Anani, A., Al-Khalid, R. and Itriq, M. (2013) ERS-A Algorithm for Pattern Matching. Middle East Journal of Scientific Research, 15, 1067-1075.