^{1}

^{1}

^{2}

Pattern matching is a very important algorithm used in many applications such as search engine and DNA analysis. They are aiming to find a pattern in a text. This paper proposes a Pattern Matching Algorithm Using Changing Consecutive Characters (PMCCC) to make the searching pro- cess of the algorithm faster. PMCCC enhances the shift process that determines how the pattern moves in case of the occurrence of the mismatch between the pattern and the text. It enhances the Berry Ravindran (BR) shift function by using m consecutive characters where m is the pattern length. The formal basis and the algorithms are presented. The experimental results show that PMCCC made enhancements in searching process by reducing the number of comparisons and the number of attempts. Comparing the results of PMCCC with other related algorithms has shown significant enhancements in average number of comparisons and average number of attempts.

Pattern matching is considered a very important algorithm in various applications such as search engine and DNA analysis [

The main purpose of pattern matching algorithms is to find a pattern (which is relatively small) in a text (which is very relatively large). They are aiming to enhance the search process and make it faster. This can be done either by decreasing the number of comparisons, attempts or both. Some algorithms made enhancements on searching process and the way comparisons are made between the text and the pattern [

In some algorithms, comparisons occur between the text and the pattern from only one side of the text either the left side [

Another type of enhancements is made on the shift values. The shift value is the amount of shift that the pattern will move in case of a mismatch between the text and the pattern occurs. This enhancement is very important since it affects the efficiency of the searching process. Most of algorithms determine the shift value according to the number of consecutive characters on the text after aligning the pattern with the text. Some algorithms take one consecutive character [

This paper proposes a new Pattern Matching algorithm Using Changing Consecutive Characters (PMCCC) to make the searching process of the algorithm faster. PMCCC made enhancements on the shifting values, which maximize the amount of shift that the pattern will move once a mismatch between the text and the pattern occurs. The new algorithm uses only one sliding window so the pattern will be aligned with text from the left side. The formal basis of the algorithm, algorithm steps and analysis is presented. Comparisons are made with existing algorithms such as BR [

The reminder of this paper is organized as follows: Section 2 presents the related work. Section 3 describes PMCCC algorithm and analyzes its performance. Section 4 presents a working example to show how the algorithm works. Section 5 shows the experiment and the comparisons with other algorithms. Finally, Section 6 covers the conclusions and future work.

Pattern matching is one of the main topics in research since it is very important and can be used in different applications [

Enhancements have been made to make the searching process faster either by using different technique for the search process itself or increasing the shift value that the pattern must move in case there is a mismatch between the pattern and the text [

Boyer Moore [

Berry-Ravindran (BR) [

Shift technique used in Berry-Ravindran algorithm (BR) [

Two Sliding Windows algorithm (TSW) [

Enhanced Two Sliding Window algorithm (ETSW) [

To determine the amount of shift instead of using two consecutive characters, in case a mismatch occurs between the text and the pattern as in TSW, Enhanced Berry Ravindran (EBR) [

ERS-A [

Enhancing ERS-A algorithm for pattern matching EERS-A [

Four sliding windows pattern matching algorithm (FSW) [

A Performance Study of the Running Times of well-known Pattern Matching Algorithms for Signature-based Intrusion Detection Systems [

PMCCC uses only one pattern window to search for a pattern p in text t from the left side of the text. In case a mismatch occurs between the pattern and the text the pattern will be shifted to right according to the shift value. The shift value is determined according to the pattern length m. Instead of taking two consecutive characters to determine the amount of shift as in TSW, three characters as in EBR or four characters in EERS-A, the new algorithm takes m (pattern length) consecutive characters immediately of the text after aligning the pattern with the left side of the text. Additional two algorithms shift 5 and shift 6 are developed to make comparisons. These two algorithms use one window to scan the text from the left side. They use five and six consecutive characters respectively to determine the amount of shift.

Comparing the results of the proposed algorithm with those of Br, EBR and RS-A has shown significant enhancements in average number of attempts and comparisons. The new algorithm makes the searching process faster and more efficient.

PMCCC enhanced the searching process by increasing the amount of shift that occurs every time there is a mismatch between the text and the pattern. PMCCC depends on using m consecutive characters of the text immediately after aligning the text with the pattern, where m is the same as the pattern length. PMCCC algorithm determines the number of the consecutive characters according to the length of pattern.

The main difference between BR [

PMCCC calculates the value of the shift using m consecutive text characters

Initially, m consecutive characters in the text have the indexes

After shifting m consecutive characters immediately to right of the text will be (leftindex + 1), (leftindex + 2), (leftindex + 3), …., (leftindex + m) for

The main idea of the proposed algorithm is to use shift function used in BR [_{1} and x_{2} are the two consecutive characters, p is the pattern, and m the pattern length.

Other algorithms also used BR shift function such as EBR [

In order to make a good comparisons, we implemented two new algorithms and called them shift 5 and shift 6. Shift 5 and shift 6 also used Berry Ravindran shift but they used 5 and 6 characters respectively to determine the amount of shift. The shifting process of algorithm shift 5 depends on equation (4) and the shifting process of shift 6 depends on equation (5).

PMCCC algorithm used m characters, where m is the pattern length, to determine the amount of shift and depends on equation (6).

PMCCC algorithm starts the searching from the left side of the pattern using one window, in case of a mismatch between the text and the window; the window will be shifted to the right according to the shift value in equation (6).

One pointer (leftindex) in the text and one pointer in the pattern (L) will be used to make comparison between the text and the pattern. The first character of the pattern and the corresponding text character (leftindex) will be compared. If a match occurs the leftindex and L will be incremented by one and if a mismatch occurs the leftindex will be incremented by the shift value and L index value will be zero.

One pointer (leftindex) in the text and one pointer in the pattern (L) will be used to make comparison between the text and the pattern. The first character of the pattern and the corresponding text character (leftindex) will be compared. If a match occurs the leftindex and L will be incremented by one and if a mismatch occurs the leftindex will be incremented by the shift value and L index value will be zero.

Lemma 1: The time complexity in worst case is O(((n − m + 1))(m)).

Proof: The worst case occurs when a match between the pattern and text occurs in all characters (i.e. The shift value is equal to one), until we reach the last character of the pattern and a mismatch between the pattern and text occurs. This case is reputed until we reach the index (n − (m + 1)) of the text.

Lemma 2: The time complexity in best case is O(m).

Proof: The best case occurs when a match occurs between the pattern and the text in the leftmost of the text.

Lemma 3: The time complexity in average case is O(n/(2*m)).

Proof: The average case occurs when the m consecutive characters immediately to the right of the text after aligning it with the pattern are not found in the pattern. In this case, time complexity is O(n/(2*m)) and the shift value will be (2*m).

In this section, we will give an example to explain the new algorithm.

Given a text T with n = 50.

T = “ABECABACBAFECABAEEBEBEABACBEECABACCCBAEEBABEBEBABA”, with index from 0 to 49.

And a pattern P with m = 9:

P = “ABACCCBAE”, with index from 0 to 9.

Pre-processing phase

Initially, shift value = 2*m = 18.

Searching phase

The searching process for the pattern p is explained through the following steps:

First attempt:

In the first attempt (

To determine the shift value we have to make the following comparison according to equation (6) as in

Second attempt:

In the second attempt, after shifting the pattern 13 steps to the right (

Third attempt:

In the third attempt, a mismatch occurs between text character at index 25 (character C) and the second pattern text (character B), (see

Shift value | Comparisons |
---|---|

1 | if p[ |

2 | if p[ |

3 | if p[ |

4 | if p[ |

5 | if p [ |

6 | if p[ |

7 | if p[ |

8 | if p[ |

9 | if p[ |

10 | if p[ |

11 | if p[ |

12 | if p[ |

13 | if p[ |

Shift value | Comparisons |
---|---|

1 | if p[ |

2 | if p[ |

3 | if p[ |

4 | if p[ |

5 | if p[ |

6 | if p[ |

7 | if p[ |

8 | if p[ |

9 | if p[ |

10 | if p[ |

11 | if p[ |

Now, to determine amount of shift we apply equation (6) and shown in

Fourth attempt:

We align the first character of the pattern (character A) with text character at index 30 (character A). A complete match between the text and pattern occurs at index 30, (see

Multiple experiments have been done to make comparisons between PMCCC and other algorithms. In addition to implementing the PMCCC, we implemented two other algorithms shift 5 and shift 6 to make good comparisons. Shift 5 and shift 6 algorithms used the same shift functions used in BR [

Shift value | Comparisons |
---|---|

1 | if p[ |

2 | if p[ |

3 | if p[ |

4 | if p[ |

5 | if p[ |

6 | if p[ |

Pattern length | Number of words | BR | EBR | RS-A | Shift 5 | Shift 6 | PMCCC |
---|---|---|---|---|---|---|---|

7 | 1988 | 13345 | 11749 | 10638 | 9783 | 9100 | 8056 |

8 | 1167 | 14807 | 13092 | 11922 | 11033 | 10320 | 9217 |

9 | 681 | 15892 | 14095 | 12911 | 12004 | 11273 | 10141 |

10 | 382 | 15799 | 14070 | 12927 | 12064 | 11362 | 10289 |

11 | 191 | 14243 | 12675 | 11672 | 10910 | 10298 | 9367 |

12 | 69 | 10923 | 9774 | 9030 | 8508 | 8074 | 7401 |

13 | 55 | 11370 | 10191 | 9422 | 8895 | 8466 | 7786 |

14 | 139 | 21673 | 19255 | 17845 | 16843 | 16008 | 14734 |

15 | 32 | 22384 | 19747 | 18318 | 17261 | 16435 | 15107 |

16 | 10 | 28644 | 25452 | 23531 | 22163 | 21080 | 19381 |

17 | 3 | 28148 | 25169 | 23119 | 21922 | 20895 | 19252 |

Pattern length | Number of words | BR | EBR | RS-A | Shift 5 | Shift 6 | PMCCC |
---|---|---|---|---|---|---|---|

7 | 1988 | 11953 | 10505 | 9512 | 8749 | 8139 | 7203 |

8 | 1167 | 13256 | 11704 | 10660 | 9866 | 9226 | 8241 |

9 | 681 | 14149 | 12532 | 11477 | 10673 | 10024 | 9019 |

10 | 382 | 14127 | 12567 | 11543 | 10774 | 10148 | 9188 |

11 | 191 | 12808 | 11378 | 10480 | 9798 | 9252 | 8410 |

12 | 69 | 9598 | 8584 | 7927 | 7472 | 7091 | 6507 |

13 | 55 | 10334 | 9250 | 8560 | 8069 | 7677 | 7065 |

14 | 139 | 19548 | 17356 | 16086 | 15172 | 14423 | 13273 |

15 | 32 | 19817 | 17454 | 16176 | 15249 | 14517 | 13362 |

16 | 10 | 26086 | 23176 | 21411 | 20176 | 19194 | 17662 |

17 | 3 | 22554 | 20138 | 18551 | 17570 | 16747 | 15458 |

Dataset:

All the six algorithms used Book1 from the Calgary corpus to be the text [

This paper proposed a new pattern matching algorithm based on Changing Consecutive Characters (PMCCC) to make the searching process of the algorithm faster. PMCCC algorithm uses m consecutive characters where m is the pattern length to determine the shift value in case a mismatch occurs between the text and pattern. This process made the PMCCC faster than many other algorithms that used shift function. Comparisons made between PMCCC and already existing algorithms BR, EBR, RS-A and also comparisons made with new algorithms implemented for the purpose of comparison, are called Shift 5 and Shift 6, respectively. The experimental results show that PMCCC is the faster than other algorithms in terms of number of comparison and attempts.

Amjad Hudaib,Dima Suleiman,Arafat Awajan, (2016) A Fast Pattern Matching Algorithm Using Changing Consecutive Characters. Journal of Software Engineering and Applications,09,399-411. doi: 10.4236/jsea.2016.98026