Antiviral software systems (AVSs) have problems in detecting polymorphic variants of viruses without specific signatures for such variants. Previous alignment-based approaches for automatic signature extraction have shown how signatures can be generated from consensuses found in polymorphic variant code. Such sequence alignment approaches required variable length viral code to be extended through gap insertions into much longer equal length code for signature extraction through data mining of consensuses. Non-nested generalized exemplars (NNge) are used in this paper in an attempt to further improve the automatic detection of polymorphic variants. The important contribution of this paper is to compare a variable length data mining technique using viral source code to the previously used equal length data mining technique obtained through sequence alignment. This comparison was achieved by conducting three different experiments (i.e. Experiments I-III). Although Experiments I and II generated unique and effective syntactic signatures, Experiment III generated the most effective signatures with an average detection rate of over 93%. The implications are that future, syntactic-based smart AVSs may be able to generate effective signatures automatically from malware code by adopting data mining and alignment techniques to cover for both known and unknown polymorphic variants and without the need for semantic (run-time) analysis.
Computer worms and viruses continue to grow despite improved intrusion detection, firewall and antivirus software systems (AVSs). For a malware detection system such as an AVS, the primary issue is to detect a worm or virus variant that is not stored in its signature database. Modern detection methods are frequently unable to detect new malware variants until they make an appearance even when a signature of one variant of that particular malware is known [
Current signature extraction is by manual assessment using semantic information, by string-based syntactic approaches (see [
A data mining algorithm (i.e. rule induction algorithm) is adopted in this paper to search and extract meaningful and smart information from malware source code in the form of rules which represent patterns (code sequence signatures) in malware data. In particular, a nearest neighbor rule induction algorithm such as NNge (details provided later) may work better in noisy domains such as malware code where there may be obfuscation and deliberate introduction of redundancy. If it is possible to generate a rule-based signature automatically from known polymorphic variants, it may also be possible to automatically create signatures that can detect entirely new variants that have not previously been encountered. If this is the case, future smart AVSs can be “pre-emptive” in that they already know, to some extent, what future variants of a virus may look like based on encountering known variants of that virus. The aim of this paper is to explore this possibility in more detail.
One of the issues in applying data mining algorithms to malware data directly is the problem of variable length strings [
The significance of this paper is to continue a purely syntactic exploration of the possibility of generating signatures automatically from malware source code without the need for semantic analysis. Syntactic techniques for signature extraction based on structural detection of malware are relatively unexplored in comparison to semantic techniques (i.e. techniques based on analyzing the execution behavior of malware). The primary benefit with a syntactic or structural technique is that new and previously unknown variants can be generated from the extracted syntactic or structural rules of existing variants (see [
Previous work used sequence alignment to extract consensuses (calculated order of the most frequent symbols found in each position) from malware code variants for the purpose of generating the minimum possible number of signatures for detecting those variants and previously unseen variants. But there was no attempt made to make the most of a by-product of the alignment for data mining purposes, which is the output of equal length malware code of variants. Our task in this paper is to compare signatures produced from the outcomes of data mining the variable length malware code before alignment with the outcomes of data mining the equal length malware code after alignment to determine which method produces better signatures automatically.
Malware is typically a script or program written first in a high-level language (e.g. C, Java) and then compiled into hex code. The source code will contain instructions for the infector part (how to spread), the payload part (what action to take) and methods for encryption/decryption to hide the malware intent. The infector part also usually contains instructions on how to change the code so that new variants are produced on infection. This leads to many “variants” of the same family where the infector and payload are the same but differently coded. The run-time behavior of the variant is used by human experts to generate signatures (short strings of hex code) for storage in libraries of AVSs to scan incoming packets and the contents of memory to detect the variant and its family. One of the main problems for AVSs is that polymorphic techniques that change the order of the malware code can evade signatures that assume a constant left-to-right ordering in malware code variants. As will be seen below, some very old and well-known viruses still evade modern AVSs because their variants adopt simple code sequence changes that cannot be detected by the latest signatures.
The task of a syntactic learning system for signature generation of polymorphic malware using hex code only (i.e. no execution traces) is specified below (see
a) From the code of a set of seen variants Ps, automatically generate signatures to identify and detect unseen variants Pu, where Ps and Pu form currently known variants Pk.
b) From the code of a set of known variants Pk, automatically generate signatures to identify and detect unknown variants Px for cross-validation. In this case, Px are code variants that have not been seen before for either training or testing purposes.
The learning task is to maximize true positive rates, and minimize false positive
and false negative rates in both cases above. As will be seen below, previous work has addressed a) through sequence alignment techniques that use insertion operations as well as substitution matrices for matching malware code. It is currently not known whether matching techniques that work well for a) will continue to work well for b), or whether data mining techniques that look for patterns in underlying structure are required to allow generalization to unknown variants.
Roadmap: Section 2 and Section 3 discuss the background and limitations of previous work. Section 4 discusses previous related work relevant to this paper. Section 5 and Section 6 discuss the data mining technique and sequence representations adopted in this paper. In Section 7, we describe our systems and methods. Section 8 summarizes the key features and steps by comparing the three different sets of experiments conducted in Section 7. Section 9 discusses the results. That is Section 9-1) compares the data mining results obtained from three different sets of experiments against other related work and Section 9-2) evaluates signatures generated through the three different sets of experiments against state of the art AVS products, and on the detection of JS.Cassandra polymorphic virus and its known and unknown variants. Section 10 and Section 11 contain the discussions and conclusions. The paper concludes with references and Appendix section. Appendix Sections A1-A3 explain the three different sets of experiments (Experiments I-III) that were individually performed with these methods.
A key development in syntactic approaches has been adoption of string-based algorithms in bioinformatics for identifying structural matches in malware code. Such algorithms do not just look for the presence or absence of characters in specific positions but also manipulate the strings to allow for insertion of characters to expand the number of matching characters. Importantly, the results of such string manipulation are a set of equal length strings from an initial set of variable length strings. Earlier work [
A sequence-based method to signature extraction was previously proposed and illustrated utilizing the Smith-Waterman algorithm (SWA) without gap penalties [
Previous work using a sequence alignment approach [
A second limitation, as noted above, was that the alignment using SWA was “pairwise” and only allowed alignment of two viral sequences at a time in the first round of alignment. Multiple sequence alignment was then used on all pairwise consensuses to generate equal length sequences for rule-based data mining using PRISM. However, in the first round, only those regions of similarity in the pairwise alignment were extracted rather than regions of similarity across all viral sequences. It was not known how much “family” information was lost through pairwise comparison of variants. A rule-based data mining approach, on the other hand, allows all sequences to be used to extract signatures should take into account all the information in all the sequences at the same time, including both family generic and variant specific information. This should then lead to more effective signatures.
The main body of research over the last fifteen years has concentrated on malware detection adopting semantic-based approaches and only a few adopting syntactic-based approaches. A list of approaches to automatic signature generation is presented in
Some other related and selected previous work that primarily focuses on malware detection using data mining and bioinformatics approaches are shown in
Researchers/Application | Type of Malware | Type of Approach | Description |
---|---|---|---|
Wespi et al. [ | Intrusions | Semantic | Variable length patterns from training data consisting of system call traces of commands under normal execution were analyzed by a sequence-based algorithm called Teiresias for intrusion detection. |
Honeycomb [ | Worms | Syntactic | Generate signatures that constitute individual adjoining byte strings (tokens). |
Polygraph [ | Polymorphic worms | Syntactic | Generates an array of tokens, a subsequence of tokens and Bayes signatures based on probabilistic methods to detect polymorphic worms. |
Nemean [ | Worms | Semantic | Focus on generating signatures that defend against worms. |
PAYL [ | Worms | Semantic | Produces subsequence signature tokens that associate ingress/egress payload notifications to detect the initial replication of worms. |
Hamsa [ | Polymorphic worms | Semantic | Produces a set of signature tokens that can deal with polymorphic worms by investigating their invariant activity. |
ShieldGen [ | Worms | Semantic | Generates network signatures for unseen vulnerabilities (worms) that are based on protocol-aware for instance. |
AutoRE [ | Botnets | Semantic | Produces a spam signature creation architecture from spam emails that use botnets to detect them. |
Coull and Szymanski [ | Masquerade | Semantic | Sequence alignment was used to identify masquerade detection by comparing “audit data” with legitimate user signatures extracted from their actual command line entries. |
Scheirer et al. [ | Polymorphic worms | Syntactic and Semantic | Detection of many polymorphic worms and uses intrusion detection techniques such as sliding window schemes and instruction semantics. |
---|---|---|---|
Wurzinger et al. [ | Botnets | Semantic | Detects botnets that are under the influence of botmaster (malicious body) using network signatures by examining the response from a compromised host to a received command and by generating detection models. |
Botzilla [ | Malware binaries | Semantic | Produces signatures for the malicious activities (traffic) created by a malware binary executed several times within a controlled domain. |
Zhao et al. [ | General malware datasets | Semantic | Generates signatures through an inverse transcoding method by converting the malware sequential information, such as system call sequences, propagation dataflow, etc. into amino acid sequences and then aligning them using multiple sequence alignment tool. |
ProVex [ | Botnets | Semantic | Generates signatures to detect botnets that use encrypted command and control (C & C) systems after being given the keys and decryption routine employed by the malware be derived using binary code reuse strategy. |
FIRMA [ | Botnets | Semantic | Detects C & C systems but does not produce signatures for those. |
Ki et al. [ | Worms, Trojans, etc. | Semantic | Generates sequences that are typical API call sequence motifs of malicious activities belonging to several malware samples and employed multiple sequence alignment tool to align those malware samples to extract signatures. |
MalGene [ | Evasive malware samples | Semantic | Uses sequence alignment techniques on two sequences of system call events belonging to two different analysis environments: one environment in which the malware evades the AVS, and the other in which it exhibits the malicious activities. These events are used to construct an “evasion signature” using sequence alignment. |
Researchers | Data Mining | Data Set | Type of Malware | Type of Approach |
---|---|---|---|---|
Chen et al. [ | Data Mining Classifiers Algorithms i.e. ANNs (Artificial Neural Networks) i.e. JavaNNS and Symbolic Rule Extraction i.e. J48 classifier | 60 malicious files, 30 belonging to virus group and 30 belonging to worm group. | One family, with a total of 60 malicious samples, 30 each for virus and worm categories. | Extraction of hex sequences from viral and worm malicious files. Multiple sequence alignment using T-Coffee was applied on the extracted hex sequences for data mining process. |
Kumar et al. [ | Data Mining Classifier Algorithms i.e. IBK (k-nearest neighbours classifier) | Existing dataset: 323 malicious files with a combination of viruses and worms. New upcoming dataset: 323 malicious files with a combination of viruses and worms. | Virus and Worm. | Extraction of hex sequences from viral files and conversion of hex sequences into ASCII sequences. Multiple sequence alignment was applied on the converted ASCII sequences for data mining process. |
Prabha et al. [ | Data Mining Classifier Algorithms i.e. J48, KNN (K-Nearest Neighbours), Naïve Bayes. | 100 binaries out of which 90 were benign and 10 were malware binaries. | 15 subfamilies, with a total of 1056 malicious viral samples. | Extraction of hex dumps/Extraction of byte sequences in terms of n-grams of different sizes. |
Srakaew et al. [ | Data Mining Classifier Algorithm i.e. J48 by generating decision trees. | Reference Data Set: 1200 files in total out of which 900 are malicious and 300 are non-malicious. Application Data Set: 3251 files in total out of which 2951 are malicious and 300 are non-malicious. | Reference Data Set: Allapple, Podhuha and Virut viral families each containing 300 malicious samples. Application Data Set: Allapple, Podhuha and Virut viral families with 890, 8 and 2,053 malicious samples, respectively. | Statistical Features Approach: Conversion of malicious and non-malicious files into hex sequences for extracting statistical aspects using n-grams of bytes. Abstract Assembly Approach: Conversion of malicious and non-malicious files into assembly instructions for extracting selected instructions using n-grams of interesting opcodes. |
---|
variants, let alone its unknown Px variants. The syntactic approach most closely related [
Previous use of sequence alignment and data mining has for the most part been semantic in nature, depending on system behavior patterns or using n-grams of bytes instead of code or structural patterns for the detection of malware. Also, because of their semantic nature, the generalizability of the results to new Pu variants generated through polymorphism is unknown. A purely syntactic-oriented approach, on the other hand, is based on the intuition that most new Pu (polymorphic) variants are simple syntactic variations of existing versions. The complicating aspect is variable length variations. The “expressive power” of signatures can be estimated by detecting how well these signatures generalize to unseen Pu and unknown Px variants of the same family, all obtained through polymorphic (structural) alterations to the code. The benefit of a syntactic approach is that no semantics is needed. More importantly, as will be shown below, the number of malware training instances required to extract signatures for use against unseen Pu test instances is exceptionally small given the sequence alignment and data mining approaches adopted in the experiments.
Previous work [
As an instance of a polymorphic string-based technique, consider the structurally-related set of sentences [
The cat saw the mouse (Class 1)
The mouse was seen by the cat (Class 2)
We see that the cat saw the mouse (Class 1)
We see that the mouse was seen by the cat (Class 2)
PRISM and NNge were applied on the four structurally-related set of sequences by categorizing them into two classes, namely: Class 1―cat saw the mouse and Class 2―mouse was seen by the cat. The variable length strings were converted into equal length strings by expanding the shorter strings to have a length equal to the longest string by adding the letter “x” at the end of each short string.
PRISM gave the following rules with 75% accuracy after four iterations (“pos” = position):
If pos1 = the, pos2 = cat, pos3 = saw, pos4 = the, pos5 = cat, pos7 = the, pos8= mouse, pos9 = x and pos10 = x then Class 1
If pos2 = mouse, pos3 = was, pos4 = seen, pos6 = was, pos7 = seen, pos8 = by, pos9 = the, pos9 = x and pos10 = x then Class 2
NNge gave the following rules with 100% accuracy (“^” = conjunction; “{}” signifies disjunctive options):
Class 1 IF: pos1 in {the, we} ^ pos2 in {cat, see} ^ pos3 in {saw, that} ^ pos4 in {the} ^ pos5 in {cat, mouse} ^ pos6 in {saw, x} ^ pos7 in {the, x} ^ pos8 in {mouse, x} ^ pos9 in {x} ^ pos10 in {x}
Class 2 IF: pos1 in {the, we} ^ pos2 in {mouse, see} ^ pos3 in {was, that} ^ pos4 in {the, seen} ^ pos5 in {mouse, by} ^ pos6 in {the, was} ^ pos7 in {cat, seen} ^ pos8 in {by, x} ^ pos9 in {the, x} ^ pos10 in {cat, x}
The strings were extracted from the above-mentioned PRISM and NNge rules and are shown as follows in their corresponding classes:
PRISM:
Class 1: the cat saw the cat the mouse
Class 2: mouse was seen was seen by the
NNge:
Class 1: the we cat see saw that the cat mouse saw the mouse
Class 2: the we mouse see was that the seen mouse by the was cat seen by the cat
The results on this example string set show that NNge can generate rules with 100% accuracy over PRISM, which generated rules with 75% accuracy. One of the aims of this paper is to determine whether this result is generalizable to many more instances of strings (variants) belonging to different classes (families).
NNge, first introduced by Martin (1995), is a nearest neighbor algorithm and an expansion of Nge [
class B if p1 = (2 or 4 or 6) AND
p2 = (22) AND
(p3 ≥ 9 AND p3 ≤ 32) AND
p4 = (b or c)
This hyperrectangle covers strings “42210b” and “62231c” but not “3118b”, for instance. Within the NNge algorithm [
NNge Algorithm:
For each instance In in the training collection do:
Locate the hyperrectangle Gb which is nearest to In /*Classification Stage*/
IF D (Gb, In) = 0 THEN
IF class (In) ≠ class (Gb) THEN Divide/Split (Gb, In) /*Adjustment Stage*/
ELSE G’: = Extend (Gb, In) /*Generalization Stage*/
IF G’ overlaps with inconsistent hyperrectangles
THEN add In as a non-generalized exemplar
ELSE Gb: = G’
The classification stage is formulated based on the distance D(I, G) between an instance I = (I1, I2, …, In) and a hyperrectangle G as shown in Equation (1) (Classification Stage).
D ( I , G ) = ∑ k = 1 n ( w k d ( I k , G k ) I k max − I k min ) 2 (1)
In Equation (1), I k min and I k max indicate the set of numerical values across the training collection which correspond to attribute k. For categorical (i.e. nominal) attributes, the length of this set is constantly 1. Gk is the interval [ G k min , G k max ]if Ik is a quantitative attribute, and is a list of values if Ik is a categorical attribute. The distance between the corresponding hyperrectangle i.e. the “side”, and the attribute values is formulated based on the type of the attribute, as illustrated in Equation (2) (Distance between the Corresponding Hyperrectangle).
d n u m ( I k , G k ) = { 0 , I k belongsto G k 1 , otherwise , d n u m ( I k , G k ) = { 0 , G k min ≤ I k ≤ G k max G k min − I k , I k < G k min I k − G k max , I k > G k max (2)
The constant wk signifies weights corresponding to attributes and can be regulated throughout the training procedure [
The adjustment stage is implemented when a previously created hyperrectangle covers an instance associated with a different class. To circumvent the creation of nested hyperrectangles NNge regulates the current hyperrectangle so that the inconsistent instance is eliminated. This is accomplished by splitting the hyperrectangle into two or more hyperrectangles and potentially into a few isolated variants/instances. The generalization stage comprises modifying the “border” of the nearest hyperrectangle possessing the same class as the training case in order to cover it. The extension is obtained only when the newly split hyperrectangle does not overlap with hyperrectangles possessing a separate class. If the overlap is detected the training case is included in the model as a non-generalized exemplar [
In the experiments that follow, two different types of code representation are tested for data mining using NNge. The first type uses the hex representation and the second uses a DNA version of the hex representation, using the conversion rules as follows:
Conversion of hexadecimal into binary code was accomplished employing the subsequent rules: “1” → “0001”; “2” → “0010”; “3” → “0011”; “4” → “0100”; “5” → “0101”; “6” → “0110”; “7” → “0111”; “8” → “1000”; “9” → “1001”; “0” → “0000”; “a” → “1010”; “b” → “1011”; “c” → “1100”; “d” → “1101”; “e” → “1110”; and “f” → “1111”. Successive conversion of the binary code into DNA sequences was accomplished employing the subsequent rules: “00” → “A”; “11” → “T”; “10” → “G”; and “01” → “C”.
So, for instance, the hex string “1234567890abcdef” becomes “0001001000110100010101100111100010010000101010111100110111101111” (binary code) and then becomes “ACAGATCACCCGCTGAGCAAGGGTTATCTGTT” (DNA sequence).
The experiments are intended to check whether data mining using DNA code produces better results than using hex code. Once viral code is converted to DNA code, sequence alignment using publicly validated and provably tested alignment software becomes possible.
Also, in the experiments below, “padding” was required to convert variable length viral strings into equal length strings for two of the experiments (Experiments I and II). For example, given hex strings “13ad3” and “245335623f”, padding produces “13ad3xxxxxx” to make both strings of equal length.
Methods Overview (Experiments I-III): The method in Experiment I consists of six steps, summarized as follows. Step-1 deals with virus code variant generation Pk and separating the training set Ps from the test set Pu. Step-2 deals with the process of variable length data mining on a small percentage of the training Ps and test Pu sets using NNge classifier to generate rules for string extraction. Step-3 deals with the extraction of common training sequences (i.e. strings, or first-level rule-based consensuses) using the NNge rules. Step-4 deals with converting the hex code of the training Ps and test Pu sets (obtained from Step-1) as well as first-level consensuses (obtained from Step-3) into a form (in this case, DNA) acceptable for sequence alignment. Step-5a deals with the process of pairwise (local) sequence alignment between the first-level consensuses and some variants of the training set Ps (both obtained from Step-4) using the SWA to produce equal length sequences (i.e. second-level consensuses). Step-5b deals with the extraction of meta-signatures, or common substrings, from these second-level consensuses. Step-6 deals with the conversion of meta-signatures back into viral hex code for the purpose of signature testing against Pk and Px viral sets. More details concerning each step are supplied in
The method in Experiment II consists of six steps. The same procedure as Experiment I was used along with the same training Ps and test Pu sets, with the only difference being that some variants of the training set Ps were converted into DNA format prior to the process of variable length data mining. More details concerning each step are supplied in
The method in Experiment III consists of seven steps. The same procedure as Experiments I and II was adopted and the same training Ps and test Pu sets were used, with the only difference being an additional step of multiple sequence alignment on the training set Ps to produce equal length sequences prior to the process of equal length data mining. More details concerning each step are supplied in
Experiment I consist of taking 22 viral strings in hex (11 malicious (set M) and 11 non-malicious (set NM)), applying NNge to MHEX and NMHEX, and converting the NNge results into two variable length strings (N1HEX, N2HEX), as shown in the “cat mouse” examples previously. The hex strings are then converted to DNA for pairwise sequence alignment between N1DNA and Ps on the one hand and N2DNA and Ps on the other. This produces consensuses C1DNA (between N1DNA and Ps) and C2DNA (N2DNA and Ps), and these consensus C1DNA and C2DNA become the meta-signatures for use against Pk and Px after converting back into hex (i.e. C1HEX and C2HEX). Therefore, the viral code remains in hex format until just before pairwise sequence alignment. Set NM in this paper is defined as malware that is generated by eliminating their key polymorphic functions and are partially functional with no payload properties.
Experiment II consists of taking 22 viral strings in hex (11 malicious (set M) and 11 non-malicious (set NM)). MHEX and NMHEX are then converted to DNA before applying NNge to MDNA and NMDNA, and converting the NNge results into two variable length strings (N1DNA, N2DNA). The N1DNA and N2DNA are then pairwise sequenced against Ps. This produces C1DNA (between N1DNA and Ps) and C2DNA (N2DNA and Ps), and C1DNA and C2DNA become the meta-signatures for use against Pk and Px after converting back into hex (i.e. C1HEX and C2HEX). The difference between Experiment I and Experiment II is that the viral strings are converted to DNA first before NNge is applied.
Experiment III consists of taking 22 viral strings in hex (11 malicious (set M) and 11 non-malicious (set NM)). MHEX and NMHEX are then converted to DNA. Multiple sequence alignment is then applied on MDNA and NMDNA to produce equal length sequences ME and NME. Then NNge is applied to ME and NME to produce variable length strings N1DNA and N2DNA. N1DNA and N2DNA are then pairwise sequenced against Ps. This produces C1DNA (between N1DNA and Ps) and C2DNA (N2DNA and Ps), and C1DNA and C2DNA become the meta-signatures for use against Pk and Px after converting back into hex (i.e. C1HEX and C2HEX). The difference between Experiment II and Experiment III is that the viral strings are multiply aligned first to produce equal length strings before NNge is applied.
1) Comparison of the Data Mining Results Obtained from Three Sets of Experiments as Well as from Other Related and Selected Previous Work
Features/Steps | Variable length data mining | Equal length data mining | |
---|---|---|---|
Experiment I | Experiment II | Experiment III | |
Hex to DNA conversion | For the process of pairwise sequence alignment only. | For the processes of data mining and pairwise sequence alignment. | For the processes of multiple sequence alignment, data mining and pairwise sequence alignment. |
Multiple sequence alignment for the process of data mining | No | No | Yes |
Conversion of variable length sequences into equal length sequences | By adding the letter “x” towards the end of each sequence until all the variable length sequences were of equal lengths. | By adding the letter “X” towards the end of each sequence until all the variable length sequences were of equal lengths. | By the process of multiple sequence alignment. All the gaps introduced by the process of alignment were substituted by “X”. |
Total number of attributes for the process of data mining | 24,565 | 49,129 | 93,438 |
Total number of labels for the process of data mining | 17 (hex labels: a - f, 0 - 9 and x) | Five (DNA labels: A, T, G, C and X) | Five (DNA labels: A, T, G, C and X) |
File size of the ARFF file | 2.49 MB | 3.87 MB | 7.38 MB |
Total time taken to generate NNge results by Weka | 2 minutes and 32 seconds | 6 minutes and 13 seconds | 32 minutes and 28 seconds |
Time taken to build model | 0.62 second | 0.73 second | 1.23 seconds |
Correctly classified instances (%)―Accuracy | 22/22 (100.00%) | 0/22 (0.00%) | 22/22 (100.00%) |
Incorrectly classified instances (%)―Inaccuracy | 0/22 (0.00%) | 22/22 (100.00%) | 0/22 (0.00%) |
Kappa statistic | 1 | −1 | 1 |
Mean absolute error | 0 | 1 | 0 |
Root mean squared error | 0 | 1 | 0 |
Relative absolute error (%) | 0.00% | 200.00% | 0.00% |
Root relative squared error (%) | 0.00% | 200.00% | 0.00% |
Total number of instances | 22 | 22 | 22 |
Total number of rules generated | Two (one for malicious class and one for non-malicious class) | Two (one for malicious class and one for non-malicious class) | Three (one for malicious class and two for non-malicious class) |
Sequence lengths of extracted hex/DNA data (first-level consensuses) from NNge rules | Malicious (hex): 123,338 Non-Malicious (hex): 37,249 | Malicious (DNA): 132,103 Non-Malicious (DNA): 41,670 | Malicious (DNA): 161,495 Non-Malicious 1 (DNA): 59,740 Non-Malicious 2 (DNA): 11,860 |
Total number of pairwise alignments performed | Six (three each for malicious and non-malicious classes) | Six (three each for malicious and non-malicious classes) | Nine (three each for malicious, non-malicious 1 and non-malicious 2 classes) |
Total number of meta-signatures (C1HEX, C2HEX) generated | Nine (Four for malicious class and five for non-malicious class) | 14 (Nine for malicious class and five for non-malicious class) | 48 (31 for malicious class, nine for non-malicious class 1 and eight for non-malicious class 2) |
Total number of unique meta-signatures (C1HEX, C2HEX) | Five | Five | 43 |
Total number of common meta-signatures (C1HEX, C2HEX) | Four | Nine | Five |
Data Mining based Techniques | Correctly Classified Instances (%) | Incorrectly Classified Instances (%) | TP (True Positive) Rate | FP (False Positive) Rate | Precision | Recall | F1 Score | |
---|---|---|---|---|---|---|---|---|
Experiment I (Variable length) | 100.00% | 0.00% | 1 | 0 | 1 | 1 | 1 | |
Experiment II (Variable length) | 0.00% | 100.00% | 0 | 1 | 0 | 0 | 0 | |
Experiment III (Equal length) | 100.00% | 0.00% | 1 | 0 | 1 | 1 | 1 | |
Chen et al. [ | Training | 85.00% | 15.00% | - | - | - | - | - |
5-fold cross validation | 60.00% | 40.00% | - | - | - | - | - | |
10-fold cross validation | 63.33% | 36.67% | - | - | - | - | - | |
15-fold cross validation | 68.33% | 31.67% | - | - | - | - | - | |
20-fold cross validation | 60.00% | 40.00% | - | - | - | - | - | |
Chen et al. [ | Training | 96.67% | 3.33% | - | - | - | - | - |
5-fold cross validation | 78.33% | 21.67% | - | - | - | - | - | |
10-fold cross validation | 66.67% | 33.33% | - | - | - | - | - | |
15-fold cross validation | 70.00% | 30.00% | - | - | - | - | - | |
20-fold cross validation | 63.33% | 36.67% | - | - | - | - | - | |
Kumar et al. [ | Existing (known)dataset (Average) | 95.9752% | 4.0248% | 0.96 | 0.094 | 0.962 | 0.96 | 0.959 |
New (unknown)dataset (Average) | 86.6873% | 13.3127% | 0.867 | 0.275 | 0.872 | 0.867 | 0.858 | |
Prabha et al. [ | - | - | - | - | - | - | - | - |
Statistical method by Srakaew et al. [ | Reference Set | 98.9167% | 1.0833% | - | - | - | - | - |
Application Set | 95.0477% | 4.9523% | - | - | - | - | - | |
10-fold cross validation | 95.333% | 4.667% | - | - | - | - | - | |
Abstract assembly method by Srakaew et al. [ | Reference Set | 99.75% | 0.25% | - | - | - | - | - |
Application Set | 98.39% | 1.661% | - | - | - | - | - | |
10-fold cross validation | 99.5% | 0.5% | - | - | - | - | - |
as true positive rate, false positive rate, precision, recall and F1 score were not reported. These results are not presented here.
Experiments I and III gave results which outperformed those previously reported achieving 100% correctly classified instances and thus 0% incorrectly classified instances (see
2) An Evaluation of the State of the Art AVS Products and Meta-Signatures (C1HEX and C2HEX) on the Detection of JS.Cassandra Virus and Its Known Pk and Unknown Px Variants
Files Scanned | Metrics | Virus Detection Method | ||||
---|---|---|---|---|---|---|
AVG | AntiVir | ClamAV | ESET | F-Prot | ||
352 known (Pk) JS.Cassandra Malicious Variants | Detection Ratio (Accuracy) | 312/352 (88.64%) | 25/352 (7.10%) | 340/352 (96.59%) | 296/352 (84.09%) | 4/352 (1.14%) |
Sensitivity/Recall | 88.64% | 7.10% | 96.59% | 84.09% | 1.14% | |
Specificity | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Precision | 100% | 100% | 100% | 100% | 100% | |
F1 Score | 93.97% | 13.26% | 98.26% | 91.36% | 2.25% | |
43 JS.Cassandra Non-Malicious (Pu) Variants | Detection Ratio (Accuracy) | 0/43 (0.00%) | 1/43 (2.32%) | 0/43 (0.00%) | 0/43 (0.00%) | 0/43 (0.00%) |
Sensitivity/Recall | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Specificity | 100% | 97.67% | 100% | 100% | 100% | |
Precision | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
F1 Score | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
352 Random JavaScript Files | Detection Ratio (Accuracy) | 0/352 (0.00%) | 0/352 (0.00%) | 0/352 (0.00%) | 0/352 (0.00%) | 0/352 (0.00%) |
Sensitivity/Recall | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Specificity | 100% | 100% | 100% | 100% | 100% | |
Precision | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
F1 Score | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Files Scanned | Metrics | Malicious C1HEX1 (I), C1HEX3 (II), non-malicious C2HEX41 (III) and C2HEX43 (III) | Malicious C1HEX3 (I) and C1HEX6 (II) | Malicious C1HEX7 (II) | Malicious C1HEX4 (I), C1HEX9 (II), non-malicious C2HEX37 (III) | Malicious C1HEX5 (III) |
352 known (Pk) JS.Cassandra Malicious Variants | Detection Ratio (Accuracy) | 340/352 (96.59%) | 85/352 (24.15%) | 325/352 (92.33%) | 352/352 (100%) | 340/352 (96.59%) |
Sensitivity/Recall | 96.59% | 24.15% | 92.33% | 100% | 96.59% | |
Specificity | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Precision | 100% | 100% | 100% | 100% | 100% | |
F1 Score | 98.26% | 38.90% | 96.01% | 100% | 98.26% |
43 JS.Cassandra Non-Malicious (Pu) Variants | Detection Ratio (Accuracy) | 6/43 (13.95%) | 1/43 (2.32%) | 20/43 (46.51%) | 43/43 (100%) | 8/43 (18.60%) |
---|---|---|---|---|---|---|
Sensitivity/Recall | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Specificity | 86.05% | 97.67% | 53.49% | 0.00% | 81.39% | |
Precision | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
F1 Score | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
352 Random JavaScript Files | Detection Ratio (Accuracy) | 0/352 (0.00%) | 0/352 (0.00%) | 0/352 (0.00%) | 0/352 (0.00%) | 0/352 (0.00%) |
Sensitivity/Recall | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Specificity | 100% | 100% | 100% | 100% | 100% | |
Precision | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
F1 Score | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Files Scanned | Metrics | Malicious C1HEX9 (III) | Malicious C1HEX15 (III) | Malicious C1HEX20 (III) | Malicious C1HEX24 (III) | Malicious C1HEX26 (III) |
352 known (Pk) JS.Cassandra Malicious Variants | Detection Ratio (Accuracy) | 329/352 (93.46%) | 344/352 (97.73%) | 191/352 (54.26%) | 202/352 (57.39%) | 352/352 (100%) |
Sensitivity/Recall | 93.46% | 97.73% | 54.26% | 57.39% | 100% | |
Specificity | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Precision | 100% | 100% | 100% | 100% | 100% | |
F1 Score | 96.62% | 98.85% | 70.35% | 72.93% | 100% | |
43 JS.Cassandra Non-Malicious (Pu) Variants | Detection Ratio (Accuracy) | 1/43 (2.32%) | 29/43 (67.44%) | 9/43 (20.93%) | 14/43 (32.56%) | 43/43 (100%) |
Sensitivity/Recall | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Specificity | 97.67% | 32.56% | 79.07% | 67.44% | 0.00% | |
Precision | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
F1 Score | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
352 Random JavaScript Files | Detection Ratio (Accuracy) | 0/352 (0.00%) | 0/352 (0.00%) | 0/352 (0.00%) | 0/352 (0.00%) | 0/352 (0.00%) |
Sensitivity/Recall | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Specificity | 100% | 100% | 100% | 100% | 100% | |
Precision | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
F1 Score | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Files Scanned | Metrics | Malicious C1HEX27 (III) | Non-malicious C2HEX7 (I), C2HEX11 (II) | Non-malicious C2HEX8 (I) | Non-malicious C2HEX12 (II) | Non-malicious C2HEX35 (III) |
352 known (Pk) JS.Cassandra Malicious Variants | Detection Ratio (Accuracy) | 140/352 (39.77%) | 339/352 (96.31%) | 140/352 (39.77%) | 325/352 (92.33%) | 352/352 (100%) |
Sensitivity/Recall | 39.77% | 96.31% | 39.77% | 92.33% | 100% | |
Specificity | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Precision | 100% | 100% | 100% | 100% | 100% | |
F1 Score | 56.91% | 98.12% | 56.91% | 96.01% | 100% |
43 JS.Cassandra Non-Malicious (Pu) Variants | Detection Ratio (Accuracy) | 3/43 (6.98%) | 37/43 (86.04%) | 16/43 (37.21%) | 20/43 (46.51%) | 43/43 (100%) |
---|---|---|---|---|---|---|
Sensitivity/Recall | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Specificity | 93.02% | 13.95% | 62.79% | 53.49% | 0.00% | |
Precision | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
F1 Score | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
352 Random JavaScript Files | Detection Ratio (Accuracy) | 0/352 (0.00%) | 0/352 (0.00%) | 0/352 (0.00%) | 0/352 (0.00%) | 0/352 (0.00%) |
Sensitivity/Recall | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Specificity | 100% | 100% | 100% | 100% | 100% | |
Precision | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
F1 Score | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
and the non-malicious meta-signatures C2HEX35 (III) and C2HEX37 (III) successfully detected all 352 known (Pk) malicious variants of the JS.Cassandra polymorphic virus, where (I), (II) and (III) represent the meta-signatures (C1HEX and C2HEX) generated from Experiments I, II and III. None of the five state of the art AVSs fully detected all known (Pk) JS.Cassandra variants. Scan results for AVG, AntiVir and F-Prot AVS products were obtained from an open source online website known as “Gary’s Hood” [
In total, 71 meta-signatures (9 meta-signatures from Experiment I, 14 meta-signatures from Experiment II and 48 meta-signatures from Experiment III) were generated from malicious and non-malicious sequences. All the 71 meta-signatures (C1HEX and C2HEX) were scanned/tested against the 352 known (Pk) JS.Cassandra malicious variants, 43 JS.Cassandra non-malicious (Pu) variants and 352 random JavaScript files individually by placing these meta-signatures inside their own generated (.ndb) database [
Malicious C1HEX4 (I), C1HEX9 (II), C1HEX26 (III) along with non-malicious C2HEX35 (III) and C2HEX37 (III) were the only five meta-signatures (C1HEX and C2HEX) that fully detected all 43 non-malicious (Pu) JS.Cassandra variants. These meta-signatures (C1HEX and C2HEX) not only detected 352 malicious (Pk) variants successfully but also detected 43 non-malicious (Pu) variants. As noted in
The same batch of 71 meta-signatures (C1HEX and C2HEX) was once again tested against the 100 unknown (Px) JS.Cassandra malicious variants by using the own generated (.ndb) database [
Files Scanned | Metrics | Virus Detection Method | ||||
---|---|---|---|---|---|---|
ClamAV | Bitdefender Total Security 2017 | Nine Meta-Signatures (Experiment I) | 14 Meta-Signatures (Experiment II) | 48 Meta-Signatures (Experiment III) | ||
100 unknown (Px) JS.Cassandra Malicious Variants | Detection Ratio (Accuracy) | 85/100 (85.00%) | 0/100 (0.00%) | 100/100 (100.00%) | 100/100 (100.00%) | 100/100 (100.00%) |
Sensitivity/Recall | 85.00% | 0.00% | 100.00% | 100.00% | 100.00% | |
Specificity | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | |
Precision | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | |
F1 Score | 91.89% | 0.00% | 100.00% | 100.00% | 100.00% |
Clamscan had overall accuracies of 100%, across all three experiments (see
The 71 meta-signatures (C1HEX and C2HEX) were tested for false positives. First, any duplicate meta-signatures (C1HEX and C2HEX) along with meta-signatures (C1HEX and C2HEX) that were six characters or below were removed. In total, 26 meta-signatures (i.e. 16 malicious C1HEX and 10 non-malicious C2HEX) were removed from the generated (.ndb) database [
Figures 2(a)-(c) are the screenshots of the scan results indicating that 352 of the 352 known (Pk) malicious variants, 43 of the 43 non-malicious (Pu) variants and 100 of the 100 unknown (Px) malicious variants were successfully identified as infected by the Clamscan antivirus scanner using the 45 meta-signatures (C1HEX and C2HEX).
detected as false positives (0.159% false positive rate) using the 45 meta-signatures (C1HEX and C2HEX), thereby satisfying the false positive rate requisite of 0.1%.
It was found from the experiments conducted in this paper that Experiment III (equal length data mining technique) gave the highest number of successful meta-signatures (C1HEX and C2HEX) in comparison to Experiments I and II (variable length data mining technique). Experiment II gave the lowest number of successful meta-signatures (C1HEX and C2HEX). Not only did Experiment III gave the highest number of meta-signatures (C1HEX and C2HEX), but it also gave the highest number of effective meta-signatures (C1HEX and C2HEX). Moreover, Experiment III generated meta-signatures (C1HEX and C2HEX) that were not generated in Experiments I and II. The importance of multiple sequence alignment prior to data mining significantly improved both the quality and quantity of meta-signatures (C1HEX and C2HEX) in comparison to Experiments I and II. In comparison to previous reported work (see Section 4 and Section 5), the syntactic approach to automatic signature generation using NNge successfully has addressed the limitations of previous work by generating signatures in the quickest, simplest and most accurate manner.
In total, 45 out of the 71 overall meta-signatures (C1HEX and C2HEX) i.e. around 63.38% (33.80% malicious (24/71) and 29.58% non-malicious (21/71)) were effective i.e. detected seen (Ps) and unseen (Pu) variants from the two different types of groups (i.e. malicious and non-malicious). Specifically, six out of the nine meta-signatures (C1HEX and C2HEX) generated from Experiment I (i.e. around 66.66% meta-signatures―44.44% malicious (4/9) and 22.22% non-malicious (2/9)) detected seen (Ps) and unseen (Pu) variants belonging to malicious and non-malicious groups (see
As Experiments I and II were performed using two different representational approaches (i.e. hex/DNA) along with Experiment III containing aligned DNA sequences, all with the same (unchanged) instances each time, some of the meta-signatures (C1HEX and C2HEX) obtained from the three sets were identical to each other. Malicious C1HEX1 (I), C1HEX3 (II), non-malicious C2HEX41 (III) and C2HEX43 (III) share identical meta-signature. On the other hand, malicious C1HEX4 (I), C1HEX9 (II) and non-malicious C2HEX37 (III) share identical meta-signature. Although Experiment II generated rules with 100% inaccuracy, the overall combined percentage of effective meta-signatures (C1HEX and C2HEX) generated from all three sets of experiments was 57.75%. On the other hand, the overall combined percentage of non-effective meta-signatures (C1HEX and C2HEX) generated from all three sets of experiments was 42.25%.
The key differences between previous related work [
1) Previous work adopted left-to-right string matching techniques to find the most optimally-conserved meta-signatures. The work presented in this paper adopts a rule-based or top-down approach that attempts to find underlying patterns.
2) Previous work generated equal length consensuses using sequence alignment techniques, whereas the current work generates variable length consensuses adopting a variable length data mining technique (NNge).
3) Previous work adopted pairwise alignment techniques for extracting signatures which only allowed alignment of two viral sequences at a time taking into account only the information available in the sequence pair. This work allows all sequences to be used to extract signatures and so takes into account all the information in all the sequences at the same time, including both family generic and variant specific information.
In this paper, some of the limitations (discussed in Section 3) of previous work [
The use of newly generated novel (Px) variants differentiates our approach from all previous research that adopts existing malware samples from an online repository. In comparison to the semantic-based approaches as shown in
In conclusion, the contributions of this paper are listed as follows:
1) Adopting a data mining algorithm, NNge, to generate rule-based signatures automatically from real malware data.
2) Comparing variable length data mining algorithm to equal length data mining algorithm using NNge on malware source code by conducting three different experiments (Experiments I-III).
3) Distinguishing malicious variants from non-malicious with the help of rules generated using the data mining algorithm, NNge.
4) Testing the derived rule-based signatures against real malware data and comparing the results to other commercial AVSs.
5) Comparing the overall performance metrics such as true positive rate, false positive rate, precision, recall, etc. with other related work on malware detection using data mining algorithms.
6) Detecting known Pk (i.e. Ps and Pu) and unknown Px variants of a polymorphic malware family using rule-based signatures (see
More work is required to apply the current rule-based approach to more intricate polymorphic as well as metamorphic viruses.
The authors declare no conflicts of interest regarding the publication of this paper.
Naidu, V., Whalley, J. and Narayanan, A. (2018) Generating Rule-Based Signatures for Detecting Polymorphic Variants Using Data Mining and Sequence Alignment Approaches. Journal of Information Security, 9, 265-298. https://doi.org/10.4236/jis.2018.94019
More details referring to the steps involved in this experiment can be found in the previous work [
Hex dump extraction (Step-1) and testing (Step-6) were undertaken on a stand-alone system to prevent possible unintended infection of other systems. Downloading of polymorphic malware (and seen Ps as well as unseen Pu variants) was performed using “Oracle VM VirtualBox” [
23 withheld variants (Ps and Pu) were selected for Experiment I. A CRC32b hash value was generated for each of these 23 withheld variants and no duplicates were found, indicating that they were unique. The percentage of training to
test ratio (Ps to Pu) in Pk for malicious JS.Cassandra virus is 3.125% (11:352). A severely reduced proportion of training to test samples was used to reflect the current difficulty in detecting signatures that generalize from a small, previously encountered set of known (Pk) variants to a potentially infinite set of new (Px) variants.
All 23 withheld variants (Ps and Pu) were checked using the “VirusTotal” [
For the process of data mining using NNge (Step-2), the variable length hex sequences were converted into equal length sequences by constraining the shorter sequences to have a length equal to the longest sequence by adding the letter “x” at the end of each short sequence. Lower case “x” was added as the hex sequences were represented in lower cases. An ARFF (Attribute-Relation File Format) file was created which contained the hex dump sequences (MHEX and NMHEX) for the 22 JS.Cassandra variants. The 23rd variant was not included in the ARFF file since it will only be used in Step-4 and Step-5 for the process of pairwise sequence alignment.
In total, the ARFF file consisted of 24,565 attributes (one attribute per position) and two classes (malicious and non-malicious). The NNge classifier was trained on the full dataset. Two NNge rules (one for each class) were generated with a data fitting accuracy of 100%. A partial segment of two NNge (hex) rules obtained in this step for the malicious (m), and 11 non-malicious (nm) hex sequences are shown below:
Malicious (m)―class m IF: pos1 in {2, 6} ^ pos2 in {0, 3} ^ pos3 in {6, 7} ^ pos4 in {a, b, 1, 2, 3, 7, 9} ^ pos5 in {6, 7} ^ pos6 in {a, e, 1, 2, 3, 5, 6, 7, 9} ^ pos7 in {6, 7} ... and so on.
Non-Malicious (nm)―class nm IF: pos1 in {2, 6, 7} ^ pos2 in {f, 6} ^ pos3 in {2, 6, 7} ^ pos4 in {f, 1, 5} ^ pos5 in {2, 6, 7} ^ pos6 in {e, 0, 2} ^ pos7 in {2, 6, 7} ... and so on.
The best instance to represent the process of rule extraction (Step-3) using the above-mentioned rule is, for (m) the first substring at pos 1 becomes the first substring in the new NNge rule extracted string, and so on: “260367ab1237967ae123567967...”. The length of the malicious string (N1HEX) was 123,338 hex characters, whereas, the length of the non-malicious string (N2HEX) was 37,249 hex characters. Only hex data (by excluding the letter “x”) from the two NNge rules were extracted.
After conversion (Step-4), six discrete pairwise alignments (Step-5a) were first conducted (sequence 1 with sequence 2, sequence 3 with sequence 4, etc.). The equal combination of gap open (i.e. 10) and gap extend (i.e. 1) penalty (as used in [
CAATCAAGGCGCGCTCCCGTGCGATCTCACGGCCGTTCGTGAGAACGATC
In Step-6a and Step-6b, the nine DNA meta-signatures were first converted into hex (C1HEX and C2HEX) and then later tested against the JS.Cassandra viral variants (Pk and Px) using clamscan scanner. One of the nine hex meta-signatures, with a sequence length 25, is shown below in hex representation:
4342999d5b98dd1a5bdb8818d
A2. Experiment IIThe same procedure as Experiment I was used along with the same JS.Cassandra (training) variants, with the only difference being that the variants (MHEX and NMHEX) were converted into DNA format (MDNA and NMDNA) prior to NNge rule generation. The conversion to DNA format was undertaken as normal using the DNA representational method as detailed in Section 6. Our method for Experiment II consists of six steps (see
In this step (Step-3), as for Experiment I, equal length sequences were created by adding the letter “X” at the end of each sequence to the length of the longest variant. Upper case “X” was added as the DNA sequences were represented in upper cases. In total, the resultant ARFF file contained 49,129 attributes and two class labels (malicious and non-malicious). The final and error-free version of ARFF file was loaded into Weka and NNge classification undertaken using all the data as the training set. After the first iteration, two NNge rules (one for each class) were generated in under seven minutes. Partial segments of the two NNge (DNA) rules are shown below:
Malicious (M)―class M IF: pos1 in {A, C} ^ pos2 in {G} ^ pos3 in {A} ^ pos4 in {A, T} ^ pos5 in {C} ^ pos6 in {T, G} ^ pos7 in {A, G, C} ^ pos8 in {T, G, C} ... and so on.
Non-Malicious (NM)―class NM IF: pos1 in {A, C} ^ pos2 in {T, G} ^ pos3 in {T, C} ^ pos4 in {T, G} ^ pos5 in {A, C} ^ pos6 in {T, G} ^ pos7 in {A, T, C} ^ pos8 in {T, C} ... and so on.
In this step (Step-4), two strings (first-level consensuses―N1DNA and N2DNA)
in DNA format were extracted in the same way as for Experiment I from these two NNge rules and the substrings in each position were concatenated as illustrated here for the Malicious class: “ACGAATCTGAGCTGC...”.
The sequence length of the malicious NNge DNA string (N1DNA) was 132,103 bases, whereas the sequence length of non-malicious NNge DNA string (N2DNA) was 41,670 bases. In this step (Step-5a), pairwise local alignment was then performed using SWA and the ID matrix with same gap penalties in a process similar to that described for Experiment I (Step-5a). In total, as in Experiment I, six pairwise alignments were performed in this step (Step-5a).
Overall, 14 common substrings (i.e. meta-signatures―C1DNA and C2DNA) were obtained in this step (Step-5b) from the six pairwise local alignments. One of the 14 meta-signatures, with a sequence length 59, generated from one of the six pairwise alignments, is shown below in nucleic acid representation:
ACAGGAAGGCCTTCAATCAAGGCGCGCTCCCGTGCGATCTCACGGCCGTTCGTGAGAAC
In Step-6a and Step-6b, the 14 DNA meta-signatures were first converted into hex (C1HEX and C2HEX) and then later tested against the JS.Cassandra viral variants (Pk and Px) using clamscan scanner. One of the 14 hex meta-signatures, with a sequence length 28, is shown below in hex representation:
28297d0d0a66756e6374696f6e20
A3. Experiment IIIThis experiment takes a different approach from Experiments I and II to dealing with the need for equal length sequences in order to generate rules using an equal length data mining approach. Multiple sequence alignment is undertaken prior to NNge rule generation to convert the variable length sequences (MDNA and NMDNA) into equal length sequences (ME and NME) by inserting gaps (
In Step-4, the same NNge classification was undertaken using Weka. The data was converted into Weka’s ARFF file format and consisted of 93,438 attributes and two classes malicious and non-malicious. Three NNge rules (one for the malicious class and two for the non-malicious class) were generated with an accuracy of 100% in under 33 minutes. A partial segment of each of these NNge rules are shown below:
Malicious (M)―class M IF: pos1 in {A, X} ^ pos2 in {G, X} ^ pos3 in {A, X} ^ pos4 in {A, X} ^ pos5 in {C, X} ^ pos6 in {T, G, X} ^ pos7 in {A, G, X} ^ pos8 in {T, G, C, X} ^ pos9 in {C, X} ^ pos10 in {T, G, X} ... and so on.
Non-Malicious 1 (NM1)―class NM IF: pos1 in {X} ^ pos2 in {X}… ^ pos96 in {T, X} ^ pos97 in {A, X} ^ pos98 in {G, X} ^ pos99 in {A, X} ^ pos100 in {A, X} ^ pos101 in {C, X} ^ pos102 in {T, G, X} ^ pos103 in {G, C, X} ... and so on.
Non-Malicious 2 (NM2)―class NM IF: pos1 in {X} ^ pos2 in {X} … ^ pos1294 in {X} ^ pos1295 in {C} ^ pos1296 in {A} ^ pos1297 in {G} ^ pos1298 in {T} ^ pos1299 in {C} ^ pos1300 in {A} ^ pos1301 in {T} ... and so on.
In Step-5, three strings (first-level consensuses―N1DNA and N2DNA) in DNA format were constructed based on each of these NNge rules. The process of extraction of strings from the rules is the same as detailed in Experiments I and II and any “X” string extension characters were ignored. An example of this string extract process from the rules for NM1 is: “TAGAACTGGC...”. The sequence length of the resultant malicious DNA string (N1DNA) was 161,495 bases, whereas, the sequence lengths of the non-malicious DNA strings (N2DNA) were 59,740 bases (NM1) and 11,860 bases (NM2).
Next, in Step-6a, local pairwise sequence alignment between these DNA sequences (first-level consensuses―N1DNA and N2DNA) extracted from each of the NNge rules and the three malicious JS.Cassandra variants (Ps) in DNA format was performed one by one using SWA and the ID matrix, as per Experiments I and II. In this step (Step-6b), common substrings that are the meta-signatures (C1DNA and C2DNA) for JS.Cassandra were extracted from the nine second-level consensuses generated from the process of nine pairwise local alignments. In total, 48 meta-signatures (C1DNA and C2DNA) were obtained. The meta-signature of sequence length 88 obtained from one of the nine pairwise alignments is shown below in its nucleic acid representation:
GGAAGTGCTAGCGTTCTCCCGTGCGCAAGGACATCCGACCTCACGGAAGTGCTAGCGACCGTGCGCACGTTCGTCAGGAAGGCAGGGA
In Step-7a and Step-7b, the 48 DNA meta-signatures were first converted into hex (C1HEX and C2HEX) and then later tested against the JS.Cassandra viral variants (Pk and Px) using clamscan scanner. One of the 48 hex meta-signatures, with a sequence length 22, is shown below in hex representation:
292c283538322f36292c28