A Botnet is a network of compromised devices that are controlled by malicious “botmaster” in order to perform various tasks, such as executing DoS attack, sending SPAM and obtaining personal data etc. As botmasters generate network traffic while communicating with their bots, analyzing network traffic to detect Botnet traffic can be a promising feature of Intrusion Detection System. Although such system has been applying various machine learning techniques, comparison of machine algorithms including their ensembles on botnet detection has not been figured out. In this study, not only the three most popular classification machine learning algorithms—Naive Bayes, Decision tree, and Neural network are evaluated, but also the ensemble methods known to strengthen classifier are tested to see if they indeed provide enhanced predictions on Botnet detection. This evaluation is conducted with the CTU-13 public dataset, measuring the training time of each classifier and its F measure and MCC score.
As a network of compromised devices called bots, a botnet executes malicious tasks under the control of the attacker, a botmaster. A botnet has been a threat to cybersecurity [
・ Information dispersion: sending SPAM, executing Denial of Service (DoS) attack, distributing false information from illegal sources.
・ Information harvesting: obtaining identity, password and financial data.
・ Information processing: processing data to crack the password for access to additional hosts.
A botnet has been grown as a menace since the first botnet. EggDrop was reported in 1993 [
The Mirai botnet, according to the same report [
The speed of growth of botnet threat is also rapid. According to the report from Spamhaus [
A botnet usually has distinguishable architecture features [
To detect botnet, approaches focusing on anomalies in bot(net)s’ network behavior with or without temporal behavior have been proposed. Most of the previous studies adopted machine learning technologies along with heuristic rules [
In this paper, focusing on the ensemble methods, supervised machine learning algorithms popularly used in previous studies were evaluated with their ensembles. As the ensemble methods were designed to strengthen weak classifiers, it would be meaningful to figure it out if the ensemble methods are indeed beneficial when it comes to botnet detection.
In the following chapter, popular classification algorithms that have been used in serval botnet detection proposals and the ensemble technology are explained.
In the previous studies where they used supervised machine learning algorithms,
three algorithms―Naive Bayes, decision tree, and (artificial) neural networks were popularly adopted [
Naive Bayes algorithm is a simple and intuitive classification technique based on the Bayes theorem assuming each feature contributes independently to the probability of an event [
Neural networks, analogous to the human brain, refer to large connections of simple units called neurons. Consisting of three layers―input layer, hidden layer(s) and output layer, Neural network takes each record to pass its features onto input layer, and then the model makes decisions calculating weights of hidden neurons to get the single highest value at the output layer. A feed-forward neural network where the output of one layer is used as input to the next layer does iterate for the same data to compare the output to true value so that it adjusts the weights in the hidden neurons with its error term. Recurrent neural networks, however, adopts feedback loops between neurons that resemble human brains [
As another popular classification method, decision tree generates a tree-like model of decisions based on decision rules inferred from the data. The goal is to create a model that predicts the value of a target variable based on several input variables. In classification decision tree, dependent variables can be categorical. Unlikely other machine learning algorithms, a decision tree is easy to interpret with tree visualized.
Ensemble methods make a set of classifiers into an ensemble by combining the prediction from each classifier either with weight or not. It is regarded as one of the possibilities to improve the accuracy. Typically, there are three types of ensemble methods as introduced below [
・ Voting: as the simplest way to form an ensemble, voting classifier consists of multiple models of diverse types. In the training step, all the models are trained separately with whole training data and it averages the posterior probabilities that are calculated by each model in the recognition step.
・ Bagging: it, also called bootstrap aggregation, manipulates the training data to generate multiple models. Instead of training the model with the whole training data, bagging randomly samples the training set from the total training data to make sub-models.
・ Boosting: it also samples out the training data like bagging does but maintains a set of weights on the data. Especially, AdaBoost where the weighted errors of each model update weights on the training data gives more weight on the data with lower accuracy and less weight on the data with higher.
Along with the concept of bagging, random forest is an ensemble of multiple decision trees. By randomly selecting features from the data, it generates decision trees and then for unseen data, the class that the majority of decision trees predict is selected as the prediction for the input. Random forest is known as a way of avoiding overfitting that can happen in a single decision tree [
Although ensemble methods are an effective way of reducing variances when it comes to prediction model, it is obvious that they come with more computation. Thus, a poor model can enhance its accuracy as the cost of the extra computation. For this reason, the ensemble methods are often used with a fast classifier such as decision tree as the case of random forest.
Finding an appropriate network traffic dataset for machine learning is often challenging. For supervised machine learning, the fact that the data should be properly labeled unless the target feature is already in the dataset makes the task onerous. Addressing this problem, Sebastian Garcia et al. created the CTU-13 dataset labeled as botnet, normal and background in the previous research [
When the researchers of the CTU-13 creating the dataset, they conducted preprocessing converting pcap files to NetFlow files. In that stage, they configured the data with those following features: start time, end time, duration, protocol, source IP address, source port, direction, destination IP address, destination port, flags, type of services, number of packets, number of bytes, number of flows, and label.
To measure the accuracy of a classifier, taking account confusion matrix is the most common way. Precision meaning the percentage of correctly predicted event from the pool of total predicted event, and recall meaning the percentage of correctly predicted event from the pool of actual events respectively are important.
Taking both precision and recall into account, the F1 score gives a more balanced view compared to using only precision or recall. The F1 score can be between 0 and 1 where 1 means its best accuracy.
Unlike metrics above, MCC which is also known as the phi coefficient is considered to be less biased because it incorporates True Negative as well. According to [
To evaluate the classification algorithms along with ensemble methods for the CTU-13 dataset, Scikit-learn on a single core of Intel Xeon-E5 with 64GB of memory was used. Because some of the features were categorical which Scikit-learn cannot handle properly, data preparation including encoding and standardization were conducted.
The evaluation results are described in
For every data and algorithms, F1 scores are higher than MCC scores. This is because the F1 score does not consider the true negatives. For this reason, MCC is preferred for a binary classification. In the following discussion, accuracy refers to MCC score and S denotes scenario of the dataset.
Among individual algorithms, for S10 and S11, NN and DT show decent accuracies over 0.91 and 0.98, respectively. However, NN takes much longer time, about 3 - 4 times longer in S4 and S10. For S4, the accuracy score dramatically goes down. The only structural difference between those three datasets is the ratio of botnet traffic. Even though S4 is the largest dataset, it only has one Rbot with 0.15% of botnet traffic ratio, which means the data is highly imbalanced or skewed. On the other hand, S10 has 8.11% of botnet traffic and S11 has 7.6%.
This pattern appears the same on the result of voting. This is on the ground that voting works by averaging out each outcome from the model. Boosting method does not significantly help either GNB or DT. The nature of boosting is turning weak models, which has slightly better prediction than random, into a strong one. In this regard, it obviously does not make DT strong as it already
Method | Scenario 4 | Scenario 10 | Scenario 11 | |||
---|---|---|---|---|---|---|
F1 | MCC | F1 | MCC | F1 | MCC | |
GNB | 0.986159 | 0.135260 | 0.988762 | 0.910358 | 0.992302 | 0.982357 |
NN | 0.998489 | 0.000000 | 0.992646 | 0.939776 | 0.993299 | 0.984639 |
DT | 0.999990 | 0.996779 | 0.999982 | 0.999849 | 0.999971 | 0.999935 |
Voting | 0.999117 | 0.644421 | 0.994763 | 0.956586 | 0.993841 | 0.985878 |
Boosing-GNB | 0.967339 | 0.043543 | 0.867776 | 0.162963 | 0.378982 | 0.168781 |
Boosing-DT | 0.999989 | 0.996285 | 0.999983 | 0.999857 | 0.999963 | 0.999916 |
Bagging-GNB | 0.986170 | 0.135319 | 0.988758 | 0.910333 | 0.992253 | 0.982245 |
Bagging-NN | 0.998489 | 0.000000 | 0.993613 | 0.946912 | 0.993670 | 0.985486 |
Bagging-DT | 0.999991 | 0.996955 | 0.999981 | 0.999836 | 0.999955 | 0.999897 |
RF | 0.999997 | 0.998930 | 0.999988 | 0.999896 | 0.999972 | 0.999935 |
Method | Scenario 4 | Scenario 10 | Scenario 11 |
---|---|---|---|
GNB | 2.68 | 1.59 | 1.57 |
NN | 76.24 | 163.86 | 21.44 |
DT | 25.48 | 35.39 | 0.62 |
Voting | 103.05 | 139.56 | 18.06 |
Boosing-GNB | 554.14 | 222.48 | 15.2 |
Boosing-DT | 56.77 | 83.23 | 0.77 |
Bagging-GNB | 62.90 | 22.13 | 1.47 |
Bagging-NN | 437.47 | 654.84 | 41.61 |
Bagging-DT | 175.11 | 186.07 | 2.65 |
RF | 43.17 | 63.74 | 1.44 |
had a good accuracy. The interesting thing comes with boosting-GNB. For S4, the MCC scores are near zero which means the prediction is no better than random. Also, it shows around 0.16 for S10 and S11, which are opposite results of using sole GNB. In the study by Ting and Zheng [
Bagging each algorithm seems very similar to using a single classifier only for each dataset. While training a bagging model, multiple sub-datasets sampled out from the original dataset make their own classifier and then predictions from those classifiers are voted. This dataset, however, may not take benefit from sampling because the data is too imbalanced.
While the ensemble methods offered by Scikit-learn are not significantly beneficial on each algorithm, random forest appears highly effective in terms of both accuracy and training time. As a combination of decision trees, it performs implicit feature selection taking feature importance into consideration. Also making multiple sub-decision trees with part of features and data rows, it can run extremely faster than other methods and even can be easily parallelized. Considering parallelization is tough to be implemented in boosting and large neural networks, random forest seems like an excellence.
Compared to the previous research [
In this study, three popular machine learning algorithms―Gaussian Naive Bayes, neural networks, decision tree were tested. Furthermore, the ensemble methods―voting, adaboosting, and bagging were also compared to figure out if ensemble methods would be significantly beneficial for botnet detection. Random forest which is a refined ensemble of decision tree was also tested. To detect botnet traffic out of all network traffic, decision tree without any ensemble method or random forest would be the most reliable approaches. It runs much faster than NN alone, with the better accuracy. Even though GNB runs the fastest, the accuracy varies on the dataset. Unlike the common expectation, adopting ensemble methods on machine learning algorithms for botnet detection in a hope of enhancing the accuracy is not preferable because it does not give remarkably more accurate result while consuming much more time. The question that this evaluation gives is that why the accuracy scores drop down when boosting is applied to GNB. Even though it was explained in [
This work was supported in part by a grant from Intel Grant #301620.
Ryu, S. and Yang, B. (2018) A Comparative Study of Machine Learning Algorithms and Their Ensembles for Botnet Detection. Journal of Computer and Communications, 6, 119-129. https://doi.org/10.4236/jcc.2018.65010