Confusion Matrix and Cyber Security
Confusion matrix is a fairly common term when it comes to machine learning. Today I would be trying to relate the importance of confusion matrix when considering the cyber crimes.
Basically confusion matrix is yet another classification metric that can be used to tell how good our model is performing.
This all gives us an idea that there is something more to confusion matrix than just being another classification metric.Confusion matrix is a fairly common term when it comes to machine learning. Today I would be trying to relate the importance of confusion matrix when considering the cyber security.
So before we dive deep let’s first understand what a confusion matrix is.
What is Confusion Matrix??
When we get the data, after data cleaning, pre-processing and wrangling, the first step we do is to feed it to an outstanding model and of course, get output in probabilities. But how can we measure the effectiveness of our model? Better the effectiveness, better the performance and that’s what we want. And it is where the Confusion matrix comes into the limelight. Confusion Matrix is a performance measurement for machine learning classification.
There are multiple ways of finding errors in the machine learning model. The Mean Absolute Error(Error/cost) function helps the model to be trained in the correct direction by trying to make the distance between the Actual and predicted value to be 0. We find the error in machine learning model prediction by “y — y^”.
Mean Square Error(MSE): Points from the data set are taken and they are squared first and then the mean is taken to overcome the error.
In Binary Classification models, the error is detected with the help of confusion matrix.
Confusion Matrix is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values.
Didn’t get it !! Don’t worry lets simplify :)
The above diagram shows the structure of confusion matrix and its classes.
Lets understand what are the values confusion matrix contains with example:
- TRUE POSITIVE :
The predicted result, and the actual result both become positive. For example, a patient is predicted to have cancer and after check-up it was found that the patient had cancer.
- FALSE POSITIVE (TYPE 1 Error) :
The predicted result was “yes” but the actual result was found “no”. For example, it was predicted that the patient had cancer and after check-up it was found that the patient did not had cancer.
- FALSE NEGATIVE (TYPE 2 Error):
The predicted result was “no”, but the actual result was found “yes”. For example, it was predicted that the patient did not have cancer, and after check-up, it was found that the patient had cancer.
- TRUE NEGATIVE :
The predicted result, and the actual result both become negative. For example, a patient did not had cancer and after check-up, it was found that the patient did not had cancer.
From confusion matrix, we can calculate five different metrics which are used for measuring the validity of model:
Accuracy = ( TP + TN ) / ( TP + TN + FP + FN )
Misclassification = (FP + FN )/( TP + TN + FP + FN)
Precision (true positives / predicted positives = TP / (TP + FP)
Recall (true positives / all actual positives) = TP /( TP + FN)
Specificity (true negatives / all actual negatives) =TN / (TN + FP)
How to implement Confusion Matrix using python :-
In Python, confusion matrix can be obtained using function
confusion_matrix()
which is a part of “sklearn” library.
This function can be imported into Python using syntax :
from sklearn.metrics import confusion_matrix
To obtain confusion matrix, users need to provide actual values and predicted values to the function.
How Confusion Matrix can helps in Cyber Security Field to tackle with Cyber Attacks ?
The cyber security is the major concern and there should not be any lag while developing the security system. In cyber security world there is one great application of Machine Learning which is know as Intrusion Detection System (IDS) . And to ensure the effectiveness of the Intrusion Detection System ML algorithm the confusion matrix helps.
IDS is the system designed using machine learning which helps in detection of cyber attack as well as to alarm system.
In today’s cyber world, the demand for the internet is increasing day by day, increasing the concern of network security. The aim of an Intrusion Detection System (IDS) is to provide approaches against many fast-growing network attacks (e.g., DDoS attack, Ransomware attack, Botnet attack, etc.), as it blocks the harmful activities occurring in the network system.
To design the IDS we can use multiple machine learning algorithms but the question is which algorithm is most accurate in cyber attack detection as well as require less time.To find the answer of this question we need to test and compare the results of various algorithms and here comes the role of “Confusion Matrix”.
To dive deep in the IDS(Intrusion Detection System ) and Confusion matrix I studied one paper : “Classification model for accuracy and intrusion detection using machine learning approach”
In this paper, there are three different classification machine learning algorithms are testify for Intrusion Detection System —
- Naïve Bayes (NB)
- Support Vector Machine (SVM)
- K-nearest neighbor (KNN)
The below table shows the attack classes and subcategories with the number of samples extracted from dataset:
For creating model for IDS on the basis of these three algorithms the dataset get pre processed
By using this pre processed dataset one by one model get trained and the results performance get checked using confusion matrix.
1)Fig No: 1 below shows the SVM classification model fitting using the SVM algorithm. The accuracy it generates is 97.77777% i.e., the algorithm learns the patterns of the dataset with an accuracy of 97.77777%.
2)The Fig No: 2 shows the KNN classification model fitting using the KNN algorithm. Here, k = 3 for initial experimentation. The value of k can be increased or decreased depending on the data entries.
The accuracy it generates is 93.33333% i.e., the algorithm learns the patterns of the dataset with an accuracy of 93.33333%.
3) The Fig No: 3 shows the Naive Bayes (NB)classification model fitting using the NB algorithm. The accuracy it generates is 95.55555% i.e., the algorithm learns the patterns of the dataset with an accuracy of 95.55555%.
Here by comparing the three algorithms’ accuracy levels adn confusion matrix , we can found that the Support Vector Machine (SVM) has the greatest accuracy level and a better result compared to other two algorithms.
This is how the confusion matrix can be used in the cyber security world which impacts much more!!
Hope you readers enjoy this blog : )