Confusion Matrix and Cyber Crime

Kanchan Daryanani
8 min readJun 5, 2021

MLOPS Summer Internship Task 5

Create a blog/article/video about cyber crime cases where they talk about confusion matrix or its two types of error.

What is Confusion Matrix?

The confusion matrix was invented in 1904 by Karl Pearson. He used the term Contingency Table. A confusion matrix is a performance measurement technique for Machine learning classification problems. It’s a simple table which helps us to know the performance of the classification model on test data for the true values are known.

A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making.

For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:

Let’s decipher the matrix:

  • The target variable has two values: Positive or Negative
  • The columns represent the actual values of the target variable
  • The rows represent the predicted values of the target variable

Understanding True Positive, True Negative, False Positive and False Negative in a Confusion Matrix

True Positive (TP)

  • The predicted value matches the actual value
  • The actual value was positive and the model predicted a positive value

True Negative (TN)

  • The predicted value matches the actual value
  • The actual value was negative and the model predicted a negative value

False Positive (FP) — Type 1 error

  • The predicted value was falsely predicted
  • The actual value was negative but the model predicted a positive value
  • Also known as the Type 1 error

False Negative (FN) — Type 2 error

  • The predicted value was falsely predicted
  • The actual value was positive but the model predicted a negative value
  • Also known as the Type 2 error

For example: Suppose we had a classification dataset with 1000 data points. We fit a classifier on it and get the below confusion matrix:

The different values of the Confusion matrix would be as follows:

  • True Positive (TP) = 560; meaning 560 positive class data points were correctly classified by the model
  • True Negative (TN) = 330; meaning 330 negative class data points were correctly classified by the model
  • False Positive (FP) = 60; meaning 60 negative class data points were incorrectly classified as belonging to the positive class by the model
  • False Negative (FN) = 50; meaning 50 positive class data points were incorrectly classified as belonging to the negative class by the model

What is cyber crime?

Cybercrime, also called computer crime, the use of a computer as an instrument to further illegal ends, such as committing fraud, trafficking in child pornography and intellectual property, stealing identities, or violating privacy. Cybercrime, especially through the Internet, has grown in importance as the computer has become central to commerce, entertainment, and government.

Because of the early and widespread adoption of computers and the Internet in the United States, most of the earliest victims and villains of cybercrime were Americans. By the 21st century, though, hardly a hamlet remained anywhere in the world that had not been touched by cybercrime of one sort or another.

Defining cybercrime

New technologies create new criminal opportunities but few new types of crime. What distinguishes cybercrime from traditional criminal activity? Obviously, one difference is the use of the digital computer, but technology alone is insufficient for any distinction that might exist between different realms of criminal activity. Criminals do not need a computer to commit fraud, traffic in child pornography and intellectual property, steal an identity, or violate someone’s privacy. All those activities existed before the “cyber” prefix became ubiquitous. Cybercrime, especially involving the Internet, represents an extension of existing criminal behaviour alongside some novel illegal activities.

Types of cybercrime

Cybercrime ranges across a spectrum of activities. At one end are crimes that involve fundamental breaches of personal or corporate privacy, such as assaults on the integrity of information held in digital depositories and the use of illegally obtained digital information to blackmail a firm or individual. Also at this end of the spectrum is the growing crime of identity theft. Midway along the spectrum lie transaction-based crimes such as fraud, trafficking in child pornography, digital piracy, money laundering, and counterfeiting. These are specific crimes with specific victims, but the criminal hides in the relative anonymity provided by the Internet. Another part of this type of crime involves individuals within corporations or government bureaucracies deliberately altering data for either profit or political objectives. At the other end of the spectrum are those crimes that involve attempts to disrupt the actual workings of the Internet. These range from spam, hacking, and denial of service attacks against specific sites to acts of cyberterrorism i.e. the use of the Internet to cause public disturbances and even death. Cyberterrorism focuses upon the use of the Internet by nonstate actors to affect a nation’s economic and technological infrastructure. Since the September 11 attacks of 2001, public awareness of the threat of cyberterrorism has grown dramatically.

Case Study of Confusion Matrix’s implementation in monitoring Cyber Attacks

Cyber Attack Detection and Classification Using Parallel Support Vector Machine

Support Vector Machine is a powerful tool to classify cyber attacks. But still it has some drawback. The first drawback is that SVM is very sensitive for attacks .The second, SVM designed for the two class problems it has to be extended for multiclass problem by choosing suitable kernel function. The performance of the SVM depends upon the kernel function. Some methods to improve the performance of SVM were proposed. Fuzzy SVM is one of the improvements made on the traditional SVM. Several machine learning paradigms including Artificial Neural Network, Linear Genetic Programming (LGP), Data Mining, etc. have been investigated for the classification of cyber attack. Also the machine learning techniques are sensitive to the noise in the training samples. The presence of mislabeled data if any can result in highly nonlinear decision surface and over fitting of the training set. This leads to poor generalization ability and classification accuracy. Decision-tree-based support vector machine which combines support vector machines and decision tree can be an effective way for solving multi-class problems. This method can decrease the training and testing time, increasing the efficiency of the system. Improved Support Vector Machine (iSVM) algorithm for classification of cyber attack dataset which gives 100% detection accuracy for Normal and Denial of Service (DOS) classes and comparable to false alarm rate, training, and testing times. A new feature selection algorithm for distributed cyber attack detection and classification is proposed. Different types of attacks together with the normal condition of the network are modeled as different classes of the network data. Binary classifiers are used at local sensors to distinguish each class from the rest.

The data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between ``bad’’ connections, called intrusions or attacks, and ``good’’ normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment. In KDD99 dataset these four attack classes (DoS, U2R,R2L, and probe) are divided into 22 different attack classes that tabulated below:

In the KDD Cup 99, the criteria used for evaluation of the participant entries is the Cost Per Test (CPT) computed using the confusion matrix and a given cost matrix.

The Detection of Attack and Normal Pattern Can be Generalized as Follows:

• True Positive (TP): The amount of attack detected when it is actually attack.

• True Negative (TN): The amount of normal detected when it is actually normal.

• False Positive (FP): The amount of attack detected when it is actually normal (False alarm).

  • False Negative (FN): The amount of normal detected when it is actually attack.

In the confusion matrix above, rows correspond to predicted categories, while columns correspond to actual categories.

This case study presents new cyber attack detection and classification system to classify cyber attacks. In this, we developed the performance of IDS using parallel support vector machine for distributed cyber attack detection and classification. The new PSVM is shown more efficient for detection and classification of different types of cyber attacks compared to SDF. The experimental results on KDD99 benchmark dataset manifest that proposed algorithm achieved high detection rate on different types of network attacks.

Conclusion:

A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. It is used to measure the performance of a classification model. It can be used to evaluate the performance of a classification model through the calculation of performance metrics like accuracy, precision, recall, and F1-score.

Need for Confusion Matrix in Machine learning:

o It evaluates the performance of the classification models, when they make predictions on test data, and tells how good our classification model is.

o It not only tells the error made by the classifiers but also the type of errors such as it is either type-I or type-II error.

o With the help of the confusion matrix, we can calculate the different parameters for the model, such as accuracy, precision, etc.

The confusion matrix is a matrix used to determine the performance of the classification models for a given set of test data. It can only be determined if the true values for test data are known. The matrix itself can be easily understood and implemented to test a ML model.

--

--