k-mean clustering and its use case cyber security domain
Clustering
Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as Euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.
Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll cover here clustering based on features. Clustering is used in market segmentation; where we try to find customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.
Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.
In this post, we will discuss on K-means which is considered as one of the most used clustering algorithms due to its simplicity and its use cases in the cyber security domain which is very hot now a days!!.
K-means Algorithm
In simple language K-means algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.
The way k-means algorithm works is as follows:
- Specify number of clusters K.
- Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
- Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.
- Compute the sum of the squared distance between data points and all centroids.
- Assign each data point to the closest cluster (centroid).
- Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.
Image Steganography Method Using K-Means Clustering
What is steganography?
Steganography is a method which allows people to communicate and hide the existence of communication.
In simple words Image Steganography method is used for the delivery of messages securely without inversion of third person(Hacker!!)
Steganography is the science of invisible communication. This transfer of information takes place by hiding data inside media.
The various steganographic techniques can be categorized as follows:
• Text Steganography.
• Image Steganography.
• Audio Steganography.
• Video Steganography.
• Network Steganography
From above list I hope you can get the wide area of applications of image steganography
Steganography + K-means Clustering = Enhanced Security
- At first, we encrypt the text message using the DES algorithm.
( The Data Encryption Standard (DES) is a symmetric-key block cipher. It has proved to be a very well designed cipher, hence enhancing the difficulty to decrypt the message to be sent. ) - Subsequently, the image in which the encrypted text has to be hidden is chosen. The pixels of the image are then clustered into groups based on the pixel values (R,G,B). K-means algorithm is used for clustering. K-means clustering aims to partition n observations (in this case, pixels) into k clusters in which each observation belongs to the cluster with the nearest centroid, serving as a prototype of the cluster.
- The encrypted message is then divided into k segments, followed by the steganography of these message segments into the pixel clusters.
- Clusters are stacked back together according to the pixel position, thus ensuring the re-attainment of the original picture.
- Finally, the stego-image, the k value and the DES encryption key are securely sent to the receiver.
- For this transmission, RSA algorithm is used.
(RSA : Asymmetric Cryptographic Algorithm, it has two different keys. This is also called public key cryptography, because one of the keys can be given to everyone.)
And finally the receiver can decrypt the message hidden inside the image using that public key for public — private key matching
Now Lets see practically how it works :
A) Sender Side
1. Encrypting the Text: Text is encrypted using the DES encryption technique
2. Cluster-wise Steganography: The image file and the encrypted text file is chosen and the text is steganographed into the clusters. After steganography, clusters with hidden information are formed
3. Secure Transmission Of DES key: The final task is to securely send the DES key. For this, we use the public-key and encrypt the file in which all the information is stored. The encrypted text generated is stored in a separate file. This file along with the clusters is then sent to the concerned person who alone has the private key to decrypt it.
B. Receiver Side
- Decryption of Encrypted file At Receiver’s Side: The first step would be to decrypt the encrypted file containing the DES key. This is done using the receiver’s private key. This key is used to transform the data obtained from the image into plain text or the actual message.
2. Cluster-wise Extraction: The number of clusters and their filenames are entered. The data extracted from these files is appended together and a text file is generated for DES decryption.
3. Decryption of Entire data At Receiver's Side: After we receive the key the DES decryption technique is used to transform the cipher text, obtained from the various clusters, to plain text or the original message
Booyaahh!! the receiver got text securely that's the beauty of steganography and K-means clustering which enhance the efficiency.
I hope you learn something new and interesting :)
Thank you!