Case-Study: How Unsupervised Machine Learning K-Means Clustering is Used in Cyber Security Domain

Sathvika Kolisetty
7 min readAug 26, 2021

The Activities of Internet users are increasing from year to year and has had an impact on the behaviour of the users themselves. Assessment of user behaviour is often only based on interaction across the Internet without knowing any others activities. The log activity can be used as another way to study the behaviour of the user. The Log Internet activity is one of the types of big data so that the use of data mining with the K-Means technique can be used as a solution for the analysis of user behaviour. This study has been carried out the process of clustering using the K-Means algorithm is divided into three clusters, namely high, medium, and low. The results of the higher education institution show that each of these clusters produces websites that are frequented by the sequence: website search engine, social media, news, and information. This study also showed that cyber profiling had been done strongly influenced by environmental factors and daily activities.

So, what are the machine learning applications in information security?

In principle, machine learning can help businesses better analyze threats and respond to attacks and security incidents. It could also help to automate more menial tasks previously carried out by stretched and sometimes under-skilled security teams.

In this article, we will make use of the Unsupervised Machine Learning Algorithm K-Means Clustering Unsupervised Learning algorithm that is very much useful in Security domains with different Use Cases.

K-Means Clustering is an Unsupervised Machine Learning Algorithm, which groups the unlabeled dataset into different clusters.

Let’s break down this statement into sub terminologies to understand better what K-mean clustering means.

What is K-mean clustering?

K-Means Clustering Use Cases and Working

K-Means Clustering is an Unsupervised Learning Algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

How does the K-Means Algorithm Work?

K Means Clustering Use Cases and Working

K Means Clustering Use Cases and Working Security

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Applications of K-means Clustering

k-means algorithm is very popular and used in a variety of applications such as market segmentation, document clustering, image segmentation and image compression, etc. It is used to group unlabeled data and used mostly in the field of :

  • Customer Profiling
  • Market segmentation
  • Computer vision
  • Geo-statistics
  • Astronomy
  • Document clustering
  • Identifying crime-prone areas
  • Customer segmentation
  • Insurance fraud detection
  • Public transport data analysis
  • Clustering of IT alerts

Use-Cases in the Security Domain :

k-means can typically be applied to data that has a smaller number of dimensions, is numeric, and is continuous. think of a scenario in which you want to make groups of similar things from a randomly distributed collection of things k-means is very suitable for such scenarios.

1.Intrusion Detection System (IDS)

An intrusion detection system (IDS) is a device or software application that monitors a network for malicious activity or policy violations. Any malicious activity or violation is typically reported or collected centrally using a security information and event management system. Anomaly detection is one of the intrusion detection systems Current anomaly detection is often associated with high false alarms with moderate accuracy and detection rates when it’s unable to detect all types of attacks correctly.

To overcome this problem, K-Means clustering is useful, which will cluster all data into the corresponding group before applying a classifier for classification purposes with a reasonable false alarm rate. This approach has resulted in high accuracy and good detection rates but with moderate false alarms on novel attacks.

3. Identify outlier Access :

The average user has more than 100 entitlements and that can be very difficult to manage manually. Through the use of the Clustering and K-Means machine learning model, we can detect access outliers by analyzing what’s going on with dynamic peer groups of users.

For example, On a Saturday afternoon, the company access data shows an employee from IT working on your production finance system. This is seemingly an outlier activity for an IT employee, as it’s not typical for someone in this role to be accessing a production finance system, much less on a Saturday afternoon. So, is this risky activity? As well, at the exact same time and on the same day, you have a business analyst accessing and working on that same production finance application.

If we examine these two access activities individually, we might perceive a problem. Yet, if we combine these two access data points dynamically, the situation may appear to be less risky. Read on.

Now, let’s add an additional person from the Finance organization, a financial analyst, and they are also accessing the same production finance application and on the same Saturday. We have three instances of three different people, from different workgroups, all accessing the production finance system at the same time and on the same day. So, what’s going on?

What’s most likely taking place in this scenario is these employees are working together to perform a system upgrade or are resolving a production issue occurring in the financial system. From a real-world viewpoint, where we can examine traditional static data attributes such as job title or department number, these three employees would not be considered a relevant peer group. From a behavioural analytics standpoint, these three employees do comprise a dynamically generated peer group, as there is system data logging their actions of accessing the same production finance system at the same time.

Dynamic peer groups are clusters of users that are created as Risk Analytics ingests log data, in near real-time, all internal to the machine learning algorithms. Dynamic peer groups are fairly transient, yet they can be retained for future reference.

3. Android Malware Classification

Android malware is malicious software that targets a specific type of device: the Android device. Android’s less secure platform, such as its Play Store where applications are downloaded, and users’ ability to sideload content from the internet creates an environment where malware can thrive. Malware often also harvests fake clicks on the ads, doubling up on the value for the makers. Ransomware and Scareware are the main malicious activities.

K means clustering can be used to create a cluster of these malwares and in addition, a classification model for Android malware classification where each cluster prediction becomes an element of the cluster. The cluster constructed from a rule-based clustering algorithm is then used to train the classifier algorithm.

4.Crime Analysis :

Crime analysis is a law enforcement function that involves systematic analysis for identifying and analyzing patterns and trends in crime and disorder. Crime analysis also plays a role in devising solutions to crime problems and formulating crime prevention strategies. Analysis of crime is essential for providing safety and security to the civilian population.

K means clustering technique is used to extract useful information from the high volume crime dataset and to interpret the data which assist police in identify and analyze crime patterns to reduce further occurrences of similar incidence and provide information to reduce the crime.

5. Malware Detection :

Malware interrupts the file registry when entering a computer and basically malware tend to create and modify computer files system and Windows registry entries besides the computer inter-process communication and basic network interaction. Intrusion attacks such as malware are known to breach the policy of network security in organizations and continuously tries to interrupt the core fundamentals of cyber security which are Confidential, Integrity and Availability or known as CIA.

Therefore, previous cyber security researcher has proposed detection-based for malware intrusion, which is a framework that monitors the behaviour of system activity. Then, the behaviour will be analyzed by the framework and notify the users if there is a sign of intrusion.