The goal of machine learning is to enable an algorithm to make decisions based on input that has not been explicitly labelled or categorised. In the context of machine learning, this activity occurs during the training phase. The machine’s job is to find similarities and differences among the unsorted data and arrange it accordingly. The machine has not been trained with any previous information.
The goal of clustering is to divide a population or set of data points into subsets that are both more similar to one another and more distinguishable from one another than the original set of data points. It may be simplified to the process of grouping things into categories based on how similar or different they are to one another.
We have been given a set of objects, each of which has its own set of characteristics and values for those attributes (represented as a vector). The task at hand is to arrange such objects in sensible classes and clusters. To do this, we’ll use k means clustering, an unsupervised learning approach. The number of desired categories or subsets of items is represented by the letter ‘K’ in the name of the technique.
Based on their similarities, the algorithm will divide the items into k distinct clusters. In order to quantify this connection, we’ll use euclidean distance as our yardstick.
The algorithm operates as follows:
- We’ll start by picking k locations at random and using them as cluster centroids or averages.
- We update the coordinates of the mean, which are the averages of the items already allocated to that cluster, by placing each item in the cluster with the mean that is closest to it.
- After repeating the process a certain number of times, we get the required clusters.
All of the “points” we’ve been talking about are more properly called “means” since they indicate the typical value of the items in each group. We may start these processes in a variety of ways. It seems sensible to randomly initialise the means of the items in the data set. Means may also be initialised using random values that are included within the data set (for instance, if the items that make up a feature x have values from 0 to 3, we will initialise the means for x that lie between 0 and 3) to ensure that the data is representative.
Using this method, the data points are clustered such that the sum of their squared distances to the cluster’s centroid is minimised. This guarantees that the procedure will carry out as intended. If your clusters aren’t very diverse, you should expect to find more similarities between the data points that fall into the same cluster.
Conclusion
K-means uses an approach called expectation-maximization to find a solution. During the Expectation phase, data points are placed in the cluster that is geographically closest to them, and during the Maximisation phase, the centroid of each cluster is calculated.