[Project 2] Day 7: Intro to K-means, K-medoids and DBSCAN clustering – Advance Mathematical Statistics (MTH 522) Assignments

Today I learnt about K-means, clustering, K-medoids and DBSCAN clustering methods.

K-means is a nonhierarchical clustering method. You tell it how many clusters you want, and it tries to find the “best” clustering.
“K means” refers to the following:
1. The number of clusters you specify (K).
2. The process of assigning observations to the cluster with the nearest center (mean).
The drawbacks of K-means are as follows:
1. Sensitivity to initial conditions
2. Difficulty in Determining K
3. Inability to handle categorical data.
4. Time complexity
K-medoids clustering is a variant of K-means that is more robust to noises and outliers.
Instead of using the mean point as the center of a cluster, K-medoids uses an actual point in the cluster to represent it.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is also a clustering algorithm.
Although it is an old algorithm (published in 1996) it is still used today because it is versatile and generates very high-quality clusters, all the points which don’t fit being designated as outliers.
There are two hyper-parameters in DBSCAN:
1. epsilon: A distance measure that will be used to locate the points/to check the density in the neighborhood of any point.
2. minPts: Minimum number of data points to define a cluster.
Hierarchical DBSCAN (HDBSCAN) is a more recent algorithm that essentially replaces the epsilon hyperparameter of DBSCAN with a more intuitive one called ‘min_cluster_size’.