Today I learnt about K-means, clustering, K-medoids and DBSCAN clustering methods.
- K-means is a nonhierarchical clustering method. You tell it how many clusters you want, and it tries to find the “best” clustering.
- “K means” refers to the following:
- The number of clusters you specify (K).
- The process of assigning observations to the cluster with the nearest center (mean).
- The drawbacks of K-means are as follows:
- Sensitivity to initial conditions
- Difficulty in Determining K
- Inability to handle categorical data.
- Time complexity
- K-medoids clustering is a variant of K-means that is more robust to noises and outliers.
- Instead of using the mean point as the center of a cluster, K-medoids uses an actual point in the cluster to represent it.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is also a clustering algorithm.
- Although it is an old algorithm (published in 1996) it is still used today because it is versatile and generates very high-quality clusters, all the points which don’t fit being designated as outliers.
- There are two hyper-parameters in DBSCAN:
- epsilon: A distance measure that will be used to locate the points/to check the density in the neighborhood of any point.
- minPts: Minimum number of data points to define a cluster.
- Hierarchical DBSCAN (HDBSCAN) is a more recent algorithm that essentially replaces the epsilon hyperparameter of DBSCAN with a more intuitive one called ‘min_cluster_size’.