How to cluster noisy data sets

Series: Kmeans and Its Variants

Real-world data sets often come with many outliers that you might not be able to remove completely during the data cleanup phase. If you have run into this problem, I want to introduce you to the k-medians algorithm. By using the median instead of the mean, and using a more robust dissimilarity metric, it is much less sensitive to outliers.

The k-means++ algorithm to kick start your initialization

Series: Kmeans and Its Variants

k-means is a very simple and ubiquitous clustering algorithm. But quite often it does not work on your problem, for example because the initialization is bad. Fortunately, there is an improved initialization method, k-means++, which can help to alleviate this problem.

A deep dive into kmeans

Series: Kmeans and Its Variants

A simple framework for performance metrics

The list of performance metrics is seemingly never-ending. Especially if you are new to data science, you can easily feel stranded in an ocean of choices. Learn how they connect to each other and how you can use it to choose the best metric for your problem and model.

← Newer
2 of 2
Older →