A deep dive into partitioning around medoids

The final evolution of k-means and why you probably learned it wrong before

By Martin Helm in Data Science Machine Learning kmeans R pam

August 20, 2021

library(ggplot2)
library(gridExtra)
library(knitr)

Introduction

During my work as a Data Scientist, I have often come across problems where simple algorithms do not suffice, because they are stuck in a local optimum. This often leads to a lot of frustration during development, because you first think your approach is working, only to discover later that it doesn’t do so consistenly, or not for all of your data sets. In this final article in my mini-series on k-means and its variants, I will talk about the k-medoids algorithm, also commonly called partitioing around medoids (PAM). It has the beauty of being basically deterministic and find very good solutions reliably. This does come at the cost of more higher computational cost, but if your data set is not extremely large it is still a good candidate to try out if you need reliable results.

Intuition of k-medoids

As with k-medians, k-medoids also commonly uses manhattan metric, but the centers now always are actual points in the data set. Instead of the centroids, we now calculate the median points, ergo the medoids. This increases the explainability of the approach, as the representative point in the data can always be retrieved.

This is often confused with k-medians (which you can checkout here), where the center point do not need to be an actual object. Consider the following example cluster, consisting of 5 points for 2 dimensions:

kable(data.frame(observation = paste("point", 1:5), 
                 x = c(1, 2, 3, 4, 5), 
                 y = c(1, 2, 4, 3, 5)))
observation x y
point 1 1 1
point 2 2 2
point 3 3 4
point 4 4 3
point 5 5 5

Because the median is calculated for each dimension separately in k-medians, the medians would be x = 3, and y = 3. But there exists no point (3, 3) in the data set.

Multiple algorithms for k-medoids are implemented, the most common ones are again a Lloyd style algorithm (also called Voronoi iteration) and true partitioning around medoids (PAM). Unfortunately, the Lloyd style algorithm is often also called PAM, but this is really not true, as the BUILD phase of PAM (as we will see later) is very different to Lloyd. The BUILD phase of true PAM is the crucial step for the success of this algorithm, that is also why Llody style k-medoids usually arrives at worse results than PAM.

k-medoids Lloyd style

To start off easy, let us first implement k-medoids in Lloyd style and later build upon it for true PAM. As usual, we first initialize the centers randomly, but the update of the centers is now fundamentally different.

The update step is now called the SWAP phase. As the name already suggests, we consider swapping the current medoid with all other points in its cluster. For each of this candidate swaps, we calculate the total cost, which is the sum of the distances of all points in this cluster to the new medoid. We remember all swaps that give a lower cost and perform the best one, i.e. the one which arrives at the lowest cost.

The algorithm then terminates if the cost can no longer be decreased. Do note that this does not mean that we arrive at a global minimum. Because we only perform steps that are decreasing the cost, the algorithm has no way to get out of a local minimum and into the global minimum, if it was not initialized within the global minimum “valley”.

kmedoids_lloyd <- function(df, k) {
  # define manhattan distance
  manhattan_distance <- function(x, y){
    return(sum(abs(x - y)))
  }
  
  # Calculate distances between all points and all current medoids
  calculate_distances <- function(df, medoids) {
    distances <- matrix(NA, nrow = nrow(df), ncol = nrow(medoids))
    for (object_id in 1:nrow(df)) {
      for (medoid_id in 1:nrow(medoids)) {
        distances[object_id, medoid_id] <- manhattan_distance(df[object_id, ], medoids[medoid_id, ])
      }
    }
    return(distances)
  }
  
  # Calculate the total cost 
  calculate_cost <- function(df, medoids) {
    distances <- calculate_distances(df, medoids)
    costs <- rep(NA, ncol(distances))
    cluster_id <- apply(distances, 1, which.min)
    
    # number of columns in the distance matrix equals the number of clusters
    for (cluster in 1:ncol(distances)) {
      costs[cluster] <- sum(distances[cluster_id == cluster, cluster])
    }
    cost <- sum(costs)
    return(cost)
  }
  
  get_non_medoids <- function(df, medoid) {
    non_medoids <- !apply(df, 1, function(x) all(x == medoid))
    non_medoids <- df[non_medoids, ]
    return(non_medoids)
  }
  
  get_best_swap_medoid <- function(df, medoid) {
    best_cost <- Inf
    best_medoid <- medoid
    non_medoids <- get_non_medoids(df, medoid)
    
    for (non_medoid_id in 1:nrow(non_medoids)) {
      candidate_medoid <- non_medoids[non_medoid_id, ]
      this_cost <- calculate_cost(df, candidate_medoid)
      if (this_cost < best_cost) {
        best_cost <- this_cost
        best_medoid <- candidate_medoid
      }
    }
    out <- cbind(cost, best_medoid)
    return(out)
  }
  
  # Initialize medoids randomly and calculate initial cost
  medoids <- df[sample(nrow(df), k), ]
  cost <- calculate_cost(df, medoids)
  iteration <- 0 # To keep track how many iterations we needed.
  
  while (TRUE) {
    iteration <- iteration + 1
    
    distances <- calculate_distances(df, medoids)
    cluster_id <- apply(distances, 1, which.min)
    
    for (medoid_id in 1:nrow(medoids)) {
      this_medoid <- medoids[medoid_id, ]
      this_cluster <- df[cluster_id == medoid_id, ]
      best_swap <- get_best_swap_medoid(this_cluster, this_medoid)
      medoids[medoid_id, ] <- best_swap[, colnames(best_swap) != "cost", drop = FALSE]
      rownames(medoids)[medoid_id] <- rownames(best_swap)
    }
    
    new_cost <- calculate_cost(df, medoids)
    
    if (new_cost < cost) {
      cost <- new_cost
      iteration <- iteration + 1
    } else {
      # If cost no longer decreases break out of the loop and return results
      print(paste("Converged after", iteration, "iterations."))
      distances <- calculate_distances(df, medoids)
      cluster_id <- apply(distances, 1, which.min)
      out <- list(cluster_id = cluster_id,
                  medoids = medoids)
      return(out)
    }
  }
}

Partitioning around medoids (PAM)

Finally, the PAM algorithm. As I already hinted before, it has a unique BUILD phase that ensures a very good initialization. The following SWAP phase is the same as we previously implemented in the Lloyd style k-medoids.

During the BUILD phase the first medoid is selected to be the one that has the minimum cost, with cost being the sum over all distances to all other points. Therefore, the first point is the most central point of the data set. All further points are then selected iteratively. For all non-medoids we calculate the cost for selecting this point as the next medoid (again the sum of distances from the candidate medoid to all other non-medoids), and then select the one with the smallest cost as the next medoid.

To clarify that this is indeed that true PAM algorithm, you can check out the paper or book from the authors that originally invented it here

As one immediately sees, this is computationally expensive to perform. In our implementation we will calculate all distances in each iteration, a less expensive solution would be to calculate the distance matrix only once (and only one of the triangles, as it is symmetric) and then only index into it as needed for the cost calculation.

The advantage of this algorithm is that the exhaustive BUILD phase typically arrives at a very good clustering already. The following SWAP phase is usually only performed a few times before it converges. The authors event state that one could sometimes even omit it and still get a good partitioning. There are also some differences in the SWAP phase between the Lloyd style k-medoids and true PAM: Lloyd only considers swaps within the same cluster, whereas PAM considers all current non-medoids for a potential swap, irrespective whether they are within the same cluster currently or not. This increases the search space for PAM, and potentially enables it to find better solutions.

Another characteristic of PAM is that it is close to being deterministic, because it does not use random elements during initialization and always considers all points as possible next medoids. Because there can be ties between the costs of two medoids considered, depending on how these ties are resolved the algorithm is not 100% deterministic (i.e. one could solve ties randomly or depending on the order in which the points are presented.)

kmedoids_pam <- function(df, k) {
  # define manhattan distance
  manhattan_distance <- function(x, y){
    return(sum(abs(x - y)))
  }
  
  # Calculate distances between all points and all current medoids
  calculate_distances <- function(df, medoids) {
    distances <- matrix(NA, nrow = nrow(df), ncol = nrow(medoids))
    for (object_id in 1:nrow(df)) {
      for (medoid_id in 1:nrow(medoids)) {
        distances[object_id, medoid_id] <- manhattan_distance(df[object_id, ], medoids[medoid_id, ])
      }
    }
    return(distances)
  }
  
  # Calculate the total cost 
  calculate_cost <- function(df, medoids) {
    distances <- calculate_distances(df, medoids)
    costs <- rep(NA, ncol(distances))
    cluster_id <- apply(distances, 1, which.min)
    
    # number of columns in the distance matrix equals the number of clusters
    for (cluster in 1:ncol(distances)) {
      costs[cluster] <- sum(distances[cluster_id == cluster, cluster])
    }
    cost <- sum(costs)
    return(cost)
  }
  
  # Get non medoids. This function is slightly different to the one of the Lloyd
  # style kmedoids, because this time we are comparing against multiple medoids.
  # The concept is: go through the data frame by row (first apply call)
  # subtract the current row from the medoids data frame
  # Go through the resulting differences. If one of them is all 0 then the current
  # row is a medoid. 
  # We cannot use rowSums for this, because a row with entries c(-1, 1) would
  # also sum to 0, but is not a medoid!
  get_non_medoids <- function(df, medoids) {
    non_medoids <- !apply(df, 1, function(x) {
      differences <- sweep(medoids, 2, x)
      is_medoid <- any(apply(differences, 1, function(y) all(y == 0)))
      is_medoid
    })
    non_medoids <- df[non_medoids, ]
    return(non_medoids)
  }
  
  # Get the best medoid for a potential swap
  get_best_swap_medoid <- function(df, medoid) {
    best_cost <- Inf
    best_medoid <- medoid
    non_medoids <- get_non_medoids(df, medoid)
    
    for (non_medoid_id in 1:nrow(non_medoids)) {
      candidate_medoid <- non_medoids[non_medoid_id, ]
      this_cost <- calculate_cost(df, candidate_medoid)
      if (this_cost < best_cost) {
        best_cost <- this_cost
        best_medoid <- candidate_medoid
      }
    }
    out <- cbind(cost, best_medoid)
    return(out)
  }
  
  # BUILD phase
  # Select first medoid as the one which has the smallest cost
  distances <- as.matrix(dist(df, method = "manhattan"))
  distances <- colSums(distances)
  medoid_id <- which.min(distances) # In case of ties this will return the first minimum
  medoids <- df[medoid_id, ]
  
  # From the remaining non_medoids select the next one that has the smallest cost
  # until we have k medoids
  while (nrow(medoids) < k) {
    non_medoids <- get_non_medoids(df, medoids)
    best_cost <- Inf
    for (non_medoid_id in 1:nrow(non_medoids)) {
      candidate_medoid <- non_medoids[non_medoid_id, ]
      candidate_cost <- calculate_cost(df, rbind(medoids, candidate_medoid))
      if (candidate_cost < best_cost) {
        best_medoid <- candidate_medoid
        best_cost <- candidate_cost
      }
    }
    
    # Add the best medoid to the medoids
    medoids <- rbind(medoids, best_medoid)
  }
  
  # Calculate initial cost  
  cost <- calculate_cost(df, medoids)
  
  # SWAP phase
  # Run until algorithm converged
  iteration <- 0 # To keep track how many iterations we needed.
  
  while (TRUE) {
    # In contrast to Lloyd style k-medoids, consider the complete data set
    # for potential swaps, not only the current cluster of the medoid
    candidate_swaps <- list()
    for (medoid_id in 1:nrow(medoids)) {
      this_medoid <- medoids[medoid_id, ]
      candidate_swaps[[medoid_id]] <- get_best_swap_medoid(df, this_medoid)
    }
    candidate_swaps <- Reduce(rbind, candidate_swaps)
    
    # Select and perform the best swap
    medoid_to_swap <- which.min(candidate_swaps$cost)
    medoids[medoid_to_swap, ] <- candidate_swaps[medoid_to_swap, colnames(candidate_swaps) != "cost", drop = FALSE]
    rownames(medoids)[medoid_to_swap] <- rownames(candidate_swaps)[medoid_to_swap] 
    
    new_cost <- calculate_cost(df, medoids)
    
    if (new_cost < cost) {
      cost <- new_cost
      iteration <- iteration + 1
    } else {
      # If cost no longer decreases break out of the loop and return results
      print(paste("Converged after", iteration, "iterations."))
      distances <- calculate_distances(df, medoids)
      cluster_id <- apply(distances, 1, which.min)
      out <- list(cluster_id = cluster_id,
                  medoids = medoids)
      return(out)
    }
  }
  
}

Comparison between algorithms

Now that we have implemented the different algorithms, let’s compare them a bit regarding runtime and outcome. Because we implemented everything in base R without taking advantage of vectorization, the runtime will be significantly longer than using optimized algorithms built in C or FORTRAN.

Clustering outcome

Let’s start by visualizing the results. Of course the colors for the “same” cluster can differ between the different algorithms, because they do not know which cluster belongs to which species.

set.seed(2)
df <- iris[, c("Sepal.Length", "Sepal.Width")]
species <- iris$Species
kmeans_clusters <- my_kmeans(df, k = 3, n_iter = 10, init = "random")
kmeanspp_clusters <- my_kmeans(df, k = 3, n_iter = 10, init = "kmeans++")
kmedians_clusters <- kmedians(df, k = 3, n_iter = 10)
kmedoids_lloyd_clusters <- kmedoids_lloyd(df, k = 3)
## [1] "Converged after 13 iterations."
kmedoids_pam_clusters <- kmedoids_pam(df, k = 3)
## [1] "Converged after 0 iterations."
# Helper function to create plots
plot_clusters <- function(df, cluster_id, title) {
  p <- ggplot(df, aes(Sepal.Length, Sepal.Width, color = factor(cluster_id))) + 
    geom_point(show.legend = FALSE) +
    labs(title = title,
         x = "Sepal Width", 
         y = "Sepal Length")
  return(p)
}

p_ground_truth <- plot_clusters(df, species, "Ground truth")
p_kmeans <- plot_clusters(df, kmeans_clusters, "k-means")
p_kmeanspp <- plot_clusters(df, kmeanspp_clusters, "k-means++")
p_kmedians <- plot_clusters(df, kmedians_clusters, "k-medians")
p_kmedoids_lloyd <- plot_clusters(df, kmedoids_lloyd_clusters$cluster_id, "k-medoids Lloyd")
p_kmedoids_pam <- plot_clusters(df, kmedoids_pam_clusters$cluster_id, "k-medoids PAM")

grid.arrange(p_ground_truth, p_kmeans, p_kmeanspp, p_kmedians, p_kmedoids_lloyd, p_kmedoids_pam, 
             ncol = 2)

Most algorithms do find more or less correct clusters, with the exception of k-medians. We also see that the PAM algorithm does actually not perform any swaps at all, highlighting the strength of its BUILD phase! Also keep in mind if you do compare the number of iterations between PAM and Lloyd k-medoids that PAM only performs a single swap per iteration, whereas the LLoyd k-medoids performs a swap for each current medoid, the number of total swaps is therefore k * iterations.

If you want to learn more about with which objective metrics you can judge your clustering outcomes, check out the corresponding section in my article on k-means.

Runtime

Finally, let’s compare the runtimes of the different algorithms, and let’s also check how much faster the implementation in FORTRAN from R is.

library(microbenchmark)
bench <- microbenchmark(
  base_kmeans = kmeans(df, centers = 3, iter.max = 10, algorithm = "Lloyd"),
  my_kmeans = my_kmeans(df, k = 3, n_iter = 10, init = "random"),
  my_kmeanspp = my_kmeans(df, k = 3, n_iter = 10, init = "kmeans++"),
  kmedians = kmedians(df, k = 3, n_iter = 10),
  kmedoids_lloyd = kmedoids_lloyd(df, k = 3),
  kmedoids_pam = kmedoids_pam(df, k = 3),
  times = 5
)
## Warning: did not converge in 10 iterations
## [1] "Converged after 13 iterations."
## Warning: did not converge in 10 iterations
## [1] "Converged after 5 iterations."
## [1] "Converged after 0 iterations."
## [1] "Converged after 0 iterations."
## [1] "Converged after 9 iterations."
## [1] "Converged after 0 iterations."
## [1] "Converged after 0 iterations."
## [1] "Converged after 3 iterations."
## [1] "Converged after 9 iterations."
## [1] "Converged after 0 iterations."
plot(bench)

As expected, PAM is the slowest algorithm, followed by Lloyd style k-medoids. Since the other lines are very close on the scale lets look at ratios instead:

bench_df <- summary(bench)
bench_df$ratio <- bench_df$median / bench_df[bench_df$expr == "base_kmeans", "median"]

ggplot(bench_df, aes(x = expr, y = ratio)) +
  geom_count(show.legend = FALSE) + 
  scale_y_log10() + 
  labs(title = "Runtime of algorithms normalized to base k-kmeans",
       xlab = "Algorithm",
       ylab = "Ratio to base k-means")

Our vanilla k-means algorithm is 4000x slower than the base k-means! This demonstrates the drastic performance gain you can get if one implements an algorithm more efficiently and in a lower-level language, like C++. But our goal here was not efficiency, but understanding.

Summary

Congratulations if you made it this far. With PAM, you know now a very sophisticated clustering method that can be robustly applied to many data sets. Due to its high computational cost, it might not be completely suitable for very large data sets. If this is the case for you, check out algorithms designed for that, such as CLARA or CLARANS. This post also concludes my mini-series on k-means and related clustering algorithms. Stay tuned on what comes next!

Photo by Alina Grubnyak on Unsplash