Unsupervised Learning

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Learning Objectives

Understand the difference between unsupervised and supervised learning
Calculate the Euclidean distance between vectors
Understand and implement the K-means algorithm for clustering
Standardize data for clustering methods
Select the number of clusters for clustering methods
Understand challenges and limitations of clustering methods

Let’s load some R packages to start.

Code

# Load necessary packages
library(tidyverse)
library(ggthemes)
library(gt)
library(broom)
library(scales)
library(cluster)
library(factoextra)

Code

# Set ggplot theme for visualizations
theme_set(ggthemes::theme_few(base_size = 16))

Unsupervised Learning

In the field of data science, supervised learning methods use labeled datasets (with a known outcome $y$) to fit models such as multiple linear regression or logistic regression. Supervised learning methods can be used for prediction or to explain relationships between a set of $p$ predictors or features ($x_1, x_2, \dots, x_p$) and a known outcome or response variable ($y$). Unsupervised learning methods use unlabeled data containing only feature measurements ($x_1, ..., x_p$) to discover hidden structures. That is, a common goal of supervised learning is prediction, whereas unsupervised learning methods often aim to discover patterns or subgroups in the data.

Some common applications of unsupervised learning include:

Market Segmentation💵️🛒: Grouping customers into distinct segments based on their purchasing habits, browsing history, or demographics, which can facilitate the creation of targeted marketing campaigns.
Genomics and Bioinformatics🧬🧪: Clustering gene expression data to find genes that function similarly, or grouping patients based on their genetic information to discover different subtypes of cancer. This can facilitate development of more targeted therapies for specific cancer subtypes.
Topic Modeling📰📚: An unsupervised method used in Natural Language Processing (NLP) to automatically group a large set of documents (like customer reviews or news articles) by latent or underlying topics or themes.
Anomaly Detection🚩🚩🚩: Identifying data points that stray far from patterns in the rest of the data. Anomaly or outlier detection can be used for detecting fraud (e.g., unusual credit card transactions) or system failures.
Recommender Systems📺🎵: Companies like Netflix, Amazon, and Spotify suggest products, movies, or songs. These recommendations can be generated by clustering users with similar preferences or finding content or products that are frequently viewed or purchased together.

Clustering

Clustering is a broad class of unsupervised learning methods that can discover subgroups or clusters in a data set. These formed clusters consist of sets of similar or homogeneous observations that can improve understanding of the data. K-means clustering is one of the most commonly used unsupervised learning methods.

K-Means Clustering

K-means clustering partitions the data into a pre-specified number of distinct, non-overlapping clusters¹. Specifically, $K$ clusters or subgroups are defined in the data that minimize the total within-cluster variation. Within-cluster variation can be defined in several ways, one of which is the sum of squared Euclidean distances between each observation and the “center” or centroid of the cluster.

Euclidean Distance

The Euclidean distance is the straight-line distance between two points or vectors. It is calculated as the square root of the sum of the squared differences between the corresponding components of the two vectors. For a general case with two $p$-dimensional vectors, $\mathbf{a} = (a_1, a_2, ..., a_p)$ and $\mathbf{b} = (b_1, b_2, ..., b_p)$, the Euclidean distance between $\mathbf{a}$ and $\mathbf{b}$ is:\[d(\mathbf{a}, \mathbf{b}) = ||\mathbf{a} - \mathbf{b}||_2 = \sqrt{\sum_{j=1}^{p} (a_j - b_j)^2}\].

For example, in the 2-dimensional case, the Euclidean distance between two vectors $\mathbf{a} = (a_1, a_2)$ and $\mathbf{b} = (b_1, b_2)$ is

\[d(\mathbf{a}, \mathbf{b}) = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2}\]

That is, if $\mathbf{a} = (8, 3)$ and $\mathbf{b} = (4, 0)$, then the Euclidean distance between $\mathbf{a}$ and $\mathbf{b}$ is

\[d(\mathbf{a}, \mathbf{b}) = \sqrt{(8 - 4)^2 + (3 - 0)^2} = \sqrt{4^2 + 3^2} = \sqrt{16+9} = \sqrt{25} = 5\]

Tip

Calculate the Euclidean distance between the two vectors $\mathbf{a} = (7, 3)$ and $\mathbf{b} = (2, -9)$.

The K-Means Algorithm

For a prespecified number of clusters, $K$, the K-Means algorithm is as follows:

Assign each observation to one of $K$ clusters. This can be random or using a more informative initialization approach for the cluster assignments.
Repeat the following until convergence:

Compute the centroid (mean vector) for each of the $K$ clusters.
Reassign each observation to the cluster whose centroid is closest (in Euclidean distance).

Convergence is commonly met when the centroids do not change significantly between iterations or a maximum number of iterations is reached.

Since variables can be on different scales, an essential step prior to implementing K-means clustering is standardization.

Standardization

When variables are measured on different scales (e.g., income in dollars vs. age in years), larger-scale variables can disproportionately affect the distance calculation and consequently the resulting clusters from unsupervised methods. Standardization of variables, also called scaling, is important for many methods in data science such as clustering to ensure variables contribute equally to cluster formation. Typically, only numeric variables are standardized, not dummy variables for categorical variables. There are multiple types of standardization that can be used.

Z-Score Normalization

To implement Z-score normalization, scaled versions of the variables are used in the clustering algorithm:

\[z = \frac{x - \bar{x}}{s},\]

where $\bar{x}$ is the sample mean and $s$ the sample standard deviation. One of the most common methods for standardization, Z-score normalization centers each variable to have a mean of 0 and a standard deviation of 1. Z-score scaling is the standard approach when features have different units or distributions.

Tip

Calculate the Z-score normalized transformations of the two vectors $\mathbf{a} = (7, 3)$ and $\mathbf{b} = (2, -9)$, confirming that $\mathbf{z}_{a} = (\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}) \approx (0.7071, 0.7071)$ and $\mathbf{z}_{b} = (-\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}) \approx (-0.7071, -0.7071)$

Min-Max Scaling

For min-max scaling, variables are standardized based on their minimum and maximum values:

\[x^* = \frac{x - \min(x)}{\max(x) - \min(x)}\]

Min-max scaling rescales variables to be bounded between 0 and 1, but it can yield unreliable results if extreme outliers are present.

Tip

Calculate the min-max scaled transformations of the two vectors $\mathbf{a} = (7, 3)$ and $\mathbf{b} = (2, -9)$, confirming that $\mathbf{a}^* = (1, 1)$ and $\mathbf{b}^* = (0, 0)$.

Palmer Penguins Data

To explore how we can use clustering methods, let’s consider the Palmer Penguins data set which has data collected on 344 penguins from 3 islands in Palmer Archipelago, Antarctica ².

*Artwork by @allison_horst*

Code

# Importing penguins data from GitHub
penguins <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/penguins.csv")

# Creating vector of colorblind-friendly colors
penguin_colors <- c("Chinstrap" = "#CC79A7", "Gentoo" = "#009E73", "Adelie" = "#E69F00")
colors_vector <- c("#E69F00", "#CC79A7", "#009E73")

Below are a data dictionary for the Palmer Penguins data, and a preview of some randomly selected rows.

Table 1: Data dictionary for the Palmer Penguins data set.

Variable	Description
species	Species of the penguin
island	Island the penguin was found on
bill_length_mm	Bill length (mm)
bill_depth_mm	Bill depth (mm)
flipper_length_mm	Flipper length (mm)
body_mass_g	Body mass (g)
sex	Sex of the penguin
year	Year data was collected

Table 2: Randomly selected rows from the Palmer Penguins data set.

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Chinstrap	Dream	50.2	18.7	198	3775	female	2009
Adelie	Biscoe	39.7	18.9	184	3550	male	2009
Chinstrap	Dream	45.4	18.7	188	3525	female	2007
Gentoo	Biscoe	46.2	14.1	217	4375	female	2009
Gentoo	Biscoe	50.0	16.3	230	5700	male	2007

First, let’s visualize the bill lengths and bill depths of the penguins coloring the points by their actual species.

Code

# Visualize the penguins data
penguins |> 
ggplot(aes(x = bill_length_mm, y = bill_depth_mm,
                           color = species, shape = species)) +
  scale_color_manual(values = penguin_colors) +
  geom_point(size = 1) +
  labs(title = "Penguin bill dimensions",
       subtitle = "Colored by species",
       x = "Bill length (mm)",
       y = "Bill depth (mm)",
       color = "Species",
       shape = "Species",
       caption = "palmerpenguins") +
  theme_few(base_size = 10) +
  theme(legend.position = "bottom")

Without coloring and altering the shapes of the points by species, can we uncover these subgroups in the data?

Code

# Visualize the penguins data
penguins |> 
ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
  scale_color_manual(values = penguin_colors) +
  geom_point(size = 1) +
  labs(title = "Penguin bill dimensions",
       subtitle = "Colored by species",
       x = "Bill length (mm)",
       y = "Bill depth (mm)",
       caption = "palmerpenguins") +
  theme_few(base_size = 10) +
  theme(legend.position = "bottom")

We will ignore the species variable and see if we can discover the species groups using only the bill lengths (bill_length_mm) and bill depths (bill_depth_mm) using clustering.

Clustering the Palmer Penguins

First, to cluster observations in R we must remove missing values (NA’s) for any of the variables we are hoping to use for clustering.

Code

# Removing rows with missing values for bill_depth_mm or bill_length_mm
penguins_complete <- penguins |> 
  drop_na(bill_depth_mm, bill_length_mm)

Then, we scale the data using Z-score normalization to ensure all variables are on the same scale.

Code

# Implementing Z-score normalization for the numeric columns
penguins_scaled <- penguins_complete |>
  mutate(across(where(is.numeric), ~ as.numeric(scale(.))))

Then, we can implement the K-means clustering of the penguins using only their bill depths and bill lengths. For now, we will specify $K=3$ clusters, although typically we would not know the true number of subgroups prior to clustering.

Code

# Setting seed for reproducible results
set.seed(1994)

# Subsetting to only bill_depth_mm and bill_length_mm
penguins_clustering_data <- penguins_scaled |> 
  select(bill_depth_mm, bill_length_mm)

# Apply K-means clustering with K = 3 clusters
kmeans_result <- penguins_clustering_data |> 
  kmeans(centers = 3)

Observe that K-means clustering discovers a grouping structure in the data that closely aligns with the species of the penguins, despite the algorithm not knowing their species.

How well did K-means do exactly? Since we actually knowing the penguin species here, one method for assessing the results is to use a confusion matrix, a visualization commonly used for assessing the results of supervised machine learning methods.

In general, for unsupervised methods one does not know the “true” clustering or grouping structure in the data, so it can be difficult to assess the reliability of the results. However, let’s consider how well the clustering of the points aligns with the known species of the penguins.

Tip

What percent of Adelie penguins were “correctly” grouped together? What about for Gentoo and Chinstrap penguins?
Why did K-means appear to have a more “difficult” time clustering the Chinstrap penguins compared to the other species?
How could we have likely “improved” the clustering results to more closely align with the actual penguin species?

A common analysis step after clustering is to next examine differences in the resulting clusters. Since we only used numeric variables for clustering here, we can use side-by-side boxplots to visualizes any differences. When using categorical variables for clustering, one can create bar charts to visualize differences as well.

Code

# Add the cluster assignments to original data
penguins_clustered <- penguins_complete |> 
  select(bill_depth_mm, bill_length_mm) |> 
  mutate(cluster = as.factor(kmeans_result$cluster))

# Visualizing distributions of clustering variables across clusters
# Creating side-by-side box plots
penguins_clustered |> 
  pivot_longer(cols = c(bill_depth_mm, bill_length_mm), 
               names_to = "variable", values_to = "value") |> 
  ggplot(aes(x = cluster, y = value, fill = cluster)) + 
  geom_boxplot() + 
  facet_wrap(. ~ variable, ncol = 2, scales = "free_y") +
  scale_fill_manual(values = colors_vector) +
        labs(title = "Variable distributions by K-means cluster",
             x = "Cluster") +
  theme(legend.position = "none")

Another common challenge with K-means clustering is how to select the number of clusters for the algorithm to use. Here we knew there were three penguin species in the data, but for most real-world data sets one should select the number of clusters with a data-driven approach.

Selecting the Number of Clusters

For clustering methods, specifying too few clusters can oversimplify the data and not effectively uncover important subgroups, while too many clusters can overfit the data and yield uninterpretable results. There are a few common strategies for selecting the number of clusters.

Silhouette Score

The silhouette score³ measures how similar each point is to its own cluster compared to other clusters. Silhouette scores range from -1 to 1, with higher values indicating better fit.

Silhouette scores can be calculated for different numbers of clusters, and the number of clusters with the highest average silhouette score across all points is preferred. In general, the average silhouette score is useful for selecting the number of clusters $K$ that yields the most compact and well-separated clusters.

Code

# fviz_nbclust runs kmeans for k=1 to 10 and computes the silhouette score
silhouette_results <- penguins_clustering_data |> 
  fviz_nbclust(FUNcluster = kmeans, method = "silhouette",
               k.max = 10) +
  labs(title = "Silhouette Method for Optimal K")

silhouette_results

Tip

How many clusters is considered optimal based on the average silhouette score for the Palmer penguins data?

Gap Statistic

The gap statistic⁴ is another method for selecting the number of clusters in clustering algorithms such as K-means. The gap statistic provides a quantitative measure by comparing the within-cluster variability for the observed data set for a specified $K$ to that for a reference null distribution which has no true clustering structure. This reference null distribution is simulated using Monte Carlo methods to reflect values in the original observed data set, but with no subgroups present in the data. The closer the gap statistic values are to 0 for a specified $K$, the more this indicates that the formed subgroups are capturing noise than true structure in the data.

To select the number of clusters $K$ using the gap statistic, clustering is implemented for a range of values for $K$, and the value of $K$ that yields the greatest gap statistic is preferred.

An advantage of the gap statistic over other common methods like the average silhouette score is that the gap statistic can suggest there are no clear subgroups in the data (i.e., $K=1$ is optimal). A limitation of the gap statistic is that uninformative variables can lead to a poor selection of the number of clusters since equal weight is given to all variables.

Code

# Setting seed value for reproducible results
set.seed(1994)

# Calculating Gap statistics for up to K.max clusters
gap_statistic_results <- penguins_clustering_data |> 
  clusGap(FUN = kmeans, nstart = 25, K.max = 10, B = 1000)

# Plot the results
fviz_gap_stat(gap_statistic_results,
              maxSE = list(method = "globalmax")) +
  labs(title = "Gap Statistic Method for Optimal K")

Tip

How many clusters is considered optimal based on the gap statistic for the Palmer penguins data?

Practical Issues in Clustering

Some common challenges associated with clustering methods are that they are highly sensitive to:

Whether or not variables are properly standardized.
The choice of dissimilarity measure (Euclidean distance vs. correlation).
The selection of the number of clusters, $K$.

Moreover, it is often difficult to validate the results of clustering methods, so results should be viewed as a starting point for further exploration.

Footnotes

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. M. Le Cam & J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281–297). University of California Press.↩︎
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. doi:10.5281/zenodo.3960218, R package version 0.1.0, https://allisonhorst.github.io/palmerpenguins/.↩︎
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7 ↩︎
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411–423. https://doi.org/10.1111/1467-9868.00293 ↩︎

--- title: "Unsupervised Learning" subtitle: DSA 220 - Introduction to Data Science and Analytics author: Andrew DiLernia format: html: theme: flatly self-contained: true embed-resources: true code-fold: true code-tools: true toc: true execute: echo: true warning: false message: false output: true --- ```{css} #| echo: false /*** Customization of call-outs ***/ /* 1. Initialize a counter for our callouts */ body { counter-reset: tip-counter; } /* 2. Tell the counter to increment for each tip callout */ .callout-tip { counter-increment: tip-counter; } /* 3. Hide the default title text (i.e., "Tip") by collapsing its font size */ .callout-tip .callout-title-container { font-size: 0; } /* 4. Add the custom superscript counter */ .callout-tip .callout-title-container::before { content: "" counter(tip-counter); position: relative; font-size: initial; } /*** Customization of Footnotes / References ***/ .footnotes > h2::before { content: "References "; } .footnotes > h2 { color: transparent; /* Hide original text */ position: relative; } .footnotes > h2::before { color: #333; position: absolute; left: 0; } ``` ```{r} #| include: false # Creating vector of colorblind-friendly colors colorblind_friendly_colors <- c("#E69F00", "#009E73", "#56B4E9", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999") ``` ## Learning Objectives - Understand the difference between unsupervised and supervised learning - Calculate the Euclidean distance between vectors - Understand and implement the K-means algorithm for clustering - Standardize data for clustering methods - Select the number of clusters for clustering methods - Understand challenges and limitations of clustering methods Let's load some R packages to start. ```{r} # Load necessary packages library(tidyverse) library(ggthemes) library(gt) library(broom) library(scales) library(cluster) library(factoextra) ``` ```{r} # Set ggplot theme for visualizations theme_set(ggthemes::theme_few(base_size = 16)) ``` # Unsupervised Learning In the field of data science, **supervised learning** methods use labeled datasets (with a known outcome $y$) to fit models such as multiple linear regression or logistic regression. Supervised learning methods can be used for prediction or to explain relationships between a set of $p$ predictors or features ($x_1, x_2, \dots, x_p$) and a known outcome or response variable ($y$). **Unsupervised learning** methods use unlabeled data containing only feature measurements ($x_1, ..., x_p$) to discover hidden structures. That is, a common goal of supervised learning is prediction, whereas unsupervised learning methods often aim to discover patterns or subgroups in the data. Some common applications of unsupervised learning include: 1. **Market Segmentation**💵️🛒: Grouping customers into distinct segments based on their purchasing habits, browsing history, or demographics, which can facilitate the creation of targeted marketing campaigns. 2. **Genomics and Bioinformatics**🧬🧪: Clustering gene expression data to find genes that function similarly, or grouping patients based on their genetic information to discover different subtypes of cancer. This can facilitate development of more targeted therapies for specific cancer subtypes. 3. **Topic Modeling**📰📚: An unsupervised method used in Natural Language Processing (NLP) to automatically group a large set of documents (like customer reviews or news articles) by latent or underlying topics or themes. 4. **Anomaly Detection**🚩🚩🚩: Identifying data points that stray far from patterns in the rest of the data. Anomaly or outlier detection can be used for detecting fraud (e.g., unusual credit card transactions) or system failures. 5. **Recommender Systems**📺🎵: Companies like Netflix, Amazon, and Spotify suggest products, movies, or songs. These recommendations can be generated by clustering users with similar preferences or finding content or products that are frequently viewed or purchased together. # Clustering Clustering is a broad class of unsupervised learning methods that can discover subgroups or clusters in a data set. These formed clusters consist of sets of similar or homogeneous observations that can improve understanding of the data. K-means clustering is one of the most commonly used unsupervised learning methods. # K-Means Clustering K-means clustering partitions the data into a pre-specified number of distinct, non-overlapping clusters[^1]. Specifically, $K$ clusters or subgroups are defined in the data that minimize the total within-cluster variation. Within-cluster variation can be defined in several ways, one of which is the sum of squared Euclidean distances between each observation and the "center" or centroid of the cluster. [^1]: MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. M. Le Cam & J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281–297). University of California Press. ```{r} #| echo: false # Generate synthetic 2D data set.seed(1994) n <- 150 df <- data.frame(x = c(rnorm(n, mean = 2), rnorm(n, mean = -2), rnorm(n, mean = 5)), y = c(rnorm(n, mean = 2), rnorm(n, mean = -2), rnorm(n, mean = -3))) # Apply K-means clustering k <- 3 km <- kmeans(df, centers = k, nstart = 25) df$Cluster <- factor(km$cluster) # Create centroid data frame centroids <- data.frame(km$centers, Cluster = factor(1:k)) # Visualize clusters with ggplot2 ggplot(df, aes(x = x, y = y, color = Cluster)) + geom_point(aes(shape = Cluster), size = 2, alpha = 0.7) + stat_ellipse(level = 0.95, type = "norm", linetype = 2) + # Add filled diamonds for centroids geom_point(data = centroids, aes(x = x, y = y, fill = Cluster), shape = 23, size = 5, color = "black") + scale_color_viridis_d() + scale_fill_viridis_d() + scale_y_continuous(breaks = seq(-6, 6, by = 3), limits = c(-6, 6)) + guides(color = "none", shape = "none") + labs(title = "K-means Clustering in 2-Dimensional Space", x = "", y = "", fill = "Cluster centroid") + theme_few(base_size = 16) + theme(legend.position = "bottom") ``` ```{r} #| echo: false # Create grid for decision boundaries xrange <- range(df$x) yrange <- range(df$y) grid <- expand.grid(x = seq(xrange[1] - 0.5, xrange[2] + 0.5, length.out = 200), y = seq(yrange[1] - 0.5, yrange[2] + 0.5, length.out = 200)) # Assign each grid point to nearest centroid assign_cluster <- function(point, centers) { dists <- apply(centers, 1, function(c) sum((point - c)^2)) which.min(dists) } grid$Cluster <- factor(apply(grid, 1, assign_cluster, centers = km$centers)) # --- Visualize clusters with boundaries --- ggplot() + geom_tile(data = grid, aes(x = x, y = y, fill = Cluster), alpha = 0.2) + geom_point(data = df, aes(x = x, y = y, color = Cluster, shape = Cluster), size = 2, alpha = 0.7) + # plotting points geom_point(data = centroids, aes(x = x, y = y, fill = Cluster), shape = 23, size = 5, color = "black") + # plotting centroids scale_color_viridis_d() + scale_fill_viridis_d() + scale_y_continuous(expand = expansion(mult = c(0.1, 0.1))) + scale_x_continuous(expand = expansion(mult = c(0.05, 0.05))) + guides(color = "none", shape = "none") + labs(title = "K-means Clustering with Decision Boundaries", x = "", y = "", fill = "Cluster centroid") + theme_few(base_size = 16) + theme(legend.position = "bottom") ``` ## Euclidean Distance The Euclidean distance is the straight-line distance between two points or vectors. It is calculated as the square root of the sum of the squared differences between the corresponding components of the two vectors. For a general case with two $p$-dimensional vectors, $\mathbf{a} = (a_1, a_2, ..., a_p)$ and $\mathbf{b} = (b_1, b_2, ..., b_p)$, the Euclidean distance between $\mathbf{a}$ and $\mathbf{b}$ is:$$d(\mathbf{a}, \mathbf{b}) = ||\mathbf{a} - \mathbf{b}||_2 = \sqrt{\sum_{j=1}^{p} (a_j - b_j)^2}$$. For example, in the 2-dimensional case, the Euclidean distance between two vectors $\mathbf{a} = (a_1, a_2)$ and $\mathbf{b} = (b_1, b_2)$ is $$d(\mathbf{a}, \mathbf{b}) = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2}$$ That is, if $\mathbf{a} = (8, 3)$ and $\mathbf{b} = (4, 0)$, then the Euclidean distance between $\mathbf{a}$ and $\mathbf{b}$ is $$d(\mathbf{a}, \mathbf{b}) = \sqrt{(8 - 4)^2 + (3 - 0)^2} = \sqrt{4^2 + 3^2} = \sqrt{16+9} = \sqrt{25} = 5$$ ```{r} #| echo: false # Making vectors a <- c(8, 3) b <- c(4, 0) # Creating data frames for plotting points_df <- data.frame(x = c(a[1], b[1]), y = c(a[2], b[2]), label = c("(8, 3)", "(4, 0)")) line_df <- data.frame(x = c(a[1], b[1]), y = c(a[2], b[2])) midpoint <- data.frame(x = mean(c(a[1], b[1])), y = mean(c(a[2], b[2]))) angle_rad <- atan2(a[2] - b[2], a[1] - b[1]) angle_deg <- angle_rad * (180 / pi) # Plotting ggplot() + geom_line(data = line_df, aes(x = x, y = y), linetype = "dotted", color = "#0072B2") + geom_point(data = points_df, aes(x = x, y = y), size = 3, color = "#009E73") + geom_text(data = points_df, aes(x = x, y = y, label = label), vjust = -1, color = "#009E73") + annotate("text", x = midpoint$x, y = midpoint$y, label = "Euclidean distance", angle = angle_deg, vjust = -0.6, color = "#0072B2", size = 5) + xlim(2, 10) + ylim(-1, 5) + coord_fixed() + theme_bw() + labs(title = "Depiction of Euclidean distance in 2-dimensional space", x = "", y = "") ``` ::: {.callout-tip} Calculate the Euclidean distance between the two vectors $\mathbf{a} = (7, 3)$ and $\mathbf{b} = (2, -9)$. ::: ## The K-Means Algorithm For a prespecified number of clusters, $K$, the K-Means algorithm is as follows: 1. Assign each observation to one of $K$ clusters. This can be random or using a more informative initialization approach for the cluster assignments. 2. Repeat the following until convergence: a. Compute the centroid (mean vector) for each of the $K$ clusters. b. Reassign each observation to the cluster whose centroid is closest (in Euclidean distance). Convergence is commonly met when the centroids do not change significantly between iterations or a maximum number of iterations is reached. Since variables can be on different scales, an essential step prior to implementing K-means clustering is standardization. ## Standardization When variables are measured on different scales (e.g., income in dollars vs. age in years), larger-scale variables can disproportionately affect the distance calculation and consequently the resulting clusters from unsupervised methods. **Standardization** of variables, also called scaling, is important for many methods in data science such as clustering to ensure variables contribute equally to cluster formation. Typically, only numeric variables are standardized, not dummy variables for categorical variables. There are multiple types of standardization that can be used. ### Z-Score Normalization To implement Z-score normalization, scaled versions of the variables are used in the clustering algorithm: $$z = \frac{x - \bar{x}}{s},$$ where $\bar{x}$ is the sample mean and $s$ the sample standard deviation. One of the most common methods for standardization, Z-score normalization centers each variable to have a mean of 0 and a standard deviation of 1. Z-score scaling is the standard approach when features have different units or distributions. ::: {.callout-tip} Calculate the Z-score normalized transformations of the two vectors $\mathbf{a} = (7, 3)$ and $\mathbf{b} = (2, -9)$, confirming that $\mathbf{z}_{a} = (\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}) \approx (0.7071, 0.7071)$ and $\mathbf{z}_{b} = (-\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}) \approx (-0.7071, -0.7071)$ ::: ### Min-Max Scaling For min-max scaling, variables are standardized based on their minimum and maximum values: $$x^* = \frac{x - \min(x)}{\max(x) - \min(x)}$$ Min-max scaling rescales variables to be bounded between 0 and 1, but it can yield unreliable results if extreme outliers are present. ::: {.callout-tip} Calculate the min-max scaled transformations of the two vectors $\mathbf{a} = (7, 3)$ and $\mathbf{b} = (2, -9)$, confirming that $\mathbf{a}^* = (1, 1)$ and $\mathbf{b}^* = (0, 0)$. ::: # Palmer Penguins Data To explore how we can use clustering methods, let's consider the Palmer Penguins data set which has data collected on 344 penguins from 3 islands in [Palmer Archipelago, Antarctica](https://www.google.com/maps/place/Palmer+Archipelago/@-64.1463814,-62.1800963,7z/data=!3m1!4b1!4m5!3m4!1s0xbc78dd6dc38c572b:0xe609367aeed33087!8m2!3d-64.1957848!4d-62.0153384)[^2]. [^2]: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. doi:10.5281/zenodo.3960218, R package version 0.1.0, <https://allisonhorst.github.io/palmerpenguins/>. | | | |-----------------------------------------------|----------------------------------------------| | ![](https://github.com/dilernia/DSA220/blob/main/images/lter_penguins.png?raw=true) | ![](https://github.com/dilernia/DSA220/blob/main/images/culmen_depth.png?raw=true) | : *Artwork by @allison_horst* ```{r} # Importing penguins data from GitHub penguins <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/penguins.csv") # Creating vector of colorblind-friendly colors penguin_colors <- c("Chinstrap" = "#CC79A7", "Gentoo" = "#009E73", "Adelie" = "#E69F00") colors_vector <- c("#E69F00", "#CC79A7", "#009E73") ``` Below are a data dictionary for the Palmer Penguins data, and a preview of some randomly selected rows. ```{r} #| label: tbl-data-dictionary #| tbl-cap: "Data dictionary for the Palmer Penguins data set." #| echo: false # Creating data dictionary data_dictionary <- tibble(Variable = colnames(penguins), Description = c("Species of the penguin", "Island the penguin was found on", "Bill length (mm)", "Bill depth (mm)", "Flipper length (mm)", "Body mass (g)", "Sex of the penguin", "Year data was collected")) data_dictionary |> gt() ``` ```{r} #| label: tbl-penguins #| tbl-cap: "Randomly selected rows from the Palmer Penguins data set." #| echo: false # Setting seed to consistently get the same "random" rows set.seed(1994) # Printing table penguins |> slice_sample(n = 5) |> gt() ``` First, let's visualize the bill lengths and bill depths of the penguins coloring the points by their actual species. ```{r} #| fig-width: 6 #| fig-height: 5 #| dpi: 120 # Visualize the penguins data penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species, shape = species)) + scale_color_manual(values = penguin_colors) + geom_point(size = 1) + labs(title = "Penguin bill dimensions", subtitle = "Colored by species", x = "Bill length (mm)", y = "Bill depth (mm)", color = "Species", shape = "Species", caption = "palmerpenguins") + theme_few(base_size = 10) + theme(legend.position = "bottom") ``` Without coloring and altering the shapes of the points by species, can we uncover these subgroups in the data? ```{r} #| fig-width: 6 #| fig-height: 5 #| dpi: 120 # Visualize the penguins data penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + scale_color_manual(values = penguin_colors) + geom_point(size = 1) + labs(title = "Penguin bill dimensions", subtitle = "Colored by species", x = "Bill length (mm)", y = "Bill depth (mm)", caption = "palmerpenguins") + theme_few(base_size = 10) + theme(legend.position = "bottom") ``` We will ignore the `species` variable and see if we can discover the species groups using only the bill lengths (`bill_length_mm`) and bill depths (`bill_depth_mm`) using *clustering*. ## Clustering the Palmer Penguins First, to cluster observations in R we must remove missing values (`NA`'s) for any of the variables we are hoping to use for clustering. ```{r} # Removing rows with missing values for bill_depth_mm or bill_length_mm penguins_complete <- penguins |> drop_na(bill_depth_mm, bill_length_mm) ``` Then, we scale the data using Z-score normalization to ensure all variables are on the same scale. ```{r} # Implementing Z-score normalization for the numeric columns penguins_scaled <- penguins_complete |> mutate(across(where(is.numeric), ~ as.numeric(scale(.)))) ``` Then, we can implement the K-means clustering of the penguins using only their bill depths and bill lengths. For now, we will specify $K=3$ clusters, although typically we would not know the true number of subgroups prior to clustering. ```{r} # Setting seed for reproducible results set.seed(1994) # Subsetting to only bill_depth_mm and bill_length_mm penguins_clustering_data <- penguins_scaled |> select(bill_depth_mm, bill_length_mm) # Apply K-means clustering with K = 3 clusters kmeans_result <- penguins_clustering_data |> kmeans(centers = 3) ``` Observe that K-means clustering discovers a grouping structure in the data that closely aligns with the species of the penguins, despite the algorithm not knowing their species. ```{r} #| echo: false #| eval: true # Significant code obtained from Google Gemini and altered gif_file <- "kmeans_animation.gif" library(gganimate) library(palmerpenguins) library(gifski) if(file.exists(gif_file) == FALSE) { # --- 1. Data Preparation --- # Set a seed for reproducible random starting points set.seed(1994) # Use the palmerpenguins dataset data_raw <- palmerpenguins::penguins |> select(species, bill_length_mm, bill_depth_mm) |> drop_na() # Separate species from data to be scaled species_data <- data_raw$species numeric_data <- data_raw |> select(bill_length_mm, bill_depth_mm) # Scale the data data_scaled_numeric <- as.data.frame(scale(numeric_data)) colnames(data_scaled_numeric) <- c("V1", "V2") # Recombine scaled data with species and add point_id data_scaled <- data_scaled_numeric |> mutate(species = species_data, point_id = 1:n()) # --- 2. Manual K-Means Algorithm --- # Set K-means parameters k <- 3 max_iter <- 10 # Initialize random centroids by picking k points from the data centroids <- data_scaled[sample(nrow(data_scaled), k), ] # Create empty data frames to store the history of all steps all_points_data <- data.frame() all_centroids_data <- data.frame() # --- Main K-Means Loop --- # We will loop for a max number of iterations # In each loop, we: # 1. (Assign) Assign each point to the closest centroid # 2. (Store) Save the current point assignments and centroid locations for animation # 3. (Update) Calculate new centroids based on the mean of assigned points # 4. (Check) Stop if the centroids haven't changed for (i in 1:max_iter) { # Store the current centroids to check for convergence later old_centroids <- centroids # --- 1. Assignment Step --- # Calculate distances from each point to each centroid distances <- matrix(NA, nrow = nrow(data_scaled), ncol = k) for (m in 1:k) { distances[, m] <- sqrt((data_scaled$V1 - centroids$V1[m])^2 + (data_scaled$V2 - centroids$V2[m])^2) } # Assign each point to the closest centroid assignments <- apply(distances, 1, which.min) # Add cluster assignments current_data <- data_scaled |> mutate(cluster = as.factor(assignments)) # --- 2. Store State for Animation --- # Add the iteration number to this step's data points_for_step <- current_data |> mutate(iteration = i) # Add iteration number to the centroids centroids_for_step <- centroids |> mutate(iteration = i, cluster = as.factor(1:k)) # Bind this step's data to our master history data frames all_points_data <- bind_rows(all_points_data, points_for_step) all_centroids_data <- bind_rows(all_centroids_data, centroids_for_step) # --- 3. Update Step --- # Calculate new centroids as the mean of the points in each cluster new_centroids_df <- current_data |> group_by(cluster) |> summarize(V1 = mean(V1), V2 = mean(V2), .groups = 'drop') # Re-order by cluster number and select only V1, V2 for the next loop centroids <- new_centroids_df |> arrange(as.numeric(as.character(cluster))) |> select(V1, V2) # --- 4. Check for Convergence --- # If centroids are the same as the last iteration, stop if (isTRUE(all.equal(old_centroids, centroids))) { final_points <- current_data |> mutate(iteration = i + 1) final_centroids <- centroids |> mutate(iteration = i + 1, cluster = as.factor(1:k)) all_points_data <- bind_rows(all_points_data, final_points) all_centroids_data <- bind_rows(all_centroids_data, final_centroids) break # Exit the loop } } # --- 3. Create the Animation --- # Get the final iteration number for the title final_iteration <- max(all_points_data$iteration) # Build the ggplot kmeans_animated <- ggplot(all_points_data, aes(x = V1, y = V2, group = point_id)) + # Plot the data points geom_point(aes(color = cluster, shape = species), alpha = 0.8) + # Plot the centroids and have shape reflect penguin species geom_point(data = all_centroids_data, aes(fill = cluster, group = cluster), shape = 23, size = 5, color = "black") + scale_color_manual(name = "Cluster", values = colors_vector) + scale_fill_manual(name = "Cluster centroid", values = colors_vector) + scale_shape_discrete(name = "Species") + guides(color = "none", fill = guide_legend(order = 1), shape = guide_legend(order = 2)) + # Set labels and title labs(title = 'K-Means Clustering: Iteration {closest_state}', subtitle = paste('Algorithm converged at iteration', final_iteration), x = "Scaled Bill Length (mm)", y = "Scaled Bill Depth (mm)", caption = "Data: palmerpenguins") + theme_few(base_size = 16) + theme(legend.position = "bottom", legend.box = "vertical", text = element_text(face = "bold")) + # Use `iteration` as the time variable for gganimate transition_states(iteration, transition_length = 1, state_length = 2) + ease_aes('sine-in-out') # Add a nice fading effect # --- 4. Render and Save the Animation --- # Render the gif num_states <- max(all_points_data$iteration) total_frames <- (1 + 2) * num_states * 10 gif_scale <- 1.2*100 # Animate final_animation <- animate(kmeans_animated, nframes = total_frames, fps = 10, width = 6*gif_scale, height = 5*gif_scale, renderer = gifski_renderer(), end_pause = 15) # Save the animation as a GIF anim_save(gif_file, animation = final_animation) } ``` ![](https://github.com/dilernia/DSA220/blob/main/images/kmeans_animation.gif?raw=true) How well did K-means do exactly? Since we actually knowing the penguin species here, one method for assessing the results is to use a confusion matrix, a visualization commonly used for assessing the results of supervised machine learning methods. ```{r} #| echo: false # Create function to plot confusion matrix plot_confusion_matrix <- function(pred_data, truth_col, pred_col) { # Ensure character format pred_data[[truth_col]] <- as.character(pred_data[[truth_col]]) pred_data[[pred_col]] <- as.character(pred_data[[pred_col]]) # Prepare count data count_data <- pred_data |> mutate(truth = factor(.data[[truth_col]], levels = unique(c(.data[[truth_col]], .data[[pred_col]]))), predicted = factor(.data[[pred_col]], levels = unique(c(.data[[truth_col]], .data[[pred_col]])))) |> group_by(truth, predicted) |> summarise(count = n(), .groups = "drop") |> complete(truth, predicted, fill = list(count = 0)) # Plot confusion matrix ggplot(count_data, aes(x = truth, y = predicted)) + geom_point(aes(size = ifelse(count == 0, NA, count), color = truth == predicted)) + geom_text(aes(label = ifelse(count == 0, NA, count)), hjust = 0, vjust = 0, nudge_x = 0.1, nudge_y = 0.1) + guides(color = "none", size = "none") + scale_color_manual(values = c("#D55E00", "#56B4E9")) + labs(x = "Actual", y = "Predicted", title = "Confusion matrix") + theme_bw() } ``` ```{r} #| echo: false # Add the cluster assignments to original data penguins_clustered <- penguins_complete |> select(bill_depth_mm, bill_length_mm, species) |> mutate(cluster = factor(kmeans_result$cluster, levels = 1:3, labels = c("Adelie", "Chinstrap", "Gentoo"))) # Plotting confusion matrix plot_confusion_matrix(pred_data = penguins_clustered, truth_col = "species", pred_col = "cluster") + labs(title = "Confusion matrix after label alignment", y = "Cluster", x = "Actual species") + scale_y_discrete(breaks = c("Adelie", "Chinstrap", "Gentoo"), labels = c("1", "2", "3")) ``` In general, for unsupervised methods one does not know the "true" clustering or grouping structure in the data, so it can be difficult to assess the reliability of the results. However, let's consider how well the clustering of the points aligns with the known species of the penguins. ::: {.callout-tip} a. What percent of Adelie penguins were "correctly" grouped together? What about for Gentoo and Chinstrap penguins? b. Why did K-means appear to have a more "difficult" time clustering the Chinstrap penguins compared to the other species? c. How could we have likely "improved" the clustering results to more closely align with the actual penguin species? ::: A common analysis step after clustering is to next examine differences in the resulting clusters. Since we only used numeric variables for clustering here, we can use side-by-side boxplots to visualizes any differences. When using categorical variables for clustering, one can create bar charts to visualize differences as well. ```{r} # Add the cluster assignments to original data penguins_clustered <- penguins_complete |> select(bill_depth_mm, bill_length_mm) |> mutate(cluster = as.factor(kmeans_result$cluster)) # Visualizing distributions of clustering variables across clusters # Creating side-by-side box plots penguins_clustered |> pivot_longer(cols = c(bill_depth_mm, bill_length_mm), names_to = "variable", values_to = "value") |> ggplot(aes(x = cluster, y = value, fill = cluster)) + geom_boxplot() + facet_wrap(. ~ variable, ncol = 2, scales = "free_y") + scale_fill_manual(values = colors_vector) + labs(title = "Variable distributions by K-means cluster", x = "Cluster") + theme(legend.position = "none") ``` Another common challenge with K-means clustering is how to select the number of clusters for the algorithm to use. Here we knew there were three penguin species in the data, but for most real-world data sets one should select the number of clusters with a data-driven approach. # Selecting the Number of Clusters For clustering methods, specifying too few clusters can oversimplify the data and not effectively uncover important subgroups, while too many clusters can overfit the data and yield uninterpretable results. There are a few common strategies for selecting the number of clusters. ## Silhouette Score The silhouette score[^3] measures how similar each point is to its own cluster compared to other clusters. Silhouette scores range from -1 to 1, with higher values indicating better fit. Silhouette scores can be calculated for different numbers of clusters, and the number of clusters with the highest average silhouette score across all points is preferred. In general, the average silhouette score is useful for selecting the number of clusters $K$ that yields the most compact and well-separated clusters. [^3]: Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. *Journal of Computational and Applied Mathematics*, 20, 53–65. <https://doi.org/10.1016/0377-0427(87)90125-7> ```{r} #| fig-align: center # fviz_nbclust runs kmeans for k=1 to 10 and computes the silhouette score silhouette_results <- penguins_clustering_data |> fviz_nbclust(FUNcluster = kmeans, method = "silhouette", k.max = 10) + labs(title = "Silhouette Method for Optimal K") silhouette_results ``` ::: {.callout-tip} How many clusters is considered optimal based on the average silhouette score for the Palmer penguins data? ::: ## Gap Statistic The gap statistic[^4] is another method for selecting the number of clusters in clustering algorithms such as K-means. The gap statistic provides a quantitative measure by comparing the within-cluster variability for the observed data set for a specified $K$ to that for a reference null distribution which has no true clustering structure. This reference null distribution is simulated using Monte Carlo methods to reflect values in the original observed data set, but with no subgroups present in the data. The closer the gap statistic values are to 0 for a specified $K$, the more this indicates that the formed subgroups are capturing noise than true structure in the data. To select the number of clusters $K$ using the gap statistic, clustering is implemented for a range of values for $K$, and the value of $K$ that yields the greatest gap statistic is preferred. An advantage of the gap statistic over other common methods like the average silhouette score is that the gap statistic can suggest there are no clear subgroups in the data (i.e., $K=1$ is optimal). A limitation of the gap statistic is that uninformative variables can lead to a poor selection of the number of clusters since equal weight is given to all variables. [^4]: Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. *Journal of the Royal Statistical Society: Series B (Statistical Methodology)*, 63(2), 411–423. <https://doi.org/10.1111/1467-9868.00293> ```{r} #| fig-align: center #| cache: true # Setting seed value for reproducible results set.seed(1994) # Calculating Gap statistics for up to K.max clusters gap_statistic_results <- penguins_clustering_data |> clusGap(FUN = kmeans, nstart = 25, K.max = 10, B = 1000) # Plot the results fviz_gap_stat(gap_statistic_results, maxSE = list(method = "globalmax")) + labs(title = "Gap Statistic Method for Optimal K") ``` ::: {.callout-tip} How many clusters is considered optimal based on the gap statistic for the Palmer penguins data? ::: ## Practical Issues in Clustering Some common challenges associated with clustering methods are that they are highly sensitive to: - Whether or not variables are properly standardized. - The choice of dissimilarity measure (Euclidean distance vs. correlation). - The selection of the number of clusters, $K$. Moreover, it is often difficult to validate the results of clustering methods, so results should be viewed as a starting point for further exploration.