Unsupervised Learning
DSA 220 - Introduction to Data Science and Analytics
Learning Objectives
Understand the difference between unsupervised and supervised learning
Calculate the Euclidean distance between vectors
Understand and implement the K-means algorithm for clustering
Standardize data for clustering methods
Select the number of clusters for clustering methods
Understand challenges and limitations of clustering methods
Let’s load some R packages to start.
Code
# Load necessary packages
library(tidyverse)
library(ggthemes)
library(gt)
library(broom)
library(scales)
library(cluster)
library(factoextra)Code
# Set ggplot theme for visualizations
theme_set(ggthemes::theme_few(base_size = 16))Unsupervised Learning
In the field of data science, supervised learning methods use labeled datasets (with a known outcome \(y\)) to fit models such as multiple linear regression or logistic regression. Supervised learning methods can be used for prediction or to explain relationships between a set of \(p\) predictors or features (\(x_1, x_2, \dots, x_p\)) and a known outcome or response variable (\(y\)). Unsupervised learning methods use unlabeled data containing only feature measurements (\(x_1, ..., x_p\)) to discover hidden structures. That is, a common goal of supervised learning is prediction, whereas unsupervised learning methods often aim to discover patterns or subgroups in the data.
Some common applications of unsupervised learning include:
Market Segmentation💵️🛒: Grouping customers into distinct segments based on their purchasing habits, browsing history, or demographics, which can facilitate the creation of targeted marketing campaigns.
Genomics and Bioinformatics🧬🧪: Clustering gene expression data to find genes that function similarly, or grouping patients based on their genetic information to discover different subtypes of cancer. This can facilitate development of more targeted therapies for specific cancer subtypes.
Topic Modeling📰📚: An unsupervised method used in Natural Language Processing (NLP) to automatically group a large set of documents (like customer reviews or news articles) by latent or underlying topics or themes.
Anomaly Detection🚩🚩🚩: Identifying data points that stray far from patterns in the rest of the data. Anomaly or outlier detection can be used for detecting fraud (e.g., unusual credit card transactions) or system failures.
Recommender Systems📺🎵: Companies like Netflix, Amazon, and Spotify suggest products, movies, or songs. These recommendations can be generated by clustering users with similar preferences or finding content or products that are frequently viewed or purchased together.
Clustering
Clustering is a broad class of unsupervised learning methods that can discover subgroups or clusters in a data set. These formed clusters consist of sets of similar or homogeneous observations that can improve understanding of the data. K-means clustering is one of the most commonly used unsupervised learning methods.
K-Means Clustering
K-means clustering partitions the data into a pre-specified number of distinct, non-overlapping clusters1. Specifically, \(K\) clusters or subgroups are defined in the data that minimize the total within-cluster variation. Within-cluster variation can be defined in several ways, one of which is the sum of squared Euclidean distances between each observation and the “center” or centroid of the cluster.
Euclidean Distance
The Euclidean distance is the straight-line distance between two points or vectors. It is calculated as the square root of the sum of the squared differences between the corresponding components of the two vectors. For a general case with two \(p\)-dimensional vectors, \(\mathbf{a} = (a_1, a_2, ..., a_p)\) and \(\mathbf{b} = (b_1, b_2, ..., b_p)\), the Euclidean distance between \(\mathbf{a}\) and \(\mathbf{b}\) is:\[d(\mathbf{a}, \mathbf{b}) = ||\mathbf{a} - \mathbf{b}||_2 = \sqrt{\sum_{j=1}^{p} (a_j - b_j)^2}\].
For example, in the 2-dimensional case, the Euclidean distance between two vectors \(\mathbf{a} = (a_1, a_2)\) and \(\mathbf{b} = (b_1, b_2)\) is
\[d(\mathbf{a}, \mathbf{b}) = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2}\]
That is, if \(\mathbf{a} = (8, 3)\) and \(\mathbf{b} = (4, 0)\), then the Euclidean distance between \(\mathbf{a}\) and \(\mathbf{b}\) is
\[d(\mathbf{a}, \mathbf{b}) = \sqrt{(8 - 4)^2 + (3 - 0)^2} = \sqrt{4^2 + 3^2} = \sqrt{16+9} = \sqrt{25} = 5\]
Calculate the Euclidean distance between the two vectors \(\mathbf{a} = (7, 3)\) and \(\mathbf{b} = (2, -9)\).
The K-Means Algorithm
For a prespecified number of clusters, \(K\), the K-Means algorithm is as follows:
Assign each observation to one of \(K\) clusters. This can be random or using a more informative initialization approach for the cluster assignments.
Repeat the following until convergence:
Compute the centroid (mean vector) for each of the \(K\) clusters.
Reassign each observation to the cluster whose centroid is closest (in Euclidean distance).
Convergence is commonly met when the centroids do not change significantly between iterations or a maximum number of iterations is reached.
Since variables can be on different scales, an essential step prior to implementing K-means clustering is standardization.
Standardization
When variables are measured on different scales (e.g., income in dollars vs. age in years), larger-scale variables can disproportionately affect the distance calculation and consequently the resulting clusters from unsupervised methods. Standardization of variables, also called scaling, is important for many methods in data science such as clustering to ensure variables contribute equally to cluster formation. Typically, only numeric variables are standardized, not dummy variables for categorical variables. There are multiple types of standardization that can be used.
Z-Score Normalization
To implement Z-score normalization, scaled versions of the variables are used in the clustering algorithm:
\[z = \frac{x - \bar{x}}{s},\]
where \(\bar{x}\) is the sample mean and \(s\) the sample standard deviation. One of the most common methods for standardization, Z-score normalization centers each variable to have a mean of 0 and a standard deviation of 1. Z-score scaling is the standard approach when features have different units or distributions.
Calculate the Z-score normalized transformations of the two vectors \(\mathbf{a} = (7, 3)\) and \(\mathbf{b} = (2, -9)\), confirming that \(\mathbf{z}_{a} = (\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}) \approx (0.7071, 0.7071)\) and \(\mathbf{z}_{b} = (-\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}) \approx (-0.7071, -0.7071)\)
Min-Max Scaling
For min-max scaling, variables are standardized based on their minimum and maximum values:
\[x^* = \frac{x - \min(x)}{\max(x) - \min(x)}\]
Min-max scaling rescales variables to be bounded between 0 and 1, but it can yield unreliable results if extreme outliers are present.
Calculate the min-max scaled transformations of the two vectors \(\mathbf{a} = (7, 3)\) and \(\mathbf{b} = (2, -9)\), confirming that \(\mathbf{a}^* = (1, 1)\) and \(\mathbf{b}^* = (0, 0)\).
Palmer Penguins Data
To explore how we can use clustering methods, let’s consider the Palmer Penguins data set which has data collected on 344 penguins from 3 islands in Palmer Archipelago, Antarctica2.
Code
# Importing penguins data from GitHub
penguins <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/penguins.csv")
# Creating vector of colorblind-friendly colors
penguin_colors <- c("Chinstrap" = "#CC79A7", "Gentoo" = "#009E73", "Adelie" = "#E69F00")
colors_vector <- c("#E69F00", "#CC79A7", "#009E73")Below are a data dictionary for the Palmer Penguins data, and a preview of some randomly selected rows.
| Variable | Description |
|---|---|
| species | Species of the penguin |
| island | Island the penguin was found on |
| bill_length_mm | Bill length (mm) |
| bill_depth_mm | Bill depth (mm) |
| flipper_length_mm | Flipper length (mm) |
| body_mass_g | Body mass (g) |
| sex | Sex of the penguin |
| year | Year data was collected |
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
| Adelie | Biscoe | 39.7 | 18.9 | 184 | 3550 | male | 2009 |
| Chinstrap | Dream | 45.4 | 18.7 | 188 | 3525 | female | 2007 |
| Gentoo | Biscoe | 46.2 | 14.1 | 217 | 4375 | female | 2009 |
| Gentoo | Biscoe | 50.0 | 16.3 | 230 | 5700 | male | 2007 |
First, let’s visualize the bill lengths and bill depths of the penguins coloring the points by their actual species.
Code
# Visualize the penguins data
penguins |>
ggplot(aes(x = bill_length_mm, y = bill_depth_mm,
color = species, shape = species)) +
scale_color_manual(values = penguin_colors) +
geom_point(size = 1) +
labs(title = "Penguin bill dimensions",
subtitle = "Colored by species",
x = "Bill length (mm)",
y = "Bill depth (mm)",
color = "Species",
shape = "Species",
caption = "palmerpenguins") +
theme_few(base_size = 10) +
theme(legend.position = "bottom")Without coloring and altering the shapes of the points by species, can we uncover these subgroups in the data?
Code
# Visualize the penguins data
penguins |>
ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
scale_color_manual(values = penguin_colors) +
geom_point(size = 1) +
labs(title = "Penguin bill dimensions",
subtitle = "Colored by species",
x = "Bill length (mm)",
y = "Bill depth (mm)",
caption = "palmerpenguins") +
theme_few(base_size = 10) +
theme(legend.position = "bottom")We will ignore the species variable and see if we can discover the species groups using only the bill lengths (bill_length_mm) and bill depths (bill_depth_mm) using clustering.
Clustering the Palmer Penguins
First, to cluster observations in R we must remove missing values (NA’s) for any of the variables we are hoping to use for clustering.
Code
# Removing rows with missing values for bill_depth_mm or bill_length_mm
penguins_complete <- penguins |>
drop_na(bill_depth_mm, bill_length_mm)Then, we scale the data using Z-score normalization to ensure all variables are on the same scale.
Code
# Implementing Z-score normalization for the numeric columns
penguins_scaled <- penguins_complete |>
mutate(across(where(is.numeric), ~ as.numeric(scale(.))))Then, we can implement the K-means clustering of the penguins using only their bill depths and bill lengths. For now, we will specify \(K=3\) clusters, although typically we would not know the true number of subgroups prior to clustering.
Code
# Setting seed for reproducible results
set.seed(1994)
# Subsetting to only bill_depth_mm and bill_length_mm
penguins_clustering_data <- penguins_scaled |>
select(bill_depth_mm, bill_length_mm)
# Apply K-means clustering with K = 3 clusters
kmeans_result <- penguins_clustering_data |>
kmeans(centers = 3)Observe that K-means clustering discovers a grouping structure in the data that closely aligns with the species of the penguins, despite the algorithm not knowing their species.
How well did K-means do exactly? Since we actually knowing the penguin species here, one method for assessing the results is to use a confusion matrix, a visualization commonly used for assessing the results of supervised machine learning methods.
In general, for unsupervised methods one does not know the “true” clustering or grouping structure in the data, so it can be difficult to assess the reliability of the results. However, let’s consider how well the clustering of the points aligns with the known species of the penguins.
What percent of Adelie penguins were “correctly” grouped together? What about for Gentoo and Chinstrap penguins?
Why did K-means appear to have a more “difficult” time clustering the Chinstrap penguins compared to the other species?
How could we have likely “improved” the clustering results to more closely align with the actual penguin species?
A common analysis step after clustering is to next examine differences in the resulting clusters. Since we only used numeric variables for clustering here, we can use side-by-side boxplots to visualizes any differences. When using categorical variables for clustering, one can create bar charts to visualize differences as well.
Code
# Add the cluster assignments to original data
penguins_clustered <- penguins_complete |>
select(bill_depth_mm, bill_length_mm) |>
mutate(cluster = as.factor(kmeans_result$cluster))
# Visualizing distributions of clustering variables across clusters
# Creating side-by-side box plots
penguins_clustered |>
pivot_longer(cols = c(bill_depth_mm, bill_length_mm),
names_to = "variable", values_to = "value") |>
ggplot(aes(x = cluster, y = value, fill = cluster)) +
geom_boxplot() +
facet_wrap(. ~ variable, ncol = 2, scales = "free_y") +
scale_fill_manual(values = colors_vector) +
labs(title = "Variable distributions by K-means cluster",
x = "Cluster") +
theme(legend.position = "none")Another common challenge with K-means clustering is how to select the number of clusters for the algorithm to use. Here we knew there were three penguin species in the data, but for most real-world data sets one should select the number of clusters with a data-driven approach.
Selecting the Number of Clusters
For clustering methods, specifying too few clusters can oversimplify the data and not effectively uncover important subgroups, while too many clusters can overfit the data and yield uninterpretable results. There are a few common strategies for selecting the number of clusters.
Silhouette Score
The silhouette score3 measures how similar each point is to its own cluster compared to other clusters. Silhouette scores range from -1 to 1, with higher values indicating better fit.
Silhouette scores can be calculated for different numbers of clusters, and the number of clusters with the highest average silhouette score across all points is preferred. In general, the average silhouette score is useful for selecting the number of clusters \(K\) that yields the most compact and well-separated clusters.
Code
# fviz_nbclust runs kmeans for k=1 to 10 and computes the silhouette score
silhouette_results <- penguins_clustering_data |>
fviz_nbclust(FUNcluster = kmeans, method = "silhouette",
k.max = 10) +
labs(title = "Silhouette Method for Optimal K")
silhouette_resultsHow many clusters is considered optimal based on the average silhouette score for the Palmer penguins data?
Gap Statistic
The gap statistic4 is another method for selecting the number of clusters in clustering algorithms such as K-means. The gap statistic provides a quantitative measure by comparing the within-cluster variability for the observed data set for a specified \(K\) to that for a reference null distribution which has no true clustering structure. This reference null distribution is simulated using Monte Carlo methods to reflect values in the original observed data set, but with no subgroups present in the data. The closer the gap statistic values are to 0 for a specified \(K\), the more this indicates that the formed subgroups are capturing noise than true structure in the data.
To select the number of clusters \(K\) using the gap statistic, clustering is implemented for a range of values for \(K\), and the value of \(K\) that yields the greatest gap statistic is preferred.
An advantage of the gap statistic over other common methods like the average silhouette score is that the gap statistic can suggest there are no clear subgroups in the data (i.e., \(K=1\) is optimal). A limitation of the gap statistic is that uninformative variables can lead to a poor selection of the number of clusters since equal weight is given to all variables.
Code
# Setting seed value for reproducible results
set.seed(1994)
# Calculating Gap statistics for up to K.max clusters
gap_statistic_results <- penguins_clustering_data |>
clusGap(FUN = kmeans, nstart = 25, K.max = 10, B = 1000)
# Plot the results
fviz_gap_stat(gap_statistic_results,
maxSE = list(method = "globalmax")) +
labs(title = "Gap Statistic Method for Optimal K")How many clusters is considered optimal based on the gap statistic for the Palmer penguins data?
Practical Issues in Clustering
Some common challenges associated with clustering methods are that they are highly sensitive to:
Whether or not variables are properly standardized.
The choice of dissimilarity measure (Euclidean distance vs. correlation).
The selection of the number of clusters, \(K\).
Moreover, it is often difficult to validate the results of clustering methods, so results should be viewed as a starting point for further exploration.
Footnotes
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. M. Le Cam & J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281–297). University of California Press.↩︎
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. doi:10.5281/zenodo.3960218, R package version 0.1.0, https://allisonhorst.github.io/palmerpenguins/.↩︎
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7↩︎
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411–423. https://doi.org/10.1111/1467-9868.00293↩︎