R Lab: Unsupervised Learning

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Once completed, submit a zipped file (.zip) containing two total documents via Blackboard: a .qmd Quarto file and the corresponding HTML document.

In this assignment we will be analyzing data from the United States American Community Survey (ACS) at the state level for 2023. Note that estimates are from the 1-year ACS.

Code

library(tidyverse)
library(sf)
library(tigris)

# Download the state boundary data (cb = TRUE uses cartographic boundaries)
# This resolution is simplified and good for a national map.
# This file includes all 50 states plus Puerto Rico and other territories.
us_states <- states(cb = TRUE, resolution = "20m",
                    progress_bar = FALSE)

# Use the built-in tigris function to move Alaska, Hawaii, 
# and Puerto Rico to the bottom-left of the map.
us_states_shifted <- shift_geometry(us_states)

# Now, plot the shifted data using ggplot2
us_map <- ggplot(data = us_states_shifted) +
    geom_sf(fill = "dodgerblue", alpha = 0.4,
            color = "black", linewidth = 0.3) +
    theme_void()

For this data, we will be concerned with the variables in the data dictionary below.

Table 1: Data dictionary for census data set.

Variable	Description
county_state	Name of state
population	Population of the state
median_income	Median household income in US dollars
median_monthly_rent_cost	Monthly rent cost in US dollars
prop_poverty	Proportion of people 25 and older living in poverty, defined by the Census Bureau as having an income below the poverty threshold for their family size
prop_highschool	Proportion of people 25 and older whose highest education-level is high school
prop_white	Proportion of people who are white alone
prop_black	Proportion of people who are black alone

We consider clustering the 50 states along with Washing DC to determine if there are any clear underlying groupings or clusterings of the United States. Specifically, we will use all the variables besides the name of the state given in the data dictionary above for clustering to yield the resulting clustering of states depicted below.

We will exclude data for Puerto Rico since extreme outliers can significantly affect the performance of K-means clustering, and Puerto Rico is significantly different than the 50 states and DC in many metrics.

1.

First, add a new code chunk containing the code below to load R packages for this lab. Note that you may need to install the sf and tigris R packages.

Code

# Load necessary packages
library(tidyverse)
library(ggthemes)
library(gt)
library(sf)
library(tigris)
library(cluster)
library(factoextra)

# Setting default ggplot2 theme
theme_set(ggthemes::theme_few())

Then, import the ACS estimates for 2023 using the code below.

Code

# Vector of variables for clustering
cluster_variables <- c("population", "median_income",
                       "median_monthly_rent_cost", 
                       "prop_poverty", "prop_highschool",
                       "prop_white", "prop_black")

# Importing census data from course GitHub page
census_data <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/census_data_state_2008-2023.csv") |>  
  dplyr::filter(year == 2023, county_state != "Puerto Rico") |> 
  dplyr::select(county_state, all_of(cluster_variables))

2.

Before implementing K-means clustering, scale the data using Z-score normalization to ensure all variables are on the same scale, making an object called census_scaled.

3.

Next, determine the optimal number of clusters, \(K\). For this part, only set a seed value of 1994 once, just before all code for selecting the number of clusters.

Using modified code from our in-class activity, calculate the average silhouette score for \(K=1\) through \(K = 5\), creating a corresponding visualization for the scores.
What is the optimal \(K\) based on the average silhouette scores?
Using modified code from our in-class activity, calculate the gap statistic for \(K=1\) through \(K = 5\), creating a corresponding visualization for the statistics. Specify 25 random starts (nstart = 25), and simulate 1000 null data sets (B = 1000).
What is the optimal \(K\) based on the gap statistic?

4.

Implement the K-means clustering algorithm using the optimal \(K\) value based on the gap statistic. Set a seed value of 1994 immediately before your code, and use 25 random starts (nstart = 25).

5.

Next, let’s examine differences in the resulting clusters.

Create a data frame called census_data_clustered that contains the raw unscaled data and a new column called cluster that contains the cluster assignment for each state using the code below.

Code

# Add the cluster assignments to original data
census_data_clustered <- census_data |> 
  select(county_state, all_of(cluster_variables)) |>  
  mutate(cluster = as.factor(kmeans_result$cluster))

Reproduce the frequency table below showing how many states were grouped into each cluster.

cluster	n
1	19
2	15
3	17

Reproduce the side-by-side boxplots below to visualize differences in the clustering variables for the resulting clusters using the raw (unscaled) variables.