Categorical Predictors & Dummy Variables
DSA 220 - Introduction to Data Science and Analytics
Learning Objectives
Incorporate categorical predictors with dummy variables
Interpret the effects of categorical predictors
In this activity we will analyze the Palmer Penguins data set: a data set consisting of measurements collected on 344 penguins from 3 islands in Palmer Archipelago, Antarctica. Specifically, data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research Network. Researchers collected data by capturing and releasing breeding adult penguins across the islands at their nests.
Let’s load some R packages to start, load a custom function for plotting VIF values, and set a default theme for all visualizations made with ggplot2.
Code
# Loading R packages
library(tidyverse)
library(gt)
library(ggthemes)
library(skimr)
library(corrr)
library(ggfortify)
library(broom)
library(scales)
library(car)
library(GGally)
library(gtsummary)
# Load function for visualizing VIF & GVIF values
source("https://raw.githubusercontent.com/dilernia/STA323/main/Functions/vif_plot.R")
# Setting default ggplot2 theme
theme_set(ggthemes::theme_few())Then, we can load the data for the activity.
Code
# Importing penguins data from GitHub
penguins <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/penguins.csv") |>
dplyr::mutate(sex = fct_relevel(sex, "male", "female"))We will also create some vectors of colorblind-friendly colors to help us consistently depict subgroups in this activity.
Code
# Creating vector of colorblind-friendly colors
penguin_species_colors <- c("Chinstrap" = "#CC79A7", "Gentoo" = "#009E73", "Adelie" = "#E69F00")
penguin_sex_colors <- c("male" = "#D55E00", "female" = "#56B4E9")Below are a data dictionary for the Palmer Penguins data and a preview of some randomly selected rows.
| Variable | Description |
|---|---|
| species | Species of the penguin |
| island | Island the penguin was found on |
| bill_length_mm | Bill length (mm) |
| bill_depth_mm | Bill depth (mm) |
| flipper_length_mm | Flipper length (mm) |
| body_mass_g | Body mass (g) |
| sex | Sex of the penguin |
| year | Year data was collected |
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
| Adelie | Biscoe | 39.7 | 18.9 | 184 | 3550 | male | 2009 |
| Chinstrap | Dream | 45.4 | 18.7 | 188 | 3525 | female | 2007 |
| Gentoo | Biscoe | 46.2 | 14.1 | 217 | 4375 | female | 2009 |
| Gentoo | Biscoe | 50.0 | 16.3 | 230 | 5700 | male | 2007 |
Let’s start with considering the simple linear regression model with a quantitative predictor.
Code
# Creating a scatter plot of body mass by flipper length
penguins |>
ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
geom_smooth(aes(x = flipper_length_mm, y = body_mass_g), method = "lm", se = FALSE) +
geom_point(pch = 21, fill = "black",
color = "white") +
labs(y = "Body mass (g)",
x = "Flipper length (mm)",
title = "Palmer penguins data: body mass by flipper length")Code
# Fitting SLR model
slr_model <- lm(body_mass_g ~ flipper_length_mm, data = penguins)
# Printing SLR model output
slr_model |>
tbl_regression(intercept = TRUE) |>
add_glance_source_note()| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | -5,781 | -6,382, -5,179 | <0.001 |
| flipper_length_mm | 50 | 47, 53 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
| R² = 0.759; Adjusted R² = 0.758; Sigma = 394; Statistic = 1,071; p-value = <0.001; df = 1; Log-likelihood = -2,528; AIC = 5,063; BIC = 5,074; Deviance = 52,854,796; Residual df = 340; No. Obs. = 342 | |||
The corresponding estimated simple linear regression model is:
\[\hat{y} = -5781 + 50 x_1,\]
where \(\hat{y}\) is the estimated penguin body mass (g), and \(x_1\) is a penguin’s flipper length (mm).
How do we interpret the estimated slope?
How do we interpret the estimated intercept?
What if we wanted to incorporate categorical variables in our model?
Code
# Creating scatter plot of body mass by flipper length coloring points by sex
penguins |>
drop_na(sex) |>
ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
geom_smooth(aes(x = flipper_length_mm, y = body_mass_g),
method = "lm", se = FALSE, color = "black") +
geom_point(aes(x = flipper_length_mm, y = body_mass_g, fill = sex), pch = 21, color = "white") +
labs(y = "Body mass (g)", x = "Flipper length (mm)", title = "Palmer penguins data: body mass by flipper length") +
scale_fill_manual(values = penguin_sex_colors) +
theme(legend.position = "bottom")Code
# Creating side-by-side boxplots for body mass by sex
penguins |>
drop_na(sex) |>
ggplot(aes(x = sex, y = body_mass_g, fill = sex)) +
geom_boxplot() +
labs(y = "Body mass (g)", x = "sex", title = "Palmer penguins data: body mass by sex") +
scale_fill_manual(values = penguin_sex_colors) +
theme(legend.position = "none")Does a penguin’s body mass (g) appear to vary based on its sex?
Do male or female penguins have larger average body masses?
Categorical Predictors
Linear regression requires a quantitative response variable, but the explanatory or predictor variable can be quantitative or categorical. Since solutions are obtained numerically, categorical predictors cannot be used directly, but must be converted to a numerical form for model fitting. The most common method for doing so is to use dummy variables.
- dummy variable: a dummy variable, also called an indicator variable, is a variable that takes on the value 0 or 1 to indicate the presence of an effect on an outcome.
For each categorical predictor \(x_k\), the number of needed dummy variables is \(\ell_k-1\) where \(\ell_k\) is the number of levels of \(x_k\). Typically, one category must be set as the baseline / reference category and does not have a dummy variable that is included in the model to avoid redundancy and perfectly collinear predictors. Sometimes, specifying the baseline category as a certain group is more preferable than others for interpretation purposes. Other times which group is set as the reference category is not important, so selecting any group works fine.
In data science and machine learning, when all categories retain a dummy variable in the model, this process of creating binary numerical representations of categorical variables is also called 🔥“one-hot encoding”🔥.
Example 1: Predicting how many points a basketball team will score using playing location (home, away, or a neutral court) 🏀
Example 2: Estimating the effect of different treatments on blood pressure (Drug A, Drug B, and a placebo) 💊
How many dummy variables would be needed to incorporate the categorical predictors in Examples 1 and 2?
Palmer Penguins Data
Using sex as a predictor variable, it has \(\ell = 2\) levels in this data set, so it has \(2 - 1 = 1\) corresponding dummy variable.
Code
# Set seed value for reproducible random number generation
set.seed(1999)
# Creating explicit dummy variable for sex of the penguin
penguins_dummy <- penguins |>
drop_na(sex) |>
dplyr::mutate(sex_female = ifelse(sex == "female", 1, ifelse(sex == "male", 0, NA)))
# Displaying data with dummy variable
penguins_dummy |>
dplyr::select(species, sex, sex_female, body_mass_g, flipper_length_mm) |>
sample_n(size = 5) |>
gt()| species | sex | sex_female | body_mass_g | flipper_length_mm |
|---|---|---|---|---|
| Gentoo | male | 0 | 6300 | 221 |
| Adelie | female | 1 | 3150 | 182 |
| Adelie | male | 0 | 3700 | 191 |
| Gentoo | female | 1 | 5050 | 207 |
| Chinstrap | male | 0 | 3775 | 193 |
Modeling Penguin Body Masses
Theoretical Model
Let’s consider using the sex of the penguins as a predictor of penguin body masses.
Our assumed simple linear regression model is:
\[y = \beta_0 + x_1\beta_1 + \varepsilon, \]
where \(y\) is a penguin’s body mass (g), \(x_1 = 1\) if a penguin is female and 0 if it is male, and \(\varepsilon\) the independent and normally distributed error term.
Code
# Group means for plotting line of best fit
body_mass_means <- penguins_dummy |>
dplyr::group_by(sex) |>
dplyr::summarize(Mean = mean(body_mass_g),
sex_female = mean(sex_female))
# Visualizing group means
penguins_dummy |>
ggplot(aes(x = sex_female, y = body_mass_g,
fill = sex)) +
geom_point(pch = 21, color = "white") +
geom_hline(data = body_mass_means,
aes(yintercept = Mean, color = sex)) +
labs(y = "Body mass (g)", x = "Female indicator", title = "Palmer penguins data: body mass by sex") +
scale_fill_manual(values = penguin_sex_colors) +
scale_color_manual(values = penguin_sex_colors) +
scale_x_continuous(breaks = c(0, 1)) +
theme(legend.position = "bottom")Code
# Fitting SLR model with sex as predictor
slr_model2 <- lm(body_mass_g ~ sex, data = penguins)
# Printing SLR model output
slr_model2 |>
tbl_regression(intercept = TRUE) |>
add_glance_source_note()| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 4,546 | 4,435, 4,656 | <0.001 |
| sex | |||
| male | — | — | |
| female | -683 | -841, -526 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
| R² = 0.181; Adjusted R² = 0.178; Sigma = 730; Statistic = 73.0; p-value = <0.001; df = 1; Log-likelihood = -2,667; AIC = 5,340; BIC = 5,351; Deviance = 176,380,769; Residual df = 331; No. Obs. = 333 | |||
Estimated Model
Our estimated simple linear regression model is:
\[\hat{y} = 4546 - 683x_1,\]
where \(\hat{y}\) is the estimated penguin body mass (g), \(x_1 = 1\) if a penguin is female, and 0 if it is male.
Which group is the baseline / reference category?
How do we interpret the estimated slope?
How do we interpret the estimated intercept?
What is the expected body mass (g) of a female penguin?
Code
# Obtaining predicted value
predict(slr_model2, newdata = tibble(sex = "female"))Note that we can incorporate both quantitative and categorical predictors at the same time in a regression model.
Code
# Creating scatter plot of body mass by flipper length colored by sex
penguins |>
drop_na(sex) |>
ggplot(aes(x = flipper_length_mm, y = body_mass_g,
fill = sex)) +
geom_point(pch = 21, color = "white") +
labs(y = "Body mass (g)", x = "Flipper length (mm)", title = "Palmer Penguins Data: Body Mass by Flipper Length") +
scale_fill_manual(values = penguin_sex_colors) +
theme(legend.position = "bottom")Let’s consider modeling the penguin body masses using the sex of the penguins and their flipper lengths.
Our assumed multiple linear regression model is:
\[y = \beta_0 + x_1\beta_1 + x_2\beta_2 + \varepsilon, \]
where \(y\) is a penguin’s body mass (g), \(x_1 = 1\) if a penguin is female and 0 if it is male, \(x_2\) is a penguin’s flipper length (mm), and \(\varepsilon\) the independent and normally distributed error term.
Code
# Fitting MLR model
sex_flipper_mlr <- lm(body_mass_g ~ sex + flipper_length_mm,
data = penguins)
# Extracting MLR model coefficients
beta_hats <- coefficients(sex_flipper_mlr)
# Printing MLR model output
sex_flipper_mlr |>
tbl_regression(intercept = TRUE) |>
add_glance_source_note()| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | -5,062 | -5,645, -4,480 | <0.001 |
| sex | |||
| male | — | — | |
| female | -348 | -427, -268 | <0.001 |
| flipper_length_mm | 47 | 44, 50 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
| R² = 0.806; Adjusted R² = 0.805; Sigma = 356; Statistic = 685; p-value = <0.001; df = 2; Log-likelihood = -2,427; AIC = 4,862; BIC = 4,878; Deviance = 41,795,374; Residual df = 330; No. Obs. = 333 | |||
Our estimated multiple linear regression model is:
\[\hat{y} = -5062 - 348x_1 + 47x_2\]
where \(\hat{y}\) is the estimated penguin body mass (g), \(x_1 = 1\) if a penguin is female, and 0 if it is male, and \(x_2\) is a penguin’s flipper length (mm).
This two-predictor model implies separate parallel lines for each group as depicted below.
Code
# Creating scatter plot with separate parallel lines for each group
sex_segment_data <- penguins |>
drop_na(sex) |>
dplyr::group_by(sex) |>
dplyr::summarize(x = min(flipper_length_mm),
xend = max(flipper_length_mm)) |>
ungroup() |>
mutate(y = predict(sex_flipper_mlr, newdata = data.frame(flipper_length_mm = x, sex = sex)),
yend = predict(sex_flipper_mlr, newdata = data.frame(flipper_length_mm = xend, sex = sex)))
penguins |>
drop_na(sex) |>
ggplot(aes(x = flipper_length_mm, y = body_mass_g,
fill = sex)) +
geom_segment(data = sex_segment_data, aes(x = x, xend = xend, y = y, yend = yend,
color = sex), linewidth = 1) +
geom_point(pch = 21, color = "white") +
labs(y = "Body mass (g)", x = "Flipper length (mm)", title = "Palmer Penguins Data: Body Mass by Flipper Length") +
scale_fill_manual(values = penguin_sex_colors) +
scale_color_manual(values = penguin_sex_colors) +
theme(legend.position = "bottom")This assumption of parallel lines across the subgroups in the data may or may not be reasonable. One can allow the lines to have different slopes within each subgroup using an interaction between the predictors (flipper length and sex here), but this adds complexity to the model which can lead to overfitting.
Categorical predictors
Species is another categorical variable in the penguins data. Using species as our predictor variable, it has \(\ell = 3\) levels in this data set.
How many dummy variables are used to represent the species variable?
Let’s consider modeling the penguin body masses using the species of the penguins.
The assumed multiple linear regression model using species as a predictor is:
\[y = \beta_0 + x_1\beta_1 + x_2\beta_2 + \varepsilon, \]
where \(y\) is a penguin’s body mass (g), \(x_1 = 1\) for a Chinstrap penguin and 0 otherwise, \(x_2=1\) for a Gentoo penguin and 0 otherwise, and \(\varepsilon\) is the independent and normally distributed error term.
Based on the model above, which species is the baseline / reference category? Why?
Code
# Set seed value for reproducible random number generation
set.seed(1994)
# Creating explicit dummy variables for species
penguins_dummy2 <- penguins |>
dplyr::mutate(species_chinstrap = ifelse(species == "Chinstrap", 1, 0),
species_gentoo = ifelse(species == "Gentoo", 1, 0))
# Displaying data with dummy variables for species
penguins_dummy2 |>
dplyr::select(species, species_chinstrap, species_gentoo, body_mass_g, flipper_length_mm) |>
dplyr::sample_n(size = 5) |>
gt()| species | species_chinstrap | species_gentoo | body_mass_g | flipper_length_mm |
|---|---|---|---|---|
| Chinstrap | 1 | 0 | 3775 | 198 |
| Adelie | 0 | 0 | 3550 | 184 |
| Chinstrap | 1 | 0 | 3525 | 188 |
| Gentoo | 0 | 1 | 4375 | 217 |
| Gentoo | 0 | 1 | 5700 | 230 |
Code
# Fitting linear model with species as predictor
species_model <- lm(body_mass_g ~ species,
data = penguins)
beta_hats <- coefficients(species_model)Code
# Printing MLR model output
species_model |>
tbl_regression(intercept = TRUE) |>
add_glance_source_note()| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 3,701 | 3,627, 3,775 | <0.001 |
| species | |||
| Adelie | — | — | |
| Chinstrap | 32 | -100, 165 | 0.6 |
| Gentoo | 1,375 | 1,265, 1,486 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
| R² = 0.670; Adjusted R² = 0.668; Sigma = 462; Statistic = 344; p-value = <0.001; df = 2; Log-likelihood = -2,582; AIC = 5,173; BIC = 5,188; Deviance = 72,443,483; Residual df = 339; No. Obs. = 342 | |||
Our estimated regression model using species as a predictor is:
\[\hat{y} = 3701 + 32x_1 + 1375x_2,\]
where \(\hat{y}\) is the estimated penguin body mass (g), \(x_1 = 1\) for a Chinstrap penguin and 0 otherwise, and \(x_2=1\) for a Gentoo penguin and 0 otherwise.
Code
# Group means for plotting line of best fit
body_mass_means <- penguins_dummy |>
dplyr::group_by(species) |>
dplyr::summarize(Mean = mean(body_mass_g, na.rm = TRUE))
# Plotting group data
penguins_dummy |>
ggplot(aes(x = as.integer(as.factor(species)), y = body_mass_g,
fill = species)) +
geom_point(pch = 21, color = "white") +
labs(y = "Body mass (g)", x = "", title = "Palmer Penguins Data: Body Mass by species") +
geom_hline(data = body_mass_means,
aes(yintercept = Mean, color = species)) +
scale_fill_manual(values = penguin_species_colors) +
scale_color_manual(values = penguin_species_colors) +
scale_x_continuous(breaks = 0) +
theme(legend.position = "bottom")The theoretical regression model using species and flipper length as predictors is:
\[y = \beta_0 + x_1\beta_1 + x_2\beta_2 + x_3\beta_3 + \varepsilon, \]
where \(y\) is a penguin’s body mass (g), \(x_1 = 1\) for a Chinstrap penguin and 0 otherwise, \(x_2=1\) for a Gentoo penguin and 0 otherwise, \(x_3\) is a penguin’s flipper length in mm, and \(\varepsilon\) is the independent and normally distributed error term.
Code
# Fitting linear models
species_flipper_mlr <- lm(body_mass_g ~ species + flipper_length_mm, data = penguins)
beta_hats <- coefficients(species_flipper_mlr)
# Printing MLR model output
species_flipper_mlr |>
tbl_regression(intercept = TRUE) |>
add_glance_source_note()| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | -4,031 | -5,181, -2,882 | <0.001 |
| species | |||
| Adelie | — | — | |
| Chinstrap | -207 | -320, -93 | <0.001 |
| Gentoo | 267 | 79, 454 | 0.005 |
| flipper_length_mm | 41 | 35, 47 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
| R² = 0.783; Adjusted R² = 0.781; Sigma = 376; Statistic = 406; p-value = <0.001; df = 3; Log-likelihood = -2,511; AIC = 5,032; BIC = 5,051; Deviance = 47,666,988; Residual df = 338; No. Obs. = 342 | |||
The estimated regression model using species and flipper length as predictors:
\[\hat{y} = -4031 -207x_1 + 267x_2 + 41x_3,\]
where \(\hat{y}\) is the estimated penguin body mass (g), \(x_1 = 1\) for a Chinstrap penguin and 0 otherwise, \(x_2=1\) for a Gentoo penguin and 0 otherwise, and \(x_3\) is a penguin’s flipper length in mm.
This three-predictor model implies separate parallel lines for each group as depicted below.
Code
# Making scatter plot with parallel lines for each group
species_segment_data <- penguins |>
dplyr::group_by(species) |>
dplyr::summarize(x = min(flipper_length_mm, na.rm = TRUE),
xend = max(flipper_length_mm, na.rm = TRUE)) |>
ungroup() |>
mutate(y = predict(species_flipper_mlr, newdata = data.frame(flipper_length_mm = x, species = species)),
yend = predict(species_flipper_mlr, newdata = data.frame(flipper_length_mm = xend, species = species)))
penguins |>
ggplot(aes(x = flipper_length_mm, y = body_mass_g, fill = species)) +
geom_segment(data = species_segment_data, aes(x = x, xend = xend,
y = y, yend = yend,
color = species),
linewidth = 1) +
geom_point(pch = 21, color = "white") +
labs(y = "Body mass (g)", x = "Flipper length (mm)", title = "Palmer Penguins Data: Body Mass by Flipper Length") +
scale_fill_manual(values = penguin_species_colors) +
scale_color_manual(values = penguin_species_colors) +
theme(legend.position = "bottom")How do we interpret the estimated intercept?
How do we interpret the estimated slopes?
Confirm that the predicted body mass (g) of a Chinstrap penguin with a flipper length of 195mm is 3700 grams. Is this extrapolation?
Code
# Obtaining predicted value
predict(species_flipper_mlr,
newdata = tibble(species = "Chinstrap",
flipper_length_mm = 195))If a Chinstrap penguin with a flipper length of 195mm had a body mass of 3900 grams, what is its corresponding residual? What does this mean?
What is the expected body mass (g) of a Chinstrap penguin with a flipper length of 225mm? Is this extrapolation?
Code
# Obtaining predicted value
predict(species_flipper_mlr,
newdata = tibble(species = "Chinstrap",
flipper_length_mm = 225))Does it seem reasonable to constrain the regression lines to be parallel for each group?