Categorical Predictors & Dummy Variables

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Learning Objectives

Incorporate categorical predictors with dummy variables
Interpret the effects of categorical predictors

In this activity we will analyze the Palmer Penguins data set: a data set consisting of measurements collected on 344 penguins from 3 islands in Palmer Archipelago, Antarctica. Specifically, data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research Network. Researchers collected data by capturing and releasing breeding adult penguins across the islands at their nests.

*Artwork by @allison_horst*

Let’s load some R packages to start, load a custom function for plotting VIF values, and set a default theme for all visualizations made with ggplot2.

Code

# Loading R packages
library(tidyverse)
library(gt)
library(ggthemes)
library(skimr)
library(corrr)
library(ggfortify)
library(broom)
library(scales)
library(car)
library(GGally)
library(gtsummary)

# Load function for visualizing VIF & GVIF values
source("https://raw.githubusercontent.com/dilernia/STA323/main/Functions/vif_plot.R")

# Setting default ggplot2 theme
theme_set(ggthemes::theme_few())

Then, we can load the data for the activity.

Code

# Importing penguins data from GitHub
penguins <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/penguins.csv") |> 
    dplyr::mutate(sex = fct_relevel(sex, "male", "female"))

We will also create some vectors of colorblind-friendly colors to help us consistently depict subgroups in this activity.

Code

# Creating vector of colorblind-friendly colors
penguin_species_colors <- c("Chinstrap" = "#CC79A7", "Gentoo" = "#009E73", "Adelie" = "#E69F00")
penguin_sex_colors <- c("male" = "#D55E00", "female" = "#56B4E9")

Below are a data dictionary for the Palmer Penguins data and a preview of some randomly selected rows.

Table 1: Data dictionary for the Palmer Penguins data set.

Variable	Description
species	Species of the penguin
island	Island the penguin was found on
bill_length_mm	Bill length (mm)
bill_depth_mm	Bill depth (mm)
flipper_length_mm	Flipper length (mm)
body_mass_g	Body mass (g)
sex	Sex of the penguin
year	Year data was collected

Table 2: Randomly selected rows from the Palmer Penguins data set.

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Chinstrap	Dream	50.2	18.7	198	3775	female	2009
Adelie	Biscoe	39.7	18.9	184	3550	male	2009
Chinstrap	Dream	45.4	18.7	188	3525	female	2007
Gentoo	Biscoe	46.2	14.1	217	4375	female	2009
Gentoo	Biscoe	50.0	16.3	230	5700	male	2007

Let’s start with considering the simple linear regression model with a quantitative predictor.

Code

# Creating a scatter plot of body mass by flipper length
penguins |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_smooth(aes(x = flipper_length_mm, y = body_mass_g), method = "lm", se = FALSE) +
  geom_point(pch = 21, fill = "black",
             color = "white") + 
  labs(y = "Body mass (g)", 
       x = "Flipper length (mm)", 
       title = "Palmer penguins data: body mass by flipper length")

Code

# Fitting SLR model
slr_model <- lm(body_mass_g ~ flipper_length_mm, data = penguins)

# Printing SLR model output
slr_model  |> 
  tbl_regression(intercept = TRUE) |> 
  add_glance_source_note()

Table 3: Regression model output.

Characteristic	Beta	95% CI	p-value
(Intercept)	-5,781	-6,382, -5,179	<0.001
flipper_length_mm	50	47, 53	<0.001
Abbreviation: CI = Confidence Interval
R² = 0.759; Adjusted R² = 0.758; Sigma = 394; Statistic = 1,071; p-value = <0.001; df = 1; Log-likelihood = -2,528; AIC = 5,063; BIC = 5,074; Deviance = 52,854,796; Residual df = 340; No. Obs. = 342

The corresponding estimated simple linear regression model is:

\[\hat{y} = -5781 + 50 x_1,\]

where \(\hat{y}\) is the estimated penguin body mass (g), and \(x_1\) is a penguin’s flipper length (mm).

Tip

How do we interpret the estimated slope?
How do we interpret the estimated intercept?

What if we wanted to incorporate categorical variables in our model?

Code

# Creating scatter plot of body mass by flipper length coloring points by sex
penguins |> 
  drop_na(sex) |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_smooth(aes(x = flipper_length_mm, y = body_mass_g), 
              method = "lm", se = FALSE, color = "black") +
  geom_point(aes(x = flipper_length_mm, y = body_mass_g, fill = sex), pch = 21, color = "white") + 
  labs(y = "Body mass (g)", x = "Flipper length (mm)", title = "Palmer penguins data: body mass by flipper length") +
    scale_fill_manual(values = penguin_sex_colors) +
  theme(legend.position = "bottom")

Code

# Creating side-by-side boxplots for body mass by sex
penguins |> 
  drop_na(sex) |> 
  ggplot(aes(x = sex, y = body_mass_g, fill = sex)) + 
  geom_boxplot() + 
  labs(y = "Body mass (g)", x = "sex", title = "Palmer penguins data: body mass by sex") +
  scale_fill_manual(values = penguin_sex_colors) + 
  theme(legend.position = "none")

Tip

Does a penguin’s body mass (g) appear to vary based on its sex?
Do male or female penguins have larger average body masses?

Categorical Predictors

Linear regression requires a quantitative response variable, but the explanatory or predictor variable can be quantitative or categorical. Since solutions are obtained numerically, categorical predictors cannot be used directly, but must be converted to a numerical form for model fitting. The most common method for doing so is to use dummy variables.

dummy variable: a dummy variable, also called an indicator variable, is a variable that takes on the value 0 or 1 to indicate the presence of an effect on an outcome.

For each categorical predictor \(x_k\), the number of needed dummy variables is \(\ell_k-1\) where \(\ell_k\) is the number of levels of \(x_k\). Typically, one category must be set as the baseline / reference category and does not have a dummy variable that is included in the model to avoid redundancy and perfectly collinear predictors. Sometimes, specifying the baseline category as a certain group is more preferable than others for interpretation purposes. Other times which group is set as the reference category is not important, so selecting any group works fine.

In data science and machine learning, when all categories retain a dummy variable in the model, this process of creating binary numerical representations of categorical variables is also called 🔥“one-hot encoding”🔥.

Example 1: Predicting how many points a basketball team will score using playing location (home, away, or a neutral court) 🏀
Example 2: Estimating the effect of different treatments on blood pressure (Drug A, Drug B, and a placebo) 💊

Tip

How many dummy variables would be needed to incorporate the categorical predictors in Examples 1 and 2?

Palmer Penguins Data

Using sex as a predictor variable, it has \(\ell = 2\) levels in this data set, so it has \(2 - 1 = 1\) corresponding dummy variable.

Code

# Set seed value for reproducible random number generation
set.seed(1999)

# Creating explicit dummy variable for sex of the penguin
penguins_dummy <- penguins |> 
  drop_na(sex) |> 
  dplyr::mutate(sex_female = ifelse(sex == "female", 1, ifelse(sex == "male", 0, NA)))

# Displaying data with dummy variable
penguins_dummy |> 
  dplyr::select(species, sex, sex_female, body_mass_g, flipper_length_mm) |> 
  sample_n(size = 5) |>
  gt()

Table 4: Five randomly selected rows from the Palmer Penguins data displaying a dummy variable.

species	sex	sex_female	body_mass_g	flipper_length_mm
Gentoo	male	0	6300	221
Adelie	female	1	3150	182
Adelie	male	0	3700	191
Gentoo	female	1	5050	207
Chinstrap	male	0	3775	193

Modeling Penguin Body Masses

Theoretical Model

Let’s consider using the sex of the penguins as a predictor of penguin body masses.

Our assumed simple linear regression model is:

\[y = \beta_0 + x_1\beta_1 + \varepsilon, \]

where \(y\) is a penguin’s body mass (g), \(x_1 = 1\) if a penguin is female and 0 if it is male, and \(\varepsilon\) the independent and normally distributed error term.

Code

# Group means for plotting line of best fit
body_mass_means <- penguins_dummy |> 
  dplyr::group_by(sex) |> 
  dplyr::summarize(Mean = mean(body_mass_g), 
                   sex_female = mean(sex_female))

# Visualizing group means
penguins_dummy |> 
  ggplot(aes(x = sex_female, y = body_mass_g, 
             fill = sex)) + 
  geom_point(pch = 21, color = "white") +
  geom_hline(data = body_mass_means, 
              aes(yintercept = Mean, color = sex)) +
    labs(y = "Body mass (g)", x = "Female indicator", title = "Palmer penguins data: body mass by sex") +
  scale_fill_manual(values = penguin_sex_colors) + 
    scale_color_manual(values = penguin_sex_colors) + 
  scale_x_continuous(breaks = c(0, 1)) +
  theme(legend.position = "bottom")

Code

# Fitting SLR model with sex as predictor
slr_model2 <- lm(body_mass_g ~ sex, data = penguins)

# Printing SLR model output
slr_model2  |> 
  tbl_regression(intercept = TRUE) |> 
  add_glance_source_note()

Table 5: Regression model output with categorical predictor.

Characteristic	Beta	95% CI	p-value
(Intercept)	4,546	4,435, 4,656	<0.001
sex
male	—	—
female	-683	-841, -526	<0.001
Abbreviation: CI = Confidence Interval
R² = 0.181; Adjusted R² = 0.178; Sigma = 730; Statistic = 73.0; p-value = <0.001; df = 1; Log-likelihood = -2,667; AIC = 5,340; BIC = 5,351; Deviance = 176,380,769; Residual df = 331; No. Obs. = 333

Estimated Model

Our estimated simple linear regression model is:

\[\hat{y} = 4546 - 683x_1,\]

where \(\hat{y}\) is the estimated penguin body mass (g), \(x_1 = 1\) if a penguin is female, and 0 if it is male.

Tip

Which group is the baseline / reference category?
How do we interpret the estimated slope?
How do we interpret the estimated intercept?
What is the expected body mass (g) of a female penguin?

Code

# Obtaining predicted value
predict(slr_model2, newdata = tibble(sex = "female"))

Note that we can incorporate both quantitative and categorical predictors at the same time in a regression model.

Code

# Creating scatter plot of body mass by flipper length colored by sex
penguins |> 
  drop_na(sex) |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g,
             fill = sex)) + 
  geom_point(pch = 21, color = "white") + 
  labs(y = "Body mass (g)", x = "Flipper length (mm)", title = "Palmer Penguins Data: Body Mass by Flipper Length") +
  scale_fill_manual(values = penguin_sex_colors) + 
  theme(legend.position = "bottom")

Let’s consider modeling the penguin body masses using the sex of the penguins and their flipper lengths.

Our assumed multiple linear regression model is:

\[y = \beta_0 + x_1\beta_1 + x_2\beta_2 + \varepsilon, \]

where \(y\) is a penguin’s body mass (g), \(x_1 = 1\) if a penguin is female and 0 if it is male, \(x_2\) is a penguin’s flipper length (mm), and \(\varepsilon\) the independent and normally distributed error term.

Code

# Fitting MLR model
sex_flipper_mlr <- lm(body_mass_g ~ sex + flipper_length_mm,
                      data = penguins)

# Extracting MLR model coefficients
beta_hats <- coefficients(sex_flipper_mlr)

# Printing MLR model output
sex_flipper_mlr |> 
  tbl_regression(intercept = TRUE) |> 
  add_glance_source_note()

Table 6: Regression model output.

Characteristic	Beta	95% CI	p-value
(Intercept)	-5,062	-5,645, -4,480	<0.001
sex
male	—	—
female	-348	-427, -268	<0.001
flipper_length_mm	47	44, 50	<0.001
Abbreviation: CI = Confidence Interval
R² = 0.806; Adjusted R² = 0.805; Sigma = 356; Statistic = 685; p-value = <0.001; df = 2; Log-likelihood = -2,427; AIC = 4,862; BIC = 4,878; Deviance = 41,795,374; Residual df = 330; No. Obs. = 333

Our estimated multiple linear regression model is:

\[\hat{y} = -5062 - 348x_1 + 47x_2\]

where \(\hat{y}\) is the estimated penguin body mass (g), \(x_1 = 1\) if a penguin is female, and 0 if it is male, and \(x_2\) is a penguin’s flipper length (mm).

This two-predictor model implies separate parallel lines for each group as depicted below.

Code

# Creating scatter plot with separate parallel lines for each group 
sex_segment_data <- penguins |> 
  drop_na(sex) |> 
  dplyr::group_by(sex) |> 
  dplyr::summarize(x = min(flipper_length_mm),
                   xend = max(flipper_length_mm)) |> 
  ungroup() |> 
  mutate(y = predict(sex_flipper_mlr, newdata = data.frame(flipper_length_mm = x, sex = sex)), 
         yend = predict(sex_flipper_mlr, newdata = data.frame(flipper_length_mm = xend, sex = sex)))

penguins |> 
  drop_na(sex) |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g, 
             fill = sex)) + 
  geom_segment(data = sex_segment_data, aes(x = x, xend = xend, y = y, yend = yend, 
                                    color = sex), linewidth = 1) +
  geom_point(pch = 21, color = "white") +
  labs(y = "Body mass (g)", x = "Flipper length (mm)", title = "Palmer Penguins Data: Body Mass by Flipper Length") +
  scale_fill_manual(values = penguin_sex_colors) + 
  scale_color_manual(values = penguin_sex_colors) + 
  theme(legend.position = "bottom")

This assumption of parallel lines across the subgroups in the data may or may not be reasonable. One can allow the lines to have different slopes within each subgroup using an interaction between the predictors (flipper length and sex here), but this adds complexity to the model which can lead to overfitting.

Categorical predictors

Species is another categorical variable in the penguins data. Using species as our predictor variable, it has \(\ell = 3\) levels in this data set.

Tip

How many dummy variables are used to represent the species variable?

Let’s consider modeling the penguin body masses using the species of the penguins.

The assumed multiple linear regression model using species as a predictor is:

\[y = \beta_0 + x_1\beta_1 + x_2\beta_2 + \varepsilon, \]

where \(y\) is a penguin’s body mass (g), \(x_1 = 1\) for a Chinstrap penguin and 0 otherwise, \(x_2=1\) for a Gentoo penguin and 0 otherwise, and \(\varepsilon\) is the independent and normally distributed error term.

Tip

Based on the model above, which species is the baseline / reference category? Why?

Code

# Set seed value for reproducible random number generation
set.seed(1994)

# Creating explicit dummy variables for species
penguins_dummy2 <- penguins |> 
  dplyr::mutate(species_chinstrap = ifelse(species == "Chinstrap", 1, 0),
         species_gentoo = ifelse(species == "Gentoo", 1, 0))

# Displaying data with dummy variables for species
penguins_dummy2 |> 
  dplyr::select(species, species_chinstrap, species_gentoo, body_mass_g, flipper_length_mm) |> 
  dplyr::sample_n(size = 5) |>
  gt()

Table 7: Five randomly selected rows from the Palmer Penguins data displaying dummy variables.

species	species_chinstrap	species_gentoo	body_mass_g	flipper_length_mm
Chinstrap	1	0	3775	198
Adelie	0	0	3550	184
Chinstrap	1	0	3525	188
Gentoo	0	1	4375	217
Gentoo	0	1	5700	230

Code

# Fitting linear model with species as predictor
species_model <- lm(body_mass_g ~ species, 
                   data = penguins)

beta_hats <- coefficients(species_model)

Code

# Printing MLR model output
species_model |> 
  tbl_regression(intercept = TRUE) |> 
  add_glance_source_note()

Table 8: Regression model output.

Characteristic	Beta	95% CI	p-value
(Intercept)	3,701	3,627, 3,775	<0.001
species
Adelie	—	—
Chinstrap	32	-100, 165	0.6
Gentoo	1,375	1,265, 1,486	<0.001
Abbreviation: CI = Confidence Interval
R² = 0.670; Adjusted R² = 0.668; Sigma = 462; Statistic = 344; p-value = <0.001; df = 2; Log-likelihood = -2,582; AIC = 5,173; BIC = 5,188; Deviance = 72,443,483; Residual df = 339; No. Obs. = 342

Our estimated regression model using species as a predictor is:

\[\hat{y} = 3701 + 32x_1 + 1375x_2,\]

where \(\hat{y}\) is the estimated penguin body mass (g), \(x_1 = 1\) for a Chinstrap penguin and 0 otherwise, and \(x_2=1\) for a Gentoo penguin and 0 otherwise.

Code

# Group means for plotting line of best fit
body_mass_means <- penguins_dummy |> 
  dplyr::group_by(species) |> 
  dplyr::summarize(Mean = mean(body_mass_g, na.rm = TRUE))

# Plotting group data
penguins_dummy |> 
  ggplot(aes(x = as.integer(as.factor(species)), y = body_mass_g, 
             fill = species)) + 
  geom_point(pch = 21, color = "white") +
  labs(y = "Body mass (g)", x = "", title = "Palmer Penguins Data: Body Mass by species") +
   geom_hline(data = body_mass_means, 
              aes(yintercept = Mean, color = species)) +
  scale_fill_manual(values = penguin_species_colors) +
  scale_color_manual(values = penguin_species_colors) +
  scale_x_continuous(breaks = 0) +
  theme(legend.position = "bottom")

The theoretical regression model using species and flipper length as predictors is:

\[y = \beta_0 + x_1\beta_1 + x_2\beta_2 + x_3\beta_3 + \varepsilon, \]

where \(y\) is a penguin’s body mass (g), \(x_1 = 1\) for a Chinstrap penguin and 0 otherwise, \(x_2=1\) for a Gentoo penguin and 0 otherwise, \(x_3\) is a penguin’s flipper length in mm, and \(\varepsilon\) is the independent and normally distributed error term.

Code

# Fitting linear models
species_flipper_mlr <- lm(body_mass_g ~ species + flipper_length_mm, data = penguins)

beta_hats <- coefficients(species_flipper_mlr)

# Printing MLR model output
species_flipper_mlr |> 
  tbl_regression(intercept = TRUE) |> 
  add_glance_source_note()

Table 9: Regression model output.

Characteristic	Beta	95% CI	p-value
(Intercept)	-4,031	-5,181, -2,882	<0.001
species
Adelie	—	—
Chinstrap	-207	-320, -93	<0.001
Gentoo	267	79, 454	0.005
flipper_length_mm	41	35, 47	<0.001
Abbreviation: CI = Confidence Interval
R² = 0.783; Adjusted R² = 0.781; Sigma = 376; Statistic = 406; p-value = <0.001; df = 3; Log-likelihood = -2,511; AIC = 5,032; BIC = 5,051; Deviance = 47,666,988; Residual df = 338; No. Obs. = 342

The estimated regression model using species and flipper length as predictors:

\[\hat{y} = -4031 -207x_1 + 267x_2 + 41x_3,\]

where \(\hat{y}\) is the estimated penguin body mass (g), \(x_1 = 1\) for a Chinstrap penguin and 0 otherwise, \(x_2=1\) for a Gentoo penguin and 0 otherwise, and \(x_3\) is a penguin’s flipper length in mm.

This three-predictor model implies separate parallel lines for each group as depicted below.

Code

# Making scatter plot with parallel lines for each group
species_segment_data <- penguins |> 
  dplyr::group_by(species) |> 
  dplyr::summarize(x = min(flipper_length_mm, na.rm = TRUE),
                   xend = max(flipper_length_mm, na.rm = TRUE)) |> 
  ungroup() |> 
  mutate(y = predict(species_flipper_mlr, newdata = data.frame(flipper_length_mm = x, species = species)), 
         yend = predict(species_flipper_mlr, newdata = data.frame(flipper_length_mm = xend, species = species)))

penguins |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g, fill = species)) + 
geom_segment(data = species_segment_data, aes(x = x, xend = xend,
                                              y = y, yend = yend,
                                              color = species), 
                              linewidth = 1) +
  geom_point(pch = 21, color = "white") +
  labs(y = "Body mass (g)", x = "Flipper length (mm)", title = "Palmer Penguins Data: Body Mass by Flipper Length") +
  scale_fill_manual(values = penguin_species_colors) + 
  scale_color_manual(values = penguin_species_colors) + 
  theme(legend.position = "bottom")

Tip

How do we interpret the estimated intercept?
How do we interpret the estimated slopes?
Confirm that the predicted body mass (g) of a Chinstrap penguin with a flipper length of 195mm is 3700 grams. Is this extrapolation?

Code

# Obtaining predicted value
predict(species_flipper_mlr, 
        newdata = tibble(species = "Chinstrap",
                         flipper_length_mm = 195))

Tip

If a Chinstrap penguin with a flipper length of 195mm had a body mass of 3900 grams, what is its corresponding residual? What does this mean?
What is the expected body mass (g) of a Chinstrap penguin with a flipper length of 225mm? Is this extrapolation?

Code

# Obtaining predicted value
predict(species_flipper_mlr, 
        newdata = tibble(species = "Chinstrap",
                         flipper_length_mm = 225))

Tip

Does it seem reasonable to constrain the regression lines to be parallel for each group?