Categorical Predictors & Dummy Variables

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Learning Objectives

  • Incorporate categorical predictors with dummy variables

  • Interpret the effects of categorical predictors

In this activity we will analyze the Palmer Penguins data set: a data set consisting of measurements collected on 344 penguins from 3 islands in Palmer Archipelago, Antarctica. Specifically, data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research Network. Researchers collected data by capturing and releasing breeding adult penguins across the islands at their nests.

Artwork by @allison_horst

Let’s load some R packages to start, load a custom function for plotting VIF values, and set a default theme for all visualizations made with ggplot2.

Code
# Loading R packages
library(tidyverse)
library(gt)
library(ggthemes)
library(skimr)
library(corrr)
library(ggfortify)
library(broom)
library(scales)
library(car)
library(GGally)
library(gtsummary)

# Load function for visualizing VIF & GVIF values
source("https://raw.githubusercontent.com/dilernia/STA323/main/Functions/vif_plot.R")

# Setting default ggplot2 theme
theme_set(ggthemes::theme_few())

Then, we can load the data for the activity.

Code
# Importing penguins data from GitHub
penguins <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/penguins.csv") |> 
    dplyr::mutate(sex = fct_relevel(sex, "male", "female"))

We will also create some vectors of colorblind-friendly colors to help us consistently depict subgroups in this activity.

Code
# Creating vector of colorblind-friendly colors
penguin_species_colors <- c("Chinstrap" = "#CC79A7", "Gentoo" = "#009E73", "Adelie" = "#E69F00")
penguin_sex_colors <- c("male" = "#D55E00", "female" = "#56B4E9")

Below are a data dictionary for the Palmer Penguins data and a preview of some randomly selected rows.

Table 1: Data dictionary for the Palmer Penguins data set.
Variable Description
species Species of the penguin
island Island the penguin was found on
bill_length_mm Bill length (mm)
bill_depth_mm Bill depth (mm)
flipper_length_mm Flipper length (mm)
body_mass_g Body mass (g)
sex Sex of the penguin
year Year data was collected
Table 2: Randomly selected rows from the Palmer Penguins data set.
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Chinstrap Dream 50.2 18.7 198 3775 female 2009
Adelie Biscoe 39.7 18.9 184 3550 male 2009
Chinstrap Dream 45.4 18.7 188 3525 female 2007
Gentoo Biscoe 46.2 14.1 217 4375 female 2009
Gentoo Biscoe 50.0 16.3 230 5700 male 2007

Let’s start with considering the simple linear regression model with a quantitative predictor.

Code
# Creating a scatter plot of body mass by flipper length
penguins |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_smooth(aes(x = flipper_length_mm, y = body_mass_g), method = "lm", se = FALSE) +
  geom_point(pch = 21, fill = "black",
             color = "white") + 
  labs(y = "Body mass (g)", 
       x = "Flipper length (mm)", 
       title = "Palmer penguins data: body mass by flipper length")

Code
# Fitting SLR model
slr_model <- lm(body_mass_g ~ flipper_length_mm, data = penguins)

# Printing SLR model output
slr_model  |> 
  tbl_regression(intercept = TRUE) |> 
  add_glance_source_note()
Table 3: Regression model output.
Characteristic Beta 95% CI p-value
(Intercept) -5,781 -6,382, -5,179 <0.001
flipper_length_mm 50 47, 53 <0.001
Abbreviation: CI = Confidence Interval
R² = 0.759; Adjusted R² = 0.758; Sigma = 394; Statistic = 1,071; p-value = <0.001; df = 1; Log-likelihood = -2,528; AIC = 5,063; BIC = 5,074; Deviance = 52,854,796; Residual df = 340; No. Obs. = 342

The corresponding estimated simple linear regression model is:

\[\hat{y} = -5781 + 50 x_1,\]

where \(\hat{y}\) is the estimated penguin body mass (g), and \(x_1\) is a penguin’s flipper length (mm).

Tip
  1. How do we interpret the estimated slope?

  2. How do we interpret the estimated intercept?

What if we wanted to incorporate categorical variables in our model?

Code
# Creating scatter plot of body mass by flipper length coloring points by sex
penguins |> 
  drop_na(sex) |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_smooth(aes(x = flipper_length_mm, y = body_mass_g), 
              method = "lm", se = FALSE, color = "black") +
  geom_point(aes(x = flipper_length_mm, y = body_mass_g, fill = sex), pch = 21, color = "white") + 
  labs(y = "Body mass (g)", x = "Flipper length (mm)", title = "Palmer penguins data: body mass by flipper length") +
    scale_fill_manual(values = penguin_sex_colors) +
  theme(legend.position = "bottom")

Code
# Creating side-by-side boxplots for body mass by sex
penguins |> 
  drop_na(sex) |> 
  ggplot(aes(x = sex, y = body_mass_g, fill = sex)) + 
  geom_boxplot() + 
  labs(y = "Body mass (g)", x = "sex", title = "Palmer penguins data: body mass by sex") +
  scale_fill_manual(values = penguin_sex_colors) + 
  theme(legend.position = "none")

Tip
  1. Does a penguin’s body mass (g) appear to vary based on its sex?

  2. Do male or female penguins have larger average body masses?

Categorical Predictors

Linear regression requires a quantitative response variable, but the explanatory or predictor variable can be quantitative or categorical. Since solutions are obtained numerically, categorical predictors cannot be used directly, but must be converted to a numerical form for model fitting. The most common method for doing so is to use dummy variables.

  • dummy variable: a dummy variable, also called an indicator variable, is a variable that takes on the value 0 or 1 to indicate the presence of an effect on an outcome.

For each categorical predictor \(x_k\), the number of needed dummy variables is \(\ell_k-1\) where \(\ell_k\) is the number of levels of \(x_k\). Typically, one category must be set as the baseline / reference category and does not have a dummy variable that is included in the model to avoid redundancy and perfectly collinear predictors. Sometimes, specifying the baseline category as a certain group is more preferable than others for interpretation purposes. Other times which group is set as the reference category is not important, so selecting any group works fine.

In data science and machine learning, when all categories retain a dummy variable in the model, this process of creating binary numerical representations of categorical variables is also called 🔥“one-hot encoding”🔥.

  • Example 1: Predicting how many points a basketball team will score using playing location (home, away, or a neutral court) 🏀

  • Example 2: Estimating the effect of different treatments on blood pressure (Drug A, Drug B, and a placebo) 💊

Tip

How many dummy variables would be needed to incorporate the categorical predictors in Examples 1 and 2?

Palmer Penguins Data

Using sex as a predictor variable, it has \(\ell = 2\) levels in this data set, so it has \(2 - 1 = 1\) corresponding dummy variable.

Code
# Set seed value for reproducible random number generation
set.seed(1999)

# Creating explicit dummy variable for sex of the penguin
penguins_dummy <- penguins |> 
  drop_na(sex) |> 
  dplyr::mutate(sex_female = ifelse(sex == "female", 1, ifelse(sex == "male", 0, NA)))

# Displaying data with dummy variable
penguins_dummy |> 
  dplyr::select(species, sex, sex_female, body_mass_g, flipper_length_mm) |> 
  sample_n(size = 5) |>
  gt()
Table 4: Five randomly selected rows from the Palmer Penguins data displaying a dummy variable.
species sex sex_female body_mass_g flipper_length_mm
Gentoo male 0 6300 221
Adelie female 1 3150 182
Adelie male 0 3700 191
Gentoo female 1 5050 207
Chinstrap male 0 3775 193

Modeling Penguin Body Masses

Theoretical Model

Let’s consider using the sex of the penguins as a predictor of penguin body masses.

Our assumed simple linear regression model is:

\[y = \beta_0 + x_1\beta_1 + \varepsilon, \]

where \(y\) is a penguin’s body mass (g), \(x_1 = 1\) if a penguin is female and 0 if it is male, and \(\varepsilon\) the independent and normally distributed error term.

Code
# Group means for plotting line of best fit
body_mass_means <- penguins_dummy |> 
  dplyr::group_by(sex) |> 
  dplyr::summarize(Mean = mean(body_mass_g), 
                   sex_female = mean(sex_female))

# Visualizing group means
penguins_dummy |> 
  ggplot(aes(x = sex_female, y = body_mass_g, 
             fill = sex)) + 
  geom_point(pch = 21, color = "white") +
  geom_hline(data = body_mass_means, 
              aes(yintercept = Mean, color = sex)) +
    labs(y = "Body mass (g)", x = "Female indicator", title = "Palmer penguins data: body mass by sex") +
  scale_fill_manual(values = penguin_sex_colors) + 
    scale_color_manual(values = penguin_sex_colors) + 
  scale_x_continuous(breaks = c(0, 1)) +
  theme(legend.position = "bottom")

Code
# Fitting SLR model with sex as predictor
slr_model2 <- lm(body_mass_g ~ sex, data = penguins)

# Printing SLR model output
slr_model2  |> 
  tbl_regression(intercept = TRUE) |> 
  add_glance_source_note()
Table 5: Regression model output with categorical predictor.
Characteristic Beta 95% CI p-value
(Intercept) 4,546 4,435, 4,656 <0.001
sex


    male
    female -683 -841, -526 <0.001
Abbreviation: CI = Confidence Interval
R² = 0.181; Adjusted R² = 0.178; Sigma = 730; Statistic = 73.0; p-value = <0.001; df = 1; Log-likelihood = -2,667; AIC = 5,340; BIC = 5,351; Deviance = 176,380,769; Residual df = 331; No. Obs. = 333

Estimated Model

Our estimated simple linear regression model is:

\[\hat{y} = 4546 - 683x_1,\]

where \(\hat{y}\) is the estimated penguin body mass (g), \(x_1 = 1\) if a penguin is female, and 0 if it is male.

Tip
  1. Which group is the baseline / reference category?

  2. How do we interpret the estimated slope?

  3. How do we interpret the estimated intercept?

  4. What is the expected body mass (g) of a female penguin?

Code
# Obtaining predicted value
predict(slr_model2, newdata = tibble(sex = "female"))

Note that we can incorporate both quantitative and categorical predictors at the same time in a regression model.

Code
# Creating scatter plot of body mass by flipper length colored by sex
penguins |> 
  drop_na(sex) |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g,
             fill = sex)) + 
  geom_point(pch = 21, color = "white") + 
  labs(y = "Body mass (g)", x = "Flipper length (mm)", title = "Palmer Penguins Data: Body Mass by Flipper Length") +
  scale_fill_manual(values = penguin_sex_colors) + 
  theme(legend.position = "bottom")

Let’s consider modeling the penguin body masses using the sex of the penguins and their flipper lengths.

Our assumed multiple linear regression model is:

\[y = \beta_0 + x_1\beta_1 + x_2\beta_2 + \varepsilon, \]

where \(y\) is a penguin’s body mass (g), \(x_1 = 1\) if a penguin is female and 0 if it is male, \(x_2\) is a penguin’s flipper length (mm), and \(\varepsilon\) the independent and normally distributed error term.

Code
# Fitting MLR model
sex_flipper_mlr <- lm(body_mass_g ~ sex + flipper_length_mm,
                      data = penguins)

# Extracting MLR model coefficients
beta_hats <- coefficients(sex_flipper_mlr)

# Printing MLR model output
sex_flipper_mlr |> 
  tbl_regression(intercept = TRUE) |> 
  add_glance_source_note()
Table 6: Regression model output.
Characteristic Beta 95% CI p-value
(Intercept) -5,062 -5,645, -4,480 <0.001
sex


    male
    female -348 -427, -268 <0.001
flipper_length_mm 47 44, 50 <0.001
Abbreviation: CI = Confidence Interval
R² = 0.806; Adjusted R² = 0.805; Sigma = 356; Statistic = 685; p-value = <0.001; df = 2; Log-likelihood = -2,427; AIC = 4,862; BIC = 4,878; Deviance = 41,795,374; Residual df = 330; No. Obs. = 333

Our estimated multiple linear regression model is:

\[\hat{y} = -5062 - 348x_1 + 47x_2\]

where \(\hat{y}\) is the estimated penguin body mass (g), \(x_1 = 1\) if a penguin is female, and 0 if it is male, and \(x_2\) is a penguin’s flipper length (mm).

This two-predictor model implies separate parallel lines for each group as depicted below.

Code
# Creating scatter plot with separate parallel lines for each group 
sex_segment_data <- penguins |> 
  drop_na(sex) |> 
  dplyr::group_by(sex) |> 
  dplyr::summarize(x = min(flipper_length_mm),
                   xend = max(flipper_length_mm)) |> 
  ungroup() |> 
  mutate(y = predict(sex_flipper_mlr, newdata = data.frame(flipper_length_mm = x, sex = sex)), 
         yend = predict(sex_flipper_mlr, newdata = data.frame(flipper_length_mm = xend, sex = sex)))

penguins |> 
  drop_na(sex) |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g, 
             fill = sex)) + 
  geom_segment(data = sex_segment_data, aes(x = x, xend = xend, y = y, yend = yend, 
                                    color = sex), linewidth = 1) +
  geom_point(pch = 21, color = "white") +
  labs(y = "Body mass (g)", x = "Flipper length (mm)", title = "Palmer Penguins Data: Body Mass by Flipper Length") +
  scale_fill_manual(values = penguin_sex_colors) + 
  scale_color_manual(values = penguin_sex_colors) + 
  theme(legend.position = "bottom")

This assumption of parallel lines across the subgroups in the data may or may not be reasonable. One can allow the lines to have different slopes within each subgroup using an interaction between the predictors (flipper length and sex here), but this adds complexity to the model which can lead to overfitting.

Categorical predictors

Species is another categorical variable in the penguins data. Using species as our predictor variable, it has \(\ell = 3\) levels in this data set.

Tip

How many dummy variables are used to represent the species variable?

Let’s consider modeling the penguin body masses using the species of the penguins.

The assumed multiple linear regression model using species as a predictor is:

\[y = \beta_0 + x_1\beta_1 + x_2\beta_2 + \varepsilon, \]

where \(y\) is a penguin’s body mass (g), \(x_1 = 1\) for a Chinstrap penguin and 0 otherwise, \(x_2=1\) for a Gentoo penguin and 0 otherwise, and \(\varepsilon\) is the independent and normally distributed error term.

Tip

Based on the model above, which species is the baseline / reference category? Why?

Code
# Set seed value for reproducible random number generation
set.seed(1994)

# Creating explicit dummy variables for species
penguins_dummy2 <- penguins |> 
  dplyr::mutate(species_chinstrap = ifelse(species == "Chinstrap", 1, 0),
         species_gentoo = ifelse(species == "Gentoo", 1, 0))

# Displaying data with dummy variables for species
penguins_dummy2 |> 
  dplyr::select(species, species_chinstrap, species_gentoo, body_mass_g, flipper_length_mm) |> 
  dplyr::sample_n(size = 5) |>
  gt()
Table 7: Five randomly selected rows from the Palmer Penguins data displaying dummy variables.
species species_chinstrap species_gentoo body_mass_g flipper_length_mm
Chinstrap 1 0 3775 198
Adelie 0 0 3550 184
Chinstrap 1 0 3525 188
Gentoo 0 1 4375 217
Gentoo 0 1 5700 230
Code
# Fitting linear model with species as predictor
species_model <- lm(body_mass_g ~ species, 
                   data = penguins)

beta_hats <- coefficients(species_model)
Code
# Printing MLR model output
species_model |> 
  tbl_regression(intercept = TRUE) |> 
  add_glance_source_note()
Table 8: Regression model output.
Characteristic Beta 95% CI p-value
(Intercept) 3,701 3,627, 3,775 <0.001
species


    Adelie
    Chinstrap 32 -100, 165 0.6
    Gentoo 1,375 1,265, 1,486 <0.001
Abbreviation: CI = Confidence Interval
R² = 0.670; Adjusted R² = 0.668; Sigma = 462; Statistic = 344; p-value = <0.001; df = 2; Log-likelihood = -2,582; AIC = 5,173; BIC = 5,188; Deviance = 72,443,483; Residual df = 339; No. Obs. = 342

Our estimated regression model using species as a predictor is:

\[\hat{y} = 3701 + 32x_1 + 1375x_2,\]

where \(\hat{y}\) is the estimated penguin body mass (g), \(x_1 = 1\) for a Chinstrap penguin and 0 otherwise, and \(x_2=1\) for a Gentoo penguin and 0 otherwise.

Code
# Group means for plotting line of best fit
body_mass_means <- penguins_dummy |> 
  dplyr::group_by(species) |> 
  dplyr::summarize(Mean = mean(body_mass_g, na.rm = TRUE))

# Plotting group data
penguins_dummy |> 
  ggplot(aes(x = as.integer(as.factor(species)), y = body_mass_g, 
             fill = species)) + 
  geom_point(pch = 21, color = "white") +
  labs(y = "Body mass (g)", x = "", title = "Palmer Penguins Data: Body Mass by species") +
   geom_hline(data = body_mass_means, 
              aes(yintercept = Mean, color = species)) +
  scale_fill_manual(values = penguin_species_colors) +
  scale_color_manual(values = penguin_species_colors) +
  scale_x_continuous(breaks = 0) +
  theme(legend.position = "bottom")

The theoretical regression model using species and flipper length as predictors is:

\[y = \beta_0 + x_1\beta_1 + x_2\beta_2 + x_3\beta_3 + \varepsilon, \]

where \(y\) is a penguin’s body mass (g), \(x_1 = 1\) for a Chinstrap penguin and 0 otherwise, \(x_2=1\) for a Gentoo penguin and 0 otherwise, \(x_3\) is a penguin’s flipper length in mm, and \(\varepsilon\) is the independent and normally distributed error term.

Code
# Fitting linear models
species_flipper_mlr <- lm(body_mass_g ~ species + flipper_length_mm, data = penguins)

beta_hats <- coefficients(species_flipper_mlr)

# Printing MLR model output
species_flipper_mlr |> 
  tbl_regression(intercept = TRUE) |> 
  add_glance_source_note()
Table 9: Regression model output.
Characteristic Beta 95% CI p-value
(Intercept) -4,031 -5,181, -2,882 <0.001
species


    Adelie
    Chinstrap -207 -320, -93 <0.001
    Gentoo 267 79, 454 0.005
flipper_length_mm 41 35, 47 <0.001
Abbreviation: CI = Confidence Interval
R² = 0.783; Adjusted R² = 0.781; Sigma = 376; Statistic = 406; p-value = <0.001; df = 3; Log-likelihood = -2,511; AIC = 5,032; BIC = 5,051; Deviance = 47,666,988; Residual df = 338; No. Obs. = 342

The estimated regression model using species and flipper length as predictors:

\[\hat{y} = -4031 -207x_1 + 267x_2 + 41x_3,\]

where \(\hat{y}\) is the estimated penguin body mass (g), \(x_1 = 1\) for a Chinstrap penguin and 0 otherwise, \(x_2=1\) for a Gentoo penguin and 0 otherwise, and \(x_3\) is a penguin’s flipper length in mm.

This three-predictor model implies separate parallel lines for each group as depicted below.

Code
# Making scatter plot with parallel lines for each group
species_segment_data <- penguins |> 
  dplyr::group_by(species) |> 
  dplyr::summarize(x = min(flipper_length_mm, na.rm = TRUE),
                   xend = max(flipper_length_mm, na.rm = TRUE)) |> 
  ungroup() |> 
  mutate(y = predict(species_flipper_mlr, newdata = data.frame(flipper_length_mm = x, species = species)), 
         yend = predict(species_flipper_mlr, newdata = data.frame(flipper_length_mm = xend, species = species)))

penguins |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g, fill = species)) + 
geom_segment(data = species_segment_data, aes(x = x, xend = xend,
                                              y = y, yend = yend,
                                              color = species), 
                              linewidth = 1) +
  geom_point(pch = 21, color = "white") +
  labs(y = "Body mass (g)", x = "Flipper length (mm)", title = "Palmer Penguins Data: Body Mass by Flipper Length") +
  scale_fill_manual(values = penguin_species_colors) + 
  scale_color_manual(values = penguin_species_colors) + 
  theme(legend.position = "bottom")

Tip
  1. How do we interpret the estimated intercept?

  2. How do we interpret the estimated slopes?

  3. Confirm that the predicted body mass (g) of a Chinstrap penguin with a flipper length of 195mm is 3700 grams. Is this extrapolation?

Code
# Obtaining predicted value
predict(species_flipper_mlr, 
        newdata = tibble(species = "Chinstrap",
                         flipper_length_mm = 195))
Tip
  1. If a Chinstrap penguin with a flipper length of 195mm had a body mass of 3900 grams, what is its corresponding residual? What does this mean?

  2. What is the expected body mass (g) of a Chinstrap penguin with a flipper length of 225mm? Is this extrapolation?

Code
# Obtaining predicted value
predict(species_flipper_mlr, 
        newdata = tibble(species = "Chinstrap",
                         flipper_length_mm = 225))
Tip

Does it seem reasonable to constrain the regression lines to be parallel for each group?