Logistic Regression and Classification

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Learning Objectives

  • State the logistic regression model

  • Check model assumptions & assess model fit

  • Interpret model estimates

  • Obtain and interpret predicted values

  • Select a final model using proper criterion and metrics

Let’s begin by loading some R packages for this activity using the code below. Note: if it is the first time you are using an R package, you may need to install it first using the install.packages() function in the Console pane.

Code
# Load necessary packages
library(tidyverse)
library(ggthemes)
library(gt)
library(corrr)
library(ggfortify)
library(broom)
library(scales)
library(car)
library(GGally)
library(gtsummary)
library(glmnet)
library(glmnetUtils)

Set default theme settings for plots, and load a function to display variance inflation factor (VIF) values using the code below.

Code
# Set ggplot theme for visualizations
theme_set(ggthemes::theme_few())

# Load function for visualizing VIF & GVIF values
source("https://raw.githubusercontent.com/dilernia/DSA220/main/Functions/vif_plot.R")

# Load function for creating empirical logit plot
source("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Functions/empirical_logit_plot.R")

Logistic Regression

Binary logistic regression is a statistical method used to model the relationship between the probability of a binary outcome, \(y\), and \(k\) explanatory or predictor variables, denoted by \(x_1\), \(x_2\), \(\dots\), \(x_k\). In general, the predictor variables can be either quantitative or categorical variables. The logistic regression model is commonly used to estimate the probability or likelihood of an outcome occurring given observed values of select predictor variables.

Some example scenarios where a logistic regression would be useful include:

  1. Describing risk factors associated with someone contracting lung cancer within the next year given their smoking status, gender, age in years, body mass index (BMI), and familial history (e.g., whether or not a first-degree relative has had lung cancer). ⚕️🏥

  2. For credit-risk analysis, estimating the risk of a client defaulting on a loan given their credit score, annual income, and employment status. 💼 💸

  3. For marketing strategies, estimating the likelihood of a customer purchasing a given product given their age, gender, and purchase history. 🏷️💲

  4. In political analytics, forecasting the winner of a United States presidential race for a state using historical outcomes of presidential races, results of political polling, and demographic characteristics about a state.🔵 🔴

  5. In sports analytics, estimating the probability of a team winning a basketball game using the percent of games they have won and the percent of games their opponent has won thus far, how many points they have scored and how many points their opponent has scored on average each game, and other factors such as whether they are playing on their home court or not.🏀

Tip

What are the response and predictor variables for examples a. through d., and what are the types of each predictor variable (e.g., categorical or quantitative)?

Logistic Regression Model

A general statement of the logit form of the assumed theoretical logistic regression model is

\[\log \left( \frac{\pi}{1-\pi} \right) = \beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k, \]

where

  • \(\pi = \Pr(y = 1)\) is the probability of a success (\(y = 1\))

  • \(y\) is our binary response variable being either 0 or 1.

  • \(\log(\cdot)\) is the natural logarithm function

  • \(x_j\) is the \(j^{th}\) predictor for \(j=1, 2, \dots, k\),

  • \(\frac{\pi}{1-\pi}\) is the odds of success

This is equivalent to stating that

\[\pi = \frac{\exp(\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k)}{1 + \exp(\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k)} = \frac{e^{\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k}}{1 + e^{\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k}}, \]

where \(\exp(x) = e^{x}\) is the exponential function which is the inverse of the log function, i.e. \(\exp(\log(x)) = x\), and \(e \approx 2.718\) is Euler’s number. Note that \(\pi\) in these equations is referring to the true probability of success, not the irrational constant that is approximately 3.14. It is also common to express the logistic regression model using \(p\) instead of \(\pi\) to represent the probability that \(y=1\), but the use of \(\pi\) is more common. Also observe that the assumed logistic regression model does not have an error term like linear regression.

When providing a statement of the theoretical or assumed logistic regression model, the terms of the model are expressed using general notation, but the response variable, what is considered to be a “success”, and the predictor variables should be specified for the given context.

A statement of the estimated logistic regression equation is

\[\log \left( \frac{\hat{\pi}}{1-\hat{\pi}} \right) = \hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k, \]

where \(\hat{\pi}\) is the estimated or predicted probability of success (\(y = 1\)) and \(x_j\) the \(j^{th}\) predictor for \(j=1, 2, \dots, k\).

As with the theoretical model, this estimated model has an alternative formulation as well:

\[\hat{\pi} = \frac{\exp(\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k)}{1 + \exp(\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k)} = \frac{e^{\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k}}{1 + e^{\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k}}, \]

which gives the estimated probability of success.

When providing a statement of the estimated logistic regression model, the \(\hat{\beta}\) terms should be replaced with corresponding numerical estimates from the model output, and the estimated response and predictor variables should be specified for the given context.

Logistic Regression Assumptions

Logistic regression requires a set of conditions or assumptions to be met to obtain reliable estimates and to conduct valid inference related to the model.

  • The response variable is a binary and random outcome.

  • Independent observations: observations should be independent of one another and have been randomly sampled from the population of interest.

  • Linearity: linear relationship between each predictor and the log odds of the success of the response variable.

  • No multicollinearity: predictor variables should not be highly correlated with one another. This is important for properly interpreting the effects of predictors, but multicollinearity does not necessarily hinder predictive accuracy.

The first two assumptions pertain to the nature of the data and how it was sampled. The linearity assumption is commonly checked using empirical logit plots for each predictor which display the relationship between the log odds of success and the values of a predictor variable.

Generalized Variance Inflation Factor

Multicollinearity can be assessed using the generalized variance inflation factor (GVIF) values which are a generalization of the variance inflation factor values1. As a rule of thumb, we will consider a predictor \(x_j\) to have multicollinearity issues if its adjusted GVIF value exceeds 5, where the adjusted GVIF is

\[\text{GVIF}_j^{\frac{1}{2\text{df}}},\]

where \(\text{df}\) is the degrees of freedom associated with \(x_j\).

Interpreting Model Estimates

Interpretations of coefficients of quantitative predictors for logistic regression are multiplicative as opposed to the additive effects for linear regression.

General interpretation for \(\hat{\beta}_j\): For every 1 unit increase in \(x_j\), we expect the odds of a success (\(y = 1\)) to multiply by \(e^{\hat{\beta}_j}\), holding all other predictors constant.

General interpretation for \(\hat{\beta}_0\): The expected odds of success (\(y=1\)) given that \(x_1\) through \(x_k\) are all 0 is \(e^{\hat{\beta}_0}\).

Note that when multicollinearity is present, the estimated slopes may be unreliable or inaccurate for describing relationships between the predictors and the response variable.

Titanic Data Set

To demonstrate the utility of the logistic regression model, we will analyze data on passengers of the RMS Titanic, the infamous British passenger liner that sank on its maiden voyage on April 15th, 1912, after colliding with an iceberg in the North Atlantic Ocean. This data set was obtained from Stanford University, and contains information on 887 different passengers.

The RMS Titanic departing Southampton on April 10th, 1912. Image obtained from Wikipedia.

Leonardo DiCaprio and Kate Winslet in the movie Titanic. Image from the New York Times.

Specifically, we will be exploring the relationship between different characteristics of people aboard the Titanic and their likelihood of survival using a logistic regression model. A data dictionary and a few randomly selected rows for the Titanic data set are included below.

Code
# Importing data from course GitHub page
titanic <- read_csv("https://raw.githubusercontent.com/dilernia/STA323/main/Data/titanic.csv")

# Creating data dictionary
data_dictionary <- tibble(Variable = colnames(titanic),
                         Description = c("Survival status (Yes or No)",
                                         "Sex (Male or Female)",
                                         "Passenger class (First, Second, or Third), with First Class typically having the most wealth and Third Class passengers typically being the least wealthy.",
                                         "Name",
                                         "Age (years)",
                                         "Fare (United States dollars, unadjusted for inflation)",
                                         "Number of siblings and spouses aboard the Titanic",
                                         "Number of parents and children aboard the Titanic"))

data_dictionary |> 
  gt()
Table 1: Data dictionary for Titanic data set.
Variable Description
survived Survival status (Yes or No)
sex Sex (Male or Female)
passenger_class Passenger class (First, Second, or Third), with First Class typically having the most wealth and Third Class passengers typically being the least wealthy.
name Name
age Age (years)
fare Fare (United States dollars, unadjusted for inflation)
n_siblings_spouses_aboard Number of siblings and spouses aboard the Titanic
n_parents_children_aboard Number of parents and children aboard the Titanic
Code
# Printing sample of 7 rows from data set
titanic |>  
  dplyr::sample_n(size = 7) |> 
  gt()
Table 2: Five randomly selected people who were aboard the Titanic.
survived sex passenger_class name age fare n_siblings_spouses_aboard n_parents_children_aboard
Yes Male Third Master. Halim Gonios Moubarek 4 15.2458 1 1
No Male Second Mr. William John Robert Turpin 29 21.0000 1 0
No Female Third Miss. Marija Cacic 30 8.6625 0 0
No Male Third Mr. Joseph Peduzzi 24 8.0500 0 0
No Male Third Mr. Mile Smiljanic 37 8.6625 0 0
No Male Third Mr. John Fredrik Alexander Holm 43 6.4500 0 0
No Male Third Mr. Ingvald Olai Olsen Hagland 28 19.9667 1 0

Exploratory Data Analysis

Let’s start with some exploratory data analysis to familiarize ourselves with the data.

Code
# Creating side-by-side box plots of age by survival status
titanic |> 
  ggplot(aes(x = survived, y = age, fill = survived)) + 
  geom_boxplot() + 
  scale_fill_viridis_d() +
        labs(title = "Titanic passenger ages by survival status",
       x = "Survival status",
       y = "Age (years)",
       caption = "Data source: Stanford University") +
  theme(legend.position = "none")

Code
# Creating clustered bar chart of survival status by passenger class
titanic |>
  dplyr::count(survived, passenger_class, .drop = FALSE) |>
  dplyr::filter(!is.na(survived), !is.na(passenger_class)) |>
  ggplot(aes(x = passenger_class, y = n, fill = survived)) +
    geom_col(position = "dodge", color = "black") +
    scale_fill_viridis_d() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
    labs(title = "Survival status by passenger class",
         x = "Passenger class",
         y = "Number of passengers",
         fill = "Survived",
         caption = "Data source: Stanford University") +
    theme(legend.position = "bottom")

Code
# Creating clustered bar chart of survival status by sex
titanic |>
  dplyr::count(survived, sex, .drop = FALSE) |>
  dplyr::filter(!is.na(survived), !is.na(sex)) |>
  ggplot(aes(x = sex, y = n, fill = survived)) +
    geom_col(position = "dodge", color = "black") +
    scale_fill_viridis_d() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
    labs(title = "Survival status by sex",
         x = "Sex",
         y = "Number of passengers",
         fill = "Survived",
         caption = "Data source: Stanford University") +
    theme(legend.position = "bottom")

Code
# Creating faceted clustered bar chart of survival status by sex and passenger class
titanic |>
  dplyr::count(survived, sex, passenger_class, .drop = FALSE) |>
  dplyr::filter(!is.na(survived), !is.na(sex)) |>
  ggplot(aes(x = sex, y = n, fill = survived)) +
    geom_col(position = "dodge", color = "black") +
    scale_fill_viridis_d() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
    labs(title = "Survival status & sex by passenger class",
         x = "Sex",
         y = "Number of passengers",
         fill = "Survived",
         caption = "Data source: Stanford University") +
    facet_grid(. ~ passenger_class) +
    theme(legend.position = "bottom",
          strip.background.y = element_rect(linetype = "solid", color = "black"))

Code
# Calculating descriptive statistics stratified by survival status
age_stats <- titanic |> 
  group_by(survived) |> 
  summarize(
  N = n(),
  Minimum = min(age, na.rm = TRUE),
  Q1 = quantile(age, na.rm = TRUE, probs = 0.25),
  Median = median(age, na.rm = TRUE),
  Q3 = quantile(age, na.rm = TRUE, probs = 0.75),
  Maximum = max(age, na.rm = TRUE),
  Mean = mean(age, na.rm = TRUE),
  Range = Maximum - Minimum,
  SD = sd(age, na.rm = TRUE),
)

# Printing table of statistics stratified by survival status
age_stats |> 
    gt() |>
    fmt_number(columns = everything(), 
               decimals = 2,
               drop_trailing_zeros = TRUE)
Table 3: Age of passengers by survival status.
survived N Minimum Q1 Median Q3 Maximum Mean Range SD
No 545 1 21 28 39 74 30.14 73 13.9
Yes 342 0.42 19 28 36.75 80 28.41 79.58 14.43
Code
# Calculating descriptive statistics stratified by passenger class
class_stats <- titanic |> 
  group_by(passenger_class) |> 
  summarize(
  N = n(),
  Minimum = min(age, na.rm = TRUE),
  Q1 = quantile(age, na.rm = TRUE, probs = 0.25),
  Median = median(age, na.rm = TRUE),
  Q3 = quantile(age, na.rm = TRUE, probs = 0.75),
  Maximum = max(age, na.rm = TRUE),
  Mean = mean(age, na.rm = TRUE),
  Range = Maximum - Minimum,
  SD = sd(age, na.rm = TRUE),
)

# Printing table of statistics
class_stats |> 
    gt() |>
    fmt_number(columns = everything(), 
               decimals = 2,
               drop_trailing_zeros = TRUE)
Table 4: Age of passengers by class.
passenger_class N Minimum Q1 Median Q3 Maximum Mean Range SD
First 216 0.92 28.75 38.5 48 80 38.79 79.08 14.18
Second 184 0.67 23 29 36.62 70 29.87 69.33 13.76
Third 487 0.42 19 24 31 74 25.19 73.58 12.1
Code
# Creating faceted histograms
titanic |> 
  rename(Class = passenger_class,
         Survived = survived) |> 
  ggplot(aes(x = age,
             fill = Class)) +
  geom_histogram(color = "black") +
  scale_fill_manual(values = c("#009E73", "#56B4E9", "#D55E00")) +
  facet_grid(Class ~ Survived,
             labeller = label_both) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
    labs(x = "Passenger age (years)",
         y = "Number of passengers") +
  theme(legend.position = "none")

Tip
  1. Based on the output above, what can we say about the relationship between passenger class and survival status?

  2. Based on the output above, what can we say about the relationship between sex and survival status?

  3. Based on the output above, what can we say about the relationship between age and survival status?

Fitting a Model

First, let’s fit a multiple logistic regression model for estimating the probability of survival for passengers using their age (age) and passenger class (passenger_class).

Tip

Provide a statement of our assumed / theoretical multiple logistic regression model.

Below is output for the logistic regression model using survival status as our response (\(y = 1\) meaning the person survived) and the age in years and class of the person as predictors.

Code
# Convert survived to a factor prior to fitting model
titanic <- titanic |> 
  dplyr::mutate(survived =factor(survived, levels = c("No", "Yes")))

# Fitting logistic regression model
logistic_model <- glm(survived ~ age + passenger_class,
                      family = "binomial", data = titanic)
Code
# Printing model output
logistic_model |> 
  tbl_regression(intercept = TRUE) |> 
  add_vif() |> 
  add_glance_source_note()
Table 5: Logistic regression output using age and passenger class as predictors of survival.
Characteristic log(OR) 95% CI p-value GVIF Adjusted GVIF1
(Intercept) 2.1 1.5, 2.7 <0.001

age -0.04 -0.05, -0.03 <0.001 1.4 1.2
passenger_class


1.4 1.1
    First


    Second -1.0 -1.5, -0.61 <0.001

    Third -2.3 -2.7, -1.9 <0.001

Abbreviations: CI = Confidence Interval, GVIF = Generalized Variance Inflation Factor, OR = Odds Ratio
Null deviance = 1,183; Null df = 886; Log-likelihood = -518; AIC = 1,045; BIC = 1,064; Deviance = 1,037; Residual df = 883; No. Obs. = 887
1 GVIF^[1/(2*df)]
Tip

Provide a statement of the estimated multiple logistic regression model.

Assessing Model Assumptions

Next let’s take a look at the empirical logit plots. Nonlinear patterns in the empirical logit plots for quantitative predictors can imply that functions of predictors would be useful in the model. Empirical logit plots for categorical predictors simply allow us to see subgroup differences in the log odds of success.

Code
# Creating empirical logit plot
empirical_logits <- empirical_logit_plot(logistic_model, nbins = 5)
Code
empirical_logits[[1]]

Code
empirical_logits[[2]]

Tip

Based on the output, state whether or not each assumption for the logistic regression model is met, providing specific evidence to support your conclusions from the output provided.

Interpreting model estimates

Let’s interpret the coefficient estimates for the logistic regression model.

As with linear regression, the estimated intercept, \(\hat{\beta}_0\), is only appropriate to interpret when all predictors simultaneously being 0 is within the range of the values of the observed data set. In general, obtaining predictions for predictor values outside the scope of the observed data set is considered extrapolation, and thus is inappropriate.

Tip
  1. Interpret the estimated slopes in context.

  2. Interpret the estimated intercept in context. Is this sensible in this context?

Tip
  1. What are the estimated odds of survival for a 19-year-old third class passenger?

  2. Is this considered extrapolation?

Code
# Estimated odds of survival for a 19 year old third-class and first-class passenger
prediction_data <- tibble(age = c(19, 19),
                          passenger_class = c("Third", "First"))

# Obtaining estimated log odds of survival
estimated_log_odds <- predict(logistic_model, newdata = prediction_data)

# Obtaining estimated odds of survival
estimated_odds <- exp(estimated_log_odds)

# Obtaining estimated probability of survival
estimated_probabilities <- predict(logistic_model, 
                                   newdata = prediction_data, type = "response")
Tip
  1. What is the estimated probability of survival for a 19-year-old third class passenger?

  2. What are the estimated odds of survival for a 19-year-old first class passenger?

  3. What is the estimated probability of survival for a 19-year-old first class passenger?

Model Selection

When fitting a logistic regression model with several predictors, there are multiple predictors available for inclusion in our final model. As with linear regression, we want a model with predictor variables that effectively explain variation in the response variable, but do not overfit the data. Moreover, we want to discriminate between these two types of predictors in a data-driven way to help us select a final model.

Akaike’s Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are numerical measures that describe the goodness-of-fit of a model and are commonly used for model selection. AIC and BIC both penalize for a larger number of variables used in a model while simultaneously rewarding how well a model explains variability in the response variable. In general, models with lower AIC and lower BIC values are more desirable than models with higher values since we can expect them to have relatively better out-of-sample predictive performance.

Another logistic regression model was fit using all of sex, age, and passenger class as predictors of survival, obtaining the output below.

Code
# Fitting full logistic regression model
full_model <- glm(survived ~ sex + age + passenger_class,
                  family = "binomial", data = titanic)
Code
# Printing model output
full_model |> 
  tbl_regression(intercept = TRUE) |> 
  add_vif() |> 
  add_glance_source_note()
Table 6: Logistic regression output using sex, age, and passenger class as predictors of survival.
Characteristic log(OR) 95% CI p-value GVIF Adjusted GVIF1
(Intercept) 3.6 2.9, 4.4 <0.001

sex


1.1 1.0
    Female


    Male -2.6 -3.0, -2.2 <0.001

age -0.03 -0.05, -0.02 <0.001 1.4 1.2
passenger_class


1.4 1.1
    First


    Second -1.2 -1.7, -0.69 <0.001

    Third -2.5 -3.0, -2.0 <0.001

Abbreviations: CI = Confidence Interval, GVIF = Generalized Variance Inflation Factor, OR = Odds Ratio
Null deviance = 1,183; Null df = 886; Log-likelihood = -401; AIC = 812; BIC = 836; Deviance = 802; Residual df = 882; No. Obs. = 887
1 GVIF^[1/(2*df)]
Tip

What is our decision regarding the inclusion of sex as a predictor in our model?

In the movie Titanic, Jack Dawson, played by Leonardo DiCaprio, is a 20-year-old male who is a third-class passenger, and Rose DeWitt-Bukater, played by Kate Winslet, is a 17-year-old female who is a first-class passenger.

Tip
  1. Based on our model, what is the estimated probability of survival for each of Jack and Rose?

  2. Based on these estimated probabilities, what are the estimated odds of survival for each of Jack and Rose?

Code
# Creating data frame for Jack and Rose
jack_rose_data <- tibble(sex = c("Male", "Female"),
                         age = c(20, 17),
                         passenger_class = c("Third", "First"))

# Obtaining estimated probability of survival
jack_rose_probs <- predict(full_model, 
                           newdata = jack_rose_data, type = "response")

Footnotes

  1. Fox, J., & Monette, G. (1992). Generalized Collinearity Diagnostics. Journal of the American Statistical Association, 87(417), 178–183. https://doi.org/10.1080/01621459.1992.10475190↩︎