Logistic Regression and Classification

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Learning Objectives

State the logistic regression model
Check model assumptions & assess model fit
Interpret model estimates
Obtain and interpret predicted values
Select a final model using proper criterion and metrics

Let’s begin by loading some R packages for this activity using the code below. Note: if it is the first time you are using an R package, you may need to install it first using the install.packages() function in the Console pane.

Code

# Load necessary packages
library(tidyverse)
library(ggthemes)
library(gt)
library(corrr)
library(ggfortify)
library(broom)
library(scales)
library(car)
library(GGally)
library(gtsummary)
library(glmnet)
library(glmnetUtils)

Set default theme settings for plots, and load a function to display variance inflation factor (VIF) values using the code below.

Code

# Set ggplot theme for visualizations
theme_set(ggthemes::theme_few())

# Load function for visualizing VIF & GVIF values
source("https://raw.githubusercontent.com/dilernia/DSA220/main/Functions/vif_plot.R")

# Load function for creating empirical logit plot
source("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Functions/empirical_logit_plot.R")

Logistic Regression

Binary logistic regression is a statistical method used to model the relationship between the probability of a binary outcome, $y$, and $k$ explanatory or predictor variables, denoted by $x_1$, $x_2$, $\dots$, $x_k$. In general, the predictor variables can be either quantitative or categorical variables. The logistic regression model is commonly used to estimate the probability or likelihood of an outcome occurring given observed values of select predictor variables.

Some example scenarios where a logistic regression would be useful include:

Describing risk factors associated with someone contracting lung cancer within the next year given their smoking status, gender, age in years, body mass index (BMI), and familial history (e.g., whether or not a first-degree relative has had lung cancer). ⚕️🏥
For credit-risk analysis, estimating the risk of a client defaulting on a loan given their credit score, annual income, and employment status. 💼 💸
For marketing strategies, estimating the likelihood of a customer purchasing a given product given their age, gender, and purchase history. 🏷️💲
In political analytics, forecasting the winner of a United States presidential race for a state using historical outcomes of presidential races, results of political polling, and demographic characteristics about a state.🔵 🔴
In sports analytics, estimating the probability of a team winning a basketball game using the percent of games they have won and the percent of games their opponent has won thus far, how many points they have scored and how many points their opponent has scored on average each game, and other factors such as whether they are playing on their home court or not.🏀

Tip

What are the response and predictor variables for examples a. through d., and what are the types of each predictor variable (e.g., categorical or quantitative)?

Logistic Regression Model

A general statement of the logit form of the assumed theoretical logistic regression model is

\[\log \left( \frac{\pi}{1-\pi} \right) = \beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k, \]

where

$\pi = \Pr(y = 1)$ is the probability of a success ($y = 1$)
$y$ is our binary response variable being either 0 or 1.
$\log(\cdot)$ is the natural logarithm function
$x_j$ is the $j^{th}$ predictor for $j=1, 2, \dots, k$,
$\frac{\pi}{1-\pi}$ is the odds of success

This is equivalent to stating that

\[\pi = \frac{\exp(\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k)}{1 + \exp(\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k)} = \frac{e^{\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k}}{1 + e^{\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k}}, \]

where $\exp(x) = e^{x}$ is the exponential function which is the inverse of the log function, i.e. $\exp(\log(x)) = x$, and $e \approx 2.718$ is Euler’s number. Note that $\pi$ in these equations is referring to the true probability of success, not the irrational constant that is approximately 3.14. It is also common to express the logistic regression model using $p$ instead of $\pi$ to represent the probability that $y=1$, but the use of $\pi$ is more common. Also observe that the assumed logistic regression model does not have an error term like linear regression.

When providing a statement of the theoretical or assumed logistic regression model, the terms of the model are expressed using general notation, but the response variable, what is considered to be a “success”, and the predictor variables should be specified for the given context.

A statement of the estimated logistic regression equation is

\[\log \left( \frac{\hat{\pi}}{1-\hat{\pi}} \right) = \hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k, \]

where $\hat{\pi}$ is the estimated or predicted probability of success ($y = 1$) and $x_j$ the $j^{th}$ predictor for $j=1, 2, \dots, k$.

As with the theoretical model, this estimated model has an alternative formulation as well:

\[\hat{\pi} = \frac{\exp(\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k)}{1 + \exp(\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k)} = \frac{e^{\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k}}{1 + e^{\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k}}, \]

which gives the estimated probability of success.

When providing a statement of the estimated logistic regression model, the $\hat{\beta}$ terms should be replaced with corresponding numerical estimates from the model output, and the estimated response and predictor variables should be specified for the given context.

Logistic Regression Assumptions

Logistic regression requires a set of conditions or assumptions to be met to obtain reliable estimates and to conduct valid inference related to the model.

The response variable is a binary and random outcome.
Independent observations: observations should be independent of one another and have been randomly sampled from the population of interest.
Linearity: linear relationship between each predictor and the log odds of the success of the response variable.
No multicollinearity: predictor variables should not be highly correlated with one another. This is important for properly interpreting the effects of predictors, but multicollinearity does not necessarily hinder predictive accuracy.

The first two assumptions pertain to the nature of the data and how it was sampled. The linearity assumption is commonly checked using empirical logit plots for each predictor which display the relationship between the log odds of success and the values of a predictor variable.

Generalized Variance Inflation Factor

Multicollinearity can be assessed using the generalized variance inflation factor (GVIF) values which are a generalization of the variance inflation factor values¹. As a rule of thumb, we will consider a predictor $x_j$ to have multicollinearity issues if its adjusted GVIF value exceeds 5, where the adjusted GVIF is

\[\text{GVIF}_j^{\frac{1}{2\text{df}}},\]

where $\text{df}$ is the degrees of freedom associated with $x_j$.

Interpreting Model Estimates

Interpretations of coefficients of quantitative predictors for logistic regression are multiplicative as opposed to the additive effects for linear regression.

General interpretation for $\hat{\beta}_j$: For every 1 unit increase in $x_j$, we expect the odds of a success ($y = 1$) to multiply by $e^{\hat{\beta}_j}$, holding all other predictors constant.

General interpretation for $\hat{\beta}_0$: The expected odds of success ($y=1$) given that $x_1$ through $x_k$ are all 0 is $e^{\hat{\beta}_0}$.

Note that when multicollinearity is present, the estimated slopes may be unreliable or inaccurate for describing relationships between the predictors and the response variable.

Titanic Data Set

To demonstrate the utility of the logistic regression model, we will analyze data on passengers of the RMS Titanic, the infamous British passenger liner that sank on its maiden voyage on April 15th, 1912, after colliding with an iceberg in the North Atlantic Ocean. This data set was obtained from Stanford University, and contains information on 887 different passengers.

The RMS Titanic departing Southampton on April 10th, 1912. Image obtained from Wikipedia.

Leonardo DiCaprio and Kate Winslet in the movie *Titanic*. Image from the New York Times.

Specifically, we will be exploring the relationship between different characteristics of people aboard the Titanic and their likelihood of survival using a logistic regression model. A data dictionary and a few randomly selected rows for the Titanic data set are included below.

Code

# Importing data from course GitHub page
titanic <- read_csv("https://raw.githubusercontent.com/dilernia/STA323/main/Data/titanic.csv")

# Creating data dictionary
data_dictionary <- tibble(Variable = colnames(titanic),
                         Description = c("Survival status (Yes or No)",
                                         "Sex (Male or Female)",
                                         "Passenger class (First, Second, or Third), with First Class typically having the most wealth and Third Class passengers typically being the least wealthy.",
                                         "Name",
                                         "Age (years)",
                                         "Fare (United States dollars, unadjusted for inflation)",
                                         "Number of siblings and spouses aboard the Titanic",
                                         "Number of parents and children aboard the Titanic"))

data_dictionary |> 
  gt()

Table 1: Data dictionary for Titanic data set.

Variable	Description
survived	Survival status (Yes or No)
sex	Sex (Male or Female)
passenger_class	Passenger class (First, Second, or Third), with First Class typically having the most wealth and Third Class passengers typically being the least wealthy.
name	Name
age	Age (years)
fare	Fare (United States dollars, unadjusted for inflation)
n_siblings_spouses_aboard	Number of siblings and spouses aboard the Titanic
n_parents_children_aboard	Number of parents and children aboard the Titanic

Code

# Printing sample of 7 rows from data set
titanic |>  
  dplyr::sample_n(size = 7) |> 
  gt()

Table 2: Five randomly selected people who were aboard the Titanic.

survived	sex	passenger_class	name	age	fare	n_siblings_spouses_aboard	n_parents_children_aboard
Yes	Male	Third	Master. Halim Gonios Moubarek	4	15.2458	1	1
No	Male	Second	Mr. William John Robert Turpin	29	21.0000	1	0
No	Female	Third	Miss. Marija Cacic	30	8.6625	0	0
No	Male	Third	Mr. Joseph Peduzzi	24	8.0500	0	0
No	Male	Third	Mr. Mile Smiljanic	37	8.6625	0	0
No	Male	Third	Mr. John Fredrik Alexander Holm	43	6.4500	0	0
No	Male	Third	Mr. Ingvald Olai Olsen Hagland	28	19.9667	1	0

Exploratory Data Analysis

Let’s start with some exploratory data analysis to familiarize ourselves with the data.

Code

# Creating side-by-side box plots of age by survival status
titanic |> 
  ggplot(aes(x = survived, y = age, fill = survived)) + 
  geom_boxplot() + 
  scale_fill_viridis_d() +
        labs(title = "Titanic passenger ages by survival status",
       x = "Survival status",
       y = "Age (years)",
       caption = "Data source: Stanford University") +
  theme(legend.position = "none")

Code

# Creating clustered bar chart of survival status by passenger class
titanic |>
  dplyr::count(survived, passenger_class, .drop = FALSE) |>
  dplyr::filter(!is.na(survived), !is.na(passenger_class)) |>
  ggplot(aes(x = passenger_class, y = n, fill = survived)) +
    geom_col(position = "dodge", color = "black") +
    scale_fill_viridis_d() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
    labs(title = "Survival status by passenger class",
         x = "Passenger class",
         y = "Number of passengers",
         fill = "Survived",
         caption = "Data source: Stanford University") +
    theme(legend.position = "bottom")

Code

# Creating clustered bar chart of survival status by sex
titanic |>
  dplyr::count(survived, sex, .drop = FALSE) |>
  dplyr::filter(!is.na(survived), !is.na(sex)) |>
  ggplot(aes(x = sex, y = n, fill = survived)) +
    geom_col(position = "dodge", color = "black") +
    scale_fill_viridis_d() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
    labs(title = "Survival status by sex",
         x = "Sex",
         y = "Number of passengers",
         fill = "Survived",
         caption = "Data source: Stanford University") +
    theme(legend.position = "bottom")

Code

# Creating faceted clustered bar chart of survival status by sex and passenger class
titanic |>
  dplyr::count(survived, sex, passenger_class, .drop = FALSE) |>
  dplyr::filter(!is.na(survived), !is.na(sex)) |>
  ggplot(aes(x = sex, y = n, fill = survived)) +
    geom_col(position = "dodge", color = "black") +
    scale_fill_viridis_d() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
    labs(title = "Survival status & sex by passenger class",
         x = "Sex",
         y = "Number of passengers",
         fill = "Survived",
         caption = "Data source: Stanford University") +
    facet_grid(. ~ passenger_class) +
    theme(legend.position = "bottom",
          strip.background.y = element_rect(linetype = "solid", color = "black"))

Code

# Calculating descriptive statistics stratified by survival status
age_stats <- titanic |> 
  group_by(survived) |> 
  summarize(
  N = n(),
  Minimum = min(age, na.rm = TRUE),
  Q1 = quantile(age, na.rm = TRUE, probs = 0.25),
  Median = median(age, na.rm = TRUE),
  Q3 = quantile(age, na.rm = TRUE, probs = 0.75),
  Maximum = max(age, na.rm = TRUE),
  Mean = mean(age, na.rm = TRUE),
  Range = Maximum - Minimum,
  SD = sd(age, na.rm = TRUE),
)

# Printing table of statistics stratified by survival status
age_stats |> 
    gt() |>
    fmt_number(columns = everything(), 
               decimals = 2,
               drop_trailing_zeros = TRUE)

Table 3: Age of passengers by survival status.

survived	N	Minimum	Q1	Median	Q3	Maximum	Mean	Range	SD
No	545	1	21	28	39	74	30.14	73	13.9
Yes	342	0.42	19	28	36.75	80	28.41	79.58	14.43

Code

# Calculating descriptive statistics stratified by passenger class
class_stats <- titanic |> 
  group_by(passenger_class) |> 
  summarize(
  N = n(),
  Minimum = min(age, na.rm = TRUE),
  Q1 = quantile(age, na.rm = TRUE, probs = 0.25),
  Median = median(age, na.rm = TRUE),
  Q3 = quantile(age, na.rm = TRUE, probs = 0.75),
  Maximum = max(age, na.rm = TRUE),
  Mean = mean(age, na.rm = TRUE),
  Range = Maximum - Minimum,
  SD = sd(age, na.rm = TRUE),
)

# Printing table of statistics
class_stats |> 
    gt() |>
    fmt_number(columns = everything(), 
               decimals = 2,
               drop_trailing_zeros = TRUE)

Table 4: Age of passengers by class.

passenger_class	N	Minimum	Q1	Median	Q3	Maximum	Mean	Range	SD
First	216	0.92	28.75	38.5	48	80	38.79	79.08	14.18
Second	184	0.67	23	29	36.62	70	29.87	69.33	13.76
Third	487	0.42	19	24	31	74	25.19	73.58	12.1

Code

# Creating faceted histograms
titanic |> 
  rename(Class = passenger_class,
         Survived = survived) |> 
  ggplot(aes(x = age,
             fill = Class)) +
  geom_histogram(color = "black") +
  scale_fill_manual(values = c("#009E73", "#56B4E9", "#D55E00")) +
  facet_grid(Class ~ Survived,
             labeller = label_both) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
    labs(x = "Passenger age (years)",
         y = "Number of passengers") +
  theme(legend.position = "none")

Tip

Based on the output above, what can we say about the relationship between passenger class and survival status?
Based on the output above, what can we say about the relationship between sex and survival status?
Based on the output above, what can we say about the relationship between age and survival status?

Fitting a Model

First, let’s fit a multiple logistic regression model for estimating the probability of survival for passengers using their age (age) and passenger class (passenger_class).

Tip

Provide a statement of our assumed / theoretical multiple logistic regression model.

Below is output for the logistic regression model using survival status as our response ($y = 1$ meaning the person survived) and the age in years and class of the person as predictors.

Code

# Convert survived to a factor prior to fitting model
titanic <- titanic |> 
  dplyr::mutate(survived =factor(survived, levels = c("No", "Yes")))

# Fitting logistic regression model
logistic_model <- glm(survived ~ age + passenger_class,
                      family = "binomial", data = titanic)

Code

# Printing model output
logistic_model |> 
  tbl_regression(intercept = TRUE) |> 
  add_vif() |> 
  add_glance_source_note()

Table 5: Logistic regression output using age and passenger class as predictors of survival.

Characteristic	log(OR)	95% CI	p-value	GVIF	Adjusted GVIF¹
(Intercept)	2.1	1.5, 2.7	<0.001
age	-0.04	-0.05, -0.03	<0.001	1.4	1.2
passenger_class				1.4	1.1
First	—	—
Second	-1.0	-1.5, -0.61	<0.001
Third	-2.3	-2.7, -1.9	<0.001
Abbreviations: CI = Confidence Interval, GVIF = Generalized Variance Inflation Factor, OR = Odds Ratio
Null deviance = 1,183; Null df = 886; Log-likelihood = -518; AIC = 1,045; BIC = 1,064; Deviance = 1,037; Residual df = 883; No. Obs. = 887
¹ GVIF^[1/(2*df)]

Tip

Provide a statement of the estimated multiple logistic regression model.

Assessing Model Assumptions

Next let’s take a look at the empirical logit plots. Nonlinear patterns in the empirical logit plots for quantitative predictors can imply that functions of predictors would be useful in the model. Empirical logit plots for categorical predictors simply allow us to see subgroup differences in the log odds of success.

Code

# Creating empirical logit plot
empirical_logits <- empirical_logit_plot(logistic_model, nbins = 5)

Code

empirical_logits[[1]]

Code

empirical_logits[[2]]

Tip

Based on the output, state whether or not each assumption for the logistic regression model is met, providing specific evidence to support your conclusions from the output provided.

Interpreting model estimates

Let’s interpret the coefficient estimates for the logistic regression model.

As with linear regression, the estimated intercept, $\hat{\beta}_0$, is only appropriate to interpret when all predictors simultaneously being 0 is within the range of the values of the observed data set. In general, obtaining predictions for predictor values outside the scope of the observed data set is considered extrapolation, and thus is inappropriate.

Tip

Interpret the estimated slopes in context.
Interpret the estimated intercept in context. Is this sensible in this context?

Tip

What are the estimated odds of survival for a 19-year-old third class passenger?
Is this considered extrapolation?

Code

# Estimated odds of survival for a 19 year old third-class and first-class passenger
prediction_data <- tibble(age = c(19, 19),
                          passenger_class = c("Third", "First"))

# Obtaining estimated log odds of survival
estimated_log_odds <- predict(logistic_model, newdata = prediction_data)

# Obtaining estimated odds of survival
estimated_odds <- exp(estimated_log_odds)

# Obtaining estimated probability of survival
estimated_probabilities <- predict(logistic_model, 
                                   newdata = prediction_data, type = "response")

Tip

What is the estimated probability of survival for a 19-year-old third class passenger?
What are the estimated odds of survival for a 19-year-old first class passenger?
What is the estimated probability of survival for a 19-year-old first class passenger?

Model Selection

When fitting a logistic regression model with several predictors, there are multiple predictors available for inclusion in our final model. As with linear regression, we want a model with predictor variables that effectively explain variation in the response variable, but do not overfit the data. Moreover, we want to discriminate between these two types of predictors in a data-driven way to help us select a final model.

Akaike’s Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are numerical measures that describe the goodness-of-fit of a model and are commonly used for model selection. AIC and BIC both penalize for a larger number of variables used in a model while simultaneously rewarding how well a model explains variability in the response variable. In general, models with lower AIC and lower BIC values are more desirable than models with higher values since we can expect them to have relatively better out-of-sample predictive performance.

Another logistic regression model was fit using all of sex, age, and passenger class as predictors of survival, obtaining the output below.

Code

# Fitting full logistic regression model
full_model <- glm(survived ~ sex + age + passenger_class,
                  family = "binomial", data = titanic)

Code

# Printing model output
full_model |> 
  tbl_regression(intercept = TRUE) |> 
  add_vif() |> 
  add_glance_source_note()

Table 6: Logistic regression output using sex, age, and passenger class as predictors of survival.

Characteristic	log(OR)	95% CI	p-value	GVIF	Adjusted GVIF¹
(Intercept)	3.6	2.9, 4.4	<0.001
sex				1.1	1.0
Female	—	—
Male	-2.6	-3.0, -2.2	<0.001
age	-0.03	-0.05, -0.02	<0.001	1.4	1.2
passenger_class				1.4	1.1
First	—	—
Second	-1.2	-1.7, -0.69	<0.001
Third	-2.5	-3.0, -2.0	<0.001
Abbreviations: CI = Confidence Interval, GVIF = Generalized Variance Inflation Factor, OR = Odds Ratio
Null deviance = 1,183; Null df = 886; Log-likelihood = -401; AIC = 812; BIC = 836; Deviance = 802; Residual df = 882; No. Obs. = 887
¹ GVIF^[1/(2*df)]

Tip

What is our decision regarding the inclusion of sex as a predictor in our model?

In the movie Titanic, Jack Dawson, played by Leonardo DiCaprio, is a 20-year-old male who is a third-class passenger, and Rose DeWitt-Bukater, played by Kate Winslet, is a 17-year-old female who is a first-class passenger.

Tip

Based on our model, what is the estimated probability of survival for each of Jack and Rose?
Based on these estimated probabilities, what are the estimated odds of survival for each of Jack and Rose?

Code

# Creating data frame for Jack and Rose
jack_rose_data <- tibble(sex = c("Male", "Female"),
                         age = c(20, 17),
                         passenger_class = c("Third", "First"))

# Obtaining estimated probability of survival
jack_rose_probs <- predict(full_model, 
                           newdata = jack_rose_data, type = "response")

Footnotes

Fox, J., & Monette, G. (1992). Generalized Collinearity Diagnostics. Journal of the American Statistical Association, 87(417), 178–183. https://doi.org/10.1080/01621459.1992.10475190↩︎

--- title: "Logistic Regression and Classification" subtitle: DSA 220 - Introduction to Data Science and Analytics author: Andrew DiLernia format: html: theme: flatly self-contained: true embed-resources: true code-fold: true code-tools: true toc: true callout-appearance: default execute: echo: true warning: false message: false output: true --- ```{css} #| echo: false /*** Customization of call-outs ***/ /* 1. Initialize a counter for our callouts */ body { counter-reset: tip-counter; } /* 2. Tell the counter to increment for each tip callout */ .callout-tip { counter-increment: tip-counter; } /* 3. Hide the default title text (i.e., "Tip") by collapsing its font size */ .callout-tip .callout-title-container { font-size: 0; } /* 4. Add the custom superscript counter */ .callout-tip .callout-title-container::before { content: "" counter(tip-counter); position: relative; font-size: initial; } /*** Customization of Footnotes / References ***/ .footnotes > h2::before { content: "References "; } .footnotes > h2 { color: transparent; /* Hide original text */ position: relative; } .footnotes > h2::before { color: #333; position: absolute; left: 0; } ``` ```{r} #| include: false # Creating vector of colorblind-friendly colors colorblind_friendly_colors <- c("#E69F00", "#009E73", "#56B4E9", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999") ``` ## Learning Objectives - State the logistic regression model - Check model assumptions & assess model fit - Interpret model estimates - Obtain and interpret predicted values - Select a final model using proper criterion and metrics Let's begin by loading some R packages for this activity using the code below. Note: if it is the first time you are using an R package, you may need to install it first using the `install.packages()` function in the Console pane. ```{r, eval = TRUE, message = FALSE} # Load necessary packages library(tidyverse) library(ggthemes) library(gt) library(corrr) library(ggfortify) library(broom) library(scales) library(car) library(GGally) library(gtsummary) library(glmnet) library(glmnetUtils) ``` Set default theme settings for plots, and load a function to display variance inflation factor (VIF) values using the code below. ```{r} # Set ggplot theme for visualizations theme_set(ggthemes::theme_few()) # Load function for visualizing VIF & GVIF values source("https://raw.githubusercontent.com/dilernia/DSA220/main/Functions/vif_plot.R") # Load function for creating empirical logit plot source("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Functions/empirical_logit_plot.R") ``` ## Logistic Regression Binary logistic regression is a statistical method used to model the relationship between the probability of a binary outcome, $y$, and $k$ explanatory or predictor variables, denoted by $x_1$, $x_2$, $\dots$, $x_k$. In general, the predictor variables can be either quantitative or categorical variables. The logistic regression model is commonly used to estimate the probability or likelihood of an outcome occurring given observed values of select predictor variables. Some example scenarios where a logistic regression would be useful include: a. Describing risk factors associated with someone contracting lung cancer within the next year given their smoking status, gender, age in years, body mass index (BMI), and familial history (e.g., whether or not a first-degree relative has had lung cancer). ⚕️🏥 b. For credit-risk analysis, estimating the risk of a client defaulting on a loan given their credit score, annual income, and employment status. 💼 💸 c. For marketing strategies, estimating the likelihood of a customer purchasing a given product given their age, gender, and purchase history. 🏷️💲 d. In political analytics, forecasting the winner of a United States presidential race for a state using historical outcomes of presidential races, results of political polling, and demographic characteristics about a state.🔵 🔴 e. In sports analytics, estimating the probability of a team winning a basketball game using the percent of games they have won and the percent of games their opponent has won thus far, how many points they have scored and how many points their opponent has scored on average each game, and other factors such as whether they are playing on their home court or not.🏀 ::: {.callout-tip} What are the response and predictor variables for examples a. through d., and what are the types of each predictor variable (e.g., categorical or quantitative)? ::: ### Logistic Regression Model A general statement of the logit form of the assumed theoretical logistic regression model is $$\log \left( \frac{\pi}{1-\pi} \right) = \beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k, $$ where - $\pi = \Pr(y = 1)$ is the probability of a success ($y = 1$) - $y$ is our binary response variable being either 0 or 1. - $\log(\cdot)$ is the natural logarithm function - $x_j$ is the $j^{th}$ predictor for $j=1, 2, \dots, k$, - $\frac{\pi}{1-\pi}$ is the odds of success This is equivalent to stating that $$\pi = \frac{\exp(\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k)}{1 + \exp(\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k)} = \frac{e^{\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k}}{1 + e^{\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k}}, $$ where $\exp(x) = e^{x}$ is the exponential function which is the inverse of the log function, i.e. $\exp(\log(x)) = x$, and $e \approx 2.718$ is Euler's number. Note that $\pi$ in these equations is referring to the true probability of success, not the irrational constant that is approximately 3.14. It is also common to express the logistic regression model using $p$ instead of $\pi$ to represent the probability that $y=1$, but the use of $\pi$ is more common. Also observe that the assumed logistic regression model does not have an error term like linear regression. When providing a statement of the theoretical or assumed logistic regression model, the terms of the model are expressed using general notation, but the response variable, what is considered to be a "success", and the predictor variables should be specified for the given context. ```{r} #| echo: false # Creating function to plot inverse logit function inverse_logit <- function(x) { return(exp(x) / (1 + exp(x))) } my_sequence <- seq(-5, 5, length.out = 100) logistic_data <- tibble(y = inverse_logit(my_sequence), x = my_sequence) # Plotting logit function logistic_data |> dplyr::mutate(logitY = log(y / (1 - y))) |> ggplot(aes(x = x, y = logitY)) + geom_function(fun = function(x){x}, color = "dodgerblue") + scale_y_continuous(breaks = seq(-4, 4, by = 2)) + scale_x_continuous(breaks = seq(-4, 4, by = 2)) + labs(title = "Logit Function", x = bquote(~beta[0]~+~beta[1]~X[1]~+~beta[2]~X[2]~+...+~beta[k]~X[k]), y = expression(paste("log", bgroup("(", paste(frac(pi, 1 - pi)), ")")))) ``` ```{r} #| echo: false # Plotting inverse logit function logistic_data |> ggplot(aes(x = x, y = y)) + geom_function(fun = inverse_logit, color = "dodgerblue") + geom_hline(yintercept = c(0, 1), lty = "dashed") + scale_x_continuous(breaks = seq(-4, 4, by = 2)) + scale_y_continuous(breaks = seq(0, 1, by = 0.25)) + labs(title = "Inverse Logit Function", x = bquote(~beta[0]~+~beta[1]~X[1]~+~beta[2]~X[2]~+...+~beta[k]~X[k]), y = expression(pi == Pr(y==1))) ``` A statement of the estimated logistic regression equation is $$\log \left( \frac{\hat{\pi}}{1-\hat{\pi}} \right) = \hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k, $$ where $\hat{\pi}$ is the estimated or predicted probability of success ($y = 1$) and $x_j$ the $j^{th}$ predictor for $j=1, 2, \dots, k$. As with the theoretical model, this estimated model has an alternative formulation as well: $$\hat{\pi} = \frac{\exp(\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k)}{1 + \exp(\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k)} = \frac{e^{\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k}}{1 + e^{\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k}}, $$ which gives the estimated probability of success. When providing a statement of the estimated logistic regression model, the $\hat{\beta}$ terms should be replaced with corresponding numerical estimates from the model output, and the estimated response and predictor variables should be specified for the given context. ### Logistic Regression Assumptions Logistic regression requires a set of conditions or assumptions to be met to obtain reliable estimates and to conduct valid inference related to the model. - The response variable is a binary and random outcome. - **Independent observations**: observations should be independent of one another and have been randomly sampled from the population of interest. - **Linearity**: linear relationship between each predictor and the log odds of the success of the response variable. - **No multicollinearity**: predictor variables should not be highly correlated with one another. This is important for properly interpreting the effects of predictors, but multicollinearity does not necessarily hinder predictive accuracy. The first two assumptions pertain to the nature of the data and how it was sampled. The linearity assumption is commonly checked using empirical logit plots for each predictor which display the relationship between the log odds of success and the values of a predictor variable. #### Generalized Variance Inflation Factor Multicollinearity can be assessed using the generalized variance inflation factor ([GVIF](https://www.stat.cmu.edu/~brian/valerie/617-2022/617-2021/week05/Fox%20&%20Monette%20-%20Generalized%20VIF%20-%202290467.pdf)) values which are a generalization of the variance inflation factor values[^1]. As a rule of thumb, we will consider a predictor $x_j$ to have multicollinearity issues if its adjusted GVIF value exceeds 5, where the adjusted GVIF is $$\text{GVIF}_j^{\frac{1}{2\text{df}}},$$ where $\text{df}$ is the degrees of freedom associated with $x_j$. [^1]: Fox, J., & Monette, G. (1992). Generalized Collinearity Diagnostics. Journal of the American Statistical Association, 87(417), 178–183. https://doi.org/10.1080/01621459.1992.10475190 ### Interpreting Model Estimates Interpretations of coefficients of quantitative predictors for logistic regression are multiplicative as opposed to the additive effects for linear regression. **General interpretation for $\hat{\beta}_j$**: *For every 1 unit increase in $x_j$, we expect the odds of a success ($y = 1$) to multiply by $e^{\hat{\beta}_j}$, holding all other predictors constant.* **General interpretation for $\hat{\beta}_0$**: *The expected odds of success ($y=1$) given that $x_1$ through $x_k$ are all 0 is $e^{\hat{\beta}_0}$.* Note that when multicollinearity is present, the estimated slopes may be unreliable or inaccurate for describing relationships between the predictors and the response variable. ## Titanic Data Set To demonstrate the utility of the logistic regression model, we will analyze data on passengers of the RMS Titanic, the infamous British passenger liner that sank on its maiden voyage on April 15th, 1912, after colliding with an iceberg in the North Atlantic Ocean. This data set was obtained from Stanford University, and contains information on 887 different passengers. ![The RMS Titanic departing Southampton on April 10th, 1912. Image obtained from Wikipedia.](https://github.com/dilernia/DSA220/blob/main/images/titanic.jpg?raw=true){width="350"} ![Leonardo DiCaprio and Kate Winslet in the movie *Titanic*. Image from the New York Times.](https://static01.nyt.com/images/2019/07/16/style/16xp-couldjackfit/16xp-couldjackfit-articleLarge.jpg){width="350"} Specifically, we will be exploring the relationship between different characteristics of people aboard the Titanic and their likelihood of survival using a logistic regression model. A data dictionary and a few randomly selected rows for the Titanic data set are included below. ```{r} #| label: tbl-dictionary #| tbl-cap: "Data dictionary for Titanic data set." # Importing data from course GitHub page titanic <- read_csv("https://raw.githubusercontent.com/dilernia/STA323/main/Data/titanic.csv") # Creating data dictionary data_dictionary <- tibble(Variable = colnames(titanic), Description = c("Survival status (Yes or No)", "Sex (Male or Female)", "Passenger class (First, Second, or Third), with First Class typically having the most wealth and Third Class passengers typically being the least wealthy.", "Name", "Age (years)", "Fare (United States dollars, unadjusted for inflation)", "Number of siblings and spouses aboard the Titanic", "Number of parents and children aboard the Titanic")) data_dictionary |> gt() ``` ```{r} #| label: tbl-sample_rows #| tbl-cap: "Five randomly selected people who were aboard the Titanic." # Printing sample of 7 rows from data set titanic |> dplyr::sample_n(size = 7) |> gt() ``` ### Exploratory Data Analysis Let's start with some exploratory data analysis to familiarize ourselves with the data. ```{r} # Creating side-by-side box plots of age by survival status titanic |> ggplot(aes(x = survived, y = age, fill = survived)) + geom_boxplot() + scale_fill_viridis_d() + labs(title = "Titanic passenger ages by survival status", x = "Survival status", y = "Age (years)", caption = "Data source: Stanford University") + theme(legend.position = "none") ``` ```{r} # Creating clustered bar chart of survival status by passenger class titanic |> dplyr::count(survived, passenger_class, .drop = FALSE) |> dplyr::filter(!is.na(survived), !is.na(passenger_class)) |> ggplot(aes(x = passenger_class, y = n, fill = survived)) + geom_col(position = "dodge", color = "black") + scale_fill_viridis_d() + scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + labs(title = "Survival status by passenger class", x = "Passenger class", y = "Number of passengers", fill = "Survived", caption = "Data source: Stanford University") + theme(legend.position = "bottom") ``` ```{r} # Creating clustered bar chart of survival status by sex titanic |> dplyr::count(survived, sex, .drop = FALSE) |> dplyr::filter(!is.na(survived), !is.na(sex)) |> ggplot(aes(x = sex, y = n, fill = survived)) + geom_col(position = "dodge", color = "black") + scale_fill_viridis_d() + scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + labs(title = "Survival status by sex", x = "Sex", y = "Number of passengers", fill = "Survived", caption = "Data source: Stanford University") + theme(legend.position = "bottom") ``` ```{r} # Creating faceted clustered bar chart of survival status by sex and passenger class titanic |> dplyr::count(survived, sex, passenger_class, .drop = FALSE) |> dplyr::filter(!is.na(survived), !is.na(sex)) |> ggplot(aes(x = sex, y = n, fill = survived)) + geom_col(position = "dodge", color = "black") + scale_fill_viridis_d() + scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + labs(title = "Survival status & sex by passenger class", x = "Sex", y = "Number of passengers", fill = "Survived", caption = "Data source: Stanford University") + facet_grid(. ~ passenger_class) + theme(legend.position = "bottom", strip.background.y = element_rect(linetype = "solid", color = "black")) ``` ```{r} #| label: tbl-age_stats #| tbl-cap: "Age of passengers by survival status." # Calculating descriptive statistics stratified by survival status age_stats <- titanic |> group_by(survived) |> summarize( N = n(), Minimum = min(age, na.rm = TRUE), Q1 = quantile(age, na.rm = TRUE, probs = 0.25), Median = median(age, na.rm = TRUE), Q3 = quantile(age, na.rm = TRUE, probs = 0.75), Maximum = max(age, na.rm = TRUE), Mean = mean(age, na.rm = TRUE), Range = Maximum - Minimum, SD = sd(age, na.rm = TRUE), ) # Printing table of statistics stratified by survival status age_stats |> gt() |> fmt_number(columns = everything(), decimals = 2, drop_trailing_zeros = TRUE) ``` ```{r} #| label: tbl-class_stats #| tbl-cap: "Age of passengers by class." # Calculating descriptive statistics stratified by passenger class class_stats <- titanic |> group_by(passenger_class) |> summarize( N = n(), Minimum = min(age, na.rm = TRUE), Q1 = quantile(age, na.rm = TRUE, probs = 0.25), Median = median(age, na.rm = TRUE), Q3 = quantile(age, na.rm = TRUE, probs = 0.75), Maximum = max(age, na.rm = TRUE), Mean = mean(age, na.rm = TRUE), Range = Maximum - Minimum, SD = sd(age, na.rm = TRUE), ) # Printing table of statistics class_stats |> gt() |> fmt_number(columns = everything(), decimals = 2, drop_trailing_zeros = TRUE) ``` ```{r} # Creating faceted histograms titanic |> rename(Class = passenger_class, Survived = survived) |> ggplot(aes(x = age, fill = Class)) + geom_histogram(color = "black") + scale_fill_manual(values = c("#009E73", "#56B4E9", "#D55E00")) + facet_grid(Class ~ Survived, labeller = label_both) + scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + labs(x = "Passenger age (years)", y = "Number of passengers") + theme(legend.position = "none") ``` ::: {.callout-tip} a. Based on the output above, what can we say about the relationship between passenger class and survival status? b. Based on the output above, what can we say about the relationship between sex and survival status? c. Based on the output above, what can we say about the relationship between age and survival status? ::: ### Fitting a Model First, let's fit a multiple logistic regression model for estimating the probability of survival for passengers using their age (`age`) and passenger class (`passenger_class`). ::: {.callout-tip} Provide a statement of our assumed / theoretical multiple logistic regression model. ::: Below is output for the logistic regression model using survival status as our response ($y = 1$ meaning the person survived) and the age in years and class of the person as predictors. ```{r} # Convert survived to a factor prior to fitting model titanic <- titanic |> dplyr::mutate(survived =factor(survived, levels = c("No", "Yes"))) # Fitting logistic regression model logistic_model <- glm(survived ~ age + passenger_class, family = "binomial", data = titanic) ``` ```{r} #| label: tbl-model_output #| tbl-cap: "Logistic regression output using age and passenger class as predictors of survival." # Printing model output logistic_model |> tbl_regression(intercept = TRUE) |> add_vif() |> add_glance_source_note() ``` ::: {.callout-tip} Provide a statement of the estimated multiple logistic regression model. ::: ### Assessing Model Assumptions Next let's take a look at the empirical logit plots. Nonlinear patterns in the empirical logit plots for quantitative predictors can imply that functions of predictors would be useful in the model. Empirical logit plots for categorical predictors simply allow us to see subgroup differences in the log odds of success. ```{r} # Creating empirical logit plot empirical_logits <- empirical_logit_plot(logistic_model, nbins = 5) ``` ```{r} empirical_logits[[1]] ``` ```{r} empirical_logits[[2]] ``` ::: {.callout-tip} Based on the output, state whether or not each assumption for the logistic regression model is met, providing specific evidence to support your conclusions from the output provided. ::: ### Interpreting model estimates Let's interpret the coefficient estimates for the logistic regression model. As with linear regression, the estimated intercept, $\hat{\beta}_0$, is only appropriate to interpret when all predictors simultaneously being 0 is within the range of the values of the observed data set. In general, obtaining predictions for predictor values outside the scope of the observed data set is considered **extrapolation**, and thus is inappropriate. ::: {.callout-tip} a. Interpret the estimated slopes in context. b. Interpret the estimated intercept in context. Is this sensible in this context? ::: ::: {.callout-tip} a. What are the estimated odds of survival for a 19-year-old third class passenger? b. Is this considered extrapolation? ::: ```{r} # Estimated odds of survival for a 19 year old third-class and first-class passenger prediction_data <- tibble(age = c(19, 19), passenger_class = c("Third", "First")) # Obtaining estimated log odds of survival estimated_log_odds <- predict(logistic_model, newdata = prediction_data) # Obtaining estimated odds of survival estimated_odds <- exp(estimated_log_odds) # Obtaining estimated probability of survival estimated_probabilities <- predict(logistic_model, newdata = prediction_data, type = "response") ``` ::: {.callout-tip} a. What is the estimated probability of survival for a 19-year-old third class passenger? b. What are the estimated odds of survival for a 19-year-old first class passenger? c. What is the estimated probability of survival for a 19-year-old first class passenger? ::: ### Model Selection When fitting a logistic regression model with several predictors, there are multiple predictors available for inclusion in our final model. As with linear regression, we want a model with predictor variables that effectively explain variation in the response variable, but do not overfit the data. Moreover, we want to discriminate between these two types of predictors in a data-driven way to help us select a final model. Akaike's Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are numerical measures that describe the goodness-of-fit of a model and are commonly used for model selection. AIC and BIC both penalize for a larger number of variables used in a model while simultaneously rewarding how well a model explains variability in the response variable. In general, models with lower AIC and lower BIC values are more desirable than models with higher values since we can expect them to have relatively better out-of-sample predictive performance. Another logistic regression model was fit using all of sex, age, and passenger class as predictors of survival, obtaining the output below. ```{r} # Fitting full logistic regression model full_model <- glm(survived ~ sex + age + passenger_class, family = "binomial", data = titanic) ``` ```{r} #| label: tbl-full_model_output #| tbl-cap: "Logistic regression output using sex, age, and passenger class as predictors of survival." # Printing model output full_model |> tbl_regression(intercept = TRUE) |> add_vif() |> add_glance_source_note() ``` ::: {.callout-tip} What is our decision regarding the inclusion of sex as a predictor in our model? ::: In the movie *Titanic*, Jack Dawson, played by Leonardo DiCaprio, is a 20-year-old male who is a third-class passenger, and Rose DeWitt-Bukater, played by Kate Winslet, is a 17-year-old female who is a first-class passenger. ::: {.callout-tip} a. Based on our model, what is the estimated probability of survival for each of Jack and Rose? b. Based on these estimated probabilities, what are the estimated odds of survival for each of Jack and Rose? ::: ```{r} # Creating data frame for Jack and Rose jack_rose_data <- tibble(sex = c("Male", "Female"), age = c(20, 17), passenger_class = c("Third", "First")) # Obtaining estimated probability of survival jack_rose_probs <- predict(full_model, newdata = jack_rose_data, type = "response") ```