Logistic Regression and Classification
DSA 220 - Introduction to Data Science and Analytics
Learning Objectives
State the logistic regression model
Check model assumptions & assess model fit
Interpret model estimates
Obtain and interpret predicted values
Select a final model using proper criterion and metrics
Let’s begin by loading some R packages for this activity using the code below. Note: if it is the first time you are using an R package, you may need to install it first using the install.packages() function in the Console pane.
Code
# Load necessary packages
library(tidyverse)
library(ggthemes)
library(gt)
library(corrr)
library(ggfortify)
library(broom)
library(scales)
library(car)
library(GGally)
library(gtsummary)
library(glmnet)
library(glmnetUtils)Set default theme settings for plots, and load a function to display variance inflation factor (VIF) values using the code below.
Code
# Set ggplot theme for visualizations
theme_set(ggthemes::theme_few())
# Load function for visualizing VIF & GVIF values
source("https://raw.githubusercontent.com/dilernia/DSA220/main/Functions/vif_plot.R")
# Load function for creating empirical logit plot
source("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Functions/empirical_logit_plot.R")Logistic Regression
Binary logistic regression is a statistical method used to model the relationship between the probability of a binary outcome, \(y\), and \(k\) explanatory or predictor variables, denoted by \(x_1\), \(x_2\), \(\dots\), \(x_k\). In general, the predictor variables can be either quantitative or categorical variables. The logistic regression model is commonly used to estimate the probability or likelihood of an outcome occurring given observed values of select predictor variables.
Some example scenarios where a logistic regression would be useful include:
Describing risk factors associated with someone contracting lung cancer within the next year given their smoking status, gender, age in years, body mass index (BMI), and familial history (e.g., whether or not a first-degree relative has had lung cancer). ⚕️🏥
For credit-risk analysis, estimating the risk of a client defaulting on a loan given their credit score, annual income, and employment status. 💼 💸
For marketing strategies, estimating the likelihood of a customer purchasing a given product given their age, gender, and purchase history. 🏷️💲
In political analytics, forecasting the winner of a United States presidential race for a state using historical outcomes of presidential races, results of political polling, and demographic characteristics about a state.🔵 🔴
In sports analytics, estimating the probability of a team winning a basketball game using the percent of games they have won and the percent of games their opponent has won thus far, how many points they have scored and how many points their opponent has scored on average each game, and other factors such as whether they are playing on their home court or not.🏀
What are the response and predictor variables for examples a. through d., and what are the types of each predictor variable (e.g., categorical or quantitative)?
Logistic Regression Model
A general statement of the logit form of the assumed theoretical logistic regression model is
\[\log \left( \frac{\pi}{1-\pi} \right) = \beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k, \]
where
\(\pi = \Pr(y = 1)\) is the probability of a success (\(y = 1\))
\(y\) is our binary response variable being either 0 or 1.
\(\log(\cdot)\) is the natural logarithm function
\(x_j\) is the \(j^{th}\) predictor for \(j=1, 2, \dots, k\),
\(\frac{\pi}{1-\pi}\) is the odds of success
This is equivalent to stating that
\[\pi = \frac{\exp(\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k)}{1 + \exp(\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k)} = \frac{e^{\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k}}{1 + e^{\beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k}}, \]
where \(\exp(x) = e^{x}\) is the exponential function which is the inverse of the log function, i.e. \(\exp(\log(x)) = x\), and \(e \approx 2.718\) is Euler’s number. Note that \(\pi\) in these equations is referring to the true probability of success, not the irrational constant that is approximately 3.14. It is also common to express the logistic regression model using \(p\) instead of \(\pi\) to represent the probability that \(y=1\), but the use of \(\pi\) is more common. Also observe that the assumed logistic regression model does not have an error term like linear regression.
When providing a statement of the theoretical or assumed logistic regression model, the terms of the model are expressed using general notation, but the response variable, what is considered to be a “success”, and the predictor variables should be specified for the given context.
A statement of the estimated logistic regression equation is
\[\log \left( \frac{\hat{\pi}}{1-\hat{\pi}} \right) = \hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k, \]
where \(\hat{\pi}\) is the estimated or predicted probability of success (\(y = 1\)) and \(x_j\) the \(j^{th}\) predictor for \(j=1, 2, \dots, k\).
As with the theoretical model, this estimated model has an alternative formulation as well:
\[\hat{\pi} = \frac{\exp(\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k)}{1 + \exp(\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k)} = \frac{e^{\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k}}{1 + e^{\hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 +\dots + x_k \hat{\beta}_k}}, \]
which gives the estimated probability of success.
When providing a statement of the estimated logistic regression model, the \(\hat{\beta}\) terms should be replaced with corresponding numerical estimates from the model output, and the estimated response and predictor variables should be specified for the given context.
Logistic Regression Assumptions
Logistic regression requires a set of conditions or assumptions to be met to obtain reliable estimates and to conduct valid inference related to the model.
The response variable is a binary and random outcome.
Independent observations: observations should be independent of one another and have been randomly sampled from the population of interest.
Linearity: linear relationship between each predictor and the log odds of the success of the response variable.
No multicollinearity: predictor variables should not be highly correlated with one another. This is important for properly interpreting the effects of predictors, but multicollinearity does not necessarily hinder predictive accuracy.
The first two assumptions pertain to the nature of the data and how it was sampled. The linearity assumption is commonly checked using empirical logit plots for each predictor which display the relationship between the log odds of success and the values of a predictor variable.
Generalized Variance Inflation Factor
Multicollinearity can be assessed using the generalized variance inflation factor (GVIF) values which are a generalization of the variance inflation factor values1. As a rule of thumb, we will consider a predictor \(x_j\) to have multicollinearity issues if its adjusted GVIF value exceeds 5, where the adjusted GVIF is
\[\text{GVIF}_j^{\frac{1}{2\text{df}}},\]
where \(\text{df}\) is the degrees of freedom associated with \(x_j\).
Interpreting Model Estimates
Interpretations of coefficients of quantitative predictors for logistic regression are multiplicative as opposed to the additive effects for linear regression.
General interpretation for \(\hat{\beta}_j\): For every 1 unit increase in \(x_j\), we expect the odds of a success (\(y = 1\)) to multiply by \(e^{\hat{\beta}_j}\), holding all other predictors constant.
General interpretation for \(\hat{\beta}_0\): The expected odds of success (\(y=1\)) given that \(x_1\) through \(x_k\) are all 0 is \(e^{\hat{\beta}_0}\).
Note that when multicollinearity is present, the estimated slopes may be unreliable or inaccurate for describing relationships between the predictors and the response variable.
Titanic Data Set
To demonstrate the utility of the logistic regression model, we will analyze data on passengers of the RMS Titanic, the infamous British passenger liner that sank on its maiden voyage on April 15th, 1912, after colliding with an iceberg in the North Atlantic Ocean. This data set was obtained from Stanford University, and contains information on 887 different passengers.