R Lab: Logistic Regression

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Once completed, submit a zipped file (.zip) containing two total documents via Blackboard: a .qmd Quarto file and the corresponding HTML document.

In this assignment we will be analyzing data on demographic characteristics of states and state-level results for United States presidential elections from 2008 to 2016.

For this data, we will be concerned with the variables year, state, population_millions, median_income_thousands, majority_female, and winner. A data dictionary for this data set is included below.

Table 1: Data dictionary for elections data set.

Variable	Description
year	Year of the presidential election
state	Name of state
population_millions	State population in millions of people
median_income_thousands	Median household income in thousands of dollars
majority_female	Binary variable indicating whether or not a state's population is mostly females
winner	Binary variable giving the major party (Democrat or Republican) that won the state

1.

First, add a new code chunk containing the code below to load R packages for this lab.

# Load necessary packages
library(tidyverse)
library(ggthemes)
library(gt)
library(corrr)
library(ggfortify)
library(broom)
library(scales)
library(car)
library(gtsummary)

# Setting default ggplot2 theme
theme_set(ggthemes::theme_few())

# Load function for visualizing VIF & GVIF values
source("https://raw.githubusercontent.com/dilernia/DSA220/main/Functions/vif_plot.R")

# Load function for creating empirical logit plot
source("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Functions/empirical_logit_plot.R")

Then, import data on presidential elections in the United States in 2008, 2012, and 2016 using the code below.

# Importing presidential election data from course GitHub page
elections <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/main/Data/presidential_census.csv") |>  
  dplyr::mutate(median_income_thousands = median_income / 1000,
                population_millions = population / 1000000,
                winner = factor(winner, levels = c("Republican", "Democrat"))) |> 
  dplyr::select(year, state, population_millions, median_income_thousands, majority_female, winner)

2.

Fit a logistic regression model with whether or not a state was majority female (majority_female) and median income (median_income_thousands) as the predictor variables and whether a Democrat or Republican won as the response variable.

Reproduce the model output exactly as displayed below.

Characteristic	log(OR)	95% CI	p-value	VIF
(Intercept)	-11	-15, -7.2	<0.001
majority_female				1.3
FALSE	—	—
TRUE	1.9	0.87, 3.2	<0.001
median_income_thousands	0.17	0.11, 0.24	<0.001	1.3
Abbreviations: CI = Confidence Interval, OR = Odds Ratio, VIF = Variance Inflation Factor
Null deviance = 208; Null df = 149; Log-likelihood = -77.3; AIC = 161; BIC = 170; Deviance = 155; Residual df = 147; No. Obs. = 150

3

Obtain empirical logit plots for each predictor, specifying the nbins argument to be 4.
Is the linearity assumption met for this logistic regression model? Why or why not?
Is the assumption of no significant multicollinearity met for this model? Why or why not?

4

Provide the value of and interpret the estimated slope for the indicator variable for majority_female in context. Be specific in regards to the names of variables whenever possible.
Provide the value of and interpret the estimated slope for median_income_thousands in context. Be specific in regards to the names of variables whenever possible.

5

Calculate the probability of a Democrat winning a state with a majority of its citizens being female with a median income of 50 thousand dollars.

Calculate the odds of a Democrat winning a state with a majority of its citizens being female with a median income of 50 thousand dollars.