| Variable | Description |
|---|---|
| year | Year of the presidential election |
| state | Name of state |
| population_millions | State population in millions of people |
| median_income_thousands | Median household income in thousands of dollars |
| majority_female | Binary variable indicating whether or not a state's population is mostly females |
| winner | Binary variable giving the major party (Democrat or Republican) that won the state |
R Lab: Logistic Regression
DSA 220 - Introduction to Data Science and Analytics
Once completed, submit a zipped file (.zip) containing two total documents via Blackboard: a .qmd Quarto file and the corresponding HTML document.
In this assignment we will be analyzing data on demographic characteristics of states and state-level results for United States presidential elections from 2008 to 2016.
For this data, we will be concerned with the variables year, state, population_millions, median_income_thousands, majority_female, and winner. A data dictionary for this data set is included below.
1.
First, add a new code chunk containing the code below to load R packages for this lab.
# Load necessary packages
library(tidyverse)
library(ggthemes)
library(gt)
library(corrr)
library(ggfortify)
library(broom)
library(scales)
library(car)
library(gtsummary)
# Setting default ggplot2 theme
theme_set(ggthemes::theme_few())
# Load function for visualizing VIF & GVIF values
source("https://raw.githubusercontent.com/dilernia/DSA220/main/Functions/vif_plot.R")
# Load function for creating empirical logit plot
source("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Functions/empirical_logit_plot.R")Then, import data on presidential elections in the United States in 2008, 2012, and 2016 using the code below.
# Importing presidential election data from course GitHub page
elections <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/main/Data/presidential_census.csv") |>
dplyr::mutate(median_income_thousands = median_income / 1000,
population_millions = population / 1000000,
winner = factor(winner, levels = c("Republican", "Democrat"))) |>
dplyr::select(year, state, population_millions, median_income_thousands, majority_female, winner)2.
- Fit a logistic regression model with whether or not a state was majority female (
majority_female) and median income (median_income_thousands) as the predictor variables and whether a Democrat or Republican won as the response variable.
- Reproduce the model output exactly as displayed below.
| Characteristic | log(OR) | 95% CI | p-value | VIF |
|---|---|---|---|---|
| (Intercept) | -11 | -15, -7.2 | <0.001 | |
| majority_female | 1.3 | |||
| FALSE | — | — | ||
| TRUE | 1.9 | 0.87, 3.2 | <0.001 | |
| median_income_thousands | 0.17 | 0.11, 0.24 | <0.001 | 1.3 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio, VIF = Variance Inflation Factor | ||||
| Null deviance = 208; Null df = 149; Log-likelihood = -77.3; AIC = 161; BIC = 170; Deviance = 155; Residual df = 147; No. Obs. = 150 | ||||
3
Obtain empirical logit plots for each predictor, specifying the
nbinsargument to be 4.Is the linearity assumption met for this logistic regression model? Why or why not?
Is the assumption of no significant multicollinearity met for this model? Why or why not?
4
Provide the value of and interpret the estimated slope for the indicator variable for
majority_femalein context. Be specific in regards to the names of variables whenever possible.Provide the value of and interpret the estimated slope for
median_income_thousandsin context. Be specific in regards to the names of variables whenever possible.
5
- Calculate the probability of a Democrat winning a state with a majority of its citizens being female with a median income of 50 thousand dollars.
- Calculate the odds of a Democrat winning a state with a majority of its citizens being female with a median income of 50 thousand dollars.