R Lab: Simple Linear Regression

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Once completed, submit a zipped file (.zip) containing two total documents via Blackboard: a .qmd Quarto file and the corresponding HTML document.

1.

First, add a new code chunk containing the code below to load R packages for this lab.

# Load necessary packages
library(tidyverse)
library(gt)
library(corrr)
library(ggthemes)
library(ggfortify)
library(broom)

# Setting default ggplot2 theme
theme_set(ggthemes::theme_few())

For this assignment, we will analyze data containing estimates from the United States American Community Survey (ACS) at the state level for the years 2008 through 2023 based on an annual sample size of approximately 3.5 million addresses including data on population, income, poverty rates, and demographic characteristics.

Load the state-level ACS data for 2021 into R using the code below.

# Importing state-level U.S. ACS data for 2021
census_data <- read_csv("https://raw.githubusercontent.com/dilernia/STA418-518/main/Data/census_data_2008-2021.csv") |> 
  dplyr::filter(!stringr::str_detect(county_state, pattern = ","), year == 2021) |> 
  dplyr::mutate(perc_poverty = 100*prop_poverty,
                median_income_thousand = median_income / 1000)

2.

Reproduce the scatter plot below, including a straight-line of best fit showing the percent of people in poverty (perc_poverty) by the median income in thousands of dollars for each state (median_income_thousand).

3.

Fit a simple linear regression model with median income in thousands of dollars as the predictor variable and percent poverty as the response variable.

4.

Reproduce the model output displayed below. Hint: pay careful attention to potential rounding in the table.

term	estimate	std.error	statistic	p.value
(Intercept)	32.03	2.30	13.90	8.257394e-19
median_income_thousand	−0.28	0.03	−8.34	4.984794e-11

5.

Provide a statement of the estimated regression equation based on the model output.

6.

Provide the value of and interpret the estimated slope in context.

7.

Provide the value of and interpret the estimated intercept in context. Is this appropriate in this context? Why or why not?

8.

Create and display diagnostic plots for the simple linear regression model, including the Residuals vs Fitted plot, the Normal Q-Q plot, histogram of the residuals, Cook’s distance plot, and the Residuals vs Leverage plot.

9.

State and check if each assumption for fitting a simple linear regression model is met, providing specific evidence to support your conclusions from the R output obtained.

10.

Are there any influential points? If so, how many are there? Cite specific evidence to support your conclusion.

11.

Are there any outliers? If so, how many are there? Cite specific evidence to support your conclusion.

12.

Regardless of whether or not assumptions are met, formally state the hypotheses for testing if the regression slope is 0 or not.

13.

Provide the test statistic, p-value, and decision for this hypothesis test.

14.

Interpret the results of the hypothesis test in the context of the problem.