R Lab: Multiple Linear Regression

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Once completed, submit a zipped file (.zip) containing two total documents via Blackboard: a .qmd Quarto file and the corresponding HTML document.

In this assignment we will be fitting and interpreting the results of a multiple linear regression model analyzing data collected on hawks consisting of measurements in millimeters on \(\text{wing}^6\), \(\text{culmen}^1\), hallux (the killing talon), and \(\text{tail}^9\) lengths. This data set was collected by students and faculty at Cornell College in Mount Vernon, Iowa at Lake MacBride near Iowa City, Iowa between 1992 and 2003. The data set contains measurements on three different species: Cooper’s, red-tailed, and sharp-shinned hawks.

Diagram of avian measurements: (1) beak length measured from tip to skull along the culmen; (2) beak length measured from the tip to the anterior edge of the nares; (3) beak depth; (4) beak width; (5) tarsus length; (6) wing length from carpal joint to wingtip; (7) secondary length from carpal joint to tip of the outermost secondary; (8) Kipp’s distance; (9) tail length. Image obtained from Tobias, J. A. et al. (2022) ‘AVONET: morphological, ecological and geographical data for all birds’, ECOLOGY LETTERS, 25(3), pp. 581–597. doi: 10.1111/ele.13898.

Depiction from left to right of a Cooper’s hawk, red-tailed hawk, and sharp-shinned hawk. Images obtained from https://en.wikipedia.org/wiki/Cooper%27s_hawk, https://www.wildlife.state.nh.us/wildlife/profiles/red-tailed-hawk.html, and https://en.wikipedia.org/wiki/Sharp-shinned_hawk.

1.

First, add a new code chunk containing the code below to load R packages for this lab.

# Load necessary packages
library(tidyverse)
library(ggthemes)
library(gt)
library(corrr)
library(ggfortify)
library(broom)
library(scales)
library(car)
library(GGally)
library(gtsummary)

# Setting default ggplot2 theme
theme_set(ggthemes::theme_few())

# Load function for visualizing VIF & GVIF values
source("https://raw.githubusercontent.com/dilernia/DSA220/main/Functions/vif_plot.R")

Then, load the hawks data using the code below.

# Importing hawks data set
hawks <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/hawks.csv")

2.

Reproduce the scatter plot below showing the weight of the hawks (weight_g) by the tail lengths (tail_mm) with a linear fit.

3.

Next, reproduce the scatter plot below showing the weight of the hawks (weight_g) by the wing span (wing_mm). Make sure to update all plot labels accordingly.

4.

Fit a multiple linear regression model with tail length in mm (tail_mm) and wing span in mm (wing_mm) as the predictor variables and weight in grams (weight_g) as the response variable. Reproduce the output exactly as displayed below.

Characteristic Beta 95% CI p-value VIF
(Intercept) -860 -928, -791 <0.001
tail_mm 2.2 1.6, 2.9 <0.001 5.3
wing_mm 3.8 3.5, 4.0 <0.001 5.3
Abbreviations: CI = Confidence Interval, VIF = Variance Inflation Factor
R² = 0.880; Adjusted R² = 0.880; Sigma = 160; Statistic = 3,273; p-value = <0.001; df = 2; Log-likelihood = -5,826; AIC = 11,661; BIC = 11,680; Deviance = 23,026,265; Residual df = 894; No. Obs. = 897

5.

  1. Obtain diagnostic plots for checking the MLR model assumptions, including but not limited to, the residual by fitted values plot, QQ plot of the residuals, a histogram of the residuals, and a plot showing the VIF values for each predictor.

  2. Formally check each assumption of the MLR model, clearly stating whether or not each assumption is met, citing specific reasons using the diagnostic plots obtained. Regardless of whether or not assumptions are met, we will proceed with the model.

  3. Are there any influential points or outliers present? If so, how many? Give specific evidence to support your answer.

6.

Provide a statement of the estimated regression equation using tail length (tail_mm) and wing span (wing_mm) as the predictor variables and weight (weight_g) as the response variable.

7.

Provide the value of and interpret the estimated slope for wing span, if appropriate, in context.

8.

Provide the value of and interpret the estimated intercept, if appropriate, in context.

9.

Provide and interpret the value of \(r^2\) in this context.

10.

  1. Obtain a predicted value and 90% prediction interval for a hawk with a wing span of 400mm and a tail length of 214mm using the predict() function.

  2. Interpret the prediction interval in context.

11.

  1. Finally, we will consider several MLR models with different combinations of predictors all with weight in grams (weight_g) as the response variable. One model with just tail length in mm (tail_mm), one with just wing span in mm (wing_mm), a quadratic model with tail length, and a quadratic model with wind span. In doing so, reproduce the tables exactly as displayed below.
Characteristic Beta 95% CI p-value
(Intercept) -1,415 -1,495, -1,334 <0.001
tail_mm 11 11, 11 <0.001
Abbreviation: CI = Confidence Interval
R² = 0.765; Adjusted R² = 0.765; Sigma = 224; Statistic = 2,922; p-value = <0.001; df = 1; Log-likelihood = -6,133; AIC = 12,273; BIC = 12,287; Deviance = 44,996,768; Residual df = 896; No. Obs. = 898
Characteristic Beta 95% CI p-value
(Intercept) -663 -700, -625 <0.001
wing_mm 4.5 4.4, 4.7 <0.001
Abbreviation: CI = Confidence Interval
R² = 0.874; Adjusted R² = 0.874; Sigma = 164; Statistic = 6,206; p-value = <0.001; df = 1; Log-likelihood = -5,848; AIC = 11,702; BIC = 11,716; Deviance = 24,153,634; Residual df = 895; No. Obs. = 897
Characteristic Beta 95% CI p-value
(Intercept) -860 -928, -791 <0.001
tail_mm 2.2 1.6, 2.9 <0.001
wing_mm 3.8 3.5, 4.0 <0.001
Abbreviation: CI = Confidence Interval
R² = 0.880; Adjusted R² = 0.880; Sigma = 160; Statistic = 3,273; p-value = <0.001; df = 2; Log-likelihood = -5,826; AIC = 11,661; BIC = 11,680; Deviance = 23,026,265; Residual df = 894; No. Obs. = 897
Characteristic Beta 95% CI p-value
(Intercept) -1,071 -1,516, -625 <0.001
tail_mm 7.1 2.2, 12 0.005
I(tail_mm^2) 0.01 0.00, 0.02 0.12
Abbreviation: CI = Confidence Interval
R² = 0.766; Adjusted R² = 0.765; Sigma = 224; Statistic = 1,464; p-value = <0.001; df = 2; Log-likelihood = -6,132; AIC = 12,272; BIC = 12,291; Deviance = 44,877,802; Residual df = 895; No. Obs. = 898
Characteristic Beta 95% CI p-value
(Intercept) -60 -220, 100 0.5
wing_mm -0.20 -1.4, 1.0 0.7
I(wing_mm^2) 0.01 0.01, 0.01 <0.001
Abbreviation: CI = Confidence Interval
R² = 0.882; Adjusted R² = 0.881; Sigma = 159; Statistic = 3,329; p-value = <0.001; df = 2; Log-likelihood = -5,820; AIC = 11,647; BIC = 11,667; Deviance = 22,683,880; Residual df = 894; No. Obs. = 897
  1. Based on the BIC values among the models considered, which model is most preferred?