Simple Linear Regression

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Learning Objectives

  • Fit a simple linear regression (SLR) model

  • State and check the assumptions of the SLR model

  • Interpret SLR model output

  • Identify influential points and outliers

  • Obtain & interpret predicted values

  • Obtain & interpret prediction intervals

Introduction

Simple linear regression (SLR) is a statistical method used to model the relationship between a single quantitative response variable, denoted by \(y\), and a single explanatory or predictor variable, denoted by \(x\). In this activity, we will focus on the case when the predictor variable \(x\) is a quantitative variable, although it can also be a categorical variable. In this setting, the SLR model relies on an assumption of a linear relationship between \(x\) and \(y\). That is, SLR is most useful when \(y\) tends to change in a straight-line or linear pattern as the value of \(x\) changes.

Linear regression is widely used across many different disciplines. Some examples include:

  1. Exploring monthly stock prices based on an economic indicator such as the gross domestic product (GDP) in U.S. dollars 📈

  2. Analyzing the relationship between health outcomes of patients, e.g. blood pressure (mmHG), and the dosage level of a prescribed treatment in milligrams for lowering people’s blood-pressure ⚕️💊

  3. Predicting how much longer a car component will last in days based on the number of miles on the vehicle 🛠🚗

  4. Modeling the relationship between sales and advertising expenses in U.S. dollars 💸

An important concept in linear regression is the role of the response and the explanatory or predictor variable.

For SLR, the response variable is the quantitative variable describing the outcome of interest. The explanatory variable is the variable whose relationship with the response variable is being studied. Explanatory variables are also referred to as predictors or predictor variables in regression.

Tip

What are the response and predictor variables for examples a. and b. above?

To continue with this activity, let’s load some R packages, and set a default theme for all visualizations made with ggplot2.

Code
# Loading R packages
library(tidyverse)
library(gt)
library(corrr)
library(ggthemes)
library(datasauRus)
library(ggfortify)
library(broom)
library(scales)
library(ggtext)

# Setting default ggplot2 theme
theme_set(theme_few())

Simple Linear Regression Model

Assumed Model

The assumed or theoretical simple linear regression model is

\[y = \beta_0 + \beta_1 x + \varepsilon, \]

where

  • \(y\) is the response variable

  • \(\beta_0\) is the intercept

  • \(\beta_1\) is the slope

  • \(x\) is the explanatory / predictor variable

  • \(\varepsilon\) is the error term

Collectively, \(\beta_0\) and \(\beta_1\) define the true straight-line relationship between \(x\) and \(y\). Using a sample of data collected on a population of interest, a main goal of SLR is to estimate this linear relationship by obtaining estimates of \(\beta_0\) and \(\beta_1\).

Code
# Simulating data form SLR model
set.seed(1994)

# True parameters
true_intercept_slope <- c(4, 0.6)
n_values <- 15

sample_data <- tibble(x = seq(1, 10, length.out = n_values),
                      y = true_intercept_slope[1] + true_intercept_slope[2] * x + rnorm(n_values, mean = 0, sd = 2))

# Extracting predicted values
sample_data$theoretical_mean <- true_intercept_slope[1] + true_intercept_slope[2] * sample_data$x 

# Create equation string
theoretical_slr_equation <- paste0("y = ", true_intercept_slope[1], " + ", true_intercept_slope[2], "x + ε")

# Format the label with HTML/CSS for ggtext
# Let's pretend this is your theoretical equation
theoretical_slr_equation_html <- str_glue("<span style='color:#56B4E9;'>y</span><span style='color:black;'> = </span><span style='color:#E69F00;'>{true_intercept_slope[1]} + {true_intercept_slope[2]}<span style='color:#56B4E9;'>x</span> </span><span style='color:black;'>+ ε</span>")

# Creating scatter plot to visualize residuals
ggplot(sample_data, aes(x = x, y = y)) +
  geom_segment(aes(xend = x, yend = theoretical_mean),
               linetype = "dotted", color = "black") +
  geom_abline(slope = true_intercept_slope[2], 
              intercept = true_intercept_slope[1], 
              color = "#E69F00", linewidth = 0.8) +
  geom_point(color = "#56B4E9", size = 1.8, alpha = 1) +
    scale_x_continuous(breaks = seq(0, 12, by = 4),
                     limits = c(0, 12)) +
annotate("richtext",  # <-- Use "richtext"
           x = quantile(sample_data$x, probs = 0.70),
           y = min(sample_data$y),
           label = theoretical_slr_equation_html, # <-- Use the HTML label
           size = 6, hjust = 0,
           fill = NA, label.color = NA) +
  labs(title = "Scatter plot with theoretical regression line") +
  theme_few(base_size = 16)

Estimated Model

A statement of the estimated regression equation is

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x, \]

where:

  • \(\hat{y}\) is the predicted response value

  • \(\hat{\beta}_0\) is the estimated intercept

  • \(\hat{\beta}_1\) is the estimated slope

  • \(x\) is the explanatory / predictor variable

When providing a statement of the estimated regression equation, we substitute our numerical estimates into the regression equation for \(\hat{\beta}_0\) and \(\hat{\beta}_1\).

Code
# Fitting SLR model
model_fit <- lm(y ~ x, data = sample_data)

# Extracting predicted values
sample_data$predicted <- predict(model_fit)

# Extract coefficients
intercept_slope <- coef(model_fit)[1:2]

# Create equation string
estimated_slr_equation <- sprintf("ŷ = %.2f + %.2fx", intercept_slope[1], intercept_slope[2])

# Creating scatter plot to visualize residuals
ggplot(sample_data, aes(x = x, y = y)) +
  geom_segment(aes(xend = x, yend = predicted),
               linetype = "dotted", color = "black") +
  geom_smooth(method = "lm", se = FALSE, color = "#009E73",
              linewidth = 0.8) +
  geom_point(color = "#56B4E9", size = 1.8, alpha = 1) +
  scale_x_continuous(breaks = seq(0, 12, by = 4),
                     limits = c(0, 12)) +
  annotate("text", x = quantile(sample_data$x, probs = 0.70), 
           y = min(sample_data$y), 
           label = estimated_slr_equation, 
           color = "#009E73", size = 6, hjust = 0) +
  labs(title = "Scatter plot with estimated regression line") +
  theme_few(base_size = 16)

The estimated intercept, \(\hat{\beta}_0\), can be generally interpreted as:

\[\textit{The expected value of $y$ given that $x=0$ is $\hat{\beta}_0$.}\]

The estimated slope, \(\hat{\beta}_1\), can be generally interpreted as:

\[\textit{For every 1 unit increase in $x$, we expect $y$ to increase / decrease by $|\hat{\beta}_1|$.}\]

Note that \(|⋅|\) is the absolute value function, and that we use the phrasing “increase” if \(\hat{\beta}_1>0\), and “decrease” if \(\hat{\beta}_1<0\). These are blueprint interpretations for the estimated intercept and slope, but should be tailored to a given context. Specifically, the units of measurement, the response variable, the predictor variable, and the values of the regression estimates should be included when providing these interpretations.

Note that there is no error term in the estimated regression equation. However, empirical residuals, also called prediction errors, can be calculated using the estimated regression model. The residual for the \(i\)th observation is the difference between the observed and predicted value:

\[e_i = y_i - \hat{y}_i\]

The estimates for SLR, \(\hat{\beta}_0\) and \(\hat{\beta}_1\), are obtained using the method of least squares. That is, \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are determined using calculus to minimize the sum of squared errors, \(\text{SSE} = \sum\limits_{i=1}^{n}e_i^2\). Note that this is not the same as minimizing the sum of the magnitude of the residuals, which is called least absolute deviations (LAD) regression. Visually, the size of the \(\text{SSE}\) is depicted via the sum of the area of all the squares in the scatter plot below.

Code
# Adding residuals to table
sample_data <- sample_data |> 
  mutate(residual = y - predicted)
  
# Creating scatter plot to visualize residuals
sum_squares_gg <- ggplot(sample_data, aes(x = x, y = y)) +
  geom_segment(aes(xend = x, yend = predicted),
               linetype = "dotted", color = "black") +
geom_rect(alpha = 0.2, fill = '#D55E00',
              aes(ymin = pmin(y, predicted), 
                  ymax = pmax(y, predicted),
                  xmin = pmin(x, x - residual),
                  xmax = pmax(x, x - residual))) +
    geom_smooth(method = "lm", se = FALSE, color = "#009E73",
                linewidth = 0.8) +
  geom_point(color = "#56B4E9", size = 1.8, alpha = 1) +
  coord_fixed() +
  theme_few(base_size = 16)

sum_squares_gg

Tip
  1. For examples a. through d., would we expect the true slope \(\beta_1\) to be negative or positive?

  2. Provide a statement of the assumed simple linear regression model for example d.

  3. Suppose that a sample of data was collected for example d., and an estimated slope was calculated as \(\hat{\beta}_1 = 2\). Interpret this value in the context of the problem.

  4. Further suppose for example d. that an estimated intercept was calculated as \(\hat{\beta}_0 = 5000\). Interpret this value in the context of the problem.

  5. Given estimates of \(\hat{\beta}_0 = 5000\) and \(\hat{\beta}_1 = 2\) for example d., what would we predict for our company’s sales revenue if we invested \(\$500\) in advertising on social media platforms?

Model Assumptions

The reliability of the estimates, \(\hat{\beta}_0\) and \(\hat{\beta}_1\) and corresponding p-values for the simple linear regression model depend on several assumptions. All of the assumptions below must be true to conduct valid inference of the simple linear regression model.

  1. Random sample: observations must be randomly selected from the population of interest.

  2. Normality: the residuals must be normally distributed.

This assumption can checked using a histogram or quantile-quantile (QQ) plot of the residuals. A fairly bell-shaped distribution in the histogram supports normality, and the points not straying far from the diagonal line on the QQ plot supports normality.

  1. Homoskedasticity: the residuals must have constant variance / spread across all predicted values.

This assumption can checked using the residual by predicted value plot. A fairly constant vertical spread / range of the residuals across all predicted values supports homoskedasticity. When the residuals do not have constant variance this is called heteroskedasticity.

  1. Independence: the residuals must be independent of the predicted values.

This assumption can checked using the residual by predicted value plot. No apparent pattern in the relationship between the residuals and predicted values supports the independence assumption.

  1. Linearity: there must be a linear relationship between \(x\) and \(y\).

This assumption can checked using the residual by predicted value plot or a scatter plot between \(x\) and \(y\). Having no systematic or identifiable pattern in the residual by predicted value plot supports the linearity assumption. A linear or straight-line pattern in the scatter plot between each predictor and \(y\) also supports the linearity assumption.

Residual vs Fitted Value Plot

For the homoskedasticity assumption to be met for regression models, we want to observe a fairly constant vertical spread or range of the residuals across all predicted values.

Figure 1: Residual by predicted value plots showing homoskedastic, heteroskedastic, independent errors and predicted values, and linearly dependent errors and predicted values.
Figure 2: Residual by predicted value plots showing homoskedastic, heteroskedastic, independent errors and predicted values, and quadratically dependent errors and predicted values.
Figure 3: Residual by predicted value plots showing homoskedastic, heteroskedastic, independent errors and predicted values, and a sinusoidal dependence between errors and predicted values.

Histograms and Q-Q Plots

To check the normality assumption for linear regression models, we can use a quantile-quantile (Q-Q) plot of the residuals and a histogram of the residuals.

Figure 4: Histograms showing the distributions of normally distributed, bimodal, and peaked residuals.
Figure 5: Quantile-quantile plots showing the empirical quantiles by the theoretical quantiles for normally distributed, bimodal, and peaked residuals.
Figure 6: Histograms showing the distributions of right-skewed, left-skewed, and discrete residuals.
Figure 7: Quantile-quantile plots showing the empirical quantiles by the theoretical quantiles for right-skewed, left-skewed, and discrete residuals.

Using Correlations with Caution

For two numeric variables, the sample correlation coefficient, \(r\), is a commonly used statistic which describes both the strength and direction of a linear relationship between two variables. Correlations range from -1 to 1, with values closer to -1 indicating a stronger, negative linear relationship (as one variable increases, the other decreases), values closer to 0 indicating a weaker linear relationship, and values closer to 1 indicating a stronger, positive and linear relationship (as one variable increases, the other increases).

Correlations are a useful starting point, but should be used with caution as they have a few limitations:

  1. They only measure linear relationships, but many relationships are non-linear. ⎰

  2. A strong correlation does not imply causation, e.g., monthly ice cream sales may be strongly correlated with the number of monthly shark attacks, but that does not necessarily mean one causes the other. 🍦🦈

  3. Confounding variables can distort the relationship between variables of interest.

  4. The correlation coefficient is sensitive to outliers.

Let’s explore a couple of data sets that demonstrate some limitations of the correlation coefficient. First, the grid of scatter plots below, called Anscombe’s Quartet, shows different data sets which all have a correlation coefficient of \(r = 0.82\).

Code
# Load Anscombe's Quartet
data(anscombe)

# Reshape the data for plotting
anscombe_long <- anscombe |>
  pivot_longer(cols = everything(),
               names_to = c(".value", "set"),
               names_pattern = "(.)(.)")

# Create a single faceted plot
ggplot(anscombe_long, aes(x = x, y = y)) +
    geom_smooth(method = "lm", se = FALSE, color = "#009E73") +
    geom_point(color = "#56B4E9", size = 2) +
    facet_wrap(~ set, ncol = 2, labeller = labeller(set = function(x) paste("Dataset", x))) +
    labs(title = "Anscombe's Quartet")

Tip
  1. Which, if any, of the limitations described above are evident for Dataset 1?
  2. Which, if any, of the limitations described above are evident for Dataset 2?
  3. Which, if any, of the limitations described above are evident for Dataset 3?
  4. Which, if any, of the limitations described above are evident for Dataset 4?

Another commonly used data set to emphasize the importance of plotting one’s data is the “datasaurus” data set displayed below.

Code
# Plotting datasaurus_dozen data
# Original code from https://cran.r-project.org/web/packages/datasauRus/vignettes/Datasaurus.html
  datasaurus_dozen |> 
    dplyr::filter(dataset != "high_lines") |>
  ggplot(aes(x = x, y = y, colour = dataset)) +
    geom_point() +
    facet_wrap(~ dataset, ncol = 3) +
    theme(legend.position = "none") 

Code
  datasaurus_dozen |> 
    dplyr::filter(dataset != "high_lines") |> 
    group_by(dataset) |> 
    summarize(mean_x = mean(x),
              mean_y = mean(y),
              std_dev_x = sd(x),
              std_dev_y = sd(y),
              corr_x_y  = cor(x, y)) |> 
      gt() |>
    fmt_number(columns = everything(), 
               decimals = 2,
               drop_trailing_zeros = TRUE)
Table 1: Summary statistics for the dinosaurus data set.
dataset mean_x mean_y std_dev_x std_dev_y corr_x_y
away 54.27 47.83 16.77 26.94 −0.06
bullseye 54.27 47.83 16.77 26.94 −0.07
circle 54.27 47.84 16.76 26.93 −0.07
dino 54.26 47.83 16.77 26.94 −0.06
dots 54.26 47.84 16.77 26.93 −0.06
h_lines 54.26 47.83 16.77 26.94 −0.06
slant_down 54.27 47.84 16.77 26.94 −0.07
slant_up 54.27 47.83 16.77 26.94 −0.07
star 54.27 47.84 16.77 26.93 −0.06
v_lines 54.27 47.84 16.77 26.94 −0.07
wide_lines 54.27 47.83 16.77 26.94 −0.07
x_shape 54.26 47.84 16.77 26.93 −0.07

To further explore concepts related to SLR, let’s consider a real-world data set on the salaries of professional basketball players.

Example: NBA Player Salary Data 🏀 💵

The National Basketball Association (NBA) is a professional basketball league in North America consisting of 30 teams. The NBA is widely considered the premier men’s professional basketball league in the world. In this activity we will analyze data on NBA player salaries for the 2025–2026 season along with performance metrics from the previous season. Salary data were obtained from https://www.basketball-reference.com/contracts/players.html, and player-level performance statistics were sourced from https://www.basketball-reference.com/leagues/NBA_2025_per_game.html.1 This data set includes information on player earnings and fundamental performance metrics such as points scored per game for the 2024-2025 season. Using this data, we will explore potential relationships between player compensation and performance-based metrics.

Let’s load the data for the activity, focusing on players who have played at least 4 years in the NBA already to avoid rookie contracts which are systematically lower than standard NBA contracts.

Code
# Importing NBA salary data from GitHub, removing players on rookie contracts, converting salary to millions of USD, and dropping columns
nba_salary_data <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/nba_salary_data.csv") |> 
  dplyr::filter(years_played >= 4) |> 
  mutate(salary_2025_2026 = salary_2025_2026 / 1e6) |> 
  dplyr::rename(salary_millions = salary_2025_2026) |> 
  dplyr::select(player:points_per_game, years_played)

Below is a data dictionary for the NBA data, and a preview of some randomly selected rows.

Table 2: Data dictionary for the NBA salaries data set.
Variable Description
player Player's name
team Player's team for the 2025-2026 season
salary_millions Player's salary for the 2025-2026 season (in millions of USD)
position Primary position played by the player (e.g., C, PF, SF, SG, PG)
age Player's age at the start of the 2025-2026 season
points_per_game Average points scored per game during the previous season
years_played Total number of years played in the NBA prior to the 2025-2026 season
Table 3: Several rows and columns from the NBA data set.
player team salary_millions position age points_per_game years_played
Stephen Curry GSW 59.61 PG 36 24.5 16
Joel Embiid PHI 55.22 C 30 23.8 9
Nikola Jokić DEN 55.22 C 29 29.6 10
Kevin Durant HOU 54.71 PF 36 26.6 17
Jayson Tatum BOS 54.13 PF 26 26.8 8
Tip
  1. In Table 3, being as specific as possible, what is the type of each variable?

  2. Which data format would be best for storing all data available on the NBA players (e.g., XML, CSV, or JSON)?

Let’s look at a scatter plot and the correlations for the NBA data:

Code
# Creating a scatter plot
nba_salary_data |> 
  ggplot(aes(x = points_per_game, 
             y = salary_millions)) + 
  geom_smooth(method = "lm", se = FALSE) +
  geom_point() +
  scale_y_continuous(labels = scales::label_currency(prefix = "$"))

Code
# Calculating correlations
correlations_table <- nba_salary_data |> 
  corrr::correlate(diagonal = 1)

# Printing table of correlations
correlations_table |> 
    gt() |>
    fmt_number(columns = everything(), 
               decimals = 2,
               drop_trailing_zeros = FALSE)
Table 4: Table of pairwise correlations.
term salary_millions age points_per_game years_played
salary_millions 1.00 0.12 0.82 0.24
age 0.12 1.00 −0.03 0.93
points_per_game 0.82 −0.03 1.00 0.11
years_played 0.24 0.93 0.11 1.00
Tip
  1. Describe the relationship between a player’s points scored per game (points_per_game), and their salary in the 2025-2026 season (salary_millions).

  2. Does the sign (being positive or negative) of the correlation coefficient seem sensible in this context? Why or why not?

  3. Does there appear to be a linear relationship between player’s points scored per game and their salary in the 2025-2026 season?

Let’s fit a simple linear regression model to model a player’s salary using their points scored per game.

Tip

Provide a statement of the assumed simple linear regression model for the NBA data.

Fitting the Model

Tip

Fit a simple linear regression model for the NBA data and print the model output using the code below.

Code
# Fitting a simple linear regression model
slr_model <- lm(salary_millions ~ points_per_game,
                data = nba_salary_data)

# Printing model output
slr_model |> 
  broom::tidy() |> 
  gt() |> 
  fmt_number(columns = term:statistic, 
             decimals = 2)
Table 5: Simple linear regression model output.
term estimate std.error statistic p.value
(Intercept) −3.39 1.64 −2.07 3.994515e-02
points_per_game 1.87 0.10 18.60 3.800056e-42
Tip

Provide a statement of the estimated simple linear regression model for the NBA data.

For linear regression, the coefficient of determination, denoted by \(r^2\), describes the proportion of variability in the response variable, \(y\), explained by our model. The \(r^2\) value is equal to the square of the correlation coefficient between the predicted values of a model and the observed response values. For SLR, \(r^2\) is also equal to the square of the correlation coefficient between the explanatory and response variables.

For simple linear regression, we have a single predictor which constitutes our entire model, so \(r^2\) is the proportion of variability in \(y\) explained by \(x\).

Tip

Provide and interpret the value of the coefficient of determination in context.

Diagnostic Plots

Below are diagnostic plots for checking the model assumptions for our analysis of the NBA data set.

Code
# Several diagnostic plots at once
autoplot(slr_model, label.repel = TRUE)

Code
# Cook's Distance plot
autoplot(slr_model, which = 4, label.repel = TRUE)@plots[[1]]

Code
# Making histogram of residuals
tibble(Residual = slr_model$residuals) |> 
    ggplot(aes(x = Residual)) + 
    geom_histogram(aes(y = after_stat(density)), fill = "#56B4E9", color = "white") +
    stat_function(fun = dnorm, args = list(mean = 0, sd = sd(slr_model$residuals)),
                  color = "#009E73") +
    scale_y_continuous(expand = expansion(mult = c(0, .01))) +
    labs(y = "Density") + 
    theme_bw() + 
    theme(panel.grid = element_blank(), 
          text = element_text(face = "bold"))

Tip

Indicate if each assumption for fitting a simple linear regression model is met, providing specific evidence to support your conclusions from the output provided.

Influential Points and Outliers

Influential points are points that have a significantly greater effect on the regression estimates than the rest of the data points. A point can be both an outlier and influential, but not every outlier is an influential point.

Code
set.seed(1994)

# Simulating influential and outlier data for demonstration
influential_data <- as.data.frame(round(mvtnorm::rmvnorm(n = 20, mean = c(3, 3), sigma = matrix(c(1.2, 0.85, 0.85, 1.2), byrow = T, nrow = 2)), 3))

influential_data <- rbind(influential_data, cbind(c(3, 4), c(5, 8)))

influential_data$V1 <- influential_data$V1*4+60

influential_data$V2 <- influential_data$V2*3

influential_data <- influential_data |> 
  rename(coffee_consumed_oz = V2,
         heart_rate_bpm = V1)

outlier_data <- tibble(V1 = c(10, 11, 18, 20, 40, 50, 51, 60, 70, 80, 90, 89, 98, 49),
                       V2 = round(c(2, 2.2, 4, 3.5, 4.1, 6, 5.6, 7.9, 8, 8.3, 10, 10.5, 12, 14))) |> 
  rename(annual_doctors_visits = V2,
         age = V1)

# Fitting models
outA <- lm(annual_doctors_visits ~ age, data = outlier_data)
outB <- lm(annual_doctors_visits ~ age, data = outlier_data[-c(nrow(outlier_data)), ])
infA <- lm(heart_rate_bpm ~ coffee_consumed_oz, data = influential_data)
infB <- lm(heart_rate_bpm ~ coffee_consumed_oz, data = influential_data[-c(nrow(influential_data)), ])
Code
# Plotting data with an influential point
influential_data |> 
  slice(1:(n()-1)) |> 
          mutate(Data = "Excludes influential point") |> 
          bind_rows(influential_data |> 
                      mutate(Data = "Includes influential point")) |> 
ggplot(aes(x = coffee_consumed_oz, y = heart_rate_bpm, color = Data)) +
  geom_smooth(method = "lm", se = FALSE) +
    geom_point(color = "black") +
  scale_color_manual(values = c("#56B4E9", "#D55E00")) +
  labs(title = "Example Coffee Data with Influential Point",
       x = "Coffee Consumed (oz)",
       y = "Heart Rate (bpm)",
       color = "") +
  theme(legend.position = "bottom")

Code
# Displaying table with Cook's Distance
influential_data$cooks_distance <- cooks.distance(infA)
influential_data$standardized_residual <- rstandard(infA)

influential_data |> 
  arrange(desc(coffee_consumed_oz)) |> 
  gt() |> 
  fmt_number(columns = everything(), 
             decimals = 2)
Table 6: Cook’s Distance measure for example coffee drinking data.
heart_rate_bpm coffee_consumed_oz cooks_distance standardized_residual
76.00 24.00 2.35 −1.86
81.23 15.39 0.27 1.89
72.00 15.00 0.06 −0.96
74.68 11.94 0.01 0.41
72.93 11.33 0.00 −0.02
75.08 11.21 0.01 0.65
76.17 10.55 0.03 1.09
68.44 10.19 0.03 −1.18
73.55 10.14 0.00 0.37
75.09 9.70 0.02 0.91
73.09 8.71 0.01 0.48
71.19 8.57 0.00 −0.08
67.28 8.25 0.04 −1.21
71.32 8.05 0.00 0.05
72.60 7.81 0.01 0.48
73.10 7.79 0.01 0.64
75.12 7.41 0.05 1.32
64.87 6.59 0.11 −1.68
66.14 6.13 0.06 −1.21
71.91 5.47 0.02 0.68
70.15 5.06 0.00 0.21
63.98 4.56 0.17 −1.64

To identify influential points, a commonly used statistic is Cook’s Distance measure, often referred to as Cook’s D. Cook’s D for the jth data point quantifies the aggregate change in all fitted or predicted values when the jth observation is excluded from the model fitting process. We will consider points to be influential if they have a Cook’s D value greater than 0.50, although this is a rule of thumb and there is not a consensus in the field of statistics on the cutoff point.

Tip

Are there any influential points in the NBA data set? Provide specific evidence to support your answer.

Outliers are points with response values that fall significantly above or below the value we would expect based on an estimated regression model. That is, outliers have residuals that are large in magnitude relative to the rest of the data set. Outliers may be due to a mistake in the data set, or they may simply be due to variability in the population of interest, but in either case they necessitate closer examination.

Code
# Plotting data with an influential point
outlier_data |> 
  slice(1:(n()-1)) |> 
          mutate(Data = "Excludes outlier") |> 
          bind_rows(outlier_data |> 
                      mutate(Data = "Includes outlier")) |> 
ggplot(aes(x = age, y = annual_doctors_visits, color = Data)) +
  geom_smooth(method = "lm", se = FALSE) +
    geom_point(color = "black") +
  scale_color_manual(values = c("#56B4E9", "#D55E00")) +
  labs(title = "Data with Outlier",
       x = "Age (years)",
       y = "Doctor Visits per Year") +
  theme(legend.position = "bottom")

Code
# Displaying table with standardized residuals
outlier_data$standardized_residual <- rstandard(outA)
outlier_data$cooks_distance <- cooks.distance(outA)

outlier_data |> 
  arrange(desc(annual_doctors_visits)) |> 
  gt() |> 
  fmt_number(columns = c(standardized_residual, cooks_distance), 
             decimals = 2)
Table 7: Standardized residuals for example doctors visits data.
age annual_doctors_visits standardized_residual cooks_distance
49 14 3.29 0.42
98 12 0.32 0.02
90 10 −0.28 0.01
89 10 −0.24 0.01
60 8 0.13 0.00
70 8 −0.31 0.00
80 8 −0.76 0.04
50 6 −0.34 0.00
51 6 −0.38 0.01
18 4 0.15 0.00
20 4 0.06 0.00
40 4 −0.81 0.03
10 2 −0.45 0.03
11 2 −0.49 0.03

To identify outliers, one can inspect a plot of the standardized residuals. As a rule of thumb, we will consider a point to be an outlier if it has a standardized residual greater than 2 in magnitude.

Tip

Are there any potential outliers in the NBA data set? Provide specific evidence to support your answer.

Inference of the SLR model

A common method of inference for the SLR model is a hypothesis test of the regression slope, \(\beta_1\). To test whether \(\beta_1\) is equal to 0 or not, a formal statement of the hypotheses is:

\[\text{H}_0: \beta_1 = 0 \text{ vs. } \text{H}_\text{a}: \beta_1 \ne 0\]

The null hypothesis is that there is not a linear relationship between \(x\) and \(y\), and the alternative hypothesis is that there is. If there is a statistically significant linear relationship between \(x\) and \(y\), the direction of the relationship should be clearly indicated, i.e., positive or negative.

Tip
  1. Regardless of whether or not assumptions are met, formally state the hypotheses for testing if the regression slope is 0 or not for the NBA data.

  2. Provide the test statistic, p-value, and decision for this hypothesis test.

  3. Interpret the results of the hypothesis test in the context of the problem.

  4. Does our model explain significantly more of the variability in the response than an intercept only model? Why or why not?

Interpreting model output

Tip
  1. Regardless of whether or not assumptions are met, provide and interpret the estimated slope in context, rounding to two decimal places.

  2. Provide and interpret the estimated intercept in context, rounding to two decimal places. Is our interpretation sensible in this setting? Why or why not?

Extrapolation is the process of obtaining a predicted value outside the scope of one’s observed data set. Calculating predicted values using explanatory variable values outside the range of the data used for model fitting is one way that extrapolation can occur. In general, extrapolation is considered inappropriate since it can lead to unreliable or misleading results if the underlying relationship does not hold beyond the range of the observed data.

Tip

Using the model estimates, what would be the expected salary for a new player who had scored 50 points per game in the 2024-2025 season? Would this be a reliable prediction in this context?

Let’s consider obtaining a predicted value for one of, if not the most prolific player in NBA history, LeBron James.

Image from https://upload.wikimedia.org/wikipedia/commons/6/67/Lebron_dunking.jpg
Tip
  1. Based on the output from our overly-simplistic model, what is the expected salary of LeBron James for the 2025-2026 season?

  2. Based on this predicted value, what is the corresponding residual value for LeBron?

  3. Using our overly-simplistic model, is LeBron being paid more or less than expected for the 2025-2026 season? Is this surprising?

After being drafted in 1996, another NBA legend, Kobe Bryant (aka “Black Mamba”) averaged 27.6 points per game in the 2004-2005 season, and subsequently made 15.9 million dollars for the 2005-2006 season.2

Image from https://grantland.com/wp-content/uploads/2013/12/161402738.jpg?w=750
Tip
  1. Based on the output from our overly-simplistic model, what was Kobe’s expected salary for the 2005-2006 season?

  2. Is this an appropriate predicted value for our model? Why or why not?

Describing uncertainty

We can also obtain predicted values and prediction intervals to describe uncertainty in these predictions.

Prediction interval

A prediction interval for a new observation at a point \(x_0\) is

\[\hat{y}_0 \pm t_{1-\alpha/2,\,n-2} \times \sqrt{\text{MSE} \times \left(1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum\limits_{i=1}^n (x_i - \bar{x})^2} \right)},\]

where

  • \(\hat{y}_0\) is the predicted response value given \(x = x_0\).

  • \(n\) is the sample size.

  • \(t_{1-\alpha/2,\,n-2}\) is the multiplier for a \(t\)-distribution with \(n-2\) degrees of freedom for a \(100 \times (1-\alpha)\%\) prediction interval. For example, if \(n=165\) and \(\alpha = 0.05\) corresponding to a \(95\%\) prediction interval, \(t_{1-\alpha/2,\,n-2} = t_{0.975,163} \approx 1.97\).

  • \(\text{MSE} = \sum\limits_{i=1}^{n}\frac{(y_i - \hat{y}_i)^2}{n - 2}\) is the mean squared error

In practice, we will use R or other software to calculate this prediction interval.

The regression standard error, \(s = \sqrt{\text{MSE}}\), describes the typical size of the prediction errors of our model.

Tip
  1. A 95% prediction interval for the salary of a current NBA player, Louie the Laker (sometimes referred to by his nickname “the other splash brother”), who scored 20 points per game in the 2024-2025 season can be obtained using the code below. Using the R code below, obtain and interpret this prediction interval in context.

  2. Is obtaining this predicted value considered extrapolation? Why or why not?

Code
# Obtaining predicted value & 95% prediciton interval
prediction_data <- tibble(player = "Louie the 'other splash brother' Laker",
                          points_per_game = 20)
  
predict(slr_model, newdata = prediction_data, 
        interval = "prediction", level = 0.95) |> 
  as_tibble() |> 
  gt() |> 
  fmt_number(columns = everything(), 
             decimals = 2)
Table 8: Predicted value and prediction interval from SLR model.
fit lwr upr
33.96 16.14 51.77

Footnotes

  1. Sports Reference LLC. (n.d.). Basketball-Reference.com. Retrieved September 27, 2025, from https://www.basketball-reference.com↩︎

  2. Spotrac. (n.d.). Kobe Bryant contract, salary, and earnings. Retrieved September 27, 2025, from https://www.spotrac.com/nba/player/earnings/_/id/2514/kobe-bryant↩︎