Simple Linear Regression

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Learning Objectives

Fit a simple linear regression (SLR) model
State and check the assumptions of the SLR model
Interpret SLR model output
Identify influential points and outliers
Obtain & interpret predicted values
Obtain & interpret prediction intervals

Introduction

Simple linear regression (SLR) is a statistical method used to model the relationship between a single quantitative response variable, denoted by $y$, and a single explanatory or predictor variable, denoted by $x$. In this activity, we will focus on the case when the predictor variable $x$ is a quantitative variable, although it can also be a categorical variable. In this setting, the SLR model relies on an assumption of a linear relationship between $x$ and $y$. That is, SLR is most useful when $y$ tends to change in a straight-line or linear pattern as the value of $x$ changes.

Linear regression is widely used across many different disciplines. Some examples include:

Exploring monthly stock prices based on an economic indicator such as the gross domestic product (GDP) in U.S. dollars 📈
Analyzing the relationship between health outcomes of patients, e.g. blood pressure (mmHG), and the dosage level of a prescribed treatment in milligrams for lowering people’s blood-pressure ⚕️💊
Predicting how much longer a car component will last in days based on the number of miles on the vehicle 🛠🚗
Modeling the relationship between sales and advertising expenses in U.S. dollars 💸

An important concept in linear regression is the role of the response and the explanatory or predictor variable.

For SLR, the response variable is the quantitative variable describing the outcome of interest. The explanatory variable is the variable whose relationship with the response variable is being studied. Explanatory variables are also referred to as predictors or predictor variables in regression.

Tip

What are the response and predictor variables for examples a. and b. above?

To continue with this activity, let’s load some R packages, and set a default theme for all visualizations made with ggplot2.

Code

# Loading R packages
library(tidyverse)
library(gt)
library(corrr)
library(ggthemes)
library(datasauRus)
library(ggfortify)
library(broom)
library(scales)
library(ggtext)

# Setting default ggplot2 theme
theme_set(theme_few())

Simple Linear Regression Model

Assumed Model

The assumed or theoretical simple linear regression model is

\[y = \beta_0 + \beta_1 x + \varepsilon, \]

where

$y$ is the response variable
$\beta_0$ is the intercept
$\beta_1$ is the slope
$x$ is the explanatory / predictor variable
$\varepsilon$ is the error term

Collectively, $\beta_0$ and $\beta_1$ define the true straight-line relationship between $x$ and $y$. Using a sample of data collected on a population of interest, a main goal of SLR is to estimate this linear relationship by obtaining estimates of $\beta_0$ and $\beta_1$.

Code

# Simulating data form SLR model
set.seed(1994)

# True parameters
true_intercept_slope <- c(4, 0.6)
n_values <- 15

sample_data <- tibble(x = seq(1, 10, length.out = n_values),
                      y = true_intercept_slope[1] + true_intercept_slope[2] * x + rnorm(n_values, mean = 0, sd = 2))

# Extracting predicted values
sample_data$theoretical_mean <- true_intercept_slope[1] + true_intercept_slope[2] * sample_data$x 

# Create equation string
theoretical_slr_equation <- paste0("y = ", true_intercept_slope[1], " + ", true_intercept_slope[2], "x + ε")

# Format the label with HTML/CSS for ggtext
# Let's pretend this is your theoretical equation
theoretical_slr_equation_html <- str_glue("<span style='color:#56B4E9;'>y</span><span style='color:black;'> = </span><span style='color:#E69F00;'>{true_intercept_slope[1]} + {true_intercept_slope[2]}<span style='color:#56B4E9;'>x</span> </span><span style='color:black;'>+ ε</span>")

# Creating scatter plot to visualize residuals
ggplot(sample_data, aes(x = x, y = y)) +
  geom_segment(aes(xend = x, yend = theoretical_mean),
               linetype = "dotted", color = "black") +
  geom_abline(slope = true_intercept_slope[2], 
              intercept = true_intercept_slope[1], 
              color = "#E69F00", linewidth = 0.8) +
  geom_point(color = "#56B4E9", size = 1.8, alpha = 1) +
    scale_x_continuous(breaks = seq(0, 12, by = 4),
                     limits = c(0, 12)) +
annotate("richtext",  # <-- Use "richtext"
           x = quantile(sample_data$x, probs = 0.70),
           y = min(sample_data$y),
           label = theoretical_slr_equation_html, # <-- Use the HTML label
           size = 6, hjust = 0,
           fill = NA, label.color = NA) +
  labs(title = "Scatter plot with theoretical regression line") +
  theme_few(base_size = 16)

Estimated Model

A statement of the estimated regression equation is

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x, \]

where:

$\hat{y}$ is the predicted response value
$\hat{\beta}_0$ is the estimated intercept
$\hat{\beta}_1$ is the estimated slope
$x$ is the explanatory / predictor variable

When providing a statement of the estimated regression equation, we substitute our numerical estimates into the regression equation for $\hat{\beta}_0$ and $\hat{\beta}_1$.

Code

# Fitting SLR model
model_fit <- lm(y ~ x, data = sample_data)

# Extracting predicted values
sample_data$predicted <- predict(model_fit)

# Extract coefficients
intercept_slope <- coef(model_fit)[1:2]

# Create equation string
estimated_slr_equation <- sprintf("ŷ = %.2f + %.2fx", intercept_slope[1], intercept_slope[2])

# Creating scatter plot to visualize residuals
ggplot(sample_data, aes(x = x, y = y)) +
  geom_segment(aes(xend = x, yend = predicted),
               linetype = "dotted", color = "black") +
  geom_smooth(method = "lm", se = FALSE, color = "#009E73",
              linewidth = 0.8) +
  geom_point(color = "#56B4E9", size = 1.8, alpha = 1) +
  scale_x_continuous(breaks = seq(0, 12, by = 4),
                     limits = c(0, 12)) +
  annotate("text", x = quantile(sample_data$x, probs = 0.70), 
           y = min(sample_data$y), 
           label = estimated_slr_equation, 
           color = "#009E73", size = 6, hjust = 0) +
  labs(title = "Scatter plot with estimated regression line") +
  theme_few(base_size = 16)

The estimated intercept, $\hat{\beta}_0$, can be generally interpreted as:

\[\textit{The expected value of $y$ given that $x=0$ is $\hat{\beta}_0$.}\]

The estimated slope, $\hat{\beta}_1$, can be generally interpreted as:

\[\textit{For every 1 unit increase in $x$, we expect $y$ to increase / decrease by $|\hat{\beta}_1|$.}\]

Note that $|⋅|$ is the absolute value function, and that we use the phrasing “increase” if $\hat{\beta}_1>0$, and “decrease” if $\hat{\beta}_1<0$. These are blueprint interpretations for the estimated intercept and slope, but should be tailored to a given context. Specifically, the units of measurement, the response variable, the predictor variable, and the values of the regression estimates should be included when providing these interpretations.

Note that there is no error term in the estimated regression equation. However, empirical residuals, also called prediction errors, can be calculated using the estimated regression model. The residual for the $i$th observation is the difference between the observed and predicted value:

\[e_i = y_i - \hat{y}_i\]

The estimates for SLR, $\hat{\beta}_0$ and $\hat{\beta}_1$, are obtained using the method of least squares. That is, $\hat{\beta}_0$ and $\hat{\beta}_1$ are determined using calculus to minimize the sum of squared errors, $\text{SSE} = \sum\limits_{i=1}^{n}e_i^2$. Note that this is not the same as minimizing the sum of the magnitude of the residuals, which is called least absolute deviations (LAD) regression. Visually, the size of the $\text{SSE}$ is depicted via the sum of the area of all the squares in the scatter plot below.

Code

# Adding residuals to table
sample_data <- sample_data |> 
  mutate(residual = y - predicted)
  
# Creating scatter plot to visualize residuals
sum_squares_gg <- ggplot(sample_data, aes(x = x, y = y)) +
  geom_segment(aes(xend = x, yend = predicted),
               linetype = "dotted", color = "black") +
geom_rect(alpha = 0.2, fill = '#D55E00',
              aes(ymin = pmin(y, predicted), 
                  ymax = pmax(y, predicted),
                  xmin = pmin(x, x - residual),
                  xmax = pmax(x, x - residual))) +
    geom_smooth(method = "lm", se = FALSE, color = "#009E73",
                linewidth = 0.8) +
  geom_point(color = "#56B4E9", size = 1.8, alpha = 1) +
  coord_fixed() +
  theme_few(base_size = 16)

sum_squares_gg

Tip

For examples a. through d., would we expect the true slope $\beta_1$ to be negative or positive?
Provide a statement of the assumed simple linear regression model for example d.
Suppose that a sample of data was collected for example d., and an estimated slope was calculated as $\hat{\beta}_1 = 2$. Interpret this value in the context of the problem.
Further suppose for example d. that an estimated intercept was calculated as $\hat{\beta}_0 = 5000$. Interpret this value in the context of the problem.
Given estimates of $\hat{\beta}_0 = 5000$ and $\hat{\beta}_1 = 2$ for example d., what would we predict for our company’s sales revenue if we invested $\$500$ in advertising on social media platforms?

Model Assumptions

The reliability of the estimates, $\hat{\beta}_0$ and $\hat{\beta}_1$ and corresponding p-values for the simple linear regression model depend on several assumptions. All of the assumptions below must be true to conduct valid inference of the simple linear regression model.

Random sample: observations must be randomly selected from the population of interest.
Normality: the residuals must be normally distributed.

This assumption can checked using a histogram or quantile-quantile (QQ) plot of the residuals. A fairly bell-shaped distribution in the histogram supports normality, and the points not straying far from the diagonal line on the QQ plot supports normality.

Homoskedasticity: the residuals must have constant variance / spread across all predicted values.

This assumption can checked using the residual by predicted value plot. A fairly constant vertical spread / range of the residuals across all predicted values supports homoskedasticity. When the residuals do not have constant variance this is called heteroskedasticity.

Independence: the residuals must be independent of the predicted values.

This assumption can checked using the residual by predicted value plot. No apparent pattern in the relationship between the residuals and predicted values supports the independence assumption.

Linearity: there must be a linear relationship between $x$ and $y$.

This assumption can checked using the residual by predicted value plot or a scatter plot between $x$ and $y$. Having no systematic or identifiable pattern in the residual by predicted value plot supports the linearity assumption. A linear or straight-line pattern in the scatter plot between each predictor and $y$ also supports the linearity assumption.

Residual vs Fitted Value Plot

For the homoskedasticity assumption to be met for regression models, we want to observe a fairly constant vertical spread or range of the residuals across all predicted values.

Figure 1: Residual by predicted value plots showing homoskedastic, heteroskedastic, independent errors and predicted values, and linearly dependent errors and predicted values.

Figure 2: Residual by predicted value plots showing homoskedastic, heteroskedastic, independent errors and predicted values, and quadratically dependent errors and predicted values.

Figure 3: Residual by predicted value plots showing homoskedastic, heteroskedastic, independent errors and predicted values, and a sinusoidal dependence between errors and predicted values.

Histograms and Q-Q Plots

To check the normality assumption for linear regression models, we can use a quantile-quantile (Q-Q) plot of the residuals and a histogram of the residuals.

Figure 4: Histograms showing the distributions of normally distributed, bimodal, and peaked residuals.

Figure 5: Quantile-quantile plots showing the empirical quantiles by the theoretical quantiles for normally distributed, bimodal, and peaked residuals.

Figure 6: Histograms showing the distributions of right-skewed, left-skewed, and discrete residuals.

Figure 7: Quantile-quantile plots showing the empirical quantiles by the theoretical quantiles for right-skewed, left-skewed, and discrete residuals.

Using Correlations with Caution

For two numeric variables, the sample correlation coefficient, $r$, is a commonly used statistic which describes both the strength and direction of a linear relationship between two variables. Correlations range from -1 to 1, with values closer to -1 indicating a stronger, negative linear relationship (as one variable increases, the other decreases), values closer to 0 indicating a weaker linear relationship, and values closer to 1 indicating a stronger, positive and linear relationship (as one variable increases, the other increases).

Correlations are a useful starting point, but should be used with caution as they have a few limitations:

They only measure linear relationships, but many relationships are non-linear. ⎰
A strong correlation does not imply causation, e.g., monthly ice cream sales may be strongly correlated with the number of monthly shark attacks, but that does not necessarily mean one causes the other. 🍦🦈
Confounding variables can distort the relationship between variables of interest.
The correlation coefficient is sensitive to outliers.

Let’s explore a couple of data sets that demonstrate some limitations of the correlation coefficient. First, the grid of scatter plots below, called Anscombe’s Quartet, shows different data sets which all have a correlation coefficient of $r = 0.82$.

Code

# Load Anscombe's Quartet
data(anscombe)

# Reshape the data for plotting
anscombe_long <- anscombe |>
  pivot_longer(cols = everything(),
               names_to = c(".value", "set"),
               names_pattern = "(.)(.)")

# Create a single faceted plot
ggplot(anscombe_long, aes(x = x, y = y)) +
    geom_smooth(method = "lm", se = FALSE, color = "#009E73") +
    geom_point(color = "#56B4E9", size = 2) +
    facet_wrap(~ set, ncol = 2, labeller = labeller(set = function(x) paste("Dataset", x))) +
    labs(title = "Anscombe's Quartet")

Tip

Which, if any, of the limitations described above are evident for Dataset 1?
Which, if any, of the limitations described above are evident for Dataset 2?
Which, if any, of the limitations described above are evident for Dataset 3?
Which, if any, of the limitations described above are evident for Dataset 4?

Another commonly used data set to emphasize the importance of plotting one’s data is the “datasaurus” data set displayed below.

Code

# Plotting datasaurus_dozen data
# Original code from https://cran.r-project.org/web/packages/datasauRus/vignettes/Datasaurus.html
  datasaurus_dozen |> 
    dplyr::filter(dataset != "high_lines") |>
  ggplot(aes(x = x, y = y, colour = dataset)) +
    geom_point() +
    facet_wrap(~ dataset, ncol = 3) +
    theme(legend.position = "none")

Code

  datasaurus_dozen |> 
    dplyr::filter(dataset != "high_lines") |> 
    group_by(dataset) |> 
    summarize(mean_x = mean(x),
              mean_y = mean(y),
              std_dev_x = sd(x),
              std_dev_y = sd(y),
              corr_x_y  = cor(x, y)) |> 
      gt() |>
    fmt_number(columns = everything(), 
               decimals = 2,
               drop_trailing_zeros = TRUE)

Table 1: Summary statistics for the dinosaurus data set.

dataset	mean_x	mean_y	std_dev_x	std_dev_y	corr_x_y
away	54.27	47.83	16.77	26.94	−0.06
bullseye	54.27	47.83	16.77	26.94	−0.07
circle	54.27	47.84	16.76	26.93	−0.07
dino	54.26	47.83	16.77	26.94	−0.06
dots	54.26	47.84	16.77	26.93	−0.06
h_lines	54.26	47.83	16.77	26.94	−0.06
slant_down	54.27	47.84	16.77	26.94	−0.07
slant_up	54.27	47.83	16.77	26.94	−0.07
star	54.27	47.84	16.77	26.93	−0.06
v_lines	54.27	47.84	16.77	26.94	−0.07
wide_lines	54.27	47.83	16.77	26.94	−0.07
x_shape	54.26	47.84	16.77	26.93	−0.07

To further explore concepts related to SLR, let’s consider a real-world data set on the salaries of professional basketball players.

Example: NBA Player Salary Data 🏀 💵

The National Basketball Association (NBA) is a professional basketball league in North America consisting of 30 teams. The NBA is widely considered the premier men’s professional basketball league in the world. In this activity we will analyze data on NBA player salaries for the 2025–2026 season along with performance metrics from the previous season. Salary data were obtained from https://www.basketball-reference.com/contracts/players.html, and player-level performance statistics were sourced from https://www.basketball-reference.com/leagues/NBA_2025_per_game.html.¹ This data set includes information on player earnings and fundamental performance metrics such as points scored per game for the 2024-2025 season. Using this data, we will explore potential relationships between player compensation and performance-based metrics.

Let’s load the data for the activity, focusing on players who have played at least 4 years in the NBA already to avoid rookie contracts which are systematically lower than standard NBA contracts.

Code

# Importing NBA salary data from GitHub, removing players on rookie contracts, converting salary to millions of USD, and dropping columns
nba_salary_data <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/nba_salary_data.csv") |> 
  dplyr::filter(years_played >= 4) |> 
  mutate(salary_2025_2026 = salary_2025_2026 / 1e6) |> 
  dplyr::rename(salary_millions = salary_2025_2026) |> 
  dplyr::select(player:points_per_game, years_played)

Below is a data dictionary for the NBA data, and a preview of some randomly selected rows.

Table 2: Data dictionary for the NBA salaries data set.

Variable	Description
player	Player's name
team	Player's team for the 2025-2026 season
salary_millions	Player's salary for the 2025-2026 season (in millions of USD)
position	Primary position played by the player (e.g., C, PF, SF, SG, PG)
age	Player's age at the start of the 2025-2026 season
points_per_game	Average points scored per game during the previous season
years_played	Total number of years played in the NBA prior to the 2025-2026 season

Table 3: Several rows and columns from the NBA data set.

player	team	salary_millions	position	age	points_per_game	years_played
Stephen Curry	GSW	59.61	PG	36	24.5	16
Joel Embiid	PHI	55.22	C	30	23.8	9
Nikola Jokić	DEN	55.22	C	29	29.6	10
Kevin Durant	HOU	54.71	PF	36	26.6	17
Jayson Tatum	BOS	54.13	PF	26	26.8	8

Tip

In Table 3, being as specific as possible, what is the type of each variable?
Which data format would be best for storing all data available on the NBA players (e.g., XML, CSV, or JSON)?

Let’s look at a scatter plot and the correlations for the NBA data:

Code

# Creating a scatter plot
nba_salary_data |> 
  ggplot(aes(x = points_per_game, 
             y = salary_millions)) + 
  geom_smooth(method = "lm", se = FALSE) +
  geom_point() +
  scale_y_continuous(labels = scales::label_currency(prefix = "$"))

Code

# Calculating correlations
correlations_table <- nba_salary_data |> 
  corrr::correlate(diagonal = 1)

# Printing table of correlations
correlations_table |> 
    gt() |>
    fmt_number(columns = everything(), 
               decimals = 2,
               drop_trailing_zeros = FALSE)

Table 4: Table of pairwise correlations.

term	salary_millions	age	points_per_game	years_played
salary_millions	1.00	0.12	0.82	0.24
age	0.12	1.00	−0.03	0.93
points_per_game	0.82	−0.03	1.00	0.11
years_played	0.24	0.93	0.11	1.00

Tip

Describe the relationship between a player’s points scored per game (points_per_game), and their salary in the 2025-2026 season (salary_millions).
Does the sign (being positive or negative) of the correlation coefficient seem sensible in this context? Why or why not?
Does there appear to be a linear relationship between player’s points scored per game and their salary in the 2025-2026 season?

Let’s fit a simple linear regression model to model a player’s salary using their points scored per game.

Tip

Provide a statement of the assumed simple linear regression model for the NBA data.

Fitting the Model

Tip

Fit a simple linear regression model for the NBA data and print the model output using the code below.

Code

# Fitting a simple linear regression model
slr_model <- lm(salary_millions ~ points_per_game,
                data = nba_salary_data)

# Printing model output
slr_model |> 
  broom::tidy() |> 
  gt() |> 
  fmt_number(columns = term:statistic, 
             decimals = 2)

Table 5: Simple linear regression model output.

term	estimate	std.error	statistic	p.value
(Intercept)	−3.39	1.64	−2.07	3.994515e-02
points_per_game	1.87	0.10	18.60	3.800056e-42

Tip

Provide a statement of the estimated simple linear regression model for the NBA data.

For linear regression, the coefficient of determination, denoted by $r^2$, describes the proportion of variability in the response variable, $y$, explained by our model. The $r^2$ value is equal to the square of the correlation coefficient between the predicted values of a model and the observed response values. For SLR, $r^2$ is also equal to the square of the correlation coefficient between the explanatory and response variables.

For simple linear regression, we have a single predictor which constitutes our entire model, so $r^2$ is the proportion of variability in $y$ explained by $x$.

Tip

Provide and interpret the value of the coefficient of determination in context.

Diagnostic Plots

Below are diagnostic plots for checking the model assumptions for our analysis of the NBA data set.

Code

# Several diagnostic plots at once
autoplot(slr_model, label.repel = TRUE)

Code

# Cook's Distance plot
autoplot(slr_model, which = 4, label.repel = TRUE)@plots[[1]]

Code

# Making histogram of residuals
tibble(Residual = slr_model$residuals) |> 
    ggplot(aes(x = Residual)) + 
    geom_histogram(aes(y = after_stat(density)), fill = "#56B4E9", color = "white") +
    stat_function(fun = dnorm, args = list(mean = 0, sd = sd(slr_model$residuals)),
                  color = "#009E73") +
    scale_y_continuous(expand = expansion(mult = c(0, .01))) +
    labs(y = "Density") + 
    theme_bw() + 
    theme(panel.grid = element_blank(), 
          text = element_text(face = "bold"))

Tip

Indicate if each assumption for fitting a simple linear regression model is met, providing specific evidence to support your conclusions from the output provided.

Influential Points and Outliers

Influential points are points that have a significantly greater effect on the regression estimates than the rest of the data points. A point can be both an outlier and influential, but not every outlier is an influential point.

Code

set.seed(1994)

# Simulating influential and outlier data for demonstration
influential_data <- as.data.frame(round(mvtnorm::rmvnorm(n = 20, mean = c(3, 3), sigma = matrix(c(1.2, 0.85, 0.85, 1.2), byrow = T, nrow = 2)), 3))

influential_data <- rbind(influential_data, cbind(c(3, 4), c(5, 8)))

influential_data$V1 <- influential_data$V1*4+60

influential_data$V2 <- influential_data$V2*3

influential_data <- influential_data |> 
  rename(coffee_consumed_oz = V2,
         heart_rate_bpm = V1)

outlier_data <- tibble(V1 = c(10, 11, 18, 20, 40, 50, 51, 60, 70, 80, 90, 89, 98, 49),
                       V2 = round(c(2, 2.2, 4, 3.5, 4.1, 6, 5.6, 7.9, 8, 8.3, 10, 10.5, 12, 14))) |> 
  rename(annual_doctors_visits = V2,
         age = V1)

# Fitting models
outA <- lm(annual_doctors_visits ~ age, data = outlier_data)
outB <- lm(annual_doctors_visits ~ age, data = outlier_data[-c(nrow(outlier_data)), ])
infA <- lm(heart_rate_bpm ~ coffee_consumed_oz, data = influential_data)
infB <- lm(heart_rate_bpm ~ coffee_consumed_oz, data = influential_data[-c(nrow(influential_data)), ])

Code

# Plotting data with an influential point
influential_data |> 
  slice(1:(n()-1)) |> 
          mutate(Data = "Excludes influential point") |> 
          bind_rows(influential_data |> 
                      mutate(Data = "Includes influential point")) |> 
ggplot(aes(x = coffee_consumed_oz, y = heart_rate_bpm, color = Data)) +
  geom_smooth(method = "lm", se = FALSE) +
    geom_point(color = "black") +
  scale_color_manual(values = c("#56B4E9", "#D55E00")) +
  labs(title = "Example Coffee Data with Influential Point",
       x = "Coffee Consumed (oz)",
       y = "Heart Rate (bpm)",
       color = "") +
  theme(legend.position = "bottom")

Code

# Displaying table with Cook's Distance
influential_data$cooks_distance <- cooks.distance(infA)
influential_data$standardized_residual <- rstandard(infA)

influential_data |> 
  arrange(desc(coffee_consumed_oz)) |> 
  gt() |> 
  fmt_number(columns = everything(), 
             decimals = 2)

Table 6: Cook’s Distance measure for example coffee drinking data.

heart_rate_bpm	coffee_consumed_oz	cooks_distance	standardized_residual
76.00	24.00	2.35	−1.86
81.23	15.39	0.27	1.89
72.00	15.00	0.06	−0.96
74.68	11.94	0.01	0.41
72.93	11.33	0.00	−0.02
75.08	11.21	0.01	0.65
76.17	10.55	0.03	1.09
68.44	10.19	0.03	−1.18
73.55	10.14	0.00	0.37
75.09	9.70	0.02	0.91
73.09	8.71	0.01	0.48
71.19	8.57	0.00	−0.08
67.28	8.25	0.04	−1.21
71.32	8.05	0.00	0.05
72.60	7.81	0.01	0.48
73.10	7.79	0.01	0.64
75.12	7.41	0.05	1.32
64.87	6.59	0.11	−1.68
66.14	6.13	0.06	−1.21
71.91	5.47	0.02	0.68
70.15	5.06	0.00	0.21
63.98	4.56	0.17	−1.64

To identify influential points, a commonly used statistic is Cook’s Distance measure, often referred to as Cook’s D. Cook’s D for the jth data point quantifies the aggregate change in all fitted or predicted values when the jth observation is excluded from the model fitting process. We will consider points to be influential if they have a Cook’s D value greater than 0.50, although this is a rule of thumb and there is not a consensus in the field of statistics on the cutoff point.

Tip

Are there any influential points in the NBA data set? Provide specific evidence to support your answer.

Outliers are points with response values that fall significantly above or below the value we would expect based on an estimated regression model. That is, outliers have residuals that are large in magnitude relative to the rest of the data set. Outliers may be due to a mistake in the data set, or they may simply be due to variability in the population of interest, but in either case they necessitate closer examination.

Code

# Plotting data with an influential point
outlier_data |> 
  slice(1:(n()-1)) |> 
          mutate(Data = "Excludes outlier") |> 
          bind_rows(outlier_data |> 
                      mutate(Data = "Includes outlier")) |> 
ggplot(aes(x = age, y = annual_doctors_visits, color = Data)) +
  geom_smooth(method = "lm", se = FALSE) +
    geom_point(color = "black") +
  scale_color_manual(values = c("#56B4E9", "#D55E00")) +
  labs(title = "Data with Outlier",
       x = "Age (years)",
       y = "Doctor Visits per Year") +
  theme(legend.position = "bottom")

Code

# Displaying table with standardized residuals
outlier_data$standardized_residual <- rstandard(outA)
outlier_data$cooks_distance <- cooks.distance(outA)

outlier_data |> 
  arrange(desc(annual_doctors_visits)) |> 
  gt() |> 
  fmt_number(columns = c(standardized_residual, cooks_distance), 
             decimals = 2)

Table 7: Standardized residuals for example doctors visits data.

age	annual_doctors_visits	standardized_residual	cooks_distance
49	14	3.29	0.42
98	12	0.32	0.02
90	10	−0.28	0.01
89	10	−0.24	0.01
60	8	0.13	0.00
70	8	−0.31	0.00
80	8	−0.76	0.04
50	6	−0.34	0.00
51	6	−0.38	0.01
18	4	0.15	0.00
20	4	0.06	0.00
40	4	−0.81	0.03
10	2	−0.45	0.03
11	2	−0.49	0.03

To identify outliers, one can inspect a plot of the standardized residuals. As a rule of thumb, we will consider a point to be an outlier if it has a standardized residual greater than 2 in magnitude.

Tip

Are there any potential outliers in the NBA data set? Provide specific evidence to support your answer.

Inference of the SLR model

A common method of inference for the SLR model is a hypothesis test of the regression slope, $\beta_1$. To test whether $\beta_1$ is equal to 0 or not, a formal statement of the hypotheses is:

\[\text{H}_0: \beta_1 = 0 \text{ vs. } \text{H}_\text{a}: \beta_1 \ne 0\]

The null hypothesis is that there is not a linear relationship between $x$ and $y$, and the alternative hypothesis is that there is. If there is a statistically significant linear relationship between $x$ and $y$, the direction of the relationship should be clearly indicated, i.e., positive or negative.

Tip

Regardless of whether or not assumptions are met, formally state the hypotheses for testing if the regression slope is 0 or not for the NBA data.
Provide the test statistic, p-value, and decision for this hypothesis test.
Interpret the results of the hypothesis test in the context of the problem.
Does our model explain significantly more of the variability in the response than an intercept only model? Why or why not?

Interpreting model output

Tip

Regardless of whether or not assumptions are met, provide and interpret the estimated slope in context, rounding to two decimal places.
Provide and interpret the estimated intercept in context, rounding to two decimal places. Is our interpretation sensible in this setting? Why or why not?

Extrapolation is the process of obtaining a predicted value outside the scope of one’s observed data set. Calculating predicted values using explanatory variable values outside the range of the data used for model fitting is one way that extrapolation can occur. In general, extrapolation is considered inappropriate since it can lead to unreliable or misleading results if the underlying relationship does not hold beyond the range of the observed data.

Tip

Using the model estimates, what would be the expected salary for a new player who had scored 50 points per game in the 2024-2025 season? Would this be a reliable prediction in this context?

Let’s consider obtaining a predicted value for one of, if not the most prolific player in NBA history, LeBron James.

Image from https://upload.wikimedia.org/wikipedia/commons/6/67/Lebron_dunking.jpg

Tip

Based on the output from our overly-simplistic model, what is the expected salary of LeBron James for the 2025-2026 season?
Based on this predicted value, what is the corresponding residual value for LeBron?
Using our overly-simplistic model, is LeBron being paid more or less than expected for the 2025-2026 season? Is this surprising?

After being drafted in 1996, another NBA legend, Kobe Bryant (aka “Black Mamba”) averaged 27.6 points per game in the 2004-2005 season, and subsequently made 15.9 million dollars for the 2005-2006 season.²

Image from https://grantland.com/wp-content/uploads/2013/12/161402738.jpg?w=750

Tip

Based on the output from our overly-simplistic model, what was Kobe’s expected salary for the 2005-2006 season?
Is this an appropriate predicted value for our model? Why or why not?

Describing uncertainty

We can also obtain predicted values and prediction intervals to describe uncertainty in these predictions.

Prediction interval

A prediction interval for a new observation at a point $x_0$ is

\[\hat{y}_0 \pm t_{1-\alpha/2,\,n-2} \times \sqrt{\text{MSE} \times \left(1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum\limits_{i=1}^n (x_i - \bar{x})^2} \right)},\]

where

$\hat{y}_0$ is the predicted response value given $x = x_0$.
$n$ is the sample size.
$t_{1-\alpha/2,\,n-2}$ is the multiplier for a $t$-distribution with $n-2$ degrees of freedom for a $100 \times (1-\alpha)\%$ prediction interval. For example, if $n=165$ and $\alpha = 0.05$ corresponding to a $95\%$ prediction interval, $t_{1-\alpha/2,\,n-2} = t_{0.975,163} \approx 1.97$.
$\text{MSE} = \sum\limits_{i=1}^{n}\frac{(y_i - \hat{y}_i)^2}{n - 2}$ is the mean squared error

In practice, we will use R or other software to calculate this prediction interval.

The regression standard error, $s = \sqrt{\text{MSE}}$, describes the typical size of the prediction errors of our model.

Tip

A 95% prediction interval for the salary of a current NBA player, Louie the Laker (sometimes referred to by his nickname “the other splash brother”), who scored 20 points per game in the 2024-2025 season can be obtained using the code below. Using the R code below, obtain and interpret this prediction interval in context.
Is obtaining this predicted value considered extrapolation? Why or why not?

Code

# Obtaining predicted value & 95% prediciton interval
prediction_data <- tibble(player = "Louie the 'other splash brother' Laker",
                          points_per_game = 20)
  
predict(slr_model, newdata = prediction_data, 
        interval = "prediction", level = 0.95) |> 
  as_tibble() |> 
  gt() |> 
  fmt_number(columns = everything(), 
             decimals = 2)

Table 8: Predicted value and prediction interval from SLR model.

fit	lwr	upr
33.96	16.14	51.77

Footnotes

Sports Reference LLC. (n.d.). Basketball-Reference.com. Retrieved September 27, 2025, from https://www.basketball-reference.com↩︎
Spotrac. (n.d.). Kobe Bryant contract, salary, and earnings. Retrieved September 27, 2025, from https://www.spotrac.com/nba/player/earnings/_/id/2514/kobe-bryant↩︎

--- title: "Simple Linear Regression" subtitle: DSA 220 - Introduction to Data Science and Analytics author: Andrew DiLernia format: html: theme: flatly self-contained: true embed-resources: true code-fold: true code-tools: true toc: true callout-appearance: default execute: echo: true warning: false message: false output: true --- ```{css} #| echo: false /*** Customization of call-outs ***/ /* 1. Initialize a counter for our callouts */ body { counter-reset: tip-counter; } /* 2. Tell the counter to increment for each tip callout */ .callout-tip { counter-increment: tip-counter; } /* 3. Hide the default title text (i.e., "Tip") by collapsing its font size */ .callout-tip .callout-title-container { font-size: 0; } /* 4. Add the custom superscript counter */ .callout-tip .callout-title-container::before { content: "" counter(tip-counter); position: relative; font-size: initial; } /*** Customization of Footnotes / References ***/ .footnotes > h2::before { content: "References "; } .footnotes > h2 { color: transparent; /* Hide original text */ position: relative; } .footnotes > h2::before { color: #333; position: absolute; left: 0; } ``` ```{r} #| include: false # Creating vector of colorblind-friendly colors colorblind_friendly_colors <- c("#E69F00", "#009E73", "#56B4E9", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999") ``` ## Learning Objectives - Fit a simple linear regression (SLR) model - State and check the assumptions of the SLR model - Interpret SLR model output - Identify influential points and outliers - Obtain & interpret predicted values - Obtain & interpret prediction intervals ## Introduction Simple linear regression (SLR) is a statistical method used to model the relationship between a single quantitative response variable, denoted by $y$, and a single explanatory or predictor variable, denoted by $x$. In this activity, we will focus on the case when the predictor variable $x$ is a quantitative variable, although it can also be a categorical variable. In this setting, the SLR model relies on an assumption of a linear relationship between $x$ and $y$. That is, SLR is most useful when $y$ tends to change in a straight-line or linear pattern as the value of $x$ changes. Linear regression is widely used across many different disciplines. Some examples include: a. Exploring monthly stock prices based on an economic indicator such as the gross domestic product (GDP) in U.S. dollars 📈 b. Analyzing the relationship between health outcomes of patients, e.g. blood pressure (mmHG), and the dosage level of a prescribed treatment in milligrams for lowering people's blood-pressure ⚕️💊 c. Predicting how much longer a car component will last in days based on the number of miles on the vehicle 🛠🚗 d. Modeling the relationship between sales and advertising expenses in U.S. dollars 💸 An important concept in linear regression is the role of the response and the explanatory or predictor variable. For SLR, the **response variable** is the quantitative variable describing the outcome of interest. The **explanatory variable** is the variable whose relationship with the response variable is being studied. Explanatory variables are also referred to as predictors or predictor variables in regression. ::: {.callout-tip} What are the response and predictor variables for examples a. and b. above? ::: To continue with this activity, let's load some R packages, and set a default theme for all visualizations made with `ggplot2`. ```{r} #| message: false #| warning: false # Loading R packages library(tidyverse) library(gt) library(corrr) library(ggthemes) library(datasauRus) library(ggfortify) library(broom) library(scales) library(ggtext) # Setting default ggplot2 theme theme_set(theme_few()) ``` ## Simple Linear Regression Model ### Assumed Model The assumed or theoretical simple linear regression model is $$y = \beta_0 + \beta_1 x + \varepsilon, $$ where - $y$ is the response variable - $\beta_0$ is the intercept - $\beta_1$ is the slope - $x$ is the explanatory / predictor variable - $\varepsilon$ is the error term Collectively, $\beta_0$ and $\beta_1$ define the true straight-line relationship between $x$ and $y$. Using a sample of data collected on a population of interest, a main goal of SLR is to estimate this linear relationship by obtaining estimates of $\beta_0$ and $\beta_1$. ```{r} # Simulating data form SLR model set.seed(1994) # True parameters true_intercept_slope <- c(4, 0.6) n_values <- 15 sample_data <- tibble(x = seq(1, 10, length.out = n_values), y = true_intercept_slope[1] + true_intercept_slope[2] * x + rnorm(n_values, mean = 0, sd = 2)) # Extracting predicted values sample_data$theoretical_mean <- true_intercept_slope[1] + true_intercept_slope[2] * sample_data$x # Create equation string theoretical_slr_equation <- paste0("y = ", true_intercept_slope[1], " + ", true_intercept_slope[2], "x + ε") # Format the label with HTML/CSS for ggtext # Let's pretend this is your theoretical equation theoretical_slr_equation_html <- str_glue("y = {true_intercept_slope[1]} + {true_intercept_slope[2]}x + ε") # Creating scatter plot to visualize residuals ggplot(sample_data, aes(x = x, y = y)) + geom_segment(aes(xend = x, yend = theoretical_mean), linetype = "dotted", color = "black") + geom_abline(slope = true_intercept_slope[2], intercept = true_intercept_slope[1], color = "#E69F00", linewidth = 0.8) + geom_point(color = "#56B4E9", size = 1.8, alpha = 1) + scale_x_continuous(breaks = seq(0, 12, by = 4), limits = c(0, 12)) + annotate("richtext", # <-- Use "richtext" x = quantile(sample_data$x, probs = 0.70), y = min(sample_data$y), label = theoretical_slr_equation_html, # <-- Use the HTML label size = 6, hjust = 0, fill = NA, label.color = NA) + labs(title = "Scatter plot with theoretical regression line") + theme_few(base_size = 16) ``` ### Estimated Model A statement of the estimated regression equation is $$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x, $$ where: - $\hat{y}$ is the predicted response value - $\hat{\beta}_0$ is the estimated intercept - $\hat{\beta}_1$ is the estimated slope - $x$ is the explanatory / predictor variable When providing a statement of the estimated regression equation, we substitute our numerical estimates into the regression equation for $\hat{\beta}_0$ and $\hat{\beta}_1$. ```{r} # Fitting SLR model model_fit <- lm(y ~ x, data = sample_data) # Extracting predicted values sample_data$predicted <- predict(model_fit) # Extract coefficients intercept_slope <- coef(model_fit)[1:2] # Create equation string estimated_slr_equation <- sprintf("ŷ = %.2f + %.2fx", intercept_slope[1], intercept_slope[2]) # Creating scatter plot to visualize residuals ggplot(sample_data, aes(x = x, y = y)) + geom_segment(aes(xend = x, yend = predicted), linetype = "dotted", color = "black") + geom_smooth(method = "lm", se = FALSE, color = "#009E73", linewidth = 0.8) + geom_point(color = "#56B4E9", size = 1.8, alpha = 1) + scale_x_continuous(breaks = seq(0, 12, by = 4), limits = c(0, 12)) + annotate("text", x = quantile(sample_data$x, probs = 0.70), y = min(sample_data$y), label = estimated_slr_equation, color = "#009E73", size = 6, hjust = 0) + labs(title = "Scatter plot with estimated regression line") + theme_few(base_size = 16) ``` The estimated intercept, $\hat{\beta}_0$, can be generally interpreted as: $$\textit{The expected value of $y$ given that $x=0$ is $\hat{\beta}_0$.}$$ The estimated slope, $\hat{\beta}_1$, can be generally interpreted as: $$\textit{For every 1 unit increase in $x$, we expect $y$ to increase / decrease by $|\hat{\beta}_1|$.}$$ Note that $|⋅|$ is the absolute value function, and that we use the phrasing "increase" if $\hat{\beta}_1>0$, and "decrease" if $\hat{\beta}_1<0$. These are blueprint interpretations for the estimated intercept and slope, but should be tailored to a given context. Specifically, the units of measurement, the response variable, the predictor variable, and the values of the regression estimates should be included when providing these interpretations. Note that there is no error term in the estimated regression equation. However, empirical residuals, also called prediction errors, can be calculated using the estimated regression model. The residual for the $i$th observation is the difference between the observed and predicted value: $$e_i = y_i - \hat{y}_i$$ The estimates for SLR, $\hat{\beta}_0$ and $\hat{\beta}_1$, are obtained using the *method of least squares*. That is, $\hat{\beta}_0$ and $\hat{\beta}_1$ are determined using calculus to minimize the sum of squared errors, $\text{SSE} = \sum\limits_{i=1}^{n}e_i^2$. Note that this is not the same as minimizing the sum of the magnitude of the residuals, which is called least absolute deviations (LAD) regression. Visually, the size of the $\text{SSE}$ is depicted via the sum of the area of all the squares in the scatter plot below. ```{r} #| fig-width: 7 #| fig-height: 5 # Adding residuals to table sample_data <- sample_data |> mutate(residual = y - predicted) # Creating scatter plot to visualize residuals sum_squares_gg <- ggplot(sample_data, aes(x = x, y = y)) + geom_segment(aes(xend = x, yend = predicted), linetype = "dotted", color = "black") + geom_rect(alpha = 0.2, fill = '#D55E00', aes(ymin = pmin(y, predicted), ymax = pmax(y, predicted), xmin = pmin(x, x - residual), xmax = pmax(x, x - residual))) + geom_smooth(method = "lm", se = FALSE, color = "#009E73", linewidth = 0.8) + geom_point(color = "#56B4E9", size = 1.8, alpha = 1) + coord_fixed() + theme_few(base_size = 16) sum_squares_gg ``` ::: {.callout-tip} a. For examples a. through d., would we expect the true slope $\beta_1$ to be negative or positive? b. Provide a statement of the assumed simple linear regression model for example d. c. Suppose that a sample of data was collected for example d., and an estimated slope was calculated as $\hat{\beta}_1 = 2$. Interpret this value in the context of the problem. d. Further suppose for example d. that an estimated intercept was calculated as $\hat{\beta}_0 = 5000$. Interpret this value in the context of the problem. e. Given estimates of $\hat{\beta}_0 = 5000$ and $\hat{\beta}_1 = 2$ for example d., what would we predict for our company's sales revenue if we invested $\$500$ in advertising on social media platforms? ::: ### Model Assumptions The reliability of the estimates, $\hat{\beta}_0$ and $\hat{\beta}_1$ and corresponding p-values for the simple linear regression model depend on several [assumptions](https://docs.google.com/document/d/1jVscyqIwKhuVfzlP60_8XNKoWQz7vvnN/edit?usp=sharing&ouid=117618223655474235634&rtpof=true&sd=true). All of the assumptions below must be true to conduct valid inference of the simple linear regression model. 1. **Random sample**: observations must be randomly selected from the population of interest. 2. **Normality**: the residuals must be normally distributed. This assumption can checked using a histogram or quantile-quantile (QQ) plot of the residuals. A fairly bell-shaped distribution in the histogram supports normality, and the points not straying far from the diagonal line on the QQ plot supports normality. 3. **Homoskedasticity**: the residuals must have constant variance / spread across all predicted values. This assumption can checked using the residual by predicted value plot. A fairly constant vertical spread / range of the residuals across all predicted values supports homoskedasticity. When the residuals do not have constant variance this is called *heteroskedasticity*. 4. **Independence**: the residuals must be independent of the predicted values. This assumption can checked using the residual by predicted value plot. No apparent pattern in the relationship between the residuals and predicted values supports the independence assumption. 5. **Linearity**: there must be a linear relationship between $x$ and $y$. This assumption can checked using the residual by predicted value plot or a scatter plot between $x$ and $y$. Having no systematic or identifiable pattern in the residual by predicted value plot supports the linearity assumption. A linear or straight-line pattern in the scatter plot between each predictor and $y$ also supports the linearity assumption. #### Residual vs Fitted Value Plot For the homoskedasticity assumption to be met for regression models, we want to observe a fairly constant vertical spread or range of the residuals across all predicted values. ::: {.panel-tabset} ##### Linear Dependence ![Residual by predicted value plots showing homoskedastic, heteroskedastic, independent errors and predicted values, and linearly dependent errors and predicted values.](https://github.com/dilernia/DSA220/blob/main/images/residual_by_predicted_plot_linear.png?raw=true){#fig-residual_predicted_linear} ##### Quadratic Dependence ![Residual by predicted value plots showing homoskedastic, heteroskedastic, independent errors and predicted values, and quadratically dependent errors and predicted values.](https://github.com/dilernia/DSA220/blob/main/images/residual_by_predicted_plot_quadratic.png?raw=true){#fig-residual_predicted_quadratic} ##### Sinusoidal Dependence ![Residual by predicted value plots showing homoskedastic, heteroskedastic, independent errors and predicted values, and a sinusoidal dependence between errors and predicted values.](https://github.com/dilernia/DSA220/blob/main/images/residual_by_predicted_plot_sin.png?raw=true){#fig-residual_predicted_sin} ::: #### Histograms and Q-Q Plots To check the normality assumption for linear regression models, we can use a quantile-quantile (Q-Q) plot of the residuals and a histogram of the residuals. ::: {.panel-tabset} ##### Normal, Bimodal, and Peaked ![Histograms showing the distributions of normally distributed, bimodal, and peaked residuals.](https://github.com/dilernia/DSA220/blob/main/images/histograms_bimodal_peaked.png?raw=true){#fig-histograms_bimodal_peaked} ![Quantile-quantile plots showing the empirical quantiles by the theoretical quantiles for normally distributed, bimodal, and peaked residuals.](https://github.com/dilernia/DSA220/blob/main/images/qqs_bimodal_peaked.png?raw=true){#fig-qqs_bimodal_peaked} ##### Right-Skewed, Left-Skewed, and Discrete ![Histograms showing the distributions of right-skewed, left-skewed, and discrete residuals.](https://github.com/dilernia/DSA220/blob/main/images/histograms_skewed_discrete.png?raw=true){#fig-histograms_skewed_discrete} ![Quantile-quantile plots showing the empirical quantiles by the theoretical quantiles for right-skewed, left-skewed, and discrete residuals.](https://github.com/dilernia/DSA220/blob/main/images/qqs_skewed_discrete.png?raw=true){#fig-qqs_skewed_discrete} ::: ### Using Correlations with Caution For two numeric variables, the sample correlation coefficient, $r$, is a commonly used statistic which describes both the strength and direction of a linear relationship between two variables. Correlations range from -1 to 1, with values closer to -1 indicating a stronger, negative linear relationship (as one variable increases, the other decreases), values closer to 0 indicating a weaker linear relationship, and values closer to 1 indicating a stronger, positive and linear relationship (as one variable increases, the other increases). Correlations are a useful starting point, but should be used with caution as they have a few limitations: 1. They only measure linear relationships, but many relationships are non-linear. ⎰ 2. A strong correlation does not imply causation, e.g., monthly ice cream sales may be strongly correlated with the number of monthly shark attacks, but that does not necessarily mean one causes the other. 🍦🦈 3. Confounding variables can distort the relationship between variables of interest. 4. The correlation coefficient is sensitive to outliers. Let's explore a couple of data sets that demonstrate some limitations of the correlation coefficient. First, the grid of scatter plots below, called [Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet), shows different data sets which all have a correlation coefficient of $r = 0.82$. ```{r} # Load Anscombe's Quartet data(anscombe) # Reshape the data for plotting anscombe_long <- anscombe |> pivot_longer(cols = everything(), names_to = c(".value", "set"), names_pattern = "(.)(.)") # Create a single faceted plot ggplot(anscombe_long, aes(x = x, y = y)) + geom_smooth(method = "lm", se = FALSE, color = "#009E73") + geom_point(color = "#56B4E9", size = 2) + facet_wrap(~ set, ncol = 2, labeller = labeller(set = function(x) paste("Dataset", x))) + labs(title = "Anscombe's Quartet") ``` ::: {.callout-tip} a. Which, if any, of the limitations described above are evident for Dataset 1? b. Which, if any, of the limitations described above are evident for Dataset 2? c. Which, if any, of the limitations described above are evident for Dataset 3? d. Which, if any, of the limitations described above are evident for Dataset 4? ::: Another commonly used data set to emphasize the importance of plotting one's data is the ["datasaurus"](https://cran.r-project.org/web/packages/datasauRus/vignettes/Datasaurus.html) data set displayed below. ```{r} #| fig-width: 8.5 #| fig-height: 8 # Plotting datasaurus_dozen data # Original code from https://cran.r-project.org/web/packages/datasauRus/vignettes/Datasaurus.html datasaurus_dozen |> dplyr::filter(dataset != "high_lines") |> ggplot(aes(x = x, y = y, colour = dataset)) + geom_point() + facet_wrap(~ dataset, ncol = 3) + theme(legend.position = "none") ``` ```{r} #| label: tbl-dinosaurus_data #| tbl-cap: 'Summary statistics for the dinosaurus data set.' datasaurus_dozen |> dplyr::filter(dataset != "high_lines") |> group_by(dataset) |> summarize(mean_x = mean(x), mean_y = mean(y), std_dev_x = sd(x), std_dev_y = sd(y), corr_x_y = cor(x, y)) |> gt() |> fmt_number(columns = everything(), decimals = 2, drop_trailing_zeros = TRUE) ``` To further explore concepts related to SLR, let's consider a real-world data set on the salaries of professional basketball players. ## Example: NBA Player Salary Data 🏀 💵 The National Basketball Association (NBA) is a professional basketball league in North America consisting of 30 teams. The NBA is widely considered the premier men's professional basketball league in the world. In this activity we will analyze data on NBA player salaries for the 2025–2026 season along with performance metrics from the previous season. Salary data were obtained from <https://www.basketball-reference.com/contracts/players.html>, and player-level performance statistics were sourced from <https://www.basketball-reference.com/leagues/NBA_2025_per_game.html>.[^1] This data set includes information on player earnings and fundamental performance metrics such as points scored per game for the 2024-2025 season. Using this data, we will explore potential relationships between player compensation and performance-based metrics. [^1]: Sports Reference LLC. (n.d.). Basketball-Reference.com. Retrieved September 27, 2025, from https://www.basketball-reference.com ![](https://github.com/dilernia/DSA220/blob/main/images/basketball-salaries-logo.png?raw=true){fig-align="center" width=35%} Let's load the data for the activity, focusing on players who have played at least 4 years in the NBA already to avoid rookie contracts which are systematically lower than standard NBA contracts. ```{r} # Importing NBA salary data from GitHub, removing players on rookie contracts, converting salary to millions of USD, and dropping columns nba_salary_data <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/nba_salary_data.csv") |> dplyr::filter(years_played >= 4) |> mutate(salary_2025_2026 = salary_2025_2026 / 1e6) |> dplyr::rename(salary_millions = salary_2025_2026) |> dplyr::select(player:points_per_game, years_played) ``` Below is a data dictionary for the NBA data, and a preview of some randomly selected rows. ```{r} #| label: tbl-data-dictionary #| tbl-cap: "Data dictionary for the NBA salaries data set." #| echo: false # Creating the data dictionary for the NBA dataset data_dictionary <- tibble( Variable = c("player", "team", "salary_millions", "position", "age", "points_per_game", "years_played"), Description = c("Player's name", "Player's team for the 2025-2026 season", "Player's salary for the 2025-2026 season (in millions of USD)", "Primary position played by the player (e.g., C, PF, SF, SG, PG)", "Player's age at the start of the 2025-2026 season", "Average points scored per game during the previous season", "Total number of years played in the NBA prior to the 2025-2026 season") ) data_dictionary |> gt() ``` ```{r} #| label: tbl-nba_salary_data #| tbl-cap: "Several rows and columns from the NBA data set." #| echo: false # Printing table nba_salary_data |> slice_head(n = 5) |> gt() |> fmt_number(columns = vars(salary_millions), decimals = 2, drop_trailing_zeros = FALSE) ``` ::: {.callout-tip} a. In @tbl-nba_salary_data, being as specific as possible, what is the type of each variable? b. Which data format would be best for storing all data available on the NBA players (e.g., XML, CSV, or JSON)? ::: Let's look at a scatter plot and the correlations for the NBA data: ```{r} # Creating a scatter plot nba_salary_data |> ggplot(aes(x = points_per_game, y = salary_millions)) + geom_smooth(method = "lm", se = FALSE) + geom_point() + scale_y_continuous(labels = scales::label_currency(prefix = "$")) ``` ```{r} #| label: tbl-nba-correlations #| tbl-cap: "Table of pairwise correlations." # Calculating correlations correlations_table <- nba_salary_data |> corrr::correlate(diagonal = 1) # Printing table of correlations correlations_table |> gt() |> fmt_number(columns = everything(), decimals = 2, drop_trailing_zeros = FALSE) ``` ::: {.callout-tip} a. Describe the relationship between a player's points scored per game (`points_per_game`), and their salary in the 2025-2026 season (`salary_millions`). b. Does the sign (being positive or negative) of the correlation coefficient seem sensible in this context? Why or why not? c. Does there appear to be a linear relationship between player's points scored per game and their salary in the 2025-2026 season? ::: Let's fit a simple linear regression model to model a player's salary using their points scored per game. ::: {.callout-tip} Provide a statement of the assumed simple linear regression model for the NBA data. ::: ### Fitting the Model ::: {.callout-tip} Fit a simple linear regression model for the NBA data and print the model output using the code below. ::: ```{r} #| label: tbl-slr_output #| tbl-cap: "Simple linear regression model output." # Fitting a simple linear regression model slr_model <- lm(salary_millions ~ points_per_game, data = nba_salary_data) # Printing model output slr_model |> broom::tidy() |> gt() |> fmt_number(columns = term:statistic, decimals = 2) ``` ::: {.callout-tip} Provide a statement of the estimated simple linear regression model for the NBA data. ::: For linear regression, the coefficient of determination, denoted by $r^2$, describes the proportion of variability in the response variable, $y$, explained by our model. The $r^2$ value is equal to the square of the correlation coefficient between the predicted values of a model and the observed response values. For SLR, $r^2$ is also equal to the square of the correlation coefficient between the explanatory and response variables. For simple linear regression, we have a single predictor which constitutes our entire model, so $r^2$ is the proportion of variability in $y$ explained by $x$. ::: {.callout-tip} Provide and interpret the value of the coefficient of determination in context. ::: ### Diagnostic Plots Below are diagnostic plots for checking the model assumptions for our analysis of the NBA data set. ```{r} # Several diagnostic plots at once autoplot(slr_model, label.repel = TRUE) ``` ```{r} # Cook's Distance plot autoplot(slr_model, which = 4, label.repel = TRUE)@plots[[1]] ``` ```{r} # Making histogram of residuals tibble(Residual = slr_model$residuals) |> ggplot(aes(x = Residual)) + geom_histogram(aes(y = after_stat(density)), fill = "#56B4E9", color = "white") + stat_function(fun = dnorm, args = list(mean = 0, sd = sd(slr_model$residuals)), color = "#009E73") + scale_y_continuous(expand = expansion(mult = c(0, .01))) + labs(y = "Density") + theme_bw() + theme(panel.grid = element_blank(), text = element_text(face = "bold")) ``` ::: {.callout-tip} Indicate if each assumption for fitting a simple linear regression model is met, providing specific evidence to support your conclusions from the output provided. ::: ### Influential Points and Outliers **Influential points** are points that have a significantly greater effect on the regression estimates than the rest of the data points. A point can be both an outlier and influential, but not every outlier is an influential point. ```{r} set.seed(1994) # Simulating influential and outlier data for demonstration influential_data <- as.data.frame(round(mvtnorm::rmvnorm(n = 20, mean = c(3, 3), sigma = matrix(c(1.2, 0.85, 0.85, 1.2), byrow = T, nrow = 2)), 3)) influential_data <- rbind(influential_data, cbind(c(3, 4), c(5, 8))) influential_data$V1 <- influential_data$V1*4+60 influential_data$V2 <- influential_data$V2*3 influential_data <- influential_data |> rename(coffee_consumed_oz = V2, heart_rate_bpm = V1) outlier_data <- tibble(V1 = c(10, 11, 18, 20, 40, 50, 51, 60, 70, 80, 90, 89, 98, 49), V2 = round(c(2, 2.2, 4, 3.5, 4.1, 6, 5.6, 7.9, 8, 8.3, 10, 10.5, 12, 14))) |> rename(annual_doctors_visits = V2, age = V1) # Fitting models outA <- lm(annual_doctors_visits ~ age, data = outlier_data) outB <- lm(annual_doctors_visits ~ age, data = outlier_data[-c(nrow(outlier_data)), ]) infA <- lm(heart_rate_bpm ~ coffee_consumed_oz, data = influential_data) infB <- lm(heart_rate_bpm ~ coffee_consumed_oz, data = influential_data[-c(nrow(influential_data)), ]) ``` ```{r} # Plotting data with an influential point influential_data |> slice(1:(n()-1)) |> mutate(Data = "Excludes influential point") |> bind_rows(influential_data |> mutate(Data = "Includes influential point")) |> ggplot(aes(x = coffee_consumed_oz, y = heart_rate_bpm, color = Data)) + geom_smooth(method = "lm", se = FALSE) + geom_point(color = "black") + scale_color_manual(values = c("#56B4E9", "#D55E00")) + labs(title = "Example Coffee Data with Influential Point", x = "Coffee Consumed (oz)", y = "Heart Rate (bpm)", color = "") + theme(legend.position = "bottom") ``` ```{r} #| label: tbl-cooks_distance #| tbl-cap: "Cook's Distance measure for example coffee drinking data." # Displaying table with Cook's Distance influential_data$cooks_distance <- cooks.distance(infA) influential_data$standardized_residual <- rstandard(infA) influential_data |> arrange(desc(coffee_consumed_oz)) |> gt() |> fmt_number(columns = everything(), decimals = 2) ``` To identify influential points, a commonly used statistic is Cook's Distance measure, often referred to as Cook's D. Cook’s D for the jth data point quantifies the aggregate change in all fitted or predicted values when the jth observation is excluded from the model fitting process. We will consider points to be influential if they have a Cook’s D value greater than 0.50, although this is a rule of thumb and there is not a consensus in the field of statistics on the cutoff point. ::: {.callout-tip} Are there any influential points in the NBA data set? Provide specific evidence to support your answer. ::: **Outliers** are points with response values that fall significantly above or below the value we would expect based on an estimated regression model. That is, outliers have residuals that are large in magnitude relative to the rest of the data set. Outliers may be due to a mistake in the data set, or they may simply be due to variability in the population of interest, but in either case they necessitate closer examination. ```{r} # Plotting data with an influential point outlier_data |> slice(1:(n()-1)) |> mutate(Data = "Excludes outlier") |> bind_rows(outlier_data |> mutate(Data = "Includes outlier")) |> ggplot(aes(x = age, y = annual_doctors_visits, color = Data)) + geom_smooth(method = "lm", se = FALSE) + geom_point(color = "black") + scale_color_manual(values = c("#56B4E9", "#D55E00")) + labs(title = "Data with Outlier", x = "Age (years)", y = "Doctor Visits per Year") + theme(legend.position = "bottom") ``` ```{r} #| label: tbl-outlier #| tbl-cap: "Standardized residuals for example doctors visits data." # Displaying table with standardized residuals outlier_data$standardized_residual <- rstandard(outA) outlier_data$cooks_distance <- cooks.distance(outA) outlier_data |> arrange(desc(annual_doctors_visits)) |> gt() |> fmt_number(columns = c(standardized_residual, cooks_distance), decimals = 2) ``` To identify outliers, one can inspect a plot of the standardized residuals. As a rule of thumb, we will consider a point to be an outlier if it has a standardized residual greater than 2 in magnitude. ::: {.callout-tip} Are there any potential outliers in the NBA data set? Provide specific evidence to support your answer. ::: ### Inference of the SLR model A common method of inference for the SLR model is a hypothesis test of the regression slope, $\beta_1$. To test whether $\beta_1$ is equal to 0 or not, a formal statement of the hypotheses is: $$\text{H}_0: \beta_1 = 0 \text{ vs. } \text{H}_\text{a}: \beta_1 \ne 0$$ The null hypothesis is that there is not a linear relationship between $x$ and $y$, and the alternative hypothesis is that there is. If there is a statistically significant linear relationship between $x$ and $y$, the direction of the relationship should be clearly indicated, i.e., positive or negative. ::: {.callout-tip} a. Regardless of whether or not assumptions are met, formally state the hypotheses for testing if the regression slope is 0 or not for the NBA data. b. Provide the test statistic, p-value, and decision for this hypothesis test. c. Interpret the results of the hypothesis test in the context of the problem. d. Does our model explain significantly more of the variability in the response than an intercept only model? Why or why not? ::: ### Interpreting model output ::: {.callout-tip} a. Regardless of whether or not assumptions are met, provide and interpret the estimated slope in context, rounding to two decimal places. b. Provide and interpret the estimated intercept in context, rounding to two decimal places. Is our interpretation sensible in this setting? Why or why not? ::: **Extrapolation** is the process of obtaining a predicted value outside the scope of one's observed data set. Calculating predicted values using explanatory variable values outside the range of the data used for model fitting is one way that extrapolation can occur. In general, extrapolation is considered inappropriate since it can lead to unreliable or misleading results if the underlying relationship does not hold beyond the range of the observed data. ::: {.callout-tip} Using the model estimates, what would be the expected salary for a new player who had scored 50 points per game in the 2024-2025 season? Would this be a reliable prediction in this context? ::: Let's consider obtaining a predicted value for one of, if not the most prolific player in NBA history, LeBron James. ![Image from https://upload.wikimedia.org/wikipedia/commons/6/67/Lebron_dunking.jpg](https://github.com/dilernia/DSA220/blob/main/images/lebron_dunking.jpg?raw=true){fig-align="center" width=60%} ```{r} #| include: false # Entering LeBron's information lebron_data <- tibble(player = "LeBron James", points_per_game = 24.4) # Calculating LeBron's expected salary from model predict(slr_model, newdata = lebron_data) ``` ::: {.callout-tip} a. Based on the output from our overly-simplistic model, what is the expected salary of LeBron James for the 2025-2026 season? b. Based on this predicted value, what is the corresponding residual value for LeBron? c. Using our overly-simplistic model, is LeBron being paid more or less than expected for the 2025-2026 season? Is this surprising? ::: After being drafted in 1996, another NBA legend, Kobe Bryant (aka "Black Mamba") averaged 27.6 points per game in the 2004-2005 season, and subsequently made 15.9 million dollars for the 2005-2006 season.[^2] [^2]: Spotrac. (n.d.). Kobe Bryant contract, salary, and earnings. Retrieved September 27, 2025, from https://www.spotrac.com/nba/player/earnings/_/id/2514/kobe-bryant ![Image from https://grantland.com/wp-content/uploads/2013/12/161402738.jpg?w=750](https://github.com/dilernia/DSA220/blob/main/images/kobe_lebron.jpg?raw=true){fig-align="center" width=60%} ```{r} #| include: false # Entering Kobe's information kobe_data <- tibble(player = "Kobe Bryant", points_per_game = 27.6) # Calculating Kobe's expected salary from model predict(slr_model, newdata = kobe_data) ``` ::: {.callout-tip} a. Based on the output from our overly-simplistic model, what was Kobe's expected salary for the 2005-2006 season? b. Is this an appropriate predicted value for our model? Why or why not? ::: #### Describing uncertainty We can also obtain predicted values and prediction intervals to describe uncertainty in these predictions. ##### Prediction interval A prediction interval for a new observation at a point $x_0$ is $$\hat{y}_0 \pm t_{1-\alpha/2,\,n-2} \times \sqrt{\text{MSE} \times \left(1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum\limits_{i=1}^n (x_i - \bar{x})^2} \right)},$$ where - $\hat{y}_0$ is the predicted response value given $x = x_0$. - $n$ is the sample size. - $t_{1-\alpha/2,\,n-2}$ is the multiplier for a $t$-distribution with $n-2$ degrees of freedom for a $100 \times (1-\alpha)\%$ prediction interval. For example, if $n=165$ and $\alpha = 0.05$ corresponding to a $95\%$ prediction interval, $t_{1-\alpha/2,\,n-2} = t_{0.975,163} \approx 1.97$. - $\text{MSE} = \sum\limits_{i=1}^{n}\frac{(y_i - \hat{y}_i)^2}{n - 2}$ is the mean squared error In practice, we will use R or other software to calculate this prediction interval. The **regression standard error**, $s = \sqrt{\text{MSE}}$, describes the typical size of the prediction errors of our model. ::: {.callout-tip} a. A 95% prediction interval for the salary of a current NBA player, Louie the Laker (sometimes referred to by his nickname "the other splash brother"), who scored 20 points per game in the 2024-2025 season can be obtained using the code below. Using the R code below, obtain and interpret this prediction interval in context. b. Is obtaining this predicted value considered extrapolation? Why or why not? ::: ```{r} #| label: tbl-slr_prediction #| tbl-cap: "Predicted value and prediction interval from SLR model." # Obtaining predicted value & 95% prediciton interval prediction_data <- tibble(player = "Louie the 'other splash brother' Laker", points_per_game = 20) predict(slr_model, newdata = prediction_data, interval = "prediction", level = 0.95) |> as_tibble() |> gt() |> fmt_number(columns = everything(), decimals = 2) ```