Multiple Linear Regression

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Learning Objectives

State the Multiple Linear Regression (MLR) model
Check assumptions of the MLR model
Fit and interpret the results of the MLR model
Obtain and interpret predicted values
Assess the fit of the MLR model
Assess multicollinearity among predictors
Select a final model using proper criterion and metrics

Let’s begin by loading some R packages for this activity using the code below. Note: if it is the first time you are using an R package, you may need to install it first using the install.packages() function in the Console pane.

Code

# Load necessary packages
library(tidyverse)
library(ggthemes)
library(gt)
library(corrr)
library(ggfortify)
library(broom)
library(scales)
library(car)
library(GGally)
library(gtsummary)

Set default theme settings for plots, and load a function to display variance inflation factor (VIF) values using the code below.

Code

# Set ggplot theme for visualizations
theme_set(ggthemes::theme_few())

# Load function for visualizing VIF & GVIF values
source("https://raw.githubusercontent.com/dilernia/DSA220/main/Functions/vif_plot.R")

Multiple Linear Regression (MLR)

Multiple linear regression (MLR) is a statistical method used to model the relationship between a single quantitative response variable, denoted by \(y\), and \(k\) explanatory or predictor variables, denoted by \(x_1\), \(x_2\), \(\dots\), \(x_k\). In general, the predictor variables can be either quantitative or categorical variables, but in this activity we will focus on only quantitative predictors.

Some examples where a MLR model would be useful are:

Predicting the selling price of houses using characteristics such as the size in square feet, number of bedrooms, number of bathrooms, and year it was built. 🏠
Forecasting the total sales of a product for a company using advertising expenses, the price of the product, the previous sales numbers for that product, and the time of year. 📈
Exploring the relationship between employee salaries and factors such as education level, years of experience, job title, and gender. 💼 💸
Constructing a model to determine what a typical blood-pressure should be for a patient using information such as the patient’s age, sex, and height. ⚕️🏥

Tip

What are the response and predictor variables for examples a. and b., and what are the types of each predictor variable (e.g., categorical or quantitative)?

MLR Model

A general statement of the assumed theoretical MLR model is

\[y = \beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k + \varepsilon, \]

where \(y\) is our response variable, \(x_j\) the \(j^{th}\) predictor for \(j=1, 2, \dots, k\), and \(\varepsilon\) the independent, identically, and normally distributed error term. When providing a statement of the theoretical or assumed MLR model, the terms of the model are expressed using general notation, but the response and predictor variables should be specified for the given context.

A statement of the estimated regression equation is

\[\hat{y} = \hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 + \dots + x_k \hat{\beta}_k, \]

where \(\hat{y}\) is the expected or predicted response value and \(x_j\) the \(j^{th}\) predictor for \(j=1, 2, \dots, k\). When providing a statement of the estimated MLR model, the \(\hat{\beta}\) terms should be replaced with corresponding numerical estimates from the model output, and the estimated response and predictor variables should be specified for the given context.

MLR Assumptions

As with simple linear regression, multiple linear regression requires a set of conditions or assumptions to be met to obtain reliable estimates and to conduct valid inference related to the model. Note that assumptions 1 through 5 below are identical to the assumptions for simple linear regression. All of the assumptions below must be true to conduct valid inference of the multiple linear regression model.

Random sample: observations must be randomly selected from the population of interest.
Normality: the residuals must be normally distributed.

This assumption can be checked using a histogram or quantile-quantile (QQ) plot of the residuals. A fairly bell-shaped distribution in the histogram supports normality, and the points not straying far from the diagonal line on the QQ plot supports normality.

Homoskedasticity: the residuals must have constant variance / spread across all predicted values.

This assumption can be checked using the residual by predicted value plot. A fairly constant vertical spread / range of the residuals across all predicted values supports homoskedasticity. When the residuals do not have constant variance this is called heteroskedasticity.

Independence: the residuals must be independent of the predicted values.

This assumption can checked using the residual by predicted value plot. No apparent pattern in the relationship between the residuals and predicted values supports the independence assumption.

Linearity: there must be a linear relationship between each \(x\) and \(y\).

This assumption can checked using the residual by predicted value plot or a scatter plot between each \(x\) and \(y\). Having no systematic or identifiable pattern in the residual by predicted value plot supports the linearity assumption. A linear or straight-line pattern in the scatter plot between each predictor and \(y\) also supports the linearity assumption.

No multicollinearity: predictor variables should not be highly correlated with one another.

We check assumptions 1 through 5 in the same way as we did for SLR - by using the corresponding diagnostic plots. To check for multicollinearity, a common approach is to calculate the variance inflation factor (VIF) for each predictor. The VIF value for a predictor, \(x_i\), describes how much variability in \(x_i\) is explained by all other predictors. As a rule of thumb, we will consider a predictor to have multicollinearity issues if its VIF value is greater than 5.

Specifically, the VIF for a predictor \(x_i\) is:

\[\text{VIF}_i = \frac{1}{1 - r_i^2}\]

where \(r_i^2\) is the coefficient of determination from a regression of \(x_i\) on all other predictors, \(\{x_j \}_{j \ne i}\).

Interpreting Model Estimates

Interpretations of coefficients for MLR are slightly different than for SLR.

General interpretation for \(\hat{\beta}_j\): For every 1 unit increase in \(x_j\), we expect \(y\) to increase / decrease by \(|\hat{\beta}_j|\) units on average, holding all other predictors constant.

General interpretation for \(\hat{\beta}_0\): The expected value of \(y\) given that \(x_1\) through \(x_k\) are all 0 is \(\hat{\beta}_0\) units.

The “holding all other predictors constant” part of the slope interpretation for MLR is an essential part that cannot be excluded.

When multicollinearity is present, the estimated slopes may be unreliable or inaccurate for describing relationships between the predictors and the response variable. Also note that the estimated intercept, \(\hat{\beta}_0\), is only appropriate to interpret when all predictors being 0 is within the range of the values of the observed data set. That is, 0 must fall between the minimum and maximum value for each predictor variable in the observed data set to yield a reliable estimated intercept term. Otherwise, \(\hat{\beta}_0\) is outside the scope of the observed data set, so interpreting it would be considered extrapolation and thus would be inappropriate.

Example: NBA Player Salary Data 🏀 💵

In this activity, we will revisit the NBA compensation data, considering data on NBA player salaries for the 2025–2026 season along with performance metrics from the previous season. Salary data were obtained from https://www.basketball-reference.com/contracts/players.html, and player-level performance statistics were sourced from https://www.basketball-reference.com/leagues/NBA_2025_per_game.html.¹

Let’s load the data for the activity, again focusing on players who have played at least 4 years in the NBA already to avoid rookie contracts which are systematically lower than standard NBA contracts, and only considering players who have a listed position.

Code

# Importing NBA salary data from GitHub, removing players on rookie contracts, converting salary to millions of USD, and dropping columns
nba_salary_data <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/nba_salary_data.csv") |> 
  dplyr::filter(years_played >= 4, !is.na(position)) |> 
  mutate(salary_2025_2026 = salary_2025_2026 / 1e6) |> 
  dplyr::rename(salary_millions = salary_2025_2026) |> 
  dplyr::select(player:position, points_per_game, assists_per_game, rebounds_per_game, steals_per_game, years_played)

Exploratory Data Analysis

Below is a data dictionary for the NBA data, and a preview of some randomly selected rows.

Table 1: Data dictionary for the NBA salaries data set.

Variable	Description
player	Player's name
team	Player's team for the 2025-2026 season
salary_millions	Player's salary for the 2025-2026 season (in millions of USD)
position	Primary position played by the player (e.g., C, PF, SF, SG, PG)
points_per_game	Average points scored per game during the previous season
assists_per_game	Average assists per game during the previous season
rebounds_per_game	Average rebounds per game during the previous season
steals_per_game	Average steals per game during the previous season
years_played	Total number of years played in the NBA prior to the 2025-2026 season

Table 2: Several rows and columns from the NBA data set.

player	team	salary_millions	position	points_per_game	assists_per_game	rebounds_per_game	steals_per_game	years_played
Stephen Curry	GSW	59.61	PG	24.5	6.0	4.5	1.1	16
Joel Embiid	PHI	55.22	C	23.8	4.5	8.2	0.7	9
Nikola Jokić	DEN	55.22	C	29.6	10.2	12.8	1.8	10
Kevin Durant	HOU	54.71	PF	26.6	4.2	6.1	0.8	17
Jayson Tatum	BOS	54.13	PF	26.8	6.0	8.7	1.1	8

Code

# Creating a scatter plot
nba_salary_data |> 
  ggplot(aes(x = points_per_game, 
             y = salary_millions)) + 
  geom_smooth(method = "lm", se = FALSE) +
  geom_point() +
  scale_y_continuous(labels = scales::label_currency(prefix = "$"))

Code

# Calculating correlations
correlations_table <- nba_salary_data |> 
  corrr::correlate(diagonal = 1)

# Printing table of correlations
correlations_table |> 
    gt() |>
    fmt_number(columns = everything(), 
               decimals = 2,
               drop_trailing_zeros = FALSE)

Table 3: Table of pairwise correlations.

term	salary_millions	points_per_game	assists_per_game	rebounds_per_game	steals_per_game	years_played
salary_millions	1.00	0.82	0.64	0.42	0.45	0.26
points_per_game	0.82	1.00	0.72	0.39	0.46	0.11
assists_per_game	0.64	0.72	1.00	0.22	0.54	0.14
rebounds_per_game	0.42	0.39	0.22	1.00	0.20	0.12
steals_per_game	0.45	0.46	0.54	0.20	1.00	0.07
years_played	0.26	0.11	0.14	0.12	0.07	1.00

We can also visualize all pairwise correlations using a correlation plot with the rplot() function from the corrr package.

Code

# Create plot of correlation matrix
nba_salary_data |> 
  corrr::correlate(diagonal = NA) |> 
  rplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

Lastly, we can create a matrix of all pairwise scatter plots and correlations with the ggpairs() function from the GGally package.

Code

# Create scatter plot matrix
nba_salary_data |> 
  select(where(is.numeric)) |> 
  ggpairs() +
  theme_few(base_size = 8)

Code

# Calculating descriptive statistics stratified by position
nba_statistics <- nba_salary_data |> 
  group_by(position) |> 
  summarize(
  N = n(),
  Minimum = min(salary_millions, na.rm = TRUE),
  Q1 = quantile(salary_millions, na.rm = TRUE, probs = 0.25),
  Median = median(salary_millions, na.rm = TRUE),
  Q3 = quantile(salary_millions, na.rm = TRUE, probs = 0.75),
  Maximum = max(salary_millions, na.rm = TRUE),
  Mean = mean(salary_millions, na.rm = TRUE),
  Range = Maximum - Minimum,
  SD = sd(salary_millions, na.rm = TRUE),
)

# Printing table of statistics stratified by player position
nba_statistics |> 
    gt() |>
    fmt_number(columns = everything(), 
               decimals = 2,
               drop_trailing_zeros = TRUE)

Table 4: Summary statistics for NBA player salaries in millions of dollars stratified by player position.

position	N	Minimum	Q1	Median	Q3	Maximum	Mean	Range	SD
C	37	2.35	8.33	15	33.94	55.22	21.04	52.87	16.39
PF	35	2.55	13.82	22.41	39.05	54.71	25.48	52.16	15.66
PG	31	2.35	10.39	32.5	39.05	59.61	28.34	57.26	16.94
SF	25	3.47	14.38	24.9	38.1	54.13	26.64	50.66	15.14
SG	37	2.35	11.55	16.88	28.1	53.67	20.91	51.32	14.35

Tip

Based on the scatter plot and correlations, what can we say about the relationship between points scored per game and player salaries?

Tip

Based on the scatter plot and correlations, what can we say about the relationship between rebounds per game and player salaries?

Tip

Provide a statement of our assumed / theoretical multiple linear regression model using the player salaries for the 2025-2026 season in millions of dollars as our response and the average points scored and average rebounds per game in the 2024-2025 season as our predictors.

Fitting a Model

Let’s fit the corresponding MLR model, and look at some diagnostic plots to see how model assumptions look.

Code

# Fitting two-predictor multiple linear regression model
mlr_model <- lm(salary_millions ~ points_per_game + rebounds_per_game,
                data = nba_salary_data)

Code

# Several diagnostic plots at once
autoplot(mlr_model, label.repel = TRUE)

Code

# Cook's Distance plot
autoplot(mlr_model, which = 4, label.repel = TRUE)@plots[[1]]

Code

# Extracting vector of residuals
model_residuals <- mlr_model$residuals

# Making histogram of residuals
tibble(Residual = model_residuals) |> 
    ggplot(aes(x = Residual)) + 
    geom_histogram(aes(y = after_stat(density)), fill = "#56B4E9", color = "white") +
    stat_function(fun = dnorm, args = list(mean = 0, sd = sd(model_residuals)),
                  color = "#009E73") +
    scale_y_continuous(expand = expansion(mult = c(0, .01))) +
    labs(y = "Density") + 
    theme_bw() + 
    theme(panel.grid = element_blank(), 
          text = element_text(face = "bold"))

Code

# Variance inflation values
vif_plot(mlr_model)

Tip

Indicate if each assumption for fitting the multiple linear regression model is met, providing specific evidence to support your conclusions from the obtained output.

Interpreting model estimates

Below is output for the model using the player salary as our response and average points scored per game and number of rebounds per game as our predictors.

Code

# Printing model output
mlr_model |> 
  tbl_regression(intercept = TRUE) |> 
  add_vif() |> 
  add_glance_source_note()

Table 5: Two predictor MLR model estimates

Characteristic	Beta	95% CI	p-value	VIF
(Intercept)	-5.6	-9.2, -2.0	0.003
points_per_game	1.8	1.6, 2.0	<0.001	1.2
rebounds_per_game	0.72	0.15, 1.3	0.014	1.2
Abbreviations: CI = Confidence Interval, VIF = Variance Inflation Factor
R² = 0.691; Adjusted R² = 0.688; Sigma = 8.84; Statistic = 182; p-value = <0.001; df = 2; Log-likelihood = -592; AIC = 1,192; BIC = 1,205; Deviance = 12,662; Residual df = 162; No. Obs. = 165

Tip

Using the output in Table 5, provide a statement of the estimated regression equation.
Interpret the estimated slopes in context.
Interpret the estimated intercept in context. Is this sensible in this context?
Interpret the confidence interval for the slope for points per game.
What does the model predict LeBron’s 2025-2026 salary to be? Use the output in Table 5 to confirm the value in Table 6 below. Note: the calculated value may differ from the fitted value in Table 6 due to rounding.
What is the corresponding residual for LeBron?
What does the residual mean in this context?

Code

# Displaying predicted value for LeBron
nba_salary_data |> 
augment_columns(x = mlr_model) |> 
  dplyr::filter(player == "LeBron James") |> 
  dplyr::select(player:points_per_game, rebounds_per_game, .fitted) |> 
  gt()

Table 6: Fitted value for LeBron James.

player	team	salary_millions	position	points_per_game	rebounds_per_game	.fitted
LeBron James	LAL	52.62715	SF	24.4	7.8	43.05807

We could also obtain a prediction interval for a player with LeBron’s statistics using the code below.

Code

# Obtaining predicted value & 95% prediciton interval
prediction_data <- tibble(points_per_game = 24.4,
                          rebounds_per_game = 7.8)
  
predict(mlr_model, newdata = prediction_data, 
        interval = "prediction", level = 0.95) |> 
  as_tibble() |> 
  gt() |> 
  fmt_number(columns = everything(), 
             decimals = 2)

Table 7: Predicted value and prediction interval from MLR model.

fit	lwr	upr
43.06	25.43	60.68

Tip

Verify the VIF values in Table 5 using the correlations in Table 3.

Functions of Predictors

Let’s reconsider the relationship between player salaries in millions of dollars and rebounds per game.

Code

# Creating scatter plot with a linear fit
nba_salary_data |> 
  ggplot(aes(x = rebounds_per_game, 
             y = salary_millions)) + 
  geom_point() + 
  scale_y_continuous(labels = scales::label_currency(prefix = "$")) +
  geom_smooth(method = "lm", se = FALSE) +
    labs(title = "Linear fit",
       y = "NBA player salary (millions USD)",
       x = "Average rebounds per game")

The relationship between rebounds per game and player salaries may be better modeled using a quadratic regression model.

In general, predictors in a model, the \(x_j\)’s, can be functions of a variable such as a squared term \(x_2 = x_1^2\).

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon \]

\[ = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \varepsilon\]

The example model above includes a quadratic term for \(x_1\). This is useful if there appears to be a quadratic relationship between \(x\) and \(y\).

Code

# Creating scatter plot with quadratic fit
nba_salary_data |> 
  ggplot(aes(x = rebounds_per_game, 
             y = salary_millions)) + 
  geom_point() + 
  scale_y_continuous(labels = scales::label_currency(prefix = "$")) +
stat_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
    labs(title = "Quadratic fit",
         y = "NBA player salary (millions USD)",
         x = "Average rebounds per game")

Code

# Fitting a quadratic regression model
mlr_model2 <- lm(salary_millions ~ rebounds_per_game + I(rebounds_per_game^2),
                data = nba_salary_data)

# Printing quadratic model output
mlr_model2 |> 
  tbl_regression(intercept = TRUE) |> 
  add_glance_source_note()

Table 8: Quadratic MLR model output

Characteristic	Beta	95% CI	p-value
(Intercept)	3.1	-7.3, 14	0.6
rebounds_per_game	5.5	1.9, 9.0	0.003
I(rebounds_per_game^2)	-0.22	-0.47, 0.04	0.10
Abbreviation: CI = Confidence Interval
R² = 0.189; Adjusted R² = 0.179; Sigma = 14.3; Statistic = 18.9; p-value = <0.001; df = 2; Log-likelihood = -672; AIC = 1,352; BIC = 1,364; Deviance = 33,277; Residual df = 162; No. Obs. = 165

Tip

Based on the quadratic model, what is the estimated player salary for a player who averaged 5 rebounds per game in the 2024-2025 season?
Is this appropriate to interpret in this context? Why or why not?

For a quadratic regression model with a single predictor, the turning point or vertex of the parabola occurs at

\[x_{\text{vertex}} = -\frac{\hat{\beta}_1}{2\hat{\beta}_2}\]

This is a low-point for the expected or predicted value of \(y\) if \(\hat{\beta}_2 > 0\) or a high-point if \(\hat{\beta}_2 < 0\).

Tip

Using the output in Table 8, confirm that \(x_{\text{vertex}} = 12.5\).
What is the expected value of \(y\) when \(x = x_{\text{vertex}}\)? Is this a high-point or a low-point?

Assessing Individual Predictors

When fitting a MLR model with several predictors, there are multiple predictors available for inclusion in our final model. In general, we want a model with predictor variables that effectively explain variation in the response variable, but not predictor variables that are just fitting random noise. Moreover, we want to discriminate between these two types of predictors in a data-driven way to help us select a final model.

There are several data-driven methods for conducting model selection in multiple linear regression, but in this activity we will use Akaike’s Information Criterion (AIC) and the Bayesian Information Criterion (BIC) which are numerical measures that describe the goodness-of-fit for a selected model based on statistical theory.

AIC and BIC both penalize for a larger number of variables used in a model while simultaneously rewarding how well a model explains variability in the response variable. However, BIC nearly always penalizes more for including more predictors (whenever the sample size \(n > 7\)). Models with lower AIC and BIC values are preferred.

Tip

Based on the obtained output in Table 8 and Table 9, should we include the quadratic term in our model? Why or why not?

Code

# Fitting a linear regression model
rebounds_model <- lm(salary_millions ~ rebounds_per_game,
                data = nba_salary_data)

# Printing linear model output
rebounds_model |> 
  tbl_regression(intercept = TRUE) |> 
  add_glance_source_note()

Table 9: Linear rebounds MLR model output

Characteristic	Beta	95% CI	p-value
(Intercept)	11	5.9, 16	<0.001
rebounds_per_game	2.6	1.7, 3.4	<0.001
Abbreviation: CI = Confidence Interval
R² = 0.175; Adjusted R² = 0.170; Sigma = 14.4; Statistic = 34.6; p-value = <0.001; df = 1; Log-likelihood = -673; AIC = 1,353; BIC = 1,362; Deviance = 33,841; Residual df = 163; No. Obs. = 165

If a higher order term is included (\(x_1^2\)) in a regression model, one should include the lower powers of that predictor \((x_1)\) to maintain interpretability. However, if predictive performance is all one is after, then lower order terms can be dropped if they are not considered useful.

Assessing Model Fit

When fitting a MLR model, it is important to assess the fit of the model. In addition to AIC and BIC, another commonly used statistic to assess the fit of a regression model is the coefficient of multiple determination (\(r^2\)).

The coefficient of multiple determination \(r^2\) represents the proportion of total variability in the response that is explained by our regression model. If \(r^2 = 0.75\) for a model containing predictors \(x_1\) and \(x_2\), this is interpreted as:

“About \(75\%\) of the variability in \(y\) is explained by \(x_1\) and \(x_2\).”

For SLR, \(r^2\) is equal to the correlation between \(x\) and \(y\) squared. For both SLR and MLR, \(r^2\) is equal to the correlation between the observed and predicted values (\(y_i\)’s and \(\hat{y}_i\)’s) squared.

Unlike AIC and BIC, the \(r^2\) value will never decrease when adding more predictors to model, so one must be careful of overfitting.

A modified version of the coefficient of determination called the adjusted coefficient of determination, \(r^2_{\text{adj}}\), can be used for model selection while avoiding overfitting. In general, a higher \(r^2_{\text{adj}}\) value is preferred for a model. The adjusted \(r^2\) is

\[r^2_{\text{adj}} = 1 - \left( \frac{(1 - r^2)(n - 1)}{n - k - 1} \right),\]

where

\(r^2\) is the coefficient of determination
\(n\) is the number of observations
\(k\) is the number of predictors in the model

The information criterion metrics, AIC and BIC, also reward model fit while penalizing for including additional predictors that bring little value as well. Thus, when comparing several candidate models, assessing the \(r^2_{\text{adj}}\), AIC, and BIC values is one approach for model selection.

When candidate models have the same values for the selection criterion, a suggested approach is to choose the simpler (more parsimonious) model.

Code

# Fitting single predictor model
mlr_model1 <- lm(salary_millions ~ points_per_game,
                data = nba_salary_data)

# Model output for single predictor model
mlr_model1 |> 
  tbl_regression(intercept = TRUE) |> 
  add_glance_source_note()

Table 10: Output for SLR model (points per game).

Characteristic	Beta	95% CI	p-value
(Intercept)	-3.4	-6.6, -0.16	0.040
points_per_game	1.9	1.7, 2.1	<0.001
Abbreviation: CI = Confidence Interval
R² = 0.680; Adjusted R² = 0.678; Sigma = 8.98; Statistic = 346; p-value = <0.001; df = 1; Log-likelihood = -595; AIC = 1,197; BIC = 1,206; Deviance = 13,143; Residual df = 163; No. Obs. = 165

Code

# Fitting two-predictor multiple linear regression model
mlr_model <- lm(salary_millions ~ points_per_game + rebounds_per_game,
                data = nba_salary_data)

# Model output for two-predictor model
mlr_model |> 
  tbl_regression(intercept = TRUE) |> 
  add_vif() |> 
  add_glance_source_note()

Table 11: Two-predictor MLR model output

Characteristic	Beta	95% CI	p-value	VIF
(Intercept)	-5.6	-9.2, -2.0	0.003
points_per_game	1.8	1.6, 2.0	<0.001	1.2
rebounds_per_game	0.72	0.15, 1.3	0.014	1.2
Abbreviations: CI = Confidence Interval, VIF = Variance Inflation Factor
R² = 0.691; Adjusted R² = 0.688; Sigma = 8.84; Statistic = 182; p-value = <0.001; df = 2; Log-likelihood = -592; AIC = 1,192; BIC = 1,205; Deviance = 12,662; Residual df = 162; No. Obs. = 165

Code

# Fitting a quadratic regression model
mlr_model2 <- lm(salary_millions ~ points_per_game + I(rebounds_per_game^2),
                data = nba_salary_data)

# Model output for quadratic model
mlr_model2 |> 
  tbl_regression(intercept = TRUE) |> 
  add_vif() |> 
  add_glance_source_note()

Table 12: Quadratic MLR model output

Characteristic	Beta	95% CI	p-value	VIF
(Intercept)	-3.8	-7.0, -0.61	0.020
points_per_game	1.8	1.6, 2.0	<0.001	1.1
I(rebounds_per_game^2)	0.05	0.01, 0.09	0.015	1.1
Abbreviations: CI = Confidence Interval, VIF = Variance Inflation Factor
R² = 0.691; Adjusted R² = 0.687; Sigma = 8.84; Statistic = 181; p-value = <0.001; df = 2; Log-likelihood = -592; AIC = 1,193; BIC = 1,205; Deviance = 12,674; Residual df = 162; No. Obs. = 165

Code

# Fitting two predictor quadratic model
mlr_model22 <- lm(salary_millions ~ rebounds_per_game + I(points_per_game^2),
                data = nba_salary_data)

# Model output for two predictor quadratic model
mlr_model22 |> 
  tbl_regression(intercept = TRUE) |> 
  add_vif() |> 
  add_glance_source_note()

Table 13: Two predictor quadratic MLR model estimates

Characteristic	Beta	95% CI	p-value	VIF
(Intercept)	5.2	1.9, 8.4	0.002
rebounds_per_game	0.92	0.34, 1.5	0.002	1.1
I(points_per_game^2)	0.05	0.05, 0.06	<0.001	1.1
Abbreviations: CI = Confidence Interval, VIF = Variance Inflation Factor
R² = 0.672; Adjusted R² = 0.668; Sigma = 9.11; Statistic = 166; p-value = <0.001; df = 2; Log-likelihood = -597; AIC = 1,202; BIC = 1,215; Deviance = 13,448; Residual df = 162; No. Obs. = 165

Tip

Based on the AIC values, which model should we use for making predictions?
Based on the BIC values, which model should we use for making predictions?
Based on the \(r^2_{\text{adj}}\) values, which model should we use for making predictions?

Footnotes

Sports Reference LLC. (n.d.). Basketball-Reference.com. Retrieved September 27, 2025, from https://www.basketball-reference.com↩︎