Multiple Linear Regression
DSA 220 - Introduction to Data Science and Analytics
Learning Objectives
State the Multiple Linear Regression (MLR) model
Check assumptions of the MLR model
Fit and interpret the results of the MLR model
Obtain and interpret predicted values
Assess the fit of the MLR model
Assess multicollinearity among predictors
Select a final model using proper criterion and metrics
Let’s begin by loading some R packages for this activity using the code below. Note: if it is the first time you are using an R package, you may need to install it first using the install.packages() function in the Console pane.
Code
# Load necessary packages
library(tidyverse)
library(ggthemes)
library(gt)
library(corrr)
library(ggfortify)
library(broom)
library(scales)
library(car)
library(GGally)
library(gtsummary)Set default theme settings for plots, and load a function to display variance inflation factor (VIF) values using the code below.
Code
# Set ggplot theme for visualizations
theme_set(ggthemes::theme_few())
# Load function for visualizing VIF & GVIF values
source("https://raw.githubusercontent.com/dilernia/DSA220/main/Functions/vif_plot.R")Multiple Linear Regression (MLR)
Multiple linear regression (MLR) is a statistical method used to model the relationship between a single quantitative response variable, denoted by \(y\), and \(k\) explanatory or predictor variables, denoted by \(x_1\), \(x_2\), \(\dots\), \(x_k\). In general, the predictor variables can be either quantitative or categorical variables, but in this activity we will focus on only quantitative predictors.
Some examples where a MLR model would be useful are:
Predicting the selling price of houses using characteristics such as the size in square feet, number of bedrooms, number of bathrooms, and year it was built. 🏠
Forecasting the total sales of a product for a company using advertising expenses, the price of the product, the previous sales numbers for that product, and the time of year. 📈
Exploring the relationship between employee salaries and factors such as education level, years of experience, job title, and gender. 💼 💸
Constructing a model to determine what a typical blood-pressure should be for a patient using information such as the patient’s age, sex, and height. ⚕️🏥
What are the response and predictor variables for examples a. and b., and what are the types of each predictor variable (e.g., categorical or quantitative)?
MLR Model
A general statement of the assumed theoretical MLR model is
\[y = \beta_0 + x_1\beta_1 + x_2\beta_2 +\dots + x_k \beta_k + \varepsilon, \]
where \(y\) is our response variable, \(x_j\) the \(j^{th}\) predictor for \(j=1, 2, \dots, k\), and \(\varepsilon\) the independent, identically, and normally distributed error term. When providing a statement of the theoretical or assumed MLR model, the terms of the model are expressed using general notation, but the response and predictor variables should be specified for the given context.
A statement of the estimated regression equation is
\[\hat{y} = \hat{\beta}_0 + x_1\hat{\beta}_1 + x_2\hat{\beta}_2 + \dots + x_k \hat{\beta}_k, \]
where \(\hat{y}\) is the expected or predicted response value and \(x_j\) the \(j^{th}\) predictor for \(j=1, 2, \dots, k\). When providing a statement of the estimated MLR model, the \(\hat{\beta}\) terms should be replaced with corresponding numerical estimates from the model output, and the estimated response and predictor variables should be specified for the given context.
MLR Assumptions
As with simple linear regression, multiple linear regression requires a set of conditions or assumptions to be met to obtain reliable estimates and to conduct valid inference related to the model. Note that assumptions 1 through 5 below are identical to the assumptions for simple linear regression. All of the assumptions below must be true to conduct valid inference of the multiple linear regression model.
Random sample: observations must be randomly selected from the population of interest.
Normality: the residuals must be normally distributed.
This assumption can be checked using a histogram or quantile-quantile (QQ) plot of the residuals. A fairly bell-shaped distribution in the histogram supports normality, and the points not straying far from the diagonal line on the QQ plot supports normality.
- Homoskedasticity: the residuals must have constant variance / spread across all predicted values.
This assumption can be checked using the residual by predicted value plot. A fairly constant vertical spread / range of the residuals across all predicted values supports homoskedasticity. When the residuals do not have constant variance this is called heteroskedasticity.
- Independence: the residuals must be independent of the predicted values.
This assumption can checked using the residual by predicted value plot. No apparent pattern in the relationship between the residuals and predicted values supports the independence assumption.
- Linearity: there must be a linear relationship between each \(x\) and \(y\).
This assumption can checked using the residual by predicted value plot or a scatter plot between each \(x\) and \(y\). Having no systematic or identifiable pattern in the residual by predicted value plot supports the linearity assumption. A linear or straight-line pattern in the scatter plot between each predictor and \(y\) also supports the linearity assumption.
- No multicollinearity: predictor variables should not be highly correlated with one another.
We check assumptions 1 through 5 in the same way as we did for SLR - by using the corresponding diagnostic plots. To check for multicollinearity, a common approach is to calculate the variance inflation factor (VIF) for each predictor. The VIF value for a predictor, \(x_i\), describes how much variability in \(x_i\) is explained by all other predictors. As a rule of thumb, we will consider a predictor to have multicollinearity issues if its VIF value is greater than 5.
Specifically, the VIF for a predictor \(x_i\) is:
\[\text{VIF}_i = \frac{1}{1 - r_i^2}\]
where \(r_i^2\) is the coefficient of determination from a regression of \(x_i\) on all other predictors, \(\{x_j \}_{j \ne i}\).
Interpreting Model Estimates
Interpretations of coefficients for MLR are slightly different than for SLR.
General interpretation for \(\hat{\beta}_j\): For every 1 unit increase in \(x_j\), we expect \(y\) to increase / decrease by \(|\hat{\beta}_j|\) units on average, holding all other predictors constant.
General interpretation for \(\hat{\beta}_0\): The expected value of \(y\) given that \(x_1\) through \(x_k\) are all 0 is \(\hat{\beta}_0\) units.
The “holding all other predictors constant” part of the slope interpretation for MLR is an essential part that cannot be excluded.
When multicollinearity is present, the estimated slopes may be unreliable or inaccurate for describing relationships between the predictors and the response variable. Also note that the estimated intercept, \(\hat{\beta}_0\), is only appropriate to interpret when all predictors being 0 is within the range of the values of the observed data set. That is, 0 must fall between the minimum and maximum value for each predictor variable in the observed data set to yield a reliable estimated intercept term. Otherwise, \(\hat{\beta}_0\) is outside the scope of the observed data set, so interpreting it would be considered extrapolation and thus would be inappropriate.
Example: NBA Player Salary Data 🏀 💵
In this activity, we will revisit the NBA compensation data, considering data on NBA player salaries for the 2025–2026 season along with performance metrics from the previous season. Salary data were obtained from https://www.basketball-reference.com/contracts/players.html, and player-level performance statistics were sourced from https://www.basketball-reference.com/leagues/NBA_2025_per_game.html.1
Let’s load the data for the activity, again focusing on players who have played at least 4 years in the NBA already to avoid rookie contracts which are systematically lower than standard NBA contracts, and only considering players who have a listed position.
Code
# Importing NBA salary data from GitHub, removing players on rookie contracts, converting salary to millions of USD, and dropping columns
nba_salary_data <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/nba_salary_data.csv") |>
dplyr::filter(years_played >= 4, !is.na(position)) |>
mutate(salary_2025_2026 = salary_2025_2026 / 1e6) |>
dplyr::rename(salary_millions = salary_2025_2026) |>
dplyr::select(player:position, points_per_game, assists_per_game, rebounds_per_game, steals_per_game, years_played)Exploratory Data Analysis
Below is a data dictionary for the NBA data, and a preview of some randomly selected rows.
| Variable | Description |
|---|---|
| player | Player's name |
| team | Player's team for the 2025-2026 season |
| salary_millions | Player's salary for the 2025-2026 season (in millions of USD) |
| position | Primary position played by the player (e.g., C, PF, SF, SG, PG) |
| points_per_game | Average points scored per game during the previous season |
| assists_per_game | Average assists per game during the previous season |
| rebounds_per_game | Average rebounds per game during the previous season |
| steals_per_game | Average steals per game during the previous season |
| years_played | Total number of years played in the NBA prior to the 2025-2026 season |
| player | team | salary_millions | position | points_per_game | assists_per_game | rebounds_per_game | steals_per_game | years_played |
|---|---|---|---|---|---|---|---|---|
| Stephen Curry | GSW | 59.61 | PG | 24.5 | 6.0 | 4.5 | 1.1 | 16 |
| Joel Embiid | PHI | 55.22 | C | 23.8 | 4.5 | 8.2 | 0.7 | 9 |
| Nikola Jokić | DEN | 55.22 | C | 29.6 | 10.2 | 12.8 | 1.8 | 10 |
| Kevin Durant | HOU | 54.71 | PF | 26.6 | 4.2 | 6.1 | 0.8 | 17 |
| Jayson Tatum | BOS | 54.13 | PF | 26.8 | 6.0 | 8.7 | 1.1 | 8 |
Code
# Creating a scatter plot
nba_salary_data |>
ggplot(aes(x = points_per_game,
y = salary_millions)) +
geom_smooth(method = "lm", se = FALSE) +
geom_point() +
scale_y_continuous(labels = scales::label_currency(prefix = "$"))Code
# Calculating correlations
correlations_table <- nba_salary_data |>
corrr::correlate(diagonal = 1)
# Printing table of correlations
correlations_table |>
gt() |>
fmt_number(columns = everything(),
decimals = 2,
drop_trailing_zeros = FALSE)| term | salary_millions | points_per_game | assists_per_game | rebounds_per_game | steals_per_game | years_played |
|---|---|---|---|---|---|---|
| salary_millions | 1.00 | 0.82 | 0.64 | 0.42 | 0.45 | 0.26 |
| points_per_game | 0.82 | 1.00 | 0.72 | 0.39 | 0.46 | 0.11 |
| assists_per_game | 0.64 | 0.72 | 1.00 | 0.22 | 0.54 | 0.14 |
| rebounds_per_game | 0.42 | 0.39 | 0.22 | 1.00 | 0.20 | 0.12 |
| steals_per_game | 0.45 | 0.46 | 0.54 | 0.20 | 1.00 | 0.07 |
| years_played | 0.26 | 0.11 | 0.14 | 0.12 | 0.07 | 1.00 |
We can also visualize all pairwise correlations using a correlation plot with the rplot() function from the corrr package.
Code
# Create plot of correlation matrix
nba_salary_data |>
corrr::correlate(diagonal = NA) |>
rplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))Lastly, we can create a matrix of all pairwise scatter plots and correlations with the ggpairs() function from the GGally package.
Code
# Create scatter plot matrix
nba_salary_data |>
select(where(is.numeric)) |>
ggpairs() +
theme_few(base_size = 8)Code
# Calculating descriptive statistics stratified by position
nba_statistics <- nba_salary_data |>
group_by(position) |>
summarize(
N = n(),
Minimum = min(salary_millions, na.rm = TRUE),
Q1 = quantile(salary_millions, na.rm = TRUE, probs = 0.25),
Median = median(salary_millions, na.rm = TRUE),
Q3 = quantile(salary_millions, na.rm = TRUE, probs = 0.75),
Maximum = max(salary_millions, na.rm = TRUE),
Mean = mean(salary_millions, na.rm = TRUE),
Range = Maximum - Minimum,
SD = sd(salary_millions, na.rm = TRUE),
)
# Printing table of statistics stratified by player position
nba_statistics |>
gt() |>
fmt_number(columns = everything(),
decimals = 2,
drop_trailing_zeros = TRUE)| position | N | Minimum | Q1 | Median | Q3 | Maximum | Mean | Range | SD |
|---|---|---|---|---|---|---|---|---|---|
| C | 37 | 2.35 | 8.33 | 15 | 33.94 | 55.22 | 21.04 | 52.87 | 16.39 |
| PF | 35 | 2.55 | 13.82 | 22.41 | 39.05 | 54.71 | 25.48 | 52.16 | 15.66 |
| PG | 31 | 2.35 | 10.39 | 32.5 | 39.05 | 59.61 | 28.34 | 57.26 | 16.94 |
| SF | 25 | 3.47 | 14.38 | 24.9 | 38.1 | 54.13 | 26.64 | 50.66 | 15.14 |
| SG | 37 | 2.35 | 11.55 | 16.88 | 28.1 | 53.67 | 20.91 | 51.32 | 14.35 |
Based on the scatter plot and correlations, what can we say about the relationship between points scored per game and player salaries?
Based on the scatter plot and correlations, what can we say about the relationship between rebounds per game and player salaries?
Provide a statement of our assumed / theoretical multiple linear regression model using the player salaries for the 2025-2026 season in millions of dollars as our response and the average points scored and average rebounds per game in the 2024-2025 season as our predictors.
Fitting a Model
Let’s fit the corresponding MLR model, and look at some diagnostic plots to see how model assumptions look.
Code
# Fitting two-predictor multiple linear regression model
mlr_model <- lm(salary_millions ~ points_per_game + rebounds_per_game,
data = nba_salary_data)Code
# Several diagnostic plots at once
autoplot(mlr_model, label.repel = TRUE)Code
# Cook's Distance plot
autoplot(mlr_model, which = 4, label.repel = TRUE)@plots[[1]]Code
# Extracting vector of residuals
model_residuals <- mlr_model$residuals
# Making histogram of residuals
tibble(Residual = model_residuals) |>
ggplot(aes(x = Residual)) +
geom_histogram(aes(y = after_stat(density)), fill = "#56B4E9", color = "white") +
stat_function(fun = dnorm, args = list(mean = 0, sd = sd(model_residuals)),
color = "#009E73") +
scale_y_continuous(expand = expansion(mult = c(0, .01))) +
labs(y = "Density") +
theme_bw() +
theme(panel.grid = element_blank(),
text = element_text(face = "bold"))Code
# Variance inflation values
vif_plot(mlr_model)Indicate if each assumption for fitting the multiple linear regression model is met, providing specific evidence to support your conclusions from the obtained output.
Interpreting model estimates
Below is output for the model using the player salary as our response and average points scored per game and number of rebounds per game as our predictors.
Code
# Printing model output
mlr_model |>
tbl_regression(intercept = TRUE) |>
add_vif() |>
add_glance_source_note()| Characteristic | Beta | 95% CI | p-value | VIF |
|---|---|---|---|---|
| (Intercept) | -5.6 | -9.2, -2.0 | 0.003 | |
| points_per_game | 1.8 | 1.6, 2.0 | <0.001 | 1.2 |
| rebounds_per_game | 0.72 | 0.15, 1.3 | 0.014 | 1.2 |
| Abbreviations: CI = Confidence Interval, VIF = Variance Inflation Factor | ||||
| R² = 0.691; Adjusted R² = 0.688; Sigma = 8.84; Statistic = 182; p-value = <0.001; df = 2; Log-likelihood = -592; AIC = 1,192; BIC = 1,205; Deviance = 12,662; Residual df = 162; No. Obs. = 165 | ||||
Using the output in Table 5, provide a statement of the estimated regression equation.
Interpret the estimated slopes in context.
Interpret the estimated intercept in context. Is this sensible in this context?
Interpret the confidence interval for the slope for points per game.
What does the model predict LeBron’s 2025-2026 salary to be? Use the output in Table 5 to confirm the value in Table 6 below. Note: the calculated value may differ from the fitted value in Table 6 due to rounding.
What is the corresponding residual for LeBron?
What does the residual mean in this context?
Code
# Displaying predicted value for LeBron
nba_salary_data |>
augment_columns(x = mlr_model) |>
dplyr::filter(player == "LeBron James") |>
dplyr::select(player:points_per_game, rebounds_per_game, .fitted) |>
gt()| player | team | salary_millions | position | points_per_game | rebounds_per_game | .fitted |
|---|---|---|---|---|---|---|
| LeBron James | LAL | 52.62715 | SF | 24.4 | 7.8 | 43.05807 |
We could also obtain a prediction interval for a player with LeBron’s statistics using the code below.
Code
# Obtaining predicted value & 95% prediciton interval
prediction_data <- tibble(points_per_game = 24.4,
rebounds_per_game = 7.8)
predict(mlr_model, newdata = prediction_data,
interval = "prediction", level = 0.95) |>
as_tibble() |>
gt() |>
fmt_number(columns = everything(),
decimals = 2)| fit | lwr | upr |
|---|---|---|
| 43.06 | 25.43 | 60.68 |
Functions of Predictors
Let’s reconsider the relationship between player salaries in millions of dollars and rebounds per game.
Code
# Creating scatter plot with a linear fit
nba_salary_data |>
ggplot(aes(x = rebounds_per_game,
y = salary_millions)) +
geom_point() +
scale_y_continuous(labels = scales::label_currency(prefix = "$")) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Linear fit",
y = "NBA player salary (millions USD)",
x = "Average rebounds per game")The relationship between rebounds per game and player salaries may be better modeled using a quadratic regression model.
In general, predictors in a model, the \(x_j\)’s, can be functions of a variable such as a squared term \(x_2 = x_1^2\).
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon \]
\[ = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \varepsilon\]
The example model above includes a quadratic term for \(x_1\). This is useful if there appears to be a quadratic relationship between \(x\) and \(y\).
Code
# Creating scatter plot with quadratic fit
nba_salary_data |>
ggplot(aes(x = rebounds_per_game,
y = salary_millions)) +
geom_point() +
scale_y_continuous(labels = scales::label_currency(prefix = "$")) +
stat_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
labs(title = "Quadratic fit",
y = "NBA player salary (millions USD)",
x = "Average rebounds per game")Code
# Fitting a quadratic regression model
mlr_model2 <- lm(salary_millions ~ rebounds_per_game + I(rebounds_per_game^2),
data = nba_salary_data)
# Printing quadratic model output
mlr_model2 |>
tbl_regression(intercept = TRUE) |>
add_glance_source_note()| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 3.1 | -7.3, 14 | 0.6 |
| rebounds_per_game | 5.5 | 1.9, 9.0 | 0.003 |
| I(rebounds_per_game^2) | -0.22 | -0.47, 0.04 | 0.10 |
| Abbreviation: CI = Confidence Interval | |||
| R² = 0.189; Adjusted R² = 0.179; Sigma = 14.3; Statistic = 18.9; p-value = <0.001; df = 2; Log-likelihood = -672; AIC = 1,352; BIC = 1,364; Deviance = 33,277; Residual df = 162; No. Obs. = 165 | |||
Based on the quadratic model, what is the estimated player salary for a player who averaged 5 rebounds per game in the 2024-2025 season?
Is this appropriate to interpret in this context? Why or why not?
For a quadratic regression model with a single predictor, the turning point or vertex of the parabola occurs at
\[x_{\text{vertex}} = -\frac{\hat{\beta}_1}{2\hat{\beta}_2}\]
This is a low-point for the expected or predicted value of \(y\) if \(\hat{\beta}_2 > 0\) or a high-point if \(\hat{\beta}_2 < 0\).
Using the output in Table 8, confirm that \(x_{\text{vertex}} = 12.5\).
What is the expected value of \(y\) when \(x = x_{\text{vertex}}\)? Is this a high-point or a low-point?
Assessing Individual Predictors
When fitting a MLR model with several predictors, there are multiple predictors available for inclusion in our final model. In general, we want a model with predictor variables that effectively explain variation in the response variable, but not predictor variables that are just fitting random noise. Moreover, we want to discriminate between these two types of predictors in a data-driven way to help us select a final model.
There are several data-driven methods for conducting model selection in multiple linear regression, but in this activity we will use Akaike’s Information Criterion (AIC) and the Bayesian Information Criterion (BIC) which are numerical measures that describe the goodness-of-fit for a selected model based on statistical theory.
AIC and BIC both penalize for a larger number of variables used in a model while simultaneously rewarding how well a model explains variability in the response variable. However, BIC nearly always penalizes more for including more predictors (whenever the sample size \(n > 7\)). Models with lower AIC and BIC values are preferred.
Code
# Fitting a linear regression model
rebounds_model <- lm(salary_millions ~ rebounds_per_game,
data = nba_salary_data)
# Printing linear model output
rebounds_model |>
tbl_regression(intercept = TRUE) |>
add_glance_source_note()| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 11 | 5.9, 16 | <0.001 |
| rebounds_per_game | 2.6 | 1.7, 3.4 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
| R² = 0.175; Adjusted R² = 0.170; Sigma = 14.4; Statistic = 34.6; p-value = <0.001; df = 1; Log-likelihood = -673; AIC = 1,353; BIC = 1,362; Deviance = 33,841; Residual df = 163; No. Obs. = 165 | |||
If a higher order term is included (\(x_1^2\)) in a regression model, one should include the lower powers of that predictor \((x_1)\) to maintain interpretability. However, if predictive performance is all one is after, then lower order terms can be dropped if they are not considered useful.
Assessing Model Fit
When fitting a MLR model, it is important to assess the fit of the model. In addition to AIC and BIC, another commonly used statistic to assess the fit of a regression model is the coefficient of multiple determination (\(r^2\)).
The coefficient of multiple determination \(r^2\) represents the proportion of total variability in the response that is explained by our regression model. If \(r^2 = 0.75\) for a model containing predictors \(x_1\) and \(x_2\), this is interpreted as:
“About \(75\%\) of the variability in \(y\) is explained by \(x_1\) and \(x_2\).”
For SLR, \(r^2\) is equal to the correlation between \(x\) and \(y\) squared. For both SLR and MLR, \(r^2\) is equal to the correlation between the observed and predicted values (\(y_i\)’s and \(\hat{y}_i\)’s) squared.
Unlike AIC and BIC, the \(r^2\) value will never decrease when adding more predictors to model, so one must be careful of overfitting.
A modified version of the coefficient of determination called the adjusted coefficient of determination, \(r^2_{\text{adj}}\), can be used for model selection while avoiding overfitting. In general, a higher \(r^2_{\text{adj}}\) value is preferred for a model. The adjusted \(r^2\) is
\[r^2_{\text{adj}} = 1 - \left( \frac{(1 - r^2)(n - 1)}{n - k - 1} \right),\]
where
\(r^2\) is the coefficient of determination
\(n\) is the number of observations
\(k\) is the number of predictors in the model
The information criterion metrics, AIC and BIC, also reward model fit while penalizing for including additional predictors that bring little value as well. Thus, when comparing several candidate models, assessing the \(r^2_{\text{adj}}\), AIC, and BIC values is one approach for model selection.
When candidate models have the same values for the selection criterion, a suggested approach is to choose the simpler (more parsimonious) model.
Code
# Fitting single predictor model
mlr_model1 <- lm(salary_millions ~ points_per_game,
data = nba_salary_data)
# Model output for single predictor model
mlr_model1 |>
tbl_regression(intercept = TRUE) |>
add_glance_source_note()| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | -3.4 | -6.6, -0.16 | 0.040 |
| points_per_game | 1.9 | 1.7, 2.1 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
| R² = 0.680; Adjusted R² = 0.678; Sigma = 8.98; Statistic = 346; p-value = <0.001; df = 1; Log-likelihood = -595; AIC = 1,197; BIC = 1,206; Deviance = 13,143; Residual df = 163; No. Obs. = 165 | |||
Code
# Fitting two-predictor multiple linear regression model
mlr_model <- lm(salary_millions ~ points_per_game + rebounds_per_game,
data = nba_salary_data)
# Model output for two-predictor model
mlr_model |>
tbl_regression(intercept = TRUE) |>
add_vif() |>
add_glance_source_note()| Characteristic | Beta | 95% CI | p-value | VIF |
|---|---|---|---|---|
| (Intercept) | -5.6 | -9.2, -2.0 | 0.003 | |
| points_per_game | 1.8 | 1.6, 2.0 | <0.001 | 1.2 |
| rebounds_per_game | 0.72 | 0.15, 1.3 | 0.014 | 1.2 |
| Abbreviations: CI = Confidence Interval, VIF = Variance Inflation Factor | ||||
| R² = 0.691; Adjusted R² = 0.688; Sigma = 8.84; Statistic = 182; p-value = <0.001; df = 2; Log-likelihood = -592; AIC = 1,192; BIC = 1,205; Deviance = 12,662; Residual df = 162; No. Obs. = 165 | ||||
Code
# Fitting a quadratic regression model
mlr_model2 <- lm(salary_millions ~ points_per_game + I(rebounds_per_game^2),
data = nba_salary_data)
# Model output for quadratic model
mlr_model2 |>
tbl_regression(intercept = TRUE) |>
add_vif() |>
add_glance_source_note()| Characteristic | Beta | 95% CI | p-value | VIF |
|---|---|---|---|---|
| (Intercept) | -3.8 | -7.0, -0.61 | 0.020 | |
| points_per_game | 1.8 | 1.6, 2.0 | <0.001 | 1.1 |
| I(rebounds_per_game^2) | 0.05 | 0.01, 0.09 | 0.015 | 1.1 |
| Abbreviations: CI = Confidence Interval, VIF = Variance Inflation Factor | ||||
| R² = 0.691; Adjusted R² = 0.687; Sigma = 8.84; Statistic = 181; p-value = <0.001; df = 2; Log-likelihood = -592; AIC = 1,193; BIC = 1,205; Deviance = 12,674; Residual df = 162; No. Obs. = 165 | ||||
Code
# Fitting two predictor quadratic model
mlr_model22 <- lm(salary_millions ~ rebounds_per_game + I(points_per_game^2),
data = nba_salary_data)
# Model output for two predictor quadratic model
mlr_model22 |>
tbl_regression(intercept = TRUE) |>
add_vif() |>
add_glance_source_note()| Characteristic | Beta | 95% CI | p-value | VIF |
|---|---|---|---|---|
| (Intercept) | 5.2 | 1.9, 8.4 | 0.002 | |
| rebounds_per_game | 0.92 | 0.34, 1.5 | 0.002 | 1.1 |
| I(points_per_game^2) | 0.05 | 0.05, 0.06 | <0.001 | 1.1 |
| Abbreviations: CI = Confidence Interval, VIF = Variance Inflation Factor | ||||
| R² = 0.672; Adjusted R² = 0.668; Sigma = 9.11; Statistic = 166; p-value = <0.001; df = 2; Log-likelihood = -597; AIC = 1,202; BIC = 1,215; Deviance = 13,448; Residual df = 162; No. Obs. = 165 | ||||
Based on the AIC values, which model should we use for making predictions?
Based on the BIC values, which model should we use for making predictions?
Based on the \(r^2_{\text{adj}}\) values, which model should we use for making predictions?
Footnotes
Sports Reference LLC. (n.d.). Basketball-Reference.com. Retrieved September 27, 2025, from https://www.basketball-reference.com↩︎