Exploratory Data Analysis

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Learning Objectives

Summarize key characteristics of data using univariate descriptive statistics
Visualize univariate distributions with appropriate plots
Calculate and interpret correlations
Compare numeric variables across subgroups using summary statistics and visualizations
Create bivariate visualizations to explore relationships
Apply multivariate visualization techniques (color, faceting) to represent three or more variables
Identify potential outliers and summarize patterns of missingness

In this activity we will analyze the Palmer Penguins data set: a data set consisting of measurements collected on 344 penguins from 3 islands in Palmer Archipelago, Antarctica. Specifically, data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research Network. Researchers collected data by capturing and releasing breeding adult penguins across the islands at their nests.

*Artwork by @allison_horst*

Let’s load some R packages to start, and set a default theme for all visualizations made with ggplot2.

Code

# Loading R packages
library(tidyverse)
library(gt)
library(corrr)
library(ggthemes)
library(skimr)

# Setting default ggplot2 theme
theme_set(theme_bw())

The code above sets a default ggplot theme for the entire document. We can update this code to a complete theme of our choice from the ggplot2 package or the ggthemes package.

Then, we can load the data for the activity.

Code

# Importing penguins data from GitHub
penguins <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/penguins.csv")

Below are a data dictionary for the Palmer Penguins data, and a preview of some randomly selected rows.

Table 1: Data dictionary for the Palmer Penguins data set.

Variable	Description
species	Species of the penguin
island	Island the penguin was found on
bill_length_mm	Bill length (mm)
bill_depth_mm	Bill depth (mm)
flipper_length_mm	Flipper length (mm)
body_mass_g	Body mass (g)
sex	Sex of the penguin
year	Year data was collected

Table 2: Randomly selected rows from the Palmer Penguins data set.

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Chinstrap	Dream	50.2	18.7	198	3775	female	2009
Adelie	Biscoe	39.7	18.9	184	3550	male	2009
Chinstrap	Dream	45.4	18.7	188	3525	female	2007
Gentoo	Biscoe	46.2	14.1	217	4375	female	2009
Gentoo	Biscoe	50.0	16.3	230	5700	male	2007

Tip

In Table 2, being as specific as possible, what is the type of each variable?
Which data format would be best for storing all data available on the Palmer Penguins (e.g., XML, CSV, or JSON)?
Which sampling technique was used to collect the Palmer Penguins data?
Which, if any, landmark data privacy law(s) would apply to this scenario?

Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) is the process of analyzing data to explore key characteristics, commonly using data visualizations and descriptive statistics.

Conducting EDA is contingent upon one’s data already being cleaned, organized in a tidy format (i.e., variables are columns and observations are rows), and imported into R or another software tool so that it is ready for analysis.

EDA is an essential skill for being a statistician or data scientist, with its main purposes being:

Identifying outliers
Exploring patterns of missing data
Formulating hypotheses for studies based on new data
To generally improve understanding of our data

Two main components of EDA are descriptive statistics and data visualizations.

Descriptive statistics

Descriptive statistics are numerical measures describing our observed data set called our sample. Descriptive statistics are different from inferential statistics which are used to learn about the population of interest.

Data visualization

Data visualization is the process of graphically representing our data. The main goals of data visualization are to:

Efficiently present data in a compact space
Make data sets coherent / understandable to our target audience
Serve a clear purpose
Not distort what the data has to say

For a more detailed set of guidelines regarding best practices of data visualization, see this guide created by people working at the pharmaceutical company Novartis.

Univariate analyses

Univariate descriptive statistics

For a single numeric variable, some commonly used univariate statistics are the mean, median, minimum, maximum, range, and standard deviation (SD).

Let’s look at these statistics for the flipper lengths in millimeters (mm) of the Palmer Penguins.

Code

# Calculating descriptive statistics
quant_1_stats <- penguins |> 
  summarize(
  Minimum = min(flipper_length_mm, na.rm = TRUE),
  Q1 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.25),
  Median = median(flipper_length_mm, na.rm = TRUE),
  Q3 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.75),
  Maximum = max(flipper_length_mm, na.rm = TRUE),
  Mean = mean(flipper_length_mm, na.rm = TRUE),
  Range = Maximum - Minimum,
  SD = sd(flipper_length_mm, na.rm = TRUE),
)

# Printing table of statistics
quant_1_stats |> 
    gt() |>
    fmt_number(columns = everything(), 
               decimals = 2,
               drop_trailing_zeros = TRUE)

Table 3: Table of summary statistics for penguin flipper lengths in mm.

Minimum	Q1	Median	Q3	Maximum	Mean	Range	SD
172	190	197	213	231	200.92	59	14.06

Note that this data set contains missing values, denoted as NA in R, but these were excluded when calculating the sample statistics above. This is how we will often handle missing values moving forward for this course.

Missing data are important to be aware of as their source should be investigated or questioned. Although we won’t be covering methods for handling missing data in this activity, some resources are Flexible Imputation of Missing Data by Stef van Buuren (2018) or the naniar and mice R packages.

Tip

What are the largest and smallest flipper lengths for penguins in this data set?

The mean and median describe the middle or center of a set of values. The mean gives the average value of a variable, while the median is the number such that $50\%$ of values are greater than or equal to this value, and $50\%$ are less than or equal to this value. The mean is sensitive to extreme or outlier observations, while the median is robust (less sensitive) to extreme values / outliers, so the median is typically a more appropriate measure of center when outliers are present.

Tip

Provide the value of the sample mean flipper length for the penguins.
Provide and interpret the value of the sample median flipper length for the penguins.
Provide and interpret the value of the first and third quartiles of the flipper lengths for the penguins.

Other commonly used statistics for numeric variables include the range and standard deviation which provide measures of dispersion or variability. Measures of variability describe how spread out values are for a particular variable.

Tip

Which statistic is more sensitive to outliers: the range or the interquartile range (IQR)?

Frequencies, also known as counts, are useful descriptive statistics for categorical variables. The categorical variables in the Palmer Penguins data are species, island, and sex.

Code

# Printing frequency table
penguins |> 
  dplyr::count(species) |> 
  gt()

Table 4: Table of counts for species.

species	n
Adelie	152
Chinstrap	68
Gentoo	124

Table 5: Table of counts for island.

island	n
Biscoe	168
Dream	124
Torgersen	52

Table 6: Table of counts for sex.

sex	n
female	165
male	168
NA	11

Tip

Which penguin species had the most observations in this data set?
Which island had the least penguins measured on it?
Are there more male or female penguins in this data set?

Notice the NA in the frequency table for the sex variable. This is due to the sex of some penguins being missing.

Univariate visualizations

Data visualization is a key aspect of EDA. Univariate data visualizations are useful for understanding characteristics of a single variable in an efficient manner and also to identify potential outliers or unusual observations. It is important to include units of measurement whenever possible in data visualizations.

For a single numeric variable, two commonly used data visualizations are the box plot and histogram. We make a histogram and a box plot for the penguin flipper lengths below.

Code

# Creating a histogram
penguins |> 
  ggplot(aes(x = flipper_length_mm)) + 
  geom_histogram(color = "white") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
  labs(title = "Distribution of penguin flipper lengths",
       x = "Flipper length (mm)",
       y = "Frequency",
       caption = "Data source: palmerpenguins R package")

Tip

What can we say about the penguin flipper lengths based on the histogram?

Code

# Creating a box plot
penguins |> 
  ggplot(aes(x = flipper_length_mm)) + 
  geom_boxplot() +
  scale_y_discrete(breaks = NULL) +
    labs(title = "Distribution of penguin flipper lengths",
       x = "Flipper length (mm)",
       caption = "Data source: palmerpenguins R package")

Tip

What can we say about the penguin flipper lengths based on the box plot? Are there any outliers present?

For a single categorical variable, a commonly used data visualization is the bar chart. We have a bar chart displaying the number of penguins of each species below.

Code

# Creating vector of colorblind-friendly colors
penguin_colors <- c("Chinstrap" = "#CC79A7", "Gentoo" = "#009E73", "Adelie" = "#E69F00")

It is considered best practice to use colorblind friendly colors whenever possible in data visualizations. Here we are using colorblind friendly colors to differentiate between the penguin species using hex codes. Hex codes are six-digit alphanumeric codes that specify colors in a relatively software agnostic manner. For example, “#009E73” refers to a specific green hue that will be consistent in appearance across both R and Python, whereas “green” is not necessarily the same exact hue for different software tools.

Code

# Creating a bar chart
penguins |> 
  dplyr::count(species, .drop = FALSE) |> 
  mutate(species = fct_reorder(species, n)) |> 
  ggplot(aes(x = species, y = n,
             fill = species)) + 
  geom_col(color = "black") +
  scale_fill_manual(values = penguin_colors) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
    labs(title = "Distribution of penguin species",
       x = "Species",
       y = "Frequency",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "none")

Tip

What can we say about the penguin species based on the bar chart?

Bivariate analyses

Bivariate descriptive statistics

For two numeric variables, the sample correlation coefficient, $r$, is a commonly used statistic which describes both the strength and direction of a linear relationship between two variables. Correlations range from -1 to 1, with values closer to -1 indicating a stronger, negative linear relationship (as one variable increases, the other decreases), values closer to 0 indicating a weaker linear relationship, and values closer to 1 indicating a stronger, positive and linear relationship (as one variable increases, the other increases).

Correlations are a useful starting point, but should be used with caution as they have a few limitations:

They only measure linear relationships, but many relationships are non-linear. ⎰
A strong correlation does not imply causation, e.g., monthly ice cream sales may be strongly correlated with the number of monthly shark attacks, but that does not necessarily mean one causes the other. 🍦🦈
Confounding variables can distort the relationship between variables of interest.
The correlation coefficient is sensitive to outliers.

Tip

What is a confounding variable between ice cream sales and number of shark attacks?

Let’s look at the correlations for the penguins data:

Code

# Calculating correlations
correlation_table <- penguins |> 
  corrr::correlate(diagonal = 1)

# Printing table of correlations
correlation_table |> 
    gt() |>
    fmt_number(columns = everything(), decimals = 3)

Table 7: Table of pairwise correlations.

term	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	year
bill_length_mm	1.000	−0.235	0.656	0.595	0.055
bill_depth_mm	−0.235	1.000	−0.584	−0.472	−0.060
flipper_length_mm	0.656	−0.584	1.000	0.871	0.170
body_mass_g	0.595	−0.472	0.871	1.000	0.042
year	0.055	−0.060	0.170	0.042	1.000

Tip

Which variables have the strongest correlation?
Which variables have the weakest correlation?

For one numeric and one categorical variable, we can look at the same univariate statistics we have previously, but stratified by the groups of the categorical variable. Let’s look at statistics for the flipper lengths of the penguins stratified by species.

Code

# Calculating descriptive statistics
quant_2_stats <- penguins |> 
  group_by(species) |> 
  summarize(
  Minimum = min(flipper_length_mm, na.rm = TRUE),
  Q1 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.25),
  Median = median(flipper_length_mm, na.rm = TRUE),
  Q3 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.75),
  Maximum = max(flipper_length_mm, na.rm = TRUE),
  Mean = mean(flipper_length_mm, na.rm = TRUE),
  Range = Maximum - Minimum,
  SD = sd(flipper_length_mm, na.rm = TRUE))

# Printing table of statistics
quant_2_stats |> 
    gt() |>
    fmt_number(columns = everything(), 
               decimals = 2,
               drop_trailing_zeros = TRUE)

Table 8: Table of groupwise summary statistics.

species	Minimum	Q1	Median	Q3	Maximum	Mean	Range	SD
Adelie	172	186	190	195	210	189.95	38	6.54
Chinstrap	178	191	196	201	212	195.82	34	7.13
Gentoo	203	212	216	221	231	217.19	28	6.48

Tip

Which penguin species typically had the largest flipper lengths?
Which penguin species had the most variability in their flipper lengths?

For two categorical variables, a table of counts is commonly calculated as below.

Code

# Calculating counts for cross-section of species and island
species_island_counts <- penguins |> 
  dplyr::count(species, island)

# Printing frequency table
species_island_counts |> 
  gt()

Table 9: Table of counts for the cross section of species and island.

species	island	n
Adelie	Biscoe	44
Adelie	Dream	56
Adelie	Torgersen	52
Chinstrap	Dream	68
Gentoo	Biscoe	124

Tip

Was any penguin species found on more than one island?
How many penguins were found on Dream island?

Bivariate visualizations

For two numeric variables, the scatter plot is the ideal visualization. Let’s look at the relationship between the penguin flipper lengths and body masses.

Code

# Creating a scatter plot
penguins |> 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm)) + 
  geom_point(pch = 21, color = "white", fill = "black") +
    labs(title = "Penguin flipper lengths by body mass",
       x = "Body mass (g)",
       y = "Flipper length (mm)",
       caption = "Data source: palmerpenguins R package")

We can also add a straight line of best fit to scatter plots as well.

Code

# Creating a scatter plot with a line of best fit
penguins |> 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm)) + 
  geom_point(pch = 21, color = "white", fill = "black") +
  geom_smooth(method = 'lm', se = FALSE) +
      labs(title = "Penguin flipper lengths by body mass",
       x = "Body mass (g)",
       y = "Flipper length (mm)",
       caption = "Data source: palmerpenguins R package")

Tip

What do we observe based on the scatter plot above?

For one numeric and one categorical variable, side-by-side box plots are a useful visualization. We can use a side-by-side box plot to explore the penguin flipper lengths by species as below.

Code

# Creating side-by-side box plots
penguins |> 
  ggplot(aes(x = species, y = flipper_length_mm, fill = species)) + 
  geom_boxplot() + 
  scale_fill_manual(values = penguin_colors) +
        labs(title = "Penguin flipper lengths by species",
       x = "Species",
       y = "Flipper length (mm)",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "none")

Tip

What do we observe based on the side-by-side box plots?

Lastly, for two categorical variables, a clustered bar chart or dumbbell chart are the more commonly used visualizations.

Code

# Creating a clustered bar chart
penguins |> dplyr::count(species, sex, .drop = FALSE) |> 
    dplyr::filter(!is.na(species), !is.na(sex)) |> 
  mutate(sex = fct_reorder(sex, n)) |> 
  ggplot(aes(x = sex, y = n,
             fill = species)) + 
  geom_col(position="dodge", color = "black") +
  scale_fill_manual(values = penguin_colors) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
    labs(title = "Distribution of penguin species by sex",
       x = "Sex",
       y = "Frequency",
       caption = "Data source: palmerpenguins R package",
       fill = "Species")

Code

# Creating dumbbell chart
penguins |> 
  dplyr::count(species, sex, .drop = FALSE) |> 
  dplyr::filter(!is.na(species), !is.na(sex)) |> 
  dplyr::mutate(species_sex = str_c(species, "_", sex)) |> 
  ggplot(aes(x = n, y = sex,
             color = species, fill = species)) + 
  geom_line(aes(group = sex), color = "black") +
    geom_point(pch = 21, color = "black", size = 5) +
  scale_fill_manual(values = penguin_colors) +
      labs(title = "Distribution of penguin species by sex",
           x = "Frequency",
           y = "Sex",
           caption = "Data source: palmerpenguins R package",
       fill = "Species") +
  theme(legend.position = "bottom")

Tip

What can we say about the penguin sex and species based on the clustered bar chart and dumbbell chart?

Multivariate analyses

Descriptive statistics could also be obtained jointly for three or more variables, although this is less commonly done. For example, we could obtain counts for the cross-section of the species, island, and sex variables.

Code

# Calculating counts for cross-section of species, island, and sex
species_island_sex_counts <- penguins |> 
  dplyr::count(species, island, sex, .drop = FALSE)

# Printing frequency table
species_island_sex_counts |> 
  gt()

Table 10: Table of counts for the cross section of species, island, and sex.

species	island	sex	n
Adelie	Biscoe	female	22
Adelie	Biscoe	male	22
Adelie	Dream	female	27
Adelie	Dream	male	28
Adelie	Dream	NA	1
Adelie	Torgersen	female	24
Adelie	Torgersen	male	23
Adelie	Torgersen	NA	5
Chinstrap	Dream	female	34
Chinstrap	Dream	male	34
Gentoo	Biscoe	female	58
Gentoo	Biscoe	male	61
Gentoo	Biscoe	NA	5

With this many variables, however, it is often better to use data visualizations to communicate aspects of the data.

Multivariate visualizations

If we want to jointly visualize three numeric variables, we can use a scatter plot coloring the points based on a third variable.

Let’s look at the relationship between the penguin flipper lengths and body masses while coloring the points based on the penguin species.

Code

# Creating a scatter plot
penguins |> 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm, fill = species)) + 
  geom_point(pch = 21, color = "white") +
    scale_fill_manual(values = penguin_colors) +
    labs(title = "Penguin flipper lengths by body mass",
       x = "Body mass (g)",
       y = "Flipper length (mm)",
       fill = "Species",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "bottom")

Tip

What can we say about the relationship between the penguin flipper lengths (mm) and body masses (g) for each penguin species based on the scatter plot?

Simpson’s Paradox

Sometimes the relationship between variables is the same within different subgroups as it is for the data as a whole. However, relationships in the full data set can be misleading or distorted when a confounding variable or lurking variable is unaccounted for.

Observe the relationship between the bill lengths and bill depths of the penguins in the scatter plot below.

Code

# Creating a scatter plot
penguins |> 
  ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
    labs(title = "Penguin bill length by bill depth",
       x = "Bill depth (mm)",
       y = "Bill length (mm)",
       fill = "Species",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "bottom")

Tip

How can we describe the relationship in the full data set?

Code

# Creating a scatter plot
penguins |> 
  ggplot(aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
    geom_smooth(method = "lm", se = FALSE) +
  geom_point() +
    scale_color_manual(values = penguin_colors) +
    labs(title = "Penguin bill length by bill depth",
       x = "Bill depth (mm)",
       y = "Bill length (mm)",
       fill = "Species",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "bottom")

Tip

How can we describe the relationship for each penguin species?

Simpson’s paradox¹ is a statistical phenomenon where a trend that appears in several different groups of data disappears or even reverses when these groups are combined.

Simpson’s Paradox can arise in many real life data sets, including the Palmer Penguins data.

Faceting

An alternative way to incorporate the information of an additional categorical variable in a plot is to use faceting, breaking up the plot into multiple subplots. Below we showcase the relationship between the penguin flipper lengths (mm) and body masses (g) for each penguin species using faceting.

Code

# Creating a faceted scatter plot
penguins |> 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm, fill = species)) + 
  geom_point(pch = 21, color = "white") +
    scale_fill_manual(values = penguin_colors) +
  facet_grid(species ~ .) +
    labs(title = "Penguin flipper lengths by body mass",
       x = "Body mass (g)",
       y = "Flipper length (mm)",
       fill = "Species",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "bottom",
        strip.background.y = element_rect(linetype = "solid", color = "black"))

For three categorical variables, we could create a faceted dumbbell chart to visualize the distribution of penguins in terms of their species, sex, and which island they were found on (Biscoe, Dream, or Torgersen).

Code

# Creating a faceted dumbbell chart
penguins |> 
  dplyr::count(species, sex, island, .drop = FALSE) |> 
  dplyr::filter(!is.na(species), !is.na(sex)) |> 
  dplyr::mutate(species_sex = str_c(species, "_", sex)) |> 
  ggplot(aes(x = n, y = sex,
             color = species, fill = species)) + 
  geom_line(aes(group = sex), color = "black") +
    geom_point(pch = 21, color = "black", size = 5) +
  scale_fill_manual(values = penguin_colors) +
      labs(title = "Distribution of penguin species and sex by island",
           x = "Frequency",
           y = "Sex",
           caption = "Data source: palmerpenguins R package",
       fill = "Species") +
  facet_grid(island ~ .) +
  theme(legend.position = "bottom",
        strip.background.y = element_rect(linetype = "solid", color = "black"))

Tip

What are some main takeaways from the faceted dumbbell chart?

We could also visualize three numeric variables. Let’s look at the relationship between the penguin flipper lengths and body masses while coloring the points based on the bill lengths (mm).

Code

# Creating a scatter plot
penguins |> 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm, fill = bill_length_mm)) + 
  geom_point(pch = 21, color = "white") +
    labs(title = "Penguin flipper lengths by body mass",
       x = "Body mass (g)",
       y = "Flipper length (mm)",
       fill = "Bill length (mm)",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "bottom")

We can see from the plot above that penguins with smaller flippers and body masses tend to also have smaller bill lengths. However, the plot is not the easiest to glean information from with three variables. Additional dimensions could be added to the plot using other aesthetics like the shape or size of the points, but in general it is best to use two or three variables at most in a single visualization.

Key EDA concepts

Efficient exploration

Careful creation of visualizations and summary statistics is ideal, but when there is a time constraint to analyses faster approaches are needed. Using the skim() function from the skimr, we can efficiently summarize data sets in R.

Code

# Efficiently skimming the data
skim(penguins)

Data summary
Name	penguins
Number of rows	344
Number of columns	8
_______________________
Column type frequency:
character	3
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
species	0	1.00	6	9	3
island	0	1.00	5	9	3
sex	11	0.97	4	6	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
bill_depth_mm	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	▅▅▇▇▂
flipper_length_mm	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂
body_mass_g	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	▃▇▆▃▂
year	0	1.00	2008.03	0.82	2007.0	2007.00	2008.00	2009.0	2009.0	▇▁▇▁▇

Tip

What are some notable patterns of missingness, if any?
Are there any extreme outliers?

Fundamental visualizations

When creating data visualizations, it is important that the visualization is appropriate for the types of variables that are used. Below is a summary of some fundamental visualizations.

Lastly, besides Simpson’s Paradox there are other ways that data visualizations can be misleading as well that are important to be mindful of.

Footnotes

Clifford H. Wagner (February 1982). “Simpson’s Paradox in Real Life”. The American Statistician. 36 (1): 46–48. doi:10.2307/2684093. JSTOR 2684093.↩︎

--- title: "Exploratory Data Analysis" subtitle: DSA 220 - Introduction to Data Science and Analytics author: Andrew DiLernia format: html: theme: flatly self-contained: true embed-resources: true code-fold: true code-tools: true toc: true callout-appearance: default execute: echo: true warning: false message: false output: true --- ```{css} #| echo: false /*** Customization of call-outs ***/ /* 1. Initialize a counter for our callouts */ body { counter-reset: tip-counter; } /* 2. Tell the counter to increment for each tip callout */ .callout-tip { counter-increment: tip-counter; } /* 3. Hide the default title text (i.e., "Tip") by collapsing its font size */ .callout-tip .callout-title-container { font-size: 0; } /* 4. Add the custom superscript counter */ .callout-tip .callout-title-container::before { content: "" counter(tip-counter); position: relative; font-size: initial; } /*** Customization of Footnotes / References ***/ .footnotes > h2::before { content: "References "; } .footnotes > h2 { color: transparent; /* Hide original text */ position: relative; } .footnotes > h2::before { color: #333; position: absolute; left: 0; } ``` ```{r} #| include: false # Creating vector of colorblind-friendly colors colorblind_friendly_colors <- c("#E69F00", "#009E73", "#56B4E9", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999") ``` ## Learning Objectives - Summarize key characteristics of data using univariate descriptive statistics - Visualize univariate distributions with appropriate plots - Calculate and interpret correlations - Compare numeric variables across subgroups using summary statistics and visualizations - Create bivariate visualizations to explore relationships - Apply multivariate visualization techniques (color, faceting) to represent three or more variables - Identify potential outliers and summarize patterns of missingness In this activity we will analyze the Palmer Penguins data set: a data set consisting of measurements collected on 344 penguins from 3 islands in [Palmer Archipelago, Antarctica](https://www.google.com/maps/place/Palmer+Archipelago/@-64.1463814,-62.1800963,7z/data=!3m1!4b1!4m5!3m4!1s0xbc78dd6dc38c572b:0xe609367aeed33087!8m2!3d-64.1957848!4d-62.0153384). Specifically, data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica Long Term Ecological Research Network](https://pallter.marine.rutgers.edu/). Researchers collected data by capturing and releasing breeding adult penguins across the islands at their nests. | | | |-----------------------------------------------|----------------------------------------------| | ![](https://github.com/dilernia/DSA220/blob/main/images/lter_penguins.png?raw=true) | ![](https://github.com/dilernia/DSA220/blob/main/images/culmen_depth.png?raw=true) | : *Artwork by @allison_horst* Let's load some R packages to start, and set a default theme for all visualizations made with `ggplot2`. ```{r} #| message: false #| warning: false # Loading R packages library(tidyverse) library(gt) library(corrr) library(ggthemes) library(skimr) # Setting default ggplot2 theme theme_set(theme_bw()) ``` The code above sets a default ggplot theme for the entire document. We can update this code to a complete theme of our choice from the [`ggplot2` package](https://ggplot2.tidyverse.org/reference/ggtheme.html) or the [`ggthemes` package](https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/). Then, we can load the data for the activity. ```{r} # Importing penguins data from GitHub penguins <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/penguins.csv") ``` Below are a data dictionary for the Palmer Penguins data, and a preview of some randomly selected rows. ```{r} #| label: tbl-data-dictionary #| tbl-cap: "Data dictionary for the Palmer Penguins data set." #| echo: false # Creating data dictionary data_dictionary <- tibble(Variable = colnames(penguins), Description = c("Species of the penguin", "Island the penguin was found on", "Bill length (mm)", "Bill depth (mm)", "Flipper length (mm)", "Body mass (g)", "Sex of the penguin", "Year data was collected")) data_dictionary |> gt() ``` ```{r} #| label: tbl-penguins #| tbl-cap: "Randomly selected rows from the Palmer Penguins data set." #| echo: false # Setting seed to consistently get the same "random" rows set.seed(1994) # Printing table penguins |> slice_sample(n = 5) |> gt() ``` ::: {.callout-tip} a. In @tbl-penguins, being as specific as possible, what is the type of each variable? b. Which data format would be best for storing all data available on the Palmer Penguins (e.g., XML, CSV, or JSON)? c. Which sampling technique was used to collect the Palmer Penguins data? d. Which, if any, landmark data privacy law(s) would apply to this scenario? ::: ## Exploratory Data Analysis (EDA) **Exploratory data analysis (EDA)** is the process of analyzing data to explore key characteristics, commonly using data visualizations and descriptive statistics. Conducting EDA is contingent upon one's data already being cleaned, organized in a tidy format (i.e., variables are columns and observations are rows), and imported into R or another software tool so that it is ready for analysis. EDA is an essential skill for being a statistician or data scientist, with its main purposes being: - Identifying outliers - Exploring patterns of missing data - Formulating hypotheses for studies based on new data - To generally improve understanding of our data Two main components of EDA are descriptive statistics and data visualizations. ## Descriptive statistics **Descriptive statistics** are numerical measures describing our observed data set called our *sample*. Descriptive statistics are different from inferential statistics which are used to learn about the *population* of interest. ## Data visualization **Data visualization** is the process of graphically representing our data. The main goals of data visualization are to: - Efficiently present data in a compact space - Make data sets coherent / understandable to our target audience - Serve a clear purpose - Not distort what the data has to say For a more detailed set of guidelines regarding best practices of data visualization, see this [guide](https://drive.google.com/file/d/1BPZcdrFgoj_TjzLVWUImZ92gXCKuaGy6/view?usp=share_link) created by people working at the pharmaceutical company Novartis. ## Univariate analyses ### Univariate descriptive statistics For a single numeric variable, some commonly used univariate statistics are the mean, median, minimum, maximum, range, and standard deviation (SD). Let's look at these statistics for the flipper lengths in millimeters (mm) of the Palmer Penguins. ```{r} #| label: tbl-stats #| tbl-cap: "Table of summary statistics for penguin flipper lengths in mm." # Calculating descriptive statistics quant_1_stats <- penguins |> summarize( Minimum = min(flipper_length_mm, na.rm = TRUE), Q1 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.25), Median = median(flipper_length_mm, na.rm = TRUE), Q3 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.75), Maximum = max(flipper_length_mm, na.rm = TRUE), Mean = mean(flipper_length_mm, na.rm = TRUE), Range = Maximum - Minimum, SD = sd(flipper_length_mm, na.rm = TRUE), ) # Printing table of statistics quant_1_stats |> gt() |> fmt_number(columns = everything(), decimals = 2, drop_trailing_zeros = TRUE) ``` Note that this data set contains missing values, denoted as `NA` in R, but these were excluded when calculating the sample statistics above. This is how we will often handle missing values moving forward for this course. Missing data are important to be aware of as their source should be investigated or questioned. Although we won't be covering methods for handling missing data in this activity, some resources are [Flexible Imputation of Missing Data](https://stefvanbuuren.name/fimd/) by Stef van Buuren (2018) or the [naniar](https://naniar.njtierney.com/) and [mice](https://github.com/amices/mice) R packages. ::: {.callout-tip} What are the largest and smallest flipper lengths for penguins in this data set? ::: The mean and median describe the middle or center of a set of values. The mean gives the average value of a variable, while the median is the number such that $50\%$ of values are greater than or equal to this value, and $50\%$ are less than or equal to this value. The mean is sensitive to extreme or outlier observations, while the median is robust (less sensitive) to extreme values / outliers, so the median is typically a more appropriate measure of center when outliers are present. ::: {.callout-tip} a. Provide the value of the sample mean flipper length for the penguins. b. Provide and interpret the value of the sample median flipper length for the penguins. c. Provide and interpret the value of the first and third quartiles of the flipper lengths for the penguins. ::: Other commonly used statistics for numeric variables include the range and standard deviation which provide measures of dispersion or variability. Measures of variability describe how spread out values are for a particular variable. ::: {.callout-tip} Which statistic is more sensitive to outliers: the range or the interquartile range (IQR)? ::: Frequencies, also known as counts, are useful descriptive statistics for categorical variables. The categorical variables in the Palmer Penguins data are `species`, `island`, and `sex`. ```{r} #| label: tbl-species-counts #| tbl-cap: "Table of counts for species." # Printing frequency table penguins |> dplyr::count(species) |> gt() ``` ```{r} #| label: tbl-island-counts #| tbl-cap: "Table of counts for island." #| echo: false # Printing frequency table penguins |> dplyr::count(island) |> gt() ``` ```{r} #| label: tbl-sex-counts #| tbl-cap: "Table of counts for sex." #| echo: false # Printing frequency table penguins |> dplyr::count(sex) |> gt() ``` ::: {.callout-tip} a. Which penguin species had the most observations in this data set? b. Which island had the least penguins measured on it? c. Are there more male or female penguins in this data set? ::: Notice the NA in the frequency table for the `sex` variable. This is due to the sex of some penguins being missing. ### Univariate visualizations Data visualization is a key aspect of EDA. Univariate data visualizations are useful for understanding characteristics of a single variable in an efficient manner and also to identify potential outliers or unusual observations. It is important to include units of measurement whenever possible in data visualizations. For a single numeric variable, two commonly used data visualizations are the box plot and histogram. We make a histogram and a box plot for the penguin flipper lengths below. ```{r} # Creating a histogram penguins |> ggplot(aes(x = flipper_length_mm)) + geom_histogram(color = "white") + scale_y_continuous(expand = expansion(mult = c(0, 0.10))) + labs(title = "Distribution of penguin flipper lengths", x = "Flipper length (mm)", y = "Frequency", caption = "Data source: palmerpenguins R package") ``` ::: {.callout-tip} What can we say about the penguin flipper lengths based on the histogram? ::: ```{r} # Creating a box plot penguins |> ggplot(aes(x = flipper_length_mm)) + geom_boxplot() + scale_y_discrete(breaks = NULL) + labs(title = "Distribution of penguin flipper lengths", x = "Flipper length (mm)", caption = "Data source: palmerpenguins R package") ``` ::: {.callout-tip} What can we say about the penguin flipper lengths based on the box plot? Are there any outliers present? ::: For a single categorical variable, a commonly used data visualization is the bar chart. We have a bar chart displaying the number of penguins of each species below. ```{r} # Creating vector of colorblind-friendly colors penguin_colors <- c("Chinstrap" = "#CC79A7", "Gentoo" = "#009E73", "Adelie" = "#E69F00") ``` It is considered best practice to use colorblind friendly colors whenever possible in data visualizations. Here we are using colorblind friendly colors to differentiate between the penguin species using hex codes. Hex codes are six-digit alphanumeric codes that specify colors in a relatively software agnostic manner. For example, "#009E73" refers to a specific green hue that will be consistent in appearance across both R and Python, whereas "green" is not necessarily the same exact hue for different software tools. ```{r} # Creating a bar chart penguins |> dplyr::count(species, .drop = FALSE) |> mutate(species = fct_reorder(species, n)) |> ggplot(aes(x = species, y = n, fill = species)) + geom_col(color = "black") + scale_fill_manual(values = penguin_colors) + scale_y_continuous(expand = expansion(mult = c(0, 0.10))) + labs(title = "Distribution of penguin species", x = "Species", y = "Frequency", caption = "Data source: palmerpenguins R package") + theme(legend.position = "none") ``` ::: {.callout-tip} What can we say about the penguin species based on the bar chart? ::: ## Bivariate analyses ### Bivariate descriptive statistics For two numeric variables, the sample correlation coefficient, $r$, is a commonly used statistic which describes both the strength and direction of a linear relationship between two variables. Correlations range from -1 to 1, with values closer to -1 indicating a stronger, negative linear relationship (as one variable increases, the other decreases), values closer to 0 indicating a weaker linear relationship, and values closer to 1 indicating a stronger, positive and linear relationship (as one variable increases, the other increases). Correlations are a useful starting point, but should be used with caution as they have a few limitations: 1. They only measure linear relationships, but many relationships are non-linear. ⎰ 2. A strong correlation does not imply causation, e.g., monthly ice cream sales may be strongly correlated with the number of monthly shark attacks, but that does not necessarily mean one causes the other. 🍦🦈 3. Confounding variables can distort the relationship between variables of interest. 4. The correlation coefficient is sensitive to outliers. ::: {.callout-tip} What is a confounding variable between ice cream sales and number of shark attacks? ::: Let's look at the correlations for the penguins data: ```{r} #| label: tbl-correlations #| tbl-cap: "Table of pairwise correlations." # Calculating correlations correlation_table <- penguins |> corrr::correlate(diagonal = 1) # Printing table of correlations correlation_table |> gt() |> fmt_number(columns = everything(), decimals = 3) ``` ::: {.callout-tip} a. Which variables have the strongest correlation? b. Which variables have the weakest correlation? ::: For one numeric and one categorical variable, we can look at the same univariate statistics we have previously, but stratified by the groups of the categorical variable. Let's look at statistics for the flipper lengths of the penguins stratified by species. ```{r} #| label: tbl-groupwise-stats #| tbl-cap: "Table of groupwise summary statistics." # Calculating descriptive statistics quant_2_stats <- penguins |> group_by(species) |> summarize( Minimum = min(flipper_length_mm, na.rm = TRUE), Q1 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.25), Median = median(flipper_length_mm, na.rm = TRUE), Q3 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.75), Maximum = max(flipper_length_mm, na.rm = TRUE), Mean = mean(flipper_length_mm, na.rm = TRUE), Range = Maximum - Minimum, SD = sd(flipper_length_mm, na.rm = TRUE)) # Printing table of statistics quant_2_stats |> gt() |> fmt_number(columns = everything(), decimals = 2, drop_trailing_zeros = TRUE) ``` ::: {.callout-tip} a. Which penguin species typically had the largest flipper lengths? b. Which penguin species had the most variability in their flipper lengths? ::: For two categorical variables, a table of counts is commonly calculated as below. ```{r} #| label: tbl-two-way-counts #| tbl-cap: "Table of counts for the cross section of species and island." # Calculating counts for cross-section of species and island species_island_counts <- penguins |> dplyr::count(species, island) # Printing frequency table species_island_counts |> gt() ``` ::: {.callout-tip} a. Was any penguin species found on more than one island? b. How many penguins were found on Dream island? ::: ### Bivariate visualizations For two numeric variables, the scatter plot is the ideal visualization. Let's look at the relationship between the penguin flipper lengths and body masses. ```{r} # Creating a scatter plot penguins |> ggplot(aes(x = body_mass_g, y = flipper_length_mm)) + geom_point(pch = 21, color = "white", fill = "black") + labs(title = "Penguin flipper lengths by body mass", x = "Body mass (g)", y = "Flipper length (mm)", caption = "Data source: palmerpenguins R package") ``` We can also add a straight line of best fit to scatter plots as well. ```{r} # Creating a scatter plot with a line of best fit penguins |> ggplot(aes(x = body_mass_g, y = flipper_length_mm)) + geom_point(pch = 21, color = "white", fill = "black") + geom_smooth(method = 'lm', se = FALSE) + labs(title = "Penguin flipper lengths by body mass", x = "Body mass (g)", y = "Flipper length (mm)", caption = "Data source: palmerpenguins R package") ``` ::: {.callout-tip} What do we observe based on the scatter plot above? ::: For one numeric and one categorical variable, side-by-side box plots are a useful visualization. We can use a side-by-side box plot to explore the penguin flipper lengths by species as below. ```{r} # Creating side-by-side box plots penguins |> ggplot(aes(x = species, y = flipper_length_mm, fill = species)) + geom_boxplot() + scale_fill_manual(values = penguin_colors) + labs(title = "Penguin flipper lengths by species", x = "Species", y = "Flipper length (mm)", caption = "Data source: palmerpenguins R package") + theme(legend.position = "none") ``` ::: {.callout-tip} What do we observe based on the side-by-side box plots? ::: Lastly, for two categorical variables, a clustered bar chart or dumbbell chart are the more commonly used visualizations. ```{r} # Creating a clustered bar chart penguins |> dplyr::count(species, sex, .drop = FALSE) |> dplyr::filter(!is.na(species), !is.na(sex)) |> mutate(sex = fct_reorder(sex, n)) |> ggplot(aes(x = sex, y = n, fill = species)) + geom_col(position="dodge", color = "black") + scale_fill_manual(values = penguin_colors) + scale_y_continuous(expand = expansion(mult = c(0, 0.10))) + labs(title = "Distribution of penguin species by sex", x = "Sex", y = "Frequency", caption = "Data source: palmerpenguins R package", fill = "Species") ``` ```{r} # Creating dumbbell chart penguins |> dplyr::count(species, sex, .drop = FALSE) |> dplyr::filter(!is.na(species), !is.na(sex)) |> dplyr::mutate(species_sex = str_c(species, "_", sex)) |> ggplot(aes(x = n, y = sex, color = species, fill = species)) + geom_line(aes(group = sex), color = "black") + geom_point(pch = 21, color = "black", size = 5) + scale_fill_manual(values = penguin_colors) + labs(title = "Distribution of penguin species by sex", x = "Frequency", y = "Sex", caption = "Data source: palmerpenguins R package", fill = "Species") + theme(legend.position = "bottom") ``` ::: {.callout-tip} What can we say about the penguin sex and species based on the clustered bar chart and dumbbell chart? ::: ### Multivariate analyses Descriptive statistics could also be obtained jointly for three or more variables, although this is less commonly done. For example, we could obtain counts for the cross-section of the `species`, `island`, and `sex` variables. ```{r} #| label: tbl-three-way-counts #| tbl-cap: "Table of counts for the cross section of species, island, and sex." # Calculating counts for cross-section of species, island, and sex species_island_sex_counts <- penguins |> dplyr::count(species, island, sex, .drop = FALSE) # Printing frequency table species_island_sex_counts |> gt() ``` With this many variables, however, it is often better to use data visualizations to communicate aspects of the data. ### Multivariate visualizations If we want to jointly visualize three numeric variables, we can use a scatter plot coloring the points based on a third variable. Let's look at the relationship between the penguin flipper lengths and body masses while coloring the points based on the penguin species. ```{r} # Creating a scatter plot penguins |> ggplot(aes(x = body_mass_g, y = flipper_length_mm, fill = species)) + geom_point(pch = 21, color = "white") + scale_fill_manual(values = penguin_colors) + labs(title = "Penguin flipper lengths by body mass", x = "Body mass (g)", y = "Flipper length (mm)", fill = "Species", caption = "Data source: palmerpenguins R package") + theme(legend.position = "bottom") ``` ::: {.callout-tip} What can we say about the relationship between the penguin flipper lengths (mm) and body masses (g) for each penguin species based on the scatter plot? ::: #### Simpson's Paradox Sometimes the relationship between variables is the same within different subgroups as it is for the data as a whole. However, relationships in the full data set can be misleading or distorted when a **confounding variable** or **lurking variable** is unaccounted for. Observe the relationship between the bill lengths and bill depths of the penguins in the scatter plot below. ```{r} # Creating a scatter plot penguins |> ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(title = "Penguin bill length by bill depth", x = "Bill depth (mm)", y = "Bill length (mm)", fill = "Species", caption = "Data source: palmerpenguins R package") + theme(legend.position = "bottom") ``` ::: {.callout-tip} How can we describe the relationship in the full data set? ::: ```{r} # Creating a scatter plot penguins |> ggplot(aes(x = bill_depth_mm, y = bill_length_mm, color = species)) + geom_smooth(method = "lm", se = FALSE) + geom_point() + scale_color_manual(values = penguin_colors) + labs(title = "Penguin bill length by bill depth", x = "Bill depth (mm)", y = "Bill length (mm)", fill = "Species", caption = "Data source: palmerpenguins R package") + theme(legend.position = "bottom") ``` ::: {.callout-tip} How can we describe the relationship for each penguin species? ::: **Simpson's paradox**[^1] is a statistical phenomenon where a trend that appears in several different groups of data disappears or even reverses when these groups are combined. [^1]: Clifford H. Wagner (February 1982). "Simpson's Paradox in Real Life". The American Statistician. 36 (1): 46–48. doi:10.2307/2684093. JSTOR 2684093. [Simpson's Paradox](https://youtu.be/sxYrzzy3cq8?si=jWMoZ8qR_wNUzQQY) can arise in many real life data sets, including the Palmer Penguins data. #### Faceting An alternative way to incorporate the information of an additional categorical variable in a plot is to use *faceting*, breaking up the plot into multiple subplots. Below we showcase the relationship between the penguin flipper lengths (mm) and body masses (g) for each penguin species using faceting. ```{r} # Creating a faceted scatter plot penguins |> ggplot(aes(x = body_mass_g, y = flipper_length_mm, fill = species)) + geom_point(pch = 21, color = "white") + scale_fill_manual(values = penguin_colors) + facet_grid(species ~ .) + labs(title = "Penguin flipper lengths by body mass", x = "Body mass (g)", y = "Flipper length (mm)", fill = "Species", caption = "Data source: palmerpenguins R package") + theme(legend.position = "bottom", strip.background.y = element_rect(linetype = "solid", color = "black")) ``` For three categorical variables, we could create a faceted dumbbell chart to visualize the distribution of penguins in terms of their species, sex, and which island they were found on (Biscoe, Dream, or Torgersen). ```{r} # Creating a faceted dumbbell chart penguins |> dplyr::count(species, sex, island, .drop = FALSE) |> dplyr::filter(!is.na(species), !is.na(sex)) |> dplyr::mutate(species_sex = str_c(species, "_", sex)) |> ggplot(aes(x = n, y = sex, color = species, fill = species)) + geom_line(aes(group = sex), color = "black") + geom_point(pch = 21, color = "black", size = 5) + scale_fill_manual(values = penguin_colors) + labs(title = "Distribution of penguin species and sex by island", x = "Frequency", y = "Sex", caption = "Data source: palmerpenguins R package", fill = "Species") + facet_grid(island ~ .) + theme(legend.position = "bottom", strip.background.y = element_rect(linetype = "solid", color = "black")) ``` ::: {.callout-tip} What are some main takeaways from the faceted dumbbell chart? ::: We could also visualize three numeric variables. Let's look at the relationship between the penguin flipper lengths and body masses while coloring the points based on the bill lengths (mm). ```{r} # Creating a scatter plot penguins |> ggplot(aes(x = body_mass_g, y = flipper_length_mm, fill = bill_length_mm)) + geom_point(pch = 21, color = "white") + labs(title = "Penguin flipper lengths by body mass", x = "Body mass (g)", y = "Flipper length (mm)", fill = "Bill length (mm)", caption = "Data source: palmerpenguins R package") + theme(legend.position = "bottom") ``` We can see from the plot above that penguins with smaller flippers and body masses tend to also have smaller bill lengths. However, the plot is not the easiest to glean information from with three variables. Additional dimensions could be added to the plot using other aesthetics like the shape or size of the points, but in general it is best to use two or three variables at most in a single visualization. # Key EDA concepts ## Efficient exploration Careful creation of visualizations and summary statistics is ideal, but when there is a time constraint to analyses faster approaches are needed. Using the `skim()` function from the `skimr`, we can efficiently summarize data sets in R. ```{r} # Efficiently skimming the data skim(penguins) ``` ::: {.callout-tip} a. What are some notable patterns of missingness, if any? b. Are there any extreme outliers? ::: ## Fundamental visualizations When creating data visualizations, it is important that the visualization is appropriate for the types of variables that are used. Below is a summary of some fundamental visualizations. ```{r} #| include: false #| eval: true plot_list <- list() # Box Plot: One Quantitative Variable plot_list$gg_boxplot <- penguins |> ggplot(aes(y = flipper_length_mm)) + geom_boxplot() + labs(title = "Box plot using geom_boxplot()", subtitle = "One quantitative variable", y = "Flipper length (mm)", x = "") + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) # Histogram: One Quantitative Variable plot_list$gg_histogram <- penguins |> ggplot(aes(x = flipper_length_mm)) + geom_histogram(color = "black") + labs(title = "Histogram using geom_histogram()", subtitle = "One quantitative variable", x = "Flipper length (mm)") + scale_y_continuous(expand = expansion(mult = c(0, 0.1))) # Bar Chart: One Categorical Variable plot_list$gg_barchart <- penguins |> ggplot(aes(x = species, fill = species)) + geom_bar(color = "black") + labs(title = "Bar chart using geom_bar()", subtitle = "One categorical variable") + scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + scale_fill_manual(values = penguin_colors) + guides(fill = "none") # Bar plot: One Quantitative Variable penguins_summary <- penguins |> group_by(species) |> summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE)) plot_list$gg_barplot <- penguins_summary |> ggplot(aes(x = species, y = mean_bill_length, fill = species)) + geom_col(color = "black") + scale_fill_manual(values = penguin_colors) + labs(title = "Bar plot using geom_col()", subtitle = "One quantitative variable and one categorical variable", y = "Mean bill length (mm)") + scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + guides(fill = "none") # Side-by-side box Plot: One Quantitative Variable and One Categorical Variable plot_list$gg_side_boxplot <- penguins |> ggplot(aes(x = species, y = bill_length_mm, fill = species)) + geom_boxplot(color = "black") + scale_fill_manual(values = penguin_colors) + labs(title = "Side-by-side box plots using geom_boxplot()", subtitle = "One quantitative variable and one categorical variable", x = "Species", y = "Bill length (mm)") + guides(fill = "none") # Scatter Plot: Two Quantitative Variables plot_list$gg_scatterplot <- penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() + labs(title = "Scatter plot using geom_point()", subtitle = "Two quantitative variables", x = "Bill length (mm)", y = "Bill depth (mm)") # Dumbbell chart: Two Categorical Variables plot_list$gg_dumbbell <- penguins |> dplyr::count(species, sex, .drop = FALSE) |> dplyr::filter(!is.na(species), !is.na(sex)) |> dplyr::mutate(species_sex = str_c(species, "_", sex)) |> ggplot(aes(x = n, y = sex, color = species, fill = species)) + geom_line(aes(group = sex), color = "black") + geom_point(pch = 21, color = "black", size = 5) + scale_fill_manual(values = penguin_colors) + labs(title = "Dumbbell chart using geom_point()", subtitle = "Two categorical variables", x = "Frequency", y = "Sex", fill = "Species") + theme(legend.position = "bottom") # Clustered bar chart: Two Categorical Variables plot_list$gg_clustered_bar <- penguins |> dplyr::count(species, sex, .drop = FALSE) |> dplyr::filter(!is.na(species), !is.na(sex)) |> mutate(sex = fct_reorder(sex, n)) |> ggplot(aes(x = sex, y = n, fill = species)) + geom_col(position="dodge", color = "black") + scale_fill_manual(values = penguin_colors) + scale_y_continuous(expand = expansion(mult = c(0, 0.10))) + labs(title = "Clustered bar chart using geom_col()", subtitle = "Two categorical variables", x = "Sex", y = "Frequency", fill = "Species") + theme(legend.position = "bottom") # Saving plots pmap(.l = list(plot = plot_list, name = names(plot_list)), .f = function(plot, name) { ggsave(plot = plot, width = 5, height = 4, filename = paste0(name, ".png")) }) ``` ::: {layout-ncol="2"} ![](gg_boxplot.png) ![](gg_histogram.png) ![](gg_barchart.png) ![](gg_barplot.png) ![](gg_side_boxplot.png) ![](gg_scatterplot.png) ![](gg_dumbbell.png) ![](gg_clustered_bar.png) ::: Lastly, besides Simpson's Paradox there are other ways that [data visualizations can be misleading](https://youtu.be/E91bGT9BjYk?si=snC9nUpId76wj40q) as well that are important to be mindful of.