Exploratory Data Analysis

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Learning Objectives

  • Summarize key characteristics of data using univariate descriptive statistics

  • Visualize univariate distributions with appropriate plots

  • Calculate and interpret correlations

  • Compare numeric variables across subgroups using summary statistics and visualizations

  • Create bivariate visualizations to explore relationships

  • Apply multivariate visualization techniques (color, faceting) to represent three or more variables

  • Identify potential outliers and summarize patterns of missingness

In this activity we will analyze the Palmer Penguins data set: a data set consisting of measurements collected on 344 penguins from 3 islands in Palmer Archipelago, Antarctica. Specifically, data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research Network. Researchers collected data by capturing and releasing breeding adult penguins across the islands at their nests.

Artwork by @allison_horst

Let’s load some R packages to start, and set a default theme for all visualizations made with ggplot2.

Code
# Loading R packages
library(tidyverse)
library(gt)
library(corrr)
library(ggthemes)
library(skimr)

# Setting default ggplot2 theme
theme_set(theme_bw())

The code above sets a default ggplot theme for the entire document. We can update this code to a complete theme of our choice from the ggplot2 package or the ggthemes package.

Then, we can load the data for the activity.

Code
# Importing penguins data from GitHub
penguins <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/penguins.csv")

Below are a data dictionary for the Palmer Penguins data, and a preview of some randomly selected rows.

Table 1: Data dictionary for the Palmer Penguins data set.
Variable Description
species Species of the penguin
island Island the penguin was found on
bill_length_mm Bill length (mm)
bill_depth_mm Bill depth (mm)
flipper_length_mm Flipper length (mm)
body_mass_g Body mass (g)
sex Sex of the penguin
year Year data was collected
Table 2: Randomly selected rows from the Palmer Penguins data set.
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Chinstrap Dream 50.2 18.7 198 3775 female 2009
Adelie Biscoe 39.7 18.9 184 3550 male 2009
Chinstrap Dream 45.4 18.7 188 3525 female 2007
Gentoo Biscoe 46.2 14.1 217 4375 female 2009
Gentoo Biscoe 50.0 16.3 230 5700 male 2007
Tip
  1. In Table 2, being as specific as possible, what is the type of each variable?

  2. Which data format would be best for storing all data available on the Palmer Penguins (e.g., XML, CSV, or JSON)?

  3. Which sampling technique was used to collect the Palmer Penguins data?

  4. Which, if any, landmark data privacy law(s) would apply to this scenario?

Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) is the process of analyzing data to explore key characteristics, commonly using data visualizations and descriptive statistics.

Conducting EDA is contingent upon one’s data already being cleaned, organized in a tidy format (i.e., variables are columns and observations are rows), and imported into R or another software tool so that it is ready for analysis.

EDA is an essential skill for being a statistician or data scientist, with its main purposes being:

  • Identifying outliers

  • Exploring patterns of missing data

  • Formulating hypotheses for studies based on new data

  • To generally improve understanding of our data

Two main components of EDA are descriptive statistics and data visualizations.

Descriptive statistics

Descriptive statistics are numerical measures describing our observed data set called our sample. Descriptive statistics are different from inferential statistics which are used to learn about the population of interest.

Data visualization

Data visualization is the process of graphically representing our data. The main goals of data visualization are to:

  • Efficiently present data in a compact space

  • Make data sets coherent / understandable to our target audience

  • Serve a clear purpose

  • Not distort what the data has to say

For a more detailed set of guidelines regarding best practices of data visualization, see this guide created by people working at the pharmaceutical company Novartis.

Univariate analyses

Univariate descriptive statistics

For a single numeric variable, some commonly used univariate statistics are the mean, median, minimum, maximum, range, and standard deviation (SD).

Let’s look at these statistics for the flipper lengths in millimeters (mm) of the Palmer Penguins.

Code
# Calculating descriptive statistics
quant_1_stats <- penguins |> 
  summarize(
  Minimum = min(flipper_length_mm, na.rm = TRUE),
  Q1 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.25),
  Median = median(flipper_length_mm, na.rm = TRUE),
  Q3 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.75),
  Maximum = max(flipper_length_mm, na.rm = TRUE),
  Mean = mean(flipper_length_mm, na.rm = TRUE),
  Range = Maximum - Minimum,
  SD = sd(flipper_length_mm, na.rm = TRUE),
)

# Printing table of statistics
quant_1_stats |> 
    gt() |>
    fmt_number(columns = everything(), 
               decimals = 2,
               drop_trailing_zeros = TRUE)
Table 3: Table of summary statistics for penguin flipper lengths in mm.
Minimum Q1 Median Q3 Maximum Mean Range SD
172 190 197 213 231 200.92 59 14.06

Note that this data set contains missing values, denoted as NA in R, but these were excluded when calculating the sample statistics above. This is how we will often handle missing values moving forward for this course.

Missing data are important to be aware of as their source should be investigated or questioned. Although we won’t be covering methods for handling missing data in this activity, some resources are Flexible Imputation of Missing Data by Stef van Buuren (2018) or the naniar and mice R packages.

Tip

What are the largest and smallest flipper lengths for penguins in this data set?

The mean and median describe the middle or center of a set of values. The mean gives the average value of a variable, while the median is the number such that \(50\%\) of values are greater than or equal to this value, and \(50\%\) are less than or equal to this value. The mean is sensitive to extreme or outlier observations, while the median is robust (less sensitive) to extreme values / outliers, so the median is typically a more appropriate measure of center when outliers are present.

Tip
  1. Provide the value of the sample mean flipper length for the penguins.

  2. Provide and interpret the value of the sample median flipper length for the penguins.

  3. Provide and interpret the value of the first and third quartiles of the flipper lengths for the penguins.

Other commonly used statistics for numeric variables include the range and standard deviation which provide measures of dispersion or variability. Measures of variability describe how spread out values are for a particular variable.

Tip

Which statistic is more sensitive to outliers: the range or the interquartile range (IQR)?

Frequencies, also known as counts, are useful descriptive statistics for categorical variables. The categorical variables in the Palmer Penguins data are species, island, and sex.

Code
# Printing frequency table
penguins |> 
  dplyr::count(species) |> 
  gt()
Table 4: Table of counts for species.
species n
Adelie 152
Chinstrap 68
Gentoo 124
Table 5: Table of counts for island.
island n
Biscoe 168
Dream 124
Torgersen 52
Table 6: Table of counts for sex.
sex n
female 165
male 168
NA 11
Tip
  1. Which penguin species had the most observations in this data set?

  2. Which island had the least penguins measured on it?

  3. Are there more male or female penguins in this data set?

Notice the NA in the frequency table for the sex variable. This is due to the sex of some penguins being missing.

Univariate visualizations

Data visualization is a key aspect of EDA. Univariate data visualizations are useful for understanding characteristics of a single variable in an efficient manner and also to identify potential outliers or unusual observations. It is important to include units of measurement whenever possible in data visualizations.

For a single numeric variable, two commonly used data visualizations are the box plot and histogram. We make a histogram and a box plot for the penguin flipper lengths below.

Code
# Creating a histogram
penguins |> 
  ggplot(aes(x = flipper_length_mm)) + 
  geom_histogram(color = "white") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
  labs(title = "Distribution of penguin flipper lengths",
       x = "Flipper length (mm)",
       y = "Frequency",
       caption = "Data source: palmerpenguins R package")

Tip

What can we say about the penguin flipper lengths based on the histogram?

Code
# Creating a box plot
penguins |> 
  ggplot(aes(x = flipper_length_mm)) + 
  geom_boxplot() +
  scale_y_discrete(breaks = NULL) +
    labs(title = "Distribution of penguin flipper lengths",
       x = "Flipper length (mm)",
       caption = "Data source: palmerpenguins R package")

Tip

What can we say about the penguin flipper lengths based on the box plot? Are there any outliers present?

For a single categorical variable, a commonly used data visualization is the bar chart. We have a bar chart displaying the number of penguins of each species below.

Code
# Creating vector of colorblind-friendly colors
penguin_colors <- c("Chinstrap" = "#CC79A7", "Gentoo" = "#009E73", "Adelie" = "#E69F00")

It is considered best practice to use colorblind friendly colors whenever possible in data visualizations. Here we are using colorblind friendly colors to differentiate between the penguin species using hex codes. Hex codes are six-digit alphanumeric codes that specify colors in a relatively software agnostic manner. For example, “#009E73” refers to a specific green hue that will be consistent in appearance across both R and Python, whereas “green” is not necessarily the same exact hue for different software tools.

Code
# Creating a bar chart
penguins |> 
  dplyr::count(species, .drop = FALSE) |> 
  mutate(species = fct_reorder(species, n)) |> 
  ggplot(aes(x = species, y = n,
             fill = species)) + 
  geom_col(color = "black") +
  scale_fill_manual(values = penguin_colors) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
    labs(title = "Distribution of penguin species",
       x = "Species",
       y = "Frequency",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "none")

Tip

What can we say about the penguin species based on the bar chart?

Bivariate analyses

Bivariate descriptive statistics

For two numeric variables, the sample correlation coefficient, \(r\), is a commonly used statistic which describes both the strength and direction of a linear relationship between two variables. Correlations range from -1 to 1, with values closer to -1 indicating a stronger, negative linear relationship (as one variable increases, the other decreases), values closer to 0 indicating a weaker linear relationship, and values closer to 1 indicating a stronger, positive and linear relationship (as one variable increases, the other increases).

Correlations are a useful starting point, but should be used with caution as they have a few limitations:

  1. They only measure linear relationships, but many relationships are non-linear. ⎰

  2. A strong correlation does not imply causation, e.g., monthly ice cream sales may be strongly correlated with the number of monthly shark attacks, but that does not necessarily mean one causes the other. 🍦🦈

  3. Confounding variables can distort the relationship between variables of interest.

  4. The correlation coefficient is sensitive to outliers.

Tip

What is a confounding variable between ice cream sales and number of shark attacks?

Let’s look at the correlations for the penguins data:

Code
# Calculating correlations
correlation_table <- penguins |> 
  corrr::correlate(diagonal = 1)

# Printing table of correlations
correlation_table |> 
    gt() |>
    fmt_number(columns = everything(), decimals = 3)
Table 7: Table of pairwise correlations.
term bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
bill_length_mm 1.000 −0.235 0.656 0.595 0.055
bill_depth_mm −0.235 1.000 −0.584 −0.472 −0.060
flipper_length_mm 0.656 −0.584 1.000 0.871 0.170
body_mass_g 0.595 −0.472 0.871 1.000 0.042
year 0.055 −0.060 0.170 0.042 1.000
Tip
  1. Which variables have the strongest correlation?

  2. Which variables have the weakest correlation?

For one numeric and one categorical variable, we can look at the same univariate statistics we have previously, but stratified by the groups of the categorical variable. Let’s look at statistics for the flipper lengths of the penguins stratified by species.

Code
# Calculating descriptive statistics
quant_2_stats <- penguins |> 
  group_by(species) |> 
  summarize(
  Minimum = min(flipper_length_mm, na.rm = TRUE),
  Q1 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.25),
  Median = median(flipper_length_mm, na.rm = TRUE),
  Q3 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.75),
  Maximum = max(flipper_length_mm, na.rm = TRUE),
  Mean = mean(flipper_length_mm, na.rm = TRUE),
  Range = Maximum - Minimum,
  SD = sd(flipper_length_mm, na.rm = TRUE))

# Printing table of statistics
quant_2_stats |> 
    gt() |>
    fmt_number(columns = everything(), 
               decimals = 2,
               drop_trailing_zeros = TRUE)
Table 8: Table of groupwise summary statistics.
species Minimum Q1 Median Q3 Maximum Mean Range SD
Adelie 172 186 190 195 210 189.95 38 6.54
Chinstrap 178 191 196 201 212 195.82 34 7.13
Gentoo 203 212 216 221 231 217.19 28 6.48
Tip
  1. Which penguin species typically had the largest flipper lengths?

  2. Which penguin species had the most variability in their flipper lengths?

For two categorical variables, a table of counts is commonly calculated as below.

Code
# Calculating counts for cross-section of species and island
species_island_counts <- penguins |> 
  dplyr::count(species, island)

# Printing frequency table
species_island_counts |> 
  gt()
Table 9: Table of counts for the cross section of species and island.
species island n
Adelie Biscoe 44
Adelie Dream 56
Adelie Torgersen 52
Chinstrap Dream 68
Gentoo Biscoe 124
Tip
  1. Was any penguin species found on more than one island?

  2. How many penguins were found on Dream island?

Bivariate visualizations

For two numeric variables, the scatter plot is the ideal visualization. Let’s look at the relationship between the penguin flipper lengths and body masses.

Code
# Creating a scatter plot
penguins |> 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm)) + 
  geom_point(pch = 21, color = "white", fill = "black") +
    labs(title = "Penguin flipper lengths by body mass",
       x = "Body mass (g)",
       y = "Flipper length (mm)",
       caption = "Data source: palmerpenguins R package")

We can also add a straight line of best fit to scatter plots as well.

Code
# Creating a scatter plot with a line of best fit
penguins |> 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm)) + 
  geom_point(pch = 21, color = "white", fill = "black") +
  geom_smooth(method = 'lm', se = FALSE) +
      labs(title = "Penguin flipper lengths by body mass",
       x = "Body mass (g)",
       y = "Flipper length (mm)",
       caption = "Data source: palmerpenguins R package")

Tip

What do we observe based on the scatter plot above?

For one numeric and one categorical variable, side-by-side box plots are a useful visualization. We can use a side-by-side box plot to explore the penguin flipper lengths by species as below.

Code
# Creating side-by-side box plots
penguins |> 
  ggplot(aes(x = species, y = flipper_length_mm, fill = species)) + 
  geom_boxplot() + 
  scale_fill_manual(values = penguin_colors) +
        labs(title = "Penguin flipper lengths by species",
       x = "Species",
       y = "Flipper length (mm)",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "none")

Tip

What do we observe based on the side-by-side box plots?

Lastly, for two categorical variables, a clustered bar chart or dumbbell chart are the more commonly used visualizations.

Code
# Creating a clustered bar chart
penguins |> dplyr::count(species, sex, .drop = FALSE) |> 
    dplyr::filter(!is.na(species), !is.na(sex)) |> 
  mutate(sex = fct_reorder(sex, n)) |> 
  ggplot(aes(x = sex, y = n,
             fill = species)) + 
  geom_col(position="dodge", color = "black") +
  scale_fill_manual(values = penguin_colors) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
    labs(title = "Distribution of penguin species by sex",
       x = "Sex",
       y = "Frequency",
       caption = "Data source: palmerpenguins R package",
       fill = "Species")

Code
# Creating dumbbell chart
penguins |> 
  dplyr::count(species, sex, .drop = FALSE) |> 
  dplyr::filter(!is.na(species), !is.na(sex)) |> 
  dplyr::mutate(species_sex = str_c(species, "_", sex)) |> 
  ggplot(aes(x = n, y = sex,
             color = species, fill = species)) + 
  geom_line(aes(group = sex), color = "black") +
    geom_point(pch = 21, color = "black", size = 5) +
  scale_fill_manual(values = penguin_colors) +
      labs(title = "Distribution of penguin species by sex",
           x = "Frequency",
           y = "Sex",
           caption = "Data source: palmerpenguins R package",
       fill = "Species") +
  theme(legend.position = "bottom")

Tip

What can we say about the penguin sex and species based on the clustered bar chart and dumbbell chart?

Multivariate analyses

Descriptive statistics could also be obtained jointly for three or more variables, although this is less commonly done. For example, we could obtain counts for the cross-section of the species, island, and sex variables.

Code
# Calculating counts for cross-section of species, island, and sex
species_island_sex_counts <- penguins |> 
  dplyr::count(species, island, sex, .drop = FALSE)

# Printing frequency table
species_island_sex_counts |> 
  gt()
Table 10: Table of counts for the cross section of species, island, and sex.
species island sex n
Adelie Biscoe female 22
Adelie Biscoe male 22
Adelie Dream female 27
Adelie Dream male 28
Adelie Dream NA 1
Adelie Torgersen female 24
Adelie Torgersen male 23
Adelie Torgersen NA 5
Chinstrap Dream female 34
Chinstrap Dream male 34
Gentoo Biscoe female 58
Gentoo Biscoe male 61
Gentoo Biscoe NA 5

With this many variables, however, it is often better to use data visualizations to communicate aspects of the data.

Multivariate visualizations

If we want to jointly visualize three numeric variables, we can use a scatter plot coloring the points based on a third variable.

Let’s look at the relationship between the penguin flipper lengths and body masses while coloring the points based on the penguin species.

Code
# Creating a scatter plot
penguins |> 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm, fill = species)) + 
  geom_point(pch = 21, color = "white") +
    scale_fill_manual(values = penguin_colors) +
    labs(title = "Penguin flipper lengths by body mass",
       x = "Body mass (g)",
       y = "Flipper length (mm)",
       fill = "Species",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "bottom")

Tip

What can we say about the relationship between the penguin flipper lengths (mm) and body masses (g) for each penguin species based on the scatter plot?

Simpson’s Paradox

Sometimes the relationship between variables is the same within different subgroups as it is for the data as a whole. However, relationships in the full data set can be misleading or distorted when a confounding variable or lurking variable is unaccounted for.

Observe the relationship between the bill lengths and bill depths of the penguins in the scatter plot below.

Code
# Creating a scatter plot
penguins |> 
  ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
    labs(title = "Penguin bill length by bill depth",
       x = "Bill depth (mm)",
       y = "Bill length (mm)",
       fill = "Species",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "bottom")

Tip

How can we describe the relationship in the full data set?

Code
# Creating a scatter plot
penguins |> 
  ggplot(aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
    geom_smooth(method = "lm", se = FALSE) +
  geom_point() +
    scale_color_manual(values = penguin_colors) +
    labs(title = "Penguin bill length by bill depth",
       x = "Bill depth (mm)",
       y = "Bill length (mm)",
       fill = "Species",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "bottom")

Tip

How can we describe the relationship for each penguin species?

Simpson’s paradox1 is a statistical phenomenon where a trend that appears in several different groups of data disappears or even reverses when these groups are combined.

Simpson’s Paradox can arise in many real life data sets, including the Palmer Penguins data.

Faceting

An alternative way to incorporate the information of an additional categorical variable in a plot is to use faceting, breaking up the plot into multiple subplots. Below we showcase the relationship between the penguin flipper lengths (mm) and body masses (g) for each penguin species using faceting.

Code
# Creating a faceted scatter plot
penguins |> 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm, fill = species)) + 
  geom_point(pch = 21, color = "white") +
    scale_fill_manual(values = penguin_colors) +
  facet_grid(species ~ .) +
    labs(title = "Penguin flipper lengths by body mass",
       x = "Body mass (g)",
       y = "Flipper length (mm)",
       fill = "Species",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "bottom",
        strip.background.y = element_rect(linetype = "solid", color = "black"))

For three categorical variables, we could create a faceted dumbbell chart to visualize the distribution of penguins in terms of their species, sex, and which island they were found on (Biscoe, Dream, or Torgersen).

Code
# Creating a faceted dumbbell chart
penguins |> 
  dplyr::count(species, sex, island, .drop = FALSE) |> 
  dplyr::filter(!is.na(species), !is.na(sex)) |> 
  dplyr::mutate(species_sex = str_c(species, "_", sex)) |> 
  ggplot(aes(x = n, y = sex,
             color = species, fill = species)) + 
  geom_line(aes(group = sex), color = "black") +
    geom_point(pch = 21, color = "black", size = 5) +
  scale_fill_manual(values = penguin_colors) +
      labs(title = "Distribution of penguin species and sex by island",
           x = "Frequency",
           y = "Sex",
           caption = "Data source: palmerpenguins R package",
       fill = "Species") +
  facet_grid(island ~ .) +
  theme(legend.position = "bottom",
        strip.background.y = element_rect(linetype = "solid", color = "black"))

Tip

What are some main takeaways from the faceted dumbbell chart?

We could also visualize three numeric variables. Let’s look at the relationship between the penguin flipper lengths and body masses while coloring the points based on the bill lengths (mm).

Code
# Creating a scatter plot
penguins |> 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm, fill = bill_length_mm)) + 
  geom_point(pch = 21, color = "white") +
    labs(title = "Penguin flipper lengths by body mass",
       x = "Body mass (g)",
       y = "Flipper length (mm)",
       fill = "Bill length (mm)",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "bottom")

We can see from the plot above that penguins with smaller flippers and body masses tend to also have smaller bill lengths. However, the plot is not the easiest to glean information from with three variables. Additional dimensions could be added to the plot using other aesthetics like the shape or size of the points, but in general it is best to use two or three variables at most in a single visualization.

Key EDA concepts

Efficient exploration

Careful creation of visualizations and summary statistics is ideal, but when there is a time constraint to analyses faster approaches are needed. Using the skim() function from the skimr, we can efficiently summarize data sets in R.

Code
# Efficiently skimming the data
skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
character 3
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
species 0 1.00 6 9 0 3 0
island 0 1.00 5 9 0 3 0
sex 11 0.97 4 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇
Tip
  1. What are some notable patterns of missingness, if any?

  2. Are there any extreme outliers?

Fundamental visualizations

When creating data visualizations, it is important that the visualization is appropriate for the types of variables that are used. Below is a summary of some fundamental visualizations.

Lastly, besides Simpson’s Paradox there are other ways that data visualizations can be misleading as well that are important to be mindful of.

Footnotes

  1. Clifford H. Wagner (February 1982). “Simpson’s Paradox in Real Life”. The American Statistician. 36 (1): 46–48. doi:10.2307/2684093. JSTOR 2684093.↩︎