Introduction to Data Science

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Learning Objectives

Understand key data science terminology and the data science cycle
Understand fundamental data structures and file types

What is Data Science?

Data science is a field of study that consists of collecting, managing, and analyzing various types of data to produce meaningful insights.

Historically, data science tasks were more separated across different areas. Commonly, domain experts collected data, computer scientists managed it, and statisticians conducted analyses. In the modern field of data science, the boundaries between these distinct roles have been blurred, often requiring a data scientist to possess a combination of knowledge regarding the collection, management, and analysis of data.

The Data Science Cycle

The data science cycle is the typical pipeline of the data science process that includes the problem definition, data collection, data preparation, data analysis, and data reporting. Data collection and preparation are usually the most time-consuming aspects of the cycle, often constituting about half of the total project time.

flowchart LR
  A(Problem \n Definition) --> B(Data \n Collection) --> C(Data \n Preparation) --> D(Data \n Analysis) --> E(Data \n Reporting)

  style A fill:#E69F00
  style B fill:#009E73
  style C fill:#56B4E9
  style D fill:#CC79A7
  style E fill:#D55E00

1. Problem Definition, Data Collection, and Preparation

This initial phase of the data science cycle sets the foundation for the entire project.

Problem Definition: The first step is to establish a clear and precise problem statement, defining the main goals and scope.

Data Collection: This is the process of gathering information on variables of interest either purposefully (e.g., a customer survey) or as a by-product of user activity (e.g., web search histories).

Data Preparation: Also called data processing, this is the step where raw data is cleaned and restructured into a form for analysis or visualization.

2. Data Analysis

Data Analysis: The process of analyzing data to discover meaningful insights.

Analyses can consist of calculating summary statistics for the data or implementing statistical models.

3. Data Reporting

The final step of the data science cycle is to effectively communicate the findings to the audience of interest. The style, conciseness, and amount of technical jargon should be tailored to a given audience.

Data visualization, the graphical presentation of data using visual elements such as charts, graphs, and maps, is a key component of data reporting. A main goal of data visualization is to make it easier and faster for the intended audience to understand complex information.

When precision or detailed comparisons are important, numerical summaries or tables are useful as well for effective data reporting.

Data Management

Over time, data has become larger and more complex, so methods for storing and managing data have evolved accordingly.

Data Warehousing: Modern data management systems that store and manage large volumes of data from various sources in a central location (the cloud), facilitating efficient retrieval and analysis.

Examples: Google BigQuery, Amazon RedShift, Microsoft Azure Synapse Analytics, and Snowflake

Data and Datasets

Data is defined as any piece of information or reference point that can be analyzed to produce higher-level insights. While often thought of as numbers, data can also be text, images, or any other analyzable content.

A dataset is a collection of data organized for analysis. Individual observations are commonly called items or instances, and the characteristics that describe each item are commonly called attributes or variables.

For example, below is a table of data on actors from the Marvel Avengers movies. Each row or observation in the table represents a single actor, and each column represents an attribute or variable.

Table 1: Data on Marvel’s Avengers actors

name	got_avengers_tattoo	gender	nationality	height	birthday	number_of_roles
Robert Downey Jr.	TRUE	male	us	1.74	1965-04-04	93
Chris Evans	TRUE	male	us	1.84	1981-06-13	64
Mark Ruffalo	FALSE	male	us	1.73	1967-11-22	81
Chris Hemsworth	TRUE	male	au	1.90	1983-08-11	51
Scarlett Johansson	TRUE	female	us	1.60	1984-11-22	82
Jeremy Renner	TRUE	male	us	1.78	1971-01-07	62

Tip

How many observations are in the table above?

Tip

How many variables are in the table above?

Data Types

Data are commonly categorized into two main types: numeric and categorical.

In data analyses, it is important to correctly identify the types of variables we are working with since different models, statistics, and visualizations are appropriate for certain types of variables.

flowchart TD
  A(Variable) --> B(Numeric)
  A(Variable) --> C(Categorical)
  B(Numeric) --> D(Continuous)
  B(Numeric) --> E(Discrete)
  C(Categorical) --> F(Ordinal)
  C(Categorical) --> G(Nominal)

  style A fill:#43aa8b
  style B fill:#f9c74f
  style C fill:#277da1
  style D fill:#f9c74f
  style E fill:#f9c74f
  style F fill:#277da1
  style G fill:#277da1

Numeric Data

Numeric data consists of measurable quantities. Arithmetic operations such as addition, subtraction, and multiplication are meaningful for numeric values. In many domains, numeric data is synonymous with quantitative data.

Continuous values: Have an unlimited amount of possible precision (e.g., temperature 🌡️ or height 📏 ).
Discrete values: Have a finite amount of possible precision, and commonly are constrained to integer values (e.g., number of students in class 🎓).

Categorical Data

Categorical data consist of values or categories commonly from a finite set, which can be words, symbols, or numbers that don’t represent a measurable quantity.

Ordinal: Values or categories have a natural order or rank (e.g., class standing; freshman, sophomore, junior, senior, or ratings of agreement; strongly disagree, somewhat disagree, neutral, somewhat agree, strongly agree).
Nominal: The values have no intrinsic order (e.g., different universities; GVSU, MSU, and SVSU).

Tip

In Table 1, what types of variables are name and got_avengers_tattoo?

Tip

In Table 1, what types of variables are nationality and height?

Data Formats and Structures

Datasets are organized and stored in various ways.

Data Structures

Structured Data: This data has a predefined model and is organized neatly into a tabular format with rows and columns, with one row for each observation.
Unstructured Data: This data lacks a predefined model and doesn’t fit into a tabular (simple rows and columns) format. Examples include text from Amazon reviews, images, or videos from TikTok.

Tip

Is the data in Table 1 structured or unstructured? Why?

Data Formats

Three commonly used plain-text formats for storing structured data are:

CSV (Comma-Separated Values): The simplest format, CSV files store tabular data where each line represents a row or observation and commas separate the attribute or variable values. CSV files are relatively easy to work with. However, they are typically inefficient for storing hierarchical data, and are not robust to handling special characters. We can view a raw CSV file for the Marvel data.
JSON (JavaScript Object Notation): Uses human-readable key : value pairs to structure data. JSON files are compatible with many programming languages and are a common choice for exchanging data between a user and a server. We can view a raw JSON file for the Marvel data.
XML (Extensible Markup Language): XML files use a hierarchy of <tags> to structure data. XML files are commonly used for representing complex, hierarchical datasets and for including metadata. We can view a raw XML file for the Marvel data.

Since CSV, JSON, and XML files are all plain-text files, they can be opened and edited with any plain-text editors (e.g., Notepad, VI Editor, Sublime Text).

Tip

Which data format would be best for storing all data available on the Avengers actors?

Tip

Which data format would be best for storing most of the information in a simplified structure?

The Role of Technology in Data Analysis

Technology provides the tools and algorithms to efficiently process and analyze complex datasets, enabling data scientists to produce meaningful insights. The best technology to use depends on one’s target audience and the size and complexity of the data.

Spreadsheet Programs

Spreadsheet programs are useful for data entry, manipulation, and visualization when working with relatively simple, tabular data. They are user-friendly but have limitations with larger datasets, more sophisticated analyses. They are also not ideal for reproducibility.

Excel: A popular and widely used Microsoft application for analyzing tabular data, integrating well with other Office products.
Google Sheets: A cloud-based spreadsheet program that facilitates real-time collaboration and access from any device with an internet connection.

Programming Languages

Programming languages are useful for performing complex tasks or analyses in a reproducible manner, such as advanced data manipulation, modeling, and automation.

Python is a versatile and general-purpose language, with a wide collection of libraries (e.g., NumPy, Pandas, Matplotlib) for many different methods of analysis. It is open-source, but is also widely used and well-suited for production environments. Machine learning (scikit-learn), deep learning (TensorFlow), natural language processing (NLTK), and web scraping (BeautifulSoup, Scrapy) are just some of the strengths of Python.
R is an open-source language developed by statisticians with an emphasis on statistical computing and data visualization with a vast set of options for statistical modeling and data visualization. R is broadly recognized as one of the most effective tools for effective and advanced data visualizations, mostly due to the ggplot2 package and its extensions. It is also known for reproducible analyses (R Markdown and the more modern Quarto). R is widely used in academia across many areas and in industry for statistical analyses.

Specialized Languages: Other languages like SQL, Scala, and Julia are used for more specific data-related tasks.

SQL is an essential and well-established tool for querying structured data in relational databases.
Scala, integrated with Apache Spark, is an effective tool for distributed data processing.
Julia offers speed and mathematical elegance for scientific computing and numerical modeling.

Other Data Analysis and Visualization Tools

These tools are particularly strong in creating effective visualizations.

Tableau and PowerBI: User-friendly applications known for creating effective, interactive visualizations and analysis dashboards. They can also perform simple data analysis tasks.

--- title: "Introduction to Data Science" subtitle: DSA 220 - Introduction to Data Science and Analytics author: Andrew DiLernia format: html: theme: flatly self-contained: true embed-resources: true code-fold: true code-tools: true toc: true callout-appearance: default execute: echo: true warning: false message: false output: true --- ```{css} #| echo: false /* 1. Initialize a counter for our callouts */ body { counter-reset: tip-counter; } /* 2. Tell the counter to increment for each tip callout */ .callout-tip { counter-increment: tip-counter; } /* 3. Hide the default title text (i.e., "Tip") by collapsing its font size */ .callout-tip .callout-title-container { font-size: 0; } /* 4. Add the custom superscript counter */ .callout-tip .callout-title-container::before { content: "" counter(tip-counter); position: relative; font-size: initial; } ``` # Learning Objectives - Understand key data science terminology and the data science cycle - Understand fundamental data structures and file types # What is Data Science? **Data science** is a field of study that consists of collecting, managing, and analyzing various types of data to produce meaningful insights. Historically, data science tasks were more separated across different areas. Commonly, domain experts collected data, computer scientists managed it, and statisticians conducted analyses. In the modern field of data science, the boundaries between these distinct roles have been blurred, often requiring a data scientist to possess a combination of knowledge regarding the collection, management, and analysis of data. ## The Data Science Cycle The **data science cycle** is the typical pipeline of the data science process that includes the problem definition, data collection, data preparation, data analysis, and data reporting. Data collection and preparation are usually the most time-consuming aspects of the cycle, often constituting about half of the total project time. <div style="text-align: center;"> ```{mermaid} %%| echo: false flowchart LR A(Problem \n Definition) --> B(Data \n Collection) --> C(Data \n Preparation) --> D(Data \n Analysis) --> E(Data \n Reporting) style A fill:#E69F00 style B fill:#009E73 style C fill:#56B4E9 style D fill:#CC79A7 style E fill:#D55E00 ``` </div> ## 1. Problem Definition, Data Collection, and Preparation This initial phase of the data science cycle sets the foundation for the entire project. **Problem Definition:** The first step is to establish a clear and precise problem statement, defining the main goals and scope. **Data Collection:** This is the process of gathering information on variables of interest either purposefully (e.g., a customer survey) or as a by-product of user activity (e.g., web search histories). **Data Preparation:** Also called data processing, this is the step where raw data is cleaned and restructured into a form for analysis or visualization. ## 2. Data Analysis **Data Analysis:** The process of analyzing data to discover meaningful insights. Analyses can consist of calculating summary statistics for the data or implementing statistical models. ## 3. Data Reporting The final step of the data science cycle is to effectively communicate the findings to the audience of interest. The style, conciseness, and amount of technical jargon should be tailored to a given audience. **Data visualization**, the graphical presentation of data using visual elements such as charts, graphs, and maps, is a key component of data reporting. A main goal of data visualization is to make it easier and faster for the intended audience to understand complex information. When precision or detailed comparisons are important, numerical summaries or tables are useful as well for effective data reporting. ## Data Management Over time, data has become larger and more complex, so methods for storing and managing data have evolved accordingly. **Data Warehousing:** Modern data management systems that store and manage large volumes of data from various sources in a central location (the cloud), facilitating efficient retrieval and analysis. *Examples*: Google BigQuery, Amazon RedShift, Microsoft Azure Synapse Analytics, and Snowflake # Data and Datasets **Data** is defined as any piece of information or reference point that can be analyzed to produce higher-level insights. While often thought of as numbers, data can also be text, images, or any other analyzable content. A **dataset** is a collection of data organized for analysis. Individual observations are commonly called items or instances, and the characteristics that describe each item are commonly called attributes or variables. For example, below is a table of data on actors from the Marvel Avengers movies. Each row or observation in the table represents a single actor, and each column represents an attribute or variable. ```{r} #| echo: false #| eval: false library(tidyverse) library(jsonlite) library(XML) library(xml2) # Importing Marvel data marvel_data <- read_json("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/marvel.json") # Create a tibble for actors marvel_tibble <- marvel_data |> tibble() |> unnest_wider(marvel_data) |> select(where(~ !is.list(.))) # Saving as CSV file write_csv(marvel_tibble, "marvel_tabular.csv") # Saving smaller CSV file marvel_tibble |> dplyr::select(-c(character_image_url, died_in_endgame, net_worth, marvel_total_screen_time)) |> write_csv(file = "small_marvel_tabular.csv") # Saving as an XML file # Load the XML package if you haven't already # install.packages("XML") library(XML) # --- Assume your 'marvel_data' list-of-lists exists here --- # Create an XML root node root_node <- newXMLNode("MarvelData") # Add metadata meta_node <- newXMLNode("Metadata", parent = root_node) newXMLNode("Note", "Net worth estimates and number of roles are as of 2024.", parent = meta_node) # Loop through each character in the dataset for (i in seq_along(marvel_data)) { character_node <- newXMLNode("Character", parent = root_node) # Loop through each attribute of the character lapply(names(marvel_data[[i]]), function(key) { val <- marvel_data[[i]][[key]] # --- START OF FIX --- # 1. Special handling for the nested list 'academy_award_nominations' if (key == "academy_award_nominations" && is.list(val)) { # Create the main <academy_award_nominations> parent node nominations_node <- newXMLNode("academy_award_nominations", parent = character_node) # Iterate over each individual nomination (which is also a list) lapply(val, function(nomination_list) { # Create a <nomination> node for this specific award nomination_node <- newXMLNode("nomination", parent = nominations_node) # Add the details of the nomination (year, movie, etc.) as child nodes # to the <nomination> node lapply(names(nomination_list), function(nom_key) { newXMLNode(nom_key, as.character(nomination_list[[nom_key]]), parent = nomination_node) }) }) # 2. Handle simple vectors that need to be collapsed (e.g., occupation) } else if ((is.list(val) || length(val) > 1)) { newXMLNode(key, paste(val, collapse = ", "), parent = character_node) # 3. Handle all other single values } else { newXMLNode(key, as.character(val), parent = character_node) } # --- END OF FIX --- }) } # Save the properly structured XML to a file saveXML(root_node, file = "marvel.xml", indent = TRUE) # Import the XML and convert it to a list marvel_xml <- read_xml("marvel.xml") |> as_list() ``` ```{r} #| echo: false #| eval: true #| label: tbl-avengers #| tbl-cap: "Data on Marvel's Avengers actors" library(tidyverse) library(gt) # Importing data on Avengers actors marvel_tibble <- read_csv("https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/marvel_tabular.csv") # Displaying tabular data marvel_tibble |> dplyr::select(-c(character_image_url, died_in_endgame, net_worth, marvel_total_screen_time)) |> gt() ``` ::: {.callout-tip} How many observations are in the table above? ::: ::: {.callout-tip} How many variables are in the table above? ::: ## Data Types Data are commonly categorized into two main types: numeric and categorical. In data analyses, it is important to correctly identify the types of variables we are working with since different models, statistics, and visualizations are appropriate for certain types of variables. <div style="text-align: center;"> ```{mermaid} %%| echo: false flowchart TD A(Variable) --> B(Numeric) A(Variable) --> C(Categorical) B(Numeric) --> D(Continuous) B(Numeric) --> E(Discrete) C(Categorical) --> F(Ordinal) C(Categorical) --> G(Nominal) style A fill:#43aa8b style B fill:#f9c74f style C fill:#277da1 style D fill:#f9c74f style E fill:#f9c74f style F fill:#277da1 style G fill:#277da1 ``` </div> ### Numeric Data Numeric data consists of measurable quantities. Arithmetic operations such as addition, subtraction, and multiplication are meaningful for numeric values. In many domains, numeric data is synonymous with quantitative data. - **Continuous values**: Have an unlimited amount of possible precision (e.g., temperature 🌡️ or height 📏 ). - **Discrete values**: Have a finite amount of possible precision, and commonly are constrained to integer values (e.g., number of students in class 🎓). ### Categorical Data Categorical data consist of values or categories commonly from a finite set, which can be words, symbols, or numbers that don't represent a measurable quantity. - **Ordinal**: Values or categories have a natural order or rank (e.g., class standing; freshman, sophomore, junior, senior, or ratings of agreement; strongly disagree, somewhat disagree, neutral, somewhat agree, strongly agree). - **Nominal**: The values have no intrinsic order (e.g., different universities; GVSU, MSU, and SVSU). ::: {.callout-tip} In @tbl-avengers, what types of variables are `name` and `got_avengers_tattoo`? ::: ::: {.callout-tip} In @tbl-avengers, what types of variables are `nationality` and `height`? ::: ## Data Formats and Structures Datasets are organized and stored in various ways. ### Data Structures - **Structured Data**: This data has a predefined model and is organized neatly into a tabular format with rows and columns, with one row for each observation. - **Unstructured Data**: This data lacks a predefined model and doesn't fit into a tabular (simple rows and columns) format. Examples include text from Amazon reviews, images, or videos from TikTok. ::: {.callout-tip} Is the data in @tbl-avengers structured or unstructured? Why? ::: ### Data Formats Three commonly used plain-text formats for storing structured data are: - **CSV (Comma-Separated Values)**: The simplest format, CSV files store tabular data where each line represents a row or observation and commas separate the attribute or variable values. CSV files are relatively easy to work with. However, they are typically inefficient for storing hierarchical data, and are not robust to handling special characters. We can view a raw [CSV file](https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/marvel_tabular.csv) for the Marvel data. - **JSON (JavaScript Object Notation)**: Uses human-readable `key : value` pairs to structure data. JSON files are compatible with many programming languages and are a common choice for exchanging data between a user and a server. We can view a raw [JSON file](https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/marvel.json) for the Marvel data. - **XML (Extensible Markup Language)**: XML files use a hierarchy of `<tags>` to structure data. XML files are commonly used for representing complex, hierarchical datasets and for including metadata. We can view a raw [XML file](https://raw.githubusercontent.com/dilernia/DSA220/refs/heads/main/Data/marvel.xml) for the Marvel data. Since CSV, JSON, and XML files are all plain-text files, they can be opened and edited with any plain-text editors (e.g., Notepad, VI Editor, Sublime Text). ::: {.callout-tip} Which data format would be best for storing all data available on the Avengers actors? ::: ::: {.callout-tip} Which data format would be best for storing most of the information in a simplified structure? ::: # The Role of Technology in Data Analysis Technology provides the tools and algorithms to efficiently process and analyze complex datasets, enabling data scientists to produce meaningful insights. The best technology to use depends on one's target audience and the size and complexity of the data. ## Spreadsheet Programs Spreadsheet programs are useful for data entry, manipulation, and visualization when working with relatively simple, tabular data. They are user-friendly but have limitations with larger datasets, more sophisticated analyses. They are also not ideal for reproducibility. - **Excel:** A popular and widely used Microsoft application for analyzing tabular data, integrating well with other Office products. - **Google Sheets:** A cloud-based spreadsheet program that facilitates real-time collaboration and access from any device with an internet connection. ## Programming Languages Programming languages are useful for performing complex tasks or analyses in a reproducible manner, such as advanced data manipulation, modeling, and automation. - **Python** is a versatile and general-purpose language, with a wide collection of libraries (e.g., `NumPy`, `Pandas`, `Matplotlib`) for many different methods of analysis. It is open-source, but is also widely used and well-suited for production environments. Machine learning (`scikit-learn`), deep learning (`TensorFlow`), natural language processing (`NLTK`), and web scraping (`BeautifulSoup`, `Scrapy`) are just some of the strengths of Python. - **R** is an open-source language developed by statisticians with an emphasis on statistical computing and data visualization with a vast set of options for statistical modeling and data visualization. R is broadly recognized as one of the most effective tools for effective and advanced data visualizations, mostly due to the `ggplot2` package and its extensions. It is also known for reproducible analyses (R Markdown and the more modern Quarto). R is widely used in academia across many areas and in industry for statistical analyses. **Specialized Languages:** Other languages like SQL, Scala, and Julia are used for more specific data-related tasks. - **SQL** is an essential and well-established tool for querying structured data in relational databases. - **Scala**, integrated with Apache Spark, is an effective tool for distributed data processing. - **Julia** offers speed and mathematical elegance for scientific computing and numerical modeling. ## Other Data Analysis and Visualization Tools These tools are particularly strong in creating effective visualizations. - **Tableau and PowerBI:** User-friendly applications known for creating effective, interactive visualizations and analysis dashboards. They can also perform simple data analysis tasks.