Introduction to Data Science

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Learning Objectives

  • Understand key data science terminology and the data science cycle

  • Understand fundamental data structures and file types

What is Data Science?

Data science is a field of study that consists of collecting, managing, and analyzing various types of data to produce meaningful insights.

Historically, data science tasks were more separated across different areas. Commonly, domain experts collected data, computer scientists managed it, and statisticians conducted analyses. In the modern field of data science, the boundaries between these distinct roles have been blurred, often requiring a data scientist to possess a combination of knowledge regarding the collection, management, and analysis of data.

The Data Science Cycle

The data science cycle is the typical pipeline of the data science process that includes the problem definition, data collection, data preparation, data analysis, and data reporting. Data collection and preparation are usually the most time-consuming aspects of the cycle, often constituting about half of the total project time.

flowchart LR
  A(Problem \n Definition) --> B(Data \n Collection) --> C(Data \n Preparation) --> D(Data \n Analysis) --> E(Data \n Reporting)

  style A fill:#E69F00
  style B fill:#009E73
  style C fill:#56B4E9
  style D fill:#CC79A7
  style E fill:#D55E00

1. Problem Definition, Data Collection, and Preparation

This initial phase of the data science cycle sets the foundation for the entire project.

Problem Definition: The first step is to establish a clear and precise problem statement, defining the main goals and scope.

Data Collection: This is the process of gathering information on variables of interest either purposefully (e.g., a customer survey) or as a by-product of user activity (e.g., web search histories).

Data Preparation: Also called data processing, this is the step where raw data is cleaned and restructured into a form for analysis or visualization.

2. Data Analysis

Data Analysis: The process of analyzing data to discover meaningful insights.

Analyses can consist of calculating summary statistics for the data or implementing statistical models.

3. Data Reporting

The final step of the data science cycle is to effectively communicate the findings to the audience of interest. The style, conciseness, and amount of technical jargon should be tailored to a given audience.

Data visualization, the graphical presentation of data using visual elements such as charts, graphs, and maps, is a key component of data reporting. A main goal of data visualization is to make it easier and faster for the intended audience to understand complex information.

When precision or detailed comparisons are important, numerical summaries or tables are useful as well for effective data reporting.

Data Management

Over time, data has become larger and more complex, so methods for storing and managing data have evolved accordingly.

Data Warehousing: Modern data management systems that store and manage large volumes of data from various sources in a central location (the cloud), facilitating efficient retrieval and analysis.

Examples: Google BigQuery, Amazon RedShift, Microsoft Azure Synapse Analytics, and Snowflake

Data and Datasets

Data is defined as any piece of information or reference point that can be analyzed to produce higher-level insights. While often thought of as numbers, data can also be text, images, or any other analyzable content.

A dataset is a collection of data organized for analysis. Individual observations are commonly called items or instances, and the characteristics that describe each item are commonly called attributes or variables.

For example, below is a table of data on actors from the Marvel Avengers movies. Each row or observation in the table represents a single actor, and each column represents an attribute or variable.

Table 1: Data on Marvel’s Avengers actors
name got_avengers_tattoo gender nationality height birthday number_of_roles
Robert Downey Jr. TRUE male us 1.74 1965-04-04 93
Chris Evans TRUE male us 1.84 1981-06-13 64
Mark Ruffalo FALSE male us 1.73 1967-11-22 81
Chris Hemsworth TRUE male au 1.90 1983-08-11 51
Scarlett Johansson TRUE female us 1.60 1984-11-22 82
Jeremy Renner TRUE male us 1.78 1971-01-07 62
Tip

How many observations are in the table above?

Tip

How many variables are in the table above?

Data Types

Data are commonly categorized into two main types: numeric and categorical.

In data analyses, it is important to correctly identify the types of variables we are working with since different models, statistics, and visualizations are appropriate for certain types of variables.

flowchart TD
  A(Variable) --> B(Numeric)
  A(Variable) --> C(Categorical)
  B(Numeric) --> D(Continuous)
  B(Numeric) --> E(Discrete)
  C(Categorical) --> F(Ordinal)
  C(Categorical) --> G(Nominal)

  style A fill:#43aa8b
  style B fill:#f9c74f
  style C fill:#277da1
  style D fill:#f9c74f
  style E fill:#f9c74f
  style F fill:#277da1
  style G fill:#277da1

Numeric Data

Numeric data consists of measurable quantities. Arithmetic operations such as addition, subtraction, and multiplication are meaningful for numeric values. In many domains, numeric data is synonymous with quantitative data.

  • Continuous values: Have an unlimited amount of possible precision (e.g., temperature 🌡️ or height 📏 ).

  • Discrete values: Have a finite amount of possible precision, and commonly are constrained to integer values (e.g., number of students in class 🎓).

Categorical Data

Categorical data consist of values or categories commonly from a finite set, which can be words, symbols, or numbers that don’t represent a measurable quantity.

  • Ordinal: Values or categories have a natural order or rank (e.g., class standing; freshman, sophomore, junior, senior, or ratings of agreement; strongly disagree, somewhat disagree, neutral, somewhat agree, strongly agree).

  • Nominal: The values have no intrinsic order (e.g., different universities; GVSU, MSU, and SVSU).

Tip

In Table 1, what types of variables are name and got_avengers_tattoo?

Tip

In Table 1, what types of variables are nationality and height?

Data Formats and Structures

Datasets are organized and stored in various ways.

Data Structures

  • Structured Data: This data has a predefined model and is organized neatly into a tabular format with rows and columns, with one row for each observation.

  • Unstructured Data: This data lacks a predefined model and doesn’t fit into a tabular (simple rows and columns) format. Examples include text from Amazon reviews, images, or videos from TikTok.

Tip

Is the data in Table 1 structured or unstructured? Why?

Data Formats

Three commonly used plain-text formats for storing structured data are:

  • CSV (Comma-Separated Values): The simplest format, CSV files store tabular data where each line represents a row or observation and commas separate the attribute or variable values. CSV files are relatively easy to work with. However, they are typically inefficient for storing hierarchical data, and are not robust to handling special characters. We can view a raw CSV file for the Marvel data.

  • JSON (JavaScript Object Notation): Uses human-readable key : value pairs to structure data. JSON files are compatible with many programming languages and are a common choice for exchanging data between a user and a server. We can view a raw JSON file for the Marvel data.

  • XML (Extensible Markup Language): XML files use a hierarchy of <tags> to structure data. XML files are commonly used for representing complex, hierarchical datasets and for including metadata. We can view a raw XML file for the Marvel data.

Since CSV, JSON, and XML files are all plain-text files, they can be opened and edited with any plain-text editors (e.g., Notepad, VI Editor, Sublime Text).

Tip

Which data format would be best for storing all data available on the Avengers actors?

Tip

Which data format would be best for storing most of the information in a simplified structure?

The Role of Technology in Data Analysis

Technology provides the tools and algorithms to efficiently process and analyze complex datasets, enabling data scientists to produce meaningful insights. The best technology to use depends on one’s target audience and the size and complexity of the data.

Spreadsheet Programs

Spreadsheet programs are useful for data entry, manipulation, and visualization when working with relatively simple, tabular data. They are user-friendly but have limitations with larger datasets, more sophisticated analyses. They are also not ideal for reproducibility.

  • Excel: A popular and widely used Microsoft application for analyzing tabular data, integrating well with other Office products.

  • Google Sheets: A cloud-based spreadsheet program that facilitates real-time collaboration and access from any device with an internet connection.

Programming Languages

Programming languages are useful for performing complex tasks or analyses in a reproducible manner, such as advanced data manipulation, modeling, and automation.

  • Python is a versatile and general-purpose language, with a wide collection of libraries (e.g., NumPy, Pandas, Matplotlib) for many different methods of analysis. It is open-source, but is also widely used and well-suited for production environments. Machine learning (scikit-learn), deep learning (TensorFlow), natural language processing (NLTK), and web scraping (BeautifulSoup, Scrapy) are just some of the strengths of Python.

  • R is an open-source language developed by statisticians with an emphasis on statistical computing and data visualization with a vast set of options for statistical modeling and data visualization. R is broadly recognized as one of the most effective tools for effective and advanced data visualizations, mostly due to the ggplot2 package and its extensions. It is also known for reproducible analyses (R Markdown and the more modern Quarto). R is widely used in academia across many areas and in industry for statistical analyses.

Specialized Languages: Other languages like SQL, Scala, and Julia are used for more specific data-related tasks.

  • SQL is an essential and well-established tool for querying structured data in relational databases.

  • Scala, integrated with Apache Spark, is an effective tool for distributed data processing.

  • Julia offers speed and mathematical elegance for scientific computing and numerical modeling.

Other Data Analysis and Visualization Tools

These tools are particularly strong in creating effective visualizations.

  • Tableau and PowerBI: User-friendly applications known for creating effective, interactive visualizations and analysis dashboards. They can also perform simple data analysis tasks.