Introduction to Data Science
DSA 220 - Introduction to Data Science and Analytics
Learning Objectives
Understand key data science terminology and the data science cycle
Understand fundamental data structures and file types
What is Data Science?
Data science is a field of study that consists of collecting, managing, and analyzing various types of data to produce meaningful insights.
Historically, data science tasks were more separated across different areas. Commonly, domain experts collected data, computer scientists managed it, and statisticians conducted analyses. In the modern field of data science, the boundaries between these distinct roles have been blurred, often requiring a data scientist to possess a combination of knowledge regarding the collection, management, and analysis of data.
The Data Science Cycle
The data science cycle is the typical pipeline of the data science process that includes the problem definition, data collection, data preparation, data analysis, and data reporting. Data collection and preparation are usually the most time-consuming aspects of the cycle, often constituting about half of the total project time.
flowchart LR A(Problem \n Definition) --> B(Data \n Collection) --> C(Data \n Preparation) --> D(Data \n Analysis) --> E(Data \n Reporting) style A fill:#E69F00 style B fill:#009E73 style C fill:#56B4E9 style D fill:#CC79A7 style E fill:#D55E00
1. Problem Definition, Data Collection, and Preparation
This initial phase of the data science cycle sets the foundation for the entire project.
Problem Definition: The first step is to establish a clear and precise problem statement, defining the main goals and scope.
Data Collection: This is the process of gathering information on variables of interest either purposefully (e.g., a customer survey) or as a by-product of user activity (e.g., web search histories).
Data Preparation: Also called data processing, this is the step where raw data is cleaned and restructured into a form for analysis or visualization.
2. Data Analysis
Data Analysis: The process of analyzing data to discover meaningful insights.
Analyses can consist of calculating summary statistics for the data or implementing statistical models.
3. Data Reporting
The final step of the data science cycle is to effectively communicate the findings to the audience of interest. The style, conciseness, and amount of technical jargon should be tailored to a given audience.
Data visualization, the graphical presentation of data using visual elements such as charts, graphs, and maps, is a key component of data reporting. A main goal of data visualization is to make it easier and faster for the intended audience to understand complex information.
When precision or detailed comparisons are important, numerical summaries or tables are useful as well for effective data reporting.
Data Management
Over time, data has become larger and more complex, so methods for storing and managing data have evolved accordingly.
Data Warehousing: Modern data management systems that store and manage large volumes of data from various sources in a central location (the cloud), facilitating efficient retrieval and analysis.
Examples: Google BigQuery, Amazon RedShift, Microsoft Azure Synapse Analytics, and Snowflake
Data and Datasets
Data is defined as any piece of information or reference point that can be analyzed to produce higher-level insights. While often thought of as numbers, data can also be text, images, or any other analyzable content.
A dataset is a collection of data organized for analysis. Individual observations are commonly called items or instances, and the characteristics that describe each item are commonly called attributes or variables.
For example, below is a table of data on actors from the Marvel Avengers movies. Each row or observation in the table represents a single actor, and each column represents an attribute or variable.
| name | got_avengers_tattoo | gender | nationality | height | birthday | number_of_roles |
|---|---|---|---|---|---|---|
| Robert Downey Jr. | TRUE | male | us | 1.74 | 1965-04-04 | 93 |
| Chris Evans | TRUE | male | us | 1.84 | 1981-06-13 | 64 |
| Mark Ruffalo | FALSE | male | us | 1.73 | 1967-11-22 | 81 |
| Chris Hemsworth | TRUE | male | au | 1.90 | 1983-08-11 | 51 |
| Scarlett Johansson | TRUE | female | us | 1.60 | 1984-11-22 | 82 |
| Jeremy Renner | TRUE | male | us | 1.78 | 1971-01-07 | 62 |
How many observations are in the table above?
How many variables are in the table above?
Data Types
Data are commonly categorized into two main types: numeric and categorical.
In data analyses, it is important to correctly identify the types of variables we are working with since different models, statistics, and visualizations are appropriate for certain types of variables.
flowchart TD A(Variable) --> B(Numeric) A(Variable) --> C(Categorical) B(Numeric) --> D(Continuous) B(Numeric) --> E(Discrete) C(Categorical) --> F(Ordinal) C(Categorical) --> G(Nominal) style A fill:#43aa8b style B fill:#f9c74f style C fill:#277da1 style D fill:#f9c74f style E fill:#f9c74f style F fill:#277da1 style G fill:#277da1
Numeric Data
Numeric data consists of measurable quantities. Arithmetic operations such as addition, subtraction, and multiplication are meaningful for numeric values. In many domains, numeric data is synonymous with quantitative data.
Continuous values: Have an unlimited amount of possible precision (e.g., temperature 🌡️ or height 📏 ).
Discrete values: Have a finite amount of possible precision, and commonly are constrained to integer values (e.g., number of students in class 🎓).
Categorical Data
Categorical data consist of values or categories commonly from a finite set, which can be words, symbols, or numbers that don’t represent a measurable quantity.
Ordinal: Values or categories have a natural order or rank (e.g., class standing; freshman, sophomore, junior, senior, or ratings of agreement; strongly disagree, somewhat disagree, neutral, somewhat agree, strongly agree).
Nominal: The values have no intrinsic order (e.g., different universities; GVSU, MSU, and SVSU).
In Table 1, what types of variables are name and got_avengers_tattoo?
In Table 1, what types of variables are nationality and height?
Data Formats and Structures
Datasets are organized and stored in various ways.
Data Structures
Structured Data: This data has a predefined model and is organized neatly into a tabular format with rows and columns, with one row for each observation.
Unstructured Data: This data lacks a predefined model and doesn’t fit into a tabular (simple rows and columns) format. Examples include text from Amazon reviews, images, or videos from TikTok.
Is the data in Table 1 structured or unstructured? Why?
Data Formats
Three commonly used plain-text formats for storing structured data are:
CSV (Comma-Separated Values): The simplest format, CSV files store tabular data where each line represents a row or observation and commas separate the attribute or variable values. CSV files are relatively easy to work with. However, they are typically inefficient for storing hierarchical data, and are not robust to handling special characters. We can view a raw CSV file for the Marvel data.
JSON (JavaScript Object Notation): Uses human-readable
key : valuepairs to structure data. JSON files are compatible with many programming languages and are a common choice for exchanging data between a user and a server. We can view a raw JSON file for the Marvel data.XML (Extensible Markup Language): XML files use a hierarchy of
<tags>to structure data. XML files are commonly used for representing complex, hierarchical datasets and for including metadata. We can view a raw XML file for the Marvel data.
Since CSV, JSON, and XML files are all plain-text files, they can be opened and edited with any plain-text editors (e.g., Notepad, VI Editor, Sublime Text).
Which data format would be best for storing all data available on the Avengers actors?
Which data format would be best for storing most of the information in a simplified structure?
The Role of Technology in Data Analysis
Technology provides the tools and algorithms to efficiently process and analyze complex datasets, enabling data scientists to produce meaningful insights. The best technology to use depends on one’s target audience and the size and complexity of the data.
Spreadsheet Programs
Spreadsheet programs are useful for data entry, manipulation, and visualization when working with relatively simple, tabular data. They are user-friendly but have limitations with larger datasets, more sophisticated analyses. They are also not ideal for reproducibility.
Excel: A popular and widely used Microsoft application for analyzing tabular data, integrating well with other Office products.
Google Sheets: A cloud-based spreadsheet program that facilitates real-time collaboration and access from any device with an internet connection.
Programming Languages
Programming languages are useful for performing complex tasks or analyses in a reproducible manner, such as advanced data manipulation, modeling, and automation.
Python is a versatile and general-purpose language, with a wide collection of libraries (e.g.,
NumPy,Pandas,Matplotlib) for many different methods of analysis. It is open-source, but is also widely used and well-suited for production environments. Machine learning (scikit-learn), deep learning (TensorFlow), natural language processing (NLTK), and web scraping (BeautifulSoup,Scrapy) are just some of the strengths of Python.R is an open-source language developed by statisticians with an emphasis on statistical computing and data visualization with a vast set of options for statistical modeling and data visualization. R is broadly recognized as one of the most effective tools for effective and advanced data visualizations, mostly due to the
ggplot2package and its extensions. It is also known for reproducible analyses (R Markdown and the more modern Quarto). R is widely used in academia across many areas and in industry for statistical analyses.
Specialized Languages: Other languages like SQL, Scala, and Julia are used for more specific data-related tasks.
SQL is an essential and well-established tool for querying structured data in relational databases.
Scala, integrated with Apache Spark, is an effective tool for distributed data processing.
Julia offers speed and mathematical elegance for scientific computing and numerical modeling.
Other Data Analysis and Visualization Tools
These tools are particularly strong in creating effective visualizations.
- Tableau and PowerBI: User-friendly applications known for creating effective, interactive visualizations and analysis dashboards. They can also perform simple data analysis tasks.