Collecting and Preparing Data
DSA 220 - Introduction to Data Science and Analytics
Learning Objectives
Define data collection and its role in data science
Describe different data collection methods commonly used in data science, such as surveys and experiments
Recognize scenarios where specific data collection methods are most appropriate
Describe the elements of survey design and identify the steps data scientists take to ensure the reliability of survey results
Describe methods for avoiding bias in survey questions
Describe various sampling techniques and the advantages of each
Identify relevant data privacy laws for different scenarios
Data Collection Methods
What is Data Collection?
Data collection is the systematic process of gathering and measuring information on specific phenomena or events. It utilizes statistical tools to capture key attributes and relevant contextual information. This process is important for making sound interpretations and gaining meaningful insights. The environment and geographic location where data is gathered are also important, as they can significantly influence conclusions and decision-making.
Prior to Data Collection
It is important for a data scientist to establish clear project objectives prior to data collection if possible. This involves:
Identifying the research question or problem
Defining the target population and sampling method
Designing the survey or experiment, including questions, response options, and overall structure
Collected data facilitates understanding patterns and trends, making predictions and recommendations, and identifying opportunities or areas for improvement.
Common Data Collection Methods
Depending on the research goals, various methods can be used such as experiments, surveys, observation, focus groups, interviews, and document analyses.
Surveys and experiments are two of the most common methods of data collection. Surveys can be conducted online, by phone, or in person while experimental research typically requires a controlled environment to ensure the validity and reliability of the data.
Experimental Designs
Conducting a controlled experiment requires a well-designed plan that describes the research objectives, variables, and procedures.
Key Elements of an Experiment
Control Group: A baseline group that does not receive the experimental treatment, used for comparison.
Systematic Measurement: Data is obtained by consistently and accurately measuring specific properties or characteristics.
Ethical Guidelines: Generally, researchers have an obligation to be honest with participants and avoid deception unless necessary and justified to a reasonable degree.
Surveys
Surveys are a fundamental method for collecting data with the objective of understanding the characteristics, opinions, or behaviors of a target population.
Key Elements of a Survey
Sampling Method: The process used to select a subset of individuals from a population. A proper sampling strategy, such as random sampling, is crucial for ensuring the results are generalizable or representative of the target population.
Questionnaire Design: The formulation of questions to be clear, understandable, and unbiased.
Data Confidentiality: Ensuring that respondents’ personal information is protected and that their answers cannot be linked back to them unless they have given explicit consent.
Examples
A/B Testing for Website Conversion 🖱💻
Scenario: A digital marketing team wants to increase the number of users who sign up for their company’s newsletter from the website’s homepage. They hypothesize that a more prominent, green “Sign Up” button 🟢 will be more effective than the current, smaller blue button 🔵.
Methodology: In this type of setting, an A/B test is commonly employed. In this type of design, participants are randomly assigned to different groups.
Group A (Control)🔵: Website visitors are shown the original webpage with the blue button.
Group B (Treatment)🟢: Website visitors are shown the modified webpage with the new green button.
Data Collection: The system collects data by tracking the click-through rate (CTR), the percentage of all visitors who click the button, for each group over a one week period.
Analysis: After the test period, the CTRs for the groups are compared using appropriate statistical methods such as logistic regression.
| time_stamp | group | clicked_through |
|---|---|---|
| 2025-05-02 08:41:20 | A | 1 |
| 2025-05-01 00:00:03 | B | 0 |
| 2025-05-01 00:26:54 | A | 0 |
| 2025-05-01 00:53:19 | A | 0 |
| 2025-05-01 01:19:59 | B | 0 |
What type of data collection method is employed in this scenario?
What is the control group in this scenario, if there is one?
What is the main outcome or attribute of interest in this scenario?
Being as specific as possible, what is the type of each variable in Table 1?
Which data format would be best for storing all data available from this scenario (e.g., XML, CSV, or JSON)?
Clinical Trial for a New Parkinson’s Drug 🔬💊
Scenario: A pharmaceutical company has developed a new drug, “Neuroquil”, designed to slow the progression of motor symptoms in patients in the United States with Parkinson’s disease. They implement a clinical trial to assess the drug’s efficacy and safety.
Methodology: Researchers enroll a large group of participants with Parkinson’s disease and randomly assigned them to receive either a monthly dose of Neuroquil or the treatment constituting the current best standard of care (SOC). The SOC was administered in a formulation that was indistinguishable from Neuroquil to ensure both participants and investigators remained blinded and unbiased.
Data Collection: Clinical data are collected from all participants at regular intervals (e.g., at baseline and 6 months), including neurological assessment scores using a standardized scale like the Unified Parkinson’s Disease Rating Scale (UPDRS) to measure motor function, patient-reported logs of symptoms and quality of life, and records of any side effects or adverse events. Higher UPDRS scores indicate greater impairment.
Analysis: At the end of the trial, researchers compare the average improvement in UPDRS scores between the Neuroquil group and the SOC group using appropriate statistical methods such as linear regression. A holistic approach to evaluating the new drug is used, collectively considering all data on efficacy and safety from the trial.
| participant_id | treatment_group | age | updrs_baseline | updrs_6_months | updrs_improvement |
|---|---|---|---|---|---|
| 1 | Neuroquil | 58 | 21 | 15 | 6 |
| 2 | Neuroquil | 58 | 26 | 25 | 1 |
| 3 | Neuroquil | 58 | 10 | 5 | 5 |
| 4 | Neuroquil | 58 | 22 | 21 | 1 |
| 5 | SOC | 57 | 13 | 16 | -3 |
What type of data collection method is employed in this scenario?
What is the control group in this scenario, if there is one?
What is the main outcome or attribute of interest in this scenario?
Being as specific as possible, what is the type of each variable in Table 2?
Which data format would be best for storing all data available from this scenario (e.g., XML, CSV, or JSON)?
Desirable Features for EV Product Design 🔋📊
- Scenario: An automaker wants to explore a new electric vehicle’s (EV) design, balancing desirable features against a competitive price. An aim is understanding which features customers in the United States value most considering there are often tradeoffs, such as battery range and cost. This type of analysis is called a conjoint analysis, “…a form of statistical analysis that firms use in market research to understand how customers value different components or features of their products or services” 1.
- Methodology: To emulate real-world customer decisions, participants were randomly selected from all key segments of the target population, ensuring the final sample was representative of the target population in regards to important characteristics. These individuals were then presented with a series of choices between different, fully-formed vehicles each with a unique combination of attributes.
| Vehicle | Type | Range | Charging | Price |
|---|---|---|---|---|
| Vehicle A | SUV | 300 miles | Standard Charging | $42,000 |
| Vehicle B | Mid-size car | 400 miles | Standard Charging | $45,000 |
| Vehicle C | SUV | 300 miles | Ultra-Fast Charging | $46,000 |
Data Collection: This preference data is collected from participants via an online questionnaire.
Analysis: The part-worth utility for each attribute, a numerical score representing its value to the customer, is calculated. These scores reveal the relative importance of features (e.g., price vs. range) and are used to assess market viability and future development directions of a new vehicle. Logistic regression would be one possible method for calculating part-worth utility.
| participant_id | vehicle | participant_preference | type | range | charging | price |
|---|---|---|---|---|---|---|
| 1 | Vehicle A | 0 | SUV | 300 miles | Standard Charging | $42,000 |
| 1 | Vehicle B | 1 | Mid-size car | 400 miles | Standard Charging | $45,000 |
| 1 | Vehicle C | 0 | SUV | 300 miles | Ultra-Fast Charging | $46,000 |
| 2 | Vehicle A | 0 | SUV | 300 miles | Standard Charging | $42,000 |
| 2 | Vehicle B | 0 | Mid-size car | 400 miles | Standard Charging | $45,000 |
| 2 | Vehicle C | 1 | SUV | 300 miles | Ultra-Fast Charging | $46,000 |
- What type of data collection method is employed in this scenario?
Note: While the study has experimental elements, its primary goal is to measure and understand existing preferences and values, not to test the effect of an intervention on a behavioral outcome.
What is the control group in this scenario, if there is one?
What is the main outcome or attribute of interest in this scenario?
Being as specific as possible, what is the type of each variable in Table 4?
Which data format would be best for storing all data available from this scenario (e.g., XML, CSV, or JSON)?
What if participants were shown combinations of different features, e.g., one participant is asked about price and range while another participant is asked about heated seats and price?
Survey Design and Implementation
Proper survey design is important to obtain accurate and reliable data. The design process begins with clearly defined research objectives, including a well-defined target population, to facilitate development of relevant and effective questions.
- Survey Structure: Surveys should start with simple questions and move to more complex ones gradually. Typically, this reduces the chances of non-response to questions and can improve the accuracy of responses.
Survey Questions
Closed-Ended Questions: Provide predetermined answers, making them easier to quantify and analyze.
Open-Ended Questions: Allow respondents to provide detailed, in-depth answers, offering qualitative insights.
Avoiding Bias: Questions should be neutral to avoid influencing responses.
Avoiding Double-Barreled Questions: A double-barreled question asks about two separate issues in a single question, making it difficult to ascertain which part the participant is responding to.
Why should surveys typically start with closed-ended as opposed to open-ended questions?
Example
Data science is the engine behind recommendation algorithms on platforms like Netflix, Instagram, and YouTube. Consider the survey questions below for Netflix. 📺🍿
Biased Question: “Do you agree that our recommendation algorithm consistently helps you discover new shows you love?”
Why is this question biased?
Neutral Question: “How do you typically discover new shows to watch on our platform? Please select all that apply.”
(Options might include: From the ‘Recommended for You’ section, Searching for specific titles, Browse by genre, Suggestions from friends, etc.)
Why is the neutral question better?
What could be improved about the following survey question from Netflix?
How would you rate the quality of Netflix’s original content and the speed of its streaming service?
Sampling Techniques
Sampling is the process of selecting a subset of a population to make inferences about the population. The most appropriate sampling technique depends on the research objectives, the target population, and the nature of the problem. All sampling images are from https://www.qualtrics.com/.
Potential Errors in Surveys
All sampling techniques are susceptible to the error and biases below, but to varying degrees.
Sampling Error: The inherent difference between the results from a sample and the actual values of the entire population. It is caused by chance, but larger sample sizes can reduce the expected sampling error.
Sampling Bias: Occurs when the selected sample does not accurately represent the target population, leading to skewed conclusions. This can happen if certain groups are over- or underrepresented. Larger sample sizes do not always alleviate issues with sampling bias.
Measurement Error: Inaccuracies in data that arise during collection, recording, or analysis. These can be random or systematic.
There are two main types of sampling: probability sampling and non-probability sampling techniques 2.
Probability Techniques
Probability sampling techniques use randomness to ensure that every individual in the population has a non-zero chance of being selected. Probability sampling techniques are typically more costly to obtain, but they tend to yield a sample that is more representative of the target population than non-probability sampling techniques.
Some of the most common probability sampling techniques are below.
| Technique | Illustration |
|---|---|
| Simple Random Sampling: Every individual of the target population has an equal chance of being selected. | |
| Stratified Sampling: The population is divided into subgroups (strata) based on shared characteristics, and a random sample is taken from each stratum. This ensures that distributions of key demographics in the sample are reflective of the target population. | |
| Cluster Sampling: The population is divided into clusters (like cities or schools), and a random sample of entire clusters is selected for the study. | |
| Systematic Sampling: A random starting point is chosen, and then every kth member of the population is selected. Despite being a probability sampling technique, it can be prone to sampling bias if the list has a periodic pattern. For example, if a list of houses is ordered by street and every 10th house is a corner lot, selecting every 10th house would result in a sample of only corner-lot houses. |
Non-Probability Techniques
Non-Probability sampling techniques do not implicitly use randomness, typically meaning that every individual in the population does not have an equal chance of being selected. Non-probability sampling techniques are more prone to sampling bias. However, they are typically less costly to obtain and can yield larger sample sizes than probability-based techniques.
Some of the most common non-probability sampling techniques are below.
| Technique | Illustration |
|---|---|
| Convenience Sampling: Participants are selected based on their availability and ease of access. | |
| Snowball Sampling: Initial participants refer other potential participants, which is useful for reaching hard-to-access populations. | |
| Quota Sampling: Researchers select participants to meet predetermined quotas for certain demographic characteristics, but do not use random sampling. Typically this will yield less sampling bias than convenience samples, but is still prone to sampling bias since participants are not randomly selected. |
Which sampling technique was used in the A/B testing example in Section 2.4.1?
Which sampling technique was used in the clinical trial example in Section 2.4.2?
Which sampling technique was used in the EV example in Section 2.4.3?
What are the pros and cons of the sampling technique likely employed for each scenario?
Data Privacy Laws
There are several landmark data privacy laws that provide a framework for protecting and properly handling personal information. Note that none of the content below constitutes legal advice.
California Consumer Privacy Act
The California Consumer Privacy Act (CCPA) is a state law that gives California residents more control over their personal information by granting them the rights to know, delete, and opt-out of the sale of their data held by businesses.3
General Data Protection Regulation
The General Data Protection Regulation (GDPR) is a European Union law designed to give individuals control over their personal data by setting unified rules for how organizations collect, process, and protect that data.4
Health Insurance Portability and Accountability Act
For data scientists working with healthcare data in the United States, the Health Insurance Portability and Accountability Act (HIPAA) is an important regulation regarding handling of protected health information.5 A primary goal of HIPAA is to protect patient health information from being disclosed without the patient’s consent or knowledge.
Family Educational Rights and Privacy Act
The Family Educational Rights and Privacy Act (FERPA) is a federal law that affords parents the right to have access to their children’s education records, to seek to have the records amended, and to have some control over the disclosure of these records.6 Often, when a student turns 18 or enrolls in a postsecondary institution like a university, these rights transfer from the parents to the student.
Best Practices
Generally, there are several best practices when handling personal information that should be followed. These include but are not limited to:
Obtaining explicit, informed consent before collecting any personal data.
Providing clear, accessible privacy notices detailing what data is collected, how it’s used, and how long it’s retained.
Implementing processes for data access, correction, and deletion requests.
Conducting periodic privacy impact assessments to identify and mitigate emerging risks.
Which landmark data privacy law(s) would apply to the A/B testing example in Section 2.4.1?
Which landmark data privacy law(s) would apply to the clinical trial example in Section 2.4.2?
Which landmark data privacy law(s) would apply to the EV example in Section 2.4.3?
Footnotes
Stobierski, T. (2020, December 18). What Is Conjoint Analysis? A Guide for Marketers. Harvard Business School Online. Retrieved August 5, 2025, from https://online.hbs.edu/blog/post/what-is-conjoint-analysis↩︎
Webster, W. (2025). Sampling methods, types & techniques. Qualtrics. Retrieved August 7, 2025, from https://www.qualtrics.com/experience-management/research/sampling-methods/↩︎
State of California Department of Justice. (n.d.). California Consumer Privacy Act (CCPA). Retrieved August 7, 2025, from https://oag.ca.gov/privacy/ccpa↩︎
Wolford, B. (n.d.). What is GDPR, the EU’s new data protection law? GDPR.eu. Retrieved August 5, 2025, from https://gdpr.eu/what-is-gdpr/↩︎
U.S. Department of Health & Human Services. (2024, July 19). HIPAA for Professionals. Retrieved August 5, 2025, from https://www.hhs.gov/hipaa/for-professionals/index.html↩︎
U.S. Department of Education. (n.d.). What is FERPA? Student Privacy Policy Office. Retrieved August 5, 2025, from https://studentprivacy.ed.gov/faq/what-ferpa↩︎