Collecting and Preparing Data

DSA 220 - Introduction to Data Science and Analytics

Author

Andrew DiLernia

Learning Objectives

  • Define data collection and its role in data science

  • Describe different data collection methods commonly used in data science, such as surveys and experiments

  • Recognize scenarios where specific data collection methods are most appropriate

  • Describe the elements of survey design and identify the steps data scientists take to ensure the reliability of survey results

  • Describe methods for avoiding bias in survey questions

  • Describe various sampling techniques and the advantages of each

  • Identify relevant data privacy laws for different scenarios

Data Collection Methods

What is Data Collection?

Data collection is the systematic process of gathering and measuring information on specific phenomena or events. It utilizes statistical tools to capture key attributes and relevant contextual information. This process is important for making sound interpretations and gaining meaningful insights. The environment and geographic location where data is gathered are also important, as they can significantly influence conclusions and decision-making.

Prior to Data Collection

It is important for a data scientist to establish clear project objectives prior to data collection if possible. This involves:

  • Identifying the research question or problem

  • Defining the target population and sampling method

  • Designing the survey or experiment, including questions, response options, and overall structure

Collected data facilitates understanding patterns and trends, making predictions and recommendations, and identifying opportunities or areas for improvement.

Common Data Collection Methods

Depending on the research goals, various methods can be used such as experiments, surveys, observation, focus groups, interviews, and document analyses.

Surveys and experiments are two of the most common methods of data collection. Surveys can be conducted online, by phone, or in person while experimental research typically requires a controlled environment to ensure the validity and reliability of the data.

Experimental Designs

Conducting a controlled experiment requires a well-designed plan that describes the research objectives, variables, and procedures.

Key Elements of an Experiment

  • Control Group: A baseline group that does not receive the experimental treatment, used for comparison.

  • Systematic Measurement: Data is obtained by consistently and accurately measuring specific properties or characteristics.

  • Ethical Guidelines: Generally, researchers have an obligation to be honest with participants and avoid deception unless necessary and justified to a reasonable degree.

Surveys

Surveys are a fundamental method for collecting data with the objective of understanding the characteristics, opinions, or behaviors of a target population.

Key Elements of a Survey

  • Sampling Method: The process used to select a subset of individuals from a population. A proper sampling strategy, such as random sampling, is crucial for ensuring the results are generalizable or representative of the target population.

  • Questionnaire Design: The formulation of questions to be clear, understandable, and unbiased.

  • Data Confidentiality: Ensuring that respondents’ personal information is protected and that their answers cannot be linked back to them unless they have given explicit consent.

Examples

A/B Testing for Website Conversion 🖱💻

  • Scenario: A digital marketing team wants to increase the number of users who sign up for their company’s newsletter from the website’s homepage. They hypothesize that a more prominent, green “Sign Up” button 🟢 will be more effective than the current, smaller blue button 🔵.

  • Methodology: In this type of setting, an A/B test is commonly employed. In this type of design, participants are randomly assigned to different groups.

    • Group A (Control)🔵: Website visitors are shown the original webpage with the blue button.

    • Group B (Treatment)🟢: Website visitors are shown the modified webpage with the new green button.

  • Data Collection: The system collects data by tracking the click-through rate (CTR), the percentage of all visitors who click the button, for each group over a one week period.

  • Analysis: After the test period, the CTRs for the groups are compared using appropriate statistical methods such as logistic regression.

Table 1: Example user-level data from A/B test
time_stamp group clicked_through
2025-05-02 08:41:20 A 1
2025-05-01 00:00:03 B 0
2025-05-01 00:26:54 A 0
2025-05-01 00:53:19 A 0
2025-05-01 01:19:59 B 0

Tip
  1. What type of data collection method is employed in this scenario?

  2. What is the control group in this scenario, if there is one?

  3. What is the main outcome or attribute of interest in this scenario?

  4. Being as specific as possible, what is the type of each variable in Table 1?

  5. Which data format would be best for storing all data available from this scenario (e.g., XML, CSV, or JSON)?

Clinical Trial for a New Parkinson’s Drug 🔬💊

  • Scenario: A pharmaceutical company has developed a new drug, “Neuroquil”, designed to slow the progression of motor symptoms in patients in the United States with Parkinson’s disease. They implement a clinical trial to assess the drug’s efficacy and safety.

  • Methodology: Researchers enroll a large group of participants with Parkinson’s disease and randomly assigned them to receive either a monthly dose of Neuroquil or the treatment constituting the current best standard of care (SOC). The SOC was administered in a formulation that was indistinguishable from Neuroquil to ensure both participants and investigators remained blinded and unbiased.

  • Data Collection: Clinical data are collected from all participants at regular intervals (e.g., at baseline and 6 months), including neurological assessment scores using a standardized scale like the Unified Parkinson’s Disease Rating Scale (UPDRS) to measure motor function, patient-reported logs of symptoms and quality of life, and records of any side effects or adverse events. Higher UPDRS scores indicate greater impairment.

  • Analysis: At the end of the trial, researchers compare the average improvement in UPDRS scores between the Neuroquil group and the SOC group using appropriate statistical methods such as linear regression. A holistic approach to evaluating the new drug is used, collectively considering all data on efficacy and safety from the trial.

Table 2: Example data from clinical trial
participant_id treatment_group age updrs_baseline updrs_6_months updrs_improvement
1 Neuroquil 58 21 15 6
2 Neuroquil 58 26 25 1
3 Neuroquil 58 10 5 5
4 Neuroquil 58 22 21 1
5 SOC 57 13 16 -3

Tip
  1. What type of data collection method is employed in this scenario?

  2. What is the control group in this scenario, if there is one?

  3. What is the main outcome or attribute of interest in this scenario?

  4. Being as specific as possible, what is the type of each variable in Table 2?

  5. Which data format would be best for storing all data available from this scenario (e.g., XML, CSV, or JSON)?

Desirable Features for EV Product Design 🔋📊

  • Scenario: An automaker wants to explore a new electric vehicle’s (EV) design, balancing desirable features against a competitive price. An aim is understanding which features customers in the United States value most considering there are often tradeoffs, such as battery range and cost. This type of analysis is called a conjoint analysis, “…a form of statistical analysis that firms use in market research to understand how customers value different components or features of their products or services” 1.
  • Methodology: To emulate real-world customer decisions, participants were randomly selected from all key segments of the target population, ensuring the final sample was representative of the target population in regards to important characteristics. These individuals were then presented with a series of choices between different, fully-formed vehicles each with a unique combination of attributes.
Table 3: Example choices presented to a participant
Vehicle Type Range Charging Price
Vehicle A SUV 300 miles Standard Charging $42,000
Vehicle B Mid-size car 400 miles Standard Charging $45,000
Vehicle C SUV 300 miles Ultra-Fast Charging $46,000
  • Data Collection: This preference data is collected from participants via an online questionnaire.

  • Analysis: The part-worth utility for each attribute, a numerical score representing its value to the customer, is calculated. These scores reveal the relative importance of features (e.g., price vs. range) and are used to assess market viability and future development directions of a new vehicle. Logistic regression would be one possible method for calculating part-worth utility.

Table 4: Example data for conjoint analysis
participant_id vehicle participant_preference type range charging price
1 Vehicle A 0 SUV 300 miles Standard Charging $42,000
1 Vehicle B 1 Mid-size car 400 miles Standard Charging $45,000
1 Vehicle C 0 SUV 300 miles Ultra-Fast Charging $46,000
2 Vehicle A 0 SUV 300 miles Standard Charging $42,000
2 Vehicle B 0 Mid-size car 400 miles Standard Charging $45,000
2 Vehicle C 1 SUV 300 miles Ultra-Fast Charging $46,000
Tip
  1. What type of data collection method is employed in this scenario?

Note: While the study has experimental elements, its primary goal is to measure and understand existing preferences and values, not to test the effect of an intervention on a behavioral outcome.

  1. What is the control group in this scenario, if there is one?

  2. What is the main outcome or attribute of interest in this scenario?

  3. Being as specific as possible, what is the type of each variable in Table 4?

  4. Which data format would be best for storing all data available from this scenario (e.g., XML, CSV, or JSON)?

What if participants were shown combinations of different features, e.g., one participant is asked about price and range while another participant is asked about heated seats and price?

Survey Design and Implementation

Proper survey design is important to obtain accurate and reliable data. The design process begins with clearly defined research objectives, including a well-defined target population, to facilitate development of relevant and effective questions.

  • Survey Structure: Surveys should start with simple questions and move to more complex ones gradually. Typically, this reduces the chances of non-response to questions and can improve the accuracy of responses.

Survey Questions

  • Closed-Ended Questions: Provide predetermined answers, making them easier to quantify and analyze.

  • Open-Ended Questions: Allow respondents to provide detailed, in-depth answers, offering qualitative insights.

  • Avoiding Bias: Questions should be neutral to avoid influencing responses.

  • Avoiding Double-Barreled Questions: A double-barreled question asks about two separate issues in a single question, making it difficult to ascertain which part the participant is responding to.

Tip

Why should surveys typically start with closed-ended as opposed to open-ended questions?

Example

Data science is the engine behind recommendation algorithms on platforms like Netflix, Instagram, and YouTube. Consider the survey questions below for Netflix. 📺🍿

Biased Question: “Do you agree that our recommendation algorithm consistently helps you discover new shows you love?”

Tip

Why is this question biased?

Neutral Question: “How do you typically discover new shows to watch on our platform? Please select all that apply.”

(Options might include: From the ‘Recommended for You’ section, Searching for specific titles, Browse by genre, Suggestions from friends, etc.)

Tip

Why is the neutral question better?

Tip

What could be improved about the following survey question from Netflix?

How would you rate the quality of Netflix’s original content and the speed of its streaming service?

Sampling Techniques

Sampling is the process of selecting a subset of a population to make inferences about the population. The most appropriate sampling technique depends on the research objectives, the target population, and the nature of the problem. All sampling images are from https://www.qualtrics.com/.

Potential Errors in Surveys

All sampling techniques are susceptible to the error and biases below, but to varying degrees.

  • Sampling Error: The inherent difference between the results from a sample and the actual values of the entire population. It is caused by chance, but larger sample sizes can reduce the expected sampling error.

  • Sampling Bias: Occurs when the selected sample does not accurately represent the target population, leading to skewed conclusions. This can happen if certain groups are over- or underrepresented. Larger sample sizes do not always alleviate issues with sampling bias.

  • Measurement Error: Inaccuracies in data that arise during collection, recording, or analysis. These can be random or systematic.

There are two main types of sampling: probability sampling and non-probability sampling techniques 2.

Probability Techniques

Probability sampling techniques use randomness to ensure that every individual in the population has a non-zero chance of being selected. Probability sampling techniques are typically more costly to obtain, but they tend to yield a sample that is more representative of the target population than non-probability sampling techniques.

Some of the most common probability sampling techniques are below.

Technique Illustration
Simple Random Sampling: Every individual of the target population has an equal chance of being selected. An illustration of simple random sampling.
Stratified Sampling: The population is divided into subgroups (strata) based on shared characteristics, and a random sample is taken from each stratum. This ensures that distributions of key demographics in the sample are reflective of the target population. An illustration of stratified sampling.
Cluster Sampling: The population is divided into clusters (like cities or schools), and a random sample of entire clusters is selected for the study. An illustration of cluster sampling.
Systematic Sampling: A random starting point is chosen, and then every kth member of the population is selected. Despite being a probability sampling technique, it can be prone to sampling bias if the list has a periodic pattern. For example, if a list of houses is ordered by street and every 10th house is a corner lot, selecting every 10th house would result in a sample of only corner-lot houses. An illustration of systematic sampling.

Non-Probability Techniques

Non-Probability sampling techniques do not implicitly use randomness, typically meaning that every individual in the population does not have an equal chance of being selected. Non-probability sampling techniques are more prone to sampling bias. However, they are typically less costly to obtain and can yield larger sample sizes than probability-based techniques.

Some of the most common non-probability sampling techniques are below.

Technique Illustration
Convenience Sampling: Participants are selected based on their availability and ease of access. An illustration of convenience sampling.
Snowball Sampling: Initial participants refer other potential participants, which is useful for reaching hard-to-access populations. An illustration of snowball sampling.
Quota Sampling: Researchers select participants to meet predetermined quotas for certain demographic characteristics, but do not use random sampling. Typically this will yield less sampling bias than convenience samples, but is still prone to sampling bias since participants are not randomly selected. An illustration of quota sampling.
Tip
  1. Which sampling technique was used in the A/B testing example in Section 2.4.1?

  2. Which sampling technique was used in the clinical trial example in Section 2.4.2?

  3. Which sampling technique was used in the EV example in Section 2.4.3?

Tip

What are the pros and cons of the sampling technique likely employed for each scenario?

Data Privacy Laws

There are several landmark data privacy laws that provide a framework for protecting and properly handling personal information. Note that none of the content below constitutes legal advice.

California Consumer Privacy Act

The California Consumer Privacy Act (CCPA) is a state law that gives California residents more control over their personal information by granting them the rights to know, delete, and opt-out of the sale of their data held by businesses.3

General Data Protection Regulation

The General Data Protection Regulation (GDPR) is a European Union law designed to give individuals control over their personal data by setting unified rules for how organizations collect, process, and protect that data.4

Health Insurance Portability and Accountability Act

For data scientists working with healthcare data in the United States, the Health Insurance Portability and Accountability Act (HIPAA) is an important regulation regarding handling of protected health information.5 A primary goal of HIPAA is to protect patient health information from being disclosed without the patient’s consent or knowledge.

Family Educational Rights and Privacy Act

The Family Educational Rights and Privacy Act (FERPA) is a federal law that affords parents the right to have access to their children’s education records, to seek to have the records amended, and to have some control over the disclosure of these records.6 Often, when a student turns 18 or enrolls in a postsecondary institution like a university, these rights transfer from the parents to the student.

Best Practices

Generally, there are several best practices when handling personal information that should be followed. These include but are not limited to:

  • Obtaining explicit, informed consent before collecting any personal data.

  • Providing clear, accessible privacy notices detailing what data is collected, how it’s used, and how long it’s retained.

  • Implementing processes for data access, correction, and deletion requests.

  • Conducting periodic privacy impact assessments to identify and mitigate emerging risks.

Tip
  1. Which landmark data privacy law(s) would apply to the A/B testing example in Section 2.4.1?

  2. Which landmark data privacy law(s) would apply to the clinical trial example in Section 2.4.2?

  3. Which landmark data privacy law(s) would apply to the EV example in Section 2.4.3?

Footnotes

  1. Stobierski, T. (2020, December 18). What Is Conjoint Analysis? A Guide for Marketers. Harvard Business School Online. Retrieved August 5, 2025, from https://online.hbs.edu/blog/post/what-is-conjoint-analysis↩︎

  2. Webster, W. (2025). Sampling methods, types & techniques. Qualtrics. Retrieved August 7, 2025, from https://www.qualtrics.com/experience-management/research/sampling-methods/↩︎

  3. State of California Department of Justice. (n.d.). California Consumer Privacy Act (CCPA). Retrieved August 7, 2025, from https://oag.ca.gov/privacy/ccpa↩︎

  4. Wolford, B. (n.d.). What is GDPR, the EU’s new data protection law? GDPR.eu. Retrieved August 5, 2025, from https://gdpr.eu/what-is-gdpr/↩︎

  5. U.S. Department of Health & Human Services. (2024, July 19). HIPAA for Professionals. Retrieved August 5, 2025, from https://www.hhs.gov/hipaa/for-professionals/index.html↩︎

  6. U.S. Department of Education. (n.d.). What is FERPA? Student Privacy Policy Office. Retrieved August 5, 2025, from https://studentprivacy.ed.gov/faq/what-ferpa↩︎