class: center, middle, inverse, title-slide #
PPOL564 | Data Science 1 | Foundations
Week 13
Data Exploration in the Social Sciences
###
Prof. Eric Dunford ◆ Georgetown University ◆ McCourt School of Public Policy ◆
eric.dunford@georgetown.edu
--- layout: true <div class="slide-footer"><span> PPOL564 | Data Science 1 | Foundations           Week 13 <!-- Week of the Footer Here -->              EDA in the SS <!-- Title of the lecture here --> </span></div> --- ### Exploratory Data Analysis (EDA) EDA is an **_iterative cycle_**: ![:space 5] 1. **_Generate_** questions about your data. 2. Search for answers by **_visualizing_**, **_transforming_**, and **_modelling_** your data. 3. Use what you learn to **_refine_** your questions and/or generate new questions. ![:space 5] EDA is not a formal process with a strict set of rules. Feel free to investigate every idea that occurs to you. Some will work and offer valuable insights, others won't. --- ### Weak priors on the DGP Often we are need to elicit answers from data where we know little about the **_data generating process_** (DGP) - Little to no academic work to guide our expectations. - New data type that does not conform cleanly to past operationalizations of well-defined theoretical concepts. - Data is entirely foreign to you + Data outside your domain of _expertise_ --- e.g., you study conflict but you need to examine healthcare data. + Data outside your domain of _expectation_ --- e.g., a company's internal data logs that you need to draw insight on. --- ### Data Exploration Steps Given new data, how best should one draw novel insights? -- - **_![:text_color darkorange](Step 0)_**: **_Train/Test Split_** - Always the first step. - Split the data and never look at, learn from, or examine the test data. - We can probe, model, and theorize about the training data _as long as we've held off test data_ to validate our conclusions. --- ### Data Exploration Steps Given new data, how best should one draw novel insights? - **_![:text_color darkorange](Step 0)_**: **_Train/Test Split_** - **_![:text_color darkorange](Step 1)_**: **_Understanding what data you have_** - <u>_Data Properties_</u> - What is the unit of analysis? - What are the data types? - integers, floats, categorical, ordered scales? - What sources of variation are there? - How is that data distributed? - Are there oddities in the data? - e.g. housing price data with a lot of 0s. - Is the data incomplete? If so, why? - Should the data be transformed? - Are there ways to back out new variables from the variables you have? ("feature engineering") --- ### Data Exploration Steps Given new data, how best should one draw novel insights? - **_![:text_color darkorange](Step 0)_**: **_Train/Test Split_** - **_![:text_color darkorange](Step 1)_**: **_Understanding what data you have_** - <u>_Data Properties_</u> - <u>_Data Generation_</u> (questions to ask yourself) - How was the data generated? - Hand coded vs. machine coded? - Self-reported? - Convenience sample? - Complete? Incomplete? Why? - How fit is the data to answer your (or any) inquiry? - In what ways could the data be distorted or biased? - Are the concepts measured in the data clearly defined? - If so, why do you believe that? - If not, where does the issue lie? --- ### Data Exploration Steps Given new data, how best should one draw novel insights? - **_![:text_color darkorange](Step 0)_**: **_Train/Test Split_** - **_![:text_color darkorange](Step 1)_**: **_Understanding what data you have_** - **_![:text_color darkorange](Step 2)_**: **_Analyzing the relationships_** - What variables are highly correlated? - What is the relationship between the variables? - What can bi-variate models tell us about the relationship between two (or more) variables? - Does a decomposition reveal anything interesting about the variation in the data? - Does the data cluster in interesting ways? - Can we plot the data spatially or temporally? Do any interesting patterns emerge? --- ### Data Exploration Steps Given new data, how best should one draw novel insights? - **_![:text_color darkorange](Step 0)_**: **_Train/Test Split_** - **_![:text_color darkorange](Step 1)_**: **_Understanding what data you have_** - **_![:text_color darkorange](Step 2)_**: **_Analyzing the relationships_** - **_![:text_color darkorange](Step 3)_**: **_Unpacking the outcome_** - What outcome variable do you care about (or seems important) in the data? - Build a model that best predicts this outcome. - Explore the most predictive model - What features/variables mattered most in the prediction task? - What are the marginal effects of these most predictive features? - Is there any heterogeneity in the predicted value of the model across certain observations? --- ### Data Exploration Steps Given new data, how best should one draw novel insights? - **_![:text_color darkorange](Step 0)_**: **_Train/Test Split_** - **_![:text_color darkorange](Step 1)_**: **_Understanding what data you have_** - **_![:text_color darkorange](Step 2)_**: **_Analyzing the relationships_** - **_![:text_color darkorange](Step 3)_**: **_Unpacking the outcome_** - **_![:text_color darkorange](Step 4)_**: **_Theorize_** - What are some hypotheses that we might be able to generate from the data? - What confounders might be lurking generating spurious correlations? - What are some plausible interventions that we might be able to employ to test a causal relationship? - Are those interventions ethical? - Are those interventions practical? --- ### Data Exploration Steps Given new data, how best should one draw novel insights? - **_![:text_color darkorange](Step 0)_**: **_Train/Test Split_** - **_![:text_color darkorange](Step 1)_**: **_Understanding what data you have_** - **_![:text_color darkorange](Step 2)_**: **_Analyzing the relationships_** - **_![:text_color darkorange](Step 3)_**: **_Unpacking the outcome_** - **_![:text_color darkorange](Step 4)_**: **_Theorize_** - **_![:text_color darkorange](Step 5)_**: **_Repeat 1 - 4_**