class: center, middle, inverse, title-slide #
PPOL564 | Data Science I - Foundations
Reproducibility
###
Prof. Eric Dunford ◆ Georgetown University ◆ McCourt School of Public Policy ◆
eric.dunford@georgetown.edu
--- layout: true <div class="slide-footer"><span> PPOL564 | Data Science I - Foundations           Class 1 <!-- Week of the Footer Here -->              Work Flow and Reproducibility<!-- Title of the lecture here --> </span></div> --- class: outline # Plan for Today - **Introduce the Course** - Chat about data science - Aims of the course - Schedule - Other important information - **Reproducibility** + what it means + how reproducible practices might save your life + how to build code that you and others can replicate and read --- <br><br><br><br><br><br> # What is data science? --- <br> .center[<img src="figures/data_science.JPG" width="700">] --- ## The Aim of Data Science ### Generate + **Valid** + <span style="color:#477acc"> scrutiny, discussion & limitations <span> + **Unbiased** + <span style="color:#477acc"> introspection, diversity & substantive knowledge <span> + **Repoducible** + <span style="color:#477acc"> data provenance, code transparency & version control <span> + **Compelling** + <span style="color:#477acc"> interpretable, intuitive & clear <span> ### insights using data to _influence and inform decision-making_. --- ## This Course focuses on - Developing **programmatic methods and tools** in python. - Understanding data types and programming approaches. - Developing a data wrangling toolkit. - Becoming versed in scientific computing: converting math to code, leveraging code to understand math. - Best Practices + clean, well-documented, and readable code + version control - Understanding the **mathematical components** that underpin data analytic and statistical learning approaches. - Linear algebra → linear regression & data decomposition. - Multivariate calculus → computational optimization - Probability → simulation and sampling --- ## What this course is not? <br> <br> - A full CS course on object-oriented programming. - Webscraping or data retrieval (DSII) - A machine learning course (we dabble with important computational concepts, but don't delve into implementation. No fitting data or prediction). (DSII) - A "big data" course (we won't delve into database structures, such as SQL). (Massive Data & Databases Elective) --- ## Course Calendar <br> <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Topics </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Topics </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Sept. 4 </td> <td style="text-align:left;"> Reproducibility </td> <td style="text-align:left;"> Oct. 23 </td> <td style="text-align:left;"> Matrix Operations and Inversions </td> </tr> <tr> <td style="text-align:left;"> Sept. 9 </td> <td style="text-align:left;"> Version Control </td> <td style="text-align:left;"> Oct. 28 </td> <td style="text-align:left;"> Linear Regression </td> </tr> <tr> <td style="text-align:left;"> Sept. 11 </td> <td style="text-align:left;"> Python Notebooks </td> <td style="text-align:left;"> Oct. 30 </td> <td style="text-align:left;"> Eigen Decompositions </td> </tr> <tr> <td style="text-align:left;"> Sept. 16 </td> <td style="text-align:left;"> Data Types in Python </td> <td style="text-align:left;"> Nov. 4 </td> <td style="text-align:left;"> Decompositions in Practice </td> </tr> <tr> <td style="text-align:left;"> Sept. 18 </td> <td style="text-align:left;"> Control Sequences, Iteration, and Functions </td> <td style="text-align:left;"> Nov. 6 </td> <td style="text-align:left;"> Differentiation </td> </tr> <tr> <td style="text-align:left;"> Sept. 23 </td> <td style="text-align:left;"> Comprehensions and Generators </td> <td style="text-align:left;"> Nov. 11 </td> <td style="text-align:left;"> Optimizing Univariate Functions </td> </tr> <tr> <td style="text-align:left;"> Sept. 25 </td> <td style="text-align:left;"> Numpy </td> <td style="text-align:left;"> Nov. 13 </td> <td style="text-align:left;"> Optimizing Multivariate Functions </td> </tr> <tr> <td style="text-align:left;"> Sept. 30 </td> <td style="text-align:left;"> Data Wrangling with Pandas (part 1) </td> <td style="text-align:left;"> Nov. 18 </td> <td style="text-align:left;"> Gradient Decent </td> </tr> <tr> <td style="text-align:left;"> Oct. 2 </td> <td style="text-align:left;"> Data Wrangling with Pandas (part 2) </td> <td style="text-align:left;"> Nov. 20 </td> <td style="text-align:left;"> Constrained Optimization and Regularization </td> </tr> <tr> <td style="text-align:left;"> Oct. 7 </td> <td style="text-align:left;"> Exploratory Data Analysis </td> <td style="text-align:left;"> Nov. 25 </td> <td style="text-align:left;"> Probability </td> </tr> <tr> <td style="text-align:left;"> Oct. 9 </td> <td style="text-align:left;"> Vectors </td> <td style="text-align:left;"> Dec. 2 </td> <td style="text-align:left;"> Bayes Rule & Naive Bayes Algorithm </td> </tr> <tr> <td style="text-align:left;"> Oct. 16 </td> <td style="text-align:left;"> Trigonometry of Vectors </td> <td style="text-align:left;"> Dec. 4 </td> <td style="text-align:left;"> Simulation and MCMC Sampling </td> </tr> <tr> <td style="text-align:left;"> Oct. 21 </td> <td style="text-align:left;"> Matrix Transformations </td> <td style="text-align:left;"> Dec. 9 </td> <td style="text-align:left;"> Wrap Up </td> </tr> </tbody> </table> --- ## Everything else - Class Website: http://ericdunford.com/ppol564/ - Readings: - Reading list posted on website for class each day. - Most readings open source (can access via link) - Some readings are in hardcopy and will be posted on CANVAS. - Class Slack Chanel for Communication - Use it publically or privately to coordinate and work through issues - Recitation - Mondays from 5 to 5:50 in Car Bar 203. - Class on Sept. 23rd will be during recitation. --- class: newsection # Reproducibility <br> as a <br> _Practical_ Reality --- ## We focus on things like this... <!-- <img src="figures/lecture01_ethnic-exposure-boko-haram.png" width="600"> --> .center[![description of the image](figures/lecture01_gapminder-animation.gif)] --- ## And forget the reality that is this... .center[<img src="figures/lecture01_messy-files.png" width="800">] --- ## Reproducibility with a captial "R" **...is fundamental to the scientific method, but it is also a <u> practical reality </u>.** <br> - juggling multiple versions of the same file - collaboration can create conflicts across versions - projects are picked up and put down → tracing the progression of a project across a spiderweb of files is not always easy (or possible) - new people enter the fray → getting them up-to-speed means walking them through the labyrinth, which wastes time and resources. --- # Generating Reproducible Work <br> ### 1. Readable ### 2. Portable ### 3. Well-Named ### 4. Repeatable ### 5. Version Control --- ## Readable ```python x = np.random.normal(size=100) y = 1 + 2*x + np.random.normal(size=100) plt.scatter(x,y) ``` vs. ```r # Monte Carlo Simulation of a bivariate linear regression sample_size = 100 # simulated sample size indep_var = np.random.normal(size=sample_size) # independent variable error = np.random.normal(size=sample_size) # simulate error # generate dependent variable as function of the # independent variable and some error. dep_var = 1 + 2*indep_var + error # plot values plt.scatter(indep_var,dep_var) ``` ??? - Well Commented Code and Functions** - Well-Named Objects - Leverage Spacing To a degree, Code (like writing) should be more Hemmingway than Faulkner: concise, clear, readable. --- ## Portable - **Project can easily travel across computers** - Python's Virtual Environments (`venv`) and R Project (`.rproj`) - **Scripts avoid "machine" specific designations** + Avoid **specific file paths**: `/Users/my-user-name/data-projects/my-project` + **Retain software and packages versions**. - **Use text files** + Not software dependent (e.g. .docx, .ia); Can open on any system + Can be easily searched via the commandline + Easy to track changes via version control --- # Well-named - **No spaces!** + A space between designations can mean many things + spaces are ambiguous for the computer + `data analysis 2.py` → `data-analysis-2.py` - **Names that state the purpose of the file** (no matter how long). + `data-analysis-2.py` → `Analysis01_wrangling-census-data-for-visualization_v2.py` --- # Well-named - Maintain **designated folders** for different aspects of the project. ```bash data-project ├── raw-data/ # Where our input data lives ├── output-data/ # Where our manipulated data lives ├── py/ # Where our Python functions live ├── R/ # Where our R functions live ├── figures/ # Where our generated figures live ├── reports/ # Where our text-based (.tex/.md/.txt) live └── analysis/ # Where our analyses live ``` --- ## Repeatable <br><br> - **Every step of the project can be expressed as code** - **Automate what you can** - **Use functions to repeat common tasks** - **Clearly state all dependencies** (i.e. packages/modules) at the top of every script --- ## Version Control <br> - **Retain a record of all changes** made throughout the project's lifespan** - **Easily handle collaboration**: + track who did what + uniform method dealing with conflicting changes - **Provides a room for experimentation and non-linear exploration** - No more **version file names**! --- ## Toward Reproducibility <br> <br> - Throughout the course, we will use **Jupyter Notebooks** to write code and to document our analyses. - Jupyter Notebooks aren't perfect, but they help embody some the principles discussed. - Encourage you to figure out your own set up that works for you (I'll post my setup on the class website).