PPOL564 | Data Science I - Foundations Reproducibility

# <font class = "title-panel"> PPOL564 | Data Science I - Foundations </font> <font size=100, face="bold"> Reproducibility</font> <br> <br>
### <font class = "title-footer">  Prof. Eric Dunford  ◆  Georgetown University  ◆  McCourt School of Public Policy  ◆  <a href="mailto:eric.dunford@georgetown.edu" class="email">eric.dunford@georgetown.edu</a></font>

---

<div class="slide-footer"><span> 
PPOL564 | Data Science I - Foundations

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Class 1

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Work Flow and Reproducibility

</span></div>

---

# Plan for Today

- **Introduce the Course**
  - Chat about data science
  - Aims of the course
  - Schedule
  - Other important information

- **Reproducibility** 
  
    + what it means
    + how reproducible practices might save your life
    + how to build code that you and others can replicate and read

---

<br><br><br><br><br><br>
# What is data science?

---

<br>
.center[<img src="figures/data_science.JPG" width="700">]

---

## The Aim of Data Science

### Generate

+ **Valid**  
      + <span style="color:#477acc"> scrutiny, discussion & limitations <span>
  + **Unbiased**
      + <span style="color:#477acc"> introspection, diversity & substantive knowledge <span>
  + **Repoducible** 
      + <span style="color:#477acc"> data provenance, code transparency & version control <span>
  + **Compelling** 
      + <span style="color:#477acc"> interpretable, intuitive & clear <span>
  
### insights using data to _influence and inform decision-making_.

---

## This Course focuses on

- Developing **programmatic methods and tools** in python. 
  - Understanding data types and programming approaches.
  - Developing a data wrangling toolkit.
  - Becoming versed in scientific computing: converting math to code, leveraging code to understand math. 
  - Best Practices
      + clean, well-documented, and readable code
      + version control 
  
- Understanding the **mathematical components** that underpin data analytic and statistical learning approaches. 
  - Linear algebra &rarr; linear regression & data decomposition.
  - Multivariate calculus &rarr; computational optimization
  - Probability &rarr; simulation and sampling

---

## What this course is not?

- A full CS course on object-oriented programming.

- Webscraping or data retrieval (DSII)

- A machine learning course (we dabble with important computational concepts, but don't delve into implementation. No fitting data or prediction). (DSII)

- A "big data" course (we won't delve into database structures, such as SQL). (Massive Data & Databases Elective)

---

## Course Calendar

<br>

<table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Date </th>
   <th style="text-align:left;"> Topics </th>
   <th style="text-align:left;"> Date </th>
   <th style="text-align:left;"> Topics </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Sept. 4 </td>
   <td style="text-align:left;"> Reproducibility </td>
   <td style="text-align:left;"> Oct. 23 </td>
   <td style="text-align:left;"> Matrix Operations and Inversions </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sept. 9 </td>
   <td style="text-align:left;"> Version Control </td>
   <td style="text-align:left;"> Oct. 28 </td>
   <td style="text-align:left;"> Linear Regression </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sept. 11 </td>
   <td style="text-align:left;"> Python Notebooks </td>
   <td style="text-align:left;"> Oct. 30 </td>
   <td style="text-align:left;"> Eigen Decompositions </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sept. 16 </td>
   <td style="text-align:left;"> Data Types in Python </td>
   <td style="text-align:left;"> Nov. 4 </td>
   <td style="text-align:left;"> Decompositions in Practice </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sept. 18 </td>
   <td style="text-align:left;"> Control Sequences, Iteration, and Functions </td>
   <td style="text-align:left;"> Nov. 6 </td>
   <td style="text-align:left;"> Differentiation </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sept. 23 </td>
   <td style="text-align:left;"> Comprehensions and Generators </td>
   <td style="text-align:left;"> Nov. 11 </td>
   <td style="text-align:left;"> Optimizing Univariate Functions </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sept. 25 </td>
   <td style="text-align:left;"> Numpy </td>
   <td style="text-align:left;"> Nov. 13 </td>
   <td style="text-align:left;"> Optimizing Multivariate Functions </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sept. 30 </td>
   <td style="text-align:left;"> Data Wrangling with Pandas (part 1) </td>
   <td style="text-align:left;"> Nov. 18 </td>
   <td style="text-align:left;"> Gradient Decent </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Oct. 2 </td>
   <td style="text-align:left;"> Data Wrangling with Pandas (part 2) </td>
   <td style="text-align:left;"> Nov. 20 </td>
   <td style="text-align:left;"> Constrained Optimization and Regularization </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Oct. 7 </td>
   <td style="text-align:left;"> Exploratory Data Analysis </td>
   <td style="text-align:left;"> Nov. 25 </td>
   <td style="text-align:left;"> Probability </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Oct. 9 </td>
   <td style="text-align:left;"> Vectors </td>
   <td style="text-align:left;"> Dec. 2 </td>
   <td style="text-align:left;"> Bayes Rule &amp; Naive Bayes Algorithm </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Oct. 16 </td>
   <td style="text-align:left;"> Trigonometry of Vectors </td>
   <td style="text-align:left;"> Dec. 4 </td>
   <td style="text-align:left;"> Simulation and MCMC Sampling </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Oct. 21 </td>
   <td style="text-align:left;"> Matrix Transformations </td>
   <td style="text-align:left;"> Dec. 9 </td>
   <td style="text-align:left;"> Wrap Up </td>
  </tr>
</tbody>
</table>

---

## Everything else

- Class Website: http://ericdunford.com/ppol564/

- Readings:
  
  - Reading list posted on website for class each day. 
  - Most readings open source (can access via link)
  - Some readings are in hardcopy and will be posted on CANVAS.
  
- Class Slack Chanel for Communication 
  
  - Use it publically or privately to coordinate and work through issues
  
- Recitation
  
  - Mondays from 5 to 5:50 in Car Bar 203. 
  
- Class on Sept. 23rd will be during recitation.

---

# Reproducibility <br> as a <br> _Practical_ Reality

---

## We focus on things like this...

---

## And forget the reality that is this...

---

## Reproducibility with a captial "R"

**...is fundamental to the scientific method, but it is also a <u> practical reality </u>.**

<br>

- juggling multiple versions of the same file

- collaboration can create conflicts across versions

- projects are picked up and put down &rarr; tracing the progression of a project across a spiderweb of files is not always easy (or possible)

- new people enter the fray &rarr; getting them up-to-speed means walking them through the labyrinth, which wastes time and resources.

---

# Generating Reproducible Work

<br>
### 1. Readable

### 2. Portable

### 3. Well-Named

### 4. Repeatable

### 5. Version Control

---

## Readable

```python
x = np.random.normal(size=100)
y = 1 + 2*x + np.random.normal(size=100)
plt.scatter(x,y)
```

vs.

```r
# Monte Carlo Simulation of a bivariate linear regression

sample_size = 100  # simulated sample size

indep_var = np.random.normal(size=sample_size) # independent variable

error = np.random.normal(size=sample_size) # simulate error

# generate dependent variable as function of the
# independent variable and some error.
dep_var = 1 + 2*indep_var + error

# plot values
plt.scatter(indep_var,dep_var)
```

???

- Well Commented Code and Functions**
- Well-Named Objects
- Leverage Spacing

To a degree, Code (like writing) should be more Hemmingway than Faulkner: concise,
clear, readable.

---

## Portable

- **Project can easily travel across computers**
    - Python's Virtual Environments (`venv`) and R Project (`.rproj`)

- **Scripts avoid "machine" specific designations**
    + Avoid **specific file paths**: `/Users/my-user-name/data-projects/my-project`
    + **Retain software and packages versions**.
      
- **Use text files**
    + Not software dependent (e.g. .docx, .ia); Can open on any system
    + Can be easily searched via the commandline
    + Easy to track changes via version control

---

# Well-named

- **No spaces!**
    + A space between designations can mean many things
    + spaces are ambiguous for the computer
    + `data analysis 2.py` &rarr; `data-analysis-2.py`
    
- **Names that state the purpose of the file** (no matter how long).
    + `data-analysis-2.py` &rarr; `Analysis01_wrangling-census-data-for-visualization_v2.py`
 
---

# Well-named   
  
- Maintain **designated folders** for different aspects of the project.

```bash
data-project
├── raw-data/        # Where our input data lives
├── output-data/     # Where our manipulated data lives
├── py/              # Where our Python functions live
├── R/               # Where our R functions live
├── figures/         # Where our generated figures live
├── reports/         # Where our text-based (.tex/.md/.txt) live
└── analysis/        # Where our analyses live
```

---

## Repeatable

- **Every step of the project can be expressed as code**

- **Automate what you can**

- **Use functions to repeat common tasks**

- **Clearly state all dependencies** (i.e. packages/modules) at the top of every script

---

## Version Control

<br>

- **Retain a record of all changes** made throughout the project's lifespan**

- **Easily handle collaboration**:
    + track who did what
    + uniform method dealing with conflicting changes
    
- **Provides a room for experimentation and non-linear exploration**

- No more **version file names**!

---
## Toward Reproducibility 
<br>
<br>

- Throughout the course, we will use **Jupyter Notebooks** to write code and to document our analyses.

- Jupyter Notebooks aren't perfect, but they help embody some the principles discussed. 
  
- Encourage you to figure out your own set up that works for you (I'll post my setup on the class website).