Introduction to Data Science

PPOL670

Spring 2021
Georgetown University


Course Outline

Course Schedule



Week Date Topic Assignment
1 January 26 Work Flow and Reproducibility
2 February 2 Introduction to Programming in R
3 February 9 Reproducibility in Practice
4 February 16 Data Wrangling in R Problem Set 1 Assigned
5 February 23 Data Visualization Problem Set 1 Due
6 March 2 Web Scraping Problem Set 2 Assigned
7 March 9 Geospatial Data Problem Set 2 Due
8 March 16 Text as Data Problem Set 3 Assigned
9 March 23 Introduction to Statistical Learning Problem Set 3 Due
March 30 Spring Break; No class
10 April 6 Applications in Supervised Learning (Regression) Problem Set 4 Assigned
11 April 13 Applications in Supervised Learning (Classification) Project Proposal Due; Problem Set 4 Due
12 April 20 Interpretable Machine Learning Problem Set 5 Assigned
13 April 27 Applications in Unsupervised Learning Problem Set 5 Due
14 May 4 Project Presentations
Final May 18 Final Project Due (9:00 PM)



Virtual Classroom


Virtual Office Hours (Wednesdays 9:00am - 11:00am)


Recurrent Zoom link can also be found on Canvas. Please contact the professor/TA through Slack if any of the Zoom links break.






Syllabus



Project


Data science is an applied field and therefore, it is important that you understand how to conduct a complete analysis from collecting data, to cleaning and analyzing it, to presenting your findings. Toward this end, students are required to complete an independent data science project, applying concepts learned throughout the course. The project is composed of three parts: a project proposal, an in-class presentation, and a project report.

More information regarding the final project will be circulated during class on Week 8


Overview

Project Overview


Presentation Rubric


Final Project Rubric


Examples of Successful Projects






Installation



The following are installation instructions for R and RStudio.



R Software

To install R, download R from CRAN via the following:

To install RStudio, download from the following (scroll to the bottom):

Video walkthroughs:






Contact


Eric Dunford (Professor)

  • Office: Bedroom (formerly 404 Old North)
  • Office Hours: Wednesdays 9:00am to 11:00am (EST) or by appointment
  • Email: eric.dunford@georgetown.edu


Maddie Pickens (Teaching Assistant)


Class

  • Slack: class communications will primarily take place through Slack. Please follow the invite link to be added to the class Slack channel.






Lecture Materials

Asynchronous lecture materials will go live approximately one week prior to the scheduled synchronous meeting date.



Week 1: Work Flow and Reproducibility



Week 5: Data Visualization



Week 7: Geospatial Data


  • Asynchronous Material

  • Lecture Slides

  • Practice:

  • Discussion:

    • Retrieve country-level and subnational-level shape files for free from diva-gis.org.
    • Explore other popular mapping packages in R:
      • ggmap: ggmap is an R package that makes it easy to retrieve raster map tiles from popular online mapping services like Google Maps and Stamen Maps and plot them using the ggplot2 framework
      • leaflet: leaflet is one of the most popular open-source JavaScript libraries for interactive maps. It’s used by websites ranging from The New York Times and The Washington Post to GitHub and Flickr, as well as GIS specialists like OpenStreetMap, Mapbox, and CartoDB.
      • tidycensus: tidycensus is an R package that allows users to interface with the US Census Bureau’s decennial Census and five-year American Community APIs and return tidyverse-ready data frames, optionally with simple feature geometry included.


Week 8: Text as Data



Week 9: Introduction to Statistical Learning



Week 10: Applications in Supervised Learning (Regression)



Week 11: Applications in Supervised Learning (Classification)



Week 13: Applications in Unsupervised Learning