Instructor

Professor: Eric Dunford, Ph.D.

  • Office: Bedroom (formerly 404 Old North)
  • Office Hours: Wednesdays 9am to 11am (EST)
  • Email: eric.dunford@georgetown.edu
  • Pronouns: he/him

Teaching Assistant: Madeline (Maddie) Pickens

Class Website: www.ericdunford.com/ppol670


Course Description

This course teaches students how to synthesize disparate, possibly unstructured data in order to draw meaningful insights from data. Topics covered include fundamentals of functional programming in R, literate programming, data wrangling, data visualization, data extraction (via web scraping and APIs), text analysis, and machine learning methods. The course aims to offer students a practical toolkit for data exploration. The objective of the course is to equip students with the skills to incorporate data into their decision-making and analysis. No prior programming experience is assumed, but prior statistics training is required.

Time and location

Classes will be held virtually on Tuesdays from 6:30 pm to 9:00pm:

  • January 26
  • February 2, 9, 16, 23
  • March 2, 9, 16, 23
  • April 6, 13, 20, 27
  • May 4

Holidays/Breaks/Away (No class):

  • March 30 (Spring Break)

Asynchronous & Synchronous Lectures

The lecture will be broken up into synchronous and asynchronous components.

  • The asynchronous components will cover the main concepts of the lecture. These materials will take the form of embedded videos in class lecture notes on the course website. Students are required to review this content along with the lecture notes and readings prior to the start of class. Asynchronous materials will be made available a week prior to the scheduled lecture date.
  • The synchronous component will take place at the scheduled class time and will involve active coding walkthrough, breakout group sessions, and questions. The aim of the synchronous class time is to reinforce the concepts covered in the asynchronous lecture materials. Thus, it is imperative that students complete the asynchronous material prior to the start of the synchronous lecture.

Note that this class is scheduled to meet weekly for 2.5 hours. I will do my best to ensure that the asynchronous and synchronous material in combination does not exceed 2.5 hours weekly. Put differently, students will not be required to commit more than 2.5 hours to lecture. This does not include readings, homework and/or coding discussions; rather, bifurcating lecture materials into synchronous and asynchronous components is necessary when learning virtually. Zoom fatigue is real, and lectures that exceed an 1.5 hours are not effective. When we do meet in-person, five minute breaks will be taken approximately every 40 minutes.

All synchronous lecture material will be recorded and stored on the class Canvas site. Students who are unable to attend the synchronous lecture will be able to review the materials covered in class at a future date.

For students attending class from afar (i.e. in time zones more than 4 hours off Eastern standard time), participating in the synchronous lecture component may not be a viable option. Please let the professor know if you’re planning on attending the course from afar. These students will not be required to attend synchronous components of the lecture. It is the students responsibility to review all lecture materials and to keep pace with the course.

Virtual Classroom

We will use Zoom (a web-conferencing platform) to hold class each week. Class will meet at its regularly scheduled time each week for synchronous lectures. If you do not have Zoom, you can download it here prior to the start of class.

A link for the synchronous component of the weekly lecture along with a link for virtual office hours is posted on the course website and Canvas. Students will use this link to access the live Zoom call for lecture.

If the link breaks or does not function properly, please check the #general channel on Slack for information regarding the new link. If there is no message regarding a new link, please contact the professor and/or TA via Slack. All synchronous lecture material will be recorded.

Course Objectives

This course focuses on providing students with an applied knowledge of the R programming environment while placing emphasis on developing a practical data science toolkit that students can implement quickly and efficiently. To this end, the course takes a ‘Tidyverse’ approach to R programming, which provides users an intuitive grammar for data manipulation and visualization. The goal is to establish a practical toolkit for analysis in R without getting too bogged down in the nuts and bolts of functional programming.

  1. Understand the basics of programming in R with emphasis on the “tidy” ecosystem of packages.

  2. Learn how to wrangle (prepare and clean) different types of data.

  3. Learn to identify and visualize important trends and findings.

  4. Learn to extract and process data from unstructured sources, such as the web and/or text.

  5. Learn to use statistical learning approaches to effectively explore and ask questions from data.

  6. Learn how to query online resources to find answers to resolve coding-related errors/inquiries.

Pre-Requisites

  • Required: PPOL501/531 - Statistical Methods for Policy Analysis (or an equivalent course)
  • Preferred: PPOL502/532 - Regression Methods for Policy Analysis (or an equivalent course)

Required Materials

Readings: We will rely primarily on the following texts for this course.

  • Wickham, H., & Grolemund, G. (2016). “R for data science: import, tidy, transform, visualize, and model data”. O’Reilly Media, Inc..

    • In an effort to keep costs as low as possible, we’ll resort to the online presentation of these materials. That said, many students find it useful to have a hard copy of the book materials. I strongly encourage students to purchase this book. It will serve as a valuable reference both during the semester and into the future.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). “An Introduction to Statistical Learning: with Applications in R”. New York: springer.

  • Additional readings will be posted for each class and can be found on the course website. Most reading material is open source and available via a link on the reading list, otherwise it can be found on Canvas.

Class Website: A class website (www.ericdunford.com/ppol670) will be used throughout the course and should be checked on a regular basis for lecture materials and required readings.

Class Slack Channel: The class also has a dedicated slack channel (ppol670-spring-2021). The channel serves as an open forum to discuss, collaborate, pose problems/questions, and offer solutions. Students are encouraged to pose any questions they have there as this will provide the professor and TA the means of answering the question so that all can see the response. If you’re unfamiliar with Slack, please consult the following start-up tutorial (https://get.slack.help/hc/en-us/articles/218080037-Getting-started-for-new-members). Please follow the invite link to be added to the Slack channel.

Canvas: A Canvas site (http://canvas.georgetown.edu) will be used periodically throughout the course and should be checked on a regular basis. All assignments will be posted on Canvas; they will not be distributed in class or by e-mail. Support for Canvas is available at (202) 687-4949

Computing: Programming task for in-class activities and assignments will be conducted using R. Students are strongly encouraged to utilize Rstudio, which offers an accessible and widely-utilized graphical user interface for programming in R.

NOTE: In-class activities will include programming in R. If you do not have access to a laptop on which you can install R and Rstudio, please contact the professor and/or TA for assistance.

Course Requirements

Assignment Percentage of Grade
Problem sets 50%
Final Project 50%

Note that the grades on Canvas are not weighted, and thus, may not accurately reflect a student’s final grade.

Problem Sets (50%): Students will be assigned five problem sets. While you are encouraged to discuss the problem sets with your peers and/or consult online resources, the finished product must be your own work. Problem sets are due on the date and time posted on Canvas and must be submitted on Canvas. Late assignments will be penalized a letter grade for every day they are overdue.

All problem sets must be submitted as .html files with clean, readable code chunks using RMarkdown. Along with the .html, student’s must submit a .zip file containing the .rmd file they used to knit the .html and the data used to complete the assignment. The .rmd file should be completely reproducible and contain no machine specific information (e.g. a file path). All assignment submissions must adhere to the following guidelines:

    1. all code must run;
    1. solutions should be readable
    • Code should be thoroughly commented (the Professor/TA should be able to understand the code’s purpose by reading the comment),
    • Coding solutions should be broken up into individual code chunks, not clumped together into one large code chunk (See examples in class or reach out to the TA/Professor if this is unclear),
    1. Non-coding responses should all be written in Markdown and should contain no grammatical or spelling errors;
    1. All programming solutions should employ concepts learned during the course. Specifically, students must use tidyverse solutions learned in class, over base R solutions pulled from the internet.

The follow schedule lays out when each assignment will be assigned and due.

Assignment Date Assigned Date Due
No. 1 February 16 February 23
No. 2 March 2 March 9
No. 3 March 16 March 23
No. 4 April 6 April 13
No. 5 April 20 April 27

Final Project (50%): Data science is an applied field and therefore, it is important that you understand how to conduct a complete analysis from collecting data, to cleaning and analyzing it, to presenting your findings. Toward the end of the semester, you will complete an independent data science project, applying concepts learned throughout the course. The project is composed of three parts: a 500 words (2-page) project proposal, an in-class presentation, and a 3000 words (12-page) project report. Due dates and breakdowns for the project are as follows:

Requirement Due Length Percentage
Project Proposal April 13 750 - 1000 words 5%
Presentation May 4 7 minutes 10%
Project Report May 18 3000 words 35%

Details regarding each aspect of the project will be posted on the course website leading up to the first due date (i.e. the Project Proposal). Until then, we will not discuss the project in class. The reason for this is that students need to reach a basic level of data competency before thinking through a project idea. Thus, discussion of the final project and the development of a project proposal will align with the final portion of the class; once we’ve broadly covered most of the fundamental data topics covered in this course.

Grading

Course grades will be determined according to the following scale:

Letter Range
A 95% – 100%
A- 91% – 94%
B+ 87% – 90%
B 84% – 86%
B- 80% – 83%
C 70% – 79%
F < 70%

Managing the Workload: How to Succeed in this Course

  • Come Prepared.

    • Do the readings. Think about the readings on their own terms, but also in terms of how the concepts apply to things you’re interested in.

    • As this class is quite hands-on, it is expected that students bring their computers to class to partake in computational activities. Moreover, students should have all relevant software up and running on their machines.

  • Ask Questions.

    • Formulating a question helps you engage with the material much more deeply. If you have a question, it’s almost certain that others do too; asking a question will not only help yourself, but you will help others. Most importantly, asking questions helps keep the class on track. If there are lots of questions, we’ll slow down and get things figured out. If there are few questions, we’ll charge ahead.
  • Collaborate.

    • Work in groups, but do so wisely. Collaboration is the greatest source of creativity and innovation. Better yet, working with classmates is a great way to learn from each other. Often, classmates will have some way of explaining things that clicks for you, and, more often than not, the act of explaining something to someone else will make things click for you. This only works, though, if you prepare by yourself first. If you show up and wait for classmates to do the work, you can probably muddle through the homeworks, but you’ll have trouble participating in classes and may fall behind as the material we cover cumulates and needs to be understood at each step.

    • collaboration should not result in verbatim submissions (e.g. no copy cats). As everyone writes code following their own unique logic, the chance of identical submissions is unlikely and easily detectable. Non-unique code will be penalized.

    • Finally, utilize the class Slack channel to post any questions, insights, coding problems and concerns. The channel will offer an open forum to communicate, collaborate, and collectively problem solve.

  • Start homeworks early.

    • Sometimes the data doesn’t cooperate, or there is an error in your code that will take you awhile to figure out and debug. You don’t want to find this out at 11pm the night before the homework is due. Also, the more you are doing homeworks, the more you will be able to follow the lectures.
  • Try doing it the hard way.

    • A core factor in the success of a data scientist is being able to explain how an algorithm or analysis was constructed, not just use software. In this class, where possible, build from scratch rather than an overly convenient library. This will allow you to become more creative down the line.

Course Policies

Participation

Participation is required in this course. I define participation as:

  • Attending synchronous lecture components over Zoom.
  • Completing the readings and asynchronous materials prior to the synchronous lecture.
  • Asking questions and participating in class.
  • During synchronous lectures, cameras are active at all times.
  • Paying attention to the professor during lecture
  • Engage in break-out group discussions when assigned.
  • Responding to questions asked during synchronous sessions.

I reserve the right to deduct points from students final grade who are not participating as expected.

Communication

  • For private questions concerning the class, email is the preferred method of communication. All email messages must originate from your Georgetown University email account(s). Please use a professional salutation, proper spelling and grammar, and patience in waiting for a response. The professor reserves the right to not respond to emails that are drafted inappropriately. Please email the professor and the TA directly rather than through the Canvas messaging system. Emails sent through Canvas will be ignored.

  • For general, class-relevant questions, Slack is the preferred method of communication. Please use the general or the relevant channel for these questions.

  • I will respond to all emails/slack questions within 24 hours of being sent during a weekday. I will not respond to emails/slack sent late Friday (after 5PM) or during the weekend until Monday (9AM). Please plan accordingly if you have questions regarding current or upcoming assignments. Please address the professor and TA by their last name unless stated otherwise.

Electronic Devices

The use of laptops, tablets, or other mobile devices is permitted only for class-related work. Audio and video recording is not allowed unless prior approval is given by the professor. Please mute all electronic devices during class.

Assignments and Late Work

Assignments should be clear, legible, and submitted in the required format. Writing assignments will be graded on the basis of content, logic, analysis, mechanics, organization, and research. Due dates for all assignments will be posted on Canvas and are non-negotiable. Exceptions to this policy will be made only under extremely unusual circumstances and will require valid documentation from the student. Late problem sets will be penalized a letter grade per day.

Proof of Diligent Debugging

When reaching out to the professor or teaching assistant regarding a technical question, error, or issue you must demonstrate that you made a good faith effort to debugging/isolate your problem prior to reaching out. In as concise a way as possible, send a record of what you tried to do along with a reproducible example emulating the error. (See the materials for Week 3 on how to generate a reproducible example using reprex and datapasta). As software is continually being refined in data science and new approaches continually emerge and changing, learning how to frame your question and find a similar solution online is a key tool for success in this domain. If you make a diligent effort beforehand to solve your problem, we will do the same in trying to help you figure out a solution. Note that the professor/TA is a resource of last resort: only come to them after you’ve exhausted all other options.

Use of Class Materials

Increasingly, with the proliferation of certain websites, questions about the ownership of course materials have arisen (and Georgetown is actively working on policies to address these concerns). I consider my syllabus, lectures, handouts, problem sets, and problem set answers to be my intellectual property. I respectfully request that you refrain from sharing my materials in any electronic (or paper) format. You are welcome to save my lectures for your own use, but they should not be posted anywhere. Sharing notes, on an occasional basis, with others in the class is fine as long as they are not posted elsewhere online. Students found in breach of this policy will fail the course.

Academic Resource Center/Disability Support

If you believe you have a disability, then you should contact the Academic Resource Center (arc@georgetown.edu) for further information. The Center is located in the Leavey Center, Suite 335 (202-687-8354). The Academic Resource Center is the campus office responsible for reviewing documentation provided by students with disabilities and for determining reasonable accommodations in accordance with the Americans with Disabilities Act (ASA) and University policies. For more information, go to http://academicsupport.georgetown.edu/disability/.

Important Academic Policies and Academic Integrity

McCourt School students are expected to uphold the academic policies set forth by Georgetown University and the Graduate School of Arts and Sciences. Students should therefore familiarize themselves with all the rules, regulations, and procedures relevant to their pursuit of a Graduate School degree. The policies are located at:’http://grad.georgetown.edu/academics/policies/

Provosts Policy Accommodating Students Religious Observances

Georgetown University promotes respect for all religions. Any student who is unable to attend classes or to participate in any examination, presentation, or assignment on a given day because of the observance of a major religious holiday (see below) or related travel shall be excused and provided with the opportunity to make up, without unreasonable burden, any work that has been missed for this reason and shall not in any other way be penalized for the absence or rescheduled work. Students will remain responsible for all assigned work. Students should notify professors in writing at the beginning of the semester of religious observances that conflict with their classes. The Office of the Provost, in consultation with Campus Ministry and the Registrar, will publish, before classes begin for a given term, a list of major religious holidays likely to affect Georgetown students. The Provost and the Main Campus Executive Faculty encourage faculty to accommodate students whose bona fide religious observances in other ways impede normal participation in a course. Students who cannot be accommodated should discuss the matter with an advising dean.

Statement on Sexual Misconduct

Please know that as a faculty member I am committed to supporting survivors of sexual misconduct, including relationship violence, sexual harassment and sexual assault. However, university policy also requires me to report any disclosures about sexual misconduct to the Title IX Coordinator, whose role is to coordinate the University’s response to sexual misconduct.

Georgetown has a number of fully confidential professional resources who can provide support and assistance to survivors of sexual assault and other forms of sexual misconduct. These resources include:

Associate Director
Jen Schweer, MA, LPC
Health Education Services for Sexual Assault Response and Prevention 
(202) 687-0323
jls242@georgetown.edu
Erica Shirley
Trauma Specialist
Counseling and Psychiatric Services (CAPS) 
(202) 687-6985
els54@georgetown.edu

More information about campus resources and reporting sexual misconduct can be found at http://sexualassault.georgetown.edu.

Course Calendar

Week Date Topic Assignment
1 January 26 Work Flow and Reproducibility
2 February 2 Introduction to Programming in R
3 February 9 Reproducibility in Practice
4 February 16 Data Wrangling in R Problem Set 1 Assigned
5 February 23 Data Visualization Problem Set 1 Due
6 March 2 Web Scraping Problem Set 2 Assigned
7 March 9 Geospatial Data Problem Set 2 Due
8 March 16 Text as Data Problem Set 3 Assigned
9 March 23 Introduction to Statistical Learning Problem Set 3 Due
March 30 Spring Break; No class
10 April 6 Applications in Supervised Learning (Regression) Problem Set 4 Assigned
11 April 13 Applications in Supervised Learning (Classification) Project Proposal Due; Problem Set 4 Due
12 April 20 Interpretable Machine Learning Problem Set 5 Assigned
13 April 27 Applications in Unsupervised Learning Problem Set 5 Due
14 May 4 Project Presentations
Final May 18 Final Project Due (9:00 PM)

IMPORTANT: This syllabus is subject to change and may be amended throughout the course to reflect any changes deemed necessary by the professor. Any changes will be announced in-class or on Slack.