Data Science I: Foundations

PPOL564-01

Fall 2021

Georgetown University


Course Outline

Course Schedule


Week Date Topic Assignment Coding Discussion
1 1-Sep Introductions, Installations, and IDEs
2 8-Sep Version Control, Workflow, and Reproducibility X
3 15-Sep Object-Oriented Programming in Python X
4 22-Sep Introduction to Algorithms Assignment 1 Assigned
5 29-Sep From Nested Lists to Data Frames Assignment 1 Due
6 6-Oct Approaches to Data Manipulation in Python X
7 13-Oct Data Visualization and Exploration Assignment 2 Assigned
8 20-Oct Drawing from (Un-)Structured Data Sources Assignment 2 Due; Assignment 3 Assigned
9 27-Oct Introduction to Statistical Learning Assignment 3 Due
10 3-Nov Continuous Outcomes and Linear Regression Project Proposals Due X
11 10-Nov Probability, Bayes Theorem, and Classification X
12 17-Nov Algorithmic Approaches to Supervised Learning X
13 24-Nov Interpretable Machine Learning Assignment 4 Assigned
14 1-Dec Project Presentations Assignment 4 Due






Virtual Classroom


Virtual Zoom Classroom (If we need to meet virtually)


Virtual Zoom Office Hours (Mondays/Wednesdays 9am - 10am)


Recurrent Zoom link can also be found on Canvas. If the link breaks or does not function properly, please check the #general channel on Slack for information regarding the new link. If there is no message regarding a new link, please contact the professor and/or TA via Slack. All synchronous lecture material will be recorded.






Syllabus



Readings



List of all required reading.






Installations


Throughout the semester, the instructor will use the commandline and many different IDEs when coding in Python or using Git. The following lists those different software and provides guidance on installation. If you run into issues, please reach out to the Teaching Assistant for assistance.


Commandline


At times, we’ll use a unix-based commandline. The commandline will feature into our discussion on using git and also running Python programs. If you use a Mac or a Linux operating system, then a functioning commandline comes with your operating system. For Apple machines, this is the Terminal.

For Windows (specifically Windows 10), you can enable Linux Bash shell. The following offers a tutorial on how to do this. If you’re using a version of Windows that pre-dates version 10, then Git Bash offers a program will allow you to use git commands from your windows machine.

Finally, you’ll notice that my terminal will have a slightly different look than the one on your machine. This is because I’m using “Oh My Zsh” which is open-source software that allows me to customize my commandline. The above link offers everything you’d need to installing Oh My Zsh on your machine.




Python3


We’ll use Python3 throughout this course. Below are instructions for downloading Python3 using commandline packages manager (Homebrew for mac, Chocolatey for windows).



An alternative way to install Python3 is to download an Anaconda distribution. The instructor will use pip rather than conda in the instruction for downloading Python modules. These are simply two ways of downloading and managing open-source software packages. Choose which ever works best for you.




Jupyter Notebook


Once you have Python3 on your computer, you can install a Jupyter Notebook. If you downloaded Python3 using Anaconda, then Jupyter Notebook comes with the distribution and requires no further installation on your part. If you install Python3 using Homebrew/Chocolately, you can install Jupyter notebook running the following code using your commandline.

pip install notebook

You can then activate a Jupyter Notebook from the commandline by typing:

jupyter notebook

If you’ve installed Python using Anaconda, the distribution provides a click-able icon to fire up a Jupyter Notebook. The advantage of using the commandline, however, is that you can set the working directory prior to firing up a notebook. This will allow you to work within a specific project folder more easily.




Atom + hydrogen


Atom is a hack-able text editor built by Github. The following are instruction on how to install Atom on your machine.

Atom allows you to install open-source packages that provide additional functionality. The following packages will help you as you use Atom to program in Python. Of these, Hydrogen is the most important. It’ll allow you to use a Jupyter kernel from within Atom to evaluate code.

Hydrogen@2.16.3
Zen@0.18.0
advanced-open-file@0.16.8
atom-beautify@0.33.4
atom-clock@0.1.18
atom-html-preview@0.2.6
atom-language-r@1.4.8
atom-material-syntax@1.0.8
atom-material-syntax-light@0.4.6
atom-material-ui@2.1.3
atom-path-intellisense@1.2.2
atom-python-virtualenv@1.0.4
atom-todoist@2.0.0
auto-update-packages@1.0.1
autocomplete-R@0.6.0
autocomplete-latex-cite@0.3.5
autocomplete-modules@2.3.0
autocomplete-python@1.17.0
autocomplete-sql@0.5.0
browser-plus@0.0.98
color-picker@2.3.0
data-explorer@0.7.0
docblock-python@0.19.1
file-icons@2.1.47
fix-indent-on-paste@0.1.1
fold-comments@0.6.0
git-log@0.4.1
hey-pane@1.2.0
hydrogen-cell-separator@0.4.1
indent-guide-improved@1.4.13
jupyter-notebook@0.0.10
kite@0.206.0
language-latex@1.2.0
language-weave@0.7.2
latex@0.50.2
latex-tree@0.5.0
latexer@0.3.0
minimap@4.39.14
oceanic-next@1.0.0
pdf-view@0.73.0
platformio-ide-terminal@2.10.1
project-manager@3.3.8
python-indent@1.2.6
quick-query-sqlite@0.4.1
reindent@1.5.0
scroll-through-time@0.3.1
simple-drag-drop-text@0.5.0
symbols-tree-view@0.14.0
todo-show@2.3.2
typewriter@0.8.0
wordcount@3.2.0

To install any one of these packages from the commandline, type:

# apm == "Atom package manager"
apm install <package-name>
# For example
apm install Hydrogen@2.16.3

There is also a dedicated package manager built into Atom which you can use to download and install new packages. Open Atom then Settings > Install and type the package name.

Troubleshooting Hydrogen/Atom Setup

Several students have had issues arise in getting Hydrogen to properly run on their machines. Particularly, after following the installation instructions for Atom and Hydrogen, many people find that upon trying to run Python code, they either (1) receive an error message stating that "no kernel for language Python found" (or something similar), or (2) they are able to connect to a Python kernel but upon trying to run code, nothing happens (they may or may not receive error messages associated with that.

If you encounter this issue, we suggest trying the following solutions in order until one of the solutions works. If you have tried all three possible solutions and are still not able to properly run Python code in Hydrogen/Atom, please contact the teaching assistants (either by Slack, email, or setting up a Calendly appointment).


Solution 1

Open the command line and run the following two commands:

python3 -m pip install ipykernel
python3 -m ipykernel install --user

Then restart Atom and try running Python code.


Solution 2

Uninstall Hydrogen on Atom by opening Atom, click "Install a Package", and search for Hydrogen in the search bar. Click "Uninstall". Once Hydrogen has finished uninstalling, search Hydrogen again and hit "Install". Once Hydrogen has finished re-installing, restart Atom and try running Python code.


Solution 3

Add the following paths to your list of environmental variables using the command line. Note that exact file paths will need to be adjusted slightly depending on your machine and operating system.

C:\Anaconda3
C:\Anaconda3\Scripts
C:\Anaconda3\Library\bin

Once these have been added to the list of environmental variables, restart Atom and try running Python code.




RStudio + reticulate


In your classes that are focused on using R, RStudio will be your main IDE. However, RStudio isn’t just for R. It can handle a number of different languages. We can use Python in RStudio using the reticulate package. We’ll talk about some of the advantages for doing this in class, but for now, let’s cover installation.

To install RStudio, download from the following link (make sure to scroll all the way to the bottom).

reticulate is an R package that allows one run a Python REPL in the R console. In addition, it allows one to read in and use Python code, and pass data between R and and Python. The following provides instructions on installing reticulate.

Note: If you have multiple versions of Python on your computer, reticulate can get confused with regard to which version it is referencing. The following article covers these issues. The best way to resolve this issue is by creating a .Rprofile file that sends instructions regarding the specific version of Python you wish to use.

“By setting the value of the RETICULATE_PYTHON environment variable to a Python binary. Note that if you set this environment variable, then the specified version of Python will always be used (i.e. this is prescriptive rather than advisory). To set the value of RETICULATE_PYTHON, insert Sys.setenv(RETICULATE_PYTHON = PATH) into your project’s .Rprofile, where PATH is your preferred Python binary.”




Other Software


Here is an overview of other text editors that are popular for programming in Python, which you won’t see featured in this course. Note I’m agnostic on whatever you use to learn Python and some find that different set ups work better for them. If one of these setups works better for you, I encourage you to use it (and tell me about how it went)!





Project


Data science is an applied field and therefore, it is important that you understand how to conduct a complete analysis from collecting data, to cleaning and analyzing it, to presenting your findings. Toward this end, students are required to complete an independent data science project, applying concepts learned throughout the course. The project is composed of three parts: a 2 page project proposal, an in-class presentation, and a 12-page project report.

More information regarding the final project will be circulated during class on Week 8


Final Project Overview


Presentation Rubric


Final Report Rubric







Contact


Eric Dunford (Professor)

  • Office: 404 Old North
  • Office Hours: Mondays & Wednesdays 9am to 10am (Office hours will be held virtually via Zoom. See Virtual Classroom tab for link)
  • Email: eric.dunford@georgetown.edu


Maddie Pickens (Teaching Assistant)


Chandler Dawson (Teaching Assistant)


Slack

The best way to reach the TA/Professor is via the class Slack channel (PPOL-564-Fall-2021). Please click on the Class Slack Channel Invite to join the class work-space.






Lecture Materials

Asynchronous lecture materials will go live approximately one week prior to the scheduled synchronous meeting date.



Week 1: Choosing your Poison
Introductions, Installations, and IDEs



Week 2: Time Travel and Other Necessities
Version Control, Workflow, and Reproducibility



Week 3: Learning Parseltongue
Object-Oriented Programming in Python



Week 4: On Time and Space
Introduction to Algorithms



Week 5: Long Live the Data Frame
From Nested Lists to Data Frames



Week 6: Modern Snake Charming
Approaches to Data Manipulation in Python



Week 7: Interrogation Techniques
Data Visualization and Exploration



Week 9: The Signal and the Noise
Introduction to Statistical Learning



Week 10: Casting Shadows in \(N\)-Dimensions
Continuous Outcomes and Linear Regression



Week 11: Hot Dog, Not Hot Dog
Probability, Bayes Theorem, and Classification



Week 12: Trees and Neighbors
Algorithmic Approaches to Supervised Learning



Week 13: Peeking inside the Black Box
Interpretable Machine Learning