Learning Objectives

In the Asynchronous Lecture

Discuss writing classes in Python
Delve into classification problems and talk through their respective performance metrics.
Learn about probability theory and Bayes theorem.

In the Synchronous Lecture

Talk through how to use Bayes theorem to make predictions with data.
Build a Naive Bayes Classifier from scratch.

If you have any questions while watching the pre-recorded material, be sure to write them down and to bring them up during the synchronous portion of the lecture.

Synchronous Materials

Bayes Theorem
Breakout: Practice Calculating Bayesian Problems
Building a Naive Bayes Classifier
- Download .ipynb
- Download data used in this notebook from Dropbox.

Asynchronous Materials

The following tabs contain pre-recorded lecture materials for class this week. Please review these materials prior to the synchronous lecture.

Total time: Approx. 1 hour and 13 minutes

_

Writing Classes

Code from the video

The following notebook delves into the basic elements of class construction and provides a simple example of writing our own class.

Writing Classes

Classification

Relevant Slides

On Probability

As was the case last week, this week I’ll again draw from videos produced by Grant Sanderson. These videos will help give you a visual intuition of key ideas in probability.

In addition, I’ve written up a notebook on probability that provides a more traditional (but less intuitive) presentation of these concepts.

Probability Overview (Using Python)

Please review both. If you’re already comfortable with probability theory, that’s great; feel free to skip these materials.

_

Binomial Distribution

Probability Densities

Bayes Theorem

As was the case last week, this week I’ll again draw from videos produced by Grant Sanderson. These videos will help give you a visual intuition of key ideas. We’ll build on these ideas when constructing a classifier during the synchronous lecture. If you’re already comfortable with Bayes theorem, that’s great; feel free to skip these materials.

_

Bayes Theorem

Bayes Theorem Proof

Practice

These exercises are designed to help you reinforce your grasp of the concepts covered in the asynchronous lecture material.

The following questions ask you to construct a class that streamlines some of the data processing tasks learned in Week 9. Use the gapminder data to test your class method.

from gapminder import gapminder
dat = gapminder

_

Question 1

Write a class called MyML that takes in a Pandas DataFrame as input upon instantiation. The class method should also take an argument called outcome that provides the column name of the outcome variable. Initiation of the object should break the provided data up into an outcome object (y) and a features matrix (X). These should be stored as separate objects in the class instance (self).

_

Answer

class MyML:
    '''
    Class to streamline the processing of data for
    a machine learning task.
    '''

    def __init__(self,data=None,outcome=""):
        self.X = data.drop(columns=outcome)
        self.y = data[[outcome]]
        
# Run
obj = MyML(data=dat,outcome="lifeExp")
obj.X.head()

Question 2

Add a method to your MyML class called .split() that splits your data into training and test datasets and stores the splits in the object instance. .split() should take the arguments prop=.25, which dictates the proportion of data that should be held out as test data, and seed=123 which specifies the seed on the random state.

_

Answer

from sklearn.model_selection import train_test_split

class MyML:
    '''
    Class to streamline the processing of data for
    a machine learning task.
    '''

    def __init__(self,data=None,outcome=""):
        self.X = data.drop(columns=outcome)
        self.y = data[[outcome]]

    def split(self,prop = .25,seed=123):
        '''
        Method splits the data into training and test datasets
        '''

        # Split the data
        splits = train_test_split(self.X,self.y,
                                  test_size=prop,
                                  random_state = seed)

        # Save the split data in the correct dictionary location
        self.train_x = splits[0]
        self.test_x = splits[1]
        self.train_y = splits[2]
        self.test_y = splits[3]

# Run
obj = MyML(data=dat,outcome="lifeExp")
obj.split()
obj.train_x.head()

Question 3

Add a method to your MyML class called .describe_training() which generates summary statistics for the training data. Summary statistics should be rounded to the second decimal place.

_

Answer

from sklearn.model_selection import train_test_split

class MyML:
    '''
    Class to streamline the processing of data for
    a machine learning task.
    '''

    def __init__(self,data=None,outcome=""):
        self.X = data.drop(columns=outcome)
        self.y = data[[outcome]]

    def split(self,prop = .25,seed=123):
        '''
        Method splits the data into training and test datasets
        '''

        # Split the data
        splits = train_test_split(self.X,self.y,
                                  test_size=prop,
                                  random_state = seed)

        # Save the split data in the correct dictionary location
        self.train_x = splits[0]
        self.test_x = splits[1]
        self.train_y = splits[2]
        self.test_y = splits[3]

    def describe_training(self):
        '''
        Method summarizes the training data. 
        '''
        return self.train_x.describe().round(2)

# Run
obj = MyML(data=dat,outcome="lifeExp")
obj.split()
obj.describe_training()

The following materials were generated for students enrolled in PPOL564. Please do not distribute without permission.

ed769@georgetown.edu | www.ericdunford.com

Back to Course Website

Hot Dog, Not Hot Dog
Probability, Bayes Theorem, and Classification

PPOL 564 | Data Science I | Foundations

Lecture Materials for Week 11

Professor Eric Dunford (ed769@georgetown.edu)
McCourt School of Public Policy, Georgetown University

Learning Objectives

Synchronous Materials

Asynchronous Materials

_

Writing Classes

Code from the video

Classification

Relevant Slides

On Probability

_

Binomial Distribution

Probability Densities

Bayes Theorem

_

Bayes Theorem

Bayes Theorem Proof

Practice

_

Question 1

_

Answer

Question 2

_

Answer

Question 3

_

Answer

Back to Course Website Hot Dog, Not Hot Dog Probability, Bayes Theorem, and Classification

PPOL 564 | Data Science I | Foundations Lecture Materials for Week 11

Professor Eric Dunford (ed769@georgetown.edu) McCourt School of Public Policy, Georgetown University

Learning Objectives

Synchronous Materials

Asynchronous Materials

_

Writing Classes

Code from the video

Classification

Relevant Slides

On Probability

_

Binomial Distribution

Probability Densities

Bayes Theorem

_

Bayes Theorem

Bayes Theorem Proof

Practice

_

Question 1

_

Answer

Question 2

_

Answer

Question 3

_

Answer

Back to Course Website

Hot Dog, Not Hot Dog
Probability, Bayes Theorem, and Classification

PPOL 564 | Data Science I | Foundations

Lecture Materials for Week 11

Professor Eric Dunford (ed769@georgetown.edu)
McCourt School of Public Policy, Georgetown University