Learning Objectives


In the Asynchronous Lecture


In the Synchronous Lecture


If you have any questions while watching the pre-recorded material, be sure to write them down and to bring them up during the synchronous portion of the lecture.




Synchronous Materials





Asynchronous Materials


The following tabs contain pre-recorded lecture materials for class this week. Please review these materials prior to the synchronous lecture.

Total time: Approx. 1 hour and 13 minutes


_




Writing Classes


Code from the video

The following notebook delves into the basic elements of class construction and provides a simple example of writing our own class.


Writing Classes



Classification

On Probability


As was the case last week, this week I’ll again draw from videos produced by Grant Sanderson. These videos will help give you a visual intuition of key ideas in probability.

In addition, I’ve written up a notebook on probability that provides a more traditional (but less intuitive) presentation of these concepts.


Probability Overview (Using Python)


Please review both. If you’re already comfortable with probability theory, that’s great; feel free to skip these materials.



_




Binomial Distribution




Probability Densities




Bayes Theorem


As was the case last week, this week I’ll again draw from videos produced by Grant Sanderson. These videos will help give you a visual intuition of key ideas. We’ll build on these ideas when constructing a classifier during the synchronous lecture. If you’re already comfortable with Bayes theorem, that’s great; feel free to skip these materials.


_




Bayes Theorem




Bayes Theorem Proof




Practice


These exercises are designed to help you reinforce your grasp of the concepts covered in the asynchronous lecture material.


The following questions ask you to construct a class that streamlines some of the data processing tasks learned in Week 9. Use the gapminder data to test your class method.

from gapminder import gapminder
dat = gapminder


_

Question 1


Write a class called MyML that takes in a Pandas DataFrame as input upon instantiation. The class method should also take an argument called outcome that provides the column name of the outcome variable. Initiation of the object should break the provided data up into an outcome object (y) and a features matrix (X). These should be stored as separate objects in the class instance (self).


_

Answer

class MyML:
    '''
    Class to streamline the processing of data for
    a machine learning task.
    '''

    def __init__(self,data=None,outcome=""):
        self.X = data.drop(columns=outcome)
        self.y = data[[outcome]]
        
# Run
obj = MyML(data=dat,outcome="lifeExp")
obj.X.head()

Question 2


Add a method to your MyML class called .split() that splits your data into training and test datasets and stores the splits in the object instance. .split() should take the arguments prop=.25, which dictates the proportion of data that should be held out as test data, and seed=123 which specifies the seed on the random state.


_

Answer

from sklearn.model_selection import train_test_split

class MyML:
    '''
    Class to streamline the processing of data for
    a machine learning task.
    '''

    def __init__(self,data=None,outcome=""):
        self.X = data.drop(columns=outcome)
        self.y = data[[outcome]]

    def split(self,prop = .25,seed=123):
        '''
        Method splits the data into training and test datasets
        '''

        # Split the data
        splits = train_test_split(self.X,self.y,
                                  test_size=prop,
                                  random_state = seed)

        # Save the split data in the correct dictionary location
        self.train_x = splits[0]
        self.test_x = splits[1]
        self.train_y = splits[2]
        self.test_y = splits[3]

# Run
obj = MyML(data=dat,outcome="lifeExp")
obj.split()
obj.train_x.head()

Question 3


Add a method to your MyML class called .describe_training() which generates summary statistics for the training data. Summary statistics should be rounded to the second decimal place.


_

Answer

from sklearn.model_selection import train_test_split

class MyML:
    '''
    Class to streamline the processing of data for
    a machine learning task.
    '''

    def __init__(self,data=None,outcome=""):
        self.X = data.drop(columns=outcome)
        self.y = data[[outcome]]

    def split(self,prop = .25,seed=123):
        '''
        Method splits the data into training and test datasets
        '''

        # Split the data
        splits = train_test_split(self.X,self.y,
                                  test_size=prop,
                                  random_state = seed)

        # Save the split data in the correct dictionary location
        self.train_x = splits[0]
        self.test_x = splits[1]
        self.train_y = splits[2]
        self.test_y = splits[3]

    def describe_training(self):
        '''
        Method summarizes the training data. 
        '''
        return self.train_x.describe().round(2)

# Run
obj = MyML(data=dat,outcome="lifeExp")
obj.split()
obj.describe_training()
 

The following materials were generated for students enrolled in PPOL564. Please do not distribute without permission.

ed769@georgetown.edu | www.ericdunford.com