In the Asynchronous Lecture
In the Synchronous Lecture
If you have any questions while watching the pre-recorded material, be sure to write them down and to bring them up during the synchronous portion of the lecture.
Breakout: Practice Calculating Bayesian Problems
Building a Naive Bayes Classifier
.ipynb
The following tabs contain pre-recorded lecture materials for class this week. Please review these materials prior to the synchronous lecture.
Total time: Approx. 1 hour and 13 minutes
The following notebook delves into the basic elements of class construction and provides a simple example of writing our own class.
As was the case last week, this week I’ll again draw from videos produced by Grant Sanderson. These videos will help give you a visual intuition of key ideas in probability.
In addition, I’ve written up a notebook on probability that provides a more traditional (but less intuitive) presentation of these concepts.
Please review both. If you’re already comfortable with probability theory, that’s great; feel free to skip these materials.
As was the case last week, this week I’ll again draw from videos produced by Grant Sanderson. These videos will help give you a visual intuition of key ideas. We’ll build on these ideas when constructing a classifier during the synchronous lecture. If you’re already comfortable with Bayes theorem, that’s great; feel free to skip these materials.
These exercises are designed to help you reinforce your grasp of the concepts covered in the asynchronous lecture material.
The following questions ask you to construct a class that streamlines some of the data processing tasks learned in Week 9. Use the gapminder
data to test your class method.
from gapminder import gapminder
= gapminder dat
Write a class called MyML
that takes in a Pandas
DataFrame
as input upon instantiation. The class method should also take an argument called outcome
that provides the column name of the outcome variable. Initiation of the object should break the provided data up into an outcome object (y
) and a features matrix (X
). These should be stored as separate objects in the class instance (self
).
class MyML:
'''
Class to streamline the processing of data for
a machine learning task.
'''
def __init__(self,data=None,outcome=""):
self.X = data.drop(columns=outcome)
self.y = data[[outcome]]
# Run
= MyML(data=dat,outcome="lifeExp")
obj obj.X.head()
Add a method to your MyML
class called .split()
that splits your data into training and test datasets and stores the splits in the object instance. .split()
should take the arguments prop=.25
, which dictates the proportion of data that should be held out as test data, and seed=123
which specifies the seed on the random state.
from sklearn.model_selection import train_test_split
class MyML:
'''
Class to streamline the processing of data for
a machine learning task.
'''
def __init__(self,data=None,outcome=""):
self.X = data.drop(columns=outcome)
self.y = data[[outcome]]
def split(self,prop = .25,seed=123):
'''
Method splits the data into training and test datasets
'''
# Split the data
= train_test_split(self.X,self.y,
splits =prop,
test_size= seed)
random_state
# Save the split data in the correct dictionary location
self.train_x = splits[0]
self.test_x = splits[1]
self.train_y = splits[2]
self.test_y = splits[3]
# Run
= MyML(data=dat,outcome="lifeExp")
obj
obj.split() obj.train_x.head()
Add a method to your MyML
class called .describe_training()
which generates summary statistics for the training data. Summary statistics should be rounded to the second decimal place.
from sklearn.model_selection import train_test_split
class MyML:
'''
Class to streamline the processing of data for
a machine learning task.
'''
def __init__(self,data=None,outcome=""):
self.X = data.drop(columns=outcome)
self.y = data[[outcome]]
def split(self,prop = .25,seed=123):
'''
Method splits the data into training and test datasets
'''
# Split the data
= train_test_split(self.X,self.y,
splits =prop,
test_size= seed)
random_state
# Save the split data in the correct dictionary location
self.train_x = splits[0]
self.test_x = splits[1]
self.train_y = splits[2]
self.test_y = splits[3]
def describe_training(self):
'''
Method summarizes the training data.
'''
return self.train_x.describe().round(2)
# Run
= MyML(data=dat,outcome="lifeExp")
obj
obj.split() obj.describe_training()
The following materials were generated for students enrolled in PPOL564. Please do not distribute without permission.
ed769@georgetown.edu | www.ericdunford.com