In the Asynchronous Lecture
pandas
along with data type conversions.pandas DataFrame
.In the Synchronous Lecture
If you have any questions while watching the pre-recorded material, be sure to write them down and to bring them up during the synchronous portion of the lecture.
Lecture Slides on tidy data, joining methods, and reshaping relational data structures.
Quick discussion on method chaining
Practice: manipulating conflict event data.
The following tabs contain pre-recorded lecture materials for class this week. Please review these materials prior to the synchronous lecture.
Total time: Approx. 1 hour and 26 minutes
pandas |
dfply \(^*\) |
dplyr \(^\dagger\) |
Description |
---|---|---|---|
.filter() |
select() |
select() |
select column variables/index |
.drop() |
drop() /select() |
select() |
drop selected column variables/index |
.rename() |
rename() |
rename() |
rename column variables/index |
.query() |
mask() |
filter() |
row-wise subset of a data frame by a values of a column variable/index |
.assign() |
mutate() |
mutate() |
Create a new variable on the existing data frame |
.sort_values() |
arrange() |
arrange() |
Arrange all data values along a specified (set of) column variable(s)/indices |
.groupby() |
group_by() |
group_by() |
Index data frame by specific (set of) column variable(s)/index value(s) |
.agg() |
summarize() |
summarize() |
aggregate data by specific function rules |
.pivot_table() |
spread() |
pivot_wider() |
cast the data from a “long” to a “wide” format |
pd.melt() |
gather() |
pivot_longer() |
cast the data from a “wide” to a “long” format |
() |
>> |
%>% |
piping, fluid programming, or the passing one function output to the next |
\(*\) dfply
offers an alternative framework for data manipulation in Python. One that mirrors the popular tidyverse functionality in R
. dfply
’s functionality will be outlined in tandem with the implementations in pandas
.
\(\dagger\) the dplyr
& tidyr
R
implementations are not demonstrated in this class. However, a full overview can be found here. The functions are presented in the table to serve as a key to maintain the same framework when switching between languages.
Download the gapminder.csv dataset used in the asynchronous videos.
import pandas as pd
# On all the conversion you can do: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
# %% -----------------------------------------
################# IMPORTING DATA #################
= pd.read_csv("gapminder.csv")
dat
dat
# Customization
"gapminder.csv",
pd.read_csv(= ",", # Separator in the data
sep ="country", # Set a variable to the index
index_col= ["country","continent","year"], # Only request specific columns
usecols = 3, # only read in n-rows of the data
nrows = "nan",
na_values =True, # Parse all date features as datetime
parse_dates=True) # read the file in chunks for lower memory use (useful on large data)
low_memory
# %% -----------------------------------------
# EXPORTING DATA
= dat.head()
dat2
# Export as csv without index
"example.csv",index=False)
dat2.to_csv(
# Export as .dta (stata data type)
"example.dta")
dat2.to_stata(
# Export as excel
"example.xlsx")
dat2.to_excel(
# %% -----------------------------------------
# DATA TYPE CONVERSIONS
# data ==> string
print(dat2.to_string())
# data ==> dict
dat2.to_dict()
# data ==> numpy array
dat2.values
# data ==> list
dat2.values.tolist()
import pandas as pd
from dfply import * # Alternative "tidyverse" data manipulation library
# Read in data
= pd.read_csv("gapminder.csv")
dat
# Small sample
= dat.sample(5,random_state=123)
dat2
dat2
# %% ------------------------------------------------
##################### SELECTING #####################
# Using the index
3,[0,1]]
dat2.iloc[:'country','year']]
dat2.loc[:,[
# Filter method to "filter out" columns
filter(["country","year"])
dat2.
# dfply approach
>> select(X.country,X.year)
dat2
# Selecting ranges of variables
"continent":"pop"]
dat2.loc[:,
# dfply approach
>> select(columns_between(X.year,X.gdpPercap))
dat
# Selecting using regular expressions
filter(regex="p")
dat2.
# dfply approach
>> select(contains("p"))
dat2
# %% ------------------------------------------------
##################### DROPPING VARIABLES #####################
=["year","lifeExp"])
dat2.drop(columns
# dfply approach
>> drop(X.year,X.lifeExp)
dat2
>> select(~X.year,~X.lifeExp) # alt approach
dat2
# %% ------------------------------------------------
##################### REORDERING #####################
# Re-arrange by requesting columns in a specific order
filter(["year","continent","country"])
dat2.
# DANGER! Only ask for a variable once.
filter(["year","year","continent","country"])
dat2.
# dfply approach
>> select(X.year,X.continent,X.country)
dat2
# list(dat2) === colnames(dat2) # in R
# Re-arranging without dropping
= list(dat2)
col_names = ["year","country"]
order for i in col_names:
if i not in order:
order.append(i)
orderfilter(order)
dat2.
# dfply approach
>> select(X.year,X.country,everything())
dat2
# %% ------------------------------------------------
##################### MUTATING #####################
# Generating new variables.
# (1) Pandas: create a new variable by specifying and assigning a new column index location
"lifeExp_new"] = dat2.lifeExp*100
dat2.loc[:,
dat2
# (2) Pandas: create a new variable by using the assign() method
= dat2.assign(lifeExp_new2 = dat2.lifeExp/10)
dat2
dat2
# (3) Pandas: create a new variable using the eval()
# Note that eval() supports an array of computations but
# not all (e.g. self-defined/third-party functions)
= dat2.eval("lifeExp_new3 = sqrt(lifeExp)")
dat2
dat2
# dfply: create a new variable by using the mutate() method.
= dat2 >> mutate(lifeExp_new4 = X.lifeExp - X.lifeExp.mean())
dat2
dat2
# %% ------------------------------------------------
##################### RENAMING #####################
={"country":"country_name","lifeExp":"LE"})
dat2.rename(columns
# dfply approach
>> rename(country_name = X.country, LE = X.lifeExp) dat2
import pandas as pd
from dfply import * # Alternative "tidyverse" data manipulation library
# Read in data
= pd.read_csv("gapminder.csv")
dat
dat
# %% ------------------------------------------
################## SUBSETTING ##################
# A.k.a. filtering
# Recall we can convert a numpy vector to a boolean vector
# using a conditional
< 25
dat.lifeExp
# Just as we subsetted a numpy array, we can subset a DataFrame
< 25]
dat.loc[dat.lifeExp
# Pandas: filter by a specific variable value use the query() method
"lifeExp < 25")
dat.query(# query() is similar to eval() in how it operates
# dfply: filter by a specific variable value using the mask() method
>> mask(X.lifeExp < 25)
dat
# Subset by distinct entry
"continent") # first values for each row are returned
dat.drop_duplicates(
filter(["continent"]).drop_duplicates("continent")
dat.
# dfply: drop duplicative entries for a specific variable
>> distinct(X.continent)
dat
# Subset by slicing
# Pandas: slice the row entries using the row index
200:203,:]
dat.iloc[
# dfplyr: slice the row entries using the row index
>> row_slice([200,201,202])
dat
# Subset by sampling
# Pandas: randomly sample N number of rows from the data
5)
dat.sample(
# dfply: randomly sample N number of rows from the data
>> sample(3)
dat
# %% ------------------------------------------
################## Arranging ##################
# Pandas: sort values by a column variable (ascending)
'country').head(3)
dat.sort_values(
# Pandas: sort values by a column variable (descending)
'country',ascending=False).head(3)
dat.sort_values(
# Pandas: sort values by more than one column variable
'country','year'],ascending=False).head(3)
dat.sort_values([
# dfply: sort values by a column variable (ascending)
>> arrange(X.country) >> head(3)
dat
# dfply: sort values by a column variable (descending)
>> arrange(desc(X.country)) >> head(3)
dat
# dfply: sort values by more than one column variable
>> arrange(desc(X.country),X.year) >> head(3)
dat
# %% ------------------------------------------
################## Summarizing ################
# Built in aggregation functions
dat.lifeExp.mean()
dat.lifeExp.median()
dat.lifeExp.mode()sum()
dat.lifeExp.
dat.lifeExp.size
# using the .agg() method to summarize across variables
'lifeExp','pop','gdpPercap']].agg(mean)
dat[[
# Define your own values: need to ensure you go from "many" to "one"
def my_func(x):
if x.mean() > 100:
return 1
else:
return 0
# Implement
'lifeExp','pop','gdpPercap']].agg(my_func)
dat[[
# dfply approach
>> summarize(lifeExp_mean = X.lifeExp.mean(),
dat = X.lifeExp.std()) lifeExp_std
import pandas as pd
from dfply import * # Alternative "tidyverse" data manipulation library
# Read in data
= pd.read_csv("gapminder.csv")
dat
dat.continent.drop_duplicates()
# %% -----------------------------------------
################## GROUPING ##################
# Pandas: group by a column entries.
# Generates an iterable where each group is broken up into a tuple (group,data).
# We can iterate across the tuple positions.
= dat.groupby(["continent"])
g
g
for i in g:
print(i)
# Under the hood how we can think about it
for i in g:
print("This is the Key:",i[0])
print("This is the data:\n",i[1].head(2),end="\n\n\n")
# In practice.
"continent"]).head(2)
dat.groupby([
# dfply: group by a column entries.
>> group_by(X.continent) >> head(2)
dat
# With dfply, the group_by() method will persist.
# As we need to ungroup() if we wish to turn off the key.
= dat >> group_by(X.continent)
d >> head(2)
d >> ungroup() >> head(2)
d
# %% -----------------------------------------
################## Groupby + Summarize ##################
# Groupby a single variable, and average across the numeric data values
"continent"]).mean()
dat.groupby([
# Group, Select, and Summarize
'continent'])[['lifeExp','gdpPercap']].mean()
dat.groupby([
# Use the aggregate function to generate many variables
'continent'])[['lifeExp','gdpPercap']].agg(["mean","std","median"])
dat.groupby([
# Have some control over which aggregation function are applied where
'continent']).agg({"lifeExp":mean,"gdpPercap":median})
dat.groupby([
# Group by more than one variable --- resulting in a multi-index
'continent','country'])[['lifeExp','gdpPercap']].mean()
dat.groupby([
# dfply way of doing this
>> group_by(X.country) >> summarize(lifeExp_mean = X.lifeExp.mean(),lifeExp_std = X.lifeExp.std())
dat
# For summarization across a range of variables
>> group_by(X.continent) >> summarize_each([np.mean,np.std],X.lifeExp,X.gdpPercap)
dat
# %% -----------------------------------------
################## Groupby + TRANSFORM ##################
# Other times we want to implement data manipulations by some grouping variable but
# retain structure of the original data. Put differently, our aim is not to aggregate
# but to perform some operation across specific groups. For example, we might want to
# group-mean center our variables as a way of removing between group variation.
# Pandas: groupby() + transform()
def center(x):
'''Center a variable around its mean'''
return x - x.mean()
'country')[["lifeExp","pop"]].transform(center).head(10)
dat.groupby(
# Likewise, apply() offers identical functionality. The only requirement of
# apply is that the output must be a pandas.DataFrame, a pandas.Series, or a scalar.
# Pandas: groupby() + apply()
'country')[["lifeExp","pop"]].apply(center).head(10)
dat.groupby(
# dfply: group_by + mutate()
= dat >> group_by(X.country) >> mutate(lifeExp_centered = center(X.lifeExp))
d 10)
d.head(
= dat >> group_by(X.country) >> mutate(lifeExp_centered = center(X.lifeExp),le_ave = X.lifeExp.mean())
d 10)
d.head(
# Return only the manipulated value
= dat >> group_by(X.country) >> transmute(lifeExp_centered = center(X.lifeExp))
d 10) d.head(
These exercises are designed to help you reinforce your grasp of the concepts covered in the asynchronous lecture material.
Using the gapminder.csv dataset from the asynchronous lecture, please answer the following questions.
How many countries are in the data?
import pandas as pd
= pd.read_csv("gapminder.csv")
dat
# IN PANDAS
dat.country.drop_duplicates().size
# IN DFPLY
>> distinct(X.country) >> summarize(N=n(X.country)) dat
What’s the average gdpPercap
in the Americas
prior to 1980
?
# In PANDAS
'continent == "Americas" and year < 1980').gdpPercap.mean()
dat.query(
# IN DFPLY
>> mask(X.continent == "Americas",X.year < 1980) >> summarize(gdpPercap = X.gdpPercap.mean()) dat
Which country has the lowest average pop
ulation in the data?
Hint:
pop
is an unfortunate name for a variable. Python can get confused with thepop
method which pops off a data column. So you’ll have to be careful regarding how you access the variable. That or just rename the variable so there is no opportunity for confusion.
# IN PANDAS
"country"])['pop'].mean().reset_index().query('pop == pop.min()')
dat.groupby([
# IN DFPLY
>> group_by(X.country) >> summarize(ave_pop = X['pop'].mean()) >> ungroup >> mask(X.ave_pop == X.ave_pop.min()) dat
The following materials were generated for students enrolled in PPOL564. Please do not distribute without permission.
ed769@georgetown.edu | www.ericdunford.com