PPOL564 - Data Science I: Foundations

Chaining Methods

In [2]:
import pandas as pd
from dfply import *

Chaining

Operations:

  • Chain together data manipulations in a single operational sequence.

These are powerful operations in isolation, but when combined, what results is a streamlined way to convert raw data into a useful data construct. We can do this because pandas was built fully utilizing a fluid programming framework (that is, the class always returns itself after every call).

This is easier to show in practice. Let's consider different ways to perform the same sequence of operations.

Let's do the following: Aggregate the data to the country level for all countries in Asia

Method 1: sequentially overwrite the object

In [71]:
dat = pd.read_csv("gapminder.csv") # Read the data in

dat = dat.query("continent == 'Asia'")
dat = dat.filter(['country','lifeExp','gdpPercap','population'])
dat = dat.groupby(['country'])
dat = dat.agg(mean)
dat = dat.reset_index()

dat.round(1).head()
Out[71]:
country lifeExp gdpPercap population
0 Afghanistan 37.5 802.7 15823715.4
1 Bahrain 65.6 18077.7 373913.2
2 Bangladesh 49.8 817.6 90755395.3
3 Cambodia 47.9 675.4 8510430.7
4 China 61.8 1488.3 958160051.9

Method 2: Method chaining horizontally

In [72]:
dat = pd.read_csv("gapminder.csv")

dat = dat.query("continent == 'Asia'").filter(['country','lifeExp','gdpPercap','population']).groupby(['country']).agg(mean).reset_index()

dat.round(1).head()
Out[72]:
country lifeExp gdpPercap population
0 Afghanistan 37.5 802.7 15823715.4
1 Bahrain 65.6 18077.7 373913.2
2 Bangladesh 49.8 817.6 90755395.3
3 Cambodia 47.9 675.4 8510430.7
4 China 61.8 1488.3 958160051.9

Method 3: Method Chaining vertically

In [73]:
dat = pd.read_csv("gapminder.csv")

# (A) use the back slash
dat = dat.\
query("continent == 'Asia'").\
filter(['country','lifeExp','gdpPercap','population']).\
groupby(['country']).\
agg(mean).\
reset_index()

dat.round(1).head()
Out[73]:
country lifeExp gdpPercap population
0 Afghanistan 37.5 802.7 15823715.4
1 Bahrain 65.6 18077.7 373913.2
2 Bangladesh 49.8 817.6 90755395.3
3 Cambodia 47.9 675.4 8510430.7
4 China 61.8 1488.3 958160051.9
In [74]:
dat = pd.read_csv("gapminder.csv")

# (B) house in parentheses
dat = (dat
       .query("continent == 'Asia'")
       .filter(['country','lifeExp','gdpPercap','population'])
       .groupby(['country'])
       .agg(mean)
       .reset_index())

dat.round(1).head()
Out[74]:
country lifeExp gdpPercap population
0 Afghanistan 37.5 802.7 15823715.4
1 Bahrain 65.6 18077.7 373913.2
2 Bangladesh 49.8 817.6 90755395.3
3 Cambodia 47.9 675.4 8510430.7
4 China 61.8 1488.3 958160051.9

piping with dfply

We've already seen this in play throughout the lecture notes.

In [108]:
dat = pd.read_csv("gapminder.csv")

dat = (dat >> 
       mask(X.continent == "Asia") >> 
       select(X.country,columns_between(X.lifeExp,X.gdpPercap)) >> 
       group_by(X.country) >>
       summarize_each([np.mean],X.lifeExp,X.gdpPercap,X.population))

dat.round(1).head()
Out[108]:
country lifeExp_mean gdpPercap_mean population_mean
0 Afghanistan 37.5 802.7 15823715.4
1 Bahrain 65.6 18077.7 373913.2
2 Bangladesh 49.8 817.6 90755395.3
3 Cambodia 47.9 675.4 8510430.7
4 China 61.8 1488.3 958160051.9

Miscellaneous

Views

Note that similar to numpy, we can manipulate a "view" of a pandas.DataFrame.

In [76]:
# Create a fake data frame
D = pd.DataFrame(dict(A=np.arange(5),
                     B=np.arange(5)*-1))
D
Out[76]:
A B
0 0 0
1 1 -1
2 2 -2
3 3 -3
4 4 -4
In [77]:
# Create and Manipulate a subset (view)
b = D.iloc[:2,:2]
b.iloc[:2,:2] = 8 
b
Out[77]:
A B
0 8 8
1 8 8
In [78]:
# Changes carry over to the original object
D
Out[78]:
A B
0 8 8
1 8 8
2 2 -2
3 3 -3
4 4 -4

Counting number of observations (by group)

In [79]:
dat = pd.read_csv("gapminder.csv")

# Pandas approach

# Country-Years
a = \
(dat
 .filter(["country"])
 .groupby(['country'])
 .size() # We can count the number of observations using the size method
 .reset_index()
 .rename(columns={0:"n_country_years"})
) 


# Countries by continent
b = \
(dat
 .filter(["continent","country"])
 .drop_duplicates()
 .groupby(['continent'])
 .size() # We can count the number of observations using the size method
 .reset_index()
 .rename(columns={0:"n_countries"})
 .head(5)
) 


# total countries  
c = \
(dat
 .filter(["country"])
 .drop_duplicates()
 .shape[0]
)


# Print
display(a.head(3))
display(b)
print(f'There are {c} total countries in the data.')
country n_country_years
0 Afghanistan 12
1 Albania 12
2 Algeria 12
continent n_countries
0 Africa 52
1 Americas 25
2 Asia 33
3 Europe 30
4 Oceania 2
There are 142 total countries in the data.
In [80]:
# dfply approach

# Country-Years
aa = \
(dat >> 
group_by(X.country) >> 
summarize(n_country_years = n(X.year)) >> 
head(3))

# Countries by continent
bb = \
(dat >> 
distinct(X.continent,X.country) >>
group_by(X.continent) >> 
summarize(n_countries = n(X.country)) >> 
head(3))

# total countries  
cc = dat >> summarize(n_total_countries = n_distinct(X.country))

# Print
display(aa.head(3))
display(bb)
display(cc)
country n_country_years
0 Afghanistan 12
1 Albania 12
2 Algeria 12
continent n_countries
0 Africa 52
1 Americas 25
2 Asia 33
3 Europe 30
4 Oceania 2
n_total_countries
0 142

Categories to dummies

In statistics and machine learning, we often need to convert a categorical variable into a dummy feature set (i.e. when the variable is "on" it takes the value of 1, 0 otherwise). In statistics, we'll use this type of conversion to generate fixed effects.

pandas makes this type of manipulation easy to do. with the

In [81]:
d = pd.DataFrame(dict(country = ["Nigeria","Nigeria","United States","United States","Russia","Russia"]))
d
Out[81]:
country
0 Nigeria
1 Nigeria
2 United States
3 United States
4 Russia
5 Russia
In [82]:
dummies = pd.get_dummies(d.country)
pd.concat([d,dummies],sort=False,axis=1)
Out[82]:
country Nigeria Russia United States
0 Nigeria 1 0 0
1 Nigeria 1 0 0
2 United States 0 0 1
3 United States 0 0 1
4 Russia 0 1 0
5 Russia 0 1 0

Merging with dfply

For the sake of completeness, let's demonstrate what joining looks like with dfply. Like dplyr (which the module is based off of), dfply uses SQl syntax to keep track of the type of merge we are performing.

In [83]:
# Same fake data construct as the prior lecture

data_A = pd.DataFrame(dict(country = ["Nigeria","England","Botswana"],
                           var1 = [4,3,6]))
data_B = pd.DataFrame(dict(country = ["Nigeria","United States","Botswana"],
                           var2 = ["low","high","medium"]))
display(data_A)
display(data_B)
country var1
0 Nigeria 4
1 England 3
2 Botswana 6
country var2
0 Nigeria low
1 United States high
2 Botswana medium
In [84]:
# Left Join
data_A >> left_join(data_B, by='country')
Out[84]:
country var1 var2
0 Nigeria 4 low
1 England 3 NaN
2 Botswana 6 medium
In [85]:
# right Join
data_A >> right_join(data_B, by='country')
Out[85]:
country var1 var2
0 Nigeria 4.0 low
1 Botswana 6.0 medium
2 United States NaN high
In [86]:
# Inner Join
data_A >> inner_join(data_B, by='country')
Out[86]:
country var1 var2
0 Nigeria 4 low
1 Botswana 6 medium
In [87]:
# Full Join
data_A >> full_join(data_B, by='country')
Out[87]:
country var1 var2
0 Nigeria 4.0 low
1 England 3.0 NaN
2 Botswana 6.0 medium
3 United States NaN high
In [88]:
# Or Outer Join (same as Full Join)
data_A >> outer_join(data_B, by='country')
Out[88]:
country var1 var2
0 Nigeria 4.0 low
1 England 3.0 NaN
2 Botswana 6.0 medium
3 United States NaN high
In [89]:
# Anti Join
data_A >> anti_join(data_B, by='country')
Out[89]:
country var1
1 England 3
In [90]:
# Bind Rows
data_A >> bind_rows(data_B)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/dfply/join.py:279: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  df = pd.concat([df, other], join=join, ignore_index=ignore_index, axis=0)
Out[90]:
country var1 var2
0 Nigeria 4.0 NaN
1 England 3.0 NaN
2 Botswana 6.0 NaN
0 Nigeria NaN low
1 United States NaN high
2 Botswana NaN medium
In [91]:
# Bind columns
data_A >> bind_cols(data_B)
Out[91]:
country var1 country var2
0 Nigeria 4 Nigeria low
1 England 3 United States high
2 Botswana 6 Botswana medium