class: center, middle, inverse, title-slide #
PPOL564 | Data Science 1 | Foundations
Week 6
Joining and Reshaping
###
Prof. Eric Dunford ◆ Georgetown University ◆ McCourt School of Public Policy ◆
eric.dunford@georgetown.edu
--- layout: true <div class="slide-footer"><span> PPOL564 | Data Science 1 | Foundations           Week 6 <!-- Week of the Footer Here -->              Joining and Reshaping <!-- Title of the lecture here --> </span></div> --- class:newsection # Tidy Data --- The same data can be represented in many different ways... ![:space 15] .center[ <table> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> year </th> <th style="text-align:center;"> cases </th> <th style="text-align:center;"> population </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 745 </td> <td style="text-align:center;"> 19987071 </td> </tr> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 2666 </td> <td style="text-align:center;"> 20595360 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 37737 </td> <td style="text-align:center;"> 172006362 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 80488 </td> <td style="text-align:center;"> 174504898 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 212258 </td> <td style="text-align:center;"> 1272915272 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 213766 </td> <td style="text-align:center;"> 1280428583 </td> </tr> </tbody> </table> ] --- <br><br> .center[ <table> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> year </th> <th style="text-align:center;"> type </th> <th style="text-align:center;"> count </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> cases </td> <td style="text-align:center;"> 745 </td> </tr> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> population </td> <td style="text-align:center;"> 19987071 </td> </tr> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> cases </td> <td style="text-align:center;"> 2666 </td> </tr> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> population </td> <td style="text-align:center;"> 20595360 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> cases </td> <td style="text-align:center;"> 37737 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> population </td> <td style="text-align:center;"> 172006362 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> cases </td> <td style="text-align:center;"> 80488 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> population </td> <td style="text-align:center;"> 174504898 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> cases </td> <td style="text-align:center;"> 212258 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> population </td> <td style="text-align:center;"> 1272915272 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> cases </td> <td style="text-align:center;"> 213766 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> population </td> <td style="text-align:center;"> 1280428583 </td> </tr> </tbody> </table> ] --- <br><br><br> .center[ <table> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> year </th> <th style="text-align:center;"> rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 745/19987071 </td> </tr> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 2666/20595360 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 37737/172006362 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 80488/174504898 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 212258/1272915272 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 213766/1280428583 </td> </tr> </tbody> </table> ] --- <br><br> .center[ <table> <caption>Cases</caption> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> 1999 </th> <th style="text-align:center;"> 2000 </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 745 </td> <td style="text-align:center;"> 2666 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 37737 </td> <td style="text-align:center;"> 80488 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 212258 </td> <td style="text-align:center;"> 213766 </td> </tr> </tbody> </table> ] <br> .center[ <table> <caption>Population</caption> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> 1999 </th> <th style="text-align:center;"> 2000 </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 19987071 </td> <td style="text-align:center;"> 20595360 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 172006362 </td> <td style="text-align:center;"> 174504898 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 1272915272 </td> <td style="text-align:center;"> 1280428583 </td> </tr> </tbody> </table> ] --- There are three interrelated rules which make a dataset **tidy**: <br><br> 1. **Each variable must have its own column.** 2. **Each observation must have its own row.** 3. **Each value must have its own cell.** .center[<img src="Figures/tidy-data.png">] --- ## Tidy! <br><br> .center[ <table> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> year </th> <th style="text-align:center;"> cases </th> <th style="text-align:center;"> population </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 745 </td> <td style="text-align:center;"> 19987071 </td> </tr> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 2666 </td> <td style="text-align:center;"> 20595360 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 37737 </td> <td style="text-align:center;"> 172006362 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 80488 </td> <td style="text-align:center;"> 174504898 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 212258 </td> <td style="text-align:center;"> 1272915272 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 213766 </td> <td style="text-align:center;"> 1280428583 </td> </tr> </tbody> </table> ] --- class:newsection # Joining Data --- ![:space 10] Consider the following two example datasets... ```python data_A ``` ``` ## country Var1 ## 0 Nigeria 4 ## 1 England 3 ## 2 Botswana 6 ``` ```python data_B ``` ``` ## country Var2 ## 0 Nigeria Low ## 1 United States High ## 2 Botswana Medium ``` --- ## Left Join .center[<img src="Figures/left_join.png", height=100px>] .center[<img src="Figures/left-join.gif", height=400px>] --- ## Left Join .center[<img src="Figures/left_join.png", height=100px>] ![:space 10] ```python data_A.merge(data_B,how="left",on="country") ``` ``` ## country Var1 Var2 ## 0 Nigeria 4 Low ## 1 England 3 NaN ## 2 Botswana 6 Medium ``` --- ## Right Join .center[<img src="Figures/right_join.png", height=100px>] .center[<img src="Figures/right-join.gif", height=400px>] --- ## Right Join .center[<img src="Figures/right_join.png", height=100px>] ![:space 10] ```python data_A.merge(data_B,how="right",on="country") ``` ``` ## country Var1 Var2 ## 0 Nigeria 4.0 Low ## 1 Botswana 6.0 Medium ## 2 United States NaN High ``` --- ## Inner Join .center[<img src="Figures/inner_join.png", height=100px>] .center[<img src="Figures/inner-join.gif", height=400px>] --- ## Inner Join .center[<img src="Figures/inner_join.png", height=100px>] ![:space 10] ```python data_A.merge(data_B,how="inner",on="country") ``` ``` ## country Var1 Var2 ## 0 Nigeria 4 Low ## 1 Botswana 6 Medium ``` --- ## Full Join .center[<img src="Figures/full_join.png", height=100px>] .center[<img src="Figures/full-join.gif", height=400px>] --- ## Full Join <center> <img src="Figures/full_join.png", height=100px> </center> ![:space 10] ```python data_A.merge(data_B,how="outer",on="country") ``` ``` ## country Var1 Var2 ## 0 Nigeria 4.0 Low ## 1 England 3.0 NaN ## 2 Botswana 6.0 Medium ## 3 United States NaN High ``` --- ## Anti Join .center[<img src="Figures/anti_join_left.png", height=100px>] .center[<img src="Figures/anti-join.gif", height=400px>] --- ## Anti Join .center[<img src="Figures/anti_join_left.png", height=100px>] ![:space 5] ```python m = pd.merge(left=data_A, right=data_B, how='left',on="country", indicator=True) print(m) ``` ``` ## country Var1 Var2 _merge ## 0 Nigeria 4 Low both ## 1 England 3 NaN left_only ## 2 Botswana 6 Medium both ``` ```python m.loc[m._merge=="left_only",:].drop(columns="_merge") ``` ``` ## country Var1 Var2 ## 1 England 3 NaN ``` --- ## Row Bind <center> <img src="Figures/rbind.png"> </center> ![:space 5] ```python pd.concat([data_A,data_B],sort=False) ``` ``` ## country Var1 Var2 ## 0 Nigeria 4.0 NaN ## 1 England 3.0 NaN ## 2 Botswana 6.0 NaN ## 0 Nigeria NaN Low ## 1 United States NaN High ## 2 Botswana NaN Medium ``` --- ## Column Bind <center> <img src="Figures/cbind.png"> </center> ![:space 5] ```python pd.concat([data_A,data_B],axis=1,sort=False) ``` ``` ## country Var1 country Var2 ## 0 Nigeria 4 Nigeria Low ## 1 England 3 United States High ## 2 Botswana 6 Botswana Medium ``` --- ## Disparate column names ![:space 10] Sometimes the naming conventions of two datasets don't perfectly align. When this happens, we can specify how data merges onto one another more explicitly using the `by=` argument. Moreover, we can merge on **_more_ than one dimension** by specifying all relevant column names. --- ![:space 10] Once again, consider the following example data.. ```python data_A ``` ``` ## cname var1 ## 0 Russia 1 ## 1 Nigeria 2 ## 2 USA 3 ## 3 Australia 4 ``` ```python data_B ``` ``` ## country var2 ## 0 Belgium -1.0 ## 1 USA 0.3 ## 2 Nigeria 2.2 ## 3 Botswana 1.7 ``` --- ![:space 20] ```python pd.merge(left = data_A, right = data_B, how = "outer", # The type of join left_on = "cname", # The left column naming convention right_on="country") # The right column naming convention ``` ``` ## cname var1 country var2 ## 0 Russia 1.0 NaN NaN ## 1 Nigeria 2.0 Nigeria 2.2 ## 2 USA 3.0 USA 0.3 ## 3 Australia 4.0 NaN NaN ## 4 NaN NaN Belgium -1.0 ## 5 NaN NaN Botswana 1.7 ``` --- class:newsection # Reshaping Data --- ![:space 10] Often, we need to alter the structure of a `data.frame` from a **wide format**... ![:space 10] .center[ <table> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> 1992 </th> <th style="text-align:center;"> 1993 </th> <th style="text-align:center;"> 1994 </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Nigeria </td> <td style="text-align:center;"> 9.72 </td> <td style="text-align:center;"> 10.06 </td> <td style="text-align:center;"> 9.66 </td> </tr> <tr> <td style="text-align:center;"> Iran </td> <td style="text-align:center;"> 9.88 </td> <td style="text-align:center;"> 10.86 </td> <td style="text-align:center;"> 9.78 </td> </tr> <tr> <td style="text-align:center;"> Cambodia </td> <td style="text-align:center;"> 10.78 </td> <td style="text-align:center;"> 10.23 </td> <td style="text-align:center;"> 10.61 </td> </tr> <tr> <td style="text-align:center;"> Australia </td> <td style="text-align:center;"> 10.04 </td> <td style="text-align:center;"> 9.37 </td> <td style="text-align:center;"> 10.18 </td> </tr> </tbody> </table> ] --- ...into a **long format** .center[ <table> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> year </th> <th style="text-align:center;"> var </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Nigeria </td> <td style="text-align:center;"> 1992 </td> <td style="text-align:center;"> 9.72 </td> </tr> <tr> <td style="text-align:center;"> Nigeria </td> <td style="text-align:center;"> 1993 </td> <td style="text-align:center;"> 10.06 </td> </tr> <tr> <td style="text-align:center;"> Nigeria </td> <td style="text-align:center;"> 1994 </td> <td style="text-align:center;"> 9.66 </td> </tr> <tr> <td style="text-align:center;"> Iran </td> <td style="text-align:center;"> 1992 </td> <td style="text-align:center;"> 9.88 </td> </tr> <tr> <td style="text-align:center;"> Iran </td> <td style="text-align:center;"> 1993 </td> <td style="text-align:center;"> 10.86 </td> </tr> <tr> <td style="text-align:center;"> Iran </td> <td style="text-align:center;"> 1994 </td> <td style="text-align:center;"> 9.78 </td> </tr> <tr> <td style="text-align:center;"> Cambodia </td> <td style="text-align:center;"> 1992 </td> <td style="text-align:center;"> 10.78 </td> </tr> <tr> <td style="text-align:center;"> Cambodia </td> <td style="text-align:center;"> 1993 </td> <td style="text-align:center;"> 10.23 </td> </tr> <tr> <td style="text-align:center;"> Cambodia </td> <td style="text-align:center;"> 1994 </td> <td style="text-align:center;"> 10.61 </td> </tr> <tr> <td style="text-align:center;"> Australia </td> <td style="text-align:center;"> 1992 </td> <td style="text-align:center;"> 10.04 </td> </tr> <tr> <td style="text-align:center;"> Australia </td> <td style="text-align:center;"> 1993 </td> <td style="text-align:center;"> 9.37 </td> </tr> <tr> <td style="text-align:center;"> Australia </td> <td style="text-align:center;"> 1994 </td> <td style="text-align:center;"> 10.18 </td> </tr> </tbody> </table> ] --- ## From wide-to-long ![:space 15] ![:center_img 100](Figures/gather.png) --- ### From wide-to-long ```python dat ``` ``` ## country 1992 1993 1994 ## 0 Nigeria 9.72 10.06 9.66 ## 1 Iran 9.88 10.86 9.78 ## 2 Cambodia 10.78 10.23 10.61 ## 3 Australia 10.04 9.37 10.18 ``` -- ```python pd.melt(dat,id_vars=['country']) ``` ``` ## country variable value ## 0 Nigeria 1992 9.72 ## 1 Iran 1992 9.88 ## 2 Cambodia 1992 10.78 ## 3 Australia 1992 10.04 ## 4 Nigeria 1993 10.06 ## 5 Iran 1993 10.86 ## 6 Cambodia 1993 10.23 ## 7 Australia 1993 9.37 ## 8 Nigeria 1994 9.66 ## 9 Iran 1994 9.78 ## 10 Cambodia 1994 10.61 ## 11 Australia 1994 10.18 ``` --- ## From long-to-wide ![:space 10] .center[<img src ="Figures/spread.png">] --- ## From long-to-wide ![:space 10] ```python dat2 ``` ``` ## country year ln_gdppc ## 0 Nigeria 1992 9.72 ## 1 Cambodia 1992 10.78 ## 2 Australia 1992 10.04 ## 3 Nigeria 1993 10.06 ## 4 Iran 1993 10.86 ## 5 Australia 1993 9.37 ## 6 Nigeria 1994 9.66 ## 7 Iran 1994 9.78 ## 8 Cambodia 1994 10.61 ## 9 Australia 1994 10.18 ``` --- ## From long-to-wide Main arguments: - `columns`: name of the variable that will be spread out into columns. - `values`: values that will populate the cells of each column - `index`: variables to set as the index ![:space 3] ```python dat2.pivot_table(values='ln_gdppc', columns='country', index='year') ``` ``` ## country Nigeria Iran Cambodia Australia ## year ## 1992 9.72 NaN 10.78 10.04 ## 1993 10.06 10.86 NaN 9.37 ## 1994 9.66 9.78 10.61 10.18 ``` --- ## From long-to-wide Main arguments: - `columns`: name of the variable that will be spread out into columns. - `values`: values that will populate the cells of each column - `index`: variables to set as the index - `fill_value`: fill in missing values ```python dat2.pivot_table(values='ln_gdppc',columns='country', index='year', fill_value=-99) ``` ``` ## country Nigeria Iran Cambodia Australia ## year ## 1992 9.72 -99.00 10.78 10.04 ## 1993 10.06 10.86 -99.00 9.37 ## 1994 9.66 9.78 10.61 10.18 ```