Get to know the data through visualization
Each of the following questions are targeted at making sure we understand our data better. A great way to get a “feel” for a dataset is to visualize it. Answer each of the below questions with a (publishable-quality) picture.
1. How is lifeExp
, pop
and gdpPercap
variables distributed?
Two ways to think about this: one is to plot each figure individually.
# Life Expectancy
gapminder %>%
ggplot(aes(lifeExp)) +
geom_histogram(bins=30) +
theme_bw()
# Population
gapminder %>%
ggplot(aes(pop)) +
geom_histogram(bins=30) +
theme_bw()
# GDP
gapminder %>%
ggplot(aes(gdpPercap)) +
geom_histogram(bins=30) +
theme_bw()
The better way is to plot all the functions at once using the pivot_longer()
function from last time with facet_wrap()
gapminder %>%
pivot_longer(cols=c(lifeExp,pop,gdpPercap)) %>%
ggplot(aes(value)) +
geom_histogram(bins=30) +
facet_wrap(~name,scales="free") +
theme_bw()
We can quickly see that there is large right skews in both gdpPercap
and pop
. Let’s transform these variables and re-plot.
gapminder %>%
mutate(ln_pop = log(pop),
ln_gdppc = log(gdpPercap)) %>%
pivot_longer(cols=c(lifeExp,ln_pop,ln_gdppc)) %>%
ggplot(aes(value)) +
geom_histogram(bins=30) +
facet_wrap(~name,scales="free") +
theme_bw()
Now let’s make things look professional!
gapminder %>%
mutate(ln_pop = log(pop),
ln_gdppc = log(gdpPercap)) %>%
pivot_longer(cols=c(lifeExp,ln_pop,ln_gdppc)) %>%
mutate(name = case_when(
name == "lifeExp" ~ "Life Expectancy",
name == "ln_gdppc" ~ "Log GDP Per Capita",
name == "ln_pop" ~ "Log Population"
)) %>%
ggplot(aes(value,fill=name)) +
geom_histogram(bins=30,color="white",alpha=.5,show.legend = F) +
facet_wrap(~name,scales="free_x") +
labs(caption="Source: gapminder.org") +
scale_fill_economist() +
theme_fivethirtyeight() +
theme(text=element_text(family="serif",face="bold",size=16))
2. What’s the relationship between economic development and life expectancy? Is the relationship the same for all continents?
gapminder %>%
mutate(ln_pop = log(pop),
ln_gdppc = log(gdpPercap)) %>%
ggplot(aes(ln_gdppc,lifeExp)) +
geom_point() +
geom_smooth(method = "lm",se=F) # Let's fit a line to the data.
Useful but there are a number of small aesthetic adjustments we could make to really help use distinguish between what is going on.
gapminder %>%
mutate(ln_pop = log(pop),
ln_gdppc = log(gdpPercap)) %>%
ggplot(aes(ln_gdppc,lifeExp)) +
geom_point(alpha=.4,color="grey30") +
geom_smooth(method = "loess",se=F,color="darkred",size=1.5) +
theme_minimal()
Is the trend the same by continent?
gapminder %>%
mutate(ln_pop = log(pop),
ln_gdppc = log(gdpPercap)) %>%
ggplot(aes(ln_gdppc,lifeExp,color=continent)) +
geom_point(alpha=.4,) +
geom_smooth(method = "loess",se=F,size=1.5) +
theme_minimal()
Generally speaking, it appears so, but it’s difficult to hone in on any one continent. There is just a lot going on. Let’s consider two alternative ways of presenting this same data.
Way 1: separate plots using facet_
gapminder %>%
mutate(ln_pop = log(pop),
ln_gdppc = log(gdpPercap)) %>%
ggplot(aes(ln_gdppc,lifeExp,color=continent)) +
geom_point(alpha=.3,) +
geom_smooth(method = "loess",se=F,size=1.5) +
facet_wrap(~continent,nrow=1) +
labs(x="Log GDP Per Capita",y = "Life Expectancy",color="") +
scale_color_gdocs() +
theme_minimal() +
theme(legend.position = "bottom",
text = element_text(size=14,family="serif",face="bold"))
Way 2: Plot the trends but not the individual data points
gapminder %>%
mutate(ln_pop = log(pop),
ln_gdppc = log(gdpPercap)) %>%
ggplot(aes(ln_gdppc,lifeExp,color=continent)) +
geom_smooth(method = "loess",se=F,size=1.5) +
labs(x="Log GDP Per Capita",y = "Life Expectancy",color="") +
scale_color_gdocs() +
theme_minimal() +
theme(legend.position = "bottom",
text = element_text(size=14,family="serif",face="bold"))
3. Which countries in Africa have the lowest levels of life expectancy?
Two ways we could go about answering a question like this. The first might just be an ordered barplot. Here we might rephrase the question as: “Which countries in Africa have the lowest levels of life expectancy on average?”
gapminder %>%
filter(continent == "Africa") %>%
group_by(country) %>%
summarize(lifeExp = mean(lifeExp),.groups="drop") %>%
ggplot(aes(lifeExp,country)) +
geom_col()
Nice, but ordering the factor fields would go a long way. Doing so is easy using the tidy forcats
package (which is part of the tidyverse), and why we’re at it, lets’ add a little polish.
gapminder %>%
filter(continent == "Africa") %>%
group_by(country) %>%
summarize(lifeExp = mean(lifeExp),.groups="drop") %>%
ggplot(aes(lifeExp,fct_reorder(country,desc(lifeExp)),fill=lifeExp)) +
geom_col(show.legend = F) +
scale_fill_gradient2_tableau() +
labs(x="Life Expectancy",y="",
title = "Average Life Expectancy in Africa",
subtitle = "1952 - 2007",
caption = "Source gapminder.org") +
theme_hc() +
theme(text=element_text(family = "serif",face="bold",size=14))
Another way we could approach this is to lay everything out spatially. ggplot with the maps
package provides a useful way to extract map data on the fly.
map_data("world") %>%
ggplot(aes(x=long,y=lat,group=group)) +
geom_polygon()
Our focus is the African continent, so we’ll just focus on that portion of the data using the data wrangling principals.
# Simplify the map data
world <-
map_data("world") %>%
select(long,lat,group,country=region) %>%
# Again standardize the country names
mutate(country = countrycode::countrycode(country,"country.name","country.name")) %>%
mutate(country = ifelse(country == "South Sudan","Sudan",country))
# subset the relevant African countries in the data.
africa <-
gapminder %>%
filter(continent == "Africa") %>%
group_by(country) %>%
summarize(lifeExp = mean(lifeExp),.groups="drop") %>%
mutate(country = countrycode::countrycode(country,"country.name","country.name")) %>%
inner_join(world,by="country")
Let’s plot the map.
africa %>%
ggplot(aes(x=long,y=lat,group=group)) +
geom_polygon()
Now let’s fill in the fields on the map using the average life expectancy values.
africa %>%
ggplot(aes(x=long,y=lat,group=group,fill=lifeExp)) +
geom_polygon(color="white",size=.25) +
scale_fill_gradient2_tableau() +
theme_map() +
labs(fill="Life Expectancy",
title = "Average Life Expectancy in Africa",
subtitle = "1952 - 2007",
caption = "Source gapminder.org") +
theme(text=element_text(family = "serif",face="bold",size=14))
What if we wanted to look at how these spatial patterns shifted over time? Not a problem, we just need to tweak the data and plot code a little bit.
# DON'T aggregate the lifeExp variable this time.
gapminder %>%
filter(continent == "Africa") %>%
mutate(country = countrycode::countrycode(country,"country.name","country.name")) %>%
inner_join(world,by="country") %>%
ggplot(aes(x=long,y=lat,group=group,fill=lifeExp)) +
geom_polygon(color="white",size=.25) +
scale_fill_gradient2_tableau() +
theme_map() +
labs(fill="Life Expectancy",
title = "Average Life Expectancy in Africa",
subtitle = "1952 - 2007",
caption = "Source gapminder.org") +
facet_wrap(~year) +
theme(text=element_text(family = "serif",face="bold",size=18),
legend.position = "bottom")
---
title: "PPOL 670 | Week 5 | Walthrough (Answers)"
subtitle: | 
  | Data Visualization Application - Gapminder Data 
output: 
  html_notebook:
    theme: united
    toc: true
    toc_float: true
    toc_depth: 5
---

<br><br>

```{r setup, include= F}
knitr::opts_chunk$set(error=F,warning = F,comment=F)
```



```{r dependencies}
#install.packages("tidyverse")
require(tidyverse) # The tidyverse package covered last time 

# install.packages("ggthemes")
require(ggthemes) # for great visualization colors and themes. 

# install.packages("maps")
require(maps) # for some maps data

# Gapminder data (for example)
#install.packages("gapminder")
require(gapminder)
```

# Data 

The [`Gapminder` dataset](https://www.gapminder.org/) is a famous dataset used by Hans Rosling to visualize development outcomes. The data covers 1952 to 2007 in five year intervals and measures life expectancy, population, and GDP Per Capita. 

```{r}
gapminder %>% head()
```


# Get to know the data through visualization

Each of the following questions are targeted at making sure we understand our data better. A great way to get a "feel" for a dataset is to visualize it. Answer each of the below questions with a (publishable-quality) picture. 

#### 1. How is `lifeExp`, `pop` and `gdpPercap` variables distributed?

Two ways to think about this: one is to plot each figure individually. 
```{r,fig.align="center",fig.width=7,fig.height=4}
# Life Expectancy
gapminder %>% 
  ggplot(aes(lifeExp)) +
  geom_histogram(bins=30) +
  theme_bw() 

# Population
gapminder %>% 
  ggplot(aes(pop)) +
  geom_histogram(bins=30) +
  theme_bw() 

# GDP
gapminder %>% 
  ggplot(aes(gdpPercap)) +
  geom_histogram(bins=30) +
  theme_bw() 
```


The better way is to plot all the functions at once using the `pivot_longer()` function from last time with `facet_wrap()`

```{r,fig.align="center",fig.width=10,fig.height=4}
gapminder %>% 
  pivot_longer(cols=c(lifeExp,pop,gdpPercap)) %>% 
  ggplot(aes(value)) +
  geom_histogram(bins=30) +
  facet_wrap(~name,scales="free") +
  theme_bw() 
```
We can quickly see that there is large right skews in both `gdpPercap` and `pop`. Let's transform these variables and re-plot.

```{r,fig.align="center",fig.width=10,fig.height=4}
gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  pivot_longer(cols=c(lifeExp,ln_pop,ln_gdppc)) %>% 
  ggplot(aes(value)) +
  geom_histogram(bins=30) +
  facet_wrap(~name,scales="free") +
  theme_bw() 
```

Now let's make things look professional!

```{r,fig.align="center",fig.width=10,fig.height=4}
gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  pivot_longer(cols=c(lifeExp,ln_pop,ln_gdppc)) %>% 
  mutate(name = case_when(
    name == "lifeExp" ~ "Life Expectancy",
    name == "ln_gdppc" ~ "Log GDP Per Capita",
    name == "ln_pop" ~ "Log Population"
  )) %>% 
  ggplot(aes(value,fill=name)) +
  geom_histogram(bins=30,color="white",alpha=.5,show.legend = F) +
  facet_wrap(~name,scales="free_x") +
  labs(caption="Source: gapminder.org") +
  scale_fill_economist() +
  theme_fivethirtyeight() +
  theme(text=element_text(family="serif",face="bold",size=16))
```



#### 2. What's the relationship between economic development and life expectancy? Is the relationship the same for all continents?

```{r,fig.align="center",fig.width=10,fig.height=4}
gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp)) +
  geom_point() +
  geom_smooth(method = "lm",se=F) # Let's fit a line to the data.
```

Useful but there are a number of small aesthetic adjustments we could make to really help use distinguish between what is going on. 

```{r,fig.align="center",fig.width=10,fig.height=4}
gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp)) +
  geom_point(alpha=.4,color="grey30") +
  geom_smooth(method = "loess",se=F,color="darkred",size=1.5) +
  theme_minimal()
```

Is the trend the same by continent?

```{r,fig.align="center",fig.width=10,fig.height=4}
gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp,color=continent)) +
  geom_point(alpha=.4,) +
  geom_smooth(method = "loess",se=F,size=1.5) +
  theme_minimal()
```

Generally speaking, it appears so, but it's difficult to hone in on any one continent. There is just a lot going on. Let's consider two alternative ways of presenting this same data. 

Way 1: separate plots using `facet_`

```{r,fig.align="center",fig.width=10,fig.height=4}
gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp,color=continent)) +
  geom_point(alpha=.3,) +
  geom_smooth(method = "loess",se=F,size=1.5) +
  facet_wrap(~continent,nrow=1) +
  labs(x="Log GDP Per Capita",y = "Life Expectancy",color="") +
  scale_color_gdocs() +
  theme_minimal() +
  theme(legend.position = "bottom",
        text = element_text(size=14,family="serif",face="bold"))
```

Way 2: Plot the trends but not the individual data points

```{r,fig.align="center",fig.width=7,fig.height=5}
gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp,color=continent)) +
  geom_smooth(method = "loess",se=F,size=1.5) +
  labs(x="Log GDP Per Capita",y = "Life Expectancy",color="") +
  scale_color_gdocs() +
  theme_minimal() +
  theme(legend.position = "bottom",
        text = element_text(size=14,family="serif",face="bold"))
```

#### 3. Which countries in Africa have the lowest levels of life expectancy?

Two ways we could go about answering a question like this. The first might just be an ordered barplot. Here we might rephrase the question as: "Which countries in Africa have the lowest levels of life expectancy _on average_?"

```{r,fig.align="center",fig.width=7,fig.height=7}
gapminder %>% 
  filter(continent == "Africa") %>% 
  group_by(country) %>% 
  summarize(lifeExp = mean(lifeExp),.groups="drop") %>% 
  ggplot(aes(lifeExp,country)) +
  geom_col() 
```

Nice, but ordering the factor fields would go a long way. Doing so is easy using the tidy `forcats` package (which is part of the tidyverse), and why we're at it, lets' add a little polish. 

```{r,fig.align="center",fig.width=7,fig.height=7.5}
gapminder %>% 
  filter(continent == "Africa") %>% 
  group_by(country) %>% 
  summarize(lifeExp = mean(lifeExp),.groups="drop") %>% 
  ggplot(aes(lifeExp,fct_reorder(country,desc(lifeExp)),fill=lifeExp)) +
  geom_col(show.legend = F) +
  scale_fill_gradient2_tableau() +
  labs(x="Life Expectancy",y="",
       title = "Average Life Expectancy in Africa",
       subtitle = "1952 - 2007",
       caption = "Source gapminder.org") +
  theme_hc() +
  theme(text=element_text(family = "serif",face="bold",size=14))
```

Another way we could approach this is to lay everything out spatially. ggplot with the `maps` package provides a useful way to extract map data on the fly. 

```{r,fig.align="center",fig.width=11,fig.height=7}
map_data("world") %>%
  ggplot(aes(x=long,y=lat,group=group)) +
  geom_polygon()
```

Our focus is the African continent, so we'll just focus on that portion of the data using the data wrangling principals. 

```{r}

# Simplify the map data 
world <- 
  map_data("world") %>% 
  select(long,lat,group,country=region) %>% 
  
  # Again standardize the country names
  mutate(country = countrycode::countrycode(country,"country.name","country.name")) %>% 
  mutate(country = ifelse(country == "South Sudan","Sudan",country))

# subset the relevant African countries in the data.
africa <- 
  gapminder %>% 
  filter(continent == "Africa") %>% 
  group_by(country) %>% 
  summarize(lifeExp = mean(lifeExp),.groups="drop")  %>% 
  mutate(country = countrycode::countrycode(country,"country.name","country.name")) %>% 
  inner_join(world,by="country")
```

Let's plot the map. 

```{r,fig.align="center",fig.width=6,fig.height=7}
africa %>% 
  ggplot(aes(x=long,y=lat,group=group)) +
  geom_polygon()
```


Now let's fill in the fields on the map using the average life expectancy values.

```{r,fig.align="center",fig.width=6,fig.height=7}
africa %>% 
  ggplot(aes(x=long,y=lat,group=group,fill=lifeExp)) +
  geom_polygon(color="white",size=.25) +
  scale_fill_gradient2_tableau() +
  theme_map() +
  labs(fill="Life Expectancy",
       title = "Average Life Expectancy in Africa",
       subtitle = "1952 - 2007",
       caption = "Source gapminder.org") +
  theme(text=element_text(family = "serif",face="bold",size=14))
```

What if we wanted to look at how these spatial patterns shifted over time? Not a problem, we just need to tweak the data and plot code a little bit. 

```{r,fig.align="center",fig.width=15,fig.height=15}
# DON'T aggregate the lifeExp variable this time. 
gapminder %>% 
  filter(continent == "Africa") %>% 
  mutate(country = countrycode::countrycode(country,"country.name","country.name")) %>% 
  inner_join(world,by="country") %>% 
  ggplot(aes(x=long,y=lat,group=group,fill=lifeExp)) +
  geom_polygon(color="white",size=.25) +
  scale_fill_gradient2_tableau() +
  theme_map() +
  labs(fill="Life Expectancy",
       title = "Average Life Expectancy in Africa",
       subtitle = "1952 - 2007",
       caption = "Source gapminder.org") +
  facet_wrap(~year) +
  theme(text=element_text(family = "serif",face="bold",size=18),
        legend.position = "bottom")
```




