#install.packages("tidyverse")
require(tidyverse) # The tidyverse package covered last time 

# install.packages("ggthemes")
require(ggthemes) # for great visualization colors and themes. 

# install.packages("maps")
require(maps) # for some maps data

# Gapminder data (for example)
#install.packages("gapminder")
require(gapminder)

Data

The Gapminder dataset is a famous dataset used by Hans Rosling to visualize development outcomes. The data covers 1952 to 2007 in five year intervals and measures life expectancy, population, and GDP Per Capita.

gapminder %>% head()

Get to know the data through visualization

Each of the following questions are targeted at making sure we understand our data better. A great way to get a “feel” for a dataset is to visualize it. Answer each of the below questions with a (publishable-quality) picture.

1. How is lifeExp, pop and gdpPercap variables distributed?

Two ways to think about this: one is to plot each figure individually.

# Life Expectancy
gapminder %>% 
  ggplot(aes(lifeExp)) +
  geom_histogram(bins=30) +
  theme_bw() 


# Population
gapminder %>% 
  ggplot(aes(pop)) +
  geom_histogram(bins=30) +
  theme_bw() 


# GDP
gapminder %>% 
  ggplot(aes(gdpPercap)) +
  geom_histogram(bins=30) +
  theme_bw() 

The better way is to plot all the functions at once using the pivot_longer() function from last time with facet_wrap()

gapminder %>% 
  pivot_longer(cols=c(lifeExp,pop,gdpPercap)) %>% 
  ggplot(aes(value)) +
  geom_histogram(bins=30) +
  facet_wrap(~name,scales="free") +
  theme_bw() 

We can quickly see that there is large right skews in both gdpPercap and pop. Let’s transform these variables and re-plot.

gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  pivot_longer(cols=c(lifeExp,ln_pop,ln_gdppc)) %>% 
  ggplot(aes(value)) +
  geom_histogram(bins=30) +
  facet_wrap(~name,scales="free") +
  theme_bw() 

Now let’s make things look professional!

gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  pivot_longer(cols=c(lifeExp,ln_pop,ln_gdppc)) %>% 
  mutate(name = case_when(
    name == "lifeExp" ~ "Life Expectancy",
    name == "ln_gdppc" ~ "Log GDP Per Capita",
    name == "ln_pop" ~ "Log Population"
  )) %>% 
  ggplot(aes(value,fill=name)) +
  geom_histogram(bins=30,color="white",alpha=.5,show.legend = F) +
  facet_wrap(~name,scales="free_x") +
  labs(caption="Source: gapminder.org") +
  scale_fill_economist() +
  theme_fivethirtyeight() +
  theme(text=element_text(family="serif",face="bold",size=16))

2. What’s the relationship between economic development and life expectancy? Is the relationship the same for all continents?

gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp)) +
  geom_point() +
  geom_smooth(method = "lm",se=F) # Let's fit a line to the data.

Useful but there are a number of small aesthetic adjustments we could make to really help use distinguish between what is going on.

gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp)) +
  geom_point(alpha=.4,color="grey30") +
  geom_smooth(method = "loess",se=F,color="darkred",size=1.5) +
  theme_minimal()

Is the trend the same by continent?

gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp,color=continent)) +
  geom_point(alpha=.4,) +
  geom_smooth(method = "loess",se=F,size=1.5) +
  theme_minimal()

Generally speaking, it appears so, but it’s difficult to hone in on any one continent. There is just a lot going on. Let’s consider two alternative ways of presenting this same data.

Way 1: separate plots using facet_

gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp,color=continent)) +
  geom_point(alpha=.3,) +
  geom_smooth(method = "loess",se=F,size=1.5) +
  facet_wrap(~continent,nrow=1) +
  labs(x="Log GDP Per Capita",y = "Life Expectancy",color="") +
  scale_color_gdocs() +
  theme_minimal() +
  theme(legend.position = "bottom",
        text = element_text(size=14,family="serif",face="bold"))

Way 2: Plot the trends but not the individual data points

gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp,color=continent)) +
  geom_smooth(method = "loess",se=F,size=1.5) +
  labs(x="Log GDP Per Capita",y = "Life Expectancy",color="") +
  scale_color_gdocs() +
  theme_minimal() +
  theme(legend.position = "bottom",
        text = element_text(size=14,family="serif",face="bold"))

3. Which countries in Africa have the lowest levels of life expectancy?

Two ways we could go about answering a question like this. The first might just be an ordered barplot. Here we might rephrase the question as: “Which countries in Africa have the lowest levels of life expectancy on average?”

gapminder %>% 
  filter(continent == "Africa") %>% 
  group_by(country) %>% 
  summarize(lifeExp = mean(lifeExp),.groups="drop") %>% 
  ggplot(aes(lifeExp,country)) +
  geom_col() 

Nice, but ordering the factor fields would go a long way. Doing so is easy using the tidy forcats package (which is part of the tidyverse), and why we’re at it, lets’ add a little polish.

gapminder %>% 
  filter(continent == "Africa") %>% 
  group_by(country) %>% 
  summarize(lifeExp = mean(lifeExp),.groups="drop") %>% 
  ggplot(aes(lifeExp,fct_reorder(country,desc(lifeExp)),fill=lifeExp)) +
  geom_col(show.legend = F) +
  scale_fill_gradient2_tableau() +
  labs(x="Life Expectancy",y="",
       title = "Average Life Expectancy in Africa",
       subtitle = "1952 - 2007",
       caption = "Source gapminder.org") +
  theme_hc() +
  theme(text=element_text(family = "serif",face="bold",size=14))

Another way we could approach this is to lay everything out spatially. ggplot with the maps package provides a useful way to extract map data on the fly.

map_data("world") %>%
  ggplot(aes(x=long,y=lat,group=group)) +
  geom_polygon()

Our focus is the African continent, so we’ll just focus on that portion of the data using the data wrangling principals.


# Simplify the map data 
world <- 
  map_data("world") %>% 
  select(long,lat,group,country=region) %>% 
  
  # Again standardize the country names
  mutate(country = countrycode::countrycode(country,"country.name","country.name")) %>% 
  mutate(country = ifelse(country == "South Sudan","Sudan",country))

# subset the relevant African countries in the data.
africa <- 
  gapminder %>% 
  filter(continent == "Africa") %>% 
  group_by(country) %>% 
  summarize(lifeExp = mean(lifeExp),.groups="drop")  %>% 
  mutate(country = countrycode::countrycode(country,"country.name","country.name")) %>% 
  inner_join(world,by="country")

Let’s plot the map.

africa %>% 
  ggplot(aes(x=long,y=lat,group=group)) +
  geom_polygon()

Now let’s fill in the fields on the map using the average life expectancy values.

africa %>% 
  ggplot(aes(x=long,y=lat,group=group,fill=lifeExp)) +
  geom_polygon(color="white",size=.25) +
  scale_fill_gradient2_tableau() +
  theme_map() +
  labs(fill="Life Expectancy",
       title = "Average Life Expectancy in Africa",
       subtitle = "1952 - 2007",
       caption = "Source gapminder.org") +
  theme(text=element_text(family = "serif",face="bold",size=14))

What if we wanted to look at how these spatial patterns shifted over time? Not a problem, we just need to tweak the data and plot code a little bit.

# DON'T aggregate the lifeExp variable this time. 
gapminder %>% 
  filter(continent == "Africa") %>% 
  mutate(country = countrycode::countrycode(country,"country.name","country.name")) %>% 
  inner_join(world,by="country") %>% 
  ggplot(aes(x=long,y=lat,group=group,fill=lifeExp)) +
  geom_polygon(color="white",size=.25) +
  scale_fill_gradient2_tableau() +
  theme_map() +
  labs(fill="Life Expectancy",
       title = "Average Life Expectancy in Africa",
       subtitle = "1952 - 2007",
       caption = "Source gapminder.org") +
  facet_wrap(~year) +
  theme(text=element_text(family = "serif",face="bold",size=18),
        legend.position = "bottom")

---
title: "PPOL 670 | Week 5 | Walthrough (Answers)"
subtitle: | 
  | Data Visualization Application - Gapminder Data 
output: 
  html_notebook:
    theme: united
    toc: true
    toc_float: true
    toc_depth: 5
---

<br><br>

```{r setup, include= F}
knitr::opts_chunk$set(error=F,warning = F,comment=F)
```



```{r dependencies}
#install.packages("tidyverse")
require(tidyverse) # The tidyverse package covered last time 

# install.packages("ggthemes")
require(ggthemes) # for great visualization colors and themes. 

# install.packages("maps")
require(maps) # for some maps data

# Gapminder data (for example)
#install.packages("gapminder")
require(gapminder)
```

# Data 

The [`Gapminder` dataset](https://www.gapminder.org/) is a famous dataset used by Hans Rosling to visualize development outcomes. The data covers 1952 to 2007 in five year intervals and measures life expectancy, population, and GDP Per Capita. 

```{r}
gapminder %>% head()
```


# Get to know the data through visualization

Each of the following questions are targeted at making sure we understand our data better. A great way to get a "feel" for a dataset is to visualize it. Answer each of the below questions with a (publishable-quality) picture. 

#### 1. How is `lifeExp`, `pop` and `gdpPercap` variables distributed?

Two ways to think about this: one is to plot each figure individually. 
```{r,fig.align="center",fig.width=7,fig.height=4}
# Life Expectancy
gapminder %>% 
  ggplot(aes(lifeExp)) +
  geom_histogram(bins=30) +
  theme_bw() 

# Population
gapminder %>% 
  ggplot(aes(pop)) +
  geom_histogram(bins=30) +
  theme_bw() 

# GDP
gapminder %>% 
  ggplot(aes(gdpPercap)) +
  geom_histogram(bins=30) +
  theme_bw() 
```


The better way is to plot all the functions at once using the `pivot_longer()` function from last time with `facet_wrap()`

```{r,fig.align="center",fig.width=10,fig.height=4}
gapminder %>% 
  pivot_longer(cols=c(lifeExp,pop,gdpPercap)) %>% 
  ggplot(aes(value)) +
  geom_histogram(bins=30) +
  facet_wrap(~name,scales="free") +
  theme_bw() 
```
We can quickly see that there is large right skews in both `gdpPercap` and `pop`. Let's transform these variables and re-plot.

```{r,fig.align="center",fig.width=10,fig.height=4}
gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  pivot_longer(cols=c(lifeExp,ln_pop,ln_gdppc)) %>% 
  ggplot(aes(value)) +
  geom_histogram(bins=30) +
  facet_wrap(~name,scales="free") +
  theme_bw() 
```

Now let's make things look professional!

```{r,fig.align="center",fig.width=10,fig.height=4}
gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  pivot_longer(cols=c(lifeExp,ln_pop,ln_gdppc)) %>% 
  mutate(name = case_when(
    name == "lifeExp" ~ "Life Expectancy",
    name == "ln_gdppc" ~ "Log GDP Per Capita",
    name == "ln_pop" ~ "Log Population"
  )) %>% 
  ggplot(aes(value,fill=name)) +
  geom_histogram(bins=30,color="white",alpha=.5,show.legend = F) +
  facet_wrap(~name,scales="free_x") +
  labs(caption="Source: gapminder.org") +
  scale_fill_economist() +
  theme_fivethirtyeight() +
  theme(text=element_text(family="serif",face="bold",size=16))
```



#### 2. What's the relationship between economic development and life expectancy? Is the relationship the same for all continents?

```{r,fig.align="center",fig.width=10,fig.height=4}
gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp)) +
  geom_point() +
  geom_smooth(method = "lm",se=F) # Let's fit a line to the data.
```

Useful but there are a number of small aesthetic adjustments we could make to really help use distinguish between what is going on. 

```{r,fig.align="center",fig.width=10,fig.height=4}
gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp)) +
  geom_point(alpha=.4,color="grey30") +
  geom_smooth(method = "loess",se=F,color="darkred",size=1.5) +
  theme_minimal()
```

Is the trend the same by continent?

```{r,fig.align="center",fig.width=10,fig.height=4}
gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp,color=continent)) +
  geom_point(alpha=.4,) +
  geom_smooth(method = "loess",se=F,size=1.5) +
  theme_minimal()
```

Generally speaking, it appears so, but it's difficult to hone in on any one continent. There is just a lot going on. Let's consider two alternative ways of presenting this same data. 

Way 1: separate plots using `facet_`

```{r,fig.align="center",fig.width=10,fig.height=4}
gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp,color=continent)) +
  geom_point(alpha=.3,) +
  geom_smooth(method = "loess",se=F,size=1.5) +
  facet_wrap(~continent,nrow=1) +
  labs(x="Log GDP Per Capita",y = "Life Expectancy",color="") +
  scale_color_gdocs() +
  theme_minimal() +
  theme(legend.position = "bottom",
        text = element_text(size=14,family="serif",face="bold"))
```

Way 2: Plot the trends but not the individual data points

```{r,fig.align="center",fig.width=7,fig.height=5}
gapminder %>% 
  mutate(ln_pop = log(pop),
         ln_gdppc =  log(gdpPercap)) %>% 
  ggplot(aes(ln_gdppc,lifeExp,color=continent)) +
  geom_smooth(method = "loess",se=F,size=1.5) +
  labs(x="Log GDP Per Capita",y = "Life Expectancy",color="") +
  scale_color_gdocs() +
  theme_minimal() +
  theme(legend.position = "bottom",
        text = element_text(size=14,family="serif",face="bold"))
```

#### 3. Which countries in Africa have the lowest levels of life expectancy?

Two ways we could go about answering a question like this. The first might just be an ordered barplot. Here we might rephrase the question as: "Which countries in Africa have the lowest levels of life expectancy _on average_?"

```{r,fig.align="center",fig.width=7,fig.height=7}
gapminder %>% 
  filter(continent == "Africa") %>% 
  group_by(country) %>% 
  summarize(lifeExp = mean(lifeExp),.groups="drop") %>% 
  ggplot(aes(lifeExp,country)) +
  geom_col() 
```

Nice, but ordering the factor fields would go a long way. Doing so is easy using the tidy `forcats` package (which is part of the tidyverse), and why we're at it, lets' add a little polish. 

```{r,fig.align="center",fig.width=7,fig.height=7.5}
gapminder %>% 
  filter(continent == "Africa") %>% 
  group_by(country) %>% 
  summarize(lifeExp = mean(lifeExp),.groups="drop") %>% 
  ggplot(aes(lifeExp,fct_reorder(country,desc(lifeExp)),fill=lifeExp)) +
  geom_col(show.legend = F) +
  scale_fill_gradient2_tableau() +
  labs(x="Life Expectancy",y="",
       title = "Average Life Expectancy in Africa",
       subtitle = "1952 - 2007",
       caption = "Source gapminder.org") +
  theme_hc() +
  theme(text=element_text(family = "serif",face="bold",size=14))
```

Another way we could approach this is to lay everything out spatially. ggplot with the `maps` package provides a useful way to extract map data on the fly. 

```{r,fig.align="center",fig.width=11,fig.height=7}
map_data("world") %>%
  ggplot(aes(x=long,y=lat,group=group)) +
  geom_polygon()
```

Our focus is the African continent, so we'll just focus on that portion of the data using the data wrangling principals. 

```{r}

# Simplify the map data 
world <- 
  map_data("world") %>% 
  select(long,lat,group,country=region) %>% 
  
  # Again standardize the country names
  mutate(country = countrycode::countrycode(country,"country.name","country.name")) %>% 
  mutate(country = ifelse(country == "South Sudan","Sudan",country))

# subset the relevant African countries in the data.
africa <- 
  gapminder %>% 
  filter(continent == "Africa") %>% 
  group_by(country) %>% 
  summarize(lifeExp = mean(lifeExp),.groups="drop")  %>% 
  mutate(country = countrycode::countrycode(country,"country.name","country.name")) %>% 
  inner_join(world,by="country")
```

Let's plot the map. 

```{r,fig.align="center",fig.width=6,fig.height=7}
africa %>% 
  ggplot(aes(x=long,y=lat,group=group)) +
  geom_polygon()
```


Now let's fill in the fields on the map using the average life expectancy values.

```{r,fig.align="center",fig.width=6,fig.height=7}
africa %>% 
  ggplot(aes(x=long,y=lat,group=group,fill=lifeExp)) +
  geom_polygon(color="white",size=.25) +
  scale_fill_gradient2_tableau() +
  theme_map() +
  labs(fill="Life Expectancy",
       title = "Average Life Expectancy in Africa",
       subtitle = "1952 - 2007",
       caption = "Source gapminder.org") +
  theme(text=element_text(family = "serif",face="bold",size=14))
```

What if we wanted to look at how these spatial patterns shifted over time? Not a problem, we just need to tweak the data and plot code a little bit. 

```{r,fig.align="center",fig.width=15,fig.height=15}
# DON'T aggregate the lifeExp variable this time. 
gapminder %>% 
  filter(continent == "Africa") %>% 
  mutate(country = countrycode::countrycode(country,"country.name","country.name")) %>% 
  inner_join(world,by="country") %>% 
  ggplot(aes(x=long,y=lat,group=group,fill=lifeExp)) +
  geom_polygon(color="white",size=.25) +
  scale_fill_gradient2_tableau() +
  theme_map() +
  labs(fill="Life Expectancy",
       title = "Average Life Expectancy in Africa",
       subtitle = "1952 - 2007",
       caption = "Source gapminder.org") +
  facet_wrap(~year) +
  theme(text=element_text(family = "serif",face="bold",size=18),
        legend.position = "bottom")
```




