In this notebook, we’ll hone our data manipulation skills by examining conflict event data generated by the Armed Conflict Location & Event Data Project (ACLED). The aim is to practice some of the tidyverse functions covered in the lecture.
As always, we need to make sure we have the tidyverse packages installed and loaded.
# install.packages("tidyverse")
require(tidyverse)
ACLED is a “disaggregated data collection, analysis, and crisis mapping project. ACLED collects the dates, actors, locations, fatalities, and modalities of all reported political violence and protest events across Africa, South Asia, Southeast Asia, the Middle East, Central Asia and the Caucasus, Latin America and the Caribbean, and Southeastern and Eastern Europe and the Balkans.” For this exercise, we’ll focus just on the data pertaining to Africa. For more information regarding these data, please consult the ACLED methodology.
acled <- read_csv("https://raw.githubusercontent.com/edunford/enhance_and_advance_R/master/lectures/wrangling/walkthroughs/acled_africa.csv")
Parsed with column specification:
cols(
.default = col_character(),
data_id = col_double(),
iso = col_double(),
event_id_no_cnty = col_double(),
year = col_double(),
time_precision = col_double(),
inter1 = col_double(),
inter2 = col_double(),
interaction = col_double(),
latitude = col_double(),
longitude = col_double(),
geo_precision = col_double(),
fatalities = col_double(),
timestamp = col_double()
)
See spec(...) for full column specifications.
head(acled)
To understand a dataset, one needs to ask a lot of questions of it. To get a better feel for the ACLED data, let’s explore the following questions:
acled %>%
group_by(country) %>%
summarize(min_year = min(year),
max_year = max(year))
`summarise()` ungrouping output (override with `.groups` argument)
acled %>%
count(country,sort=T)
acled %>%
count(year,sort=T)
acled %>%
group_by(event_type) %>%
summarize(n_events = n()) %>%
arrange(desc(n_events))
`summarise()` ungrouping output (override with `.groups` argument)
sub_ event types are there for each event type?acled %>%
group_by(event_type) %>%
summarize(n_subtypes = n_distinct(sub_event_type)) %>%
arrange(desc(n_subtypes))
`summarise()` ungrouping output (override with `.groups` argument)
Let’s say we wanted to estimate the impact of political violence on economic growth. Specifically, we want to known how the number of fatalities as a result of political instability impacts the economic growth of a country. Construct a data set that would allow you to answer this question and use OLS to estimate the relationship model.
First, I’ve provided data from the World Bank that captures the economic growth rate by country-year.
# install.packages(wbstats) # API for downloading WB data
require(wbstats)
# Download the data
gdppp_growth <- wb(country = "countries_only",indicator = "NY.GDP.PCAP.KD.ZG") %>%
mutate(date=as.numeric(date))
# Look at the top of the data
head(gdppp_growth)
The unit of analysis of the acled data is set at the location-day. We need to aggregate these data so that it is at the country-year. Relying only on the tidyverse function, generate a new data frame that has the following fields:
country (unit);year (time);fatalities = total number of fatalities in a given country-year;ln_fatalities = natural log of the number of fatalities (main IV);n_events = total number of events that took place in a given country;ln_n_events = natural log of the number of events (control).acled_country_year <-
acled %>%
group_by(country,year) %>%
summarize(fatalities = sum(fatalities),
ln_fatalities = log(fatalities + 1),
n_events = n(),
ln_n_events = log(n_events + 1)) %>%
ungroup
`summarise()` regrouping output by 'country' (override with `.groups` argument)
acled_country_year %>% sample_n(5)
Now that we’ve converted our event-level data into coutry-year-level data, let’s merge the growth data onto the conflict data.
A few things to keep in mind:
# Reduce the acled data down.
gdppp_growth2 <- gdppp_growth %>% select(country, year = date, growth = value)
# join
dat <- left_join(acled_country_year,gdppp_growth2,by=c("country","year"))
# look at the data
dat %>% sample_n(5)
Are we missing any countries? That is, did we any countries not come through when merging?
dat %>%
group_by(country) %>%
summarize(n_missing = sum(is.na(growth)),
total = n(),
prop_missing = n_missing/total) %>%
arrange(desc(n_missing))
`summarise()` ungrouping output (override with `.groups` argument)
Oh no – loooks like we missed a few. Let’s explore why. As we can see below, it looks like the WB has different naming conventions for some of the countries.
gdppp_growth2 %>%
filter(str_detect(country,"Congo")) %>%
distinct(country)
Note a problem! Let’s standardize the naming conventions using the countrycode package.
# install.packages(countrycode)
require(countrycode)
# Standardize the country names for the WB data
gdppp_growth3 <-
gdppp_growth2 %>%
mutate(country = countrycode(country,"country.name","country.name"))
# Standardize the country names for the ACLED data
acled_country_year2 <-
acled_country_year %>%
mutate(country = countrycode(country,"country.name","country.name"))
# join
dat <- left_join(acled_country_year2,gdppp_growth3,by=c("country","year"))
# look at the data
dat %>% sample_n(5)
Check again if there are any issues?
dat %>%
group_by(country) %>%
summarize(n_missing = sum(is.na(growth)),
total = n(),
prop_missing = n_missing/total) %>%
arrange(desc(n_missing))
`summarise()` ungrouping output (override with `.groups` argument)
Still missing data for Somalia. Why is this? Doesn’t look like the WB has growth data for Somalia after 1990.
gdppp_growth3 %>%
filter(country == "Somalia") %>%
summary()
country year growth
Length:30 Min. :1961 Min. :-20.97435
Class :character 1st Qu.:1968 1st Qu.: -6.81114
Mode :character Median :1976 Median : 0.02207
Mean :1976 Mean : -0.74568
3rd Qu.:1983 3rd Qu.: 3.17685
Max. :1990 Max. : 21.78373
Let’s now run our (very basic) analysis!
dat %>%
lm(growth ~ ln_fatalities, data = .) %>%
summary(.)
Call:
lm(formula = growth ~ ln_fatalities, data = .)
Residuals:
Min 1Q Median 3Q Max
-62.938 -1.862 0.003 1.940 137.591
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.77997 0.40920 6.794 1.88e-11 ***
ln_fatalities -0.25760 0.09809 -2.626 0.00876 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.123 on 996 degrees of freedom
(105 observations deleted due to missingness)
Multiple R-squared: 0.006878, Adjusted R-squared: 0.00588
F-statistic: 6.897 on 1 and 996 DF, p-value: 0.008764
Now controlling for the number of event in the country.
dat %>%
lm(growth ~ ln_fatalities + ln_n_events, data = .) %>%
summary(.)
Call:
lm(formula = growth ~ ln_fatalities + ln_n_events, data = .)
Residuals:
Min 1Q Median 3Q Max
-62.857 -1.874 0.002 2.013 137.433
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.48722 0.58441 5.967 3.35e-09 ***
ln_fatalities -0.04803 0.15785 -0.304 0.7610
ln_n_events -0.39617 0.23392 -1.694 0.0907 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.116 on 995 degrees of freedom
(105 observations deleted due to missingness)
Multiple R-squared: 0.009732, Adjusted R-squared: 0.007742
F-statistic: 4.889 on 2 and 995 DF, p-value: 0.007709