Overview

There are more streamlined ways to extract data from external sources. Let’s explore some R package that provide wrapper for useful APIs to streamline data extraction and acquisition.



World Bank Development Indicators Data


What is it?

World Development Indicators (WDI) is the primary World Bank collection of development indicators, compiled from officially recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.


Package

  • wbstats: Programmatic Access to Data and Statistics from the World Bank API


Installation

install.packages("wbstats")
require(wbstats)
## Loading required package: wbstats


Usage

Search for specific measured concepts.

# Look up indicators related to GDP.
gdp_ind <- wbsearch(pattern = "gdp")
## Warning: `wbsearch()` is deprecated as of wbstats 1.0.0.
## Please use `wb_search()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
gdp_ind 

Download data for a specific indicator.

# Download info on the GDP (current US$)
wb_dat <- wb(indicator = "NY.GDP.MKTP.CD", # Use the indicator id
             country = "countries_only", # Avoid regional designations
             startdate = 2000, enddate=2005)

Look at the data

glimpse(wb_dat)
## Rows: 1,216
## Columns: 7
## $ iso3c       <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "AFG", "AFG", "…
## $ date        <chr> "2005", "2004", "2003", "2002", "2001", "2000", "2005", "…
## $ value       <dbl> 2330726257, 2228491620, 2021229050, 1941340782, 192011173…
## $ indicatorID <chr> "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "NY…
## $ indicator   <chr> "GDP (current US$)", "GDP (current US$)", "GDP (current U…
## $ iso2c       <chr> "AW", "AW", "AW", "AW", "AW", "AW", "AF", "AF", "AF", "AF…
## $ country     <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Af…

Plot the average log GDP by year.

wb_dat %>%
  mutate(date = as.numeric(date),
         ln_gdp = log(value)) %>% 
  ggplot(aes(date,ln_gdp)) +
  geom_smooth(method="loess",color="steelblue",fill="steelblue") +
  ggthemes::theme_economist()
## `geom_smooth()` using formula 'y ~ x'


Reddit Data


What is it?

“Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members.” - Wikipedia


Package

Installation

if (!requireNamespace("remotes")) {
  install.packages("remotes")
}

# Not published on CRAN, must download from Github
remotes::install_github("mkearney/rreddit")


Usage

Download the data from a specific subreddit.

## get up to 100 of the most recent posts made to /r/dataisbeautiful
d <- rreddit::get_r_reddit(subreddit = "dataisbeautiful", 
                           n = 10000)
## ✔ #1: collected 100 posts
## ✔ #2: collected 100 posts
## ✔ #3: collected 100 posts
## ✔ #4: collected 100 posts
## ✔ #5: collected 100 posts
## ✔ #6: collected 100 posts
## ✔ #7: collected 100 posts
## ✔ #8: collected 100 posts
## ✔ #9: collected 100 posts
## ✔ #10: collected 100 posts

We get a lot of fields back.

dim(d)
## [1] 1000   60

Let’s look at the data.

d

Let’s plot the posting over time.

d %>% 
  transmute(created_at=as.Date(created_utc)) %>% 
  count(created_at) %>% 
  ggplot(aes(created_at,n)) +
  geom_line(size=1,color="grey30") +
  ggthemes::theme_wsj()


Limitations

  • Package is a proto-type. Still experimental/buggy.



Other Packages


  • rtweet: R client for accessing Twitter’s REST and stream APIs.
    • Need an authorization key to download data. Can take awhile to get.
  • WikipediR: A wrapper for the MediaWiki API, aimed particularly at the Wikimedia ‘production’ wikis, such as Wikipedia. It can be used to retrieve page text, information about users or the history of pages, and elements of the category tree.
  • Rfacebook: Provides an interface to the Facebook API.
    • This package hasn’t been updated since 2017. A lot has changed since then. May not work.
 

The following materials were generated for students enrolled in PPOL670. Please do not distribute without permission.

ed769@georgetown.edu | www.ericdunford.com