Overview

There are more streamlined ways to extract data from external sources. Let’s explore some R package that provide wrapper for useful APIs to streamline data extraction and acquisition.

Google Trends Data

What is it?

What is Google Trends? “Google Trends is a website by Google that analyzes the popularity of top search queries in Google Search across various regions and languages. The website uses graphs to compare the search volume of different queries over time.” - Wikipedia

Package(s)

gtrendsR: An interface for retrieving and displaying the information returned online by Google Trends is provided. Trends (number of hits) over the time as well as geographic representation of the results can be displayed.
trendyy: a tidy wrapper to the gtrendsR package.

Installation

install.packages("gtrendsR")
install.packages("trendyy")

require(trendyy)

## Loading required package: trendyy

Usage

# Specify the search terms you're interested in. 
my_search_terms <- c("Biden","trump")

# Download the term activity across a specific period using the Google Trends API
trends <- trendy(search_terms = my_search_terms,
                 from = "2020-09-01",to="2020-09-30")

Clean and examine the data.

# Trends sends you back a lot of data. Use trendyy's helper functions to get
# what we want: search term interest over time.
dat_trends <- get_interest(trends)
dat_trends

Visualize.

# Note: tidyverse is loaded. 
dat_trends %>%
  ggplot(aes(date,hits,color=keyword)) +
  geom_line() +
  geom_point() +
  ggthemes::scale_color_fivethirtyeight() +
  ggthemes::theme_fivethirtyeight()

We can target specific geographic regions with the geo = argument. To know which geographic code corresponds to a specific instance, we need to look at gtrendsR.

geo_codes <- as_tibble(gtrendsR::countries)
geo_codes

# Extract trends just for the US. Also we can draw out larger time windows
trends_us <- trendy(search_terms = my_search_terms,
                    from = "2004-01-01",to="2020-09-30",
                    geo="US")

# Draw out relevant content
dat_trends_us <- get_interest(trends_us)
dat_trends_us

Visualize again.

# Visualize
dat_trends_us %>%
  ggplot(aes(date,hits,color=keyword)) +
  geom_line() +
  ggthemes::scale_color_fivethirtyeight() +
  ggthemes::theme_fivethirtyeight()

Limitations

Normalized search terms means that content is relative to the words searched. We can’t back out search volumn.
Can only search for 5 keywords at a time.
Can only request 1000 queries a day.
Choosing the correct keywords can be difficult. Language matters.

World Bank Development Indicators Data

What is it?

World Development Indicators (WDI) is the primary World Bank collection of development indicators, compiled from officially recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.

Package

wbstats: Programmatic Access to Data and Statistics from the World Bank API

Installation

install.packages("wbstats")

require(wbstats)

## Loading required package: wbstats

Usage

Search for specific measured concepts.

# Look up indicators related to GDP.
gdp_ind <- wbsearch(pattern = "gdp")

## Warning: `wbsearch()` is deprecated as of wbstats 1.0.0.
## Please use `wb_search()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

gdp_ind

Download data for a specific indicator.

# Download info on the GDP (current US$)
wb_dat <- wb(indicator = "NY.GDP.MKTP.CD", # Use the indicator id
             country = "countries_only", # Avoid regional designations
             startdate = 2000, enddate=2005)

Look at the data

glimpse(wb_dat)

## Rows: 1,216
## Columns: 7
## $ iso3c       <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "AFG", "AFG", "…
## $ date        <chr> "2005", "2004", "2003", "2002", "2001", "2000", "2005", "…
## $ value       <dbl> 2330726257, 2228491620, 2021229050, 1941340782, 192011173…
## $ indicatorID <chr> "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "NY…
## $ indicator   <chr> "GDP (current US$)", "GDP (current US$)", "GDP (current U…
## $ iso2c       <chr> "AW", "AW", "AW", "AW", "AW", "AW", "AF", "AF", "AF", "AF…
## $ country     <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Af…

Plot the average log GDP by year.

wb_dat %>%
  mutate(date = as.numeric(date),
         ln_gdp = log(value)) %>% 
  ggplot(aes(date,ln_gdp)) +
  geom_smooth(method="loess",color="steelblue",fill="steelblue") +
  ggthemes::theme_economist()

## `geom_smooth()` using formula 'y ~ x'

Reddit Data

What is it?

“Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members.” - Wikipedia

Package

rreddit

Installation

if (!requireNamespace("remotes")) {
  install.packages("remotes")
}

# Not published on CRAN, must download from Github
remotes::install_github("mkearney/rreddit")

Usage

Download the data from a specific subreddit.

## get up to 100 of the most recent posts made to /r/dataisbeautiful
d <- rreddit::get_r_reddit(subreddit = "dataisbeautiful", 
                           n = 10000)

## [32m✔[39m #1: collected 100 posts
## [32m✔[39m #2: collected 100 posts
## [32m✔[39m #3: collected 100 posts
## [32m✔[39m #4: collected 100 posts
## [32m✔[39m #5: collected 100 posts
## [32m✔[39m #6: collected 100 posts
## [32m✔[39m #7: collected 100 posts
## [32m✔[39m #8: collected 100 posts
## [32m✔[39m #9: collected 100 posts
## [32m✔[39m #10: collected 100 posts

We get a lot of fields back.

dim(d)

## [1] 1000   60

Let’s look at the data.

Let’s plot the posting over time.

d %>% 
  transmute(created_at=as.Date(created_utc)) %>% 
  count(created_at) %>% 
  ggplot(aes(created_at,n)) +
  geom_line(size=1,color="grey30") +
  ggthemes::theme_wsj()

Limitations

Package is a proto-type. Still experimental/buggy.

Other Packages

rtweet: R client for accessing Twitter’s REST and stream APIs.
- Need an authorization key to download data. Can take awhile to get.
WikipediR: A wrapper for the MediaWiki API, aimed particularly at the Wikimedia ‘production’ wikis, such as Wikipedia. It can be used to retrieve page text, information about users or the history of pages, and elements of the category tree.
Rfacebook: Provides an interface to the Facebook API.
- This package hasn’t been updated since 2017. A lot has changed since then. May not work.

The following materials were generated for students enrolled in PPOL670. Please do not distribute without permission.

ed769@georgetown.edu | www.ericdunford.com

Back to Course Website

Data Acquisition Using `R` Package APIs

PPOL 670 | Introduction to Data Science

Professor Eric Dunford (ed769@georgetown.edu)
McCourt School of Public Policy, Georgetown University

Overview

Google Trends Data

What is it?

Package(s)

Installation

Usage

Limitations

World Bank Development Indicators Data

What is it?

Package

Installation

Usage

Reddit Data

What is it?

Package

Installation

Usage

Limitations

Other Packages

Back to Course Website Data Acquisition Using R Package APIs

PPOL 670 | Introduction to Data Science

Professor Eric Dunford (ed769@georgetown.edu) McCourt School of Public Policy, Georgetown University

Overview

Google Trends Data

What is it?

Package(s)

Installation

Usage

Limitations

World Bank Development Indicators Data

What is it?

Package

Installation

Usage

Reddit Data

What is it?

Package

Installation

Usage

Limitations

Other Packages

Back to Course Website

Data Acquisition Using `R` Package APIs

Professor Eric Dunford (ed769@georgetown.edu)
McCourt School of Public Policy, Georgetown University