R
Package APIs
There are more streamlined ways to extract data from external sources. Let’s explore some R
package that provide wrapper for useful APIs to streamline data extraction and acquisition.
What is Google Trends? “Google Trends is a website by Google that analyzes the popularity of top search queries in Google Search across various regions and languages. The website uses graphs to compare the search volume of different queries over time.” - Wikipedia
gtrendsR
: An interface for retrieving and displaying the information returned online by Google Trends is provided. Trends (number of hits) over the time as well as geographic representation of the results can be displayed.trendyy
: a tidy wrapper to the gtrendsR
package.## Loading required package: trendyy
# Specify the search terms you're interested in.
my_search_terms <- c("Biden","trump")
# Download the term activity across a specific period using the Google Trends API
trends <- trendy(search_terms = my_search_terms,
from = "2020-09-01",to="2020-09-30")
Clean and examine the data.
# Trends sends you back a lot of data. Use trendyy's helper functions to get
# what we want: search term interest over time.
dat_trends <- get_interest(trends)
dat_trends
Visualize.
# Note: tidyverse is loaded.
dat_trends %>%
ggplot(aes(date,hits,color=keyword)) +
geom_line() +
geom_point() +
ggthemes::scale_color_fivethirtyeight() +
ggthemes::theme_fivethirtyeight()
We can target specific geographic regions with the geo =
argument. To know which geographic code corresponds to a specific instance, we need to look at gtrendsR
.
# Extract trends just for the US. Also we can draw out larger time windows
trends_us <- trendy(search_terms = my_search_terms,
from = "2004-01-01",to="2020-09-30",
geo="US")
# Draw out relevant content
dat_trends_us <- get_interest(trends_us)
dat_trends_us
Visualize again.
# Visualize
dat_trends_us %>%
ggplot(aes(date,hits,color=keyword)) +
geom_line() +
ggthemes::scale_color_fivethirtyeight() +
ggthemes::theme_fivethirtyeight()
World Development Indicators (WDI) is the primary World Bank collection of development indicators, compiled from officially recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.
Search for specific measured concepts.
## Warning: `wbsearch()` is deprecated as of wbstats 1.0.0.
## Please use `wb_search()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Download data for a specific indicator.
# Download info on the GDP (current US$)
wb_dat <- wb(indicator = "NY.GDP.MKTP.CD", # Use the indicator id
country = "countries_only", # Avoid regional designations
startdate = 2000, enddate=2005)
Look at the data
## Rows: 1,216
## Columns: 7
## $ iso3c <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "AFG", "AFG", "…
## $ date <chr> "2005", "2004", "2003", "2002", "2001", "2000", "2005", "…
## $ value <dbl> 2330726257, 2228491620, 2021229050, 1941340782, 192011173…
## $ indicatorID <chr> "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "NY…
## $ indicator <chr> "GDP (current US$)", "GDP (current US$)", "GDP (current U…
## $ iso2c <chr> "AW", "AW", "AW", "AW", "AW", "AW", "AF", "AF", "AF", "AF…
## $ country <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Af…
Plot the average log GDP by year.
wb_dat %>%
mutate(date = as.numeric(date),
ln_gdp = log(value)) %>%
ggplot(aes(date,ln_gdp)) +
geom_smooth(method="loess",color="steelblue",fill="steelblue") +
ggthemes::theme_economist()
## `geom_smooth()` using formula 'y ~ x'
“Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members.” - Wikipedia
if (!requireNamespace("remotes")) {
install.packages("remotes")
}
# Not published on CRAN, must download from Github
remotes::install_github("mkearney/rreddit")
Download the data from a specific subreddit.
## get up to 100 of the most recent posts made to /r/dataisbeautiful
d <- rreddit::get_r_reddit(subreddit = "dataisbeautiful",
n = 10000)
## [32m✔[39m #1: collected 100 posts
## [32m✔[39m #2: collected 100 posts
## [32m✔[39m #3: collected 100 posts
## [32m✔[39m #4: collected 100 posts
## [32m✔[39m #5: collected 100 posts
## [32m✔[39m #6: collected 100 posts
## [32m✔[39m #7: collected 100 posts
## [32m✔[39m #8: collected 100 posts
## [32m✔[39m #9: collected 100 posts
## [32m✔[39m #10: collected 100 posts
We get a lot of fields back.
## [1] 1000 60
Let’s look at the data.
Let’s plot the posting over time.
d %>%
transmute(created_at=as.Date(created_utc)) %>%
count(created_at) %>%
ggplot(aes(created_at,n)) +
geom_line(size=1,color="grey30") +
ggthemes::theme_wsj()
The following materials were generated for students enrolled in PPOL670. Please do not distribute without permission.
ed769@georgetown.edu | www.ericdunford.com