In the Asynchronous Lecture
In the Synchronous Lecture
R
packages to streamline downloading data from the web.Date
class objects in R
using lubridate
.If you have any questions while watching the pre-recorded material, be sure to write them down and to bring them up during the synchronous portion of the lecture.
The following tabs contain pre-recorded lecture materials for class this week. Please review these materials prior to the synchronous lecture.
Total time: Approx. 50 minutes.
Note that this week we’ll be using loops. We covered loops in Week 3. If you’re a bit shaky on writing loops, please revisit that content.
# Functions allow us to house code in a single operation
# Example function that adds two values.
add <- function(x=5,y=5){
total <- x + y
return(total)
}
# We can specify values for our arguments altering the output as defined in the
# function
add(x=1000,y=4000)
# When call the function but don't provide values for our arguments then the
# default values are used.
add()
require(tidyverse)
require(rvest)
# Let's scrape content from the follow url which links to a BBC story.
url <- "https://www.bbc.com/news/blogs-trending-54121992"
# Download the website
raw_website <- read_html(url)
# Let's get the headline
headline <-
raw_website %>%
html_nodes(xpath = '//*[@id="comp-blog-story-content"]/h2/span') %>%
html_text()
# Date
date <-
raw_website %>%
html_nodes(xpath = '//*[@id="comp-blog-story-content"]/div[2]/div/div/div[1]/ul/li/div') %>%
html_text()
# Content
content <-
raw_website %>%
html_nodes(xpath = '//*[@id="comp-blog-story-content"]/div[3]/p') %>%
html_text() %>%
paste0(.,collapse = " ")
# Each independent data object
headline
date
content
dat <- tibble(headline,date,content)
glimpse(dat)
# Build a bbc scraper
bbc_scraper <- function(url){
# Download the website
raw_website <- read_html(url)
# Let's get the headline
headline <-
raw_website %>%
html_nodes(xpath = '//*[@id="comp-blog-story-content"]/h2/span') %>%
html_text()
# Date
date <-
raw_website %>%
html_nodes(xpath = '//*[@id="comp-blog-story-content"]/div[2]/div/div/div[1]/ul/li/div') %>%
html_text()
# Content
content <-
raw_website %>%
html_nodes(xpath = '//*[@id="comp-blog-story-content"]/div[3]/p') %>%
html_text() %>%
paste0(.,collapse = " ")
# Combind as data
dat <- tibble(headline,date,content)
return(dat)
}
# Run scraper
bbc_scraper("https://www.bbc.com/news/blogs-trending-53997203")
# Loop over urls to generate a data frame of news stories
urls <- c(
"https://www.bbc.com/news/blogs-trending-54121992",
"https://www.bbc.com/news/blogs-trending-53997203",
"https://www.bbc.com/news/blogs-trending-53948820"
)
# write loop
news_stories <- c()
for( i in 1:length(urls) ){
news_stories <- bind_rows(news_stories,bbc_scraper(urls[i]))
}
# Look at content
news_stories
These exercises are designed to help you reinforce your grasp of the concepts covered in the asynchronous lecture material.
For the following question, let’s use this Wikipedia page to practice some of the webscraping concepts covered in the asynchronous lecture.
Scrape the article title from the Wikipedia article.
The following materials were generated for students enrolled in PPOL670. Please do not distribute without permission.
ed769@georgetown.edu | www.ericdunford.com