Learning Objectives


In the Asynchronous Lecture


In the Synchronous Lecture


If you have any questions while watching the pre-recorded material, be sure to write them down and to bring them up during the synchronous portion of the lecture.




Asynchronous Materials


The following tabs contain pre-recorded lecture materials for class this week. Please review these materials prior to the synchronous lecture.

Total time: Approx. 50 minutes.


Note that this week we’ll be using loops. We covered loops in Week 3. If you’re a bit shaky on writing loops, please revisit that content.


_




Writing Functions

Code from the video

# Functions allow us to house code in a single operation

# Example function that adds two values.
add <- function(x=5,y=5){
  total <- x + y
  return(total)
}

# We can specify values for our arguments altering the output as defined in the
# function
add(x=1000,y=4000)
 
# When call the function but don't provide values for our arguments then the
# default values are used.
add()



Websites

Scraping Content

Code from the video

require(tidyverse)
require(rvest)

# Let's scrape content from the follow url which links to a BBC story.
url <-  "https://www.bbc.com/news/blogs-trending-54121992"

# Download the website
raw_website <- read_html(url)


# Let's get the headline 
headline <- 
  raw_website %>% 
  html_nodes(xpath = '//*[@id="comp-blog-story-content"]/h2/span') %>% 
  html_text()

# Date
date <- 
  raw_website %>% 
  html_nodes(xpath = '//*[@id="comp-blog-story-content"]/div[2]/div/div/div[1]/ul/li/div') %>% 
  html_text()


# Content 
content <- 
  raw_website %>% 
  html_nodes(xpath = '//*[@id="comp-blog-story-content"]/div[3]/p') %>% 
  html_text() %>% 
  paste0(.,collapse = " ")

# Each independent data object 
headline
date
content

dat <- tibble(headline,date,content)
glimpse(dat)



Building a Scraper

Code from the video

# Build a bbc scraper
bbc_scraper <- function(url){
  # Download the website
  raw_website <- read_html(url)
  
  # Let's get the headline 
  headline <- 
    raw_website %>% 
    html_nodes(xpath = '//*[@id="comp-blog-story-content"]/h2/span') %>% 
    html_text()
  
  # Date
  date <- 
    raw_website %>% 
    html_nodes(xpath = '//*[@id="comp-blog-story-content"]/div[2]/div/div/div[1]/ul/li/div') %>% 
    html_text()
  
  
  # Content 
  content <- 
    raw_website %>% 
    html_nodes(xpath = '//*[@id="comp-blog-story-content"]/div[3]/p') %>% 
    html_text() %>% 
    paste0(.,collapse = " ")
  
  
  # Combind as data
  dat <- tibble(headline,date,content)
  return(dat)
}

# Run scraper
bbc_scraper("https://www.bbc.com/news/blogs-trending-53997203")


# Loop over urls to generate a data frame of news stories 

urls <- c(
  "https://www.bbc.com/news/blogs-trending-54121992",
  "https://www.bbc.com/news/blogs-trending-53997203",
  "https://www.bbc.com/news/blogs-trending-53948820"
)

# write loop 
news_stories <- c()
for( i in 1:length(urls) ){
  news_stories <- bind_rows(news_stories,bbc_scraper(urls[i]))
}

# Look at content
news_stories



Practice


These exercises are designed to help you reinforce your grasp of the concepts covered in the asynchronous lecture material.


For the following question, let’s use this Wikipedia page to practice some of the webscraping concepts covered in the asynchronous lecture.

require(rvest)
require(tidyverse)
wiki_url <- "https://en.wikipedia.org/wiki/Machine_learning"


_

Question 1


Download the website of the Wikipedia article.


_

Answer

site <- read_html(wiki_url)

Question 2


Scrape the article title from the Wikipedia article.


_

Answer

article_title <- 
  site %>% 
  html_nodes(xpath = '//*[@id="firstHeading"]') %>% 
  html_text()
article_title

Question 3


Scrape the article content from the Wikipedia article. Make sure the content is collapsed into a single character string.


_

Answer

article_content <- 
  site %>% 
  html_nodes(xpath = '//*[@id="mw-content-text"]/div[1]/p') %>% 
  html_text() %>% 
  paste0(.,collapse="")
article_content
 

The following materials were generated for students enrolled in PPOL670. Please do not distribute without permission.

ed769@georgetown.edu | www.ericdunford.com