Learning Objectives

In the Asynchronous Lecture

Cover the basics of writing functions;
Understanding html structure to look up content on a website;
Scrape content from a website;
Building a scraper to systematically draw content from similarly organized webpages.

In the Synchronous Lecture

Talk about legality when scraping a webcontent and how to reduce harm.
Useful R packages to streamline downloading data from the web.
Talk about how to deal with Date class objects in R using lubridate.

If you have any questions while watching the pre-recorded material, be sure to write them down and to bring them up during the synchronous portion of the lecture.

Asynchronous Materials

The following tabs contain pre-recorded lecture materials for class this week. Please review these materials prior to the synchronous lecture.

Total time: Approx. 50 minutes.

Note that this week we’ll be using loops. We covered loops in Week 3. If you’re a bit shaky on writing loops, please revisit that content.

_

Writing Functions

Relevant Slides

Code from the video

# Functions allow us to house code in a single operation

# Example function that adds two values.
add <- function(x=5,y=5){
  total <- x + y
  return(total)
}

# We can specify values for our arguments altering the output as defined in the
# function
add(x=1000,y=4000)
 
# When call the function but don't provide values for our arguments then the
# default values are used.
add()

Websites

Relevant Slides

Scraping Content

Relevant Slides

Code from the video

require(tidyverse)
require(rvest)

# Let's scrape content from the follow url which links to a BBC story.
url <-  "https://www.bbc.com/news/blogs-trending-54121992"

# Download the website
raw_website <- read_html(url)


# Let's get the headline 
headline <- 
  raw_website %>% 
  html_nodes(xpath = '//*[@id="comp-blog-story-content"]/h2/span') %>% 
  html_text()

# Date
date <- 
  raw_website %>% 
  html_nodes(xpath = '//*[@id="comp-blog-story-content"]/div[2]/div/div/div[1]/ul/li/div') %>% 
  html_text()


# Content 
content <- 
  raw_website %>% 
  html_nodes(xpath = '//*[@id="comp-blog-story-content"]/div[3]/p') %>% 
  html_text() %>% 
  paste0(.,collapse = " ")

# Each independent data object 
headline
date
content

dat <- tibble(headline,date,content)
glimpse(dat)

Building a Scraper

Relevant Slides

Code from the video

# Build a bbc scraper
bbc_scraper <- function(url){
  # Download the website
  raw_website <- read_html(url)
  
  # Let's get the headline 
  headline <- 
    raw_website %>% 
    html_nodes(xpath = '//*[@id="comp-blog-story-content"]/h2/span') %>% 
    html_text()
  
  # Date
  date <- 
    raw_website %>% 
    html_nodes(xpath = '//*[@id="comp-blog-story-content"]/div[2]/div/div/div[1]/ul/li/div') %>% 
    html_text()
  
  
  # Content 
  content <- 
    raw_website %>% 
    html_nodes(xpath = '//*[@id="comp-blog-story-content"]/div[3]/p') %>% 
    html_text() %>% 
    paste0(.,collapse = " ")
  
  
  # Combind as data
  dat <- tibble(headline,date,content)
  return(dat)
}

# Run scraper
bbc_scraper("https://www.bbc.com/news/blogs-trending-53997203")


# Loop over urls to generate a data frame of news stories 

urls <- c(
  "https://www.bbc.com/news/blogs-trending-54121992",
  "https://www.bbc.com/news/blogs-trending-53997203",
  "https://www.bbc.com/news/blogs-trending-53948820"
)

# write loop 
news_stories <- c()
for( i in 1:length(urls) ){
  news_stories <- bind_rows(news_stories,bbc_scraper(urls[i]))
}

# Look at content
news_stories

Practice

These exercises are designed to help you reinforce your grasp of the concepts covered in the asynchronous lecture material.

For the following question, let’s use this Wikipedia page to practice some of the webscraping concepts covered in the asynchronous lecture.

require(rvest)
require(tidyverse)
wiki_url <- "https://en.wikipedia.org/wiki/Machine_learning"

_

Question 1

Download the website of the Wikipedia article.

_

Answer

site <- read_html(wiki_url)

Question 2

Scrape the article title from the Wikipedia article.

_

Answer

article_title <- 
  site %>% 
  html_nodes(xpath = '//*[@id="firstHeading"]') %>% 
  html_text()
article_title

Question 3

Scrape the article content from the Wikipedia article. Make sure the content is collapsed into a single character string.

_

Answer

article_content <- 
  site %>% 
  html_nodes(xpath = '//*[@id="mw-content-text"]/div[1]/p') %>% 
  html_text() %>% 
  paste0(.,collapse="")
article_content

The following materials were generated for students enrolled in PPOL670. Please do not distribute without permission.

ed769@georgetown.edu | www.ericdunford.com

Back to Course Website

Webscraping

PPOL 670 | Introduction to Data Science

Lecture Materials for Week 6

Professor Eric Dunford (ed769@georgetown.edu)
McCourt School of Public Policy, Georgetown University

Learning Objectives

Asynchronous Materials

_

Writing Functions

Relevant Slides

Code from the video

Websites

Relevant Slides

Scraping Content

Relevant Slides

Code from the video

Building a Scraper

Relevant Slides

Code from the video

Practice

_

Question 1

_

Answer

Question 2

_

Answer

Question 3

_

Answer

Back to Course Website Webscraping

PPOL 670 | Introduction to Data Science Lecture Materials for Week 6

Professor Eric Dunford (ed769@georgetown.edu) McCourt School of Public Policy, Georgetown University

Learning Objectives

Asynchronous Materials

_

Writing Functions

Relevant Slides

Code from the video

Websites

Relevant Slides

Scraping Content

Relevant Slides

Code from the video

Building a Scraper

Relevant Slides

Code from the video

Practice

_

Question 1

_

Answer

Question 2

_

Answer

Question 3

_

Answer

Back to Course Website

Webscraping

PPOL 670 | Introduction to Data Science

Lecture Materials for Week 6

Professor Eric Dunford (ed769@georgetown.edu)
McCourt School of Public Policy, Georgetown University