PPOL670 | Introduction to Data Science for Public Policy Week 8 Text as Data

# <font class = "title-panel"> PPOL670 | Introduction to Data Science for Public Policy </font> <font size=6, face="bold"> Week 8 </font> <br> <br> <font size=100, face="bold"> Text as Data </font>
### <font class = "title-footer">  Prof. Eric Dunford  ◆  Georgetown University  ◆  McCourt School of Public Policy  ◆  <a href="mailto:eric.dunford@georgetown.edu" class="email">eric.dunford@georgetown.edu</a></font>

---

<div class="slide-footer"><span> 
PPOL670 | Introduction to Data Science for Public Policy

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Week 8

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Text-as-Data

</span></div>

---
class: outline

# Outline for Today

![:space 10]

- **String Manipulation**

- **Text as "tidy" data**

- **Sentiment Analysis**

- **Topic Models**

---

# Strings

---

### String Manipulation in `R`

```r
require(stringr)

# or

require(tidyverse)
```
]

![:space 30]

```r
str_c("a","b")
```

```
## [1] "ab"
```

```r
str_detect("There is a cat in the street",pattern = "cat")
```

```
## [1] TRUE
```

]

---

![:space 20]

```r
text = "There were 5 cats!"
text
```

```
## [1] "There were 5 cats!"
```

![:space 10]

```r
str_view(text,"cats")
```

<div id="htmlwidget-a33436ac3cdeb0ac7508" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-a33436ac3cdeb0ac7508">{"x":{"html":"<ul>\n  <li>There were 5 <span class='match'>cats<\/span>!<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---

### Regular Expressions (regex)

|Regex | Description |
|:----:|------------------------------|
| `+`  | match 1 or more of the previous character |
| `*`  | match 0 or more of the previous character
| `?` |the preceding item is optional (i.e., match 0 or 1 of the previous character).
| `[ ]` | match 1 of the set of things inside the bracket
| `\\w` | match a "word" character (i.e., letters and numbers).
| `\\d` | match digits
| `\\s` | match a space character
| `\\t` | match a "tab" character
| `\\n` | match a "newline" character
| `^`   | the "beginning edge" of a string
| `$` | the "ending edge" of a string
| {n} | the preceding character is matched n times
]

---

![:space 30]

```r
str_view(string = text, pattern = "\\d")
```

<div id="htmlwidget-56a43f716e2b9b07ef87" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-56a43f716e2b9b07ef87">{"x":{"html":"<ul>\n  <li>There were <span class='match'>5<\/span> cats!<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---

![:space 30]

```r
str_view_all(string = text, pattern = "\\s")
```

<div id="htmlwidget-b06161d820086d4c0654" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-b06161d820086d4c0654">{"x":{"html":"<ul>\n  <li>There<span class='match'> <\/span>were<span class='match'> <\/span>5<span class='match'> <\/span>cats!<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---

![:space 30]

```r
str_view_all(string = text, pattern = "\\d+\\s+\\w+")
```

<div id="htmlwidget-2a95a5790547650cb8f6" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-2a95a5790547650cb8f6">{"x":{"html":"<ul>\n  <li>There were <span class='match'>5 cats<\/span>!<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---

### String Editing

![:space 5]

```r
str_replace(string = text, pattern = "cats", replacement = "dogs")
```

```
## [1] "There were 5 dogs!"
```

![:space 5]

```r
str_remove(string = text, pattern = "[:punct:]")
```

```
## [1] "There were 5 cats"
```

![:space 5]

```r
str_extract(text,pattern = "\\d")
```

```
## [1] "5"
```

---

### Locating text

```r
texts <- c("The man drank 5 beers.",
           "Obama was president.",
           "I think we should walk 2 blocks.")
```

![:space 5]

```r
str_detect(texts,pattern = "\\d")
```

```
## [1]  TRUE FALSE  TRUE
```

![:space 5]

```r
str_which(texts,pattern = "\\d")
```

```
## [1] 1 3
```

---

### Insert data in a string

![:space 5]

```r
x <- 10
str_c("The value is ",x,"%")
```

```
## [1] "The value is 10%"
```

![:space 3]

```r
x <- 10
str_glue("The value is {x}%")
```

```
## The value is 10%
```

![:space 3]

```r
x <- 10
str_glue("The value is {x + 5}%")
```

```
## The value is 15%
```

---

### Capitalization

```r
text2 <- "TeXt MininG iN r"
```

```r
str_to_lower(text2)
```

```
## [1] "text mining in r"
```

```r
str_to_upper(text2)
```

```
## [1] "TEXT MINING IN R"
```

```r
str_to_title(text2)
```

```
## [1] "Text Mining In R"
```

```r
str_to_sentence(text2)
```

```
## [1] "Text mining in r"
```

---

# Tidy Text

---

![:space 20]

```r
require(tidytext)
```

]

---

![:space 5]

- ![:text_color orangered](Plays well with existing data manipulation and visualization toolkit)

- ![:text_color forestgreen](Streamlined integration with other text mining libraries): that require that the data be organized differently

]

![:space 65]

![:center_img 100](Figures/tidyflow-ch-1.png)

---

![:space 5]

- Each variable is a column
- Each observation is a row
- Each type of observational unit is a table

tidy text format is a **table with one-token-per-row**

]

![:space 65]

![:center_img 100](Figures/tidyflow-ch-1.png)

---

### Tokenization

![:space 5]

```r
text
```

[1] "US opposition politicians and aid agencies have questioned a decision by President Donald Trump to cut off aid to three Central American states --- or so the story reports!"

```r
text_data <- tibble(text = text)
text_data
```

```
## # A tibble: 1 x 1
##   text                                                                          
##   <chr>                                                                         
## 1 US opposition politicians and aid agencies have questioned a decision by Pres…
```

---

### Tokenization (words)

```r
text_data %>% 
  unnest_tokens(word,text,token = "words") # Words are Default
```

```
## # A tibble: 28 x 1
##    word       
##    <chr>      
##  1 us         
##  2 opposition 
##  3 politicians
##  4 and        
##  5 aid        
##  6 agencies   
##  7 have       
##  8 questioned 
##  9 a          
## 10 decision   
## # … with 18 more rows
```

---

### Tokenization (characters)

```r
text_data %>% 
  unnest_tokens(word,text,token = "characters")
```

```
## # A tibble: 140 x 1
##    word 
##    <chr>
##  1 u    
##  2 s    
##  3 o    
##  4 p    
##  5 p    
##  6 o    
##  7 s    
##  8 i    
##  9 t    
## 10 i    
## # … with 130 more rows
```

---

### Tokenization (ngrams)

```r
text_data %>% 
  unnest_tokens(word,text,token = "ngrams",n=2)
```

```
## # A tibble: 27 x 1
##    word                  
##    <chr>                 
##  1 us opposition         
##  2 opposition politicians
##  3 politicians and       
##  4 and aid               
##  5 aid agencies          
##  6 agencies have         
##  7 have questioned       
##  8 questioned a          
##  9 a decision            
## 10 decision by           
## # … with 17 more rows
```

---

### Tokenization (ngrams)

```r
text_data %>% 
  unnest_tokens(word,text,token = "ngrams",n=3)
```

```
## # A tibble: 26 x 1
##    word                      
##    <chr>                     
##  1 us opposition politicians 
##  2 opposition politicians and
##  3 politicians and aid       
##  4 and aid agencies          
##  5 aid agencies have         
##  6 agencies have questioned  
##  7 have questioned a         
##  8 questioned a decision     
##  9 a decision by             
## 10 decision by president     
## # … with 16 more rows
```

---

### Tokenization (tweets)

```r
tibble(text = "Hey @professor, this assignment doesn't make sense") %>% 
  unnest_tokens(word,text,token = "tweets") %>% head(3)
```

```
## # A tibble: 3 x 1
##   word      
##   <chr>     
## 1 hey       
## 2 @professor
## 3 this
```

```r
tibble(text = "Hey @professor, this assignment doesn't make sense") %>% 
  unnest_tokens(word,text,token = "words") %>% head(3)
```

```
## # A tibble: 3 x 1
##   word     
##   <chr>    
## 1 hey      
## 2 professor
## 3 this
```

---

### From words to numbers (a.k.a. counting)

Number of times a word appears in the text.

```r
text_data %>% 
  unnest_tokens(word,text) %>% 
  count(word, sort = TRUE) 
```

```
## # A tibble: 26 x 2
##    word         n
##    <chr>    <int>
##  1 aid          2
##  2 to           2
##  3 a            1
##  4 agencies     1
##  5 american     1
##  6 and          1
##  7 by           1
##  8 central      1
##  9 cut          1
## 10 decision     1
## # … with 16 more rows
```

---

### Stopwords

Some words are common, carrying little to no _unique_ information, and need to be removed.

`tidytext` comes with a database of common stop words, which we can leverage to remove these low information words.

```r
set.seed(11)  
stop_words %>% sample_n(10)
```

```
## # A tibble: 10 x 2
##    word       lexicon 
##    <chr>      <chr>   
##  1 anyone     SMART   
##  2 about      snowball
##  3 large      onix    
##  4 everywhere SMART   
##  5 worked     onix    
##  6 i          snowball
##  7 say        SMART   
##  8 i'll       SMART   
##  9 been       SMART   
## 10 twice      SMART
```

---

### Drop Stopwords

```r
text_data %>% 
  unnest_tokens(word,text) %>%  
* anti_join(stop_words) %>%
  count(word,sort = T)
```

```
## # A tibble: 14 x 2
##    word            n
##    <chr>       <int>
##  1 aid             2
##  2 agencies        1
##  3 american        1
##  4 central         1
##  5 cut             1
##  6 decision        1
##  7 donald          1
##  8 opposition      1
##  9 politicians     1
## 10 president       1
## 11 questioned      1
## 12 reports         1
## 13 story           1
## 14 trump           1
```

---

### Stemming

Often words are the same but appear different because of their tense.

```r
txt = "cleaned cleaning cleaner beauty beautiful killing killed killer"
tibble(text = txt) %>% 
  unnest_tokens(word,text) %>% 
  count(word)
```

```
## # A tibble: 8 x 2
##   word          n
##   <chr>     <int>
## 1 beautiful     1
## 2 beauty        1
## 3 cleaned       1
## 4 cleaner       1
## 5 cleaning      1
## 6 killed        1
## 7 killer        1
## 8 killing       1
```

---

### Stemming

Stemming allows use to reduce a word down to it's fundamental root. ( _Note: Need to install `SnowballC` package_ )

```r
txt = "cleaned cleaning cleaner beauty beautiful killing killed killer"
tibble(text = txt) %>% 
  unnest_tokens(word,text) %>% 
* mutate(word = SnowballC::wordStem(word)) %>%
  count(word)
```

```
## # A tibble: 5 x 2
##   word        n
##   <chr>   <int>
## 1 beauti      2
## 2 clean       2
## 3 cleaner     1
## 4 kill        2
## 5 killer      1
```

---

### Stemming

```r
text_data %>% 
  unnest_tokens(word,text) %>%  
  anti_join(stop_words) %>% 
* mutate(word = SnowballC::wordStem(word)) %>%
  count(word,sort = T)
```

```
## # A tibble: 14 x 2
##    word           n
##    <chr>      <int>
##  1 aid            2
##  2 agenc          1
##  3 american       1
##  4 central        1
##  5 cut            1
##  6 decis          1
##  7 donald         1
##  8 opposit        1
##  9 politician     1
## 10 presid         1
## 11 question       1
## 12 report         1
## 13 stori          1
## 14 trump          1
```

---

### Example

```r
require(rvest)
urls <- c("https://www.bbc.com/news/election-us-2020-54437852",
              "https://www.bbc.com/news/world-us-canada-54441986",
              "https://www.bbc.com/news/election-us-2020-54423497")

# Recall the BBC Scraper we build last class:
news_data <- c()
for(i in 1:length(urls)){
  draw <- bbc_scraper(urls[i])  
  news_data <- bind_rows(news_data,draw)
}
```

---

### Example

```r
news_data <- 
  news_data %>% 
  mutate(story_id = row_number()) # Create an id for the document

glimpse(news_data)
```

```
## Rows: 3
## Columns: 4
## $ headline <chr> "Trump Covid: Biden warns there is 'a lot to be concerned ab…
## $ date     <chr> "6 October 2020", "6 October 2020", "6 October 2020"
## $ story    <chr> "Democratic presidential nominee Joe Biden has criticised US…
## $ story_id <int> 1, 2, 3
```

---

```r
text_data <- 
  news_data %>% 
  group_by(story_id) %>% 
* unnest_tokens(word,story) %>%
  ungroup()
text_data
```

```
## # A tibble: 2,894 x 4
##    headline                                       date        story_id word     
##    <chr>                                          <chr>          <int> <chr>    
##  1 Trump Covid: Biden warns there is 'a lot to b… 6 October …        1 democrat…
##  2 Trump Covid: Biden warns there is 'a lot to b… 6 October …        1 presiden…
##  3 Trump Covid: Biden warns there is 'a lot to b… 6 October …        1 nominee  
##  4 Trump Covid: Biden warns there is 'a lot to b… 6 October …        1 joe      
##  5 Trump Covid: Biden warns there is 'a lot to b… 6 October …        1 biden    
##  6 Trump Covid: Biden warns there is 'a lot to b… 6 October …        1 has      
##  7 Trump Covid: Biden warns there is 'a lot to b… 6 October …        1 criticis…
##  8 Trump Covid: Biden warns there is 'a lot to b… 6 October …        1 us       
##  9 Trump Covid: Biden warns there is 'a lot to b… 6 October …        1 president
## 10 Trump Covid: Biden warns there is 'a lot to b… 6 October …        1 donald   
## # … with 2,884 more rows
```

---

```r
# Term Frequency

text_data %>% 
  group_by(story_id) %>% 
  count(word,sort=T) 
```

```
## # A tibble: 1,310 x 3
## # Groups:   story_id [3]
##    story_id word      n
##       <int> <chr> <int>
##  1        1 the      74
##  2        3 the      55
##  3        1 to       52
##  4        1 a        45
##  5        1 he       31
##  6        2 the      31
##  7        1 and      29
##  8        3 and      25
##  9        1 of       22
## 10        1 on       22
## # … with 1,300 more rows
```

---

```r
### Dropping stopwords

text_data <-
  text_data %>% 
* anti_join(stop_words)

text_data %>% 
  group_by(story_id) %>% 
  count(word,sort = T) 
```

```
## # A tibble: 874 x 3
## # Groups:   story_id [3]
##    story_id word             n
##       <int> <chr>        <int>
##  1        1 trump           21
##  2        1 president       17
##  3        3 presidential    12
##  4        3 debate          11
##  5        1 house           10
##  6        1 white           10
##  7        1 coronavirus      9
##  8        1 people           9
##  9        2 positive         8
## 10        2 president        8
## # … with 864 more rows
```

---

Let's drop words that have digits in them... using regular expressions and the `stringr` package.

```r
# Further Cleaning 
text_data %>% 
  filter(str_detect(word,"\\d")) %>% 
  select(story_id,word)
```

```
## # A tibble: 32 x 2
##    story_id word 
##       <int> <chr>
##  1        1 19   
##  2        1 3    
##  3        1 19   
##  4        1 19   
##  5        1 10   
##  6        1 19   
##  7        1 24   
##  8        1 7    
##  9        1 74   
## 10        1 12   
## # … with 22 more rows
```

---

Let's drop words that have digits in them... using regular expressions and the `stringr` package.

```r
# Further Cleaning

text_data <- 
  text_data %>% 
  filter(!str_detect(word,"\\d"))

text_data %>% select(story_id,word)
```

```
## # A tibble: 1,262 x 2
##    story_id word        
##       <int> <chr>       
##  1        1 democratic  
##  2        1 presidential
##  3        1 nominee     
##  4        1 joe         
##  5        1 biden       
##  6        1 criticised  
##  7        1 president   
##  8        1 donald      
##  9        1 trump       
## 10        1 downplaying 
## # … with 1,252 more rows
```

---

```r
# Stemming

text_data <- 
  text_data %>% 
  mutate(word = SnowballC::wordStem(word))

# Now count for real
text_data_cnts <- 
  text_data %>% 
  group_by(story_id,headline) %>% 
  count(word,sort=T) %>% 
  ungroup()

text_data_cnts 
```

```
## # A tibble: 776 x 4
##    story_id headline                                              word         n
##       <int> <chr>                                                 <chr>    <int>
##  1        1 Trump Covid: Biden warns there is 'a lot to be conce… trump       21
##  2        1 Trump Covid: Biden warns there is 'a lot to be conce… presid      17
##  3        3 Kamala Harris v Mike Pence: Why this vice-president … debat       16
##  4        3 Kamala Harris v Mike Pence: Why this vice-president … preside…    12
##  5        1 Trump Covid: Biden warns there is 'a lot to be conce… hous        10
##  6        1 Trump Covid: Biden warns there is 'a lot to be conce… white       10
##  7        1 Trump Covid: Biden warns there is 'a lot to be conce… coronav…     9
##  8        1 Trump Covid: Biden warns there is 'a lot to be conce… peopl        9
##  9        2 Covid: US military leaders quarantine after official… chief        8
## 10        2 Covid: US military leaders quarantine after official… posit        8
## # … with 766 more rows
```

---

### Inverse Document Frequency (tf-idf)

Measures how important a word is given all words in the text. Mainly, we want to _up weight_ infrequently used words across the documents, and _down weight_ words that are used often by all the documents.

<br>

`$$idf(term) = ln(\frac{n_{documents}}{n_{documents~containing~term}})$$`

`$$tf(term) = \frac{n_{word}}{n_{document}})$$`

`$$tf\_idf(term) = tf(term)*idf(term)$$`

![:space 5]

It's in the words particular to an author or document that the real information lies!

---

```r
text_data_cnts2 <- 
  text_data_cnts %>% 
  bind_tf_idf(word, story_id, n)

text_data_cnts2 %>% 
  select(n,tf,idf,tf_idf)
```

```
## # A tibble: 776 x 4
##        n     tf   idf  tf_idf
##    <int>  <dbl> <dbl>   <dbl>
##  1    21 0.0339 0     0      
##  2    17 0.0275 0     0      
##  3    16 0.0455 0     0      
##  4    12 0.0341 0.405 0.0138 
##  5    10 0.0162 0.405 0.00655
##  6    10 0.0162 0.405 0.00655
##  7     9 0.0145 0.405 0.00590
##  8     9 0.0145 0     0      
##  9     8 0.0275 0.405 0.0111 
## 10     8 0.0275 0     0      
## # … with 766 more rows
```

---

### Visualize!

![:space 10]

```r
text_data_cnts2 %>%
  group_by(story_id) %>% 
  top_n(5, tf_idf) %>% 
  ungroup() %>%
  mutate(word = reorder(word, tf_idf)) %>% 
  ggplot(aes(word, tf_idf,fill=headline)) +
  geom_col(show.legend = F) +
  xlab(NULL) +
  coord_flip() +
  facet_wrap(~headline,ncol=1,scales="free") +
  theme(text=element_text(size=30))
```

---

### Visualize!

---

# Sentiment

---

![:space 25]
![:center_img](Figures/tidyflow-ch-2.png)

---

### Sentiment Dictionaries

![:space 15]

--
.center[
|Dictionary Name | Source 
|:----------------:|--------|
| `nrc` | http://saifmohammad.com/WebPages/lexicons.html
| `AFINN` | http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010
| `bing` | https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
| `loughran` | https://sraf.nd.edu/
]

---

### Sentiment Dictionaries

```r
get_sentiments("afinn")
```

```
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows
```
]

```r
get_sentiments("bing")
```

```
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows
```
]

---

### Sentiment to text

```r
sent_dict <- get_sentiments("afinn")
sent_text <- text_data %>% inner_join(sent_dict) %>% ungroup()
sent_text
```

```
## # A tibble: 56 x 5
##    headline                                    date        story_id word   value
##    <chr>                                       <chr>          <int> <chr>  <dbl>
##  1 Trump Covid: Biden warns there is 'a lot t… 6 October …        1 fear      -2
##  2 Trump Covid: Biden warns there is 'a lot t… 6 October …        1 glad       3
##  3 Trump Covid: Biden warns there is 'a lot t… 6 October …        1 hope       2
##  4 Trump Covid: Biden warns there is 'a lot t… 6 October …        1 matter     1
##  5 Trump Covid: Biden warns there is 'a lot t… 6 October …        1 death     -2
##  6 Trump Covid: Biden warns there is 'a lot t… 6 October …        1 save       2
##  7 Trump Covid: Biden warns there is 'a lot t… 6 October …        1 threat    -2
##  8 Trump Covid: Biden warns there is 'a lot t… 6 October …        1 prote…     1
##  9 Trump Covid: Biden warns there is 'a lot t… 6 October …        1 danger    -2
## 10 Trump Covid: Biden warns there is 'a lot t… 6 October …        1 prote…     1
## # … with 46 more rows
```

---

```r
sent_text %>% distinct(word,value) %>% 
* mutate(word = fct_reorder(word,value)) %>%
  ggplot(aes(word, value)) +
  geom_col(show.legend = FALSE,aes(fill=value)) +
  scale_fill_viridis_c() +
  coord_flip() + theme(text=element_text(size=20))
```

---

```r
text_data %>% ungroup %>% 
  inner_join(get_sentiments("bing"),by = "word") %>% 
  distinct(word,sentiment) %>%
  mutate(word = fct_reorder(word,sentiment=="positive")) %>% 
  ggplot(aes(word, sentiment,label=word,color=sentiment)) +
  geom_text(size=3,show.legend = FALSE) +
  coord_flip() + scale_color_manual(values=c("darkred","steelblue")) +
  theme_minimal() + theme(text=element_text(size=20),axis.text.y = element_blank()) 
```

---