Data

# Read the data in
inaug_dat <- read_csv("inaug_speeches.csv") 

# Clean it 
dat <- 
  inaug_dat %>% 
  transmute(president = str_to_lower(Name) %>% str_replace_all(.," ","_"),
            address = case_when(
                str_detect(`Inaugural Address`,"First") ~ "first",
                str_detect(`Inaugural Address`,"Second") ~ "second",
                str_detect(`Inaugural Address`,"Third") ~ "third",
                str_detect(`Inaugural Address`,"Fourth") ~ "fourth",
                T ~ "first"),
            date = as.Date(Date,"%A, %B %d, %Y"),
            year = lubridate::year(date),
            length = str_count(text),
            text = text)

# Adjust for one problematic date
dat[dat$president=="bill_clinton" & dat$address=="second",]$date = as.Date("1997-01-20")
dat[dat$president=="bill_clinton" & dat$address=="second",]$year = 1997

head(dat)



Questions



(1) Do speeches get longer over time?



(2) Who was the most/least verbose?



(3) Convert text to tidy format.

  • tokenize using words as the fundamental unit.
  • Remove all stopwords
  • Remove all digits.



(4) What are the 30 most frequent words used across all inaugural speeches? Please present this information as a bar plot.



(5) Remove the top 30 most common words from the inaugural speech data.

Treat these as stop words that are particular to inaugural speeches (i.e. every president uses these words).



(6) What are the top five words that are most unique to each president’s inaugural speech in their first term? Please present this information as a faceted bar graph.



(7) On Average, which president’s inaugural speech is most “positive”? Which is most “negative”?



(8) Of presidents elected into a second term of office, are they more positive in their second inaugural speech vis-a-vis their first on average? Plot this information as a graph of your choosing.

Note: Don’t consider FDR’s third and fourth inaugural speeches.



(9) Run a topic model on the inaugural speeches setting k (the number of topics you’re looking for to 5). Try and interpret the output.