9 min read

Making a Word Cloud Gift from GroupMe Data Using R

Introduction

This year my friends took part in a Secret Santa mug exchange. Determined to not be outdone, I decided to create a completely personalized mug, featuring a word cloud of all the words my giftee wrote in a GroupMe chat during our internship together. I’ll dive into how to do thi using the GroupMe API, but if you use a different method to get your words you can still follow steps 2 and 3 to create and export your word cloud for whatever personalized item you want to put it on.

Step 1: Get the Data

As usual, the first step is to retrieve all the relevant data. My friends all used GroupMe to communicate so I’ll be using their API to get our group messages. In the GroupMe API, you can view messages from anyone in a group you’re in. However, I’ll only be including messages I’ve written in this post to preserve all my friends’ privacy.

First we have to include the libraries in R.

Libraries

library(rvest) # parse html
library(stringr) #string functions
library(jsonlite) # parse json
library(tidyverse) # make R fun
library(lubridate) # parse dates
library(tidytext) # text functions
library(httr) # api requests

API Key

Before we can get any data from the GroupMe API, we’ll need to get an API key. Log into your GroupMe account at dev.groupme.com. In the menu bar at the top right you will see an access token menu item.

GroupMe Access Token Menu

GroupMe Access Token Menu

Click the menu item to get your key and save it as a variable in your R project.

groupme.api.key <- "iamanapikey"

Query the Data

The GroupMe API separates messaging calls from group calls. You can specify the group(s) you want messages from using group IDs, which will allow you to use the messaging API to get messages from the group(s).

First, I’ll get a list of all my groups.

connection.url <- 'https://api.groupme.com/v3/groups'
groups.query.response <- GET(connection.url,
                      query = list(
                        token = groupme.api.key
                      ))
parsed.group.response <- content(groups.query.response, "parsed")

You can explore the parsed.group.response to find the group you need if you’re only looking for one group’s messages. Instead, I’m going to get every message from all my groups using a list of the group IDs.

group.ids <- lapply(parsed.group.response$response, 
                    function(x){ x$group})

Test Getting Messages

Before trying to get every message from every group, we should test out the API with a single group.

Get your group ID.

test.group.id <- group.ids[1]

Query a group for messages.

#/groups/:group_id/messages

connection.url <- paste0("https://api.groupme.com/v3/groups/",
                         test.group.id,
                         "/messages")
query.response <- GET(connection.url,
                      query = list(
                        token = groupme.api.key
                      ))
parsed.data <- content(query.response, "parsed")

I’ll use lapply to get the fields I want out of the message.

messages <- lapply(
    parsed.data$response$messages, 
    function(x){
      as.vector(x[c("sender_id", 
                    "name",
                    "text")])
      }
    )

View the messages to see messages people wrote in that group.

head(messages)

When converting to a vector we lose our null values, which makes it hard to get the messages into a dataframe format and create a word cloud. To get around that issue I am using a function to convert any null items to NA.

get.non.null <- function(item){
  if(is.null(item)){
    return(NA)
  }
  else{
    return(item)
  }
}

Now we can convert our sample into a dataframe using the get.non.null function.

message.df <- data.frame("id"= c(NA),
                         "name" = c(NA),
                         "text" = c(NA),
                         stringsAsFactors = FALSE)
for(message in messages){
  message.vector <- c(get.non.null(message$sender_id[1]),
                      get.non.null(message$name[1]),
                      get.non.null(message$text[1]))
  # print(message.vector)
  message.df <- rbind(message.df,message.vector)
}
summary(message.df)

The GroupMe API also limits how many messages you can query at a time (default is 20). To get all of the messages from a single group we have to query multiple times.

 while(query.response$status_code == 200){
    parsed.data <- content(query.response, "parsed")
    messages <- lapply(parsed.data$response$messages,
                       function(x){x[c("id", 
                                       "sender_id", 
                                       "name", 
                                       "text")]})
    for(message in messages){
          message.vector <- c(get.non.null(message$sender_id[1]),
                              get.non.null(message$name[1]), 
                              get.non.null(message$text[1]))
    # print(message.vector)
    message.df <- rbind(message.df,message.vector)
    }
    last.message.id <- message.df$id[length(message.df$id)]
    #query again starting at the last message
    query.response <- GET(connection.url,
                      query = list(
                        token = groupme.api.key,
                        before_id = last.message.id
                      ))

  }
  #success if 304, otherwise exited on failed query
  print(paste("exited with code:", query.response$status_code))
summary(message.df)

Make a Function

Combine everything into a function so we can repeat for every group.

get.messages <-  function(group_id){
  #Connect to API for messages in the group
  connection.url <- paste0("https://api.groupme.com/v3/groups/",
                             group_id,
                             "/messages")
  #Define what we want out of a group message
  message.df <- data.frame("group_id" = c(NA), 
                           "message_id" = c(NA), 
                           "sender_id"= c(NA), 
                           "name" = c(NA), 
                           "text" = c(NA), 
                           stringsAsFactors = FALSE)
  
  #Query first 100 messages of a group
  query.response <- GET(connection.url,
                        query = list(
                          token = groupme.api.key,
                          limit = 100
                        ))
  
  #Continue querying for messages until we receive an API error
  while(query.response$status_code == 200){
    parsed.data <- content(query.response, "parsed")
    #Parse message fields
    messages <- lapply(parsed.data$response$messages,
                       function(x){x[c("id", 
                                       "sender_id", 
                                       "name", 
                                       "text")]})
    #Convert into a dataframe
    for(message in messages){
      message.vector <- c(group_id, 
                          get.non.null(message$id[1]), 
                          get.non.null(message$sender_id[1]), 
                          get.non.null(message$name[1]), 
                          get.non.null(message$text[1]))
      # print(message.vector)
      message.df <- rbind(message.df,message.vector)
    }
    #Get latest message ID
    last.message.id <- message.df$message_id[length(message.df$message_id)]
    #Query for more messages
    query.response <- GET(connection.url,
                        query = list(
                          token = groupme.api.key,
                          limit = 100,
                          before_id = last.message.id
                        ))
  }
  #success if 304, otherwise failed
  print(paste("exited with code:", query.response$status_code))
  return(message.df)
}

It’s time to actually retrieve all the data we need from GroupMe. I’m both combining all the data and saving each group to its own file.

#Create our final dataframe
all.messages.df <- data.frame("group_id"= c(NA),
                              "message_id" = c(NA),
                              "sender_id"= c(NA), 
                              "name" = c(NA), 
                              "text" = c(NA), 
                              stringsAsFactors = FALSE)

#Run our function on every group
for(id in group.ids){
  print(paste("group", id))
  #Check how long it takes to get all messages
  time <- system.time(messages.df <- get.messages(id))
  print(time)
  #save as individual rds file
  saveRDS(messages.df,
          file = paste0("../data/group_messages/",id,".rds"))
  all.messages.df <- rbind(all.messages.df, messages.df)
}
#save final dataframe
saveRDS(all.messages.df, file = "../data/all_messages.rds")
summary(all.messages.df)
##    group_id          message_id         sender_id        
##  Length:5376        Length:5376        Length:5376       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##      name               text          
##  Length:5376        Length:5376       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

Step 2: Create a Word Cloud

Once all the message data has been collected, getting it into a format suitable for a word cloud is pretty simple with tidytext and wordcloud2.

The most common word cloud is built from word frequency, where the largest words are those that occur the most.

First, we’ll need to filter out the messages we don’t care about. We only want my messages, so I need to find my ID. An easy way to find it is with string matching.

Save your sender ID for filtering.

sender.id <- "myid"

Be sure to filter messages with certain words or phrases early, because it can be harder to filter once they are split into words. For GroupMe, I’m filtering out any media messages or links, which show as plain text through the API.

user.messages <- all.messages.df %>%
  filter(sender_id == sender.id )%>%
  filter(!str_detect(text, ".com"))
library(wordcloud2)

words <- user.messages %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  rename(count = n)
## Joining, by = "word"
wordcloud2(words, shape="square")

You can tell by my word cloud that I like to play Settlers of Catan and D&D with my friends.

Most Relevant Words

For this gift I wanted to stray from word frequency. Since we have a group of many people messaging, I can compare the words my giftee used with what everyone else said using Term Frequency, Inverse Document Frequency (TF-IDF).

This time we don’t start by filtering a single person’s messages, instead only removing bad messages such as links containing “.com”.

filtered.messages <- all.messages.df %>%
  filter(!str_detect(text, ".com"))

TF-IDF is created on all messages, and grouped by each person with the final function bind_tf_idf(word, sender_id, n).

group.words <- filtered.messages %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>%
    count(word, sender_id, sort = TRUE)
## Joining, by = "word"
group.words.tfidf <- group.words %>% 
  bind_tf_idf(word, sender_id, n)

With wordcloud2 we can easily use the tf_idf numbers in place of word frequency. The package automatically determines how to size the words and doesn’t require whole numbers. We also won’t need to filter stop words with tf_idf, although you still can if you want to. TF-IDF adjusts for usage across everyone. Stop words, which everyone uses in the group, will have lower values. This is particularly helpful for me as my name (Will), is a stop word. We only need to filter a single person at this final step and remake the word cloud.

my.tfidf <- group.words.tfidf %>% 
  filter(sender_id == sender.id)

wordcloud.df <- my.tfidf %>%
  select(word, tf_idf)

wordcloud2(wordcloud.df)

Step 3: Download the Word Cloud

This step is a little tricky and you may want to just look into using the package wordcloud instead of wordcloud2. The wordcloud package will create a jpeg, which is easy to export, while wordcloud2 is made in HTML and is harder to get an image. However, I like how Wordcloud2 looks better, and it allows you to create word clouds in shapes or letters so I used a couple workarounds.

From HTML, there is a simple JavaScript trick that allows you to download as an image.

document.getElementById("canvas").toDataURL()

This will get a link to an image you can download. However, this command seems to crash rstudio’s viewer. To get around that, I made a Shiny app with the same word cloud code. Having a Shiny app will allow you to view in Chrome as well as edit the size of the word cloud canvas easily. I also added an input to control the size of words, which allows you to quickly experiment and get the word cloud that you like. Below is the full Shiny app code.

# to save run following in js
#document.getElementById("canvas").toDataURL()

library(shiny)
library(wordcloud2)
library(tidyverse)
library(tidytext)
library(stringr)

#Define id and size of words
n <- 1
sender.id <- "myid"

#Get message data
all.message.dfs <-  readRDS("../all_messages.rds")

#Filter out links
filtered.messages <- all.message.dfs %>%
  filter(!str_detect(text, ".com"))

#Get word counts
all.words <- filtered.messages %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, sender_id, sort = TRUE)

#Create TFIDF
all.words.tfidf <- all.words %>% bind_tf_idf(word, sender_id, n)

#Filter TFIDF
user.tfidf <- all.words.tfidf %>% 
  filter(sender_id == sender.id)

#format for wordcloud
wordcloud.text <- user.tfidf %>%
  filter(!is.na(word)) %>%
  select(word, tf_idf)

# Size the wordcloud
ui <- bootstrapPage(
  numericInput('size', 'Size of wordcloud', n),
  wordcloud2Output('wordcloud2', width = "1048px", height = "400px")
)

# Define server logic required to draw the wordcloud
server <- function(input, output) {
  output$wordcloud2 <- renderWordcloud2({
    wordcloud2(wordcloud.text, size=input$size)
  })
}

# Run the application 
shinyApp(ui = ui, server = server)

This blog is a static website so I can’t embed the Shiny app, but here is what it should look like once you reference your own message data.

Now all you have to do is inspect the page in your web browser and write the command to get a URL in the JavaScript console tab.

document.getElementById("canvas").toDataURL()

Finally, we got a word cloud image with R. The Shiny page is a bit of an annoying workaround and I’ve been having issues with wordcloud2 correctly doing image masking. R might not be the best tool to get a word cloud after getting word frequency or TF-IDF, but it is doable.

I used Zazzle to create my mug. Zazzle was able to infer that all white space was supposed to be transparent and the resulting product looks pretty great.

Have fun making your word cloud merchandise!