47 min read

Association Rules in Web History

Introduction

I have been curious about how I spend my time online and decided to investigate what activities I tend to do together in a single session. Association Rules allows us to discover the associations between these activities.

Association Rules

Association Rules is an algorithm to find patterns in events that tend to occur together. One common example is where a supermarket used this algorithm to discover that people who buy diapers also tend to buy beer, later reasoning that there are a lot of new dads shopping. With this information you could decide to put diapers and beer close together to simplify the shopping experience. Or you could place diapers at the other end of the store to make new dads walk by temptations of grills, sandals, and football merchandise.

Definition: Frequent Itemsets

An itemset is a collection one or more items. First, start with a dataframe of itemsets

Transaction ID Items
1 Bread, Milk
2 Bread,Diaper,Beer,Eggs
3 Milk,Diaper,Beer,Coke
4 Bread,Milk,Diaper,Beer
5 Bread,Milk,Diaper,Coke

The support count is how many times an itemset occurred. e.g The support count of {Milk,Bread,Diaper} = 2

The support is the fraction of all transactions that contain an itemset. e.g. The support of {Milk,Bread,Diaper} = 2/5

The first parameter we need to manually set in Association Rules is a minimum support. A frequent itemset is any itemset that has a support greater than or equal to the minimum support. Within the Association Rules algorithm we will filter to only use frequent itemsets to save computation time.

Definition: Association Rule

An Association Rule takes the form of X -> Y.

For example, {Milk,Diaper} -> {Beer} means that a transaction where someone bought milk and diapers implies that they also bought beer.

Support

Support has the same meaning as with an itemset, but the itemset you are counting is both sides of the implication

e.g. support for {Milk,Diaper} -> {Beer} is support_count({Milk,Diaper,Beer})/count(transactions) = 2/5

The second parameter is a confidence threshold.

Confidence

Confidence is a measure of how often items in Y also appear in transactions that contain X.

c = support_count({Milk,Diaper,Beer})/support_count(Milk,Diaper) = 2/3

A higher confidence means that the outcome has a greater frequency to occur in the set rather than outside of it. Note that you can have the same itemset, but depending on how the rule is formed, confidence may change. For example

{Milk} -> {Diaper,Beer} has a support of 0.4 and a confidence of 0.5. {Milk,Beer} -> {Diaper} has a support of 0.4, but a confidence of 1.0

So if milk is bought there is a 50% chance of also buying diapers and beer. However, if milk and beer are bought, then there is a 100% chance of purchasing diapers.

The last main metric often observed in association rules is lift.

Lift

Confidence tells you if the right side is more likely to occur given the left side. However, that isn’t always enough. Consider the following example of drinkers of coffee and tea. 100 people were asked if they drink coffee and tea. To relate to Association Rules you can think of this as 100 restaurant orders where people had tea or coffee.

Coffee Not(Coffee)
Tea 15 5 20
Not(Tea) 75 5 80
90 10 100

The confidence of someone drinking coffee given that they drink tea is 15/20 or 75%, pretty high. That means that we are confident that someone drinking tea also drinks coffee. Giving us the rule {Tea} -> {Coffee}. However, the probability that someone drinks coffee disregarding tea is 90/100 or 90%. Therefore, if someone drinks tea they are less likely to drink coffee than someone who doesn’t drink tea. With this knowledge {Tea} -> {Coffee} doesn’t make as much sense as {} -> {Coffee}.

{} denotes empty set, indicating that every transaction implies the right hand itemset

Lift fixes this by accounting for that probability that someone drinks coffee.

In this example lift = 0.75/0.9 = 0.83. Since it is less than one, drinking coffee is negatively associated with drinking tea.

Using lift we can filter for rules where the resulting item is more likely to occur given the input itemset.

Format the Data

The data I’ll be using is from RescueTime. I went through how to retrieve this data in my previous post How to Download and Analyze Your RescueTime Data in R. I’m only going look at categories of websites. However, you can use the same method to look at specific websites. You can even try this with Google History from Takeout, but it would take more formatting to categorize websites from raw URLs or titles.

library(lubridate)
rescue.documents <- read.csv("../data/rescuetime-activity-history.csv", stringsAsFactors = FALSE, header=FALSE)
names(rescue.documents) <-  c("time", "activity","title", "category","domain", "duration")
rescue.documents$time <- mdy_hm(rescue.documents$time)
str(rescue.documents)
## 'data.frame':    241 obs. of  6 variables:
##  $ time    : POSIXct, format: "2018-09-14 01:00:00" "2018-09-14 01:00:00" ...
##  $ activity: chr  "rstudio" "newtab" "docs.google.com/#document" "docs.google.com/#document" ...
##  $ title   : chr  "C:/Users/Will/Desktop/Personal Projects/PersonalSite - master - RStudio" "New Tab - Google Chrome" "Google Docs - Google Chrome" "Untitled document - Google Docs - Google Chrome" ...
##  $ category: chr  "Software Development" "Utilities" "Design & Composition" "Design & Composition" ...
##  $ domain  : chr  "Data Modeling & Analysis" "Browsers" "Writing" "Writing" ...
##  $ duration: int  306 18 11 10 3 3327 114 43 26 15 ...

As always, there is a bit of cleaning to do before we can run the algorithm.

Create Sessions

Start out by including tidyverse.

library(tidyverse)

In order for Association Rules to work, we need to define what counts as a transaction or web session. Whether you get web history from Google or RescueTime, there isn’t a strict definition of what counts as a session. I’m defining a session as continuous web usage with breaks no longer than 60 minutes.

The following function creates sessions from a timestamp column.

make.sessions <- function(time.vec, time.sep = 30, time.unit = "mins"){
  #create a vector of next times
  lag.vec <- lag(time.vec)
  #find the difference in time between web activity
  difftime.vec <-  difftime(time.vec,lag.vec, units=time.unit)
  #stop if any difference is negative
  stopifnot(difftime.vec[!is.na(difftime.vec)] >= 0)
  #set all points where there is a new session to 1 and 0 otherwise
  new.sessions <- as.numeric(difftime.vec > as.difftime(time.sep, units = time.unit))
  new.sessions[is.na(new.sessions)] <- 0
  #use addition to create indexes
  sessions <- cumsum(new.sessions)
}

Run the function on RescueTime data and view the results. Make sure your data is sorted by time before using this function.

rescue.documents <- rescue.documents %>% arrange(time)
rescue.documents$session.id <-  make.sessions(rescue.documents$time, time.sep = session.break.time)

Format Session Data for Rransactions

Within a single session there are likely duplicate websites and categories. Since Association Rules is based on set manipulation, we don’t need those duplicates and can simplify our dataframe by removing them.

rescue.sessions.unique.categories <-  rescue.documents %>% group_by(category, session.id) %>% dplyr::summarise(count = n())

Lets look at the dataframe we’ll coerce into a transaction.

summary(rescue.sessions.unique.categories)
##    category           session.id       count       
##  Length:16878       Min.   :   0   Min.   :  1.00  
##  Class :character   1st Qu.: 605   1st Qu.:  2.00  
##  Mode  :character   Median :1430   Median :  5.00  
##                     Mean   :1355   Mean   : 12.83  
##                     3rd Qu.:2041   3rd Qu.: 15.00  
##                     Max.   :2650   Max.   :895.00

The arules library uses a specific transaction data format. To convert the dataframe into a transaction first create a vector of comma-separated items, then use the as function to coerce the vector into a transactions object.

library(arules)
categories.in.session <- aggregate(category ~ session.id, data = rescue.sessions.unique.categories, c)
category.transactions <- as(categories.in.session$category, "transactions")

Run Association Rules

category.rules <- apriori(category.transactions, parameter = list(supp=0.1, maxlen=3))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target   ext
##       3  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 265 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[11 item(s), 2651 transaction(s)] done [0.00s].
## sorting and recoding items ... [11 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [296 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

You can look through rules with the inspect function.

inspect(head(sort(category.rules, by="lift"), n=10))
##      lhs                             rhs                      support confidence     lift count
## [1]  {Business,                                                                                
##       Design & Composition}       => {Software Development} 0.2097322  0.8273810 2.206627   556
## [2]  {Communication & Scheduling,                                                              
##       Design & Composition}       => {Software Development} 0.2304791  0.8028909 2.141312   611
## [3]  {Design & Composition,                                                                    
##       Social Networking}          => {Software Development} 0.1942663  0.8009331 2.136090   515
## [4]  {Design & Composition,                                                                    
##       Software Development}       => {Business}             0.2097322  0.8348348 1.990240   556
## [5]  {Social Networking,                                                                       
##       Software Development}       => {Business}             0.2195398  0.8151261 1.943255   582
## [6]  {Communication & Scheduling,                                                              
##       Software Development}       => {Business}             0.2542437  0.8120482 1.935917   674
## [7]  {Communication & Scheduling,                                                              
##       Design & Composition}       => {Business}             0.2297246  0.8002628 1.907821   609
## [8]  {Shopping,                                                                                
##       Software Development}       => {Uncategorized}        0.1086382  0.9664430 1.603279   288
## [9]  {Design & Composition,                                                                    
##       Shopping}                   => {Uncategorized}        0.1029800  0.9479167 1.572545   273
## [10] {Business,                                                                                
##       Shopping}                   => {Uncategorized}        0.1256130  0.9433428 1.564957   333

You can also coerce them into a dataframe if you prefer.

df_rules <- as(sort(category.rules, by="lift"),"data.frame")
head(df_rules)
##                                                                           rules
## 103                   {Business,Design & Composition} => {Software Development}
## 108 {Communication & Scheduling,Design & Composition} => {Software Development}
## 104          {Design & Composition,Social Networking} => {Software Development}
## 102                   {Design & Composition,Software Development} => {Business}
## 155                      {Social Networking,Software Development} => {Business}
## 159             {Communication & Scheduling,Software Development} => {Business}
##       support confidence     lift count
## 103 0.2097322  0.8273810 2.206627   556
## 108 0.2304791  0.8028909 2.141312   611
## 104 0.1942663  0.8009331 2.136090   515
## 102 0.2097322  0.8348348 1.990240   556
## 155 0.2195398  0.8151261 1.943255   582
## 159 0.2542437  0.8120482 1.935917   674

Visualize

You can also visualize Association Rules with some arulesViz. When I use Association Rules, it’s generally very exploratory. This package makes exploring rules very easy and even offers interactive modes.

library(arulesViz)

The default plot helps identify the distribution of rules over confidence, support, and lift.

plot(category.rules)

You can also use the interactive plot to view the actual rules behind each data point.

plot(category.rules, engine="htmlwidget")
## Warning: package 'bindrcpp' was built under R version 3.4.4

The paracoord plot helps view how items place into rules.

plot(category.rules, method="paracoord",  control=list(alpha=.5, reorder=TRUE))

My favorite is to simply view all the rules in an interactive table. This allows you to sort and search for rules dynamically.

inspectDT(category.rules)

This interactive mode is especially useful with a large number of rules, which is hard to visualize. I recommend trying it if you decide to do rules on websites visited rather than categories of websites.

There we have it. You can run Association Rules any time it makes sense to think in terms of transactions where multiple things may be done together. Next time I’ll have to see what arises when I switch to sequential pattern analysis.