Predicting Song Completion on Spotify

Introduction

I love Spotify’s curated music. I can usually find something I want to listen to. Yet, even with my lists of favorites and my Discover Weekly playlist I still come across plenty of songs that I’m not feeling and have to skip. I found myself wondering what factors determine whether I skip a song or finish it, and set out to use my Spotify data to predict when I’m going to skip a song.

Include the Libraries

I’m going to be analyzing in R with the following packages.

library(tidyverse) # make R fun
library(randomForest) # create Random Forest ML Model
library(jsonlite) # parse spotify data
library(httr) # request data from spotify API
library(lubridate) # reformat dates
library(rlist) # manipulate spotify data

Getting the Data

You can download the last 90 days of your listening history through the Spotify website. Here’s a tutorial to help you find the menu to download. I’m not sure why they only provide 90 days, and unfortunately I don’t know any other way to get this data. You can try to use Scrobble with last.fm, but that won’t give you the duration you listened to a song.

Once you have downloaded your data, you can find your listening history in the StreamingHistory.json file.

streaming_history <- read_json("../data/StreamingHistory.json",
                             simplifyVector = TRUE,
                             flatten = TRUE)

str(streaming_history)

## 'data.frame':    3654 obs. of  4 variables:
##  $ endTime   : chr  "2018-11-19 17:45" "2018-11-19 17:48" "2018-11-19 17:52" "2018-11-19 17:58" ...
##  $ artistName: chr  "Maribou State" "Will Stratton" "Elder Island" "Shanic" ...
##  $ trackName : chr  "Beginner's Luck" "Hemet Pine Singer" "Welcome State" "It All Changed So Fast" ...
##  $ msPlayed  : int  268027 191735 270476 342857 188664 182783 148155 163813 225973 226976 ...

We can use the Spotify search API to retrieve the track ID and other feature data from Spotify. For the auth token, you can use the token generated within the API testing interface on Spotify’s document website. It expires fairly quickly, but nothing we’re doing here requires querying for too long.

auth_code = paste("Bearer", "[Your-Code-Here]")

This function will search Spotify for a song and get the metadata that we’re interested in. Duration will be used to determine whether a song was skipped, popularity can be used as a feature, and additional IDs can be used to get more metadata for additional features.

get_track_metadata <- function(artist_name, track_name){
  #Search Tracks
  # track_name <- "4AM in London"
  # artist_name <- "Benjamin Francis Leftwich"
  search <- paste("track:", track_name, "artist:", artist_name)
  
  query_response <- GET("https://api.spotify.com/v1/search",
                        add_headers(Authorization = auth_code),
                        query = list(
                          q = search,
                          type = "track",
                          market="US"
                        )
  )
  #get the json response
  parsed_query_response <- content(query_response, "parsed")
  track_id <- tryCatch(parsed_query_response$tracks$items[[1]]$id, 
                       error=function(e) NA)
  track_duration <- tryCatch(parsed_query_response$tracks$items[[1]]$duration_ms, 
                             error=function(e) NA)
  track_popularity <- tryCatch(parsed_query_response$tracks$items[[1]]$popularity, 
                               error=function(e) NA)
  track_album_id <- tryCatch(parsed_query_response$tracks$items[[1]]$album$id, 
                             error=function(e) NA)
  track_artist_id <- tryCatch(parsed_query_response$tracks$items[[1]]$album$artists[[1]]$id, 
                              error=function(e) NA)
  
  data.frame(track_id = track_id,
             track_album_id = track_album_id,
             track_artist_id = track_artist_id,
             track_duration = track_duration,
             track_popularity = track_popularity
  )
}

Loop through every record and get the track metadata.

track_metadata <- data.frame(track_id = c(),?
                               
       track_album_id = c(),
       track_artist_id = c(),
       track_duration = c(),
       track_popularity = c()
  )
#nrow(streaming_history)
for (row in 1:nrow(streaming_history)) {
  # print(streaming_history[row,])
  track_name <- streaming_history[row,]$trackName
  artist_name  <- streaming_history[row,]$artistName
  # print(paste("artist name: ", artist_name, "track name: ", track_name))
  current_track_metadata <- get_track_metadata(artist_name, track_name)
  track_metadata <- rbind(track_metadata, current_track_metadata)
}

Combine the metadata with the streaming data to get all the info about each track.

track_data <- cbind(streaming_history, track_metadata)

We do have some missing values from when the search didn’t return any results.

sum(is.na(track_data$track_id))

## [1] 227

We need to use the full track length for a song to determine if it was skipped. I don’t want to count all songs that have a shorter playtime than total duration; if I listened to most of the song and only skipped the ending, I still listened to the song. Instead, I’m going to set a threshold to determine whether I skipped the song.

skip_threshold = 0.9

Create the skipped label.

track_data <- track_data %>% 
  mutate(
    skipped = (track_duration - msPlayed > (1-skip_threshold) * track_duration))

summary(track_data$skipped)

##    Mode   FALSE    TRUE    NA's 
## logical    2510     917     227

Now it’s time to think of what other features could be valuable. Thinking about how I use Spotify and skip songs can help determine the features that might make a good prediction.

I know I tend to skip many songs in a row when searching for the music I want to listen to. I’ll create a feature called skip_streak to represent how many songs, before the current song, were skipped in a row.

skipped_vector <- track_data$skipped
streak_vector <- vector(mode = "numeric", length = length(skipped_vector))
skip_streak <- 0
for(i in 1:length(skipped_vector)){
  streak_vector[i] <- skip_streak
  if(!is.na(skipped_vector[i])){
      if(skipped_vector[i]){
    skip_streak <- skip_streak + 1
    }
    else{
      skip_streak <- 0
    }
  }
}

Add the skip streak feature to the track data.

track_data$skip_streak <- streak_vector

To get an idea of what impact skipping previous songs has on the likelihood to skip the current song, we can plot with ggplot.

track_data %>% 
  group_by(skip_streak) %>% 
  summarise(percent_skipped = sum(skipped, na.rm=TRUE)/n()) %>% 
  ggplot(aes(x = skip_streak)) +
  geom_bar(aes(weight=percent_skipped)) + 
  labs(title = "Skip Rate by Number of Songs Skipped Beforehand",
       x = "Songs Skipped", 
       y = "Percent of Songs Skipped")

We can see that skipping two or three songs before the current song greatly increases the likelihood that the current song will be skipped. This will be a good feature to use.

I’m also curious about what time of day I tend to skip songs.

track_data$end_time_formatted <- ymd_hm(track_data$endTime)
track_data$hour <- hour(track_data$end_time_formatted)
track_data$weekday <- wday(track_data$end_time_formatted)

track_data %>%
  group_by(hour) %>%
  summarise(percent_skipped = sum(skipped, na.rm=TRUE)/n()) %>% 
  ggplot(aes(x = hour)) +
  geom_bar(aes(weight=percent_skipped))

I see a spike after lunch and during the night, while early morning and early work hours appear below average.

We can also use the Spotify API to get features like “danceability” and “instrumentalness” from the song audio.

get_track_analysis_features <- function(track_id){

  query_response <- GET(paste0("https://api.spotify.com/v1/audio-features/",track_id),
                        add_headers(Authorization = auth_code)
  )
  #get the json response
  parsed_query_response <- tryCatch(content(query_response, "parsed"), error=function(e) NA) 
  # print(parsed_query_response)
  as.data.frame(parsed_query_response)
}

Use a sample query to get the column names.

test_track_features <- get_track_analysis_features("0aNyZsNQZykSvoiwrjhBFB")
track_analysis_features_names <- names(test_track_features)

Query audio analysis features for every track ID we were able to find.

#create an empty dataframe with the right names
track_analysis_features <- data.frame(matrix(ncol = 18, nrow = 0))
colnames(track_analysis_features) <-  track_analysis_features_names

for (row in 1:nrow(track_data)) {
  track_id <- track_data[row,]$track_id
  
  #get track features
  current_track_analysis_features <- get_track_analysis_features(track_id)
  #create an NA dataframe on failure
  if(ncol(current_track_analysis_features) != 18){
    current_track_analysis_features <- data.frame(matrix(ncol = 18, nrow = 1))
    colnames(current_track_analysis_features) <- track_analysis_features_names
  }
  track_analysis_features <- rbind(track_analysis_features, 
                                   current_track_analysis_features)
  }

str(track_analysis_features)

## 'data.frame':    3654 obs. of  18 variables:
##  $ danceability    : num  NA 0.358 0.666 0.377 0.534 0.713 0.785 NA 0.518 0.497 ...
##  $ energy          : num  NA 0.232 0.432 0.00868 0.445 0.433 0.362 NA 0.585 0.523 ...
##  $ key             : int  NA 6 6 0 7 9 9 NA 8 7 ...
##  $ loudness        : num  NA -13.8 -8.79 -32.79 -9.62 ...
##  $ mode            : int  NA 0 1 1 0 0 1 NA 1 1 ...
##  $ speechiness     : num  NA 0.038 0.0323 0.0487 0.0351 0.0959 0.0611 NA 0.344 0.0322 ...
##  $ acousticness    : num  NA 0.956 0.334 0.993 0.0483 0.0659 0.697 NA 0.347 0.881 ...
##  $ instrumentalness: num  NA 8.96e-01 2.89e-02 8.92e-01 7.25e-05 8.94e-01 3.34e-01 NA 9.08e-01 5.30e-01 ...
##  $ liveness        : num  NA 0.106 0.11 0.0696 0.116 0.0992 0.0502 NA 0.111 0.212 ...
##  $ valence         : num  NA 0.0742 0.297 0.0461 0.0828 0.468 0.513 NA 0.487 0.149 ...
##  $ tempo           : num  NA 121.9 96 86.1 156 ...
##  $ type            : chr  NA "audio_features" "audio_features" "audio_features" ...
##  $ id              : chr  NA "0aNyZsNQZykSvoiwrjhBFB" "7JJinw17P46NEPXimuRkvt" "4GbJgSLRK5XUkSJmDoeTz5" ...
##  $ uri             : chr  NA "spotify:track:0aNyZsNQZykSvoiwrjhBFB" "spotify:track:7JJinw17P46NEPXimuRkvt" "spotify:track:4GbJgSLRK5XUkSJmDoeTz5" ...
##  $ track_href      : chr  NA "https://api.spotify.com/v1/tracks/0aNyZsNQZykSvoiwrjhBFB" "https://api.spotify.com/v1/tracks/7JJinw17P46NEPXimuRkvt" "https://api.spotify.com/v1/tracks/4GbJgSLRK5XUkSJmDoeTz5" ...
##  $ analysis_url    : chr  NA "https://api.spotify.com/v1/audio-analysis/0aNyZsNQZykSvoiwrjhBFB" "https://api.spotify.com/v1/audio-analysis/7JJinw17P46NEPXimuRkvt" "https://api.spotify.com/v1/audio-analysis/4GbJgSLRK5XUkSJmDoeTz5" ...
##  $ duration_ms     : int  NA 229853 270476 342857 188664 182784 148156 NA 225973 226977 ...
##  $ time_signature  : int  NA 4 4 3 4 3 4 NA 5 3 ...

Combine track data and analysis features for the full set of features.

track_data_features_full <- cbind(track_data, track_analysis_features)

str(track_data_features_full)

## 'data.frame':    3654 obs. of  32 variables:
##  $ endTime           : chr  "2018-11-19 17:45" "2018-11-19 17:48" "2018-11-19 17:52" "2018-11-19 17:58" ...
##  $ artistName        : chr  "Maribou State" "Will Stratton" "Elder Island" "Shanic" ...
##  $ trackName         : chr  "Beginner's Luck" "Hemet Pine Singer" "Welcome State" "It All Changed So Fast" ...
##  $ msPlayed          : int  268027 191735 270476 342857 188664 182783 148155 163813 225973 226976 ...
##  $ track_id          : chr  NA "0aNyZsNQZykSvoiwrjhBFB" "7JJinw17P46NEPXimuRkvt" "4GbJgSLRK5XUkSJmDoeTz5" ...
##  $ track_album_id    : chr  NA "7Fitd8mCCAxXbbE5d6jfoE" "7MRtAyLQCuhaWSNtkb7Jqa" "4IWnOazPfcc5bXTHjZGwtO" ...
##  $ track_artist_id   : chr  NA "0LyfQWJT6nXafLPZqxe9Of" "3EnbnmqrrvApHJs6FMvYik" "4iwYvUGOkgKYbCF0rcmJay" ...
##  $ track_duration    : int  NA 229853 270476 342857 188664 182783 148155 NA 225973 226976 ...
##  $ track_popularity  : int  NA 61 59 33 52 33 50 NA 55 31 ...
##  $ skipped           : logi  NA TRUE FALSE FALSE FALSE FALSE ...
##  $ skip_streak       : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ end_time_formatted: POSIXct, format: "2018-11-19 17:45:00" "2018-11-19 17:48:00" ...
##  $ hour              : int  17 17 17 17 18 18 18 18 18 18 ...
##  $ weekday           : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ danceability      : num  NA 0.358 0.666 0.377 0.534 0.713 0.785 NA 0.518 0.497 ...
##  $ energy            : num  NA 0.232 0.432 0.00868 0.445 0.433 0.362 NA 0.585 0.523 ...
##  $ key               : int  NA 6 6 0 7 9 9 NA 8 7 ...
##  $ loudness          : num  NA -13.8 -8.79 -32.79 -9.62 ...
##  $ mode              : int  NA 0 1 1 0 0 1 NA 1 1 ...
##  $ speechiness       : num  NA 0.038 0.0323 0.0487 0.0351 0.0959 0.0611 NA 0.344 0.0322 ...
##  $ acousticness      : num  NA 0.956 0.334 0.993 0.0483 0.0659 0.697 NA 0.347 0.881 ...
##  $ instrumentalness  : num  NA 8.96e-01 2.89e-02 8.92e-01 7.25e-05 8.94e-01 3.34e-01 NA 9.08e-01 5.30e-01 ...
##  $ liveness          : num  NA 0.106 0.11 0.0696 0.116 0.0992 0.0502 NA 0.111 0.212 ...
##  $ valence           : num  NA 0.0742 0.297 0.0461 0.0828 0.468 0.513 NA 0.487 0.149 ...
##  $ tempo             : num  NA 121.9 96 86.1 156 ...
##  $ type              : chr  NA "audio_features" "audio_features" "audio_features" ...
##  $ id                : chr  NA "0aNyZsNQZykSvoiwrjhBFB" "7JJinw17P46NEPXimuRkvt" "4GbJgSLRK5XUkSJmDoeTz5" ...
##  $ uri               : chr  NA "spotify:track:0aNyZsNQZykSvoiwrjhBFB" "spotify:track:7JJinw17P46NEPXimuRkvt" "spotify:track:4GbJgSLRK5XUkSJmDoeTz5" ...
##  $ track_href        : chr  NA "https://api.spotify.com/v1/tracks/0aNyZsNQZykSvoiwrjhBFB" "https://api.spotify.com/v1/tracks/7JJinw17P46NEPXimuRkvt" "https://api.spotify.com/v1/tracks/4GbJgSLRK5XUkSJmDoeTz5" ...
##  $ analysis_url      : chr  NA "https://api.spotify.com/v1/audio-analysis/0aNyZsNQZykSvoiwrjhBFB" "https://api.spotify.com/v1/audio-analysis/7JJinw17P46NEPXimuRkvt" "https://api.spotify.com/v1/audio-analysis/4GbJgSLRK5XUkSJmDoeTz5" ...
##  $ duration_ms       : int  NA 229853 270476 342857 188664 182784 148156 NA 225973 226977 ...
##  $ time_signature    : int  NA 4 4 3 4 3 4 NA 5 3 ...

With that, I think we have enough features. We don’t need all the columns to train the model so we can select only the columns we need.

track_data_features <- track_data_features_full %>%
  select(
    skipped,
    skip_streak,
    hour,
    weekday,
    track_popularity,
    danceability,
    energy,
    key,
    loudness,
    mode,
    speechiness,
    acousticness,
    instrumentalness,
    liveness,
    valence,
    tempo
  ) %>% 
  na.omit()
#convert to factor for random forest classification
track_data_features$skipped <- as.factor(track_data_features$skipped)

Random Forest

It’s time to create the prediction model. First, we have to separate the data into training testing sets.

set.seed(100)
train <- sample(nrow(track_data_features), 0.7*nrow(track_data_features), replace = FALSE)
train_set <- track_data_features[train,]
valid_set <- track_data_features[-train,]
summary(train_set)

##   skipped      skip_streak          hour         weekday     
##  FALSE:1407   Min.   : 0.000   Min.   : 0.0   Min.   :1.000  
##  TRUE : 520   1st Qu.: 0.000   1st Qu.:11.0   1st Qu.:2.000  
##               Median : 0.000   Median :16.0   Median :4.000  
##               Mean   : 1.162   Mean   :13.9   Mean   :3.919  
##               3rd Qu.: 1.000   3rd Qu.:20.0   3rd Qu.:6.000  
##               Max.   :33.000   Max.   :23.0   Max.   :7.000  
##  track_popularity  danceability        energy              key        
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.000175   Min.   : 0.000  
##  1st Qu.: 46.00   1st Qu.:0.3260   1st Qu.:0.215500   1st Qu.: 1.000  
##  Median : 55.00   Median :0.4900   Median :0.422000   Median : 4.000  
##  Mean   : 53.48   Mean   :0.4795   Mean   :0.440038   Mean   : 4.166  
##  3rd Qu.: 59.00   3rd Qu.:0.6600   3rd Qu.:0.659000   3rd Qu.: 7.000  
##  Max.   :100.00   Max.   :0.9430   Max.   :1.000000   Max.   :11.000  
##     loudness            mode         speechiness       acousticness   
##  Min.   :-46.448   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:-21.152   1st Qu.:0.0000   1st Qu.:0.03600   1st Qu.:0.1190  
##  Median :-12.925   Median :1.0000   Median :0.04460   Median :0.3330  
##  Mean   :-16.127   Mean   :0.6772   Mean   :0.09325   Mean   :0.4278  
##  3rd Qu.: -8.860   3rd Qu.:1.0000   3rd Qu.:0.08505   3rd Qu.:0.7615  
##  Max.   : -1.584   Max.   :1.0000   Max.   :0.72200   Max.   :0.9960  
##  instrumentalness     liveness          valence            tempo       
##  Min.   :0.00000   Min.   :0.02190   Min.   :0.00000   Min.   :  0.00  
##  1st Qu.:0.00524   1st Qu.:0.09855   1st Qu.:0.07105   1st Qu.: 84.66  
##  Median :0.59700   Median :0.13500   Median :0.24400   Median :112.22  
##  Mean   :0.50442   Mean   :0.19532   Mean   :0.32452   Mean   :111.26  
##  3rd Qu.:0.89900   3rd Qu.:0.22050   3rd Qu.:0.52300   3rd Qu.:130.01  
##  Max.   :1.00000   Max.   :0.98300   Max.   :0.98000   Max.   :209.77

summary(valid_set)

##   skipped     skip_streak          hour          weekday     
##  FALSE:609   Min.   : 0.000   Min.   : 0.00   Min.   :1.000  
##  TRUE :218   1st Qu.: 0.000   1st Qu.:12.00   1st Qu.:3.000  
##              Median : 0.000   Median :16.00   Median :4.000  
##              Mean   : 1.225   Mean   :14.49   Mean   :4.144  
##              3rd Qu.: 1.000   3rd Qu.:20.00   3rd Qu.:6.000  
##              Max.   :34.000   Max.   :23.00   Max.   :7.000  
##  track_popularity  danceability        energy              key        
##  Min.   : 4.00    Min.   :0.0536   Min.   :0.000224   Min.   : 0.000  
##  1st Qu.:46.50    1st Qu.:0.3650   1st Qu.:0.218000   1st Qu.: 1.000  
##  Median :55.00    Median :0.4900   Median :0.412000   Median : 3.000  
##  Mean   :53.03    Mean   :0.4839   Mean   :0.446852   Mean   : 4.053  
##  3rd Qu.:59.00    3rd Qu.:0.6470   3rd Qu.:0.681500   3rd Qu.: 7.000  
##  Max.   :96.00    Max.   :0.9420   Max.   :1.000000   Max.   :11.000  
##     loudness            mode         speechiness       acousticness      
##  Min.   :-46.448   Min.   :0.0000   Min.   :0.02240   Min.   :0.0000034  
##  1st Qu.:-21.152   1st Qu.:0.0000   1st Qu.:0.03715   1st Qu.:0.1190000  
##  Median :-12.778   Median :1.0000   Median :0.04460   Median :0.3400000  
##  Mean   :-15.977   Mean   :0.6759   Mean   :0.09703   Mean   :0.4222262  
##  3rd Qu.: -8.320   3rd Qu.:1.0000   3rd Qu.:0.09070   3rd Qu.:0.7795000  
##  Max.   : -2.749   Max.   :1.0000   Max.   :0.72200   Max.   :0.9960000  
##  instrumentalness      liveness         valence            tempo       
##  Min.   :0.000000   Min.   :0.0244   Min.   :0.00001   Min.   : 46.95  
##  1st Qu.:0.003755   1st Qu.:0.0995   1st Qu.:0.08140   1st Qu.: 86.99  
##  Median :0.597000   Median :0.1330   Median :0.26800   Median :115.03  
##  Mean   :0.495254   Mean   :0.1897   Mean   :0.32827   Mean   :112.85  
##  3rd Qu.:0.893000   3rd Qu.:0.2130   3rd Qu.:0.52650   3rd Qu.:132.87  
##  Max.   :0.985000   Max.   :0.9770   Max.   :0.97000   Max.   :209.77

I’m just going to use the defaults for this model. However, there are plenty of methods to use grid search and loop through parameters you can use if you want to get better predictions.

model  <- randomForest(skipped ~ ., data = train_set, importance = TRUE, type = classification)

model

## 
## Call:
##  randomForest(formula = skipped ~ ., data = train_set, importance = TRUE,      type = classification) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 18.37%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE  1261  146   0.1037669
## TRUE    208  312   0.4000000

One main reason I chose Random Forest was so I could get a clear view of the importance of each feature. It looks like skip streak is the best indicator by far.

importance(model)

##                      FALSE       TRUE MeanDecreaseAccuracy
## skip_streak      64.133308 72.4336116            82.919092
## hour             14.188931  8.8973681            16.730238
## weekday           9.918859  1.9784049             9.806679
## track_popularity 16.959472  4.0607791            17.598914
## danceability     13.535021 -1.0126104            12.151877
## energy           12.781509 10.2480297            16.862217
## key               6.443682  0.5880843             6.408745
## loudness         14.199997 11.8339001            18.386072
## mode              1.588636  0.7190802             1.798527
## speechiness       7.508636  7.4175950            10.124761
## acousticness      7.819603  5.0608135            10.453423
## instrumentalness 11.583550 12.0180676            15.658565
## liveness         10.215735 -1.2243800             9.088325
## valence          15.288138 -2.3483400            14.334362
## tempo             7.533527  5.7996491             9.323382
##                  MeanDecreaseGini
## skip_streak            193.773820
## hour                    43.547560
## weekday                 27.618062
## track_popularity        40.195048
## danceability            40.712076
## energy                  47.766131
## key                     23.635846
## loudness                53.806499
## mode                     5.357875
## speechiness             38.896387
## acousticness            39.428905
## instrumentalness        50.586842
## liveness                36.065278
## valence                 39.418151
## tempo                   38.867707

Overall, the prediction is pretty good.

predictions <- predict(model, newdata = valid_set)
table(predictions, valid_set$skipped)

##            
## predictions FALSE TRUE
##       FALSE   531   84
##       TRUE     78  134

There are a lot more features for tracks that Spotify exposes. Adding in more of these features could improve the performance of our predictions. It might also be interesting to do regression instead of classification and predict exactly how long you will listen to a song. There are a lot of interesting projects you can try with Spotify data, and I hope this tutorial helps get you started on your own.