Last.fm Music Recommender
The purpose of a music recommender is two-fold: to create better user experiences and to increase the commercial value of streaming platforms. In this project, we explore collaborative filtering based recommendation. Specifically we had built two recommenders using matrix factorization and autoencoder, and compare their performances.
Type: Data Science + Machine Learning
Team: Zhihao Fang, Siyu Guo, Vincent Mai
Role:
- Assisted in gathering user data from the music website Last. fm.
- Assisted in gathering data preprocessing
- Trained and analyzed end-to-end recommenders built using deep autoencoder.
Date: 2019
Problem Statement¶
Online music streaming services had become the dominant medium for users to access music. The purpose of a music recommender is two-fold: to create better user experiences and to increase the commercial value of streaming platforms.
For the term project we are interested in exploring collaborative filtering recomendation which typically consists of user-based and content-based appraoches. We will focused on user-based approach for it is the lighter approach that only takes into account users' past ratings of the songs. We will build two recommenders using a traditional approach of matrix factorization and a modern deep learning approach of deep autoencoder respectively. Subsequently, we will compare their performances in predicting users' preferences on songs masked as "unseen" and conduct analysis in relation to the nature of the data.
Related Work¶
While matrix factorization method with SVD had been the traditonally popular in collaboratie filtering based recommender systems, recent effort have increasing migrated toward deep learning technique. Kuchaiev and Ginsburg
https://pdfs.semanticscholar.org/838e/9ddc41b6badb9bf1f64cd7e2a18f8adda057.pdf
Data Collection¶
We will mainly use Last.fm api to collect our data, where we can retrieve a set of users and their N most frequently played songs in a most recent window of time User.getTopTracks
. The caveat of using this method is that we'll need to pass in list of usernames as argument, which is not readily available for privacy reason. To get around this challenge, we will use the search phrase site:last.fm/user/
on popular search engine website and scrape usernames. Our experiments had showed Bing by far returned the most number of usernames. Importantly, as a good privacy practice, make sure that usernames are hashed before publishing any parts of the data harvested.
import time
import re
from bs4 import BeautifulSoup
import urllib.request
import numpy as np
import progressbar
import pylast
import os
import operator
import scipy.sparse as sp
import matplotlib.pyplot as plt
from collections import OrderedDict
API_KEY = "389968e3566746e89cbb4750271c3aa1"
API_SECRET = "a96e6333143c6f686fd045a41acf1f4e"
PATH_USERNAMES = "dataset/usernames.npy"
PATH_SONGS = "dataset/songs.npy"
PATH_RATINGS = "dataset/ratings.npy"
PATH_MATRIX = "dataset/matrix.npz"
PATH_RATING_DICT = "dataset/rating_dict.npy"
PATH_MODEL = "model/ae.model"
def get_username_from_url(request):
res = set()
with urllib.request.urlopen(request) as urlObject:
html = urlObject.read()
soup = BeautifulSoup(html, "html.parser")
results = soup.findAll('li', { "class" : "b_algo" })
for result in results:
result = str(result.find('a').get('href'))
username = re.search(r'https://[^.]+.last.fm/user/([^/]+)', result)
if username is not None:
res.add(username.group(1))
else:
print('failed to fetch username in', result)
return res
def bing_search():
print('crawling usernames from Bing')
url = "https://www.bing.com/search?q=site%3alast.fm%2fuser%2f&qs=ds&first={0}&FORM=PERE"
headers = {'User-Agent':'Chro/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0'}
res = set()
start = 1
address = url.format(start)
request = urllib.request.Request(address, None, headers)
res = res.union(get_username_from_url(request))
start = 15
total = 1000
with progressbar.ProgressBar(max_value=total) as bar:
while start < total:
address = url.format(start)
request = urllib.request.Request(address, None, headers)
res = res.union(get_username_from_url(request))
start += 10
bar.update(min(start, total))
time.sleep(1)
print('crawled %d usernames from bing search'%len(res))
return res
In addtion to crawling Bing to scrape usernames user.get_friends()
to find user's friends to ensure we have big enough of a dataset. With a "depth=1" search, we are able to get 9682 users.
network = pylast.LastFMNetwork(api_key=API_KEY, api_secret=API_SECRET)
def crawl_user_friends(usernames, depth=1, max_friends=0):
usernames = usernames[:]
for i in range(depth):
print('fetching list of usernames with a depth of %d'%depth)
assert len(usernames) == i+1
friends = set()
# traverse each user in last depth
with progressbar.ProgressBar(max_value=len(usernames[i])) as bar:
for j, username in enumerate(usernames[i]):
user = network.get_user(username)
try:
n = 0
for friend in user.get_friends():
recorded = False
for prev in usernames: # check if friend is already recorded in previous depths
if friend.get_name() in prev:
recorded = True
if not recorded:
friends.add(friend.get_name())
n += 1
if max_friends and n >= max_friends:
break
except:
pass
finally:
bar.update(j)
usernames.append(friends)
return usernames
def get_usernames():
# TODO refer to cache funtions in previous hw
path = os.path.join(os.getcwd(), PATH_USERNAMES)
if os.path.isfile(path):
return np.load(path, allow_pickle=True)
dataset = crawl_user_friends([bing_search()], depth=1, max_friends=20)
dataset = np.array(list(set.union(*dataset)), dtype=object)
np.save(path, dataset)
return dataset
USERNAMES = get_usernames()
crawling usernames from Bing
100% (1000 of 1000) |####################| Elapsed Time: 0:02:11 Time: 0:02:11
crawled 768 usernames from bing search fetching list of usernames with a depth of 1
100% (768 of 768) |######################| Elapsed Time: 0:42:58 Time: 0:42:58
USERNAMES = get_usernames()
During the process of scrape username te found out There is a limitation in the original method of pylast._collect_nodes
import xml.dom
def _collect_nodes(limit, sender, method_name, cacheable, params=None):
"""
Returns a sequence of dom.Node objects about as close to limit as possible
"""
if not params:
params = sender._get_params()
nodes = []
page = 1
end_of_pages = False
while not end_of_pages:
params["page"] = str(page)
tries = 1
while True:
try:
doc = sender._request(method_name, cacheable, params)
break # success
except Exception as e:
if tries >= 3:
raise e
# Wait and try again
time.sleep(1)
tries += 1
doc = pylast.cleanup_nodes(doc)
# break if there are no child nodes
if not doc.documentElement.childNodes:
break
main = doc.documentElement.childNodes[0]
if main.hasAttribute("totalPages"):
total_pages = pylast._number(main.getAttribute("totalPages"))
elif main.hasAttribute("totalpages"):
total_pages = pylast._number(main.getAttribute("totalpages"))
else:
raise Exception("No total pages attribute")
for node in main.childNodes:
if not node.nodeType == xml.dom.Node.TEXT_NODE:
nodes.append(node)
if page >= total_pages:
end_of_pages = True
page += 1
return nodes
# overide pylast's method
pylast._collect_nodes = _collect_nodes
Now we will use Last.fm's api method network.get_user(username)
to find out whether this username is valid. Then we will is use Last.fm's api method user.get_top_tracks
to collected all the play records assciated with this valid user in one month.
With 9682 requests, we are able to collect 6360 valid users with their play records.
We stored these records in a dictionary:
{'user_1':{ 'Album','Artist' : 'playcount'},
...
{'user_N':{ 'Album','Artist' : 'playcount'}}
def get_dictionary(usernames=USERNAMES):
# check if there exists rating matrix
path_rating_dict = os.path.join(os.getcwd(), PATH_RATING_DICT)
if os.path.isfile(path_rating_dict):
return np.load(path_rating_dict, allow_pickle=True).item()
d = {} # {username: {(song, artist): playcount}}
with progressbar.ProgressBar(max_value=len(usernames)) as bar:
for user_index, username in enumerate(usernames):
try:
user = network.get_user(username)
tracks = user.get_top_tracks(period="1month", limit=1000, cacheable=True) # todo test cacheable
if username not in d and len(tracks):
d[username] = {}
for item in tracks:
track, play_count = item
key =(track.get_artist().get_name(), track.get_name())
d[username][key] = play_count
except Exception as e:
pass
finally:
# with open(path_rating_dict, 'w') as f:
np.save(path_rating_dict, d)
return d
Explore data¶
Now that we have successfully generated a disctionary of user-songs mapping, let's visualize it to better understand the underlying nature of the data. We have successful gather song ratings from 9,682 users, with a combined number of 1,780,000 songs. We observe an exponential distribution in both user activies
user_play_dict = get_dictionary()
print('Number of Valid Users: %d'% len(user_play_dict))
print('Number of Valid Songs: %d'% sum(map(len, user_play_dict.values())))
100% (9682 of 9682) |####################| Elapsed Time: 9:50:11 Time: 9:50:11
user_play_dict = get_dictionary()
print('Number of Valid Users: %d'% len(user_play_dict))
print('Number of Valid Songs: %d'% sum(map(len, user_play_dict.values())))
Number of Valid Users: 6360 Number of Valid Songs: 4847252
Data Preprocessing¶
Given our observation, we have manually set up parameters to purge the outliers in the long tail of the exponential distribution, that is, users who had only listened to a few songs
But we still face the question of how to convert user's playcounts into a rating for a given song. To contraint the wide range of playcounts, we will use the logarithmic scale instead of the linear scale. Subsequently we have normalized the log playcount for each users to generate a relative user rating from
def normalize_ratings(play_count):
if not len(play_count):
return play_count
play_count = np.log(np.array(play_count, dtype=np.float))
# play_count = np.sqrt(np.log(np.array(play_count, dtype=np.float)))
max_count = np.max(play_count)
if max_count > 0:
play_count /= max_count
play_count = play_count*0.9 + 0.1
return play_count.tolist()
def purge_rating_dict(user_play_dict, min_songs, min_users):
"""
1. remove users with only a few played songs (<min_songs)
2. remove songs played by only a few users (<min_users)
3. normalize playcounts
4. return user_index_list, song_index_list, user_list, song_list, rating_list
"""
# List for build up Matrix
user_indices = []
song_indices = []
rating_list = []
user_list = []
song_list = []
# remove users with only a few played songs (<min_songs)
remove_user_list = []
print('user num:', len(user_play_dict), end='')
for user in user_play_dict:
if len(user_play_dict[user]) < min_songs:
remove_user_list.append(user)
for key in remove_user_list:
if key in user_play_dict:
del user_play_dict[key]
print('->', len(user_play_dict), end='n')
# remove songs played by only a few users (<min_users)
song_dict = {}
remove_songs = set()
for keys, values in user_play_dict.items():
for song in values:
song_dict[song] = song_dict.get(song, 0) + 1
print('song num :', len(song_dict), end='')
for key in song_dict:
if song_dict[key] < min_users:
remove_songs.add(key)
for key in remove_songs:
if key in song_dict:
del song_dict[key]
print('->', len(song_dict), end='n')
d = {}
for name in user_play_dict.keys():
ratings = []
_remove_songs = []
_remove_users = []
for k, v in user_play_dict[name].items():
if k in remove_songs:
_remove_songs.append(k)
continue
if k not in d:
song_indices.append(k)
d[k] = len(d)
user_list.append(len(user_indices))
song_list.append(d[k])
ratings.append(v)
if len(ratings):
rating_list.extend(normalize_ratings(ratings))
user_indices.append(name)
else:
_remove_users.append(name)
for s in _remove_songs:
if s in user_play_dict[name]:
del user_play_dict[name][s]
for u in _remove_users:
if u in user_play_dict:
del user_play_dict[u]
return rating_list, user_list, song_list, user_indices, song_indices
path_rating_dict = os.path.join(os.getcwd(), PATH_RATING_DICT)
user_play_dict = np.load(path_rating_dict, allow_pickle=True).item()
rating_list, user_list, song_list, user_indices, song_indices = purge_rating_dict(user_play_dict, min_songs=10, min_users=50)
user num: 6360-> 6246 song num : 1780000-> 8496
def processVisual(dt):
song_dict = {}
user_dict = {}
remove_songs = set()
for key, values in dt.items():
user_dict[key] = len(values)
for song in values:
song_dict[song] = song_dict.get(song, 0) + 1
return song_dict, user_dict
song_dict, user_dict = processVisual(user_play_dict)
# Music Listened By Number of Users
def plot_Users_Music(song_dict):
song_list = []
song_list = sorted(song_dict.items(), key=operator.itemgetter(1),reverse=True)
x = []
for i in song_list:
x.append(i[1])
fig, ax = plt.subplots()
ax.set(xlabel='Song', ylabel='#listeners',
title='#Listener per Song')
ax.grid()
ax.plot(np.arange(len(x))+1, x, label='')
fig.savefig("test.png")
plt.show()
plot_Users_Music(song_dict)
# About User Activities in last Month
def plot_Users_Activ(user_dict):
user_list = []
user_list = sorted(user_dict.items(), key=operator.itemgetter(1),reverse=True)
x = []
for i in user_list:
x.append(i[1])
fig, ax = plt.subplots()
ax.set(xlabel='User', ylabel='Different Music Listened/User (m/u)',
title='About User Activties in Last one month')
ax.grid()
ax.plot(np.arange(len(x))+1, x, label='')
fig.savefig("test.png")
plt.show()
plot_Users_Activ(user_dict)
# Normalization Of User Playcount
def plot_User_Normalization(user):
picked_user = user
picked_user_track = sorted(user_play_dict[picked_user].items(), key=operator.itemgetter(1),reverse=True)
play_count = []
for i in picked_user_track:
play_count.append(i[1])
# call normalization
play_count_2 = normalize_ratings(play_count)
play_count = np.array(play_count)
max_count = np.max(play_count)
min_count = np.min(play_count)
if max_count > 0:
play_count = (play_count - min_count)/(max_count-min_count)
# play_count.tolist()
fig, ax = plt.subplots()
ax.set(xlabel='Music', ylabel='Playcount',
title='Normalization of User's Playcount')
ax.grid()
ax.plot(np.arange(len(play_count))+1, play_count, label='Linear Normalization')
ax.plot(np.arange(len(play_count_2))+1, play_count_2, label='Log Normalization')
fig.savefig("test.png")
plt.legend(title='')
plt.show()
random_user = list(user_play_dict)[21]
plot_User_Normalization(random_user)
Data Preparation¶
Now that We have completed data cleaning and preprocessing, it's time the prepare the data for the recommender training pipelines. We will first combine the lists of users IDs, song IDs and user ratings into a user-song matrix, with each entries being the user's rating
from sklearn.model_selection import train_test_split
def split_dataset(data, row, col, val_size=0.2, test_size=0.1, random_state=0xDECAFE):
train_userid, test_userid, train_songid, test_songid, train_rating, test_rating = train_test_split(row, col, data, test_size=0.1, random_state=random_state)
train_userid, val_userid, train_songid, val_songid, train_rating, val_rating = train_test_split(train_userid, train_songid, train_rating, test_size=0.2, random_state=random_state)
# Split matrix
matrix_train = sp.coo_matrix((train_rating, (train_userid, train_songid)), shape=(len(user_indices), len(song_indices)))
matrix_val = sp.coo_matrix((val_rating, (val_userid, val_songid)), shape=(len(user_indices), len(song_indices)))
matrix_test = sp.coo_matrix((test_rating, (test_userid,
test_songid)), shape=(len(user_indices), len(song_indices)))
return matrix_train, matrix_val, matrix_test
matrix_train, matrix_val, matrix_test = split_dataset(rating_list, user_list, song_list)
Building Recommenders¶
Before we start building our recommender models, it's important to get an overview of what they do. Recall that we have an extremely sparse user-song rating matrix. In user-based collaborative filtering, the recommender algorithm learns a representation of the underlying data distribution and "densifies" the sparse matrix in the process. This is true for both matrix factorization and deep autoencoder, which essentially peform dimension reduction of the data, similar to the function of PCA. For instance, the "k" parameter determining the later-feature dimension in matrix factorization is the equavilent to choosing the top k eigenvectors and eigenvalues in PCA. Similarly, autoencoder distills the dimensions of the input data by forcing it through a bottle-neck layer.
With this in mind, we are ready to build our recommenders.
Matrix Factorization Pipeline¶
In collaborative filtering, we take the user-song rating matrix and attempts to factorize it into user-latent features and song-latent features. By viewing rating as the product of both user and song properties, we can predict a user's ratings on song they have not yet listened to.
Given the sparsely filled rating matrix Scipy.linalg.svds
to factorize the matrix and mean square error for evaluating the reconstruction loss.
# cite: error computation taken from 688 hw4_cf
import scipy.linalg as la
import time
def error(X, U, V):
""" Compute the mean error of the observed ratings in X and their estimated values.
args:
X : np.array[num_users, num_movies] -- the ratings matrix
U : np.array[num_users, num_features] -- a matrix of features for each user
V : np.array[num_movies,num_features] -- a matrix of features for each movie
return: float -- the mean squared error between non-zero entries of X and the ratings
predicted by U and V; as this is an error and not a loss function, you do not need to include the
regularizing terms.
"""
return np.sum(np.square(np.multiply(np.sign(X), (X - np.dot(U, V)))))/np.sum(X!=0)
After experimentation, we have chosen k=300 which gives us both a relatively low training
svd_u, svd_s, svd_vt = sp.linalg.svds(matrix_train, k=300)
# train error
print("Train Error: ", error(matrix_train.todense(), svd_u@np.diag(svd_s), svd_vt))
# # test error
print("Val Error: ", error(matrix_val.todense(), svd_u@np.diag(svd_s), svd_vt))
Train Error: 0.04620375153263186 Val Error: 0.08332249067412383
Autoencoder Pipeline¶
A deep autoencoder is a type of multi-layer perceptron
import os
import time
import math
import torch
import numpy as np
import torch.nn as nn
import scipy.sparse as sp
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
cuda = torch.cuda.is_available()
device = torch.device("cuda" if cuda else "cpu")
class RatingDataLoader(DataLoader):
def __init__(self, dataset, batch_size, shuffle=True):
# super(RatingDataLoader, self).__init__(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
self.dataset = dataset
self.batch_size = batch_size
# dict{usrid: matrix_row_indices}
self.usr_dict = self.get_usr_dict(dataset.row)
self.usr_list = sorted(self.usr_dict.keys())
self.num_row = dataset.get_shape()[0]
self.num_col = dataset.get_shape()[1]
self.shuffle = shuffle
self.length = math.ceil(len(self.usr_list)/self.batch_size)
def __len__(self):
return self.length
def __iter__(self):
usr_list = self.usr_list.copy()
# random
if self.shuffle:
np.random.shuffle(usr_list)
while len(usr_list) > 0:
indices_row = []
indices_col = []
for i in range(self.batch_size):
if len(usr_list) == 0:
break
idx = self.usr_dict[usr_list.pop()]
indices_col.extend(idx)
indices_row.extend([i] * len(idx))
row = np.array(indices_row, dtype=np.int32)
col = self.dataset.col[indices_col]
data = torch.FloatTensor(self.dataset.data[indices_col])
row_col = torch.LongTensor(np.stack((row, col)))
yield torch.sparse.FloatTensor(row_col, data, torch.Size([indices_row[-1]+1, self.num_col])).to_dense()
def get_usr_dict(self, row_list):
d = {}
for i, u in enumerate(row_list):
if u in d:
d[u].append(i)
else:
d[u] = [i]
return d
class Autoencoder(nn.Module):
"""
symetrical encoder-decoder layers
"""
def __init__(self, layer_sizes, dropout_prob):
super().__init__()
self.layer_sizes = layer_sizes
self.dropout_prob = dropout_prob
self.encoder = nn.Sequential(*self.getLayers("encoder"))
self.decoder = nn.Sequential(*self.getLayers("decoder"))
def getLayers(self, type):
layers = []
if type == "encoder":
layers.append(nn.Dropout(self.dropout_prob))
_layer_sizes = self.layer_sizes
elif type == "decoder":
_layer_sizes = self.layer_sizes[: : -1]
for i in range(len(_layer_sizes)-1):
layers.append(nn.Linear(_layer_sizes[i], _layer_sizes[i+1]))
layers.append(nn.Tanh())
return layers
def forward(self, x):
x = self.encoder(x)
x = self.decoder(x)
return x
def train_epoch(model, train_loader, criterion, optimizer):
model.train() # set to training mode
model.to(device)
running_loss = 0.0
start_time = time.time()
for batch_idx, data in enumerate(train_loader):
optimizer.zero_grad() # clear gradient for each batch
data = data.to(device)
# forward
train_out = model(data)
loss = criterion(train_out, data)
running_loss += loss.item()
# backward
loss.backward()
optimizer.step()
end_time = time.time()
running_loss /= len(train_loader)
print("Reconstruction Loss: %f Time: %fs" % (running_loss, end_time - start_time))
return running_loss
def eval_epoch(model, eval_loader, criterion):
with torch.no_grad():
model.eval()
model.to(device)
running_loss = 0.0
total_pred = 0
correct_pred = 0
for batch_idx, data in enumerate(eval_loader):
data = data.to(device)
eval_out = model(data)
total_pred = data.size(0)
correct_pred += (eval_out == data).sum().item()
loss = criterion(eval_out, data).detach()
running_loss += loss.item()
running_loss /= len(eval_loader)
print("Validation Loss: %f" % running_loss)
return running_loss
A typical loss function for an autoencoder is mean square error
class zero_masked_MSELoss(nn.Module):
def __init__(self):
super().__init__()
def forward(self, input, target):
diff = input - target
mask = torch.where(target != 0, torch.ones_like(target), target)
return torch.sum(diff ** 2 * mask) / torch.sum(mask)
Below is the end-to-end pipeline for the autoencoder training. We will add an additional dropout layer
def train_autoencoder(model, path):
# load data
print("loading data")
batch_size = 256
# matrix_combined = (matrix_train + matrix_val).tocoo()
train_loader = RatingDataLoader(matrix_train, batch_size, shuffle=True)
test_loader = RatingDataLoader(matrix_val, batch_size, shuffle=False)
criterion = zero_masked_MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# training loop
n_epochs = 250
print("training autoencoder for %d epochs:" % n_epochs)
train_loss = []
val_loss = []
with progressbar.ProgressBar(max_value=n_epochs) as bar:
for i in range(n_epochs):
# print("Epoch: %d/%d"%((i+1),n_epochs))
t_loss = train_epoch(model, train_loader, criterion, optimizer)
v_loss = eval_epoch(model, test_loader, criterion)
train_loss.append(t_loss)
val_loss.append(v_loss)
bar.update(i)
print("train_loss=%ftval_loss=%f"%(train_loss[-1], val_loss[-1]))
print("done")
torch.save({
'model_state_dict': model.state_dict(),
'train_loss': train_loss,
'val_loss': val_loss,
}, path)
return train_loss, val_loss
def plot_loss(train_loss, val_loss):
train_loss = np.array(train_loss)
val_loss = np.array(val_loss)
fig, ax = plt.subplots()
ax.set(xlabel='EPOCH', ylabel='LOSS',title='LOSS')
ax.grid()
ax.plot(np.arange(len(train_loss))+1, train_loss, label='train loss')
ax.plot(np.arange(len(val_loss))+1, val_loss, label='val loss')
plt.legend(title='')
plt.show()
def get_recommender_AE():
recommender_AE = Autoencoder([matrix_train.get_shape()[1], 1000, 1000, 400], dropout_prob=0.5)
print(recommender_AE)
path_model = os.path.join(os.getcwd(), PATH_MODEL)
if os.path.isfile(path_model):
checkpoint = torch.load(path_model)
recommender_AE.load_state_dict(checkpoint['model_state_dict'])
train_loss = checkpoint['train_loss']
val_loss = checkpoint['val_loss']
else:
train_loss, val_loss = train_autoencoder(recommender_AE, path_model)
plot_loss(train_loss, val_loss)
return recommender_AE.to(device)
recommender_AE = get_recommender_AE()
Autoencoder( (encoder): Sequential( (0): Dropout(p=0.5, inplace=False) (1): Linear(in_features=8496, out_features=1000, bias=True) (2): Tanh() (3): Linear(in_features=1000, out_features=1000, bias=True) (4): Tanh() (5): Linear(in_features=1000, out_features=400, bias=True) (6): Tanh() ) (decoder): Sequential( (0): Linear(in_features=400, out_features=1000, bias=True) (1): Tanh() (2): Linear(in_features=1000, out_features=1000, bias=True) (3): Tanh() (4): Linear(in_features=1000, out_features=8496, bias=True) (5): Tanh() ) )
By closely monitoring the training-validation loss, we are able to tune the model to prevent overfitting.
Make recommendation¶
After training both recommenders, it is time to examine their performance by their rating predictions. For matrix factorization, we obtain a dense user-song rating matrix by multiplying the user-latent and song-latent matrixes. For the autoencoder predicition, we will pass the training data into the foward loop and obtain a dense matrix. Subsequently, we will compare the predicted matrix from the two recommender against the unseen rating in the test set.
Matrix Factorization Prediction¶
# loading test data
test_data = matrix_test.toarray()
pred_MF = (svd_u@np.diag(svd_s))@svd_vt
Autoencoder Prediction¶
train_data = matrix_train.toarray()
tensor = torch.from_numpy(train_data).float()
tensor = tensor.to(device)
with torch.no_grad():
recommender_AE.eval()
pred_AE = recommender_AE(tensor)
pred_AE = pred_AE.to("cpu").detach().numpy()
Analysis¶
After each recommender systems had made their prediction, we can compare it with a user's actual rating in the test data, which the recommender had not seen during training. Here we visualize in a row, the user's actual rating and the predictions made by both matrix factorization recommender and autoencoder recommender. To facilitate understanding, we have mapped the domain of the rating
from matplotlib import colors
import matplotlib.pyplot as plt
def plotMatrix(lst, user_id):
temp = [i for i in lst]
lst = temp
fig, ax = plt.subplots(1, 1)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
fig.set_size_inches(5, 2)
intersection_matrix = np.array(lst)
ax.imshow(intersection_matrix, cmap=plt.cm.Blues, aspect='auto', interpolation='nearest')
ax.set_yticks(np.arange(-0.5, 3, 1), minor=True);
ax.grid(which='minor', color='w', linestyle='-', linewidth=6)
labels = ['Groud Truth', 'Matrix Factorization Prediction', 'AutoEncoder Prediction']
ax.set_yticks(np.arange(len(labels)))
ax.set_yticklabels(labels)
ax.set_xticklabels([])
ax.set_title("UserID: %s" % user_id)
def predictionCompare(user_id):
rating_threshold = 0
actual = test_data[user_id]
pred_MF_usr = pred_MF[user_id]
pred_AE_usr = pred_AE[user_id]
idx = np.where(actual > rating_threshold)
# sort prediction by sorted test_data
_actual = sorted(actual[idx], reverse=True)
_pred_MF_usr = sortby(actual[idx], pred_MF_usr[idx])
_pred_AE_usr = sortby(actual[idx], pred_AE_usr[idx])
plotMatrix([_actual, _pred_MF_usr, _pred_AE_usr], user_id)
predictionCompare(6)
predictionCompare(13)
predictionCompare(27)
predictionCompare(35)
The above graphic visualizes the comparison between the actual and the predicted ratings for 5 different users. We observe that the performance of the recommenders are in fact, far lower than the expected performance during validation. Evidently, both matrix factorization and deep autoencoder had made an average prediction that is far lower than the actual rating
"""
returns an MSE between the input matrix and the target matrix
The mask is defined by a tuple of rating threadsholds
"""
def threshold_MSE(input, target, threshold):
diff = input - target
start, end = threshold
mask = np.where((target>start)&(target<end),1,0)
return np.sum(diff ** 2 * mask) / np.sum(mask)
thresholds = [(0.1, 0.2), (0.2, 0.4), (0.4, 0.6), (0.6, 0.8), (0.8, 1.0)]
threshold_MSE_MF = []
threshold_MSE_AE = []
for t in thresholds:
threshold_MSE_MF.append(threshold_MSE(pred_MF, test_data, t))
threshold_MSE_AE.append(threshold_MSE(pred_AE, test_data, t))
def plotDoubleBar(lst1, lst2):
labels = ['[0.1 - 0.2]', '[0.2 - 0.4]', '[0.4 - 0.6]', '[0.6 - 0.8]', '[0.8 - 1.0]']
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, lst1, width, label='Matrix Factorization')
rects2 = ax.bar(x + width/2, lst2, width, label='AutoEncoder')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('MSE')
ax.set_title('MSE at different ***')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
fig.tight_layout()
plt.show()
Factorization_means = threshold_MSE_MF
AutoEncoder_means = threshold_MSE_AE
plotDoubleBar(Factorization_means, AutoEncoder_means)
Unsurprisingly, the result matches with our expectation. We see that both recommenders had made far smaller of a mistake when predicting reatings close from 0.1 to 0.4, (a.k.a songs not prefered by the user) while the mean square error gradually increases toward songs most liked by the users. This is, at least partially, in agreement with our hypothesis that the paucity of data in the songs with most liked
default_threshold = (0, 1)
test_error_MF = threshold_MSE(pred_MF, test_data, default_threshold)
test_error_AE = threshold_MSE(pred_AE, test_data, default_threshold)
print("MF Test Error: ", test_error_MF)
print("AE Test Error: ", test_error_AE)
MF Test Error: 0.0744864442121783 AE Test Error: 0.06296188026915885
Examine recommendation:¶
Finally, we have used the recommender system to yeild a set of songs unseen in the user's training, val and test data. To our satisfaction, we have observed that the artists recommended did appeared serveral times in the user's play history. Thus, validating our recommender.
combined_matrix = (matrix_train+matrix_val+matrix_test).toarray()
uid = np.random.choice(user_list, 10)
def sortby(X, Y):
return [y for x, y in sorted(zip(X, Y), reverse=True, key=lambda x: x[0])]
def recommend(uid, k=10, combined_matrix=combined_matrix):
print('user id:', uid)
tensor = torch.from_numpy(combined_matrix[uid]).float().to(device)
with torch.no_grad():
recommender_AE.eval()
pred_rating = recommender_AE(tensor).to('cpu').tolist()
idx = sortby(pred_rating, list(range(len(pred_rating))))
pred_artist = set()
pred_songs = []
for i in range(k):
pred_artist.add(song_indices[idx[i]][0])
pred_songs.append([pred_rating[idx[i]], song_indices[idx[i]], 0])
for song, playcount in user_play_dict[user_indices[uid]].items():
for i in range(len(pred_songs)):
if song[0] == pred_songs[i][1][0]:
pred_songs[i][2] += 1
print("ratingtsongtttttartisttttartist listened")
for p in pred_songs:
artist = p[1][0] if len(p[1][0]) <= 20 else p[1][0][:17]+'...'
song = p[1][1] if len(p[1][1]) <= 30 else p[1][1][:27]+'...'
print("%.3ft%-35st%-25st%3d"%(p[0], song, artist, p[2]))
print('n')
for i in uid:
recommend(i)
user id: 1874 rating song artist artist listened 0.672 It Was in Me Avril Lavigne 1 0.589 Owner of a Lonely Heart Yes 0 0.563 Falling Trevor Daniel 0 0.561 Yesterday - Remastered 2009 The Beatles 0 0.553 Radio Friendly Unit Shifter Nirvana 0 0.543 Jewels Of The Sky: Inscription Matana Roberts 0 0.542 Another One Bites The Dust ... Queen 0 0.537 Just Wait Til Next Year John Maus 0 0.530 Guilty Party The National 1 0.519 I Apologise If You Feel Som... Bring Me the Horizon 0 user id: 2266 rating song artist artist listened 0.663 Bosses Hang, Pt. I Godspeed You! Bla... 0 0.502 Almost Had To Start A Fight... Parquet Courts 0 0.491 Batuka Madonna 4 0.480 Pump It Black Eyed Peas 0 0.475 Daddy Issues Demi Lovato 0 0.457 Did You Know? Villagers 0 0.453 The Spirit of Radio Rush 0 0.452 The Trip Still Corners 0 0.448 Peroxide ECCO2k 0 0.448 Uprising Muse 0 user id: 2132 rating song artist artist listened 0.565 Immortal Marina 1 0.564 Lighthouse Grouper 0 0.557 ILY2 Charli XCX 19 0.544 Wouldn't Leave Kanye West 0 0.543 Personal Jesus - single ver... Depeche Mode 0 0.538 Good Morning Kanye West 0 0.535 Timebends Deerhunter 0 0.531 I Fall Apart Post Malone 3 0.529 Jewels Of The Sky: Inscription Matana Roberts 0 0.528 Subterranean Homesick Blues Bob Dylan 0 user id: 2491 rating song artist artist listened 0.646 Imitation of Life R.E.M. 0 0.621 Dsco Sweet Trip 0 0.563 Dark Ballet Madonna 0 0.514 Welcome Home (Sanitarium) Metallica 0 0.514 To Your Love Fiona Apple 12 0.509 Venom Little Simz 0 0.509 Adore You Miley Cyrus 0 0.504 Luxury Azealia Banks 0 0.500 Visiting Hours Kero Kero Bonito 0 0.500 Falling Trevor Daniel 0 user id: 2006 rating song artist artist listened 0.656 Owner of a Lonely Heart Yes 0 0.654 Yesterday - Remastered 2009 The Beatles 0 0.522 Surf (feat. Gunna) Young Thug 0 0.517 Platz eins Lindemann 0 0.501 No I in Threesome Interpol 2 0.499 Better Now Post Malone 0 0.492 Lady Marmalade - From "Moul... Christina Aguilera 0 0.486 Video Girl FKA twigs 0 0.486 Top (feat. Pi'erre Bourne) Playboi Carti 0 0.486 Rock and Roll Led Zeppelin 0 user id: 4190 rating song artist artist listened 0.654 Dsco Sweet Trip 0 0.578 Top (feat. Pi'erre Bourne) Playboi Carti 0 0.571 Immortal Marina 0 0.570 Piel Arca 0 0.568 Trains Porcupine Tree 0 0.553 Useless Phrases King Princess 0 0.547 Criminal Britney Spears 3 0.544 Cathedrals of Heaven Swans 0 0.520 Falling Trevor Daniel 0 0.516 Twist Thom Yorke 0 user id: 468 rating song artist artist listened 0.680 The Spirit of Radio Rush 0 0.631 This Life Vampire Weekend 2 0.599 Dsco Sweet Trip 0 0.585 Videotape Radiohead 0 0.577 Boom Clap Charli XCX 0 0.568 Bosses Hang, Pt. I Godspeed You! Bla... 0 0.547 JUNKY BROCKHAMPTON 0 0.543 Call It What You Want Foster the People 0 0.537 What a Time (feat. Niall Ho... Julia Michaels 0 0.535 Start Me Up - Remastered The Rolling Stones 0 user id: 2996 rating song artist artist listened 0.747 Falling Trevor Daniel 0 0.623 Top (feat. Pi'erre Bourne) Playboi Carti 28 0.614 Dsco Sweet Trip 0 0.588 You Did It Yourself Arthur Russell 0 0.571 Ribs Lorde 9 0.568 Pagan Fears Mayhem 0 0.557 Sissy That Walk RuPaul 0 0.555 Joy to the World Mariah Carey 0 0.552 Boom Clap Charli XCX 0 0.541 Funeral Fog Mayhem 0 user id: 587 rating song artist artist listened 0.639 Near DT, MI black midi 0 0.576 JESUS BROCKHAMPTON 0 0.575 Another One Bites The Dust ... Queen 0 0.556 Lighthouse Grouper 0 0.541 Midnight Coldplay 0 0.539 This Life Vampire Weekend 1 0.535 U (Man Like) Bon Iver 0 0.528 I Know Places Taylor Swift 27 0.525 Batuka Madonna 0 0.523 55 Year Old Daughter Giant Swan 0 user id: 3516 rating song artist artist listened 0.654 Funeral Fog Mayhem 0 0.633 Platz eins Lindemann 0 0.596 Losing My Religion R.E.M. 0 0.595 Docking The Pod Duster 0 0.581 All by Myself Céline Dion 0 0.574 Dust in the Wind Kansas 0 0.561 You Can't Always Get What Y... The Rolling Stones 0 0.561 People Are Strange The Doors 0 0.539 Start Me Up - Remastered The Rolling Stones 0 0.530 Savage Nomad Danny Brown 0
Conclusion¶
The result of our studies had shown the limitation of user-based collaborative filtering. Dispite the obvious advantages of a user-based approach: no need for specific domain knowledge of song
Future Work¶
The fact that modern deep learning technique had only shown marginal advantage over traditional matrix factorization had has prompted us to think of ways to imporve existing collaborative filtering techniques.
For the autoencoder recommender, one way to mitigate the network from becoming biased toward extremely sparse and heavily skewed data is to regularize the loss during training. For instance, we might multiply the existing loss with a frequency indicator such that we penalize less on frequently appeared, low-rating songs and more on songs that rarely occurs but heavily favored by the users.
Alternatively, we are eager to compare user-based collaborative filtering with content-based approaches even though the later has a more complex input features, for user similarity becomes a more elusive concept when the product or services to recommend becomes increasingly complex