We evaluated the following knn inspired algorithms from the Surprise recommender library:
The similarity metrics we used were:
the baseline metrics we used were:
plot_knn_stats()
plot_knn_comp()
eval_rmse | eval_mae | eval_rsquared | |
---|---|---|---|
algorithm | |||
KNNBasic | 0.983554 | 0.781742 | 0.221840 |
KNNWithZScore | 0.920839 | 0.716667 | 0.317914 |
KNNBaseline | 0.907554 | 0.712153 | 0.337452 |
plot_knn_times()
test_time | train_time | |
---|---|---|
algorithm | ||
KNNBasic | 110.802776 | 1.452880 |
KNNWithZScore | 122.908290 | 1.551636 |
KNNBaseline | 151.489925 | 1.457129 |
We evaluated the following knn inspired algorithms from the Surprise recommender library:
The above algorithms were compared using kfold cross validation. We also used kfold cross validation to determine the best parameters (epochs, factors) for each algorithm, according to root mean squared error (rmse).
plot_fac_stats()
plot_fac_comp()
eval_rmse | eval_mae | eval_rsquared | |
---|---|---|---|
algorithm | |||
svd | 0.917724 | 0.722339 | 0.322520 |
nmf | 0.912606 | 0.708599 | 0.330056 |
svdpp | 0.970618 | 0.750084 | 0.242176 |
plot_fac_time()
test_time | train_time | |
---|---|---|
algorithm | ||
svd | 17.087900 | 3.757889 |
nmf | 15.089663 | 34.455554 |
svdpp | 14.988622 | 4.341382 |
Item to item relationship.
Represented by an $m \times m$ matrix, $C$, where the ${i,j}^{th}$ entry is number of times item i is seen with item j.
Jaccard index: Intersection over the Union
Prevents over-representation of generally popular movies
$J(A,B) = \frac{A \cap B}{A \cup B} = \frac{C_{i,j}}{C_{i,i} + C_{j,j} - C_{i,j}} = S_{i,j}$
A matrix, $A$, that captures the strength of each user and the item in each entry $A_{i,j}$ for the $i^{th}$ user and $j^{th}$ item.
Calculated by multiplying the affinity matrix by the co-occurence matrix, $R = AS$, where R is an $i \times j$ matrix, each row for a user, and each column for a movie.
Higher scores imply better recommendations, however, not in the same unit rating system.
filename = 'gt_user_SAR'
infile = open(filename,'rb')
data = pickle.load(infile)
infile.close()
data
userID | itemID | rating | timestamp | |
---|---|---|---|---|
9528 | 222 | 7 | 5.0 | 877563168 |
18002 | 222 | 127 | 5.0 | 881059039 |
11350 | 222 | 196 | 5.0 | 878183110 |
13380 | 222 | 204 | 5.0 | 878182370 |
93356 | 222 | 64 | 5.0 | 878183136 |
40380 | 222 | 328 | 5.0 | 877562772 |
127 | 222 | 750 | 5.0 | 883815120 |
3433 | 222 | 69 | 5.0 | 878182338 |
34954 | 222 | 28 | 5.0 | 878182370 |
24586 | 222 | 228 | 5.0 | 878181869 |
filename = 'p_user_SAR'
infile = open(filename,'rb')
data = pickle.load(infile)
infile.close()
data
userID | itemID | prediction | |
---|---|---|---|
0 | 222 | 161 | 3.367728 |
1 | 222 | 82 | 3.362009 |
2 | 222 | 204 | 3.304764 |
3 | 222 | 69 | 3.249401 |
4 | 222 | 195 | 3.216889 |
5 | 222 | 423 | 3.200151 |
6 | 222 | 28 | 3.158334 |
7 | 222 | 228 | 3.127052 |
8 | 222 | 196 | 3.119067 |
9 | 222 | 550 | 3.099642 |
filename = 'm_user_SAR'
infile = open(filename,'rb')
data = pickle.load(infile)
infile.close()
data
userID | itemID | rating | timestamp | prediction | |
---|---|---|---|---|---|
0 | 222 | 7 | 5.0 | 877563168 | NaN |
1 | 222 | 127 | 5.0 | 881059039 | NaN |
2 | 222 | 196 | 5.0 | 878183110 | 3.119067 |
3 | 222 | 204 | 5.0 | 878182370 | 3.304764 |
4 | 222 | 64 | 5.0 | 878183136 | NaN |
5 | 222 | 328 | 5.0 | 877562772 | NaN |
6 | 222 | 750 | 5.0 | 883815120 | NaN |
7 | 222 | 69 | 5.0 | 878182338 | 3.249401 |
8 | 222 | 28 | 5.0 | 878182370 | 3.158334 |
9 | 222 | 228 | 5.0 | 878181869 | 3.127052 |
For user 222, we can look at the ground truth values and the recommendations made by the model. As we can see, the recommendation values are based on a different scale, however, we can still give a top K recommendation by looking at the top K values. Thus, we can match against the ground truth to look at how successful the model is at predicting the best movies. We see that there were 5 movies the user found desirable, (196, 204, 69, 28, 228), which we successfully found in the top 10 values of the predicted movie result.
import pickle
filename = 'sar_eval'
infile = open(filename,'rb')
data = pickle.load(infile)
infile.close()
print("Model:\t",
"Top K:\t%d" % data[0],
"MAP:\t%f" % data[1],
"NDCG:\t%f" % data[2],
"Precision@K:\t%f" % data[3],
"Recall@K:\t%f" % data[4], sep='\n')
Model: Top K: 10 MAP: 0.110591 NDCG: 0.382461 Precision@K: 0.330753 Recall@K: 0.176385
We do our recommendations via a Supervised Learning Technique where different ratings for every movie is learned for every user. Since we are doing Collaborative filtering, we just learn from users and movies they have already watched to generate ratings for every movie! We can do this by using Neural Network as we can input Users and Movies as seprate inputs and combine them in the middle. Additionally, I (Mohan) used Collab and its GPU for faster training time!
We mainly utilize the ratings dataset. This just has ratings for every user (userId) and movies (movieId) they have watched previously. We utilize Keras for building our Neural Networks!
I had to split our dataset into train and test with 90-10 split. I used the train dataset to make our model and test dataset to test our model and compare evaluation techniques with others. Additionally need to separate our user and movie columns as our neural network will be learning ratings based off of users and movies separately. Put both into an array and feed it into our network.
Conceptually, embedding layers are a mapping of categorical variables to vector of continuous numbers. We make embeddings of movies and users to represent each movie and user. Initially random, the backpropagation process learns representations for both and captures the essential qualities of the users/movies that contribute to given ratings. This model is simple but powerful. Input is mapped to the specific indeces for the corresponding embedding layer. The way of merging is by using a dot product of the embeddings. Since they are of the same size, we can do this and then even specify the output size of 5. Used mean squared error as our loss function and the popular and very effective Adam optimizer (rather than SGD) to optimize our network weights. Additionally, we want to keep track of Root Mean Square Error, Mean Absolute Error, and explained variance as metrics to later evaluate our network. We are not using a validation set, as we are using a train, test set at the end for evaluation instead. Also, the explained variance is not a default error metric in Keras, so a rsquared function was made itself. All 3 are passed as separate metrics. Also, I add regularization for embeddings to make sure it doesn't overfit!
def rsquared(y_true, y_pred):
SS_res = K.sum(K.square(y_true - y_pred))
SS_tot = K.sum(K.square(y_true - K.mean(y_true)))
return ( 1 - SS_res/(SS_tot + K.epsilon()) )
def Rec1(n_users, n_movies, n_factors):
user = Input(shape=(1,))
u = Embedding(n_users, n_factors, embeddings_initializer='he_normal',
embeddings_regularizer=l2(1e-6))(user)
u = Reshape((n_factors,))(u)
movie = Input(shape=(1,))
m = Embedding(n_movies, n_factors, embeddings_initializer='he_normal',
embeddings_regularizer=l2(1e-6))(movie)
m = Reshape((n_factors,))(m)
x = Dot(axes=1)([u, m])
# x = conc_or_dot([u,m])
model1 = Model(inputs=[user, movie], outputs=x)
opt = Adam(lr=0.001)
model1.compile(loss='mean_squared_error', optimizer=opt, metrics =['mse', 'mae', rsquared])
return model1
We fit using a batch size of 64. Batch size is bascially how much data used to update weights before error is calculated. Having a higher batch size helps with training time a lot as error is calculated less often which can be computationally expensive. However, since error is calculated less often, the accuracy of the model is not as high. Also, we use an epoch value of 5. This is the amount of times the whole dataset is looked at. Notice that increasing the number of epochs would theoretically increase the loss on and on; however, we need to be careful for overfitting. I made sure 5 was a good number by running with a validation split and made sure that the validation loss didn't increase at epoch 5 for all of the models, which it was fine!
model1 = Rec1(n_users, n_movies, n_factors)
model1.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5, verbose=0)
evals1 = model1.evaluate(X_test_array, y_test, verbose = 0)
evals1[1] = np.sqrt(evals1[1])
listOfStr = ['Final Loss (MSE)', 'RMSE', 'MAE', 'R2']
tup = tuple(zip(listOfStr, evals1))
dictOfWords = { i[0] : i[1] for i in tup }
dictOfWords
{'Final Loss (MSE)': 0.8751297973632812, 'MAE': 0.7291919128417969, 'R2': 0.26184510326385496, 'RMSE': 0.9308358680487956}
This network is a modification of the original. However, I add two new additions. One is an embedding bias term. We do this in order to add more flexibility within the embedding learning. For example for linear regression it makes sense to learn from y=mx+b, not y=mx. Also, we feed the dot product restult into a sigmoid function which includes a non-linearity to the output! I tried incorporating one or the other, but only when I combined both results I got the best evaluation metrics.
class EmbeddingLayer:
def __init__(self, n_items, n_factors):
self.n_items = n_items
self.n_factors = n_factors
def __call__(self, x):
x = Embedding(self.n_items, self.n_factors, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))(x)
x = Reshape((self.n_factors,))(x)
return x
def RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating):
user = Input(shape=(1,))
u = EmbeddingLayer(n_users, n_factors)(user)
ub = EmbeddingLayer(n_users, 1)(user)
movie = Input(shape=(1,))
m = EmbeddingLayer(n_movies, n_factors)(movie)
mb = EmbeddingLayer(n_movies, 1)(movie)
x = Dot(axes=1)([u, m])
x = Add()([x, ub, mb])
x = Activation('sigmoid')(x)
x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)
model = Model(inputs=[user, movie], outputs=x)
opt = Adam(lr=0.001)
model.compile(loss='mean_squared_error', optimizer=opt, metrics =['mse', 'mae', rsquared])
return model
model2 = RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating)
model2.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5, verbose=0)
evals2 = model2.evaluate(X_test_array,y_test, verbose = 0)
evals2[1] = np.sqrt(evals2[1])
tup = tuple(zip(listOfStr, evals2))
{ i[0] : i[1] for i in tup }
{'Final Loss (MSE)': 0.8117879865646362, 'MAE': 0.7069411417007446, 'R2': 0.3144750452041626, 'RMSE': 0.8983010084414732}
This one is a bit different. Our merging method is simply the concatenation of both embeddings. Then we can add dense layers to our network to learn off of. This one is utilizing deep learning! After experimenting with different layers, layer size, I found that two layers of size 10 and 1 (with sigmoid activation) to be the most optimal.
def RecNet(n_users, n_movies, n_factors_users, n_factors_movies, min_rating, max_rating, conc_or_dot):
user = Input(shape=(1,))
u = EmbeddingLayer(n_users, n_factors_users)(user)
movie = Input(shape=(1,))
m = EmbeddingLayer(n_movies, n_factors_movies)(movie)
x=conc_or_dot([u,m])
x = Dense(10, activation = 'relu', kernel_initializer='he_normal')(x)
x = Dense(1, activation = 'sigmoid', kernel_initializer='he_normal')(x)
x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)
model = Model(inputs=[user, movie], outputs=x)
opt = Adam(lr=0.001)
model.compile(loss='mean_squared_error', optimizer=opt, metrics =['mse', 'mae', rsquared])
return model
However, I found that the embedding sizes of the users and movies greatly affected model performance! To optimize upon this, I performed manual Grid Search over all options for both embeddings size of [50,75,100,150 and 200] Grid search gave optimal values of embedding sizes for users: 100 and movies: 50 optimizing on 3 out of the 4 objectives. Adding more layers decreased overall accuracy. This is probably because of too much weights to learn upon and overfitting. I think if we used the bigger dataset, adding more layers will help with evaluation!
l = []
for i in [50,75,100,150,200]:
for j in [50,75,100,150,200]:
model5 = RecNet(n_users, n_movies, i,j, min_rating, max_rating, Concatenate())
model5.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,verbose=0)
eva = model5.evaluate(X_test_array,y_test, verbose = 0)
l.append(((i,j),eva))
print('Done')
model5 = RecNet(n_users, n_movies, 100,50, min_rating, max_rating, Concatenate())
model5.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,verbose=0)
evals5 = model5.evaluate(X_test_array, y_test, verbose = 0)
evals5[1] = np.sqrt(evals5[1])
tup = tuple(zip(listOfStr, evals5))
{ i[0] : i[1] for i in tup }
{'Final Loss (MSE)': 0.8572985005378723, 'MAE': 0.7273193671226501, 'R2': 0.27431239557266235, 'RMSE': 0.9251387700815097}
This model is very similar to the first one: instead of concat, it adds the outputs. Nevertheless, it has similar results to the concat version.
model6 = RecommenderNet(n_users, n_movies, 50,50, min_rating, max_rating, Add())
model6.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5, verbose=0)
evals6 = model6.evaluate(X_test_array, y_test, verbose = 0)
evals6[1] = np.sqrt(evals6[1])
tup = tuple(zip(listOfStr, evals6))
{ i[0] : i[1] for i in tup }
{'Final Loss (MSE)': 0.8694782653331756, 'MAE': 0.7308314328193665, 'R2': 0.2643206853866577, 'RMSE': 0.9318947968148698}
plot()
Recommendations for model 1: Embedding w/ Dot
rec1()
movieId | title | genres | |
---|---|---|---|
277 | 318 | Shawshank Redemption, The (1994) | Crime|Drama |
57 | 64 | Two if by Sea (1996) | Comedy|Romance |
272 | 313 | Swan Princess, The (1994) | Animation|Children |
234 | 272 | Madness of King George, The (1994) | Comedy|Drama |
353 | 408 | 8 Seconds (1994) | Drama |
Recommendations for model 2: Embedding w/ Dot
rec2()
movieId | title | genres | |
---|---|---|---|
272 | 313 | Swan Princess, The (1994) | Animation|Children |
146 | 174 | Jury Duty (1995) | Comedy |
117 | 144 | Brothers McMullen, The (1995) | Comedy |
144 | 172 | Johnny Mnemonic (1995) | Action|Sci-Fi|Thriller |
183 | 215 | Before Sunrise (1995) | Drama|Romance |
Recommendations for model 5: Concat Deep Net
rec3()
movieId | title | genres | |
---|---|---|---|
277 | 318 | Shawshank Redemption, The (1994) | Crime|Drama |
1231 | 1639 | Chasing Amy (1997) | Comedy|Drama|Romance |
353 | 408 | 8 Seconds (1994) | Drama |
161 | 190 | Safe (1995) | Thriller |
To verify the results, we do cross validation across the models. KFold Cross validation is used with size of 10. To simplify training time, we increased batch size and 4 epochs are used instead. The results we got above are verified! Results are plotted.
plot2()
We evaluated each model using Root mean squared error, Mean absolute error and Explained variance:
The Keras version for the Bias & Activation Neural Network had the best values for RMSE and MAE and the FastAI version had the best RSQUARED value. We measured the execution times for both algorithm but it is not a consistent way to compare them as the run time can vary from on computer to other, but we can conclude that the Bias & Activation Neural Networks methods were giving the best performance in term of accuracy and execution time. The unsuperised methods were also worked pretty well! Behind the bias and activation network, the KNN baseline was the second best method with SVD and NMF trailing behind it.
In conclusion, we have determined that the best algorithm accross all the different algorithms utilizing various evaluation metrics. We used both supervised and unsupervised methods within collaborative filtering. Futhermore, we have thought that the algorithm that we decided as the best, would require a lot of mateinance as each time a new movie is rated, it would require to train the model again. In a future, in order to solve this problem, we hope content based or hybrid recommendation systems. We also want to explore other dataset as music or advertising.