Movie Recommendation Project - Team 10 4641

Introduction (Motivation)

  • As streaming platforms like Netflix or YouTube have a lot of content, and a lot of information about users, there is an opportunity to provide users with content tailored to their interests.
  • In this project we aim to survey some of the collaborative methods available for generating recommendations, and evaluate which is most effective.

Method

  • Within recommendation systems, there are two types: Content Based and Collaborative Filtering. We are doing the later
  • Collaborative Filtering makes recommendations by utilizing what other users have watched
    • We only need to look at reviews dataset as it contains what all users have watched before.
    • The methods we look at include both unsupervised: (Matrix Factorization, SVD, K-Nearest Neighbors) and even supervised techniques with Neural Network (utilizing Keras and FastAI)
    • We use common evaluation techniques including Root Mean Square Error, Mean Absolute Error, and Explained Variance. We also look at training and testing time.
    • We use Cross Validation and Grid Search as well to aid with our techniques

Dataset

  • Movielens 100k (reviews dataset)
    • User id: unique identifier of the user that made the rating. (non-negative integer)
    • Item id: unique identifier of the movie the user rated. (non-negative integer)
    • Rating: number from 1-5 representing the opinion of the user of a certain movie (non-negative integer)
    • Timestamp: unix seconds representing the time the rating was made. (non-negative integer)

KNN Inspired Methods

We evaluated the following knn inspired algorithms from the Surprise recommender library:

  • KNNBasic
  • KNNWithZScore
  • KNNBaseline

The similarity metrics we used were:

  • cosine
  • pearson baseline

the baseline metrics we used were:

  • alternating least squares
  • stochastic gradient descent
In [6]:
plot_knn_stats()
In [8]:
plot_knn_comp()
eval_rmse eval_mae eval_rsquared
algorithm
KNNBasic 0.983554 0.781742 0.221840
KNNWithZScore 0.920839 0.716667 0.317914
KNNBaseline 0.907554 0.712153 0.337452
In [10]:
plot_knn_times()
test_time train_time
algorithm
KNNBasic 110.802776 1.452880
KNNWithZScore 122.908290 1.551636
KNNBaseline 151.489925 1.457129

Matrix Factorization Methods

We evaluated the following knn inspired algorithms from the Surprise recommender library:

  • Singular Value Decomposition
  • Non-Negative Matrix Factorization
  • SVD++

The above algorithms were compared using kfold cross validation. We also used kfold cross validation to determine the best parameters (epochs, factors) for each algorithm, according to root mean squared error (rmse).

In [13]:
plot_fac_stats()
In [15]:
plot_fac_comp()
eval_rmse eval_mae eval_rsquared
algorithm
svd 0.917724 0.722339 0.322520
nmf 0.912606 0.708599 0.330056
svdpp 0.970618 0.750084 0.242176
In [17]:
plot_fac_time()
test_time train_time
algorithm
svd 17.087900 3.757889
nmf 15.089663 34.455554
svdpp 14.988622 4.341382
In [ ]:
 

Simple Algorithm for Recommendation (SAR)

Overview

Utilizes 2 matricies:

Co-occurence Matrix (C) $\xrightarrow{\text{Rescaling}}$ Item Similarity Matrix (S)

Affinity Matrix (A)

Solves a Recommendation Matrix

$R = AS$

Co-occurence Matrix

Item to item relationship.

Represented by an $m \times m$ matrix, $C$, where the ${i,j}^{th}$ entry is number of times item i is seen with item j.

Rescaling

Jaccard index: Intersection over the Union

Prevents over-representation of generally popular movies

$J(A,B) = \frac{A \cap B}{A \cup B} = \frac{C_{i,j}}{C_{i,i} + C_{j,j} - C_{i,j}} = S_{i,j}$

Affinity Matrix

A matrix, $A$, that captures the strength of each user and the item in each entry $A_{i,j}$ for the $i^{th}$ user and $j^{th}$ item.

Recommendation Matrix

Calculated by multiplying the affinity matrix by the co-occurence matrix, $R = AS$, where R is an $i \times j$ matrix, each row for a user, and each column for a movie.

Higher scores imply better recommendations, however, not in the same unit rating system.

In [3]:
filename = 'gt_user_SAR'

infile = open(filename,'rb')
data = pickle.load(infile)
infile.close()

data
Out[3]:
userID itemID rating timestamp
9528 222 7 5.0 877563168
18002 222 127 5.0 881059039
11350 222 196 5.0 878183110
13380 222 204 5.0 878182370
93356 222 64 5.0 878183136
40380 222 328 5.0 877562772
127 222 750 5.0 883815120
3433 222 69 5.0 878182338
34954 222 28 5.0 878182370
24586 222 228 5.0 878181869
In [4]:
filename = 'p_user_SAR'

infile = open(filename,'rb')
data = pickle.load(infile)
infile.close()

data
Out[4]:
userID itemID prediction
0 222 161 3.367728
1 222 82 3.362009
2 222 204 3.304764
3 222 69 3.249401
4 222 195 3.216889
5 222 423 3.200151
6 222 28 3.158334
7 222 228 3.127052
8 222 196 3.119067
9 222 550 3.099642
In [5]:
filename = 'm_user_SAR'

infile = open(filename,'rb')
data = pickle.load(infile)
infile.close()

data
Out[5]:
userID itemID rating timestamp prediction
0 222 7 5.0 877563168 NaN
1 222 127 5.0 881059039 NaN
2 222 196 5.0 878183110 3.119067
3 222 204 5.0 878182370 3.304764
4 222 64 5.0 878183136 NaN
5 222 328 5.0 877562772 NaN
6 222 750 5.0 883815120 NaN
7 222 69 5.0 878182338 3.249401
8 222 28 5.0 878182370 3.158334
9 222 228 5.0 878181869 3.127052

Conclusion

For user 222, we can look at the ground truth values and the recommendations made by the model. As we can see, the recommendation values are based on a different scale, however, we can still give a top K recommendation by looking at the top K values. Thus, we can match against the ground truth to look at how successful the model is at predicting the best movies. We see that there were 5 movies the user found desirable, (196, 204, 69, 28, 228), which we successfully found in the top 10 values of the predicted movie result.

In [2]:
import pickle

filename = 'sar_eval'

infile = open(filename,'rb')
data = pickle.load(infile)
infile.close()

print("Model:\t",
      "Top K:\t%d" % data[0],
      "MAP:\t%f" % data[1],
      "NDCG:\t%f" % data[2],
      "Precision@K:\t%f" % data[3],
      "Recall@K:\t%f" % data[4], sep='\n')
Model:	
Top K:	10
MAP:	0.110591
NDCG:	0.382461
Precision@K:	0.330753
Recall@K:	0.176385

Quick training and prediction time

Training = 0.847 seconds

Prediction = 0.301 seconds

Movie Recommendations using Neural Networks

We do our recommendations via a Supervised Learning Technique where different ratings for every movie is learned for every user. Since we are doing Collaborative filtering, we just learn from users and movies they have already watched to generate ratings for every movie! We can do this by using Neural Network as we can input Users and Movies as seprate inputs and combine them in the middle. Additionally, I (Mohan) used Collab and its GPU for faster training time!

We mainly utilize the ratings dataset. This just has ratings for every user (userId) and movies (movieId) they have watched previously. We utilize Keras for building our Neural Networks!

I had to split our dataset into train and test with 90-10 split. I used the train dataset to make our model and test dataset to test our model and compare evaluation techniques with others. Additionally need to separate our user and movie columns as our neural network will be learning ratings based off of users and movies separately. Put both into an array and feed it into our network.

Model 1

Basically the format of these networks is going to be two embedding layers, one for users, one for movies. Then merge them in some way with the output being of 5 nodes.

Conceptually, embedding layers are a mapping of categorical variables to vector of continuous numbers. We make embeddings of movies and users to represent each movie and user. Initially random, the backpropagation process learns representations for both and captures the essential qualities of the users/movies that contribute to given ratings. This model is simple but powerful. Input is mapped to the specific indeces for the corresponding embedding layer. The way of merging is by using a dot product of the embeddings. Since they are of the same size, we can do this and then even specify the output size of 5. Used mean squared error as our loss function and the popular and very effective Adam optimizer (rather than SGD) to optimize our network weights. Additionally, we want to keep track of Root Mean Square Error, Mean Absolute Error, and explained variance as metrics to later evaluate our network. We are not using a validation set, as we are using a train, test set at the end for evaluation instead. Also, the explained variance is not a default error metric in Keras, so a rsquared function was made itself. All 3 are passed as separate metrics. Also, I add regularization for embeddings to make sure it doesn't overfit!

In [0]:
def rsquared(y_true, y_pred):
      SS_res =  K.sum(K.square(y_true - y_pred)) 
      SS_tot = K.sum(K.square(y_true - K.mean(y_true))) 
      return ( 1 - SS_res/(SS_tot + K.epsilon()) )
def Rec1(n_users, n_movies, n_factors):
      user = Input(shape=(1,))
      u = Embedding(n_users, n_factors, embeddings_initializer='he_normal',
                    embeddings_regularizer=l2(1e-6))(user)
      u = Reshape((n_factors,))(u)

      movie = Input(shape=(1,))
      m = Embedding(n_movies, n_factors, embeddings_initializer='he_normal',
                    embeddings_regularizer=l2(1e-6))(movie)
      m = Reshape((n_factors,))(m)

      x = Dot(axes=1)([u, m])
      # x = conc_or_dot([u,m])

      model1 = Model(inputs=[user, movie], outputs=x)
      opt = Adam(lr=0.001)
      model1.compile(loss='mean_squared_error', optimizer=opt, metrics =['mse', 'mae', rsquared])
      return model1

We fit using a batch size of 64. Batch size is bascially how much data used to update weights before error is calculated. Having a higher batch size helps with training time a lot as error is calculated less often which can be computationally expensive. However, since error is calculated less often, the accuracy of the model is not as high. Also, we use an epoch value of 5. This is the amount of times the whole dataset is looked at. Notice that increasing the number of epochs would theoretically increase the loss on and on; however, we need to be careful for overfitting. I made sure 5 was a good number by running with a validation split and made sure that the validation loss didn't increase at epoch 5 for all of the models, which it was fine!

In [105]:
model1 = Rec1(n_users, n_movies, n_factors)
model1.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5, verbose=0)
evals1 = model1.evaluate(X_test_array, y_test, verbose = 0)
evals1[1] = np.sqrt(evals1[1])
listOfStr = ['Final Loss (MSE)', 'RMSE', 'MAE', 'R2']
tup = tuple(zip(listOfStr, evals1))
dictOfWords = { i[0] : i[1] for i in tup }
dictOfWords
Out[105]:
{'Final Loss (MSE)': 0.8751297973632812,
 'MAE': 0.7291919128417969,
 'R2': 0.26184510326385496,
 'RMSE': 0.9308358680487956}

Neural Network 2

This network is a modification of the original. However, I add two new additions. One is an embedding bias term. We do this in order to add more flexibility within the embedding learning. For example for linear regression it makes sense to learn from y=mx+b, not y=mx. Also, we feed the dot product restult into a sigmoid function which includes a non-linearity to the output! I tried incorporating one or the other, but only when I combined both results I got the best evaluation metrics.

In [0]:
class EmbeddingLayer:
    def __init__(self, n_items, n_factors):
        self.n_items = n_items
        self.n_factors = n_factors
    
    def __call__(self, x):
        x = Embedding(self.n_items, self.n_factors, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))(x)
        x = Reshape((self.n_factors,))(x)
        return x
def RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating):
    user = Input(shape=(1,))
    u = EmbeddingLayer(n_users, n_factors)(user)
    ub = EmbeddingLayer(n_users, 1)(user)
    
    movie = Input(shape=(1,))
    m = EmbeddingLayer(n_movies, n_factors)(movie)
    mb = EmbeddingLayer(n_movies, 1)(movie)
    x = Dot(axes=1)([u, m])
    x = Add()([x, ub, mb])
    x = Activation('sigmoid')(x)
    x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)
    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt,  metrics =['mse', 'mae', rsquared])
    return model
In [106]:
model2 = RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating)
model2.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5, verbose=0)
evals2 = model2.evaluate(X_test_array,y_test, verbose = 0)
evals2[1] = np.sqrt(evals2[1])
tup = tuple(zip(listOfStr, evals2))
{ i[0] : i[1] for i in tup }
Out[106]:
{'Final Loss (MSE)': 0.8117879865646362,
 'MAE': 0.7069411417007446,
 'R2': 0.3144750452041626,
 'RMSE': 0.8983010084414732}

Neural Network 3

This one is a bit different. Our merging method is simply the concatenation of both embeddings. Then we can add dense layers to our network to learn off of. This one is utilizing deep learning! After experimenting with different layers, layer size, I found that two layers of size 10 and 1 (with sigmoid activation) to be the most optimal.

In [0]:
def RecNet(n_users, n_movies, n_factors_users, n_factors_movies, min_rating, max_rating, conc_or_dot):
    user = Input(shape=(1,))
    u = EmbeddingLayer(n_users, n_factors_users)(user)

    movie = Input(shape=(1,))
    m = EmbeddingLayer(n_movies, n_factors_movies)(movie)
  
    x=conc_or_dot([u,m])
  
    x = Dense(10, activation = 'relu', kernel_initializer='he_normal')(x)

    x = Dense(1, activation = 'sigmoid', kernel_initializer='he_normal')(x)
    x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)
    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt,  metrics =['mse', 'mae', rsquared])
    return model

Grid Search

However, I found that the embedding sizes of the users and movies greatly affected model performance! To optimize upon this, I performed manual Grid Search over all options for both embeddings size of [50,75,100,150 and 200] Grid search gave optimal values of embedding sizes for users: 100 and movies: 50 optimizing on 3 out of the 4 objectives. Adding more layers decreased overall accuracy. This is probably because of too much weights to learn upon and overfitting. I think if we used the bigger dataset, adding more layers will help with evaluation!

In [0]:
l = []
for i in [50,75,100,150,200]:
  for j in [50,75,100,150,200]:
    model5 = RecNet(n_users, n_movies, i,j, min_rating, max_rating, Concatenate())
    model5.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,verbose=0)
    eva = model5.evaluate(X_test_array,y_test, verbose = 0)
    l.append(((i,j),eva))
print('Done')
In [101]:
model5 = RecNet(n_users, n_movies, 100,50, min_rating, max_rating, Concatenate())
model5.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,verbose=0)
evals5 = model5.evaluate(X_test_array, y_test, verbose = 0)
evals5[1] = np.sqrt(evals5[1])
tup = tuple(zip(listOfStr, evals5))
{ i[0] : i[1] for i in tup }
Out[101]:
{'Final Loss (MSE)': 0.8572985005378723,
 'MAE': 0.7273193671226501,
 'R2': 0.27431239557266235,
 'RMSE': 0.9251387700815097}

This model is very similar to the first one: instead of concat, it adds the outputs. Nevertheless, it has similar results to the concat version.

In [102]:
model6 = RecommenderNet(n_users, n_movies, 50,50, min_rating, max_rating, Add())
model6.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5, verbose=0)
evals6 = model6.evaluate(X_test_array, y_test, verbose = 0)
evals6[1] = np.sqrt(evals6[1])
tup = tuple(zip(listOfStr, evals6))
{ i[0] : i[1] for i in tup }
Out[102]:
{'Final Loss (MSE)': 0.8694782653331756,
 'MAE': 0.7308314328193665,
 'R2': 0.2643206853866577,
 'RMSE': 0.9318947968148698}

After all models were run, we graph the evaluation metrics for each model. For all models, the non deep learning method using embedding vectors with bias, and sigmoid activation layer gave the best results. With a larger dataset, maybe the deep learning option would give us better results.

In [108]:
plot()

Now we want to give example recommendations for users for each model. We do this for user 101. We do this by finding the best predicted rated value for a movie in which the user has not watched. Neural Networks basically allows every user to have a predicted rating for every movie in the dataset. That is the beauty of neural networks!

Recommendations for model 1: Embedding w/ Dot

In [67]:
rec1()
Out[67]:
movieId title genres
277 318 Shawshank Redemption, The (1994) Crime|Drama
57 64 Two if by Sea (1996) Comedy|Romance
272 313 Swan Princess, The (1994) Animation|Children
234 272 Madness of King George, The (1994) Comedy|Drama
353 408 8 Seconds (1994) Drama

Recommendations for model 2: Embedding w/ Dot

In [70]:
rec2()
Out[70]:
movieId title genres
272 313 Swan Princess, The (1994) Animation|Children
146 174 Jury Duty (1995) Comedy
117 144 Brothers McMullen, The (1995) Comedy
144 172 Johnny Mnemonic (1995) Action|Sci-Fi|Thriller
183 215 Before Sunrise (1995) Drama|Romance

Recommendations for model 5: Concat Deep Net

In [73]:
rec3()
Out[73]:
movieId title genres
277 318 Shawshank Redemption, The (1994) Crime|Drama
1231 1639 Chasing Amy (1997) Comedy|Drama|Romance
353 408 8 Seconds (1994) Drama
161 190 Safe (1995) Thriller

To verify the results, we do cross validation across the models. KFold Cross validation is used with size of 10. To simplify training time, we increased batch size and 4 epochs are used instead. The results we got above are verified! Results are plotted.

In [120]:
plot2()
In [0]:
 

Neural collaborative filtering using FastAI

Method to find optimal value for learning rate

Cross Validation in order to optimize the number of epochs

Training process

Top 5 predictions for a user

Some metrics

Results

We evaluated each model using Root mean squared error, Mean absolute error and Explained variance:

The Keras version for the Bias & Activation Neural Network had the best values for RMSE and MAE and the FastAI version had the best RSQUARED value. We measured the execution times for both algorithm but it is not a consistent way to compare them as the run time can vary from on computer to other, but we can conclude that the Bias & Activation Neural Networks methods were giving the best performance in term of accuracy and execution time. The unsuperised methods were also worked pretty well! Behind the bias and activation network, the KNN baseline was the second best method with SVD and NMF trailing behind it.

Conclusion and Future Steps

In conclusion, we have determined that the best algorithm accross all the different algorithms utilizing various evaluation metrics. We used both supervised and unsupervised methods within collaborative filtering. Futhermore, we have thought that the algorithm that we decided as the best, would require a lot of mateinance as each time a new movie is rated, it would require to train the model again. In a future, in order to solve this problem, we hope content based or hybrid recommendation systems. We also want to explore other dataset as music or advertising.