Movie Recommendation Project - Team 10 4641¶

Introduction (Motivation)¶

As streaming platforms like Netflix or YouTube have a lot of content, and a lot of information about users, there is an opportunity to provide users with content tailored to their interests.
In this project we aim to survey some of the collaborative methods available for generating recommendations, and evaluate which is most effective.

Method¶

Within recommendation systems, there are two types: Content Based and Collaborative Filtering. We are doing the later
Collaborative Filtering makes recommendations by utilizing what other users have watched
- We only need to look at reviews dataset as it contains what all users have watched before.
- The methods we look at include both unsupervised: (Matrix Factorization, SVD, K-Nearest Neighbors) and even supervised techniques with Neural Network (utilizing Keras and FastAI)
- We use common evaluation techniques including Root Mean Square Error, Mean Absolute Error, and Explained Variance. We also look at training and testing time.
- We use Cross Validation and Grid Search as well to aid with our techniques

Dataset¶

Movielens 100k (reviews dataset)
- User id: unique identifier of the user that made the rating. (non-negative integer)
- Item id: unique identifier of the movie the user rated. (non-negative integer)
- Rating: number from 1-5 representing the opinion of the user of a certain movie (non-negative integer)
- Timestamp: unix seconds representing the time the rating was made. (non-negative integer)

KNN Inspired Methods¶

We evaluated the following knn inspired algorithms from the Surprise recommender library:

KNNBasic
KNNWithZScore
KNNBaseline

The similarity metrics we used were:

cosine
pearson baseline

the baseline metrics we used were:

alternating least squares
stochastic gradient descent

In [6]:

plot_knn_stats()

In [8]:

plot_knn_comp()

	eval_rmse	eval_mae	eval_rsquared
algorithm
KNNBasic	0.983554	0.781742	0.221840
KNNWithZScore	0.920839	0.716667	0.317914
KNNBaseline	0.907554	0.712153	0.337452

In [10]:

plot_knn_times()

	test_time	train_time
algorithm
KNNBasic	110.802776	1.452880
KNNWithZScore	122.908290	1.551636
KNNBaseline	151.489925	1.457129

Matrix Factorization Methods¶

We evaluated the following knn inspired algorithms from the Surprise recommender library:

Singular Value Decomposition
Non-Negative Matrix Factorization
SVD++

The above algorithms were compared using kfold cross validation. We also used kfold cross validation to determine the best parameters (epochs, factors) for each algorithm, according to root mean squared error (rmse).

In [13]:

plot_fac_stats()

In [15]:

plot_fac_comp()

	eval_rmse	eval_mae	eval_rsquared
algorithm
svd	0.917724	0.722339	0.322520
nmf	0.912606	0.708599	0.330056
svdpp	0.970618	0.750084	0.242176

In [17]:

plot_fac_time()

	test_time	train_time
algorithm
svd	17.087900	3.757889
nmf	15.089663	34.455554
svdpp	14.988622	4.341382

In [ ]:

Simple Algorithm for Recommendation (SAR)¶

Overview¶

Utilizes 2 matricies:¶

Co-occurence Matrix (C) $\xrightarrow{\text{Rescaling}}$ Item Similarity Matrix (S)

Affinity Matrix (A)

Solves a Recommendation Matrix¶

$R = AS$

Co-occurence Matrix¶

Item to item relationship.

Represented by an $m \times m$ matrix, $C$, where the ${i,j}^{th}$ entry is number of times item i is seen with item j.

Rescaling¶

Jaccard index: Intersection over the Union

Prevents over-representation of generally popular movies

$J(A,B) = \frac{A \cap B}{A \cup B} = \frac{C_{i,j}}{C_{i,i} + C_{j,j} - C_{i,j}} = S_{i,j}$

Affinity Matrix¶

A matrix, $A$, that captures the strength of each user and the item in each entry $A_{i,j}$ for the $i^{th}$ user and $j^{th}$ item.

Recommendation Matrix¶

Calculated by multiplying the affinity matrix by the co-occurence matrix, $R = AS$, where R is an $i \times j$ matrix, each row for a user, and each column for a movie.

Higher scores imply better recommendations, however, not in the same unit rating system.

In [3]:

filename = 'gt_user_SAR'

infile = open(filename,'rb')
data = pickle.load(infile)
infile.close()

data

Out[3]:

	userID	itemID	rating	timestamp
9528	222	7	5.0	877563168
18002	222	127	5.0	881059039
11350	222	196	5.0	878183110
13380	222	204	5.0	878182370
93356	222	64	5.0	878183136
40380	222	328	5.0	877562772
127	222	750	5.0	883815120
3433	222	69	5.0	878182338
34954	222	28	5.0	878182370
24586	222	228	5.0	878181869

In [4]:

filename = 'p_user_SAR'

infile = open(filename,'rb')
data = pickle.load(infile)
infile.close()

data

Out[4]:

	userID	itemID	prediction
0	222	161	3.367728
1	222	82	3.362009
2	222	204	3.304764
3	222	69	3.249401
4	222	195	3.216889
5	222	423	3.200151
6	222	28	3.158334
7	222	228	3.127052
8	222	196	3.119067
9	222	550	3.099642

In [5]:

filename = 'm_user_SAR'

infile = open(filename,'rb')
data = pickle.load(infile)
infile.close()

data

Out[5]:

	userID	itemID	rating	timestamp	prediction
0	222	7	5.0	877563168	NaN
1	222	127	5.0	881059039	NaN
2	222	196	5.0	878183110	3.119067
3	222	204	5.0	878182370	3.304764
4	222	64	5.0	878183136	NaN
5	222	328	5.0	877562772	NaN
6	222	750	5.0	883815120	NaN
7	222	69	5.0	878182338	3.249401
8	222	28	5.0	878182370	3.158334
9	222	228	5.0	878181869	3.127052

Conclusion¶

For user 222, we can look at the ground truth values and the recommendations made by the model. As we can see, the recommendation values are based on a different scale, however, we can still give a top K recommendation by looking at the top K values. Thus, we can match against the ground truth to look at how successful the model is at predicting the best movies. We see that there were 5 movies the user found desirable, (196, 204, 69, 28, 228), which we successfully found in the top 10 values of the predicted movie result.

In [2]:

import pickle

filename = 'sar_eval'

infile = open(filename,'rb')
data = pickle.load(infile)
infile.close()

print("Model:\t",
      "Top K:\t%d" % data[0],
      "MAP:\t%f" % data[1],
      "NDCG:\t%f" % data[2],
      "Precision@K:\t%f" % data[3],
      "Recall@K:\t%f" % data[4], sep='\n')

Model:	
Top K:	10
MAP:	0.110591
NDCG:	0.382461
Precision@K:	0.330753
Recall@K:	0.176385

Quick training and prediction time¶

Training = 0.847 seconds¶

Prediction = 0.301 seconds¶

Movie Recommendations using Neural Networks¶

We do our recommendations via a Supervised Learning Technique where different ratings for every movie is learned for every user. Since we are doing Collaborative filtering, we just learn from users and movies they have already watched to generate ratings for every movie! We can do this by using Neural Network as we can input Users and Movies as seprate inputs and combine them in the middle. Additionally, I (Mohan) used Collab and its GPU for faster training time!

We mainly utilize the ratings dataset. This just has ratings for every user (userId) and movies (movieId) they have watched previously. We utilize Keras for building our Neural Networks!

I had to split our dataset into train and test with 90-10 split. I used the train dataset to make our model and test dataset to test our model and compare evaluation techniques with others. Additionally need to separate our user and movie columns as our neural network will be learning ratings based off of users and movies separately. Put both into an array and feed it into our network.

Model 1¶

Basically the format of these networks is going to be two embedding layers, one for users, one for movies. Then merge them in some way with the output being of 5 nodes.¶

Conceptually, embedding layers are a mapping of categorical variables to vector of continuous numbers. We make embeddings of movies and users to represent each movie and user. Initially random, the backpropagation process learns representations for both and captures the essential qualities of the users/movies that contribute to given ratings. This model is simple but powerful. Input is mapped to the specific indeces for the corresponding embedding layer. The way of merging is by using a dot product of the embeddings. Since they are of the same size, we can do this and then even specify the output size of 5. Used mean squared error as our loss function and the popular and very effective Adam optimizer (rather than SGD) to optimize our network weights. Additionally, we want to keep track of Root Mean Square Error, Mean Absolute Error, and explained variance as metrics to later evaluate our network. We are not using a validation set, as we are using a train, test set at the end for evaluation instead. Also, the explained variance is not a default error metric in Keras, so a rsquared function was made itself. All 3 are passed as separate metrics. Also, I add regularization for embeddings to make sure it doesn't overfit!

In [0]:

def rsquared(y_true, y_pred):
      SS_res =  K.sum(K.square(y_true - y_pred)) 
      SS_tot = K.sum(K.square(y_true - K.mean(y_true))) 
      return ( 1 - SS_res/(SS_tot + K.epsilon()) )
def Rec1(n_users, n_movies, n_factors):
      user = Input(shape=(1,))
      u = Embedding(n_users, n_factors, embeddings_initializer='he_normal',
                    embeddings_regularizer=l2(1e-6))(user)
      u = Reshape((n_factors,))(u)

      movie = Input(shape=(1,))
      m = Embedding(n_movies, n_factors, embeddings_initializer='he_normal',
                    embeddings_regularizer=l2(1e-6))(movie)
      m = Reshape((n_factors,))(m)

      x = Dot(axes=1)([u, m])
      # x = conc_or_dot([u,m])

      model1 = Model(inputs=[user, movie], outputs=x)
      opt = Adam(lr=0.001)
      model1.compile(loss='mean_squared_error', optimizer=opt, metrics =['mse', 'mae', rsquared])
      return model1

We fit using a batch size of 64. Batch size is bascially how much data used to update weights before error is calculated. Having a higher batch size helps with training time a lot as error is calculated less often which can be computationally expensive. However, since error is calculated less often, the accuracy of the model is not as high. Also, we use an epoch value of 5. This is the amount of times the whole dataset is looked at. Notice that increasing the number of epochs would theoretically increase the loss on and on; however, we need to be careful for overfitting. I made sure 5 was a good number by running with a validation split and made sure that the validation loss didn't increase at epoch 5 for all of the models, which it was fine!

In [105]:

model1 = Rec1(n_users, n_movies, n_factors)
model1.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5, verbose=0)
evals1 = model1.evaluate(X_test_array, y_test, verbose = 0)
evals1[1] = np.sqrt(evals1[1])
listOfStr = ['Final Loss (MSE)', 'RMSE', 'MAE', 'R2']
tup = tuple(zip(listOfStr, evals1))
dictOfWords = { i[0] : i[1] for i in tup }
dictOfWords

Out[105]:

{'Final Loss (MSE)': 0.8751297973632812,
 'MAE': 0.7291919128417969,
 'R2': 0.26184510326385496,
 'RMSE': 0.9308358680487956}

Neural Network 2¶

This network is a modification of the original. However, I add two new additions. One is an embedding bias term. We do this in order to add more flexibility within the embedding learning. For example for linear regression it makes sense to learn from y=mx+b, not y=mx. Also, we feed the dot product restult into a sigmoid function which includes a non-linearity to the output! I tried incorporating one or the other, but only when I combined both results I got the best evaluation metrics.

In [0]:

class EmbeddingLayer:
    def __init__(self, n_items, n_factors):
        self.n_items = n_items
        self.n_factors = n_factors
    
    def __call__(self, x):
        x = Embedding(self.n_items, self.n_factors, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))(x)
        x = Reshape((self.n_factors,))(x)
        return x
def RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating):
    user = Input(shape=(1,))
    u = EmbeddingLayer(n_users, n_factors)(user)
    ub = EmbeddingLayer(n_users, 1)(user)
    
    movie = Input(shape=(1,))
    m = EmbeddingLayer(n_movies, n_factors)(movie)
    mb = EmbeddingLayer(n_movies, 1)(movie)
    x = Dot(axes=1)([u, m])
    x = Add()([x, ub, mb])
    x = Activation('sigmoid')(x)
    x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)
    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt,  metrics =['mse', 'mae', rsquared])
    return model

In [106]:

model2 = RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating)
model2.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5, verbose=0)
evals2 = model2.evaluate(X_test_array,y_test, verbose = 0)
evals2[1] = np.sqrt(evals2[1])
tup = tuple(zip(listOfStr, evals2))
{ i[0] : i[1] for i in tup }

Out[106]:

{'Final Loss (MSE)': 0.8117879865646362,
 'MAE': 0.7069411417007446,
 'R2': 0.3144750452041626,
 'RMSE': 0.8983010084414732}

Neural Network 3¶

This one is a bit different. Our merging method is simply the concatenation of both embeddings. Then we can add dense layers to our network to learn off of. This one is utilizing deep learning! After experimenting with different layers, layer size, I found that two layers of size 10 and 1 (with sigmoid activation) to be the most optimal.

In [0]:

def RecNet(n_users, n_movies, n_factors_users, n_factors_movies, min_rating, max_rating, conc_or_dot):
    user = Input(shape=(1,))
    u = EmbeddingLayer(n_users, n_factors_users)(user)

    movie = Input(shape=(1,))
    m = EmbeddingLayer(n_movies, n_factors_movies)(movie)
  
    x=conc_or_dot([u,m])
  
    x = Dense(10, activation = 'relu', kernel_initializer='he_normal')(x)

    x = Dense(1, activation = 'sigmoid', kernel_initializer='he_normal')(x)
    x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)
    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt,  metrics =['mse', 'mae', rsquared])
    return model

Grid Search¶

However, I found that the embedding sizes of the users and movies greatly affected model performance! To optimize upon this, I performed manual Grid Search over all options for both embeddings size of [50,75,100,150 and 200] Grid search gave optimal values of embedding sizes for users: 100 and movies: 50 optimizing on 3 out of the 4 objectives. Adding more layers decreased overall accuracy. This is probably because of too much weights to learn upon and overfitting. I think if we used the bigger dataset, adding more layers will help with evaluation!

In [0]:

l = []
for i in [50,75,100,150,200]:
  for j in [50,75,100,150,200]:
    model5 = RecNet(n_users, n_movies, i,j, min_rating, max_rating, Concatenate())
    model5.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,verbose=0)
    eva = model5.evaluate(X_test_array,y_test, verbose = 0)
    l.append(((i,j),eva))
print('Done')

In [101]:

model5 = RecNet(n_users, n_movies, 100,50, min_rating, max_rating, Concatenate())
model5.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,verbose=0)
evals5 = model5.evaluate(X_test_array, y_test, verbose = 0)
evals5[1] = np.sqrt(evals5[1])
tup = tuple(zip(listOfStr, evals5))
{ i[0] : i[1] for i in tup }

Out[101]:

{'Final Loss (MSE)': 0.8572985005378723,
 'MAE': 0.7273193671226501,
 'R2': 0.27431239557266235,
 'RMSE': 0.9251387700815097}

This model is very similar to the first one: instead of concat, it adds the outputs. Nevertheless, it has similar results to the concat version.

In [102]:

model6 = RecommenderNet(n_users, n_movies, 50,50, min_rating, max_rating, Add())
model6.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5, verbose=0)
evals6 = model6.evaluate(X_test_array, y_test, verbose = 0)
evals6[1] = np.sqrt(evals6[1])
tup = tuple(zip(listOfStr, evals6))
{ i[0] : i[1] for i in tup }

Out[102]:

{'Final Loss (MSE)': 0.8694782653331756,
 'MAE': 0.7308314328193665,
 'R2': 0.2643206853866577,
 'RMSE': 0.9318947968148698}

After all models were run, we graph the evaluation metrics for each model. For all models, the non deep learning method using embedding vectors with bias, and sigmoid activation layer gave the best results. With a larger dataset, maybe the deep learning option would give us better results.¶

In [108]:

plot()

Now we want to give example recommendations for users for each model. We do this for user 101. We do this by finding the best predicted rated value for a movie in which the user has not watched. Neural Networks basically allows every user to have a predicted rating for every movie in the dataset. That is the beauty of neural networks!¶

Recommendations for model 1: Embedding w/ Dot

In [67]:

rec1()

Out[67]:

	movieId	title	genres
277	318	Shawshank Redemption, The (1994)	Crime\|Drama
57	64	Two if by Sea (1996)	Comedy\|Romance
272	313	Swan Princess, The (1994)	Animation\|Children
234	272	Madness of King George, The (1994)	Comedy\|Drama
353	408	8 Seconds (1994)	Drama

Recommendations for model 2: Embedding w/ Dot

In [70]:

rec2()

Out[70]:

	movieId	title	genres
272	313	Swan Princess, The (1994)	Animation\|Children
146	174	Jury Duty (1995)	Comedy
117	144	Brothers McMullen, The (1995)	Comedy
144	172	Johnny Mnemonic (1995)	Action\|Sci-Fi\|Thriller
183	215	Before Sunrise (1995)	Drama\|Romance

Recommendations for model 5: Concat Deep Net

In [73]:

rec3()

Out[73]:

	movieId	title	genres
277	318	Shawshank Redemption, The (1994)	Crime\|Drama
1231	1639	Chasing Amy (1997)	Comedy\|Drama\|Romance
353	408	8 Seconds (1994)	Drama
161	190	Safe (1995)	Thriller

To verify the results, we do cross validation across the models. KFold Cross validation is used with size of 10. To simplify training time, we increased batch size and 4 epochs are used instead. The results we got above are verified! Results are plotted.

In [120]:

plot2()

In [0]:

Neural collaborative filtering using FastAI¶

Method to find optimal value for learning rate¶

Cross Validation in order to optimize the number of epochs¶

Training process¶

Top 5 predictions for a user¶

Some metrics¶

Results¶

We evaluated each model using Root mean squared error, Mean absolute error and Explained variance:

The Keras version for the Bias & Activation Neural Network had the best values for RMSE and MAE and the FastAI version had the best RSQUARED value. We measured the execution times for both algorithm but it is not a consistent way to compare them as the run time can vary from on computer to other, but we can conclude that the Bias & Activation Neural Networks methods were giving the best performance in term of accuracy and execution time. The unsuperised methods were also worked pretty well! Behind the bias and activation network, the KNN baseline was the second best method with SVD and NMF trailing behind it.

Conclusion and Future Steps¶

In conclusion, we have determined that the best algorithm accross all the different algorithms utilizing various evaluation metrics. We used both supervised and unsupervised methods within collaborative filtering. Futhermore, we have thought that the algorithm that we decided as the best, would require a lot of mateinance as each time a new movie is rated, it would require to train the model again. In a future, in order to solve this problem, we hope content based or hybrid recommendation systems. We also want to explore other dataset as music or advertising.

Movie Recommendation Project - Team 10 4641¶

Introduction (Motivation)¶

Method¶

Dataset¶

KNN Inspired Methods¶

Matrix Factorization Methods¶

Simple Algorithm for Recommendation (SAR)¶

Overview¶

Utilizes 2 matricies:¶

Solves a Recommendation Matrix¶

Co-occurence Matrix¶

Rescaling¶

Affinity Matrix¶

Recommendation Matrix¶

Conclusion¶

Quick training and prediction time¶

Training = 0.847 seconds¶

Prediction = 0.301 seconds¶

Movie Recommendations using Neural Networks¶

Model 1¶

Basically the format of these networks is going to be two embedding layers, one for users, one for movies. Then merge them in some way with the output being of 5 nodes.¶

Neural Network 2¶

Neural Network 3¶

Grid Search¶

After all models were run, we graph the evaluation metrics for each model. For all models, the non deep learning method using embedding vectors with bias, and sigmoid activation layer gave the best results. With a larger dataset, maybe the deep learning option would give us better results.¶

Neural collaborative filtering using FastAI¶

Method to find optimal value for learning rate¶

Cross Validation in order to optimize the number of epochs¶

Training process¶

Top 5 predictions for a user¶

Some metrics¶

Results¶

Conclusion and Future Steps¶

References¶