Given a list of movies users have watched and their ratings, how do you suggest the next movie the user should watch. This is the famous
Netflix Prize that ended with a million dollars being awarded to the winners.
Today the data for the Netflix problem is still available on Kaggle, and the goal of this project was to use Pytorch to build a collaborative filtering network to “solve” the problem.
Along the way we applied embeddings, a learning rate finder, and triangular learning rates to the problem.
There is nothing fancy about the model built for this task. It was a simple embedding of 100 for both the user and the items. Then this was passed to two linear layers, each of which had dropout and batch normalization applied to them. Finally we predicted one number (the rating) and used Mean Squared Error as our loss.
The cool stuff in this project happened during training.
Learning Rate Finder
Our first bit of training optimization was the learning rate finder. The goal of this step was to find the “best” or “max” learning rate for the model.
To do this we trained a model, recorded the loss for each batch of the model as it learned with a range of learning rates. Then we viewed the graph of the loss, and the learning rate we chose was where the loss begins to increase.
Once we had found this loss we built a slanted triangular learning rate schedule (STLR).
Slanted Learning Rate
While this max learning rate is good, we didn’t want to train with the same learning rate for the whole training process. By using a varied learning rate we hoped the model would learn quicker and approach a lower loss.
To build this varied learning rate we used the max learning rate from before. Then we used the slanted triangular learning rate formula found in the ULMFiT Paper to build a learning rate schedule. The code to do this is below:
def sltr(num_epochs, train_size, eta_max=0.01, cut_frac=0.1, ratio=32): '''Slanted triangular learning rates written from ULMFiT paper see [ULMFit](https://arxiv.org/abs/1801.06146) ''' training_iterations = num_epochs * train_size cut = math.floor(training_iterations * cut_frac) lr = [None for _ in range(training_iterations)] for t in range(training_iterations): if t<cut: p=t/cut else: p=1-((t-cut)/(cut*(1/cut_frac-1))) lr[t] = eta_max * (1+p*(ratio-1))/ratio return lr
With the list of learning rates this returned, we were able to pass one in for each of the batches in the training process.
In the end the model performed well, but was too simple to catch enough of the nuances in the data to make the predictions exciting. We ended with a loss around 1.07, which is comparable to the top kernel (standard collaborative filtering) with a loss of 0.98.
We also found that predicting into the future (validation) is hard because of the Cold Start Problem. Our loss on validation remained higher than training, however it did not show over fitting.
The full code for this project can be found on Github.