The goal of this project was to transfer the style from an input image to another image without making the content unrecognizable. To do this we first picked two different images, one called the content image and one called the style image.

Then we had to construct two different distance functions. One for the style and one for the content. This takes some pretty intense math, but a good explanation is provided in the original paper.1

The output images were then created by finding an image that is low in distance both from the content and the style image. This was calculated from multiple layers of the neural network by using their feature maps (convolution layers).

Because not all images are close to both, we used gradient descent to approach the representation we wanted. To do this the optimizer used the sum of the two gradients from each loss.

placeholder image 1 placeholder image 2 placeholder image 3
A sample of images with the style transfer applied to the Golden Gate Bridge

Applying Style Transfer

A great set of starter code for this can be found in the Pytorch tutorials.2 It also contains a great explanation of the content and style loss implementations.

The general approach to style transfer is this:

  1. Read in two images and resize them to the same size
  2. Load in the pretrained network
  3. Combine the two losses for each ‘important’ layer of the network
  4. Run gradient descent

There are two specific implementation details of the optimizer I would like to focus on.


The first is the choice of optimizer for Neural Transfer. The authors of the original paper suggest using the LBFGS optimization algorithm. So lets dive in to what makes this different from the normal Adam optimizer.

First the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is a famous optimization approach. It finds the gradient by computing the N x N approximation of the inverse Hessian matrix.3

The problem with this algorithm is the memory required to store the N x N approximation. Especially in our case where we have multiple layers being fed into the optimizer.

This is why the authors suggest using LBFGS or low memory BFGS. Instead of storing the N X N approximation, LBFGS stores a few vectors that approximate the inverse Hessian. This is the reason it is well suited for optimization with a large number of variables like we have in style transfer.

The last implementation detail is handling the fact that LBFGS recalculates the gradient multiple times. Because of this we have to include a closure method in our optimization step.

This is so each time pytorch needs to recalculate the gradient, it can run the whole function. To do this we define it in the running loop for the model, and pass it in to the optimizer as seen below:

while(True):      # running loop
  def closure():
    ...           # optimizer calculation


Style transfer is a pretty cool application of deep learning, but it does have many nuances to understand. While it is a small niche of deep learning, I think it is worth learning just for the cool pictures.

As a part of this project I used both VGG and SqueezeNet to do style transfer. I recommend you try VGG and some other pretrained network, because it will give you experience working with different types of layers and model architectures.