Pytorch is all the rage these days. It is quickly gaining momentum in the deep learning community. Its success can be attributed to three main things:

  1. the ease of writing Pytorch code
  2. the ability to interactively debug your models
  3. its speed.

In this article we are going to focus on the third, since you get the first two for free.

Because Pytorch is written to be “pythonic”, there are many different ways to implement a solution. Of course not all of these solutions are optimal. If your not careful you can end up writing some pretty slow code. In this post I hope to highlight some of the tricks I picked up writing Pytorch, so you get the most out of the library.

The GPU – CPU Transfer

The first cool thing about Pytorch is how easy it is to move computations to a GPU or CPU. But with great power comes great responsibility. You should try to minimize these calls, because this is a very expensive step. Here is an example of how it is done.

First to pass the data or models to between the two you can use:

model.cuda() # sends data to the GPU
model.cpu() # sends data to the CPU

But this is not necessarily the best way to do things, because these will break your code if you switch between running on a machine with a GPU to one that does not have a GPU. Instead you tell Pytorch to use the GPU if it is available, otherwise it can use the CPU.

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Then when passing data around you can use:

Checking if a GPU is avalaible is a great trick, because it allows you to write code that responds to the machine it is running on. For example if you are doing Neural Transfer, the image size can effect the speed of your code. Because your optimization will take longer on a CPU than a GPU, you can reduce the time by shrinking the image if all that is available is a CPU.1

imsize = 512 if torch.cuda.is_available() else 128  # use small size if no GPU


The next cool thing Pytorch has to offer is the availability of datasets and dataloaders. These are two tools that Pytorch gives you to format and work with your data so that your computations will be fast. Anytime you are working with a new dataset you should write each of these for it.

To make use of a dataloader, first we need a dataset. A dataset is an abstract class that represents a dataset. This is where the bulk of the code will be written.

To implement a dataset we need to write three methods: the init, len, and getitem. Below is an example of a dataset I constructed for the Netflix data set.

from torch.utils import data

class NetflixDataset(data.Dataset):

    def __init__(self, csv, transform=None):
        df = pd.read_csv(csv)
        self.length = len(df)
        self.y = torch.FloatTensor(df['rating'].values)
        self.x = torch.LongTensor(df.drop('rating',axis=1).values)
        self.transform = transform
    def __len__(self):
        return self.length
    def __getitem__(self,index):
        sample = {'x':self.x[index], 'y':self.y[index]}
        return sample

The three main pieces of advice I have for writing datasets are:

1. Do all your casting and pandas work in the init method.

These are the slow parts of your code. You only want to do them once, and ideally you will not use them outside of the init function.

2. Always have optional transforms.

This is helpful mostly when dealing with pictures, but text can be translated as a transformation as well. This is important to include in the getitem because it keeps the translated data together with the original on the same side of the train/val split.

Side Note: if you are using pictures be sure to use PIL, because the torchvision package has a lot of the built in transforms.2

3. Finally in your getitem method, always use a dictionary.

This makes it easier to know what you are getting later in your code, and it is faster than using a list.

Data Loaders

Now that we have a dataset, turning it into a dataloader is trivial. The only other thing we have to pick is the batch size. In general the bigger the batch the faster your model will train (larger optimization steps), but you have to be careful it fits in memory.

Side Note: To check how much memory your each batch takes up on you GPU you can use nvidia-smi from the command line.

Once you have decided on your batch size you can initialize both the dataset and data loader as follows:3

# build datasets
train_ds = NetflixDataset(train_df)
valid_ds = NetflixDataset(valid_df)
test_ds = NetflixDataset(test_df)

# build data loaders
batch_size = 100000
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True, num_workers=4)

# for test we use shuffle=False
valid_dl = DataLoader(valid_ds, batch_size=batch_size, shuffle=False,num_workers=4)
test_dl = DataLoader(test_ds, batch_size=batch_size, shuffle=False,num_workers=4)

As seen in the comments, we only want to shuffle the train data. This is so we do not have any hidden data leakage from how the data was recorded.

Multi-GPU Speed

If you happen to have multiple GPUs to work with, Pytorch will let you parallelize the batches across them. This is also trivial to set up, and can be added to your code even if you don’t have multiple GPUs.

# device just like before
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# build the model
model = CollabFNet(num_users, num_items, emb_size=100)

# if you have more than one GPU parallelize the model
if torch.cuda.device_count()>1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = nn.DataParallel(model)

# copy the model to each device

Note: This is only running different batches across the GPUs.

As of the time of writing this post, Pytorch does not yet support distributed training of models.

Train Loop Optimization

The last thing I want to show you is how to set up your training loop so that it will be fast. Here is an example of a simple training loop I used for the Netflix data.

# for each epoch
for i in range(epochs): 
    # set model to train and initialize aggregation variables
    total, sum_loss = 0, 0

    # for each batch
    for sample in train_dl:

        # get the optimizer (allows for changing learning rates)
        optim = get_optimizer(model, lr=lrs[idx], wd=0.00001)

        # put each of the batch objects on the device
        x = sample['x'].to(device)
        y = sample['y'].unsqueeze(1).to(device)
        # put x through the model and calculate metrics
        # ...

In this loop the most important thing is that we put the data on the device once per batch. We don’t want to keep moving them around, because that is time intensive.


All of these are tips and tricks I discovered while trying to build a simple Collaborative Filtering Neural Network. This project was a part of the Deep Learning with Pytorch class I took during my Masters at USF. Too see all of the code you can check out the notebook on Github and a write up about it in my projects

Applying these tricks made my code easier to read, and reduced each epoch’s runtime from 4 hours to 4 minutes. I hope they help you write faster Pytorch code too!


Leave a comment