Use PyTorch’s DataLoader with Variable Length Sequences for LSTM/GRU

When I first started using PyTorch to implement recurrent neural networks (RNN), I faced a small issue when I was trying to use DataLoader in conjunction with variable-length sequences. What I specifically wanted to do was to automate the process of distributing training data among multiple graphics cards. Even though there are numerous examples online talking about how to do the actual padding, I couldn’t find any concrete example of using DataLoader in conjunction with padding, and my many-months old question on their forum is still left unanswered!!

The standard way of working with inputs of variable lengths is to pad all the sequences with zeros to make their lengths equal to the length of the largest sequence. This padding is done with the pad_sequence function. PyTorch’s RNN (LSTM, GRU, etc) modules are capable of working with inputs of a padded sequence type and intelligently ignore the zero paddings in the sequence.

If the goal is to train with mini-batches, one needs to pad the sequences in each batch. In other words, given a mini-batch of size N, if the length of the largest sequence is L, one needs to pad every sequence with a length of smaller than L with zeros and make their lengths equal to L. Moreover, it is important that the sequences in the batch are in the descending order.

To do proper padding with DataLoader, we can use the collate_fn argument to specify a class that performs the collation operation, which in our case is zero padding. The following is a minimal example of a collation class that does the padding we need:

import torch
import numpy as np

class PadSequence:
    def __call__(self, batch):
		# Let's assume that each element in "batch" is a tuple (data, label).
        # Sort the batch in the descending order
        sorted_batch = sorted(batch, key=lambda x: x[0].shape[0], reverse=True)
		# Get each sequence and pad it
        sequences = [x[0] for x in sorted_batch]
        sequences_padded = torch.nn.utils.rnn.pad_sequence(sequences, batch_first=True)
		# Also need to store the length of each sequence
		# This is later needed in order to unpad the sequences
		lengths = torch.LongTensor([len(x) for x in sequences])
		# Don't forget to grab the labels of the *sorted* batch
        labels = torch.LongTensor(map(lambda x: x[1], sorted_batch))
        return sequences_padded, lengths, labels

Note the importance of batch_first=True in my code above. By default, DataLoader assumes that the first dimension of the data is the batch number. Whereas, PyTorch’s RNN modules, by default, put batch in the second dimension (which I absolutely hate). Fortunately, this behavior can be changed for both the RNN modules and the DataLoader. I personally always prefer to have the batch be the first dimension of the data.

With my code above, DataLoader instance is created as follows:

torch.utils.data.DataLoader(dataset=dataset,
                            ... more arguments ...,
                            collate_fn=PadSequence())

The last remaining step here is to pass each batch to the RNN module during training/inference. This can be done by using the pack_padded_sequence function as follows:

from torch.nn.utils.rnn import pack_padded_sequence as PACK

class MyModel(nn.Module):
    def __init__():
        self.gru = nn.GRU(10, 20, 2, batch_first=True)  # Note that "batch_first" is set to "True"

    def forward(self, batch):
        x, x_lengths, _ = batch
        x_pack = PACK(x, x_lengths, batch_first=True)
        output, hidden = self.gru(x_pack)

 

7 comments

Skip to comment form

    • Sanjay Bharath on April 26, 2019 at 4:50 AM
    • Reply

    hello im getting an error like this
    Traceback (most recent call last):
    File “/home/sanjay/anaconda3/lib/python3.7/multiprocessing/queues.py”, line 236, in _feed
    obj = _ForkingPickler.dumps(obj)
    File “/home/sanjay/anaconda3/lib/python3.7/multiprocessing/reduction.py”, line 51, in dumps
    cls(buf, protocol).dump(obj)
    AttributeError: Can’t pickle local object ‘PadSequence.__call__..’

    • Andrés Lou on February 19, 2020 at 9:03 AM
    • Reply

    Thanks a lot for the help, Mehran!

    1. Happy to help!

    • Soo J Park on April 21, 2020 at 4:07 PM
    • Reply

    Thank you very much, it really helps.
    Would you mind providing an unpadding example as well?

    • Anupam Yadav on October 9, 2021 at 1:32 AM
    • Reply

    def forward(self, inputs, input_lengths, state):
    # inputs is of shape batch_size, num_steps(sequence length which is the length of
    # longest text sequence). Each row of inputs is 1d LongTensor array of length
    # num_steps containing word index. Using the embedding layer we want to convert
    # each word index to its corresponding word vector of dimension emb_dim
    batch_size = inputs.size(0)
    num_steps = inputs.size(1)
    # embeds is of shape batch_size * num_steps * emb_dim and is the input to lstm layer
    embeds = self.emb_layer(inputs)
    # pack_padded_sequence before feeding into LSTM. This is required so pytorch knows
    # which elements of the sequence are padded ones and ignore them in computation.
    # This step is done only after the embedding step
    embeds_pack = pack_padded_sequence(embeds, input_lengths, batch_first=True)
    # lstm_out is of shape batch_size * num_steps * hidden_size and contains the output
    # features (h_t) from the last layer of LSTM for each t
    # h_n is of shape num_layers * batch_size * hidden_size and contains the final hidden
    # state for each element in the batch i.e. hidden state at t_end
    # same for c_n as h_n except that it is the final cell state
    lstm_out_pack, (h_n, c_n) = self.lstm_layer(embeds_pack)
    # unpack the output
    lstm_out, lstm_out_len = pad_packed_sequence(lstm_out_pack, batch_first=True)
    # tensor flattening works only if tensor is contiguous
    # https://discuss.pytorch.org/t/contigious-vs-non-contigious-tensor/30107/2
    # flatten lstm_out from 3d to 2d with shape (batch_size * num_steps), hidden_dim)
    lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
    # regularize lstm output by applying dropout
    out = self.dropout(lstm_out)
    # The the output Y of fully connected rnn layer has the shape of
    # (`num_steps` * `batch_size`, `num_hiddens`). This Y is then fed as input to the
    # output fully connected linear layer which produces the prediction in the output shape of
    # (`num_steps` * `batch_size`, `output_dim`).
    output = self.linear(out)
    # reshape output to batch_size, num_steps, output_dim
    output = output.view(batch_size, -1, self.output_dim)
    # reshape output again to batch_size, output_dim. The last element of middle dimension
    # i.e. num_steps is taken i.e. for each item in the batch the output is the hidden state
    # from the last layer of LSTM for t = t_end
    output = output[:, -1, :]
    output = self.act(output)
    return output, (h_n, c_n)

    • Neepa Biswas on November 1, 2021 at 6:25 PM
    • Reply

    Thank you. But I am getting an error of TypeError: expected Tensor as element 0 in argument 0, but got csr_matrix
    Can you please help with this please

    • Zahra on February 14, 2023 at 2:50 PM
    • Reply

    Hi I have a multivariate time-series dataset with different sequence length and I am trying to train my model for a conditioned autoregressive problem. I used a similar method as yours but the results don’t look good. My guess is maybe my dataset is not normalized. Do you have any suggestion for how to normalize this type of data with variable length?

Leave a Reply

Your email address will not be published.