Fast (Differentiable) Soft DTW for PyTorch using CUDA

Dynamic time warping (DTW) is a dynamic programming algorithm which aims to find the dissimilarity between two time-series. This algorithm was originally applied towards speech recognition.

In ICML 2017, Marco Cuturi and Mathieu Blondel proposed a differentiable formulation of this algorithm that’s very helpful in optimization problems involving temporal sequences. They call this differentiable formulation soft DTW and I’ve been using this algorithm quite extensively for a project that I’m working on.

My primary deep learning framework is PyTorch, and although multiple implementations exist already (e.g. this or this), they were a bit slow for my use-case, and couldn’t do as many experiments as I wanted due to speed constraints. Considering that soft DTW is very similar to the original DTW, and many efficient implementations exist for it already, I set out to come up with my own implementation that was faster than the existing ones. Naturally, a CUDA implementation was the first thing that I thought of.

One obvious approach of parallelizing DTW computations is to compute multiple DTW(x, y) queries in parallel, but that’s a very obvious approach ūüėÄ Interestingly, there already exist related work in the literature that demonstrate the parallelization of the seemingly sequential work flow of dynamic programming. Without going too much into the details, the basic idea is to parallelize the dynamic programming computation across multiple threads, by processing the diagonals of the cost matrix in parallel, via multiple threads. This, coupled by computing multiple DTW(x, y) queries in parallel yields some serious speed ups.

Today, I decided to publicly release my implementation, in the hope of helping others who may be interested in using this algorithm in their PyTorch projects. The project is available at GitHub.

Masked Tensor Operations in PyTorch

As far as I know, PyTorch does not inherently have masked tensor operations (such as those available in

The other day, I needed to do some aggregation operations on a tensor while ignoring the masked elements in the operations. Specifically, I needed to do a mean() along a specific dimension, but ignore the masked elements. Fortunately, it’s easy enough to implement these operations manually. Let’s implement the mean() operation.

Let’s say you have a matrix a, and a bool mask m (with the same shape as a) and you want to compute a.mean(dim=1) but only on elements that are not masked. Here’s a small function that does this for you:

def masked_mean(tensor, mask, dim):
    masked = torch.mul(tensor, mask)  # Apply the mask using an element-wise multiply
    return masked.sum(dim=dim) / mask.sum(dim=dim)  # Find the average!

We can implement a similar function for finding (say) max() along a specific dimension:

def masked_max(tensor, mask, dim):
    masked = torch.mul(tensor, mask)
    neg_inf = torch.zeros_like(tensor)
    neg_inf[~mask] = -math.inf  # Place the smallest values possible in masked positions
    return (masked + neg_inf).max(dim=dim)[0]

Simple, no? ūüôā

Specify Target File in git cherry-pick

Cherry-picking with git could easily become a nightmare if some files are renamed or the directory structure of the project is changed. In these cases, it is often helpful to cherry-pick the problematic files individually.

Fortunately, this can be done in a few easy steps:

  1. Create a patch file for that individual file: git show [commit hash] -- path/to/old/file > patchfile
  2. Manually edit the newly created patchfile and replace all occurrences of the old path with the new path.
  3. Apply the patch via git apply --3way patchfile

The --3way argument is very important and would tell git to do a 3-way comparison. After step 3, if git is unable to figure out how to merge the changes, it will add conflict markers to the resulting file and would allow one to manually resolve the conflicts.

The procedure above can be repeated for every individual file.

Use PyTorch’s DataLoader with Variable Length Sequences for LSTM/GRU

When I first started using PyTorch to implement recurrent neural networks (RNN), I faced a small issue when I was trying to use DataLoader in conjunction with variable-length sequences. What I specifically wanted to do was to automate the process of distributing training data among multiple graphics cards. Even though there are numerous examples online talking about how to do the actual padding, I couldn’t find any concrete example of using DataLoader in conjunction with padding, and my many-months old question on their forum is still left unanswered!!

The standard way of working with inputs of variable lengths is to pad all the sequences with zeros to make their lengths equal to the length of the largest sequence. This padding is done with the pad_sequence function. PyTorch’s RNN (LSTM, GRU, etc) modules are capable of working with inputs of a padded sequence type and intelligently ignore the zero paddings in the sequence.

If the goal is to train with mini-batches, one needs to pad the sequences in each batch. In other words, given a mini-batch of size N, if the length of the largest sequence is L, one needs to pad every sequence with a length of smaller than L with zeros and make their lengths equal to L. Moreover, it is important that the sequences in the batch are in the descending order.

To do proper padding with DataLoader, we can use the collate_fn argument to specify a class that performs the collation operation, which in our case is zero padding. The following is a minimal example of a collation class that does the padding we need:

import torch
import numpy as np

class PadSequence:
    def __call__(self, batch):
		# Let's assume that each element in "batch" is a tuple (data, label).
        # Sort the batch in the descending order
        sorted_batch = sorted(batch, key=lambda x: x[0].shape[0], reverse=True)
		# Get each sequence and pad it
        sequences = [x[0] for x in sorted_batch]
        sequences_padded = torch.nn.utils.rnn.pad_sequence(sequences, batch_first=True)
		# Also need to store the length of each sequence
		# This is later needed in order to unpad the sequences
		lengths = torch.LongTensor([len(x) for x in sequences])
		# Don't forget to grab the labels of the *sorted* batch
        labels = torch.LongTensor(map(lambda x: x[1], sorted_batch))
        return sequences_padded, lengths, labels

Note the importance of batch_first=True in my code above. By default, DataLoader assumes that the first dimension of the data is the batch number. Whereas, PyTorch’s RNN modules, by default, put batch in the second dimension (which I absolutely hate). Fortunately, this behavior can be changed for both the RNN modules and the DataLoader. I¬†personally always prefer to have the batch be the first dimension of the data.

With my code above, DataLoader instance is created as follows:,
                            ... more arguments ...,

The last remaining step here is to pass each batch to the RNN module during training/inference. This can be done by using the pack_padded_sequence function as follows:

from torch.nn.utils.rnn import pack_padded_sequence as PACK

class MyModel(nn.Module):
    def __init__():
        self.gru = nn.GRU(10, 20, 2, batch_first=True)  # Note that "batch_first" is set to "True"

    def forward(self, batch):
        x, x_lengths, _ = batch
        x_pack = PACK(x, x_lengths, batch_first=True)
        output, hidden = self.gru(x_pack)


OpenGL not Found, “cannot find -lGL”, and Other Issues with NVIDIA Drivers

Under Ubuntu,  sometimes when the NVIDIA drivers are installed, CMake cannot properly find OpenGL (specifically As a result, linking issues may occur.

The easiest fix for this is to first create a symlink of nvidia-current to nvidia-xxx where xxx is the version of the current NVIDIA driver installed (e.g. nvidia-387) . Then create a symlink from /usr/lib/ to /usr/lib/nvidia-current/

The command would be:

sudo ln -sfn /usr/lib/nvidia-xxx /usr/lib/nvidia-current
sudo ln -sfn /usr/lib/nvidia-current/ /usr/lib/

Note that having nvidia-current symlink is generally a good idea as it can prevent other linker errors when it comes to finding NVIDIA-specific libraries. Also, remember to update this symlink whenever NVIDIA drivers are also updated.

Regional and Locale Settings Affects Parsing Decimal Strings in C#

Yesterday, someone who was trying to use Jackknife, our gesture recognition library, reported that he couldn’t reproduce our results with our C# code. Naturally, we started looking into the issue.

Strangely enough, the code worked fine for me (both Debug and Release configurations). I then built the code under multiple versions of Visual Studio, tried both Windows 8 and Windows 10, and even built the code on Mono under Ubuntu. Everything was working as expected. He even sent me his compiled binaries so that I could try his build, yet to my surprise I was still getting correct outputs on all my machines.

Since our C++ code was working fine for him, I knew the issue was connected to .NET somehow. While trying to figure out what was causing this mind-boggling issue, I noticed that he was in Europe and I got this crazy idea that his operating system’s locale settings were interfering with something somewhere. As a last resort, I changed the regional settings on my computer to his region and voila! I was able to reproduce his problem!!

It turned out that in his region, decimal numbers are separated by commas, rather than periods, which is common in the US. Consequently, in that region the decimal number 123.45 is written as 123,45.¬†In our tool, we were manually parsing CSV files. Since we hadn’t thought of issues like this, we were naively calling double.Parse()¬†on strings containing decimals. Therefore, .NET was using the operating system’s locale settings to parse the string, giving completely different results than the ones we expected.

One easy hack to fix this issue was to force the locale of the current thread to “en-US” locale. The code excerpt below does exactly that:

System.Threading.Thread.CurrentThread.CurrentCulture = new System.Globalization.CultureInfo("en-US", false);

In our C++ code, however, we were using std::stod which does not care about locale settings, so it was working fine.

Orthonormalize a Rotation Matrix

If you use a 3×3 R matrix to store the result of the multiplication of a series of rotation transformations, it could be the case that sometimes you end up with a matrix that is not orthonormal (i.e. det(R) != 1 and R.R' != eye).

In such cases, you need to re-orthonormalize the rotation matrix, which can be done in either of the two following ways:

  1. Use the SVD decomposition as follows (MATLAB syntax):
    [u s vt] = svd(R);
    R = u * vt';
  2. Alternatively, you can express the rotation matrix as a quaternion, normalize the quaternion, then convert the quaternion back to the rotation matrix form. In MATLAB, you could do this:
    R = quat2rotm(quatnormalize(rotm2quat(R)));

    Note that the above syntax requires MATLAB’s robotics toolbox.

Visual Studio Setup Blocked or Can’t Uninstall

For some reason a Visual Studio 2015 Community Edition installation I had on one of my machines was experiencing problems: solution files were not loading properly, and what’s worse is that installation instance had gone totally missing from “Uninstall a Program” in Windows Control Panel. Also, every time I tried running the setup file vs_community.exe from the ISO image, I would get an error saying “The product version that you are trying to set up is earlier than the version already installed on this computer”.

What I ended up doing was to run the setup file with the two following switches:

vs_community.exe /uninstall /force

Luckily, it went forward with uninstalling the instance I had on my machine.

Practical Kinect Stereo Calibration for the Highest Accuracy

I’ve been meaning to write up this post for a while, but I’ve been putting it off ūüôā

The release of the Kinect sensor by Microsoft spawned a plethora of research in robotics, computer vision and many other fields. Many of these attempts involved using Kinect for purposes other that what it was originally meant for! That pretty much involves anything other than gesture recognition and skeleton tracking.

Even though Kinect is a very capable device, many people (including us!) don’t get the fact that Kinect was¬†simply not designed for being used as an all-purpose depth and RGB camera. This observation is further bolstered by the fact that the Kinect was not released with a public API, and the¬†official SDK which was released much later was missing a lot of functionality critical¬†to various computer vision applications. For instance, the SDK does not provide a built-in functionality for getting color camera and depth camera calibration values, extrinsics and distortion models. Rather conveniently, the SDK designers simply chose to provide the users¬†with simple functionality, like mapping pixels from world coordinates to RGB coordinates (but no way to do a backproject an image or depth point to ray). Needless to say that in many practical computer vision applications, one needs to constantly re-calibrate the camera in order to minimize the error caused by the mis-calibration of the camera.

Nevertheless, the choice by the SDK designers is understandable: there are proprietary algorithms and methods that are implemented in the Kinect software layer and it may not always be possible to give public access to them. Also, 3rd party open source libraries (such as libfreenect) have tried to reverse-engineer many of the innerworkings of the Kinect sensor and supply the end user with a set of needed functionalities.

With all this, let’s get started! (If you are impatient, feel free to jump to the list of tips I have compiled for you at the end of this post). I have also included PDFs containing checkerboards suitable for printing on large sheets, PVC or aluminum dibond.

Continue reading

Computing the Distance Between a 3D Point and a Pl√ľcker Line

In order to solve an optimization problem with the goal of reducing the distance between a bunch of 3D points and lines, I was looking for the correct way of finding the distance between 3D points and a Plucker line representation.

The Plucker line \(L\) passing through two lines \(A\) and \(B\) is defined as \(L = AB^T РBA^T\) (for more details refer to [1]). After a lot of looking, I found that there is a simple method for finding this distance in [2]. A direct quote from the paper:


A Plucker line \(L = (n, m)\) is described by a unit vector \(n\) and a
moment \(m\). This line representation allows to conveniently determine
the distance of a 3D point \(X\) to the line

$$d(X, L) = ||X \times n – m||_2$$

where \(\times\) denotes a cross product.


[1] Hartley, Richard, and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.

[2] Brox, Thomas, et al. “Combined region and motion-based 3D tracking of rigid and articulated objects.” IEEE Transactions on Pattern Analysis and Machine Intelligence 32.3 (2010): 402-415.