ArrayFire is Now Open Source!

To my surprise, the CUDA library ArrayFire is now open source and licensed under BSD 3-Clause License which means that commercial use is permitted!

ArrayFire is a production oriented library which greatly reduces CUDA application development time. The repository is hosted on GitHub and is located here.

Tutorial : Use CUDA and C++11 Code in MATLAB

As it turns out, incorporating CUDA code in MATLAB can be easily done! 🙂

MATLAB provides functionality for loading arbitrary dynamic libraries and invoking their functions. This is especially easy for invoking C/C++ code in a MATLAB program. Such functionality is possible using the so called MEX functions.


Mex functions can be created with the mex command in MATLAB. Essentially, mex takes as input a C/C++ source file, invokes the default C/C++ compiler installed in the operating system (GCC or CL), and creates a mexa64 file (on a 64-bit machine) which can be used like any other MATLAB function.

The C/C++ file that is passed to mex must have the following included in it:

#include "mex.h" // The mex header containing the necessary interop definitions

 * The "gateway" function which is the entry point for the MATLAB
 * function call (will be executed when the mex file is invoked in
void mexFunction(int nlhs, mxArray *plhs[],
                 int nrhs, const mxArray *prhs[])


The arguments that are passed from MATLAB are accessible using the prhs parameter (which stands for parameters-right hand side). Any output that the gateway function generates can be returned using the plhs parameter (which stands for parameter-left hand side). The number of the arguments that are passed to the gateway function is stored in the nrhs parameter and the number of outputs that the MATLAB code expects from the gateway function is stored in the nlhs parameter. From this point on, I refer to the file containing the above code as the mex gateway file. Also, I will refer to the mexFunction above as the gateway function.

Continue reading

CImg and NVIDIA’s NPP Interop

Apparently, NPP relies on the pixel order of its input arrays (they need to be interleaved). If you are planning on using CImg with NPP, be sure to check this post out before attempting to do so. Failing to permute CImg image axes will result in wrong filtered values for color images.

CImg does not store pixels in the interleaved format


Took me hours before I found and read the documentation.

CImg stores pixels in a planer format (RRRR…..GGGG…..BBBB). For most tasks in CUDA, it’s much better to store the pixel values in the interleaved format (RGBRGBRGB……).

In order to do that, just call the permute_axes method of the CImg object:

CImg image("image.jpg");


After permutation, the width, height,  spectrum and depth values that are reported for CImg will all change.  To permute back (for displaying or saving) do this:

CImg result(dataPointer, spectrum, width, height, depth, false);


Where the values, are previously saved values (before doing any kind of permutation on the axes). This will undo any changes and now you can safely save the image or display it.

CImg instance from interleaved array (bitmap):

Now imagine you want to initialize a CImg object with an interleaved bitmap (say an OpenGL texture or what have you). In this case, you need to know the width and height of the image as well as the number of components. Also imagine that the spectrum is 1. To create a CImg object using this array you can do ( imageArray is the bitmap pointer):

cimg_library::CImg<unsigned char> result(imageArray, numChannels, imageWidth, imageHeight, 1, true);


Error: “incorrect inclusion of a cudart header file”

If you receive this error while compiling a CUDA program, it means that you have included a CUDA header file containing CUDA specific qualifiers (such as __device__)  in a *.cpp file.

CUDA header files with such qualifiers should ONLY be included in *.cu files.

This happened to me when I had #inlcude <common_functions.h> in my *.cpp file. Note that having this in a header file that will be linked to a *.cpp file will also result in the same error.

Enable C++11 Support for CUDA Compiler (NVCC) – CUDA 6.5+

To enable support for C++11 in nvcc just add the switch -std=c++11 to nvcc.

If you are using Nsight Eclipse, right click on your project, go to Properties > Build > Settings > Tool Settings > NVCC Compiler and in the “Command line prompt” section add -std=c++11

The C++11 code should be compiled successfully with nvcc. Nsight’s C++ indexer will also work fine.

Automount NTFS Partitions with All Permissions

Somethings really need to be burned onto the inside of my skull., since I forget them ALL the time. This is especially true for Linux commands for trivial tasks. Automounting NTFS partitions with execution permission in Linux is one of those things for me. Here’s how to do it in Linux Mint (or probably any other Debian-based Linux distro)

1) Find  the UUID of your partition by

$ blkid


2) Add the following line in the file /etc/fstab

UUID=<xxxxx> /media/[whatever] ntfs rw,auto,users,exec,nls=utf8,umask=000,gid=46,uid=1000    0   0


3) Run the following command to verify everything is working fine

$ sudo mount -a


You can verify the uid for your user by running

$ id


Note the option umask=000. This gives execution permission to all files.


NPP’s Convoluion with Border Control Only Partially Implemented

One thing I discovered yesterday is that the image convolution filters implemented in NPP (such as nppiFilterBorder_8u) are only partially implemented! These family of functions are asserted to provide border control for the convolution, thus serving as a robust alternative to the regular image convolution functions in NPP (such as nppiFilter_8u). The catch is that the border control is only partially working.

The documentation on these functions is scarce. These functions expect an argument of type NppiBorderType to define their border treatment. Possible options are:

NPP_BORDER_NONE: no border treatment
NPP_BORDER_CONSTANT: (probably) assume constant values at out of bounds&nbsp;pixels
NPP_BORDER_REPLICATE: replicate edge pixels and use them as values for out of bounds pixels
NPP_BORDER_WRAP: round-robin treatment of borders


My experiments showed that the only working option is NPP_BORDER_REPLICATE. Any other option would result in the NPPStatus  error code of -9999 (equivalent to NPP_NOT_SUPPORTED_MODE_ERROR, for which I have, again, not found any documentations).

Seeing as the performance of the border-controlled convolutions is inferior to the box filter function (using large mask sizes), my assumption is that the NPP_BORDER_REPLICATE uses the nppiCopyConstBorder_8u function to implement its border-control.

Possible options include implementing the border control manually, if behaviors other than replication are desired.



NPP’s Box Filter (nppiFilterBox) is Broken

Surprisingly, the box filter function (nppiFilterBox_8u)  that is shipped with CUDA as a part of the NPP library is broken! It is the same function that is used in the “Box Filter with NPP” sample.

If you import this sample from the CUDA SDK and try it with masks of size 13 an above, the filter produces garbage output (tested with CUDA 6.5). At this point, I have no idea why this is happening or why such simple filter may not work for larger mask sizes. An alternative would be to use the convolution filters (such as nppiFilter_8u).


EDIT (12/5/2014): I reported this bug to NVIDIA and today I received an email indicating that this bugs was now fixed and the fixed version will be available in the next version of the CUDA toolkit.

Blog Created

Seeing as how often many programmers struggle with the same issue twice, I decided to start this blog. I will try to note the problems that I encountered during my coding here so that when I, or other programmers, encounter them again the solution is already available somewhere.

I will note the issues that required more than a simple Google search to solve.

Never get stuck on the same issue twice! 🙂