Surprisingly, the box filter function (nppiFilterBox_8u) that is shipped with CUDA as a part of the NPP library is broken! It is the same function that is used in the “Box Filter with NPP” sample.
If you import this sample from the CUDA SDK and try it with masks of size 13 an above, the filter produces garbage output (tested with CUDA 6.5). At this point, I have no idea why this is happening or why such simple filter may not work for larger mask sizes. An alternative would be to use the convolution filters (such as nppiFilter_8u).
EDIT (12/5/2014): I reported this bug to NVIDIA and today I received an email indicating that this bugs was now fixed and the fixed version will be available in the next version of the CUDA toolkit.