Can anyone provide sample code demonstrating the use of 16 bit floating point in cuda?

There are a few things to note up-front: Refer to the half-precision intrinsics. Note that many of these intrinsics are only supported in device code. However, in recent/current CUDA versions, many/most of the conversion intrinsics are supported in both host and device code. (And, @njuffa has created a set of host-usable conversion functions here) Therefore, … Read more

What kind of variables consume registers in CUDA?

The register allocation in PTX is completely irrelevant to the final register consumption of the kernel. PTX is only an intermediate representation of the final machine code and uses static single assignment form, meaning that each register in PTX is only used once. A piece of PTX with hundreds of registers can compile into a … Read more

Copy an object to device?

Yes, you can copy an object to the device for use on the device. When the object has embedded pointers to dynamically allocated regions, the process requires some extra steps. See my answer here for a discussion of what is involved. That answer also has a few samples code answers linked to it. Also, in … Read more

How do I select which GPU to run a job on?

The problem was caused by not setting the CUDA_VISIBLE_DEVICES variable within the shell correctly. To specify CUDA device 1 for example, you would set the CUDA_VISIBLE_DEVICES using export CUDA_VISIBLE_DEVICES=1 or CUDA_VISIBLE_DEVICES=1 ./cuda_executable The former sets the variable for the life of the current shell, the latter only for the lifespan of that particular executable invocation. … Read more

How can I check the progress of matrix multiplication?

Here is a code which demonstrates how to check progress from a matrix multiply kernel: #include <stdio.h> #include <stdlib.h> #include <time.h> #define TIME_INC 100000000 #define INCS 10 #define USE_PROGRESS 1 #define MAT_DIMX 4000 #define MAT_DIMY MAT_DIMX #define cudaCheckErrors(msg) \ do { \ cudaError_t __err = cudaGetLastError(); \ if (__err != cudaSuccess) { \ fprintf(stderr, “Fatal … Read more

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

The necessary instructions are contained in the documentation for the MPS service. You’ll note that those instructions don’t really depend on or call out MPI, so there really isn’t anything MPI-specific about them. Here’s a walkthrough/example. Read section 2.3 of the above-linked documentation for various requirements and restrictions. I recommend using CUDA 7, 7.5, or … Read more