cuda – Make Me Engineer

“invalid configuration argument ” error for the call of CUDA kernel?

June 14, 2023 by Tarik

This type of error message frequently refers to the launch configuration parameters (grid/threadblock dimensions in this case, could also be shared memory, etc. in other cases). When you see a message like this it’s a good idea just to print out your actual config parameters before launching the kernel, to see if you’ve made any … Read more

How can I make tensorflow run on a GPU with capability 2.x?

June 14, 2023 by Tarik

Recent GPU versions of tensorflow require compute capability 3.5 or higher (and use cuDNN to access the GPU. cuDNN also requires a GPU of cc3.0 or higher: cuDNN is supported on Windows, Linux and MacOS systems with Pascal, Kepler, Maxwell, Tegra K1 or Tegra X1 GPUs. Kepler = cc3.x Maxwell = cc5.x Pascal = cc6.x … Read more

Can I use CUDA with a non-NVIDIA GPU? [duplicate]

June 4, 2023 by Tarik

CUDA is an NVIDIA proprietary technology, and the only current, useful, and fully functional implementation available requires a system with a supported NVIDIA GPU. If you don’t have that (and it seems you don’t) then there is no solution to your problem.

Polymorphism and derived classes in CUDA / CUDA Thrust

June 3, 2023 by Tarik

I am not going to attempt to answer everything in this question, it is just too large. Having said that here are some observations about the code you posted which might help: The GPU side new operator allocates memory from a private runtime heap. As of CUDA 6, that memory cannot be accessed by the … Read more

Can anyone provide sample code demonstrating the use of 16 bit floating point in cuda?

June 2, 2023 by Tarik

There are a few things to note up-front: Refer to the half-precision intrinsics. Note that many of these intrinsics are only supported in device code. However, in recent/current CUDA versions, many/most of the conversion intrinsics are supported in both host and device code. (And, @njuffa has created a set of host-usable conversion functions here) Therefore, … Read more

What kind of variables consume registers in CUDA?

June 1, 2023 by Tarik

The register allocation in PTX is completely irrelevant to the final register consumption of the kernel. PTX is only an intermediate representation of the final machine code and uses static single assignment form, meaning that each register in PTX is only used once. A piece of PTX with hundreds of registers can compile into a … Read more

Copy an object to device?

May 30, 2023 by Tarik

Yes, you can copy an object to the device for use on the device. When the object has embedded pointers to dynamically allocated regions, the process requires some extra steps. See my answer here for a discussion of what is involved. That answer also has a few samples code answers linked to it. Also, in … Read more

How do I select which GPU to run a job on?

April 30, 2023 by Tarik

The problem was caused by not setting the CUDA_VISIBLE_DEVICES variable within the shell correctly. To specify CUDA device 1 for example, you would set the CUDA_VISIBLE_DEVICES using export CUDA_VISIBLE_DEVICES=1 or CUDA_VISIBLE_DEVICES=1 ./cuda_executable The former sets the variable for the life of the current shell, the latter only for the lifespan of that particular executable invocation. … Read more

How can I check the progress of matrix multiplication?

April 29, 2023 by Tarik

Here is a code which demonstrates how to check progress from a matrix multiply kernel: #include <stdio.h> #include <stdlib.h> #include <time.h> #define TIME_INC 100000000 #define INCS 10 #define USE_PROGRESS 1 #define MAT_DIMX 4000 #define MAT_DIMY MAT_DIMX #define cudaCheckErrors(msg) \ do { \ cudaError_t __err = cudaGetLastError(); \ if (__err != cudaSuccess) { \ fprintf(stderr, “Fatal … Read more

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

April 29, 2023 by Tarik

The necessary instructions are contained in the documentation for the MPS service. You’ll note that those instructions don’t really depend on or call out MPI, so there really isn’t anything MPI-specific about them. Here’s a walkthrough/example. Read section 2.3 of the above-linked documentation for various requirements and restrictions. I recommend using CUDA 7, 7.5, or … Read more