gpgpu – Make Me Engineer

Best approach for GPGPU/CUDA/OpenCL in Java?

June 8, 2023 by Tarik

AFAIK, JavaCL / OpenCL4Java is the only OpenCL binding that is available on all platforms right now (including MacOS X, FreeBSD, Linux, Windows, Solaris, all in Intel 32, 64 bits and ppc variants, thanks to its use of JNA). It has demos that actually run fine from Java Web Start at least on Mac and … Read more

Can I use CUDA with a non-NVIDIA GPU? [duplicate]

June 4, 2023 by Tarik

CUDA is an NVIDIA proprietary technology, and the only current, useful, and fully functional implementation available requires a system with a supported NVIDIA GPU. If you don’t have that (and it seems you don’t) then there is no solution to your problem.

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

April 29, 2023 by Tarik

The necessary instructions are contained in the documentation for the MPS service. You’ll note that those instructions don’t really depend on or call out MPI, so there really isn’t anything MPI-specific about them. Here’s a walkthrough/example. Read section 2.3 of the above-linked documentation for various requirements and restrictions. I recommend using CUDA 7, 7.5, or … Read more

nvidia-smi Volatile GPU-Utilization explanation?

November 26, 2022 by Tarik

It is a sampled measurement over a time period. For a given time period, it reports what percentage of time one or more GPU kernel(s) was active (i.e. running). It doesn’t tell you anything about how many SMs were used, or how “busy” the code was, or what it was doing exactly, or in what … Read more

How are 2D / 3D CUDA blocks divided into warps?

November 22, 2022 by Tarik

Threads are numbered in order within blocks so that threadIdx.x varies the fastest, then threadIdx.y the second fastest varying, and threadIdx.z the slowest varying. This is functionally the same as column major ordering in multidimensional arrays. Warps are sequentially constructed from threads in this ordering. So the calculation for a 2d block is unsigned int … Read more

CUDA apps time out & fail after several seconds – how to work around this?

November 20, 2022 by Tarik

I’m not a CUDA expert, — I’ve been developing with the AMD Stream SDK, which AFAIK is roughly comparable. You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious. To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1. You … Read more

CUDA: How many concurrent threads in total?

November 10, 2022 by Tarik

The GTX 580 can have 16 * 48 concurrent warps (32 threads each) running at a time. That is 16 multiprocessors (SMs) * 48 resident warps per SM * 32 threads per warp = 24,576 threads. Don’t confuse concurrency and throughput. The number above is the maximum number of threads whose resources can be stored … Read more

Why is NVIDIA Pascal GPUs slow on running CUDA Kernels when using cudaMallocManaged

November 5, 2022 by Tarik

Under CUDA 8 with Pascal GPUs, managed memory data migration under a unified memory (UM) regime will generally occur differently than on previous architectures, and you are experiencing the effects of this. (Also see note at the end about CUDA 9 updated behavior for windows.) With previous architectures (e.g. Maxwell), managed allocations used by a … Read more

How do CUDA blocks/warps/threads map onto CUDA cores?

October 6, 2022 by Tarik

Two of the best references are NVIDIA Fermi Compute Architecture Whitepaper GF104 Reviews I’ll try to answer each of your questions. The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a … Read more

Passing Host Function as a function pointer in global OR device function in CUDA

October 5, 2022 by Tarik

Yes, for a GPU implementation of Calc, you should pass the GetInv as a __device__ function pointer. It is possible, here are some worked examples: Ex. 1 Ex. 2 Ex. 3 Most of the above examples demonstrate bringing the device function pointer all the way back to the host code. This may not be necessary … Read more