cuda – Page 2 – Make Me Engineer

Which Compute Capability is supported by which CUDA versions?

April 21, 2023 by Tarik

CUDA Version Min CC Deprecated CC Default CC Max CC 5.5 (and prior) 1.0 N/A 1.0 ? 6.0 1.0 1.0 1.0 ? 6.5 1.1 1.x 2.0 ? 7.x 2.0 N/A 2.0 ? 8.0 2.0 2.x 2.0 6.2 9.x 3.0 N/A 3.0 7.0 10.x 3.0 * N/A 3.0 7.5 11.x 3.5 † 3.x 5.2 11.0:8.0, 11.1:8.6, … Read more

nvidia-smi Volatile GPU-Utilization explanation?

November 26, 2022 by Tarik

It is a sampled measurement over a time period. For a given time period, it reports what percentage of time one or more GPU kernel(s) was active (i.e. running). It doesn’t tell you anything about how many SMs were used, or how “busy” the code was, or what it was doing exactly, or in what … Read more

What is a bank conflict? (Doing Cuda/OpenCL programming)

November 26, 2022 by Tarik

For nvidia (and amd for that matter) gpus the local memory is divided into memorybanks. Each bank can only address one dataset at a time, so if a halfwarp tries to load/store data from/to the same bank the access has to be serialized (this is a bank conflict). For gt200 gpus there are 16 banks … Read more

How to get the CUDA version?

November 24, 2022 by Tarik

As Jared mentions in a comment, from the command line: nvcc –version (or /usr/local/cuda/bin/nvcc –version) gives the CUDA compiler version (which matches the toolkit version). From application code, you can query the runtime API version with cudaRuntimeGetVersion() or the driver API version with cudaDriverGetVersion() As Daniel points out, deviceQuery is an SDK sample app that … Read more

What can I do against ‘CUDA driver version is insufficient for CUDA runtime version’?

November 22, 2022 by Tarik

Update your NVIDIA driver. At the moment you have the driver which only supports CUDA 6 or lower, and you are trying to use the CUDA 7.0 toolkit with it.

How are 2D / 3D CUDA blocks divided into warps?

November 22, 2022 by Tarik

Threads are numbered in order within blocks so that threadIdx.x varies the fastest, then threadIdx.y the second fastest varying, and threadIdx.z the slowest varying. This is functionally the same as column major ordering in multidimensional arrays. Warps are sequentially constructed from threads in this ordering. So the calculation for a 2d block is unsigned int … Read more

CUDA apps time out & fail after several seconds – how to work around this?

November 20, 2022 by Tarik

I’m not a CUDA expert, — I’ve been developing with the AMD Stream SDK, which AFAIK is roughly comparable. You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious. To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1. You … Read more

What is the purpose of using multiple “arch” flags in Nvidia’s NVCC compiler?

November 10, 2022 by Tarik

Roughly speaking, the code compilation flow goes like this: CUDA C/C++ device code source –> PTX –> SASS The virtual architecture (e.g. compute_20, whatever is specified by -arch compute…) determines what type of PTX code will be generated. The additional switches (e.g. -code sm_21) determine what type of SASS code will be generated. SASS is … Read more

CUDA: How many concurrent threads in total?

November 10, 2022 by Tarik

The GTX 580 can have 16 * 48 concurrent warps (32 threads each) running at a time. That is 16 multiprocessors (SMs) * 48 resident warps per SM * 32 threads per warp = 24,576 threads. Don’t confuse concurrency and throughput. The number above is the maximum number of threads whose resources can be stored … Read more

In CUDA, what is memory coalescing, and how is it achieved?

November 8, 2022 by Tarik

It’s likely that this information applies only to compute capabality 1.x, or cuda 2.0. More recent architectures and cuda 3.0 have more sophisticated global memory access and in fact “coalesced global loads” are not even profiled for these chips. Also, this logic can be applied to shared memory to avoid bank conflicts. A coalesced memory … Read more