Re: NAMD3 multiGPU: invalid device function error

From: David Hardy (dhardy_at_ks.uiuc.edu)
Date: Mon Feb 22 2021 - 16:26:01 CST

Hi Lorenzo,

The version of NAMD that you are running does require NVLink for multi-GPU support (see http://www.ks.uiuc.edu/Research/namd/alpha/3.0alpha/ <http://www.ks.uiuc.edu/Research/namd/alpha/3.0alpha/>). Our most recent improvements to multi-GPU support no longer require NVLink, however, scaling performance suffers without it. The next release (alpha 9) will include this new multi-GPU support. Until we get new builds posted, you will need to build from the GitLab “devel" branch to try it out. You can get access to the NAMD GitLab repo by following the posted directions (https://gitlab.com/tcbgUIUC/namd <https://gitlab.com/tcbgUIUC/namd>).

Best regards,
Dave

--
David J. Hardy, Ph.D.
Beckman Institute
University of Illinois at Urbana-Champaign
405 N. Mathews Ave., Urbana, IL 61801
dhardy_at_ks.uiuc.edu, http://www.ks.uiuc.edu/~dhardy/
> On Feb 19, 2021, at 9:24 PM, Lorenzo Casalino <lcasalino_at_ucsd.edu> wrote:
> 
> Hello,
> 
> I am trying to use the multiGPU version of NAMD3 (NAMD_3.0alpha7_Linux-x86_64-multicore-CUDA-MultiGPU-SingleNode) to run plain MD on 2 GPUs on a single node on a local cluster using the following command:
> 
> namd3 +p 2 +setcpuaffinity +idlepoll +devices 0,1 input.conf > input.log
> 
> I added the following keywords to my configuration file:
> - CUDASOAintegrate on
> - margin 4
> 
>> From the log file, it looks like the 2 GPUs are seen and activated:
> Info: Built with CUDA version 10010
> Pe 1 physical rank 1 binding to CUDA device 1 on tscc-gpu-5-0.sdsc.edu: 'GeForce RTX 3090'  Mem: 24268MB  Rev: 8.6  PCI: 0:24:0
> Pe 0 physical rank 0 binding to CUDA device 0 on tscc-gpu-5-0.sdsc.edu: 'GeForce RTX 3090'  Mem: 24268MB  Rev: 8.6  PCI: 0:1:0
> 
> The startup phase finishes smoothly, and then, when the actual MD simulation starts, the following error is generated:
> 
> Info: Finished startup at 34.7205 s, 0 MB of memory in use
> 
> TCL: Running for 100000 steps
> FATAL ERROR: CUDA error cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, hgi, hgi, d_nHG, natoms, notZero(), stream) in file src/SequencerCUDAKernel.cu, function buildRattleLists, line 4461
> on Pe 1 (tscc-gpu-5-0.sdsc.edu device 1 pci 0:24:0): invalid device function
> 
> A single node of the cluster has 32 cpus and 8 GPUs (GeForce RTX 3090).
> I point out that the GPUs are NOT connected by NVlink.
> Finally, this is the PBS argument I use to add the GPUs to the environment: #PBS -l nodes=1:ppn=8:gpus=2:gpu3090
> 
> I was not able to work this error out. Is it possible that without NVlink I cannot use the multiGPU version?
> Any help or advises on this issue would be greatly appreciated.
> 
> Thank you.
> 
> Best regards,
> Lorenzo

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:10 CST