Re: The HIP version of NAMD gets wrong results when computing on more than one node

From: Josh Vermaas (joshua.vermaas_at_gmail.com)
Date: Fri Jul 03 2020 - 05:54:33 CDT

Hi Zhang,

The list of configurations I tested before getting distracted by COVID
research were multicore builds, and netlrts builds that split a single node
(networking wasn't working properly on our test setups). This was also in
the era of ROCM 3.3, and now I see this morning that those old binaries
don't work with 3.5, so I'm still working to reproduce your result. Two
things I'd try in the interim:

1. compile with clang. In my own testing, things work better when I use
clang (which really aliases to AMD's LLVM compiler) over gcc.
2. Try the netlrts backend just as a sanity check. My own personal
experience with ucx is that it is far from bulletproof, and it would help
to isolate if it is a HIP-specific issue or a ucx issue.

-Josh

On Fri, Jul 3, 2020 at 3:59 AM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn> wrote:

> Hello,
>
>
> I noticed that there is a HIP version of NAMD in the gerrit repository of
> NAMD. I tried it using the apoa1 and stmv benchmark. The results of single
> node with multi GPU seem right, but when using more than one node, the
> total energy keeps increasing, and sometimes the computation even crashes
> because of too fast moving of atoms. I used the
> ucx-linux-x86_64-ompipmix-smp building of charm-6.10.1. Could anyone give
> me some hints about this problem?
>
>
> Sincerely,
>
> Zhang
>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:09 CST