Re: how to properly end NAMD replica job on slurm batch system

From: René Hafner TUK (hamburge_at_physik.uni-kl.de)
Date: Wed Mar 24 2021 - 09:57:26 CDT

Hi Josh,

     I use NAMD 2.14.

     Though when using 2 replicas forcing a crash they both had an
error/end message in the logfile

     while for 4 I *had **at least one* replica logfile that has no
error/end message written at the end.

     Therefore I guess there is one zombie still left.

     I wanted to try this now with "top -b > file.txt" in my submission
script after  the line "charmrun namd2..." but need to wait until a
proper node becomes available again.

Kind regards

René

On 3/24/2021 3:40 PM, Vermaas, Josh wrote:
>
> Hi Rene,
>
> Is this 2.13 or 2.14? I seem to recall that 2.13 (or maybe it was
> 2.12?) **didn’t** kill the other replicas when one replica received a
> termination signal, and so you might legitimately be running into an
> issue where there are zombie namd processes roaming around on slurm.
>
> I typically do not do anything special to clean up after a job
> crashes, since it is supposed to take itself down cleanly.
>
> -Josh
>
> *From: *<owner-namd-l_at_ks.uiuc.edu> on behalf of René Hafner TUK
> <hamburge_at_physik.uni-kl.de>
> *Reply-To: *"namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>, René Hafner TUK
> <hamburge_at_physik.uni-kl.de>
> *Date: *Wednesday, March 24, 2021 at 9:22 AM
> *To: *"namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>
> *Subject: *namd-l: how to properly end NAMD replica job on slurm batch
> system
>
> Dear NAMD Maintainers,
>
> I work on cluster with SLURM batch system.
>
>  I am currently testing replica simulations and
>
>         experience the issue that when the replica simulation ends
> with an error or I cancel the job via scancel (since I am only testing...)
>
>     the node gets "closed" with the error that "*kill task failed*".
> (it then takes intervention by cluster admins to reopen/reboot the
> node but thats local policy I guess)
>
> Have you ever experienced this?
>
> Is there a way to savely end the replica runs even when an error occurs?
>
> Do I have to collect processIDs to kill the replica runs myself before
> the submission script (containing the call to charmrun namd2... ) ends ?
>
> Kind regards
> René
> --
> --
> Dipl.-Phys. René Hafner
> TU Kaiserslautern
> Germany

-- 
--
Dipl.-Phys. René Hafner
TU Kaiserslautern
Germany

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST