Re: Job stopped without any error message (probably a load balancing issue?)

From: Natalia Ostrowska (n.ostrowska_at_cent.uw.edu.pl)
Date: Fri Oct 15 2021 - 04:18:40 CDT

I think that if you managed to run a few steps there has to be something
other than load balancing lines. Type something like grep -v LDB file.out |
grep -v MAX etc so you cut those lines out and maybe it will be easier to
search then

The combination of amber files, CG model and namd may be very hard to debug
- and pretty much impossible for me (namd + charmm exclusively) so I'm
gonna leave this job to others

But again, a .conf file and description of the CG model will be extremely
helpful for anyone. Just the size of pseudo atoms, and maximum lengths of
the bonds

In the meantime you can do a quick test of completely turning off the
1-4exclusion (in your conf file) I know it sounds irrelevant but this is
the place where sizes and lengths in your model can matter

And other important question, does the simulation crash during
thermalization / equilibration steps, just after that or later in the
production part?

Regards,
Natalia

On Fri, 15 Oct 2021, 10:10 Haohao Fu, <fhh2626_at_gmail.com> wrote:

> Thanks a lot for your help.
> There is nothing but repeats of "START OF LOAD BALANCING...", like,
> LDB: ============= START OF LOAD BALANCING ============== 47128
> LDB: TIME 47128 LOAD: AVG 0.045555 MAX 0.0602885 PROXIES: TOTAL 113 MAXPE
> 31 MAXPATCH 2 None MEM: 0 MB
> LDB: TIME 47128 LOAD: AVG 0.045555 MAX 0.0602885 PROXIES: TOTAL 113 MAXPE
> 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
> LDB: TIME 47128 LOAD: AVG 0.045555 MAX 0.0602885 PROXIES: TOTAL 113 MAXPE
> 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
> LDB: Reverting to original mapping
> LDB: ============== END OF LOAD BALANCING =============== 47128
> Info: useSync: 0 useProxySync: 0
> LDB: =============== DONE WITH MIGRATION ================ 47128
> LDB: ============= START OF LOAD BALANCING ============== 47130.5
> LDB: ============== END OF LOAD BALANCING =============== 47130.5
> Info: useSync: 0 useProxySync: 0
> LDB: =============== DONE WITH MIGRATION ================ 47130.5
> LDB: ============= START OF LOAD BALANCING ============== 47130.6
> LDB: TIME 47130.6 LOAD: AVG 0.0455513 MAX 0.0602717 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 None MEM: 0 MB
> LDB: TIME 47130.6 LOAD: AVG 0.0455513 MAX 0.0602717 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
> LDB: TIME 47130.6 LOAD: AVG 0.0455513 MAX 0.0602717 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
> LDB: Reverting to original mapping
> LDB: ============== END OF LOAD BALANCING =============== 47130.6
> Info: useSync: 0 useProxySync: 0
> LDB: =============== DONE WITH MIGRATION ================ 47130.6
> LDB: ============= START OF LOAD BALANCING ============== 47133
> LDB: ============== END OF LOAD BALANCING =============== 47133
> Info: useSync: 0 useProxySync: 0
> LDB: =============== DONE WITH MIGRATION ================ 47133
> LDB: ============= START OF LOAD BALANCING ============== 47133.1
> LDB: TIME 47133.1 LOAD: AVG 0.04589 MAX 0.0604653 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 None MEM: 0 MB
> LDB: TIME 47133.1 LOAD: AVG 0.04589 MAX 0.0604653 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
> LDB: TIME 47133.1 LOAD: AVG 0.04589 MAX 0.0604653 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
> LDB: ============== END OF LOAD BALANCING =============== 47133.1
> Info: useSync: 0 useProxySync: 0
> LDB: =============== DONE WITH MIGRATION ================ 47133.1
> LDB: ============= START OF LOAD BALANCING ============== 47135.6
> LDB: ============== END OF LOAD BALANCING =============== 47135.6
> Info: useSync: 0 useProxySync: 0
> LDB: =============== DONE WITH MIGRATION ================ 47135.6
> LDB: ============= START OF LOAD BALANCING ============== 47135.6
> LDB: TIME 47135.6 LOAD: AVG 0.04596 MAX 0.0609632 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 None MEM: 0 MB
> LDB: TIME 47135.6 LOAD: AVG 0.04596 MAX 0.0609632 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
> LDB: TIME 47135.6 LOAD: AVG 0.04596 MAX 0.0609632 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
> LDB: ============== END OF LOAD BALANCING =============== 47135.6
> Info: useSync: 0 useProxySync: 0
> LDB: =============== DONE WITH MIGRATION ================ 47135.6
> LDB: ============= START OF LOAD BALANCING ============== 47138.1
> LDB: ============== END OF LOAD BALANCING =============== 47138.1
> Info: useSync: 0 useProxySync: 0
> LDB: =============== DONE WITH MIGRATION ================ 47138.1
> LDB: ============= START OF LOAD BALANCING ============== 47138.2
> LDB: TIME 47138.2 LOAD: AVG 0.0456313 MAX 0.0603476 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 None MEM: 0 MB
> LDB: TIME 47138.2 LOAD: AVG 0.0456313 MAX 0.0603476 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
> LDB: TIME 47138.2 LOAD: AVG 0.0456313 MAX 0.0603476 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
> LDB: ============== END OF LOAD BALANCING =============== 47138.2
> Info: useSync: 0 useProxySync: 0
> LDB: =============== DONE WITH MIGRATION ================ 47138.2
> LDB: ============= START OF LOAD BALANCING ============== 47140.6
> LDB: ============== END OF LOAD BALANCING =============== 47140.6
> Info: useSync: 0 useProxySync: 0
> LDB: =============== DONE WITH MIGRATION ================ 47140.6
> LDB: ============= START OF LOAD BALANCING ============== 47140.7
> LDB: TIME 47140.7 LOAD: AVG 0.0455321 MAX 0.0602817 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 None MEM: 0 MB
> LDB: TIME 47140.7 LOAD: AVG 0.0455321 MAX 0.0602817 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
> LDB: TIME 47140.7 LOAD: AVG 0.0455321 MAX 0.0602817 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
> LDB: ============== END OF LOAD BALANCING =============== 47140.7
> Info: useSync: 0 useProxySync: 0
> LDB: =============== DONE WITH MIGRATION ================ 47140.7
>
> The input files are Amber-formatted. The force field has the same function
> form as the Amber FF. I used the latest patch at NAMD Gitlab to guarantee
> the correctness of reading Amber-formatted files. I
> used fullelectfrequency, nonbondedfreq and stepspercycle of 1 and margin of
> 10 to avoid possible problems caused by the fluctuation of the box. I
> checked all the things that you mentioned and succeed to run the simulation
> using the same parm7/pdb files through OpenMM. Hence, I suspect there is
> something wrong during the load balancing and migration.
>
> Best,
> Haohao
>
> Natalia Ostrowska <n.ostrowska_at_cent.uw.edu.pl> 于2021年10月15日周五 下午2:35写道:
>
>> Hi, I think you need to attach / paste longer portion of your .out file,
>> up to correct steps - otherwise no one will be able to help here, maybe
>> also conf file and a couple of words on the model
>>
>> I have ran CG simulations with namd, and I can tell you there is 99%
>> chance your errors are caused by how the system is parameterized, also
>> errors that look like server problems - could be pseudo-atom size, distance
>> between them, atom parameters, all sorts of things. Also have a closer look
>> at how does the trajectory look like, in vmd maybe? Check if the system is
>> behaving 'nirmally' or if there is anything strange happening, like
>> aggregation, vacuum bubbles etc
>>
>>
>>
>> On Fri, 15 Oct 2021, 07:16 Haohao Fu, <fhh2626_at_gmail.com> wrote:
>>
>>> Hi,
>>>
>>> My system is modeled by the SIRAH CG force field. If I run the job using
>>> multiple CPU cores + 1 GPU, the simulation will stop after some time
>>> (usually 50000-5000000 steps) without any error message. The only weird
>>> thing is that messages like
>>>
>>> LDB: ============= START OF LOAD BALANCING ============== 12285.4
>>> LDB: TIME 12285.4 LOAD: AVG 0.0416106 MAX 0.0480318 PROXIES: TOTAL 113
>>> MAXPE 31 MAXPATCH 2 None MEM: 0 MB
>>> LDB: TIME 12285.4 LOAD: AVG 0.0416106 MAX 0.0480318 PROXIES: TOTAL 113
>>> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
>>> LDB: TIME 12285.4 LOAD: AVG 0.0416106 MAX 0.0480318 PROXIES: TOTAL 113
>>> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
>>> LDB: Reverting to original mapping
>>> LDB: ============== END OF LOAD BALANCING =============== 12285.4
>>> Info: useSync: 0 useProxySync: 0
>>> LDB: =============== DONE WITH MIGRATION ================ 12285.4
>>>
>>> are much more frequent compared with a normal simulation. The last line
>>> of the log files of terminated jobs are always
>>> LDB: =============== DONE WITH MIGRATION ================ *****.*.
>>>
>>> If I run the job using 1 CPU core + 1 GPU, the simulation will not stop,
>>> but messages like
>>> LDB: ============= START OF LOAD BALANCING ============== 36312
>>> LDB: ============== END OF LOAD BALANCING =============== 36312
>>> LDB: =============== DONE WITH MIGRATION ================ 36312
>>> are still super frequent.
>>>
>>> I suspect that the issue is due to a problem in the load balancing
>>> process, but how can I address this issue?
>>>
>>> Thanks for your help!
>>> Haohao
>>>
>>>
>>>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST