Skip to content

Verify and Report Exit Codes and States for All Steps

Currently, only the allocation exit-code is reported. Use the information of the exit-code of each step to report how many steps, of each job, are failed.

IMPORTANT: Normally, the allocation exit code reported by sacct (shown when the -X flag is used) corresponds to the exit code of the last command in the submitted batch script, which may not necessarily be a job step. As a result, even if srun (or mpirun) fails, if another command follows and completes successfully (e.g., time), the exit code will be reported as 0:0, and the job state will be marked as COMPLETED rather than FAILED.

Here’s an example of a job with two steps: the first step completed successfully, but the second one failed. However, after the last srun, another serial command was executed, which returned an exit code of 0, leading to the overall exit code being reported as 0:

$ sacct -Xj 10382893 -o jobid%20,state,exitcode,allocnodes,alloccpus
               JobID      State ExitCode AllocNodes  AllocCPUS 
-------------------- ---------- -------- ---------- ---------- 
            10382893  COMPLETED      0:0        128       4096 

$ sacct -j 10382893 -o jobid%20,state,exitcode,allocnodes,alloccpus
               JobID      State ExitCode AllocNodes  AllocCPUS 
-------------------- ---------- -------- ---------- ---------- 
            10382893  COMPLETED      0:0        128       4096 
      10382893.batch  COMPLETED      0:0          1         32 
     10382893.extern  COMPLETED      0:0        128       4096 
          10382893.0  COMPLETED      0:0          1         32 
          10382893.1     FAILED      1:0        128       4096