Verify and Report Exit Codes and States for All Steps
Currently, only the allocation exit-code is reported. Use the information of the exit-code of each step to report how many steps, of each job, are failed.
IMPORTANT: Normally, the allocation exit code reported by sacct
(shown when the -X
flag is used) corresponds to the exit code of the last command in the submitted batch script, which may not necessarily be a job step.
As a result, even if srun
(or mpirun
) fails, if another command follows and completes successfully (e.g., time
), the exit code will be reported as 0:0
, and the job state will be marked as COMPLETED rather than FAILED.
Here’s an example of a job with two steps: the first step completed successfully, but the second one failed. However, after the last srun
, another serial command was executed, which returned an exit code of 0, leading to the overall exit code being reported as 0:
$ sacct -Xj 10382893 -o jobid%20,state,exitcode,allocnodes,alloccpus
JobID State ExitCode AllocNodes AllocCPUS
-------------------- ---------- -------- ---------- ----------
10382893 COMPLETED 0:0 128 4096
$ sacct -j 10382893 -o jobid%20,state,exitcode,allocnodes,alloccpus
JobID State ExitCode AllocNodes AllocCPUS
-------------------- ---------- -------- ---------- ----------
10382893 COMPLETED 0:0 128 4096
10382893.batch COMPLETED 0:0 1 32
10382893.extern COMPLETED 0:0 128 4096
10382893.0 COMPLETED 0:0 1 32
10382893.1 FAILED 1:0 128 4096