troycomi / reportseff

Tabular seff

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

feat: handle large argument lists better

troycomi opened this issue · comments

With several thousands of slurm output files, the argument list to sacct can get too long. Need a way to invoke sacct without a giant arg list.

slurm_out  ❯❯❯ ls | wc -l
14704
slurm_out  ❯❯❯ reportseff
Traceback (most recent call last):
  File "/tigress/tcomi/.conda/mybase/bin/reportseff", line 5, in <module>
    main()
  File "/tigress/tcomi/.conda/mybase/lib/python3.6/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/tigress/tcomi/.conda/mybase/lib/python3.6/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/tigress/tcomi/.conda/mybase/lib/python3.6/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/tigress/tcomi/.conda/mybase/lib/python3.6/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/tcomi/projects/reportseff/src/reportseff/console.py", line 97, in main
    output, entries = get_jobs(args)
  File "/home/tcomi/projects/reportseff/src/reportseff/console.py", line 143, in get_jobs
    db_output = get_db_output(inquirer, renderer, job_collection, args.debug)
  File "/home/tcomi/projects/reportseff/src/reportseff/console.py", line 209, in get_db_output
    renderer.query_columns, job_collection.get_jobs(), debug_cmd
  File "/home/tcomi/projects/reportseff/src/reportseff/db_inquirer.py", line 176, in get_db_output
    shell=False,
  File "/tigress/tcomi/.conda/mybase/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/tigress/tcomi/.conda/mybase/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/tigress/tcomi/.conda/mybase/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 7] Argument list too long: 'sacct'

I think there are actually two problems here.

  1. When monitoring status, it is better to truncate the query to something like 100 or so jobs. The issue is how to determine the jobs to query prior to calling sacct. In the case of job numbers or filenames, can sort first then provide a limited set of jobs to the db_inquirer. As a benefit, the smaller call to sacct should return faster.

  2. When performing an operation, e.g. move all completed slurm outputs, need to get the reportseff output regardless of length. This would require making a temporary file and invoking with process substitution or calling sacct multiple times with a subset of jobs. Will need to determine a suitable batch size.

Overall, need to add a job limit option to modify what gets passed to sacct. If unset, use the alternative method to call sacct.