feat: handle large argument lists better
troycomi opened this issue · comments
With several thousands of slurm output files, the argument list to sacct can get too long. Need a way to invoke sacct without a giant arg list.
slurm_out ❯❯❯ ls | wc -l
14704
slurm_out ❯❯❯ reportseff
Traceback (most recent call last):
File "/tigress/tcomi/.conda/mybase/bin/reportseff", line 5, in <module>
main()
File "/tigress/tcomi/.conda/mybase/lib/python3.6/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/tigress/tcomi/.conda/mybase/lib/python3.6/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/tigress/tcomi/.conda/mybase/lib/python3.6/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/tigress/tcomi/.conda/mybase/lib/python3.6/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/tcomi/projects/reportseff/src/reportseff/console.py", line 97, in main
output, entries = get_jobs(args)
File "/home/tcomi/projects/reportseff/src/reportseff/console.py", line 143, in get_jobs
db_output = get_db_output(inquirer, renderer, job_collection, args.debug)
File "/home/tcomi/projects/reportseff/src/reportseff/console.py", line 209, in get_db_output
renderer.query_columns, job_collection.get_jobs(), debug_cmd
File "/home/tcomi/projects/reportseff/src/reportseff/db_inquirer.py", line 176, in get_db_output
shell=False,
File "/tigress/tcomi/.conda/mybase/lib/python3.6/subprocess.py", line 423, in run
with Popen(*popenargs, **kwargs) as process:
File "/tigress/tcomi/.conda/mybase/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/tigress/tcomi/.conda/mybase/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 7] Argument list too long: 'sacct'
I think there are actually two problems here.
-
When monitoring status, it is better to truncate the query to something like 100 or so jobs. The issue is how to determine the jobs to query prior to calling sacct. In the case of job numbers or filenames, can sort first then provide a limited set of jobs to the db_inquirer. As a benefit, the smaller call to sacct should return faster.
-
When performing an operation, e.g. move all completed slurm outputs, need to get the reportseff output regardless of length. This would require making a temporary file and invoking with process substitution or calling sacct multiple times with a subset of jobs. Will need to determine a suitable batch size.
Overall, need to add a job limit option to modify what gets passed to sacct. If unset, use the alternative method to call sacct.