vatlab / sos

SoS workflow system for daily data analysis

Home Page:http://vatlab.github.io/sos-docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question: automatically zap input files?

gaow opened this issue · comments

commented

[Also from my colleague] consider a workflow that looks like this (code will not work, just showing idea):

[1]
input: for_each = ['files_sra'], group_by = 1, concurrent = True
output: files_sra
download: 
  {url}/{_files_sra}

[2]
output: files_bam
run:
 sratools-dump ${_input} ${_output}

The problem is that the data size is huge and can barely fit into a disk. So ideally we need to immediately zap output of step 1 after step 2 is done. Is there a build-in mechanism for it?

commented

We discussed before if there should be a programmatic way to zap a file...

Would something like

[1]
input: for_each = ['files_sra'], group_by = 1, concurrent = True
output: files_sra
download: 
  {url}/{_files_sra}

[2]
output: files_bam
run:
 sratools-dump ${_input} ${_output}

zap_file(_input)

work?

commented

Ah sorry I do not recall the previous discussion on it ... I think what you proposed will work, but should we make it another task option instead?

commented

A function form allows you to zap file at any time (not necessarily after everything is done), in task or regular step, and not limited to _input (e.g. zap a dynamically downloaded file).

zap_file could be named zap if this is not a common name though.

commented

I see ... Okey, or sos_zap, to distinguish from actions as if zap is name of a language, and also like sos_run? And if you make more general, should it also take step names as input when used as a separate step, not just files? I'm just throwing out un-baked thoughts ...

commented

As finally implemented, zap() is now added to path, paths, file_target and sos_targets so that you can do path('file').zap(), _input.zap(), step_input.zap() etc. This should be enough so I did not implement a sos_zap() function.

commented

There is a minor issue with this new feature: a confusing warning message is displayed at the end of execution:

[1]
input: '1.txt'
output: "2.txt" 
run:
  echo asd>2.txt
_input.zap()

the output has a line of "warning"

INFO: Executing default_1: 
INFO: input:   [file_target('1.txt')]
echo asd>2.txt
WARNING: Failed to create signature: input target 1.txt does not exist
INFO: output:   [file_target('2.txt')]
INFO: Workflow default (ID=46c4bb0edea3b1b5) is executed successfully.
commented

It is still not completely fixed. If I remove 2.txt and re-run, sos should complain no real input file and stop running... but it went ahead any way and your example happens to be independent of input.