XENON1T / cax

Simple data management tool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Purging Processed Data

pdeperio opened this issue · comments

Need a nice method for purging a set of processed data of a given pax version.

Perhaps just modifying https://github.com/XENON1T/cax/blob/master/cax/tasks/clear.py#L108 to use pax version as in other modules.

@lucrlom Since you now have an idea how processed data purging works (with the external script), can you look into implementing this in cax?

Ciao Patrick,
Sorry but I completely forgot about this issue.
My plan is this one:

I want to add two new parameters in the config file json like:
"purge_processed": "True" (default is False)
"pax_version": "v6.x.x"
these two parameter will be implemented in config.py: https://github.com/XENON1T/cax/blob/master/cax/config.py#L83
also if is not yet clear how pass these information to me, I have only a roughly idea.

Therefore set a condition here https://github.com/XENON1T/cax/blob/master/cax/tasks/clear.py#L114
in the case "purge_processed" is true and the version we want to remove has been set.
What do you think?

How I can check if a variable has been defined in the code?

First version
https://github.com/XENON1T/cax/blob/cax_purge_processed/cax/tasks/clear.py#L114
new entries for the config file cax_purge.json:
[
{
"name": "midway-login1",
"method": "scp",
"hostname": "midway-login1.rcc.uchicago.edu",
"dir_raw": "/project/lgrandi/xenon1t/raw",
"dir_processed": "/project/lgrandi/xenon1t/processed",
"upload_options": [],
"username": "tunnell",
"download_options": [],
"task_list": ["RetryStalledTransfer", "BufferPurger"],
"purge" : 25,
"purge_type" : "processed", <<<<-----
"pax_version" : "v6.4.0"<<<<-----
}
]

I think you should leave BufferPurger as it was, since its purpose is for automatic buffer purging (i.e. removing raw from xe1t-datamanager buffer once there are enough copies available). So please implement in a new class like:

class PurgeProcessed(checksum.CompareChecksums):

You may also grab the pax version from the environment as it's done elsewhere in the code, e.g.:

'v%s' % pax.__version__ == data_doc['pax_version']

instead of making a new cax.json option.

The class can also be programmed to just operate on processed files, like how BufferPurger only operates on raw files, so you shouldn't need to make a new option for that either.

ok, but with the last selection you cannot select which version to delete, but just the present version with which are you operating and not the oldest ones

therefore the constrain on the time should remain, right?

what do you mean "constrain on the time"?

as with the standard processing task, the version is determined by the environment, which you setup with source activate pax_v#.#.#. but you're right that conda env list is now missing older pax versions, so then you may decide to keep the pax_version, just rename so it's not confused with processing e.g. pax_version_purge

ok.
with the constraint on time I mean how many days have to be pass in order to purge the files, as for the raw data. In this case if we decide to remove the processed files, we don't care how many days are passed. Therefore I think we can remove this constraint.

yes, you can basically write a new class from scratch, since it should be much simpler than for raw data.

I had several problem to setup the environment of cax where test the new class on midway.
Finally I found the problem and fix it.
I did the test and the code works properly.
I can merge it if you agree

What was the problem and fix?

Please make a pull request (don't merge directly).

Fixed in #85