yimuchen / XRDRetrieve

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

XRDRetrieve

XRDRetrieve is a package aimed to help in the retrieval of crab jobs, assuming you only have knowledge of how the crab configuration file is setup. The input options are:

  • -h, --help: Display options
  • -s, --site: Site of where the output is stored, this is NOT the same as the site you put in your crab file, but rather the url of the site, (T2_CH_CERN would correspond to eoscms.cern.ch)
  • -d, --dirfln: The directory you put in your crab configuration file.
  • -p, --primary: The primary tag of the data you are processing, for instance /SingleMuon/Run2016B-Promptreco-v2/MINIAOD would be SingleMuon
  • -c, --crabjob: The requestName entry in you crab configuration
  • -o, --output: Where to store the retrieved files. For the a given input /some/path, the root files would be stored in /some/path/<crabjob>
  • -r, --refresh: Boolean flag to reprocess log file

In the case the file retrieval process fails, of if there is a update from a re-ran job, the fore-mentioned log file keeps track of which files have are up to date and do not require downloading. The --refresh flag, tells the program to regenerate a log file by crawling the filesystem of the remote server again. Regardless of whether this flag is placed, on the first run there will always be a crawl.

Requirements

Please note that any xrdfs commands should be able to run without user-interaction. If you need to setup certificate proxies, please do it before running the scripts.

Design mentalities

After given a base storage directory, a request name, crab will store the output root files in this directory:

/<base path>/<primary tag>/crab_<request name>/<timestamp>/0000/<yourfile>.root

The time stamp is a unique marker for when the crab file was submitted. When generating our local log file or the first time, we crawl the remote file system to get the following information

  • The root files under the directory with the latest time stamp
  • The size and modification time of each root file

When determining which files should be downloaded, the program checks if:

  • If the local version of the file exists. If not then download
  • If the local version of the file has exactly the same size as remote counter part.

With the refresh option, we also check our existing log file with one freshly generated:

  • If the new log file contains a file not listed in the original, download said file
  • If a file's time stamp or size is different between the two log files, download said file

About

License:MIT License


Languages

Language:Python 100.0%