scddl (pronounced scuttle) downloads data sets for scientific computing.
-
integrity checks
Data sets that provide file integrity information, e.g. MD5 checksums, are rigorously checked.
-
strict versioning
Data sets that are not inherently versioned will be tagged with the download date. This makes reproducible research possible. There will be no link to the latest version to enforce this strict versioning.
A result of this is that existing files are never overwritten. All running jobs would have inconsistent results if files would be updated in place.
-
centralized storage location
Especially on scientific computing platforms, the data sets are intended to be downloaded to globally accessible storage locations. This avoids that users or groups have to maintain their own copies and that their file system quotas are stressed. Also, new users can immediately start working instead of having to download their data sets first.
-
improved file system performance
Another advantage of centralized storage is that the file system can better cache the data sets. This can result in improved I/O performance, especially when a single data set is used concurrently by many users. Your mileage may vary, based on caching capability of the used file system and on the data set usage patterns.
-
periodic, automatic updates
The download tools can be run as cron jobs or systemd timers. This way, you can easily create periodic, automated updates of data sets.
-
logging to syslog
When specified, the download tools send their output to syslog with their script name as the tag, e.g. the tool ncbidl.sh would use ncbidl as tag. You can then search for these tags, e.g.:
journalctl -t ncbidl
Source data sets are downloaded directly off the internet.
Derived data sets are built from source data sets. They automatically download their sources, if these are not available yet.
- diamond:
diamonddb.sh
- builds diamond database from NCBI sources using the
makedb
sub-command
- builds diamond database from NCBI sources using the
Each tool provides online help via the --help
command line argument, e.g.:
bash ncbidl.sh --help
The download tools can also be used as cron jobs, e.g.:
@monthly time bash /path/to/ncbidl.sh /data/db blast/db/nr
@monthly time bash /path/to/ncbidl.sh /data/db blast/db/nt