aws-archiver

Command-line preservation archiving tool for S3

Purpose

The aws-archiver tool is intended to facilitate deposit of assets to Amazon S3 storage while ensuring end-to-end asset fixity and the creation of auditable deposit records.

Installation

To install the tool for system-wide access, the recommended method is via pip:

$ git clone https://www.github.com/umd-lib/aws-archiver
$ cd aws-archiver && pip install -e .

Usage

To see the list of available subcommands, run:

$ archiver --help

For help with a particular subcommand, run:

$ archiver <SUBCOMMAND> --help

where <SUBCOMMAND> is the name of the subcommand. For example, for the "deposit" subcommand:

$ archiver deposit --help

"deposit" subcommand

Usage: archiver deposit [-h] -b BUCKET [-c CHUNK] [-l LOGS] [-n NAME] [-p PROFILE] [-r ROOT] [-s STORAGE] [-t THREADS] (-m MAPFILE | -a ASSET) [--dry-run]

Deposit a batch of resources to S3

options:
  -h, --help            show this help message and exit
  -b BUCKET, --bucket BUCKET
                        S3 bucket to deposit files into
  -c CHUNK, --chunk CHUNK
                        Chunk size for multipart uploads
  -l LOGS, --logs LOGS  Location to store log files
  -n NAME, --name NAME  Batch identifier or name
  -p PROFILE, --profile PROFILE
                        AWS authorization profile
  -r ROOT, --root ROOT  Root dir of files being archived
  -s STORAGE, --storage STORAGE
                        S3 storage class
  -t THREADS, --threads THREADS
                        Maximum number of concurrent threads
  -m MAPFILE, --mapfile MAPFILE
                        Archive assets in inventory file
  -a ASSET, --asset ASSET
                        Archive a single asset
  --dry-run             Perform a "dry run" without actually contacting AWS.

The "deposit" subcommand is used to deposit either a single asset (using the "-a/--asset" argument) or multiple assets in a single batch (using the "-m/--mapfile" argument).

For historical reasons, a "dep" alias is provided for the "deposit" subcommand.

Batch manifest file

The "--mapfile" argument uses files in one of three different batch manifest formats:

md5sum manifest files
patsy-db manifest files
inventory manifest files

md5sum manifest files

A text file listing one asset per line, in the form <md5 hash> <whitespace> <absolute local path>. This is the same line format as the output of the Unix md5sum utility. As a convenience, a script to generate the latter from a directory of files is included in this repository's bin directory.

To create a batch manifest with the included script, do:

$ ./bin/make_mapfile.sh path/to/asset/dir mapfile.txt

patsy-db manifest files

A CSV file listing one asset per line, in the form

<md5 hash>,<absolute local path>,<relative path>

See the "patsy-db" documentation for information about creating the manifest file.

inventory manifest files

A CSV file listing one asset per line, as generated by the "inventory" command of the "preserve" tool.

See the "preserve" documentation (https://github.com/umd-lib/preserve) for more information about creating the manifest file.

Note: The "BATCH" field in the first row of the manifest file will be used as the "name", overriding any "name" argument given on the command-line.

AWS credentials

AWS credentials are required for making deposits. This tool uses the boto3 library to manage authorization using AWS authentication profiles. These profiles are stored in ~/.aws/credentials. To choose a profile to use with a batch, use the -p PROFILE option. If left unspecified, the tool will default to the default profile. The chosen profile must have write permission for the bucket specified in the -b BUCKET option.

Default option values

The following arguments listed above as "optional" are necessary for the deposit and use default values if not specified:

option	default
'-c', '--chunk'	'4GB'
'-l', '--logs'	'logs'
'-n', '--name'	'test_batch'
'-p', '--profile'	'default'
'-r', '--root'	'.'
'-s', '--storage'	'DEEP_ARCHIVE'
'-t', '--threads'	10

"batch-deposit" subcommand

usage: archiver batch-deposit [-h] -f BATCHES_FILE [-p PROFILE]

options:
  -h, --help            show this help message and exit
  -f BATCHES_FILE, --batches-file BATCHES_FILE
                        YAML file containing the paths to the manifests of individual batches.
  -p PROFILE, --profile PROFILE
                        AWS authorization profile

Enables depositing multiple batches specified in a YAML manifest. The format of the YAML file is:

batches_dir: <Fully-qualified filepath to the directory batches>
batches:
    - path: <Relative subdirectory to the manifest file for the batch>
      bucket: <The AWS bucket to store the assets in>
      asset_root: <The asset root for the batch>

For example:

batches_dir: /libr/archives/logs/libdc/load1
batches:
    - path: Archive000Football1
      bucket: libdc-archivebucket-17lowbw7m2av1
      asset_root: /libr/archives/footballfilmsexport/FootballFilmMpeg2_07272011/2010-07-12/Mpeg2QCd
    - path: Archive000Football2
      bucket: libdc-archivebucket-17lowbw7m2av1
      asset_root: /libr/archives/footballfilmsexport/FootballFilmMpeg2_07272011/2010-08-20/Maryland_mpg2_master/Maryland_mpg2_Batch1

Restoring from AWS Deep Glacier

You can restore files from AWS Deep Glacier using the scripts bin/requestfilesfromdeepglacier.sh and bin/copyfromawstolocal.sh.

Install the AWS CLI. One option for installation is brew install awscli.
Configure your region and credentials following the instructions in the AWS CLI Reference, General Options and Credentials.
Create a CSV input file with the list of files to restore, including the 3 columns bucketname, filelocation, fileserverlocation, without a header row. The scripts will prompt for the name of the input file. Example file contents:

libdc-archivebucket-foobarxyz,Archive092/scpa-062057-0018.tif,./restore_directory

Request the restoration from Deep Glacier to an S3 bucket using bin/requestfilesfromdeepglacier.sh. The restoration may take up to 48 hours to complete.
Copy the file from the S3 bucket to the local file system using bin/copyfromawstolocal.sh.

Development Setup

See docs/DevelopmentSetup.md.

Conformance Tests

Manual tests to verify application conformanance to actual AWS behavior are specified in docs/ConformanceTests.md.

dbowling / aws-archiver