ingest

Shared internal tooling for pathogen data ingest. Used by our individual pathogen repos which produce Nextstrain builds. Expected to be vendored by each pathogen repo using git subrepo.

Some tools may only live here temporarily before finding a permanent home in augur curate or Nextstrain CLI. Others may happily live out their days here.

Vendoring

Nextstrain maintained pathogen repos will use git subrepo to vendor ingest scripts. (See discussion on this decision in #3)

For a list of Nextstrain repos that are currently using this method, use this GitHub code search.

If you don't already have git subrepo installed, follow the git subrepo installation instructions. Then add the latest ingest scripts to the pathogen repo by running:

git subrepo clone https://github.com/nextstrain/ingest ingest/vendored

Any future updates of ingest scripts can be pulled in with:

git subrepo pull ingest/vendored

If you run into merge conflicts and would like to pull in a fresh copy of the latest ingest scripts, pull with the --force flag:

git subrepo pull ingest/vendored --force

Warning Beware of rebasing/dropping the parent commit of a git subrepo update

git subrepo relies on metadata in the ingest/vendored/.gitrepo file, which includes the hash for the parent commit in the pathogen repos. If this hash no longer exists in the commit history, there will be errors when running future git subrepo pull commands.

If you run into an error similar to the following:

$ git subrepo pull ingest/vendored
git-subrepo: Command failed: 'git branch subrepo/ingest/vendored '.
fatal: not a valid object name: ''

Check the parent commit hash in the ingest/vendored/.gitrepo file and make sure the commit exists in the commit history. Update to the appropriate parent commit hash if needed.

History

Much of this tooling originated in ncov-ingest and was passaged thru mpox's ingest/. It subsequently proliferated from mpox to other pathogen repos (rsv, zika, dengue, hepatitisB, forecasts-ncov) primarily thru copying. To counter that proliferation, this repo was made.

Elsewhere

The creation of this repo, in both the abstract and concrete, and the general approach to "ingest" has been discussed in various internal places, including:

https://github.com/nextstrain/private/issues/59
@joverlee521's workflows document
5 July 2023 Slack thread
6 July 2023 team meeting
…many others

Scripts

Scripts for supporting ingest workflow automation that don’t really belong in any of our existing tools.

notify-on-diff - Send Slack message with diff of a local file and an S3 object
notify-on-job-fail - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
notify-on-job-start - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
notify-on-record-change - Send Slack message with details about line count changes for a file compared to an S3 object's metadata recordcount. If the S3 object's metadata does not have recordcount, then will attempt to download S3 object to count lines locally, which only supports xz compressed S3 objects.
notify-slack - Send message or file to Slack
s3-object-exists - Used to prevent 404 errors during S3 file comparisons in the notify-* scripts
trigger - Triggers downstream GitHub Actions via the GitHub API using repository_dispatch events.
trigger-on-new-data - Triggers downstream GitHub Actions if the provided upload-to-s3 outputs do not contain the identical_file_message A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated.

NCBI interaction scripts that are useful for fetching public metadata and sequences.

fetch-from-ncbi-entrez - Fetch metadata and nucleotide sequences from NCBI Entrez and output to a GenBank file. Useful for pathogens with metadata and annotations in custom fields that are not part of the standard NCBI Datasets outputs.

Historically, some pathogen repos used the undocumented NCBI Virus API through fetch-from-ncbi-virus to fetch data. However we've opted to drop the NCBI Virus scripts due to #18.

Potential Nextstrain CLI scripts

sha256sum - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
cloudfront-invalidate - CloudFront invalidation is already supported in the nextstrain remote command for S3 files. This exists as a separate script to support CloudFront invalidation when using the upload-to-s3 script.
upload-to-s3 - Upload file to AWS S3 bucket with compression based on file extension in S3 URL. Skips upload if the local file's hash is identical to the S3 object's metadata sha256sum. Adds the following user defined metadata to uploaded S3 object:
- sha256sum - hash of the file generated by sha256sum
- recordcount - the line count of the file
download-from-s3 - Download file from AWS S3 bucket with decompression based on file extension in S3 URL. Skips download if the local file already exists and has a hash identical to the S3 object's metadata sha256sum.

Potential augur curate scripts

apply-geolocation-rules - Applies user curated geolocation rules to NDJSON records
merge-user-metadata - Merges user annotations with NDJSON records
transform-authors - Abbreviates full author lists to ' et al.'
transform-field-names - Rename fields of NDJSON records
transform-genbank-location - Parses location field with the expected pattern "<country_value>[:<region>][, <locality>]" based on GenBank's country field
transform-strain-names - Ordered search for strain names across several fields.

Software requirements

Some scripts may require Bash ≥4. If you are running these scripts on macOS, the builtin Bash (/bin/bash) does not meet this requirement. You can install Homebrew's Bash which is more up to date.

Testing

Most scripts are untested within this repo, relying on "testing in production". That is the only practical testing option for some scripts such as the ones interacting with S3 and Slack.

For more locally testable scripts, Cram-style functional tests live in tests and are run as part of CI. To run these locally,

Download Cram: pip install cram
Run the tests: cram tests/

nextstrain / ingest