nextstrain / ingest

Shared internal tooling for pathogen data ingest. Used by our pathogen build repos.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ingest

Shared internal tooling for pathogen data ingest. Used by our individual pathogen repos which produce Nextstrain builds. Expected to be vendored by each pathogen repo using git subrepo.

Some tools may only live here temporarily before finding a permanent home in augur curate or Nextstrain CLI. Others may happily live out their days here.

Vendoring

Nextstrain maintained pathogen repos will use git subrepo to vendor ingest scripts. (See discussion on this decision in #3)

For a list of Nextstrain repos that are currently using this method, use this GitHub code search.

If you don't already have git subrepo installed, follow the git subrepo installation instructions. Then add the latest ingest scripts to the pathogen repo by running:

git subrepo clone https://github.com/nextstrain/ingest ingest/vendored

Any future updates of ingest scripts can be pulled in with:

git subrepo pull ingest/vendored

If you run into merge conflicts and would like to pull in a fresh copy of the latest ingest scripts, pull with the --force flag:

git subrepo pull ingest/vendored --force

Warning Beware of rebasing/dropping the parent commit of a git subrepo update

git subrepo relies on metadata in the ingest/vendored/.gitrepo file, which includes the hash for the parent commit in the pathogen repos. If this hash no longer exists in the commit history, there will be errors when running future git subrepo pull commands.

If you run into an error similar to the following:

$ git subrepo pull ingest/vendored
git-subrepo: Command failed: 'git branch subrepo/ingest/vendored '.
fatal: not a valid object name: ''

Check the parent commit hash in the ingest/vendored/.gitrepo file and make sure the commit exists in the commit history. Update to the appropriate parent commit hash if needed.

History

Much of this tooling originated in ncov-ingest and was passaged thru mpox's ingest/. It subsequently proliferated from mpox to other pathogen repos (rsv, zika, dengue, hepatitisB, forecasts-ncov) primarily thru copying. To counter that proliferation, this repo was made.

Elsewhere

The creation of this repo, in both the abstract and concrete, and the general approach to "ingest" has been discussed in various internal places, including:

Scripts

Scripts for supporting ingest workflow automation that don’t really belong in any of our existing tools.

  • notify-on-diff - Send Slack message with diff of a local file and an S3 object
  • notify-on-job-fail - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
  • notify-on-job-start - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
  • notify-on-record-change - Send Slack message with details about line count changes for a file compared to an S3 object's metadata recordcount. If the S3 object's metadata does not have recordcount, then will attempt to download S3 object to count lines locally, which only supports xz compressed S3 objects.
  • notify-slack - Send message or file to Slack
  • s3-object-exists - Used to prevent 404 errors during S3 file comparisons in the notify-* scripts
  • trigger - Triggers downstream GitHub Actions via the GitHub API using repository_dispatch events.
  • trigger-on-new-data - Triggers downstream GitHub Actions if the provided upload-to-s3 outputs do not contain the identical_file_message A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated.

NCBI interaction scripts that are useful for fetching public metadata and sequences.

  • fetch-from-ncbi-entrez - Fetch metadata and nucleotide sequences from NCBI Entrez and output to a GenBank file. Useful for pathogens with metadata and annotations in custom fields that are not part of the standard NCBI Datasets outputs.

Historically, some pathogen repos used the undocumented NCBI Virus API through fetch-from-ncbi-virus to fetch data. However we've opted to drop the NCBI Virus scripts due to #18.

Potential Nextstrain CLI scripts

  • sha256sum - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
  • cloudfront-invalidate - CloudFront invalidation is already supported in the nextstrain remote command for S3 files. This exists as a separate script to support CloudFront invalidation when using the upload-to-s3 script.
  • upload-to-s3 - Upload file to AWS S3 bucket with compression based on file extension in S3 URL. Skips upload if the local file's hash is identical to the S3 object's metadata sha256sum. Adds the following user defined metadata to uploaded S3 object:
    • sha256sum - hash of the file generated by sha256sum
    • recordcount - the line count of the file
  • download-from-s3 - Download file from AWS S3 bucket with decompression based on file extension in S3 URL. Skips download if the local file already exists and has a hash identical to the S3 object's metadata sha256sum.

Potential augur curate scripts

Software requirements

Some scripts may require Bash ≥4. If you are running these scripts on macOS, the builtin Bash (/bin/bash) does not meet this requirement. You can install Homebrew's Bash which is more up to date.

Testing

Most scripts are untested within this repo, relying on "testing in production". That is the only practical testing option for some scripts such as the ones interacting with S3 and Slack.

For more locally testable scripts, Cram-style functional tests live in tests and are run as part of CI. To run these locally,

  1. Download Cram: pip install cram
  2. Run the tests: cram tests/

About

Shared internal tooling for pathogen data ingest. Used by our pathogen build repos.


Languages

Language:Python 57.5%Language:Shell 40.9%Language:Perl 1.6%