nextstrain / ncov

Nextstrain build for novel coronavirus SARS-CoV-2

Home Page:https://nextstrain.org/ncov

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why use unaligned sequences as "aligned" input instead of the actually aligned sequences?

tsibley opened this issue · comments

Why do we do this (also in our other profile configs):

# Note: unaligned sequences are provided as "aligned" sequences to avoid an initial full-DB alignment
# as we re-align everything after subsampling.
inputs:
- name: open
metadata: "s3://nextstrain-data/files/ncov/open/metadata.tsv.gz"
aligned: "s3://nextstrain-data/files/ncov/open/sequences.fasta.xz"
skip_sanitize_metadata: true

instead of something more self-explanatory like this?

-# Note: unaligned sequences are provided as "aligned" sequences to avoid an initial full-DB alignment
-# as we re-align everything after subsampling.
 inputs:
   - name: open
     metadata: "s3://nextstrain-data/files/ncov/open/metadata.tsv.gz"
-    aligned: "s3://nextstrain-data/files/ncov/open/sequences.fasta.xz"
+    aligned: "s3://nextstrain-data/files/ncov/open/aligned.fasta.xz"
     skip_sanitize_metadata: true

My memory is that this trick preceded the existence of aligned.fasta.xz. I don't see any reason why not to use that input, but @corneliusroemer may know specifics.

I looked at the git blame and looks like at least the comment was changed in Dec 2021. Here's the meeting notes from just prior to that commit - seems this was part of moving alignment (cached) to ingest and reorganizing ncov so that many things happened after subsampling (?): https://docs.google.com/document/d/1rV5S8_zA92MFj7l-6LpxkGjQCi9JUtPPAn6ccjlqvo8/edit#heading=h.7nuap5p45ey7

In this PR (linked in above meeting notes) @huddlej writes:

modifies the Nextstrain profiles to use aligned inputs for sequences even though they are not aligned (to skip right to subsampling)

Maybe he remembers a bit more about why this approach?

I don't think I was involved at all with this ever :) But I had a look nonetheless.

The workflow behaves weirdly. When I run the dev-docs recommended command: snakemake -pf all --profile nextstrain_profiles/nextstrain-open -n the workflow assumes that the data is already there - but it definitely isn't, I don't have any data. What's going?

This is the first rule, it assumes there already is data!
image

The config value that is used here:

input_file = config.get("inputs", {}).get(origin_wildcard, {}).get(stage, "")

It's a great example why we shouldn't make complicated workflows. Even such a simple question gets hard to answer and people who wrote that code 2yrs ago can't make sense of it themselves.

Richard once wrote ncov-simple, I feel that it would solve a lot of the problems if we switched over to the simple version without all the complicating templating. That would make maintaining going forward so much simpler. Now that the hot pandemic phase is over it would make sense to simplify things.

@jameshadfield @emmahodcroft Ah, thanks for the context. On a quick examination, it does look like ncov-ingest's aligned.fasta.xz was not quite available when "aligned: "s3://nextstrain-data/files/ncov/open/sequences.fasta.xz" was added. I think we could switch it now, yeah? A minor thing, but I think clearer?

@corneliusroemer Hmm, maybe I can clarify.

The workflow behaves weirdly. […] the workflow assumes that the data is already there - but it definitely isn't, I don't have any data. What's going?

Snakemake will download the file as needed automatically. It's assuming it's there for you because you're in dry run mode, and it knows it can automatically download it during a real run.

The config value that is used here:

input_file = config.get("inputs", {}).get(origin_wildcard, {}).get(stage, "")

It's a great example why we shouldn't make complicated workflows. Even such a simple question gets hard to answer and people who wrote that code 2yrs ago can't make sense of it themselves.

I know what these config vars are and how they're used by the workflow. I know how that code works. That's not what I'm asking about.

My question was why we're passing off unaligned sequences as the aligned sequences input, instead of using aligned sequences as the aligned sequences input.

Richard once wrote ncov-simple, I feel that it would solve a lot of the problems if we switched over to the simple version without all the complicating templating. That would make maintaining going forward so much simpler. Now that the hot pandemic phase is over it would make sense to simplify things.

I understand your desire for reduced complexity, but I think this is a rushed, short-sighted conclusion that throws the baby out with the bathwater. There are dozens of groups external to us using this workflow who benefit from its encapsulated complexity. Using a simpler workflow would externalize that complexity back onto them.

I think we could switch it now, yeah? A minor thing, but I think clearer?

Yeah! Thanks for doublechecking the history; it doesn't sound like past-me to use "sequences" when "aligned" was available, but it makes sense that "aligned" wasn't actually available yet. Better to switch, now that it is.