nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement

Home Page:https://clades.nextstrain.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[v3] error attaching nodes to tree

jameshadfield opened this issue · comments

@ivan-aksamentov I'm sure you're aware of these so feel free to close if it's on your radar to fix

Nextclade (v3, built from 3653a04) when running HBV data, including the provided test sequences, raises the following error:

./target/release/nextclade run -j 1 --input-dataset test_datasets/hbv/files \
  --output-all <dir> --output-basename v3 \
  test_datasets/hbv/files/sequences.fasta

Error: 
   0: When attaching the new node for query sequence 'OP153999.1 |Hepatitis B virus isolate HBV_OBI_UFD1193, complete genome' to the tree
   1: Parent node is expected, but not found. This is an internal error. Please report it to developers

Location:
   packages_rs/nextclade/src/tree/tree_builder.rs:191

You can see this in the web UI too

I wondered whether this was due to using a dummy tree.json in that (test) dataset. I created a proper tree (will share in slack) and swapped the dataset to that. This resulted in a different error:

Error: 
   0: When attaching the new node for query sequence 'OP255998.1 |Hepatitis B virus isolate HBV_OBI_SPb274, complete genome' to the tree
   1: When splitting mutations between query sequence and the child node 'NODE_0001450'
   2: When splitting private nucleotide substitutions
   3: Found mutations with the same position, but different reference letters: C2740A and T2740C. This is an internal error. Please report it to developers

Location:
   packages_rs/nextclade/src/tree/split_muts.rs:118

In both cases the alignments, translations + metadata are written out before the program exits code 1

@jameshadfield Thanks! Not aware. So this is very valuable.

The first error is reproducible. We will check it with Richard.

Regarding the second error, could you please share the tree (or better full dataset) and the sequence in question, so that we can reproduce and trace the execution?

P.S. This may or may not be related and may or may not be useful information for your work on HBV. Back when working on genome annotation branch, I modified the genemap.gff of this dataset: test_datasets/hbv/files/genemap.gff#L6-L16, so that it has proper gene entries as separate lines. In the original they don't exist and genes are just marked as "gene" attributes on CDSes, which is against GFF3 spec. I haven't changed boundaries or anything else (at least I did not mean to). It is important that all three: reference sequence, reference tree and gene map correspond to each other precisely. Any discrepancies can cause errors or just random pink elephants.

I merged #1208, which should solve the first error.

The second part should be addressed in #1211