ID collisions in FASTA output

Question

ID collisions in FASTA output

maitai opened this issue 6 years ago · comments

Expected Behavior

Unique ID for every entry in the resulting FASTA file.

Current Behavior

Non-uniqueness of incrementing number at the end of the FASTA header ID results in collisions.
There seem to be two types of collisions:

numbers used twice (rounding .5?)
scientific notation (formatter?)

Example (sequence data lines removed):

>NXL_FL_2_Virus.fasta_0.5
>NXL_FL_2_Virus.fasta_1.5
>NXL_FL_2_Virus.fasta_2.5
>NXL_FL_2_Virus.fasta_3.5
[...]
>NXL_FL_2_Virus.fasta_99997.5
>NXL_FL_2_Virus.fasta_99998.5
>NXL_FL_2_Virus.fasta_99999.5
>NXL_FL_2_Virus.fasta_100000
>NXL_FL_2_Virus.fasta_100002
>NXL_FL_2_Virus.fasta_100002
>NXL_FL_2_Virus.fasta_100004
>NXL_FL_2_Virus.fasta_100004
>NXL_FL_2_Virus.fasta_100006
>NXL_FL_2_Virus.fasta_100006
[...]
>NXL_FL_2_Virus.fasta_999996
>NXL_FL_2_Virus.fasta_999996
>NXL_FL_2_Virus.fasta_999998
>NXL_FL_2_Virus.fasta_999998
>NXL_FL_2_Virus.fasta_1e+06
>NXL_FL_2_Virus.fasta_1e+06
>NXL_FL_2_Virus.fasta_1e+06
>NXL_FL_2_Virus.fasta_1e+06
>NXL_FL_2_Virus.fasta_1e+06
>NXL_FL_2_Virus.fasta_1e+06
>NXL_FL_2_Virus.fasta_1.00001e+06
>NXL_FL_2_Virus.fasta_1.00001e+06
>NXL_FL_2_Virus.fasta_1.00001e+06
>NXL_FL_2_Virus.fasta_1.00001e+06
[...]

Steps to Reproduce (for bugs)

@martin-steinegger ran PLASS for me (thanks again!) and provided me with the output FASTA files.

Plass Output (for bugs)

see above

Context

Your Environment

See above.

Martin Steinegger · Answer 1 · Tue Aug 07 2018 01:12:56 GMT+0800 (China Standard Time)

Thank you for the report. This should be fixed now.

mhibberd · Answer 2 · Wed Mar 27 2019 06:12:01 GMT+0800 (China Standard Time)

Sorry to re-open - this appears to still be an issue, in combination with #6

My base command is roughly:

plass assemble \
 forward_reads.fq \
 reverse_reads.fq \
 plass_assembly.faa \
 tmp \
 --translation-table 11

In processing some output protein assemblies from plass, derived from ~12.6 million paired-end reads, I'm finding semi-duplicated headers in the output. In the primary plass output:

>A00325:41:HCC5VDSXX:3:2402:3179:3464 [Orf: 7685735, 1, 226, 1, 1, 1]
*MDISRMVVPIRVSADAGRVTGEVFFAEFQAKCLCLFHGQAVVGCISGVKADDILVAFDI
TMLGVLTILAVCQQTGRCKREIAALKGVEQVGFPQLRSALFIQKLLSSERIMLVNEVRFD
GGVVRVFITIRLFKSLATTSNCMSQICGLWLNNAAALATEAICLSVAIQFTQNFLGAEPE
CSQVCKFVKEFRPDLLGVRTAGIEVPCKLVEVSALLAALSKQGGNSVKSFFSAACNDDCF
LDLDVVDGTADKSGEGEIQEFASEFQ*
>A00325:41:HCC5VDSXX:3:2402:3179:3464 [Orf: 7685735, 2, 227, -1, 1, 1]
*MTKTNVQPIPSNSQHHDTPNNKTHVIKFRVTAEEKASPELTCKLLNLSLSTFIRRAIHN
VKIEKTVIVAGGGEETLNAVSTLLAQCSKQGGNLNQLARHFNSGGADTEQIRAKLLDELA
DLTAFRLSAEKVLGELYGNAQAYRLSRKGSCIVQP*

I'm curious about the naming conventions. First of all, when I grep the IDs of these fasta lines (they look like read headers) in my input sequencing data, I find something similar, but not exactly the same (headers from R1 and R2 files):

@A00325:41:HCC5VDSXX:3:2402:3179:34648 1:N:0:AGGCAGAA+ATTGTCAT
@A00325:41:HCC5VDSXX:3:2402:3179:34648 2:N:0:AGGCAGAA+ATTGTCAT

I chose to reopen this particular issue, as it seems that the additional, bracketed information in the header lines could potentially resolve this collision. The duplication survives length filtering and clustering, suggesting (as does the eye) that these proteins aren't necessarily closely related to each other.

Any help? Thanks!

Martin Steinegger · Answer 3 · Wed Mar 27 2019 10:46:26 GMT+0800 (China Standard Time)

@mhibberd both protein started with protein frames from A00325:41:HCC5VDSXX:3:2402:3179:3464. The first comes from the reverse frame 1 226 and the second is from 2 227. Is the name clash a problem for you?

mhibberd · Answer 4 · Wed Mar 27 2019 11:22:48 GMT+0800 (China Standard Time)

@martin-steinegger a couple of things bother me: 1) the name of the protein(s) that start from the read specified don't actually match the name of the read itself (see the reference to the fastq headers above - the protein headers are missing the final '8' in the read header). 2) yeah, the naming clash is an issue when I'm trying to work with the sequences themselves (homology/similarity searching, clustering, etc.). I could easily put some arbitrary "_1" and "_2" business on the sequences, but figured I'd put it out there in case your group had a better suggestion. I also just wanted to make sure I understood what the header lines meant before I modified the PLASS output.

Martin Steinegger · Answer 5 · Thu Mar 28 2019 00:49:52 GMT+0800 (China Standard Time)

Oh thank you for pointing out the missing '8'. I have not seen this. This is indeed a bug and I will fix it. But I will also think about a better naming scheme. You suggestion make sense to add the _1 and _2.
I do like the output of metaspades that has a uniq naming, including sequence length and coverage information.

mhibberd · Answer 6 · Thu Mar 28 2019 04:37:45 GMT+0800 (China Standard Time)

@martin-steinegger one more related item - I think this naming issue may be affecting how mmseqs deals with the output sequences from plass. I attempted to use 'easy-cluster' to remove related protein sequences in the output of plass filtered to >100 AAs per sequence:

mmseqs easy-cluster plass_output_l100.faa plass_output_l100_i90c90_cluster.faa tmp --cov-mode 1 -c 0.9 --min-seq-id 0.9

In the output rep seqs file and cluster-indicating tsv I see two sequences that share 99% identity but have the same fasta header name (but different "Orf" designations) both retained. Happy to provide more data to replicate if it's helpful.