Non UTF-8 characters in the your database creating parsing errors
carden24 opened this issue · comments
I run into problems parsing diamond alignments created with the latest version of superfocus (
SUPER-FOCUS 0.34, on Apr 2, 2019)
Generating output... [31.191s]
Traceback (most recent call last):
File "superfocus_v2.py", line 602, in <module>
main()
File "superfocus_v2.py", line 568, in main
del_alignments)
File "superfocus_v2.py", line 177, in parse_alignments
for row in alignment_reader:
File "/home/erick/edge/edge_v1.5/thirdParty/Anaconda2/envs/superfocus/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 277: invalid start byte
The issue is that one of your sequences in your fasta files in the database has non-utf characters.
I found them using this command:
grep -axv '.*' file.txt
The cultrip is this sequence:
fig|419947.9.peg.1104__1009__Mycobacterial_MmpL5_membrane_protein_cluster__Rv0678__MarR_family_transcriptional_regulator_associated_with_MmpL5MmpS5_efflux_system
Which apparently looks fine but if you check the characters, it has a weird one ^V=SYN (Synchronous idle). ^$ is the end of line character.
grep -axv '.*' 100_clusters.fasta
>fig|419947.9.peg.1104__1009__Mycobacterial_MmpL5_membrane_protein_cluster__Rv0678__MarR_family_transcriptional_regulator_associated_with_MmpL5M-^VMmpS5_efflux_system$
I found this problem in the 100_clusters.fasta file.
This issue can be solved by adding the option " , encoding='ISO-8859-1' " to the parse_alignments function of the do_alignment.py. Ideally you should try co fix your database issue first.
Before:
with open(alignment) as alignment_file:
After:
with open(alignment, encoding='ISO-8859-1') as alignment_file:
Thanks, @carden24. I will add the change into the next release.
Surprisingly, SUPER-FOCUS's users have formated the same database file and it is the first time I see this error.
Best
Fixed - Thanks
It is a very unusual error indeed. You will only see it if you have a hit against that subject in the database. I do not know if it only shows in my version of python3 (3.6.10) or csv (1.0). Thanks for the quick fix. feel free to close the issue.
gotcha! thanks again.