Non UTF-8 characters in the your database creating parsing errors

Question

Non UTF-8 characters in the your database creating parsing errors

carden24 opened this issue 4 years ago · comments

I run into problems parsing diamond alignments created with the latest version of superfocus (
SUPER-FOCUS 0.34, on Apr 2, 2019)

Generating output...  [31.191s]
Traceback (most recent call last):
  File "superfocus_v2.py", line 602, in <module>
    main()
  File "superfocus_v2.py", line 568, in main
    del_alignments)
  File "superfocus_v2.py", line 177, in parse_alignments
    for row in alignment_reader:
  File "/home/erick/edge/edge_v1.5/thirdParty/Anaconda2/envs/superfocus/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 277: invalid start byte

The issue is that one of your sequences in your fasta files in the database has non-utf characters.

I found them using this command:

grep -axv '.*' file.txt

The cultrip is this sequence:

fig|419947.9.peg.1104__1009__Mycobacterial_MmpL5_membrane_protein_cluster__Rv0678__MarR_family_transcriptional_regulator_associated_with_MmpL5MmpS5_efflux_system

Which apparently looks fine but if you check the characters, it has a weird one ^V=SYN (Synchronous idle). ^$ is the end of line character.

grep -axv '.*' 100_clusters.fasta

>fig|419947.9.peg.1104__1009__Mycobacterial_MmpL5_membrane_protein_cluster__Rv0678__MarR_family_transcriptional_regulator_associated_with_MmpL5M-^VMmpS5_efflux_system$

I found this problem in the 100_clusters.fasta file.

This issue can be solved by adding the option " , encoding='ISO-8859-1' " to the parse_alignments function of the do_alignment.py. Ideally you should try co fix your database issue first.

Before:
with open(alignment) as alignment_file:

After:
with open(alignment, encoding='ISO-8859-1') as alignment_file:

Geni Silva · Answer 1 · Fri Jul 03 2020 06:03:38 GMT+0800 (China Standard Time)

Thanks, @carden24. I will add the change into the next release.

Surprisingly, SUPER-FOCUS's users have formated the same database file and it is the first time I see this error.

Best

Geni Silva · Answer 2 · Fri Jul 03 2020 06:08:26 GMT+0800 (China Standard Time)

Fixed - Thanks

Erick Cardenas · Answer 3 · Sat Jul 04 2020 00:02:31 GMT+0800 (China Standard Time)

It is a very unusual error indeed. You will only see it if you have a hit against that subject in the database. I do not know if it only shows in my version of python3 (3.6.10) or csv (1.0). Thanks for the quick fix. feel free to close the issue.

Geni Silva · Answer 4 · Sat Jul 04 2020 00:03:27 GMT+0800 (China Standard Time)

gotcha! thanks again.