UnicodeDecodeError in Windows
Salingo opened this issue · comments
Hello,
Thanks very much for this useful tool!
Recently I met a small issue when running the tool:
C:\>arxiv_latex_cleaner ./paper
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "c:\programdata\anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\Scripts\arxiv_latex_cleaner.exe\__main__.py", line 4, in <module>
File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\__main__.py", line 87, in <module>
run_arxiv_cleaner(ARGS)
File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\arxiv_latex_cleaner.py", line 330, in run_arxiv_cleaner
splits['tex_in_root'] + splits['tex_not_in_root'], parameters)
File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\arxiv_latex_cleaner.py", line 138, in _read_all_tex_contents
os.path.join(parameters['input_folder'], fn))
File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\arxiv_latex_cleaner.py", line 131, in _read_file_content
return fp.readlines()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x99 in position 5722: illegal multibyte sequence
The reason could be that the default encoding in Windows is 'gbk'. Thus changing with open(filename, 'r') as fp
to with open(filename, 'rb', encoding='utf-8') as fp
may solve the issue.
Comment: I have tested on the source code, problem solved after adding encoding='utf-8'
.
Thanks @Salingo
I'll add the fix to the code.
@jponttuset Thanks!
Reopening as a reminder for myself :)
Done
I'm seeing a similar error on Ubuntu 22.04:
Traceback (most recent call last):
File "/home/tommy/.local/bin/arxiv_latex_cleaner", line 5, in <module>
from arxiv_latex_cleaner.__main__ import __main__
File "/home/tommy/.local/lib/python3.10/site-packages/arxiv_latex_cleaner/__main__.py", line 184, in <module>
run_arxiv_cleaner(final_args)
File "/home/tommy/.local/lib/python3.10/site-packages/arxiv_latex_cleaner/arxiv_latex_cleaner.py", line 523, in run_arxiv_cleaner
tex_contents = _read_all_tex_contents(
File "/home/tommy/.local/lib/python3.10/site-packages/arxiv_latex_cleaner/arxiv_latex_cleaner.py", line 213, in _read_all_tex_contents
contents[fn] = _read_file_content(
File "/home/tommy/.local/lib/python3.10/site-packages/arxiv_latex_cleaner/arxiv_latex_cleaner.py", line 205, in _read_file_content
lines = fp.readlines()
File "/usr/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 2645: invalid continuation byte
EDIT: In my case, I had an old file from a template, that was encoded as iso-8859-1, determined via the python snippet
import magic
blob = open(filename, 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob) # "utf-8" "us-ascii" etc
print(filename, encoding)
(per this answer on stackoverflow).