google-research / arxiv-latex-cleaner

arXiv LaTeX Cleaner: Easily clean the LaTeX code of your paper to submit to arXiv

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UnicodeDecodeError in Windows

Salingo opened this issue · comments

Hello,

Thanks very much for this useful tool!

Recently I met a small issue when running the tool:

C:\>arxiv_latex_cleaner ./paper
Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\programdata\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\Scripts\arxiv_latex_cleaner.exe\__main__.py", line 4, in <module>
  File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\__main__.py", line 87, in <module>
    run_arxiv_cleaner(ARGS)
  File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\arxiv_latex_cleaner.py", line 330, in run_arxiv_cleaner
    splits['tex_in_root'] + splits['tex_not_in_root'], parameters)
  File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\arxiv_latex_cleaner.py", line 138, in _read_all_tex_contents
    os.path.join(parameters['input_folder'], fn))
  File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\arxiv_latex_cleaner.py", line 131, in _read_file_content
    return fp.readlines()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x99 in position 5722: illegal multibyte sequence

The reason could be that the default encoding in Windows is 'gbk'. Thus changing with open(filename, 'r') as fp to with open(filename, 'rb', encoding='utf-8') as fp may solve the issue.

Comment: I have tested on the source code, problem solved after adding encoding='utf-8'.

Thanks @Salingo
I'll add the fix to the code.

Reopening as a reminder for myself :)

I'm seeing a similar error on Ubuntu 22.04:

Traceback (most recent call last):
  File "/home/tommy/.local/bin/arxiv_latex_cleaner", line 5, in <module>
    from arxiv_latex_cleaner.__main__ import __main__
  File "/home/tommy/.local/lib/python3.10/site-packages/arxiv_latex_cleaner/__main__.py", line 184, in <module>
    run_arxiv_cleaner(final_args)
  File "/home/tommy/.local/lib/python3.10/site-packages/arxiv_latex_cleaner/arxiv_latex_cleaner.py", line 523, in run_arxiv_cleaner
    tex_contents = _read_all_tex_contents(
  File "/home/tommy/.local/lib/python3.10/site-packages/arxiv_latex_cleaner/arxiv_latex_cleaner.py", line 213, in _read_all_tex_contents
    contents[fn] = _read_file_content(
  File "/home/tommy/.local/lib/python3.10/site-packages/arxiv_latex_cleaner/arxiv_latex_cleaner.py", line 205, in _read_file_content
    lines = fp.readlines()
  File "/usr/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 2645: invalid continuation byte

EDIT: In my case, I had an old file from a template, that was encoded as iso-8859-1, determined via the python snippet

import magic
blob = open(filename, 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob)  # "utf-8" "us-ascii" etc
print(filename, encoding)

(per this answer on stackoverflow).