UnicodeDecodeError in stdout/err thread
r4lv opened this issue · comments
Dear amoffat,
First, thank you for such an amazing project!
Lately, I've been using sh to generate LaTeX documents with latexmk
, and sh gets stuck with an UnicodeDecodeError
— I think this is related to _tee
, and some strange stdout output. I'm using sh like this:
cmd = sh.latexmk("-pdf", blabla, _out=sys.stdout, _err_to_out=True, _tee="out", _bg=True, _bg_exc=True)
try:
cmd.wait()
except KeyboardInterrupt:
cmd.kill()
except sh.ErrorReturnCode as e:
# process e.stdout, ...
and it never finishes the wait()
. This is the error I obtain:
Underfull \hbox (badness 10000) in paragraph at lines 2--2
Overfull \hbox (0.9766pt too wide) in paragraph at lines 9--10
Exception in thread STDOUT/ERR thread for pid 5456:
Traceback (most recent call last):
File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 892, in run
self._target(*self._args, **self._kwargs)
File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 1637, in wrap
fn(*rgs, **kwargs)
File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 2585, in output_thread
done = f.read()
File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 3049, in read
self.write_chunk(chunk)
File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 3026, in write_chunk
self.should_quit = self.process_chunk(chunk)
File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 2928, in process
handler.write(encode(chunk))
File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 2918, in encode
def encode(chunk): return chunk.decode(handler.encoding) # noqa: E731
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 22: invalid start byte
(I'm getting the same error on python 3.7)
I "solved" that problem by adding an errors="ignore"
to the chunk.decode
in line 2918, but I'm not sure if there are any side-effects:
0xf6
seems to be the german ö / o with diaeresis
, I don't know why utf-8
couldn't decode that — would you have an idea?
Hi @r4lv. If you run chardet.detect on your input, what encoding does it say it is?
Thank you for the idea!
The input file is {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
, and the log file LaTeX produces (which should match the stdout) is {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}
— something's fishy there. I tried adding _encoding="latin1"
and/or _decode_errors="ignore"
to my call to sh, but without success. Even when passing _encoding
, the handler.encoding
from line 2917 still defaults to utf-8
. Is this due to the _tee
, _out
, etc which I pass too?
Hmm, sh should respect your _encoding
value. I'm having a hard time seeing how it could default to utf-8
when you pass it in explicitly.
This is where it gets its default value, which would be your system encoding (but you are overriding it)
The encoding value is passed into two StreamReader
objects (these read from stdout and stderr)
here
and here
The StreamReaders create a chunk processor, using that passed-in encoding directly, here:
A "file chunk consumer" is created because your output (stdout) has a write
method on it, here:
In the file chunk consumer, the encoding is picked up off of the StreamReader handler here:
So if you are specifying "latin1", I am not seeing how "utf8" makes it to line 2918. I don't see anywhere where the "encoding" parameter is overwritten by anything. 🤔
Can you capture your stdout and attach it here? That would make it easier to reproduce the error and narrow down the problem.
We'll need a reproducible example to be able to debug this. I'm closing the issue, but feel free to reopen if you get back to this.
I hit this recently too. It seems that when the descriptor passed to _out
is something like StringIO
instead of BytesIO
, sh
will attempt to decode the output:
>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=sys.stdout, _tee=True)
pub rsa4096 2018-05-15 [SC]
D74F 9F14 6E7F 7557 8358 3158 B343 B2BA 293E 5174
Exception in thread STDOUT/ERR thread for pid 1573984:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.10/site-packages/sh.py", line 1641, in wrap
fn(*rgs, **kwargs)
File "/usr/lib/python3.10/site-packages/sh.py", line 2593, in output_thread
done = f.read()
File "/usr/lib/python3.10/site-packages/sh.py", line 3057, in read
self.write_chunk(chunk)
File "/usr/lib/python3.10/site-packages/sh.py", line 3034, in write_chunk
self.should_quit = self.process_chunk(chunk)
File "/usr/lib/python3.10/site-packages/sh.py", line 2936, in process
handler.write(encode(chunk))
File "/usr/lib/python3.10/site-packages/sh.py", line 2926, in encode
def encode(chunk): return chunk.decode(handler.encoding) # noqa: E731
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 33: invalid start byte
pub rsa4096 2018-05-15 [SC]
D74F 9F14 6E7F 7557 8358 3158 B343 B2BA 293E 5174
>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=StringIO(), _tee=True)
Exception in thread STDOUT/ERR thread for pid 1576798:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.10/site-packages/sh.py", line 1641, in wrap
fn(*rgs, **kwargs)
File "/usr/lib/python3.10/site-packages/sh.py", line 2593, in output_thread
done = f.read()
File "/usr/lib/python3.10/site-packages/sh.py", line 3057, in read
self.write_chunk(chunk)
File "/usr/lib/python3.10/site-packages/sh.py", line 3034, in write_chunk
self.should_quit = self.process_chunk(chunk)
File "/usr/lib/python3.10/site-packages/sh.py", line 2977, in process
handler.write(chunk.decode(encoding, decode_errors))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 33: invalid start byte
pub rsa4096 2018-05-15 [SC]
D74F 9F14 6E7F 7557 8358 3158 B343 B2BA 293E 5174
>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=BytesIO(), _tee=True)
b'pub rsa4096 2018-05-15 [SC]\n D74F 9F14 6E7F 7557 8358 3158 B343 B2BA 293E 5174\nuid [ unknown] Marcus M\xfcller (GNU Radio Maintainer) <mmueller@gnuradio.org>\nuid [ unknown] Marcus M\xfcller <marcus@hostalia.de>\n\n'
With _decode_errors="ignore"
set, it starts to work for StringIO
, but still not for sys.stdout
:
>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=StringIO(), _tee=True, _decode_errors="ignore")
pub rsa4096 2018-05-15 [SC]
D74F 9F14 6E7F 7557 8358 3158 B343 B2BA 293E 5174
uid [ unknown] Marcus Mller (GNU Radio Maintainer) <mmueller@gnuradio.org>
uid [ unknown] Marcus Mller <marcus@hostalia.de>
>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=sys.stdout, _tee=True, _decode_errors="ignore")
pub rsa4096 2018-05-15 [SC]
D74F 9F14 6E7F 7557 8358 3158 B343 B2BA 293E 5174
Exception in thread STDOUT/ERR thread for pid 1579387:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.10/site-packages/sh.py", line 1641, in wrap
fn(*rgs, **kwargs)
File "/usr/lib/python3.10/site-packages/sh.py", line 2593, in output_thread
done = f.read()
File "/usr/lib/python3.10/site-packages/sh.py", line 3057, in read
self.write_chunk(chunk)
File "/usr/lib/python3.10/site-packages/sh.py", line 3034, in write_chunk
self.should_quit = self.process_chunk(chunk)
File "/usr/lib/python3.10/site-packages/sh.py", line 2936, in process
handler.write(encode(chunk))
File "/usr/lib/python3.10/site-packages/sh.py", line 2926, in encode
def encode(chunk): return chunk.decode(handler.encoding) # noqa: E731
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 33: invalid start byte
pub rsa4096 2018-05-15 [SC]
D74F 9F14 6E7F 7557 8358 3158 B343 B2BA 293E 5174
There must be something more affecting this. I cannot reproduce the above on OS X and Python 3.10.1:
>>> import sh, sys
>>> from io import StringIO
>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "E6B0152CE5614F6680EEFDC25D11C3B7A2E59699", _out=StringIO(), _tee=True, _decode_errors="ignore")
pub ed25519 2022-10-19 [SC] [expires: 2024-10-18]
E6B0 152C E561 4F66 80EE FDC2 5D11 C3B7 A2E5 9699
uid [ ultim. ] John Doe <john@example.com>
sub cv25519 2022-10-19 [E] [expires: 2024-10-18]
>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "E6B0152CE5614F6680EEFDC25D11C3B7A2E59699", _out=sys.stdout, _tee=True, _decode_errors="ignore")
pub ed25519 2022-10-19 [SC] [expires: 2024-10-18]
E6B0 152C E561 4F66 80EE FDC2 5D11 C3B7 A2E5 9699
uid [ ultim. ] John Doe <john@example.com>
sub cv25519 2022-10-19 [E] [expires: 2024-10-18]
pub ed25519 2022-10-19 [SC] [expires: 2024-10-18]
E6B0 152C E561 4F66 80EE FDC2 5D11 C3B7 A2E5 9699
uid [ ultim. ] John Doe <john@example.com>
sub cv25519 2022-10-19 [E] [expires: 2024-10-18]
The output of gpg
in your example are all valid ASCII characters, which won't trigger this problem.
The example I provided actually contains a special character:
$ gpg --fingerprint D74F9F146E7F755783583158B343B2BA293E5174
pub rsa4096 2018-05-15 [SC]
D74F 9F14 6E7F 7557 8358 3158 B343 B2BA 293E 5174
uid [ unknown] Marcus Müller (GNU Radio Maintainer) <mmueller@gnuradio.org>
uid [ unknown] Marcus Müller <marcus@hostalia.de>