amoffat / sh

Python process launching

Home Page:https://sh.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UnicodeDecodeError in stdout/err thread

r4lv opened this issue · comments

commented

Dear amoffat,

First, thank you for such an amazing project!

Lately, I've been using sh to generate LaTeX documents with latexmk, and sh gets stuck with an UnicodeDecodeError — I think this is related to _tee, and some strange stdout output. I'm using sh like this:

cmd = sh.latexmk("-pdf", blabla, _out=sys.stdout, _err_to_out=True, _tee="out", _bg=True, _bg_exc=True)
try:
    cmd.wait()
except KeyboardInterrupt:
    cmd.kill()
except sh.ErrorReturnCode as e:
    # process e.stdout, ...

and it never finishes the wait(). This is the error I obtain:

Underfull \hbox (badness 10000) in paragraph at lines 2--2
Overfull \hbox (0.9766pt too wide) in paragraph at lines 9--10
Exception in thread STDOUT/ERR thread for pid 5456:
Traceback (most recent call last):
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 1637, in wrap
    fn(*rgs, **kwargs)
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 2585, in output_thread
    done = f.read()
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 3049, in read
    self.write_chunk(chunk)
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 3026, in write_chunk
    self.should_quit = self.process_chunk(chunk)
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 2928, in process
    handler.write(encode(chunk))
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 2918, in encode
    def encode(chunk): return chunk.decode(handler.encoding)  # noqa: E731
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 22: invalid start byte

(I'm getting the same error on python 3.7)

I "solved" that problem by adding an errors="ignore" to the chunk.decode in line 2918, but I'm not sure if there are any side-effects:

sh/sh.py

Lines 2916 to 2920 in 12420ed

def get_file_chunk_consumer(handler):
if getattr(handler, "encoding", None):
def encode(chunk): return chunk.decode(handler.encoding) # noqa: E731
else:
def encode(chunk): return chunk # noqa: E731

0xf6 seems to be the german ö / o with diaeresis, I don't know why utf-8 couldn't decode that — would you have an idea?

Hi @r4lv. If you run chardet.detect on your input, what encoding does it say it is?

commented

Thank you for the idea!
The input file is {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}, and the log file LaTeX produces (which should match the stdout) is {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''} — something's fishy there. I tried adding _encoding="latin1" and/or _decode_errors="ignore" to my call to sh, but without success. Even when passing _encoding, the handler.encoding from line 2917 still defaults to utf-8. Is this due to the _tee, _out, etc which I pass too?

Hmm, sh should respect your _encoding value. I'm having a hard time seeing how it could default to utf-8 when you pass it in explicitly.

This is where it gets its default value, which would be your system encoding (but you are overriding it)

sh/sh.py

Line 1220 in 12420ed

"encoding": DEFAULT_ENCODING,

The encoding value is passed into two StreamReader objects (these read from stdout and stderr)
here

sh/sh.py

Lines 2206 to 2212 in 12420ed

self._stdout_stream = StreamReader(
self.log.get_child("streamreader", "stdout"),
self._stdout_parent_fd, stdout, self._stdout,
ca["out_bufsize"], ca["encoding"],
ca["decode_errors"], stdout_pipe,
save_data=save_stdout
)

and here

sh/sh.py

Lines 2233 to 2238 in 12420ed

self._stderr_stream = StreamReader(
Logger("streamreader"),
self._stderr_parent_fd, stderr, self._stderr,
ca["err_bufsize"], ca["encoding"], ca["decode_errors"],
stderr_pipe, save_data=save_stderr
)

The StreamReaders create a chunk processor, using that passed-in encoding directly, here:

sh/sh.py

Lines 2999 to 3000 in 12420ed

self.process_chunk, self.finish_chunk_processor = \
determine_how_to_feed_output(handler, encoding, decode_errors)

A "file chunk consumer" is created because your output (stdout) has a write method on it, here:

sh/sh.py

Lines 2896 to 2897 in 12420ed

elif hasattr(handler, "write"):
process, finish = get_file_chunk_consumer(handler)

In the file chunk consumer, the encoding is picked up off of the StreamReader handler here:

sh/sh.py

Line 2918 in 12420ed

def encode(chunk): return chunk.decode(handler.encoding) # noqa: E731

So if you are specifying "latin1", I am not seeing how "utf8" makes it to line 2918. I don't see anywhere where the "encoding" parameter is overwritten by anything. 🤔

Can you capture your stdout and attach it here? That would make it easier to reproduce the error and narrow down the problem.

We'll need a reproducible example to be able to debug this. I'm closing the issue, but feel free to reopen if you get back to this.

I hit this recently too. It seems that when the descriptor passed to _out is something like StringIO instead of BytesIO, sh will attempt to decode the output:

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=sys.stdout, _tee=True)
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174
Exception in thread STDOUT/ERR thread for pid 1573984:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 1641, in wrap
    fn(*rgs, **kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2593, in output_thread
    done = f.read()
  File "/usr/lib/python3.10/site-packages/sh.py", line 3057, in read
    self.write_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 3034, in write_chunk
    self.should_quit = self.process_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2936, in process
    handler.write(encode(chunk))
  File "/usr/lib/python3.10/site-packages/sh.py", line 2926, in encode
    def encode(chunk): return chunk.decode(handler.encoding)  # noqa: E731
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 33: invalid start byte
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=StringIO(), _tee=True)
Exception in thread STDOUT/ERR thread for pid 1576798:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 1641, in wrap
    fn(*rgs, **kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2593, in output_thread
    done = f.read()
  File "/usr/lib/python3.10/site-packages/sh.py", line 3057, in read
    self.write_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 3034, in write_chunk
    self.should_quit = self.process_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2977, in process
    handler.write(chunk.decode(encoding, decode_errors))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 33: invalid start byte
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=BytesIO(), _tee=True)
b'pub   rsa4096 2018-05-15 [SC]\n      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174\nuid           [ unknown] Marcus M\xfcller (GNU Radio Maintainer) <mmueller@gnuradio.org>\nuid           [ unknown] Marcus M\xfcller <marcus@hostalia.de>\n\n'

With _decode_errors="ignore" set, it starts to work for StringIO, but still not for sys.stdout:

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=StringIO(), _tee=True, _decode_errors="ignore")
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174
uid           [ unknown] Marcus Mller (GNU Radio Maintainer) <mmueller@gnuradio.org>
uid           [ unknown] Marcus Mller <marcus@hostalia.de>

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=sys.stdout, _tee=True, _decode_errors="ignore")
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174
Exception in thread STDOUT/ERR thread for pid 1579387:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 1641, in wrap
    fn(*rgs, **kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2593, in output_thread
    done = f.read()
  File "/usr/lib/python3.10/site-packages/sh.py", line 3057, in read
    self.write_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 3034, in write_chunk
    self.should_quit = self.process_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2936, in process
    handler.write(encode(chunk))
  File "/usr/lib/python3.10/site-packages/sh.py", line 2926, in encode
    def encode(chunk): return chunk.decode(handler.encoding)  # noqa: E731
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 33: invalid start byte
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174

There must be something more affecting this. I cannot reproduce the above on OS X and Python 3.10.1:

>>> import sh, sys
>>> from io import StringIO
>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "E6B0152CE5614F6680EEFDC25D11C3B7A2E59699", _out=StringIO(), _tee=True, _decode_errors="ignore")
pub   ed25519 2022-10-19 [SC] [expires: 2024-10-18]
      E6B0 152C E561 4F66 80EE  FDC2 5D11 C3B7 A2E5 9699
uid           [ ultim. ] John Doe <john@example.com>
sub   cv25519 2022-10-19 [E] [expires: 2024-10-18]

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "E6B0152CE5614F6680EEFDC25D11C3B7A2E59699", _out=sys.stdout, _tee=True, _decode_errors="ignore")
pub   ed25519 2022-10-19 [SC] [expires: 2024-10-18]
      E6B0 152C E561 4F66 80EE  FDC2 5D11 C3B7 A2E5 9699
uid           [ ultim. ] John Doe <john@example.com>
sub   cv25519 2022-10-19 [E] [expires: 2024-10-18]

pub   ed25519 2022-10-19 [SC] [expires: 2024-10-18]
      E6B0 152C E561 4F66 80EE  FDC2 5D11 C3B7 A2E5 9699
uid           [ ultim. ] John Doe <john@example.com>
sub   cv25519 2022-10-19 [E] [expires: 2024-10-18]

The output of gpg in your example are all valid ASCII characters, which won't trigger this problem.

The example I provided actually contains a special character:

$ gpg --fingerprint D74F9F146E7F755783583158B343B2BA293E5174
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174
uid           [ unknown] Marcus Müller (GNU Radio Maintainer) <mmueller@gnuradio.org>
uid           [ unknown] Marcus Müller <marcus@hostalia.de>