[PyQt] UnicodeDecodeError with output from Windows OS command

Fri Dec 1 12:18:43 GMT 2017

On Fri, Dec 01, 2017 at 11:11:14AM +0000, J Barchan wrote:
> [Among the places I have been asking to try to resolve this, you have been
> the most "on the ball" and provided responses which actually show an
> understanding of the issue.  So I'd be ever so grateful if you wouldn't
> mind taking your time to look through what I'm writing below.  I know it's
> a bit long, but I hope you'll see I'm just laying out a very logical
> argument of what's going on here.  There are 3 "Parts".  Plus a "thank you"
> at the end....]

Sure! Like mentioned in my email signature, I prefer too long emails over too
short ones ;-)

FWIW that understanding mostly came from reading the articles (or watching the
talk) I linked earlier :)

> *FIRST PART:*
> 
> You suggest:
> 
> What you probably want in this case is:
> >
> >    bytearray.decode(sys.getfilesystemencoding())
> >
> 
> 
> Now, from https://docs.python.org/3/library/sys.html "*PEP 529 -- Change
> Windows filesystem encoding to UTF-8*":
> 
> >
> >    - On Mac OS X, the encoding is 'utf-8'.
> >    - On Unix, the encoding is the locale encoding.
> >    - On Windows, the encoding may be 'utf-8' or 'mbcs', depending on user
> >    configuration.
> >
> > And from the https://www.python.org/dev/peps/pep-0529/ referenced from
> there:
> 
> > This PEP proposes changing the default filesystem encoding on Windows to
> > utf-8
> >
> and:
> 
> > Currently the default filesystem encoding is 'mbcs', which is a
> > meta-encoder that uses the active code page.
> >
> > The use of utf-8 will not be configurable, except for the provision of a
> > "legacy mode" flag to revert to the previous behaviour.
> > Update sys.getfilesystemencoding
> > <https://www.python.org/dev/peps/pep-0529/#id7>
> >
> > Remove the default value for Py_FileSystemDefaultEncoding and set it in
> > initfsencoding() to utf-8, or if the legacy-mode switch is enabled to
> > mbcs.
> >
> So, (remembering that I cannot test this under Windows...) I understand
> that sys.getfilesystemencoding() is going to return utf-8 there, and we
> already know that the problem is this will be wrong when it encounters the
> 0x9c character, since that's what I'm reporting....

Yes, thinking about it again, I was wrong there - sorry! If you can't force
robocopy to output UTF-8, the best (and probably correct) guess is
locale.getpreferredencoding():
https://docs.python.org/3/library/locale.html#locale.getpreferredencoding

The documentation for subprocess mentions that it's using
locale.getpreferredencoding(False) if no encoding was given. I'm still not sure
what that argument (do_setlocale) means exactly, but you might want to just do
the same.

> *SECOND PART:*
> [...]
> 
> For example, I see from http://doc.qt.io/qt-5/qtextcodec.html#details :
> 
> > QByteArray <http://doc.qt.io/qt-5/qbytearray.html> encodedString = "...";QTextCodec <http://doc.qt.io/qt-5/qtextcodec.html#QTextCodec> *codec = QTextCodec <http://doc.qt.io/qt-5/qtextcodec.html#QTextCodec>::codecForName("KOI8-R");QString <http://doc.qt.io/qt-5/qstring.html> string = codec->toUnicode(encodedString);
> >
> > and:
> 
> > QTextCodec <http://doc.qt.io/qt-5/qtextcodec.html#QTextCodec>
> > *QTextCodec::codecForLocale()
> >
> > Returns a pointer to the codec most suitable for this locale.
> >
> > On Windows, the codec will be based on a system locale. On Unix systems,
> > the codec will might fall back to using the *iconv* library if no builtin
> > codec for the locale can be found.
> >
> > Note that in these cases the codec's name will be "System".
> >
> So is *that* the "correct" approach to use, do you think?

I think that essentially does the same I explained above.

> *THIRD PART:*
> 
> Elsewhere I have been told that the content of the error I see:
> 
> Unhandled Exception:
> >
> > 'utf-8' codec can't decode byte 0x9c in position 32: invalid start byte
> >
> > <class 'UnicodeDecodeError'>
> > File "C:\HJinn\widgets\messageboxes.py", line 289, in
> > processReadyReadStandardOutput
> > output = output.data().decode('utf-8')
> >
> 
> is specifically coming from *Python*.  (The QByteArray.decode() is a
> Python-only/PyQt function, not present in Qt C++, is that correct??)

There's no QByteArray.decode() - but QByteArray.data() (which you call here)
gives you a Python 'bytes' object, and that has a .decode() method.

> If that's true, are you sure that the whole issue is nothing to with
> specifically Python's handling of the 0x9c character, as per the dedicated
> stackoverflow post for specifically that character:
> https://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c
> ?

Yes, it has nothing to do with that character. You'll find similar posts for
any character which is commonly used - say a "ä" (a with umlauts):
https://stackoverflow.com/q/33295733/2085149

> But again you may say it has nothing to do with Python *per se*, it's just
> people noticing it transitioning from Python 2 to Python 3 behaviour....

Right. You'll also find the same kind of errors for different programming
languages, as long as they decide to raise an exception instead of silently
replacing those characters with potentially nobody noticing ;-)

Or for databases: https://dba.stackexchange.com/q/4777

> I see there various suggested solutions of decode('cp1252') (that must be
> "Code Page", and be equivalent to windows_1252, ah ha!)

Correct. If Windows used UTF-8 everywhere like almost everything does nowadays,
this would be easier ;-)

> and decode('unicode_escape') ....

It looks like that exists to decode Python string literals with backslash
escapes, like if you have a literal "\x9c". I didn't know about it, and it's
probably not useful unless you want to programatically read/write Python source
files.

Florian

-- 
https://www.qutebrowser.org  | me at the-compiler.org (Mail/XMPP)
   GPG: 916E B0C8 FD55 A072  | https://the-compiler.org/pubkey.asc
         I love long mails!  | https://email.is-not-s.ms/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://www.riverbankcomputing.com/pipermail/pyqt/attachments/20171201/c0a043f5/attachment.sig>