[PyQt] UnicodeDecodeError with output from Windows OS command

Fri Dec 1 11:11:14 GMT 2017

On 30 November 2017 at 22:59, Florian Bruhin <me at the-compiler.org> wrote:

> (I assume you accidentally removed the list from the reply, so I re-added
> it)
>
> On Thu, Nov 30, 2017 at 10:28:58PM +0000, J Barchan wrote:
> > Did I mention that I have learnt now what output is causing the error?
> It
> > turns out it's if robocopy --- which simply reports filenames as it goes
> > --- encounters a filename with a "£" (UK pound sterling) character in its
> > name.  It would happen just as much if the command were, say, "dir"
> instead
> > of "robocopy".  That pound character is a single byte of 0x9c in the
> > output, which decode('utf-8') is barfing it.
>
> Right, because it is not output encoded as utf-8, so utf-8 is the wrong
> encoding to decode it with ;-)
>
> > In extensive investigations, all I came across was
> > https://riverbankcomputing.com/pipermail/pyqt/2010-January/025564.html:
> >
> > > >> > really often I have this kind of code in my application when it
> comes
> > > >> > to
> > > >> > converting a QByteArray to s string under Python 3.1.
> > > >> >
> > > >> > s = bytes(QByteArray).decode()
> > > >> >
> > > >> > In Python 2 one could use
> > > >> >
> > > >> > s = unicode(QByteArray)
> > > >> >
> > > >> > to get the same result. Did I miss something or could QByteArray
> get
> > a
> > > >> > decode() method to make it similar to a Python3 bytes or bytearray
> > > >> > type?
> > > >>
> > > >> The Python3 way to do it is...
> > > >>
> > > >> s = str(QByteArray, encoding='ascii')
> > > >>
> > > >> ...or whatever encoding is used.
> > > >>
> > > >> It would be possible to change things so that...
> > > >>
> > > >> s = str(QByteArray)
> > > >>
> > > >> ...automatically uses the default encoding. However that would then
> > make
> > > >> it
> > > >> inconsistent with the behaviour of...
> > > >>
> > > >> s = str(bytes)
> > > >>
> > > >> ...and I'm not sure that that is a good idea.
> >
> > See, that guy is saying in Python 3 "*...or whatever encoding is used*.".
> > But in Python 2 he says it "*automatically uses the default encoding*".
>
> That person seems to be talking about ascii encoding only, which is the
> most
> simple case.
>
> > *I just want the Python 2 behaviour, what was the "default encoding" used
> > there, which I meant I didn't have to explicitize one? *  Python 3
> > equivalent.
>
> It's picking ascii and erroring out if that doesn't work, which is
> basically
> what you're seeing ;-)
>
> https://docs.python.org/2/howto/unicode.html#the-unicode-type
>
>     The first argument is converted to Unicode using the specified
> encoding; if
>     you leave off the encoding argument, the ASCII encoding is used for the
>     conversion, so characters greater than 127 will be treated as errors:
> [...]
>
> > Did it just use bytearray.decode('utf-8', 'replace') like you say for the
> > C++?
>
> No, the equivalent of bytearray.decode('ascii', 'error').
>
> What you probably want in this case is:
>
>    bytearray.decode(sys.getfilesystemencoding())
>
> Assuming that robocopy doesn't somehow re-encode the filenames it gets.
>
> Florian
>
> --
> https://www.qutebrowser.org  | me at the-compiler.org (Mail/XMPP)
>    GPG: 916E B0C8 FD55 A072  | https://the-compiler.org/pubkey.asc
>          I love long mails!  | https://email.is-not-s.ms/
>

Firstly, I admit I get lost/confused when replying in my Gmail client.
I've seen some people here reply to me only, reply to pyqt only, or reply
to some combination.  Anyway, this time I'm using "Reply All"....

Secondly, thanks to your suggestions, at the end of the day I can decode
via "latin-1" or "windows_1252" and it works in my customer's case here in
the UK (I do not have access to the Windows running code or their
filenames, I develop just under Linux!), so I am no longer "stuck".

However, I would still like to do this in the "best"/"most robust"/"most
portable"/"correct" way if I can, plus I like to understand things
properly.  Assuming you are still interested & helpful (!?), I still have a
couple of things to pursue with you.

[Among the places I have been asking to try to resolve this, you have been
the most "on the ball" and provided responses which actually show an
understanding of the issue.  So I'd be ever so grateful if you wouldn't
mind taking your time to look through what I'm writing below.  I know it's
a bit long, but I hope you'll see I'm just laying out a very logical
argument of what's going on here.  There are 3 "Parts".  Plus a "thank you"
at the end....]

*FIRST PART:*

You suggest:

What you probably want in this case is:
>
>    bytearray.decode(sys.getfilesystemencoding())
>

Now, from https://docs.python.org/3/library/sys.html "*PEP 529 -- Change
Windows filesystem encoding to UTF-8*":

>
>    - On Mac OS X, the encoding is 'utf-8'.
>    - On Unix, the encoding is the locale encoding.
>    - On Windows, the encoding may be 'utf-8' or 'mbcs', depending on user
>    configuration.
>
> And from the https://www.python.org/dev/peps/pep-0529/ referenced from
there:

> This PEP proposes changing the default filesystem encoding on Windows to
> utf-8
>
and:

> Currently the default filesystem encoding is 'mbcs', which is a
> meta-encoder that uses the active code page.
>
> The use of utf-8 will not be configurable, except for the provision of a
> "legacy mode" flag to revert to the previous behaviour.
> Update sys.getfilesystemencoding
> <https://www.python.org/dev/peps/pep-0529/#id7>
>
> Remove the default value for Py_FileSystemDefaultEncoding and set it in
> initfsencoding() to utf-8, or if the legacy-mode switch is enabled to
> mbcs.
>
So, (remembering that I cannot test this under Windows...) I understand
that sys.getfilesystemencoding() is going to return utf-8 there, and we
already know that the problem is this will be wrong when it encounters the
0x9c character, since that's what I'm reporting....

*SECOND PART:*

You, and others, keep talking about the behaviour of the particular
robocopy command I happen to get the problem with.  I don't see that is the
cause/relevant of the issue.  My code is for generic issuing of an unknown
OS command and reading its output.

So, let's get rid of robocopy completely!  Certainly here in the UK, on my
keyboard/with my UK Windows I can go into a Command Prompt and type:

echo £
>

and I get £ as output.  And if I examine the output it has a single byte of
0x9c for the £ character.

So, let's say that --- or some program which outputs same --- is my
sub-process program.  *Note that now it has nothing to do with Windows
filenames!!*   (Nor anything to do with robocopy.)  The choice of anything
to do with sys.getfilesystemencoding() now seems "inappropriate" to me, as
the issue has nothing to do with file system file names.  Which is how it
always seemed to me.

[BTW: I notice
https://stackoverflow.com/questions/15530635/why-is-sys-getdefaultencoding-different-from-sys-stdout-encoding-and-how-does.
While that is Python2, it may be relevant to this discussion.  Though
please don't get hung up on it!]

My "spawner" app cannot possibly know anything about which program is being
run, nor should it.

I wish to display the output much as, say, it would appear in the Command
Prompt console.  I just want the output in a scrollable Qt window (QTextEdit)
for the user to examine.  Now, there are many programs out there which do
this sort of thing.  For example, Qt Creator itself allows the spawning of
arbitrary OS commands and displays the output in some scrollable control,
as does say Visual Studio, and has no knowledge of what the sub-process
might be doing in the way of text encoding, yet they manage fine.  I want
the same!  So how do they do it?!

My *thought* is: When I install Windows here, I am asked what "locale" I am
in, I pick "UK", and at that point (as I understand it) the system is set
so that my keyboard input and my whatever output (certainly for the
console) is set to something UK-ish, which includes that £ character.  Your
mileage may vary, e.g. if you're in the US, or Germany with all its funny
characters.  I'm thinking/wondering whether there is something from there I
can get at which is what I should be using if I have to pass an explicit
parameter to decode().

For example, I see from http://doc.qt.io/qt-5/qtextcodec.html#details :

> QByteArray <http://doc.qt.io/qt-5/qbytearray.html> encodedString = "...";QTextCodec <http://doc.qt.io/qt-5/qtextcodec.html#QTextCodec> *codec = QTextCodec <http://doc.qt.io/qt-5/qtextcodec.html#QTextCodec>::codecForName("KOI8-R");QString <http://doc.qt.io/qt-5/qstring.html> string = codec->toUnicode(encodedString);
>
> and:

> QTextCodec <http://doc.qt.io/qt-5/qtextcodec.html#QTextCodec>
> *QTextCodec::codecForLocale()
>
> Returns a pointer to the codec most suitable for this locale.
>
> On Windows, the codec will be based on a system locale. On Unix systems,
> the codec will might fall back to using the *iconv* library if no builtin
> codec for the locale can be found.
>
> Note that in these cases the codec's name will be "System".
>
So is *that* the "correct" approach to use, do you think?

*THIRD PART:*

Elsewhere I have been told that the content of the error I see:

Unhandled Exception:
>
> 'utf-8' codec can't decode byte 0x9c in position 32: invalid start byte
>
> <class 'UnicodeDecodeError'>
> File "C:\HJinn\widgets\messageboxes.py", line 289, in
> processReadyReadStandardOutput
> output = output.data().decode('utf-8')
>

is specifically coming from *Python*.  (The QByteArray.decode() is a
Python-only/PyQt function, not present in Qt C++, is that correct??)  If
that's true, are you sure that the whole issue is nothing to with
specifically Python's handling of the 0x9c character, as per the dedicated
stackoverflow post for specifically that character:
https://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c
?

But again you may say it has nothing to do with Python *per se*, it's just
people noticing it transitioning from Python 2 to Python 3 behaviour....

I see there various suggested solutions of decode('cp1252') (that must be
"Code Page", and be equivalent to windows_1252, ah ha!) and decode(
'unicode_escape') ....

*Well, if you've read this far I'm really grateful, and would really
appreciate any suggestions.  If you're not fed up with me/have lost the
will to live... ;-)*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.riverbankcomputing.com/pipermail/pyqt/attachments/20171201/6c8c78da/attachment-0001.html>