collederas collederas - 3 months ago 15
Python Question

Cannot understand the 32-bit encoding of the "Python" string

I am reading the Unicode HOWTO of the Python docs to start to really understand Unicode. At the Encodings Paragraph it shows a representation of the "Python" string in a 32-bit integers array.

I don't understand why each char has so many 00s. Like, the char "P" is represented by 0x50 (which I understand, being the hex equivalent for the ASCII ordinal 80). But then it is followed by 3 couples of 00s. What is that? How should I read this representation?

Answer

A 32-bit integers array consists of, well, 32-bit integers.

A byte is 8 bits, so each character necessarily consists of 4 bytes.

The number is 0x00000050, which is translated into four bytes. You could order them 0x50 0x00 0x00 0x00 (byte representing most significant numbers at the end -- "little endian") or 0x00 0x00 0x00 0x50 (least significant at the end -- "big endian"). Different CPUs make different choices for the order, as they note in the paragraph you link to.

If you think this is impractical: they are trying to explain in that paragraph why it is, and why another encoding is typically preferred.

Instead of starting at that article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) manages to live up to its title pretty well.