struct endianness in Python

TIL the Python standard library struct module defaults to interpreting binary strings using the endianness of your machine.

Which means that this code:

def decode_matchinfo(buf): 
    # buf is a bytestring of unsigned integers, each 4 bytes long 
    return struct.unpack("I" * (len(buf) // 4), buf)

Behaves differently on big-endian v.s. little-endian systems.

I found this out thanks to this bug report against my sqlite-fts4 library.

SQLite doesn't change the binary format depending on the endianness of the system, which means that my function here works correctly on little-endian but does the wrong thing on big-endian systems:

Update: I was entirely wrong about this. SQLite DOES change the format based on the endianness of the system. My bug fix was incorrect - see this issue comment for details.

On little-endian systems:

>>> buf = b'\x01\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00'
>>> decode_matchinfo(buf)
(1, 2, 2, 2)

But on big-endian systems:

>>> buf = b'\x01\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00'
>>> decode_matchinfo(buf)
(16777216, 33554432, 33554432, 33554432)

The fix is to add a first character to that format string specifying the endianness that should be used, see Byte Order, Size, and Alignment in the Python documentation.

>>> struct.unpack("<IIII", buf)
(1, 2, 2, 2)
>>> struct.unpack(">IIII", buf)
(16777216, 33554432, 33554432, 33554432)

So the fix for my bug was to rewrite the function to look like this:

def decode_matchinfo(buf):
    # buf is a bytestring of unsigned integers, each 4 bytes long
    return struct.unpack("<" + ("I" * (len(buf) // 4)), buf)

Bonus: How to tell which endianness your system has

Turns out Python can tell you if you are big-endian or little-endian like this:

>>> from sys import byteorder
>>> byteorder
'little'