David David - 3 months ago 8
Python Question

Obtaining strings from BitStruct in Python's construct module

I am using the Python construct parser to process some binary data but am not managing to obtain strings in the way I expected.

Note that in the simplified example below I could use unpack or even just a slice, but the real data I am parsing does not align neatly to byte boundaries.

Some example code:

from construct import BitStruct, BitField, Padding, String

struct = BitStruct("foo",
BitField("bar", 8),
BitField("baz", 16),
Padding(4),
BitField("bat", 4)
)

struct2 = BitStruct("foo",
BitField("bar", 8),
String("baz", 16),
Padding(4),
BitField("bat", 4)
)

data = "\x01AB\xCD"

print struct.parse(data)
print struct2.parse(data)


This prints the output:

Container:
bar = 1
baz = 16706
bat = 13
Container:
bar = 1
baz = '\x00\x01\x00\x00\x00\x00\x00\x01\x00\x01\x00\x00\x00\x00\x01\x00'
bat = 13


I was expecting that String would give me back
AB
as an actual string. However it is returning the equivalent binary string instead.

How can I persuade construct to return me the actual ASCII string?

Answer

I solved this by creating an Adapter. The original ASCII values are parsed into a list of integers which can then be converted into a string representation.

It's not the most elegant way but due to BitStruct operating only on bit values it seems to be the easiest workaround. An improved version would parse different length strings (e.g. 7-bit ASCII).

from binascii import hexlify
from construct import BitStruct, BitField, Padding, Array, Octet, Adapter

class BitStringAdapter(Adapter):
  def _encode(self, obj, context):
    return list(ord(b) for b in obj)
  def _decode(self, obj, context):
    return "".join(chr(b) for b in obj)

struct = BitStruct("foo",
  BitField("bar", 8),
  BitStringAdapter(Array(2, Octet("baz"))),
  Padding(4),
  BitField("bat", 4)
)

data = "\x01AB\xCD"

out = struct.parse(data)
print hexlify(struct.build(out))

This outputs:

Container:
    bar = 1
    baz = 16706
    bat = 13
0141420d

Which is correct - the C byte is discarded because it's marked as padding, this is fine.