bebop bebop - 6 months ago 34
Python Question

More efficient way to make unicode escape codes

I am using python to automatically generate

qsf
files for Qualtrics online surveys. The
qsf
file requires unicode characters to be escaped using the
\u+hex
convention: 'слово' = '\u0441\u043b\u043e\u0432\u043e'. Currently, I am achieving this with the following expression:

'слово'.encode('ascii','backslashreplace').decode('ascii')


The output is exactly what I need, but since this is a two-step process, I wondered if there is a more efficient way to get the same result.

Answer

If you open your output file as 'wb', then it accepts a byte stream rather than unicode arguments:

s = 'слово'
with open('data.txt','wb') as f:
    f.write(s.encode('unicode_escape'))
    f.write(b'\n')  # add a line feed

This seems to do what you want:

$ cat data.txt
\u0441\u043b\u043e\u0432\u043e

and it avoids both the decode as well as any translation that happens when writing unicode to a text stream.


Updated to use encode('unicode_escape') as per the suggestion of @J.F.Sebastian.

%timeit reports that it is quite a bit faster than encode('ascii', 'backslashreplace'):

In [18]: f = open('data.txt', 'wb')

In [19]: %timeit f.write(s.encode('unicode_escape'))
The slowest run took 224.43 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.55 µs per loop

In [20]: %timeit f.write(s.encode('ascii','backslashreplace'))
The slowest run took 9.13 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.37 µs per loop

In [21]: f.close()

Curiously, the lag from timeit for encode('unicode_escape') is a lot longer than that from encode('ascii', 'backslashreplace') even though the per loop time is faster, so be sure to test both in your environment.

Comments