Max Max - 11 months ago 87
C Question

Fastest way to expose C strings to numpy?

I'm working on converting some old text logs to a usable format in Python. The files are huge, so I'm writing my own C extensions to run through the files as quickly as possible and parse out the relevant fields with regular expressions. My ultimate goal is to export these fields into numpy arrays of strings.

I know it's possible to create the numpy array as a PyObject in C and then call SetItem on each element, but I'm looking to optimize as much as I can. Can I use something like memcpy or PyBuffer_FromMemory to read the C strings into a numpy string array directly? I know numpy arrays are internally similar to C arrays, but do I have to ensure the numpy array will be contiguously allocated?

I intend to use the numpy arrays to build columns Pandas for statistical analysis. As I understand it, Pandas uses numpy arrays to store columns in a dataframe so I won't have a large overhead going from numpy into Pandas. I'd like to avoid cython if possible.

Answer Source

To give a sense of how an array of strings is stored, I'll make one, and view it in several ways:

In [654]: np.array(['one','two','three','four'],dtype='S5')
array([b'one', b'two', b'three', b'four'], 
In [655]: x=np.array(['one','two','three','four'],dtype='S5')
In [656]: x.tostring()
Out[656]: b'one\x00\x00two\x00\x00threefour\x00'
In [657]: x.view(np.uint8)
array([111, 110, 101,   0,   0, 116, 119, 111,   0,   0, 116, 104, 114,
       101, 101, 102, 111, 117, 114,   0], dtype=uint8)

So its databuffer consists of 20 bytes (4*S5). For strings that are shorter than 5, it puts (or leaves) 0 in the byte.

Yes, there are C functions for creating new arrays of a given size and dtype. And functions for copying blocks of data to those arrays. Look at the C side of the numpy documentation, or look at some of the numpy code on it's github repository.

Regarding the pandas transfer, beware that pandas readily changes the dtype of its columns. For example if you put None or nan in a column it is likely to change it to object dtype.

Object arrays store pointers in the databuffer.

In [658]: y=np.array(['one','two','three','four'],dtype=object)
In [659]: y
Out[659]: array(['one', 'two', 'three', 'four'], dtype=object)
In [660]: y.tostring()
Out[660]: b'\xe0\x0f\xc5\xb5\xa0\xfah\xb5\x80\x0b\x8c\xb4\xc09\x8b\xb4'

If I interpret that right, the databuffer has 16 bytes - 4 4byte pointers. The strings are stored elsewhere in memory as regular Python strings (in this case unicode strings (Py3)).


fromstring and frombuffer lets me recreate an array from a buffer

In [696]: x=np.array(['one','two','three','four'],dtype='S5')
In [697]: xs=x.tostring()
In [698]: np.fromstring(xs,'S5')
array([b'one', b'two', b'three', b'four'], 
In [700]: np.frombuffer(xs,'S5')
array([b'one', b'two', b'three', b'four'], 

This works without copying the buffer.

However, if the are multiple strings in different parts of memory, then building an array from them will require copying into on contiguous buffer.