acdr acdr - 1 month ago 4x
Python Question

Efficient way to make numpy object arrays intern strings

Consider numpy arrays of the

dtype. I can shove anything I want in there.

A common use case for me is to put strings in them. However, for very large arrays, this may use up a lot of memory, depending on how the array is constructed. For example, if you assign a long string (e.g. "1234567890123456789012345678901234567890") to a variable, and then assign that variable to each element in the array, everything is fine:

arr = np.zeros((100000,), dtype=object)
arr[:] = "1234567890123456789012345678901234567890"

The interpreter now has one large string in memory, and an array full of pointers to this one object.

However, we can also do it wrong:

arr2 = np.zeros((100000,), dtype=object)
for idx in range(100000):
arr2[idx] = str(1234567890123456789012345678901234567890)

Now, the interpreter has a hundred thousand copies of my long string in memory. Not so great.
(Naturally, in the above example, the generation of a new string each time is stunted - in real life, imagine reading a string from each line in a file.)

What I want to do is, instead of assigning each element to the string, first check if it's already in the array, and if it is, use the same object as the previous entry, rather than the new object.

Something like:

arr = np.zeros((100000,), dtype=object)
seen = []
for idx, string in enumerate(file): # Length of file is exactly 100000
if string in seen:
arr[idx] = seen[seen.index(string)]
arr[idx] = string

(Apologies for not posting fully running code. Hopefully you get the idea.)

Unfortunately this requires a large number of superfluous operations on the
list. I can't figure out how to make it work with
s either.



Here's one way to do it, using a dictionary whose values are equal to its keys:

seen = {}
for idx, string in enumerate(file):
    arr[idx] = seen.setdefault(string, string)