piRSquared piRSquared - 26 days ago 6
Python Question

efficiently convert uneven list of lists to minimal containing array padded with nan

consider the list of lists

l


l = [[1, 2, 3], [1, 2]]


if I convert this to a
np.array
I'll get a one dimensional object array with
[1, 2, 3]
in the first position and
[1, 2]
in the second position.

print(np.array(l))

[[1, 2, 3] [1, 2]]


I want this instead

print(np.array([[1, 2, 3], [1, 2, np.nan]]))

[[ 1. 2. 3.]
[ 1. 2. nan]]





I can do this with a loop, but we all know how unpopular loops are

def box_pir(l):
lengths = [i for i in map(len, l)]
shape = (len(l), max(lengths))
a = np.full(shape, np.nan)
for i, r in enumerate(l):
a[i, :lengths[i]] = r
return a

print(box_pir(l))

[[ 1. 2. 3.]
[ 1. 2. nan]]





how do I do this in a fast, vectorized way?




timing

enter image description here

enter image description here

setup functions

%%cython
import numpy as np

def box_pir_cython(l):
lengths = [len(item) for item in l]
shape = (len(l), max(lengths))
a = np.full(shape, np.nan)
for i, r in enumerate(l):
a[i, :lengths[i]] = r
return a





def box_divikar(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out = np.full(mask.shape, np.nan)
out[mask] = np.concatenate(v)
return out

def box_hpaulj(LoL):
return np.array(list(zip_longest(*LoL, fillvalue=np.nan))).T

def box_simon(LoL):
max_len = len(max(LoL, key=len))
return np.array([x + [np.nan]*(max_len-len(x)) for x in LoL])

def box_dawg(LoL):
cols=len(max(LoL, key=len))
rows=len(LoL)
AoA=np.empty((rows,cols, ))
AoA.fill(np.nan)
for idx in range(rows):
AoA[idx,0:len(LoL[idx])]=LoL[idx]
return AoA

def box_pir(l):
lengths = [len(item) for item in l]
shape = (len(l), max(lengths))
a = np.full(shape, np.nan)
for i, r in enumerate(l):
a[i, :lengths[i]] = r
return a

def box_pandas(l):
return pd.DataFrame(l).values

Answer

This seems to be a close one of this question, where the padding was with zeros instead of NaNs. Interesting approaches were posted there, along with mine based on broadcasting and boolean-indexing. So, I would just modify one line from my post there to solve this case like so -

def boolean_indexing(v):
    lens = np.array([len(item) for item in v])
    mask = lens[:,None] > np.arange(lens.max())
    out = np.full(mask.shape,np.nan)
    out[mask] = np.concatenate(v)
    return out

Sample run -

In [17]: l
Out[17]: [[1, 2, 3], [1, 2], [3, 8, 9, 7, 3]]

In [18]: boolean_indexing(l)
Out[18]: 
array([[  1.,   2.,   3.,  nan,  nan],
       [  1.,   2.,  nan,  nan,  nan],
       [  3.,   8.,   9.,   7.,   3.]])

I have posted few runtime results there for all the posted approaches on that Q&A, which could be useful.