Brandon J Brandon J - 1 year ago 51
Python Question

Numpy Arrays: Extracting preferentially ordered values from array with Nans without padding?

Suppose I have an array (M,N) where the values in each "column", N, represent data recordings of N different machines. Let's also imagine each "row", M, represents a unique "timestamp" where data was recorded for all of the N machines.

The array (M,N) is structured in a way so that at M = 0, this would corresp[ond to the very first "timestamp" (t0) and the row M = M (tm) represents the most recent "timestamp" recording.

Let's call this array "AX." AX[0] would yield the recorded data for N machines at the very 1st "timestamp". AX[-1] would be the most recent recordings.

Here is my array:

>>AX = np.random.randn(3, 5)

array([[ 0.53826804, -0.9450442 , -0.10279278, 0.47251871, 0.32050493],
[-0.97573464, -0.42359652, -0.00223274, 0.7364234 , 0.83810714],
[-0.07626913, 0.85246932, -0.13736392, -1.39977431, -1.39882156]])

Now imagine something went wrong and data wasn't captured consistently for every machine at every "timestamp". To create an example of what the output might look like I followed the example linked below to insert Nans in random positions in the array:

Create sample numpy array with randomly placed NaNs

>>AX.ravel()[np.random.choice(AX.size, 9, replace=False)] = np.nan

array([[ 0.53826804, -0.9450442 , nan, 0.47251871, nan],
[ nan, nan, nan, 0.7364234 , 0.83810714],
[-0.07626913, nan, nan, nan, nan]])

Let's assume that I need to provide the most recent values of the recorded data. Ideally this would be as easy as referencing AX[-1]. In this particular case, I would hardly have any data since everything got screwed up.


array([-0.07626913, nan, nan, nan, nan])


I realize any data is better than nothing, so I would like use the most recent value recorded for each machine. In this particular scenario, the best I could is provide an array with the values:

[-0.07626913, -0.9450442, 0.7364234, 0.83810714]

Notice column 2 of AX had no usable data, so I just skipped it's ouput.

I do not find np.arrays to be very intuitive and as I read through the documentation, I am overwhelmed by the amount of specialized functions and transforms.

My intial idea was to perhaps filter out all of the Nans to a new array (AY), and then take the last row AY[-1] (assuming this would retains its important row based ordering) Then I realized that this would be making an array with a strange shape of (I'm just using integer values here for convenience instead of AX's values):


Assuming that is even possible to create, taking the last "row"(?) would yield [6,5,3] and would totally mess everything up. Padding an array with values is also bad because the most recent values would be pads for 4 out of 5 data points in the most recent "timestamp" row.

Is there a way to achieve what I want in a fairly painless manner while still using the np.array stucture and avoiding dataframes and panels?


Answer Source

This is the kind of question that can generate many interesting answers. Someone will probably come up with a better way than this, but to get things started, here's one possibility:

In [99]: AX
array([[ 0.53826804, -0.9450442 ,         nan,  0.47251871,         nan],
       [        nan,         nan,         nan,  0.7364234 ,  0.83810714],
       [-0.07626913,         nan,         nan,         nan,         nan]])

np.isfinite(AX) is a boolean array that is True where AX is not nan (and not infinite, but I assume that case is not relevant). For a boolean array B, B.argmax(axis=0) gives the indices of the first True value in each column. To get the indices of the last True value, reverse the array, take the argmax, and then subtract the result from the number of rows minus 1; that is, B.shape[0]-1 - B[::-1].argmax(axis=0). In this case, B is np.isfinite(AX), so we have:

In [100]: k = AX.shape[0] - 1 - np.isfinite(AX)[::-1].argmax(axis=0)

k contains the row indices where the final values occur. There is one for each column, so the corresponding column indices are simply np.arange(AX.shape[1]).

In [101]: last_vals = AX[k, np.arange(AX.shape[1])]

last_vals is the one-dimensional array of the last non-nan values in each column, unless a column is all nan, in which case the value in last_vals is also nan:

In [102]: last_vals
Out[102]: array([-0.07626913, -0.9450442 ,         nan,  0.7364234 ,  0.83810714])

To eliminate the non-nan values in last_vals, you can index it with np.isfinite(last_vals):

In [103]: last_vals[np.isfinite(last_vals)]
Out[103]: array([-0.07626913, -0.9450442 ,  0.7364234 ,  0.83810714])