Brandon J - 1 year ago 85

Python Question

Suppose I have an array (M,N) where the values in each "column", N, represent data recordings of N different machines. Let's also imagine each "row", M, represents a unique "timestamp" where data was recorded for all of the N machines.

The array (M,N) is structured in a way so that at M = 0, this would corresp[ond to the very first "timestamp" (t0) and the row M = M (tm) represents the most recent "timestamp" recording.

Let's call this array "AX." AX[0] would yield the recorded data for N machines at the very 1st "timestamp". AX[-1] would be the most recent recordings.

Here is my array:

`>>AX = np.random.randn(3, 5)`

array([[ 0.53826804, -0.9450442 , -0.10279278, 0.47251871, 0.32050493],

[-0.97573464, -0.42359652, -0.00223274, 0.7364234 , 0.83810714],

[-0.07626913, 0.85246932, -0.13736392, -1.39977431, -1.39882156]])

Now imagine something went wrong and data wasn't captured consistently for every machine at every "timestamp". To create an example of what the output might look like I followed the example linked below to insert Nans in random positions in the array:

Create sample numpy array with randomly placed NaNs

`>>AX.ravel()[np.random.choice(AX.size, 9, replace=False)] = np.nan`

array([[ 0.53826804, -0.9450442 , nan, 0.47251871, nan],

[ nan, nan, nan, 0.7364234 , 0.83810714],

[-0.07626913, nan, nan, nan, nan]])

Let's assume that I need to provide the most recent values of the recorded data. Ideally this would be as easy as referencing AX[-1]. In this particular case, I would hardly have any data since everything got screwed up.

`>>AX[-1]`

array([-0.07626913, nan, nan, nan, nan])

I realize any data is better than nothing, so I would like use the

`[-0.07626913, -0.9450442, 0.7364234, 0.83810714]`

Notice column 2 of AX had no usable data, so I just skipped it's ouput.

I do not find np.arrays to be very intuitive and as I read through the documentation, I am overwhelmed by the amount of specialized functions and transforms.

My intial idea was to perhaps filter out all of the Nans to a new array (AY), and then take the last row AY[-1] (assuming this would retains its important row based ordering) Then I realized that this would be making an array with a strange shape of (I'm just using integer values here for convenience instead of AX's values):

`[1,2,3],`

[4,5],

[6]

Assuming that is even possible to create, taking the last "row"(?) would yield [6,5,3] and would totally mess everything up. Padding an array with values is also bad because the most recent values would be pads for 4 out of 5 data points in the most recent "timestamp" row.

Is there a way to achieve what I want in a fairly painless manner while still using the np.array stucture and avoiding dataframes and panels?

Thanks!

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

This is the kind of question that can generate many interesting answers. Someone will probably come up with a better way than this, but to get things started, here's one possibility:

```
In [99]: AX
Out[99]:
array([[ 0.53826804, -0.9450442 , nan, 0.47251871, nan],
[ nan, nan, nan, 0.7364234 , 0.83810714],
[-0.07626913, nan, nan, nan, nan]])
```

`np.isfinite(AX)`

is a boolean array that is True where `AX`

is not nan (and not infinite, but I assume that case is not relevant). For a boolean array `B`

, `B.argmax(axis=0)`

gives the indices of the *first* True value in each column. To get the indices of the *last* True value, reverse the array, take the argmax, and then subtract the result from the number of rows minus 1; that is, `B.shape[0]-1 - B[::-1].argmax(axis=0)`

. In this case, `B`

is `np.isfinite(AX)`

, so we have:

```
In [100]: k = AX.shape[0] - 1 - np.isfinite(AX)[::-1].argmax(axis=0)
```

`k`

contains the row indices where the final values occur. There is one for each column, so the corresponding column indices are simply `np.arange(AX.shape[1])`

.

```
In [101]: last_vals = AX[k, np.arange(AX.shape[1])]
```

`last_vals`

is the one-dimensional array of the last non-nan values in each column, unless a column is all nan, in which case the value in `last_vals`

is also nan:

```
In [102]: last_vals
Out[102]: array([-0.07626913, -0.9450442 , nan, 0.7364234 , 0.83810714])
```

To eliminate the non-nan values in `last_vals`

, you can index it with `np.isfinite(last_vals)`

:

```
In [103]: last_vals[np.isfinite(last_vals)]
Out[103]: array([-0.07626913, -0.9450442 , 0.7364234 , 0.83810714])
```