Eran Moshe - 1 month ago 3
Python Question

# python filter 2d array by a chunk of data

``````import numpy as np

data = np.array([
[20,  0,  5,  1],
[20,  0,  5,  1],
[20,  0,  5,  0],
[20,  1,  5,  0],
[20,  1,  5,  0],
[20,  2,  5,  1],
[20,  3,  5,  0],
[20,  3,  5,  0],
[20,  3,  5,  1],
[20,  4,  5,  0],
[20,  4,  5,  0],
[20,  4,  5,  0]
])
``````

I have the following 2d array. lets called the fields
`a, b, c, d`
in the above order where column
`b`
is like
`id`
. I wish to delete all cells that doesnt have atlist 1 appearance of the number "1" in column
`d`
for all cells with the same number in column
`b`
(same id) so after filtering i will have the following results:

``````[[20  0  5  1]
[20  0  5  1]
[20  0  5  0]
[20  2  5  1]
[20  3  5  0]
[20  3  5  0]
[20  3  5  1]]
``````

all rows with
`b = 1`
and
`b = 4`
have been deleted from the data

to sum up because I see answers that doesnt fit. we look at chunks of data by the
`b`
column. if a complete chunk of data doesnt have even one appearance of the number "1" in column
`d`
we delete all the rows of that
`b`
item. in the following example we can see a chunk of data with
`b = 1`
and
`b = 4`
("id" = 1 and "id" = 4) that have 0 appearances of the number "1" in column
`d`
. thats why it gets deleted from the data

Generic approach : Here's an approach using `np.unique` and `np.bincount` to solve for a generic case -

``````unq,tags = np.unique(data[:,1],return_inverse=1)
goodIDs = np.flatnonzero(np.bincount(tags,data[:,3]==1)>=1)
out = data[np.in1d(tags,goodIDs)]
``````

Sample run -

``````In [15]: data
Out[15]:
array([[20, 10,  5,  1],
[20, 73,  5,  0],
[20, 73,  5,  1],
[20, 31,  5,  0],
[20, 10,  5,  1],
[20, 10,  5,  0],
[20, 42,  5,  1],
[20, 54,  5,  0],
[20, 73,  5,  0],
[20, 54,  5,  0],
[20, 54,  5,  0],
[20, 31,  5,  0]])

In [16]: out
Out[16]:
array([[20, 10,  5,  1],
[20, 73,  5,  0],
[20, 73,  5,  1],
[20, 10,  5,  1],
[20, 10,  5,  0],
[20, 42,  5,  1],
[20, 73,  5,  0]])
``````

Specific case approach : If the second column data is always sorted and have sequential numbers starting from `0`, we can use a simplified version, like so -

``````goodIDs = np.flatnonzero(np.bincount(data[:,1],data[:,3]==1)>=1)
out = data[np.in1d(data[:,1],goodIDs)]
``````

Sample run -

``````In [44]: data
Out[44]:
array([[20,  0,  5,  1],
[20,  0,  5,  1],
[20,  0,  5,  0],
[20,  1,  5,  0],
[20,  1,  5,  0],
[20,  2,  5,  1],
[20,  3,  5,  0],
[20,  3,  5,  0],
[20,  3,  5,  1],
[20,  4,  5,  0],
[20,  4,  5,  0],
[20,  4,  5,  0]])

In [45]: out
Out[45]:
array([[20,  0,  5,  1],
[20,  0,  5,  1],
[20,  0,  5,  0],
[20,  2,  5,  1],
[20,  3,  5,  0],
[20,  3,  5,  0],
[20,  3,  5,  1]])
``````

Also, if `data[:,3]` always have ones and zeros, we can just use `data[:,3]` in place of `data[:,3]==1` in the above listed codes.

Source (Stackoverflow)