Brocolli Rob Brocolli Rob - 2 months ago 14
Python Question

CSV Filtering for numpy

I've got this rather large CSV file, and I only care about the rows that contain the word "Slave"...Some rows that contain Slave are not exactly the same as others, but they all contain the word "Slave".

I want to throw away all the other rows and then work on the data that's left over in the other column.

Here's the catch: The other column isn't clean either...It always looks like this, though:

digit (text)

so, for instance:

7 (medium)


12 (strong)


I want to grab the first 1 or 2 (depending on if there is 1 or 2 digits, of course) and then plot them in a histogram with numpy and matplotlib/pyplot.

I've got two problems:

This code:

import csv
x=csv.reader(open('sample.csv', 'rt'), delimiter=',')
x=list(x)


Does OK, but now I have to address things like
x[1][1]
...This will show

Slave (0x00-02-5b-00-a5-a5) (#1)


But, something like
x[:][1]
shows

['6 (medium)', 'Slave (0x00-02-5b-00-a5-a5) (#1)']


Which, is not what I expect...I would expect it to just print the second column.

Anyway, if I can get past that, the next issue will be that the surviving column will have some character filtering to do (I guess) to just keep the digits and remove the alpha characters...However, I'm dreading on how to do just that and also be able to stuff in into some numpy friendly data structure.

Any thoughts on how to proceed? Here's a sample of the data I'm working with:

6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
7 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
7 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
7 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
7 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
5 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
5 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
7 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
7 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
4 (weak),Slave (0x00-02-5b-00-a5-a5) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
7 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
7 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
5 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
7 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
5 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
4 (weak),Slave (0x00-02-5b-00-a5-a5) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
13 (strong),Master (0x00-25-52-f5-a6-f1) (#1)
7 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
10 (strong),Master (0x00-25-52-f5-a6-f1) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
11 (strong),Master (0x00-25-52-f5-a6-f1) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
8 (medium),Master (0x00-25-52-f5-a6-f1) (#1)
7 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
13 (strong),Master (0x00-25-52-f5-a6-f1) (#1)
12 (strong),Master (0x00-25-52-f5-a6-f1) (#1)
7 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
10 (strong),Master (0x00-25-52-f5-a6-f1) (#1)
5 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
11 (strong),Master (0x00-25-52-f5-a6-f1) (#1)
4 (weak),Slave (0x00-02-5b-00-a5-a5) (#1)
13 (strong),Master (0x00-25-52-f5-a6-f1) (#1)
5 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
11 (strong),Master (0x00-25-52-f5-a6-f1) (#1)
6 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
11 (strong),Master (0x00-25-52-f5-a6-f1) (#1)
13 (strong),Master (0x00-25-52-f5-a6-f1) (#1)
7 (medium),Slave (0x00-02-5b-00-a5-a5) (#1)
10 (strong),Master (0x00-25-52-f5-a6-f1) (#1)


Thanks

Answer

As a csv, each line contains 2 columns, one before and one after the comma. x[1][1] is the second column from the second list, since python uses 0-based indices. And x[:][1] is equivalent to x[1], so no surprise there.

I suggest filtering and keeping the very first column that contains your numbers:

firstcol_filt = [int(str.split(k[0])[0]) for k in x if 'Slave' in k[1]]

then you can transform this list of lists into a numpy array if you wish,

firstcol_arr = np.array(firstcol_filt)

Due to its shape, this will be a 1d array which you can use in a histogram.


Just to elaborate: x from your CSV is a list of lists. The list comprehension loops over x, k is each row of the CSV, so k is a two-element list. If 'Slave' is in the second element, we split the first element at whitespace and transform its first part into an integer.

The list comprehension can be expanded into an equivalent loop like so:

firstcol_filt = []
for k in x:
    if 'Slave' in k[1]:
        firstcol_filt.append(int(str.split(k[0])[0]))

Since you asked in a comment, here's the filtering step and the splitting step separately, for clarity:

filtered_rows = [k for k in x if 'Slave' in k[1]]
firstcol_filt = [int(str.split(k[0])[0]) for k in filtered_rows]
Comments