JohnJ JohnJ - 2 months ago 12
Python Question

fast sorting of a csv file?

I am somewhat new to python - and trying to learn python for data analysis primarily. I have a CSV file which looks like so (ID,category):

67512367,0
67567,5
89789789,5
...
...
876289347,10
638, 10
...
...
98723489,20
3828909, 20
...
...<going upto>
78789789, 200
978789, 200


What I'd like to do is to sort the IDs according to their categories. So, at the end I would like something which looks like this:

list_5 = [67567, 89789789, .., ]
list_10 = [876289347, 638, ...]
list_200 = [78789789, 978789, ...]


The problem is that the CSV file has around 1.5 million entries. So, I load them at the moment using izip and csv reader like so:

data = izip(csv.reader(open("data.csv", "rb")))


and I have the entries in
data
which I am able to iterate using simply:

for i in data:
print i
#print i[0][0] # for ids
#print i[0][1] # for category


Now, I know that I can use an if-else/elif construct to check if
i[0][1]==5
and then append
i[0][0]
(the ID) to a list, but this would seem very slow - and my list is large.

I was wondering if there was any other elegant way (maybe something using itertools?) to
bucketize
the IDs based on the value of the second column (the category).

DSM DSM
Answer

Since you say you're "trying to learn python for data analysis primarily", you should definitely look at pandas so you can have a better toolset to play with. (That's not to say that it's not useful to know how to built pandas-like tools from scratch, of course. But in my experience even with pandas you have more than enough chances to exercise your Python skills, and it's more interesting to figure out how to do something real than it is to figure out how to reimplement basic functionality.)

You could read your file into a DataFrame (like an Excel sheet) using read_csv:

>>> import pandas as pd
>>> df = pd.read_csv("group.csv", names=["ID", "category"])
>>> df
          ID  category
0   67512367         0
1      67567         5
2   89789789         5
3  876289347        10
4        638        10
5   98723489        20
6    3828909        20
7   78789789       200
8     978789       200

And then build a dictionary of the categories-to-ids using groupby:

>>> {k: v.tolist() for k,v in df.groupby("category")["ID"]}
{0: [67512367], 200: [78789789, 978789], 10: [876289347, 638], 20: [98723489, 3828909], 5: [67567, 89789789]}

Although you can also do a lot of operations on groupby objects directly (like computing statistics, etc) so to be honest I don't often need an actual list of the indices. More frequently I just want "do this operation on every group", but YMMV.

Comments