NutshellProgrammer - 8 months ago 38

Python Question

I'm reading a large csv file (500 000 rows) and adding every row in a dict.

One example row is:

`6AH8,F,B,0,60541765,60541765,90.52,1`

index 4 -

`60541765`

`60541765`

in milliseconds-after-midnight format

I want to iterate through every row, and add the large number in a list and address that number to the list index.

i.e:

`timeList = [60541765, 20531765, ..., 80542765]`

so the row will be :

`6AH8,F,B,0,0,0,90.52,1`

Is there a better way to this than store them in a list?

if not:

How do I iterate through the rows, to replace index 4 and 5 the fastest way - right now it takes more than 15 minutes.

Im doing like this at the moment:

`timeStampList = []`

def putTimeStampsInList(inputDict):

for values in inputDict.values():

timestamp1 = values[4]

if values[4] not in timeStampList:

timeStampList.append(values[4])

-----------------------

This is an assignment, where I'm supposed to use compression to make a 19MB file smaller without using any 3rd party or framework provided compression libraries for your solution. So I can't use huffman or lz77 and copy it.

I already have a solution to minimize the

`index 1 - 6AH8`

index 2 and 3 - F,B

My issue is the time stamps which I can't minimize properly and in a time saving way.

Answer

Your issue is likely that checking whether a number is in a List in python is an O(n) operation, which will need to be performed for every row in your large dataset, meaning a total of O(n^2) algorithm, which is enormous on 500,000 entries.

My suggestion would be to add a bit of space complexity O(n) to save on time complexity (now making it average case O(n), I think on typical data)

```
timeStampList = []
timeStampSet = set()
def putTimeStampsInList(inputDict):
for values in inputDict.values():
timestamp1 = values[4]
if values[4] not in timeStampSet:
timeStampList.append(values[4])
TimeStampSet.add(values[4])
```

Now checking for membership is a constant time operation, so rather than your code cycling through your gigantic list every single time it needs to check if something is in the List, it can just quickly check if it's in the set that you're creating! This should speed up the time of your algorithm significantly.

Once you're done creating the List, you don't need the set anymore, so it won't affect the compression size.