NutshellProgrammer NutshellProgrammer - 4 months ago 10
Python Question

handling large integers in for loop optimized in python

I'm reading a large csv file (500 000 rows) and adding every row in a dict.
One example row is:

6AH8,F,B,0,60541765,60541765,90.52,1


index 4 -
60541765
and index 5 -
60541765
in this case is the same, but it is not always the case. These integers are timestamps that indicates time
in milliseconds-after-midnight format

I want to iterate through every row, and add the large number in a list and address that number to the list index.

i.e:

timeList = [60541765, 20531765, ..., 80542765]


so the row will be :
6AH8,F,B,0,0,0,90.52,1


WHY? - because the times sometimes occurs more than once in the file.

Myquestion is:

Is there a better way to this than store them in a list?

if not:

How do I iterate through the rows, to replace index 4 and 5 the fastest way - right now it takes more than 15 minutes.

Im doing like this at the moment:

timeStampList = []
def putTimeStampsInList(inputDict):
for values in inputDict.values():
timestamp1 = values[4]

if values[4] not in timeStampList:
timeStampList.append(values[4])


----------------------- Additional information -----------------------

This is an assignment, where I'm supposed to use compression to make a 19MB file smaller without using any 3rd party or framework provided compression libraries for your solution. So I can't use huffman or lz77 and copy it.

I already have a solution to minimize the

index 1 - 6AH8

index 2 and 3 - F,B


My issue is the time stamps which I can't minimize properly and in a time saving way.

Answer

Your issue is likely that checking whether a number is in a List in python is an O(n) operation, which will need to be performed for every row in your large dataset, meaning a total of O(n^2) algorithm, which is enormous on 500,000 entries.

My suggestion would be to add a bit of space complexity O(n) to save on time complexity (now making it average case O(n), I think on typical data)

timeStampList = []
timeStampSet = set()
def putTimeStampsInList(inputDict):
  for values in inputDict.values():
      timestamp1 = values[4]

      if values[4] not in timeStampSet:
          timeStampList.append(values[4])
          TimeStampSet.add(values[4])

Now checking for membership is a constant time operation, so rather than your code cycling through your gigantic list every single time it needs to check if something is in the List, it can just quickly check if it's in the set that you're creating! This should speed up the time of your algorithm significantly.

Once you're done creating the List, you don't need the set anymore, so it won't affect the compression size.

Comments