I'm reading a large csv file (500 000 rows) and adding every row in a dict.
One example row is:
6AH8,F,B,0,60541765,60541765,90.52,1
60541765
60541765
timeList = [60541765, 20531765, ..., 80542765]
6AH8,F,B,0,0,0,90.52,1
timeStampList = []
def putTimeStampsInList(inputDict):
for values in inputDict.values():
timestamp1 = values[4]
if values[4] not in timeStampList:
timeStampList.append(values[4])
index 1 - 6AH8
index 2 and 3 - F,B
Your issue is likely that checking whether a number is in a List in python is an O(n) operation, which will need to be performed for every row in your large dataset, meaning a total of O(n^2) algorithm, which is enormous on 500,000 entries.
My suggestion would be to add a bit of space complexity O(n) to save on time complexity (now making it average case O(n), I think on typical data)
timeStampList = []
timeStampSet = set()
def putTimeStampsInList(inputDict):
for values in inputDict.values():
timestamp1 = values[4]
if values[4] not in timeStampSet:
timeStampList.append(values[4])
TimeStampSet.add(values[4])
Now checking for membership is a constant time operation, so rather than your code cycling through your gigantic list every single time it needs to check if something is in the List, it can just quickly check if it's in the set that you're creating! This should speed up the time of your algorithm significantly.
Once you're done creating the List, you don't need the set anymore, so it won't affect the compression size.