Khris Khris - 24 days ago 6
Python Question

Huge slowdown with string.count() when reading certain data with certain characters

I need to read in a large csv data file which is however riddled with newline characters and generally quite chaotic. So instead of pandas I do it manually, however I'm running into a strange slow-down which seems to depend on the characters which appear in the file.

While trying to recreate the problem by randomly creating a csv file which looks similar I figured that maybe the problem lies in the

count
function.

Consider this example which creates a large file with chaotic random data, reads the file and then by using count orders it such that it can be read as columnar data.

Note that in the first run of the file I only use
string.ascii_letters
for the random data, for the second run I'm using characters from
string.printable
.

import os
import random as rd
import string
import time

# Function to create random data in a specific pattern with separator ";":
def createRandomString(num,io,fullLength):
lineFull = ''
nl = True
randstr = ''.join(rd.choice(string.ascii_letters) for _ in range(7))
for i in range(num):
if i == 0:
line = 'Start;'
else:
line = ''
bb = rd.choice([True,True,False])
if bb:
line = line+'\"\";'
else:
if rd.random() < 0.999:
line = line+randstr
else:
line = line+rd.randint(10,100)*randstr
if nl and i != num-1:
line = line+';\n'
nl = False
elif rd.random() < 0.04 and i != num-1:
line = line+';\n'
if rd.random() < 0.01:
add = rd.randint(1,10)*'\n'
line = line+add
else:
line = line+';'
lineFull = lineFull+line
return lineFull+'\n'

# Create file with random data:
outputFolder = "C:\\DataDir\\Output\\"
numberOfCols = 38
fullLength = 10000
testLines = [createRandomString(numberOfCols,i,fullLength) for i in range(fullLength)]
with open(outputFolder+"TestFile.txt",'w') as tf:
tf.writelines(testLines)

# Read in file:
with open(outputFolder+"TestFile.txt",'r') as ff:
lines = []
for line in ff.readlines():
lines.append(unicode(line.rstrip('\n')))

# Restore columns by counting the separator:
linesT = ''
lines2 = []
time0 = time.time()
for i in range(len(lines)):
linesT = linesT + lines[i]
count = linesT.count(';')
if count == numberOfCols:
lines2.append(linesT)
linesT = ''
if i%1000 == 0:
print time.time()-time0
time0 = time.time()
print time.time()-time0


The print statements output this:

0.0
0.0019998550415
0.00100016593933
0.000999927520752
0.000999927520752
0.000999927520752
0.000999927520752
0.00100016593933
0.0019998550415
0.000999927520752
0.00100016593933
0.0019998550415
0.00100016593933
0.000999927520752
0.00200009346008
0.000999927520752
0.000999927520752
0.00200009346008
0.000999927520752
0.000999927520752
0.00200009346008
0.000999927520752
0.00100016593933
0.000999927520752
0.00200009346008
0.000999927520752


Consistently fast performance.

Now I change the third line in
createRandomString
to
randstr = ''.join(rd.choice(string.printable) for _ in range(7))
, my output now becomes this:

0.0
0.0759999752045
0.273000001907
0.519999980927
0.716000080109
0.919999837875
1.11500000954
1.25199985504
1.51200008392
1.72199988365
1.8820002079
2.07999992371
2.21499991417
2.37400007248
2.64800000191
2.81900000572
3.04500007629
3.20299983025
3.55500006676
3.6930000782
3.79499983788
4.13900017738
4.19899988174
4.58700013161
4.81799983978
4.92000007629
5.2009999752
5.40199995041
5.48399996758
5.70299983025
5.92300009727
6.01099991798
6.44200015068
6.58999991417
3.99399995804


Not only is the performance very slow but it is consistently becoming slower over time.

The only difference lies in the range of characters which are written into the random data.

The full set of character which appear in my real data is this:

charSet = [' ','"','&',"'",'(',')','*','+',',','-','.','/','0','1','2','3','4','5','6',
'7','8','9',':',';','<','=','>','A','B','C','D','E','F','G','H','I','J','K',
'L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','\\','_','`','a',
'b','d','e','g','h','i','l','m','n','o','r','s','t','x']


Lets do some benchmarking on the
count
-function:

import random as rd
rd.seed()

def Test0():
randstr = ''.join(rd.choice(string.digits) for _ in range(10000))
randstr.count('7')

def Test1():
randstr = ''.join(rd.choice(string.ascii_letters) for _ in range(10000))
randstr.count('a')

def Test2():
randstr = ''.join(rd.choice(string.printable) for _ in range(10000))
randstr.count(';')

def Test3():
randstr = ''.join(rd.choice(charSet) for _ in range(10000))
randstr.count(';')


I'm testing only digits, only letters, printable, and the charset from my data.

Results of
%timeit
:

%timeit(Test0())
100 loops, best of 3: 9.27 ms per loop
%timeit(Test1())
100 loops, best of 3: 9.12 ms per loop
%timeit(Test2())
100 loops, best of 3: 9.94 ms per loop
%timeit(Test3())
100 loops, best of 3: 8.31 ms per loop


The performance is consistent and doesn't suggest any problems of
count
with certain character sets.

I also tested if concatenating strings with
+
would cause a slow down but this wasn't the case either.

Can anyone explain this or give me some hints?

EDIT: Using Python 2.7.12

EDIT 2: In my original data the following is happening:

The file has around 550000 lines which are often broken by random newline characters yet defined by always 38
";"
-delimiters. Until roughtly 300000 lines the performance is fast, then from that line on it suddenly starts getting slower and slower. I'm investigating this further now with the new clues.

Answer

The problem is in count(';').

string.printable contains ';' while string.ascii_characters doesn't.

Then as the length of linesT grows, the execution time grows as well:

0.000236988067627
0.0460968017578
0.145275115967
0.271568059921
0.435608148575
0.575787067413
0.750104904175
0.899538993835
1.08505797386
1.24447107315
1.34459710121
1.45430088043
1.63317894936
1.90502595901
1.92841100693
2.07722711563
2.16924905777
2.30753016472

Faulty runtime behaviour In particular this code is problematic with string.printable:

 numberOfCols = 38
 if count == numberOfCols:
        lines2.append(linesT)
        linesT = ''

Since there is a chance that ';' is included more than once in line 37 just before linesT is flushed, 38 will be skipped and linesT grows indefinitely.

You can observe this behaviour by leaving the initial set to string.ascii_characters and changing your code to count('a').

To fix the problem with printable you can modify your code like this:

if count > numberOfCols:

Then we go back to the expected runtime behaviour:

0.000234842300415
0.00233697891235
0.00247097015381
0.00217199325562
0.00262403488159
0.00262403488159
0.0023078918457
0.0024049282074
0.00231409072876
0.00233006477356
0.00214791297913
0.0028760433197
0.00241804122925
0.00250506401062
0.00254893302917
0.00266218185425
0.00236296653748
0.00201988220215
0.00245118141174
0.00206398963928
0.00219988822937
0.00230193138123
0.00205302238464
0.00230097770691
0.00248003005981
0.00204801559448
Comments