Niche.P Niche.P - 11 months ago 47
Python Question

Improving the speed of writing into a txt file python

I am generating a txt file based on the TD IDF calculation for each words.

I am using this code to write the file

w_writer = open("tf_idf_vectors_stops_2.txt", "w")
for x in xrange(0, len(listPatient)):
patientId = listPatient[x] #List for patientid
for words in tdDict_final[patientId]:
w_writer.write(patent + "," + str(multiListTokens.index(words[0])) + "," + str(words[2]))

listPatient is a list consisted of sorted ID.

listPatient = ['001', '002', '003', '004']

tdDict_final is a dictionary consists of ID as a key and words and words value

In the code we called words[0] for word and word[2] value because word[1] is going to be ":", the format of tdDict_final is shown as this.

{'001': [('dog', ':', '0.2534879), ('cat', ':', '0.0133487)],
'002': [('floor', ':', '0.047589'), ('board'), ':' ('0.099345)],
'003': [('key'), ':', '0.04993)],
'004': [('thanks', ':', '0.01479')]}

tdDict contains all the patients in listPatient

multilistTokens is a list contain many distinct vocabularies (tokens)

multilistTokens consists of all the possible dictinct vocabularies found in tdDict.

The problem is, my code above is extremely slow and sluggish when writing it out.

Is there anyway I can improve the efficiency of writing into a txt file using the code above?

Thank you very much

Answer Source
with open("tf_idf_vectors_stops_2.txt", "w") as w_writer:
    for patientId in listPatient:
        for words in tdDict_final[patientId]:
            w_writer.write("%s,%s,%s\n" % (patent, str(multiListTokens.index(words[0])), str(words[2])))

1st | you should use a with statement instead of opening the file and then manually closing the file. The with statement is a python context manager, which means that it will open the file as w_writer and then when you are finished it will close it automatically.

2nd | there is no need to use the xrange above, because apart from where you take patientId from listPatient (patientId = listPatient[x]) you are not using the x. You can extract patientId directly from listPatient and use it from there.

3rd | using the + method to add strings together is notoriously slow in python. The most efficient way to concatenate (join) strings in python is by using the join method or by using in-place delimiters (as I have). Also you should not be calling write twice as you can incorporate the "\n" in the 1st write statement.