user2539738 user2539738 - 2 months ago 13
Python Question

python, multthreading, safe to use pandas "to_csv" on common file?

I've got some code that works pretty nicely. Basically it's a while-loop that goes through a list of dates, finds files on my HDD that corresponds to those dates, does some calculations with those files, and then outputs to a "results.csv" file using the command

my_df.to_csv("results.csv",mode = 'a')


I'm wondering if it's safe to create a new thread for each date, and call the stuff in the while loop on several dates at a time?

MY CODE:

import datetime, time, os
import sys
import threading
import helperPY #a python file containing the logic I need

class myThread (threading.Thread):
def __init__(self, threadID, name, counter,sn, m_date):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
self.counter = counter
self.sn = sn
self.m_date = m_date
def run(self):
print "Starting " + self.name
m_runThis(sn, m_date)
print "Exiting " + self.name

def m_runThis(sn, m_date):
helperPY.helpFn(sn,m_date) #this is where the "my_df.to_csv()" is called

sn = 'XXXXXX'
today=datetime.datetime(2016,9,22) #
yesterday=datetime.datetime(2016,6,13)

threadList = []
i_threadlist=0
while(today>yesterday):
threadList.append(myThread(i_threadlist, str(today), i_threadlist,sn,today))
threadList[i_threadlist].start()
i_threadlist = i_threadlist +1
today = today-datetime.timedelta(1)

Answer

Writing the file in multiple threads is not safe. But you can create a lock to protect that one operation while letting the rest run in parallel. Your to_csv isn't shown, but you could create the lock

csv_output_lock = threading.Lock()

and pass it to helperPY.helpFn. When you get to the operation, do

with csv_output_lock:
    my_df.to_csv("results.csv",mode = 'a')

You get parallelism for other operations - subject to the GIL of course - but the file access is protected.

Comments