user2539738 user2539738 - 1 year ago 60
Python Question

python, multthreading, safe to use pandas "to_csv" on common file?

I've got some code that works pretty nicely. Basically it's a while-loop that goes through a list of dates, finds files on my HDD that corresponds to those dates, does some calculations with those files, and then outputs to a "results.csv" file using the command

my_df.to_csv("results.csv",mode = 'a')

I'm wondering if it's safe to create a new thread for each date, and call the stuff in the while loop on several dates at a time?


import datetime, time, os
import sys
import threading
import helperPY #a python file containing the logic I need

class myThread (threading.Thread):
def __init__(self, threadID, name, counter,sn, m_date):
self.threadID = threadID = name
self.counter = counter = sn
self.m_date = m_date
def run(self):
print "Starting " +
m_runThis(sn, m_date)
print "Exiting " +

def m_runThis(sn, m_date):
helperPY.helpFn(sn,m_date) #this is where the "my_df.to_csv()" is called

sn = 'XXXXXX'
today=datetime.datetime(2016,9,22) #

threadList = []
threadList.append(myThread(i_threadlist, str(today), i_threadlist,sn,today))
i_threadlist = i_threadlist +1
today = today-datetime.timedelta(1)

Answer Source

Writing the file in multiple threads is not safe. But you can create a lock to protect that one operation while letting the rest run in parallel. Your to_csv isn't shown, but you could create the lock

csv_output_lock = threading.Lock()

and pass it to helperPY.helpFn. When you get to the operation, do

with csv_output_lock:
    my_df.to_csv("results.csv",mode = 'a')

You get parallelism for other operations - subject to the GIL of course - but the file access is protected.