T. Carson T. Carson - 1 month ago 6
Python Question

Many-record upload to postgres

I have a series of .csv files with some data, and I want a Python script to open them all, do some preprocessing, and upload the processed data to my postgres database.

I have it mostly complete, but my upload step isn't working. I'm sure it's something simple that I'm missing, but I just can't find it. I'd appreciate any help you can provide.

Here's the code:

import psycopg2
import sys
from os import listdir
from os.path import isfile, join
import csv
import re
import io

try:
con = db_connect("dbname = '[redacted]' user = '[redacted]' password = '[redacted]' host = '[redacted]'")
except:
print("Can't connect to database.")
sys.exit(1)
cur = con.cursor()

upload_file = io.StringIO()

file_list = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in file_list:
id_match = re.search(r'.*-(\d+)\.csv', file)
if id_match:
id = id_match.group(1)
file_name = format(id_match.group())
with open(mypath+file_name) as fh:
id_reader = csv.reader(fh)
next(id_reader, None) # Skip the header row
for row in id_reader:
[stuff goes here to get desired values from file]
if upload_file.getvalue() != '': upload_file.write('\n')
upload_file.write('{0}\t{1}\t{2}'.format(id, [val1], [val2]))

print(upload_file.getvalue()) # prints output that looks like I expect it to
# with thousands of rows that seem to have the right values in the right fields

cur.copy_from(upload_file, '[my_table]', sep='\t', columns=('id', 'col_1', 'col_2'))
con.commit()

if con:
con.close()


This runs without error, but a select query in psql still shows no records in the table. What am I missing?

Edit: I ended up giving up and writing it to a temporary file, and then uploading the file. This worked without any trouble...I'd obviously rather not have the temporary file though, so I'm happy to have suggestions if someone sees the problem.

Answer

When you write to an io.StringIO (or any other file) object, the file pointer remains at the position of the last character written. So, when you do

f = io.StringIO()
f.write('1\t2\t3\n')
s = f.readline()

the file pointer stays at the end of the file and s contains an empty string.


To read (not getvalue) the contents, you must reposition the file pointer to the beginning, e.g. use seek(0)

upload_file.seek(0)
cur.copy_from(upload_file, '[my_table]', columns = ('id', 'col_1', 'col_2'))

This allows copy_from to read from the beginning and import all the lines in your upload_file.


Don't forget, that you read and keep all the files in your memory, which might work for a single small import, but may become a problem when doing large imports or multiple imports in parallel.