Pythoner Pythoner - 1 month ago 24
Python Question

Import only unique records into PostgreSQL

I am trying to import CSV file into PostgreSQL using Python. I want to tell Python to only import the records/rows of data in CSV file which do not contain the duplicates (only the first unique builder record). I have attached the code I am using to find the duplicates in the CSV file. I am new to programming so please pardon my inexperience.

import csv
import psycopg2

database = psycopg2.connect (database = "***", user="***")

cursor = database.cursor()
delete = """Drop table if exists "Real".Noduplicates"""
print (delete)

mydata = cursor.execute(delete)

cursor.execute("""Create Table "Real".Noduplicates
(UserName varchar(55),
LastUpdate timestamp,
Week date,
Builder varchar(55),
Traffic integer
);""")

print "Table created successfully"

csv_data = csv.reader(file('SampleData2.csv'))

already_seen = set()

next(csv_data)

for row in csv_data:
builder = row[3]
if builder in already_seen:
print('{} is a duplicate builder'.format(builder))
else:
print('{} is a new builder'.format(builder))
already_seen.add(builder)

for row in csv_data:

cursor.execute("""INSERT INTO "Real".Noduplicates (UserName, LastUpdate, Week, Builder, Traffic)"""\
"""VALUES (%s,%s,%s,%s,%s)""",
row)

cursor.close()
database.commit()
database.close()

print "CSV Imported"

Answer

Instead, import all rows to a temp table using COPY (much faster) and then INSERT a distinct set into the target table from there, using SELECT DISTINCT ... or SELECT DISTINCT ON (builder) or use aggregate functions). Finding dupes is a forte of any RDBMS.

Related code examples:

Or, while inserting rows one by one, the new UPSERT in Postgres 9.5 and a UNIQUE index on all 5 columns is all you need:

But note the special role of NULL values (which are never considered equal in SQL):

If the column builder is all you need to identify dupes, just define that column UNIQUE:

...
builder varchar(55) UNIQUE NOT NULL,
...

and use a simple UPSERT:

INSERT INTO "Real".noduplicates (userName, lastUpdate, week, builder, Traffic)
VALUES (%s,%s,%s,%s,%s)
ON CONFLICT ON (builder) DO NOTHING;

Better don't use CaMelCase names in Postgres btw.: