user2628641 user2628641 - 22 days ago 6
Python Question

psycopg2: copy_expert csv file containing e+ integer number

First, let me show how to reproduce the problem. It's simple setup.

here is my simple test csv file

a;b
1;1
0;2
1.5637e+07;3


the test csv file is generated by pandas dataframe's to_csv method

import csv, pandas as pd

d = {'a' : pd.Series([1, 0.0, 1.5637e+07]),'b' : pd.Series([1, 2, 3])}

df = pd.DataFrame(d)

df_csv = df.to_csv('test.csv', encoding="utf-8", index=False,
sep=';', quotechar='"', float_format="%g", quoting=csv.QUOTE_MINIMAL)


then, let's create a test table in db

create table testtb (
a integer,
b integer
)


Finally, I tried to export the csv file into that table, some set up code here

def parse_column(header):
column_str = ",".join(header[0].split(";"))
return column_str

conn = psycopg2.connect("dbname=test1 " +
" user=postgres" +
" host=localhost" +
" password=password")

cur = conn.cursor()

with open('test.csv', "r") as f:
reader = csv.reader(f)
header = next(reader)
f.seek(0)

copy_sql = """
COPY {}.{}({}) FROM STDIN CSV HEADER DELIMITER ';'
""".format("public", "simplew", parse_column(header))
cur.copy_expert(sql=copy_sql, file=f)

conn.commit()


The problem is, now I get

invalid input syntax for integer: "1.5637e+07"


There are few constraints for solution here


  1. I can't use to_sql method of dataframe, I am doing bulk copying here, to_sql is too slow..

  2. because of No.1, I have to follow the procedure I laid out above, to_csv, then copy the csv to DB, which means I can't use any code to format "1.5637e+07", except setting float_format in the to_csv method..

  3. about the float_format, I need to get rid of trailing zero of float number, so I set it to %g currently


Answer

1.5637e+07 isn't an integer. It's a float

>>> 1.5637e+07
15637000.0
>>> type(1.5637e+07)
<class 'float'>

If you want an integer, just do 15637*10**3 or simply 15637000. That will be an integer.

pd.Series([1, 0.0, 15637000])

Note: doing int(1.5637e+07) would also work here, but would fail with loss of precision for highest 10th powers: (int(1.5637e+22) = 15637000000000000786432) so better stick to integer exponentiation, unlike C integers, python integers can handle unlimited values, well memory-limited at most)