I have a Pandas DataFrame (called
), which I would like to upload to a MySql database.
The dataframe has columns [A
] and the table in the database has columns [ID
]. The ID
column in the database is the auto-incrementing primary key.
I can upload the dataframe to the database using the
command. However, this does not give me any information about the values that the database assigned to the ID
column of the incoming data. The only way I have of getting this information is by querying the database using the values for columns A
ID, A, B, C
where (A, B, C) in ((x1, y1, z1), (x2, y2, z2), ...)
However, this query takes a very long time when I am inserting a lot of data.
Is there a simpler and quicker way of getting the values that the database assigned to the ID
column of the incoming data?
I can assign the ID
column myself, as per user3364098's answer below. However, my job is part of a pipeline that is ran in parallel. If I assign the ID
column myself, there is a chance that I may assign the same id
values to different dataframes that are uploaded at the same time. This is why I would like to relegate the ID
assignment task to the database.
I ended up assigning the ID
column myself, and issuing a lock on the table while uploading the data in order to guarantee that no other process uploads data with the same id value. Basically:
engine.execute('lock tables `table_name` write')
max_id_query = 'select max(ID) FROM `table_name`'
max_id = int(pd.read_sql_query(max_id_query, engine).values)
df['ID'] = range(max_id + 1, max_id + len(df) + 1)
df.to_sql('table_name', engine, if_exists='append', index=False)