Blutch Blutch - 3 months ago 31
Python Question

python pandas dtypes detection from sql

I am quite troubled by the behaviour of Pandas DataFrame about Dtype detection.

I use 'read_sql_query' to retrieve data from a database to build a DataFrame, and then dump it into a csv file.

I don't need any transformation. Just dump it into a csv file and change date fields in the form : '%d/%m/%Y'

However :

self.dataframe.to_csv(self.fic,
index=False,
header=False,
sep='|',
mode='a',
encoding='utf-8',
line_terminator='\n',
date_format='%d/%m/%Y
)


Would miss to transforme/format some date fields...

I tried to do it another way :

l = list(self.dataframe.select_dtypes(include=['datetime64']).columns)
for i in l:
self.dataframe[i] = self.dataframe[i].dt.strftime('%d/%m/%Y')


I was about to be satisfied, but some more tests showed a weird behaviour :

if my sql request only selects two nuplets :

requete = 'select * from DOMMAGE_INTERET where doi_id in (176433, 181564)'


Everything works, whatever formating in the csv or in the DataFrame.

It detects date fields properly :

df.dtypes
doi_id int64
aff_id int64
pdo_id int64
doi_date_decision datetime64[ns]
doi_date_mod datetime64[ns]
doi_montant float64
doi_reste_a_payer object
doi_appliquer_taux int64
doi_date_update datetime64[ns]
afg_id int64
dtype: object


But when using a different selection :

requete = 'select * from DOMMAGE_INTERET where rownum < 100'


It miss again. And actually, fields types are detected differently :

doi_id int64
aff_id int64
pdo_id int64
doi_date_decision object
doi_date_mod datetime64[ns]
doi_montant float64
doi_reste_a_payer object
doi_appliquer_taux int64
doi_date_update datetime64[ns]
afg_id int64
dtype: object


As you can see : 'doi_date_decision' type does change depending of the request selection but, of course, this is the same set of data !!!

Isn't it weird?

Do you have a explanation for this behaviour?

Answer

Your to-csv operation does not convert all specified date fields because as you mention, not all datetime columns are read in as datetime format but show as string (object dtype) in current dataframe. This is the unfortunate side effect of reading from external sources as the imported system --this includes Python, SAS, Stata, R, Excel, etc.-- attempts to define columns usually by first few rows unless otherwise explicitly defined.

Fortunately, pandas's read_sql_query() maintains a parameter for parse_dates. So consider defining the dates during the read in operation as this argument takes a list or dictionary:

df = read_sql_query('select * from DOMMAGE_INTERET where rownum < 100', engine, 
                    parse_dates = ['doi_date_decision', 'doi_date_mod', 'doi_date_update'])

Alternatively, convert with pd.to_datetime() just after reading in and before to_csv:

df['doi_date_decision'] = pd.to_datetime(df['doi_date_decision'])

And most RDMS maintains datetime in YYYY-MM-DD HH:MM:SS format, aligning to pandas format.