ukdatahub ukdatahub - 23 days ago 9
Python Question

How to handle date variable in machine learning data pre-processing

I have a data-set that contains among other variables the time-stamp of the transaction in the format 26-09-2017 15:29:32. I need to find possible correlations and predictions of the sales (lets say in logistic regression). My questions are:


  1. How to handle the date format? Shall I convert it to one number (like excel does automatically)? Shall I split it in more variables like day, month, year, hour, mins, seconds? any other possible suggestions?

  2. What if I would like to add distinct week number per year? shall I add variable like 342017(week 34 of year 2017)?

  3. Shall I make the same for question 2 for quarter of year?

  4. Is there any book or guide that will help with the data preprocessing techniques?




# Datetime Gender Purchase
1 23/09/2015 00:00:00 0 1
2 23/09/2015 01:00:00 1 0
3 25/09/2015 02:00:00 1 0
4 27/09/2015 03:00:00 1 1
5 28/09/2015 04:00:00 0 0


Answer Source

Some random thoughts:

Dates are good sources for feature engineering, I don't think there is one method to use dates in a model. Business user expertise would be great, are there observed trends that can be coded into the data.

Possible suggestions of features ijnculde:

  • weekends vs weekdays
  • business hours and time of day
  • seasons
  • week of year number
  • month
  • year
  • beginning/end of month (pay days)
  • quarter
  • days to/from an action event(distance)
  • missing or incomplete data
  • etc.

All this depends on the data set and most wont apply.

some links:

http://appliedpredictivemodeling.com/blog/2015/7/28/feature-engineering-versus-feature-extraction

https://www.salford-systems.com/blog/dan-steinberg/using-dates-in-data-mining-models

http://trevorstephens.com/kaggle-titanic-tutorial/r-part-4-feature-engineering/