fmark fmark - 6 months ago 52
Python Question

Method for guessing type of data represented currently represented as strings

I'm currently parsing CSV tables and need to discover the "data types" of the columns. I don't know the exact format of the values. Obviously, everything that the CSV parser outputs is a string. The data types I am currently interested in are:

  1. integer

  2. floating point

  3. date

  4. boolean

  5. string

My current thoughts are to test a sample of rows (maybe several hundred?) in order to determine the types of data present through pattern matching.

I am particularly concerned about the date data type - is their a python module for parsing common date idioms (obviously I will not be able to detect them all)?

What about integers and floats?


Dateutil comes to mind for parsing dates.

For integers and floats you could always try a cast in a try/except section

>>> f = "2.5"
>>> i = "9"
>>> ci = int(i)
>>> ci
>>> cf = float(f)
>>> cf
>>> g = "dsa"
>>> cg = float(g)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for float(): dsa
>>> try:
...   cg = float(g)
... except:
...   print "g is not a float"
g is not a float