Andrew Andrew - 4 months ago 40x
Python Question

using pandas read_csv with missing data

I am attempting to read a csv file where some rows may be missing chunks of data.

This seems to be causing a problem with the pandas read_csv function when you specify the dtype. The problem appears that in order to convert from the str to whatever the dtype specifies pandas just tries to cast it directly. Therefore, if something is missing things break down.

A MWE follows (this MWE uses StringIO in place of a true file; however, the issue also happens with a real file being used)

import pandas as pd
import numpy as np
import io

datfile = io.StringIO("12 23 43| | 37| 12.23| 71.3\n12 23 55|X| | | 72.3")

names = ['id', 'flag', 'number', 'data', 'data2']
dtypes = [np.str, np.str,, np.float, np.float]

dform = {name: dtypes[ind] for ind, name in enumerate(names)}

colconverters = {0: lambda s: s.strip(), 1: lambda s: s.strip()}

df = pd.read_table(datfile, sep='|', dtype=dform, converters=colconverters, header=None,
index_col=0, names=names, na_values=' ')

The error I get when I run this is

Traceback (most recent call last):
File "pandas/parser.pyx", line 1084, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12580)
TypeError: Cannot cast array from dtype('O') to dtype('int64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/aliounis/Repos/stellarpy/source/", line 15, in <module>
index_col=0, names=names, na_values=' ')
File "/usr/local/lib/python3.5/site-packages/pandas/io/", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.5/site-packages/pandas/io/", line 325, in _read
File "/usr/local/lib/python3.5/site-packages/pandas/io/", line 815, in read
ret =
File "/usr/local/lib/python3.5/site-packages/pandas/io/", line 1314, in read
data =
File "pandas/parser.pyx", line 805, in (pandas/parser.c:8748)
File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
File "pandas/parser.pyx", line 904, in pandas.parser.TextReader._read_rows (pandas/parser.c:10022)
File "pandas/parser.pyx", line 1011, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:11397)
File "pandas/parser.pyx", line 1090, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12656)
ValueError: invalid literal for int() with base 10: ' '

Is there someway I can fix this. I looked through the documentation but didn't see anything that looked like it would directly address this solution. Is this just a bug that needs to be reported to panda?


So, as Merlin pointed out, the main problem is that nan's can't be ints, which is probably why pandas acts this way to begin with. I unfortunately didn't have a choice so I had to make some changes to the pandas source code myself. I ended up having to change lines 1087-1096 of the file parser.pyx to

        na_count_old = na_count
        for ind, row in enumerate(col_res):
            k = kh_get_str(na_hashset, row.strip().encode())
            if k != na_hashset.n_buckets:

                col_res[ind] = np.nan

                na_count += 1


                col_res[ind] = np.array(col_res[ind]).astype(col_dtype).item(0)

        if na_count_old==na_count:

            # float -> int conversions can fail the above
            # even with no nans
            col_res_orig = col_res
            col_res = col_res.astype(col_dtype)
            if (col_res != col_res_orig).any():
                raise ValueError("cannot safely convert passed user dtype of "
                                 "{col_dtype} for {col_res} dtyped data in "
                                 "column {column}".format(col_dtype=col_dtype,

which essentially goes through each element of a column, checks to see if each element is contained in the na list (note that we have to strip the stuff so that multi-spaces show up as being in the na list). If it is then that element is set as a double np.nan. If it is not in the na list then it is cast to the original dtype specified for that column (that means the column will have multiple dtypes).

While this isn't a perfect fix (and is likely slow) it works for my needs and maybe someone else who has a similar problem will find it useful.