S.AMEEN S.AMEEN - 4 months ago 7
Python Question

Error while reading Boston data from UCL website using pandas

Any help please for reading this file from url website.

eurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
data = pandas.read_csv(url, sep=',', header = None)


I tried sep=',', sep=';' and sep='\t' but the data read like this
enter image description here

but with

data = pandas.read_csv(url, sep=' ', header = None)


I received an error,

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7988)()
pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)()
pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:8970)()
pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)()
pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:22649)()
CParserError: Error tokenizing data. C error: Expected 30 fields in line 2, saw 31


Maybe same question asked here enter link description here but the accepted answer does not help me.

any help please to read this file from the url provide it.

BTW, I know there is Boston = load_boston() to read this data but when I read it from this function, the attribute 'MEDV' in the dataset does not download with the dataset.

Answer

There are multiple spaces used as a delimiter, that's why it's not working when you use a single space as a delimiter (sep=' ')

you can do it using sep='\s+':

In [171]: data = pd.read_csv(url, sep='\s+', header = None)

In [172]: data.shape
Out[172]: (506, 14)

In [173]: data.head()
Out[173]:
        0     1     2   3      4      5     6       7   8      9     10      11    12    13
0  0.00632  18.0  2.31   0  0.538  6.575  65.2  4.0900   1  296.0  15.3  396.90  4.98  24.0
1  0.02731   0.0  7.07   0  0.469  6.421  78.9  4.9671   2  242.0  17.8  396.90  9.14  21.6
2  0.02729   0.0  7.07   0  0.469  7.185  61.1  4.9671   2  242.0  17.8  392.83  4.03  34.7
3  0.03237   0.0  2.18   0  0.458  6.998  45.8  6.0622   3  222.0  18.7  394.63  2.94  33.4
4  0.06905   0.0  2.18   0  0.458  7.147  54.2  6.0622   3  222.0  18.7  396.90  5.33  36.2

or using delim_whitespace=True:

In [174]: data = pd.read_csv(url, delim_whitespace=True, header = None)

In [175]: data.shape
Out[175]: (506, 14)

In [176]: data.head()
Out[176]:
        0     1     2   3      4      5     6       7   8      9     10      11    12    13
0  0.00632  18.0  2.31   0  0.538  6.575  65.2  4.0900   1  296.0  15.3  396.90  4.98  24.0
1  0.02731   0.0  7.07   0  0.469  6.421  78.9  4.9671   2  242.0  17.8  396.90  9.14  21.6
2  0.02729   0.0  7.07   0  0.469  7.185  61.1  4.9671   2  242.0  17.8  392.83  4.03  34.7
3  0.03237   0.0  2.18   0  0.458  6.998  45.8  6.0622   3  222.0  18.7  394.63  2.94  33.4
4  0.06905   0.0  2.18   0  0.458  7.147  54.2  6.0622   3  222.0  18.7  396.90  5.33  36.2
Comments