A Coday - 1 year ago 72

Python Question

I've been at this problem for a few days and tried a couple of different approaches, but I can't seem to get it quite right. This is the a simplified version of my csv data file: testme.csv

`"Name", "City", "State", "1996-04", "1996-05", "1996-06", "1996-07"`

"Bob", "Portland", "OR", 100000, 120000, 140000, 160000

"Sally", "Eugene", "OR", 50000, 75000, 80000, 90000

"Peter", "San Francisco", "CA", , , 440000, 500000

I want to extract the numbers as 2D array which I wish to do some math upon. However I've got these text fields before that I need to ignore. Also, some rows won't have numbers for all of the columns, but once the numbers start they are continuous to the right (i.e. some rows have empty items for the first several columns), and this can be different per row. Additionally, the real data set has hundreds of rows and dozens of columns

This is some of what I've tried:

`import csv`

import numpy as np

filename = "testme.csv"

ifile = open(filename, 'r')

header1 = ifile.readline()

reader = csv.reader(ifile)

A = np.array([]).reshape(0, 4)

for row in reader:

print row

print row[3:]

A = np.vstack([A, row[3:]])

print A

And then I get this:

`['Bob', ' "Portland"', ' "OR"', ' 100000', ' 120000', ' 140000', ' 160000']`

[' 100000', ' 120000', ' 140000', ' 160000']

['Sally', ' "Eugene"', ' "OR"', ' 50000', ' 75000', ' 80000', ' 90000']

[' 50000', ' 75000', ' 80000', ' 90000']

['Peter', ' "San Francisco"', ' "CA"', ' ', ' ', ' 440000', ' 500000']

[' ', ' ', ' 440000', ' 500000']

[[' 100000' ' 120000' ' 140000' ' 160000']

[' 50000' ' 75000' ' 80000' ' 90000']

[' ' ' ' ' 440000' ' 500000']]

I'm close, but the elements are all now literal strings. Is there an easier way to do this and get numbers instead or do I know go through this and convert each element to numbers? And the empty elements I could benefit from clamping them to zero.

Thank you for advice and help ahead of time!

Update (8/1/16)

I did go with the genfromtxt method as that matched what I needed a lot. Here is the result for posterity and others

`import csv`

import numpy as np

NumIgnoreFirstCols = 3

filename = "testme2.csv"

ifile = open(filename, 'r')

reader = csv.reader(ifile)

header1 = next(reader)

numcols = len(header1)

#Find number of cols for usecol in genfromtxt

print("numcols", numcols)

ifile.close()

print(range(NumIgnoreFirstCols, numcols))

aMatrix = np.genfromtxt(filename, skip_header=1, delimiter=',', usecols=range(NumIgnoreFirstCols,numcols), dtype=int)

print aMatrix

normalizedMatrix = np.where(aMatrix<0, 0, aMatrix)

print(normalizedMatrix)

minValue = np.amin(normalizedMatrix)

maxValue = np.amax(normalizedMatrix)

print (minValue, maxValue)

Thanks again for all the help

Answer Source

With your sample, `numpy`

`genfromtxt`

works:

```
In [166]: np.genfromtxt('stack38627130.csv',names=True,delimiter=',',dtype=None)
Out[166]:
array([(b'"Bob"', b' "Portland"', b' "OR"', 100000, 120000, 140000, 160000),
(b'"Sally"', b' "Eugene"', b' "OR"', 50000, 75000, 80000, 90000),
(b'"Peter"', b' "San Francisco"', b' "CA"', -1, -1, 440000, 500000)],
dtype=[('Name', 'S7'), ('City', 'S16'), ('State', 'S5'), ('199604', '<i4'), ('199605', '<i4'), ('199606', '<i4'), ('199607', '<i4')])
```

This is a 1d structured array; you access columns by field name (here derived from your header line)

```
In [167]: data=_
In [168]: data['199604']
Out[168]: array([100000, 50000, -1])
In [169]: data['199607']
Out[169]: array([160000, 90000, 500000])
```

The missing fields are filled with `-1`

. I think that can be changed.

There are other parameters for setting field names if you don't like the deduced ones.

Read can be restricted to the numeric columns; different fill depending on the `dtype`

.

```
In [171]: np.genfromtxt('stack38627130.csv',skip_header=1,delimiter=',',usecols=
...: [3,4,5,6])
Out[171]:
array([[ 100000., 120000., 140000., 160000.],
[ 50000., 75000., 80000., 90000.],
[ nan, nan, 440000., 500000.]])
In [172]: np.genfromtxt('stack38627130.csv',skip_header=1,delimiter=',',usecols=
...: [3,4,5,6],dtype=int)
Out[172]:
array([[100000, 120000, 140000, 160000],
[ 50000, 75000, 80000, 90000],
[ -1, -1, 440000, 500000]])
```

Now we get a 2d array.

I believe `pandas`

handles missing fields better, but as long as those fields are marked with the delimiter, `genfromtxt`

shouldn't have problems.

`genfromtxt`

roughly does:

```
result = []
for row in reader:
data = row[3:]
data = [float(x) for x in data]
result.append(data)
result = np.array(result)
```

`np.array`

can do the float conversion if all the strings convert properly; it does not handle the empty ones or `nan`

. Generally collecting a list of values is better than repeated `vstack`

(or concatenates).