A Coday A Coday - 4 months ago 17
Python Question

Trying to extract portion of csv to numpy array

I've been at this problem for a few days and tried a couple of different approaches, but I can't seem to get it quite right. This is the a simplified version of my csv data file: testme.csv

"Name", "City", "State", "1996-04", "1996-05", "1996-06", "1996-07"
"Bob", "Portland", "OR", 100000, 120000, 140000, 160000
"Sally", "Eugene", "OR", 50000, 75000, 80000, 90000
"Peter", "San Francisco", "CA", , , 440000, 500000


I want to extract the numbers as 2D array which I wish to do some math upon. However I've got these text fields before that I need to ignore. Also, some rows won't have numbers for all of the columns, but once the numbers start they are continuous to the right (i.e. some rows have empty items for the first several columns), and this can be different per row. Additionally, the real data set has hundreds of rows and dozens of columns

This is some of what I've tried:

import csv
import numpy as np

filename = "testme.csv"

ifile = open(filename, 'r')
header1 = ifile.readline()

reader = csv.reader(ifile)
A = np.array([]).reshape(0, 4)

for row in reader:
print row
print row[3:]
A = np.vstack([A, row[3:]])

print A


And then I get this:

['Bob', ' "Portland"', ' "OR"', ' 100000', ' 120000', ' 140000', ' 160000']
[' 100000', ' 120000', ' 140000', ' 160000']
['Sally', ' "Eugene"', ' "OR"', ' 50000', ' 75000', ' 80000', ' 90000']
[' 50000', ' 75000', ' 80000', ' 90000']
['Peter', ' "San Francisco"', ' "CA"', ' ', ' ', ' 440000', ' 500000']
[' ', ' ', ' 440000', ' 500000']
[[' 100000' ' 120000' ' 140000' ' 160000']
[' 50000' ' 75000' ' 80000' ' 90000']
[' ' ' ' ' 440000' ' 500000']]


I'm close, but the elements are all now literal strings. Is there an easier way to do this and get numbers instead or do I know go through this and convert each element to numbers? And the empty elements I could benefit from clamping them to zero.

Thank you for advice and help ahead of time!

Aaron



Update (8/1/16)
I did go with the genfromtxt method as that matched what I needed a lot. Here is the result for posterity and others

import csv
import numpy as np

NumIgnoreFirstCols = 3
filename = "testme2.csv"

ifile = open(filename, 'r')
reader = csv.reader(ifile)
header1 = next(reader)
numcols = len(header1)
#Find number of cols for usecol in genfromtxt
print("numcols", numcols)
ifile.close()

print(range(NumIgnoreFirstCols, numcols))

aMatrix = np.genfromtxt(filename, skip_header=1, delimiter=',', usecols=range(NumIgnoreFirstCols,numcols), dtype=int)

print aMatrix

normalizedMatrix = np.where(aMatrix<0, 0, aMatrix)
print(normalizedMatrix)

minValue = np.amin(normalizedMatrix)
maxValue = np.amax(normalizedMatrix)
print (minValue, maxValue)


Thanks again for all the help

Answer

With your sample, numpy genfromtxt works:

In [166]: np.genfromtxt('stack38627130.csv',names=True,delimiter=',',dtype=None)

Out[166]: 
array([(b'"Bob"', b' "Portland"', b' "OR"', 100000, 120000, 140000, 160000),
       (b'"Sally"', b' "Eugene"', b' "OR"', 50000, 75000, 80000, 90000),
       (b'"Peter"', b' "San Francisco"', b' "CA"', -1, -1, 440000, 500000)], 
      dtype=[('Name', 'S7'), ('City', 'S16'), ('State', 'S5'), ('199604', '<i4'), ('199605', '<i4'), ('199606', '<i4'), ('199607', '<i4')])

This is a 1d structured array; you access columns by field name (here derived from your header line)

In [167]: data=_
In [168]: data['199604']
Out[168]: array([100000,  50000,     -1])
In [169]: data['199607']
Out[169]: array([160000,  90000, 500000])

The missing fields are filled with -1. I think that can be changed.

There are other parameters for setting field names if you don't like the deduced ones.

Read can be restricted to the numeric columns; different fill depending on the dtype.

In [171]: np.genfromtxt('stack38627130.csv',skip_header=1,delimiter=',',usecols=
     ...: [3,4,5,6])
Out[171]: 
array([[ 100000.,  120000.,  140000.,  160000.],
       [  50000.,   75000.,   80000.,   90000.],
       [     nan,      nan,  440000.,  500000.]])
In [172]: np.genfromtxt('stack38627130.csv',skip_header=1,delimiter=',',usecols=
     ...: [3,4,5,6],dtype=int)
Out[172]: 
array([[100000, 120000, 140000, 160000],
       [ 50000,  75000,  80000,  90000],
       [    -1,     -1, 440000, 500000]])

Now we get a 2d array.

I believe pandas handles missing fields better, but as long as those fields are marked with the delimiter, genfromtxt shouldn't have problems.

genfromtxt roughly does:

result = []
for row in reader:
    data = row[3:]
    data = [float(x) for x in data]
    result.append(data)
result = np.array(result)

np.array can do the float conversion if all the strings convert properly; it does not handle the empty ones or nan. Generally collecting a list of values is better than repeated vstack (or concatenates).