Long Thai - 1 year ago 105

Python Question

I'm writing a method to create an array from data file. The method looks like:

`import numpy`

def readDataFile(fileName):

try:

with open(fileName, 'r') as inputs:

data = None

for line in inputs:

line = line.strip()

items = line.split('\t')

if data == None:

data = numpy.array(items[0:len(items)])

else:

data = numpy.vstack((data, items[0:len(items)]))

return numpy.array(data)

except IOError as ioerr:

print 'IOError: ', ioerr

return None

My data file contains lines of numbers, each of which is separated from each other by a tab, e.g:

`1 2 3`

4 5 6

7 8 9

And I expect to receive an array as follows:

`array([[1, 2, 3],`

[4, 5, 6],

[7, 8, 9]])

However, the result contains

`dtype`

`array([[1, 2, 3],`

[4, 5, 6],

[7, 8, 9]], dtype='|S9')

Because of it, I cannot perform some operations on the result, e.g. if I try to find the max value for each line using

`result.max(0)`

TypeError: cannot perform reduce with flexible type.

So, can anyone tell me what's wrong with my code and how to fix it? Thanks a lot.

Answer Source

The easiest fix is to use numpy's loadtxt:

```
data = numpy.loadtxt(fileName, dtype='float')
```

Just FYI, using `numpy.vstack`

inside a loop is a bad idea. If you decide not to use `loadtxt`

, you can replace your loop with the following to fix the dtype issue and eliminating the `numpy.vstack`

.

```
data = [row.split('\t') for row in inputs]
data = np.array(data, dtype='float')
```

Update

Every time vstack is called it makes a new array, and copies the contents of the old arrays into the new one. This copy is roughly O(n) where n is the size of the array and if your loop runs n times the whole thing becomes O(n**2), in other words slow. If you know the final size of the array ahead of time, it's better to create the array outside the loop and fill the existing array. If you don't know the final size of the array, you can use a list inside the loop and call vstack at the end. For example:

```
import numpy as np
myArray = np.zeros((10,3))
for i in xrange(len(myArray)):
myArray[i] = [i, i+1, i+2]
# or:
myArray = []
for i in xrange(10):
myArray.append(np.array([i, i+1, i+2]))
myArray = np.vstack(myArray)
```