cerremony - 8 months ago 41

Python Question

I have the following code to gather arrays (stored as pickled documents in a Mongo collection) and stack them into a 2D numpy matrix:

`db = MongoClient()['db']`

total = db.command('collStats', 'collection')['count']

collection = db['collection']

X, Y = np.array([]), np.array([])

pipe = [{'$sample': {'size': total}}]

cursor = collection.aggregate(pipe, allowDiskUse=True)

for i in range(1000):

try:

xy = pickle.loads(cursor.next()['array'])

if not np.size(X):

X, Y = xy[0:-1], xy[-1]

else:

X, Y = np.vstack((X, xy[0:-1])), np.vstack((Y, xy[-1]))

except StopIteration:

break

which is extremely slow. How can I optimize this, say, to read multiple documents at once from Mongo? Any other optimization tips would be useful as well

Answer

Best way I found to improve performance:

- Convert each item to a list
- Append that list to a list of lists
- Convert to a 2D numpy array

So, for my case, like this:

```
xy = pickle.loads(curosor.next()['array']).tolist()
X, Y = [xy[0:-1]], [[xy[-1]]]
for i in range(1000):
try:
xy = pickle.loads(cursor.next()['array']).tolist()
except StopIteration:
break
X.append(xy[0:-1])
Y.append([xy[-1]])
X, Y = np.array(X), np.array(Y)
```

Improved my performance by a factor of 50.