cerremony cerremony - 4 months ago 21
Python Question

Retrieving multiple mongo documents from cursor.next() for better performance

I have the following code to gather arrays (stored as pickled documents in a Mongo collection) and stack them into a 2D numpy matrix:

db = MongoClient()['db']
total = db.command('collStats', 'collection')['count']
collection = db['collection']

X, Y = np.array([]), np.array([])
pipe = [{'$sample': {'size': total}}]
cursor = collection.aggregate(pipe, allowDiskUse=True)

for i in range(1000):
try:
xy = pickle.loads(cursor.next()['array'])
if not np.size(X):
X, Y = xy[0:-1], xy[-1]
else:
X, Y = np.vstack((X, xy[0:-1])), np.vstack((Y, xy[-1]))
except StopIteration:
break


which is extremely slow. How can I optimize this, say, to read multiple documents at once from Mongo? Any other optimization tips would be useful as well

Answer

Best way I found to improve performance:

  1. Convert each item to a list
  2. Append that list to a list of lists
  3. Convert to a 2D numpy array

So, for my case, like this:

xy = pickle.loads(curosor.next()['array']).tolist()
X, Y = [xy[0:-1]], [[xy[-1]]]

for i in range(1000):
    try:
        xy = pickle.loads(cursor.next()['array']).tolist()
    except StopIteration:
        break
    X.append(xy[0:-1])
    Y.append([xy[-1]])

X, Y = np.array(X), np.array(Y)

Improved my performance by a factor of 50.