cerremony cerremony - 1 year ago 78
Python Question

Retrieving multiple mongo documents from cursor.next() for better performance

I have the following code to gather arrays (stored as pickled documents in a Mongo collection) and stack them into a 2D numpy matrix:

db = MongoClient()['db']
total = db.command('collStats', 'collection')['count']
collection = db['collection']

X, Y = np.array([]), np.array([])
pipe = [{'$sample': {'size': total}}]
cursor = collection.aggregate(pipe, allowDiskUse=True)

for i in range(1000):
xy = pickle.loads(cursor.next()['array'])
if not np.size(X):
X, Y = xy[0:-1], xy[-1]
X, Y = np.vstack((X, xy[0:-1])), np.vstack((Y, xy[-1]))
except StopIteration:

which is extremely slow. How can I optimize this, say, to read multiple documents at once from Mongo? Any other optimization tips would be useful as well

Answer Source

Best way I found to improve performance:

  1. Convert each item to a list
  2. Append that list to a list of lists
  3. Convert to a 2D numpy array

So, for my case, like this:

xy = pickle.loads(curosor.next()['array']).tolist()
X, Y = [xy[0:-1]], [[xy[-1]]]

for i in range(1000):
        xy = pickle.loads(cursor.next()['array']).tolist()
    except StopIteration:

X, Y = np.array(X), np.array(Y)

Improved my performance by a factor of 50.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download