Leandro Santos Leandro Santos - 2 months ago 18
Python Question

Python2.7 - How to read a specific field in a MongoDB collection

I have some data stored in my collection, here is an example obtained through the shell. (Please ignore the language of the text.)

"_id" : ObjectId("581ab1811d41c814004f4d16"),
"created_time" : "2016-11-02T19:48:41+0000",
"message" : "Acabaram de assaltar o carro de um colega nosso em Itabaiana\nno zangue, ele é de Aracaju e foi passear em Itabaiana.Gol G6 prata 2013 placa OER-5474.\n",
"id" : "400728540046889_1107668596019543"

In this case I need to get only the text contained in the
field, since I need to do several operations in those texts. So the process would be as follows: In my collection I have to get all text in the field
do the operations and then return this
to its proper location, along with its other attributes. My code so far:

# -*- coding: utf-8 -*-
import preprocessing
import pymongo
import json
from pymongo import MongoClient
from unicodedata import normalize
from preprocessing import PreProcessing

if __name__ == '__main__':
client = MongoClient('localhost:27017')
collection = client.facebook.dadosColetados1
dbmessage = collection.find()
for text in dbmessage:
print text
except Exception, e:
print str(e)

I can not pass the
attribute to be used in find and when I use only
it returns me the text without being in utf-8 like:

e7\xf5es institucionais para uma seguran\xe7a p\xfablica mais integrada em todo o Estado.\n\nO secret\xe1rio destacou a import\xe2ncia da manuten\xe7\xe3o do di\xe1logo entres as institui\xe7\xf5es.

What would be the best approach to this situation?


You can query the database and set the projection so just the value field associated with the "message" key is returned. Then throw the messages into a list.

import pymongo

client = pymongo.MongoClient('localhost:27017')
db = client['db_name']

query = {'message': {'$exists': 1}}
projection = {'_id': 0, 'message': 1}

db_messages = db['collection_name'].find(query, projection)

message_list = []
for message in db_messages:
    for key, value in message.iteritems():

Now "message_list" will contain all of the messages from your collection and you can perform any operation on your data:

message_list = [u'message1', u'message2', u'message3', etc.]

EDIT: If you want to keep the _id associated to its message you can do the following (probably not the best way but it works)...

In projetion, set the id key to 1 ('_id': 1) and change the code from above:

message_list, id_list = []. []
for document in db_messages:
    for key, value in document.iteritems():
        if key == 'message':

pair_up = zip(id_list, message_list)

# let's say I want to keep only the last 2 letters/numbers of each message
updated_pair_up = []
for i in pair_up:
    updated_pair_up.append((i[0], i[1][-2:])

So pair_up will look something like this:

pair_up = [(id1, 'message1'), (id2, 'message2'), etc]

And updated_pair_up will like this:

updated_pair_up = [(id1, 'e1'), (id2, 'e2'), etc]