user7070824 user7070824 - 1 month ago 8
Python Question

export data to csv from mongodb by using python

I am having problems with export to csv by using python script. some array data need to be exported to CSV from Mongodb, but the following script did not export properly because three subfield data are dumped into a column. I want to separate three fields(order, text,answerid) under answers field into three different columns in a CSV.

the sample of Mongodb:

"answers": [
{
"order": 0,
"text": {
"en": "Yes"
},
"answerId": "527d65de7563dd0fb98fa28c"
},
{
"order": 1,
"text": {
"en": "No"
},
"answerId": "527d65de7563dd0fb98fa28b"
}
]


the python script:

import csv
cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1})
cursor = list(cursor)
with open('answer_2.csv', 'w') as outfile:

fields = ['_id','answers.order', 'answers.text', 'answers.answerid']
write = csv.DictWriter(outfile, fieldnames=fields)
write.writeheader()
for x in cursor:
for y, v in x.iteritems():
if y == 'answers'
print (y, v)
write.writerow(v)
write.writerow(x)

Answer

So... The problem is that the csv writer doesn't understand the concept of "subdictionaries" as mongo returns it.

If I understood correctly, when you query Mongo, you get a dictionary like this:

{
   "_id": "a hex ID that correspond with the record that contains several answers",
   "answers": [ ... a list with a bunch of dicts in it... ]
}

So when the csv.DictWriter tries to write that, it is only writing one dictionary (the topmost). It doesn't know (or cares) that answers is a list that contains dictionaries whose values need to be written in columns as well (accessing fields in dictionaries using the dot notation such as answers.order is only understood by Mongo, not by the csv writer)

What I understand you should do is "walk" the list of answers and create one dictionary out of each record (each dictionary) in that list. Once you have a list of "flattened" dictionaries you can pass those and write them in your csv file:

cursor = client.stack_overflow.stack_039.find(
    {}, {'_id': 1, 'answers.order': 1, 'answers.text': 1, 'answers.answerId': 1})

# Step 1: Create the list of dictionaries (one dictionary per entry in the `answers` list)
flattened_records = []
for answers_record in cursor:
    answers_record_id = answers_record['_id']
    for answer_record in answers_record['answers']:
        flattened_record = {
            '_id': answers_record_id,
            'answers.order': answer_record['order'],
            'answers.text': answer_record['text'],
            'answers.answerId': answer_record['answerId']
        }
        flattened_records.append(flattened_record)

# Step 2: Iterate through the list of flattened records and write them to the csv file
with open('stack_039.csv', 'w') as outfile:
    fields = ['_id', 'answers.order', 'answers.text', 'answers.answerId']
    write = csv.DictWriter(outfile, fieldnames=fields)
    write.writeheader()
    for flattened_record in flattened_records:
        write.writerow(flattened_record)

Whatch for the use of plurals. answers_record is different than answer_record

That creates a file like this:

$ cat ./stack_039.csv
_id,answers.order,answers.text,answers.answerId
580f9aa82de54705a2520833,0,{u'en': u'Yes'},527d65de7563dd0fb98fa28c
580f9aa82de54705a2520833,1,{u'en': u'No'},527d65de7563dd0fb98fa28b

EDIT:

Your query (the one that makes cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1})) will return all the entries in the questions collection. If this collection is very large, you might want to use the cursor as an iterator.

As you might have already realized, the first for loop in my code above puts all the records in a list (the flattened_records list). You can do lazy loading by iterating through the cursor (instead of loading all the items in memory, fetch one, do something with it, get the next, do something with it...).

It's slightly slower, but more memory efficient.

cursor = client.stack_overflow.stack_039.find(
    {}, {'_id': 1, 'answers.order': 1, 'answers.text': 1, 'answers.answerId': 1})

with open('stack_039.csv', 'w') as outfile:
    fields = ['_id', 'answers.order', 'answers.text', 'answers.answerId']
    write = csv.DictWriter(outfile, fieldnames=fields)
    write.writeheader()
    for answers_record in cursor:  # Here we are using 'cursor' as an iterator
        answers_record_id = answers_record['_id']
        for answer_record in answers_record['answers']:
            flattened_record = {
                '_id': answers_record_id,
                'answers.order': answer_record['order'],
                'answers.text': answer_record['text'],
                'answers.answerId': answer_record['answerId']
            }
            write.writerow(flattened_record)

It will produce the same .csv file as shown above.