user1438162 user1438162 - 3 months ago 31
Python Question

Quickly count the number of objects in bson document

I'd like to calculated the number of documents stored in a mongodb bson file without having to import the file into the db via mongo restore.

The best I've been able to come up with in python is

bson_doc = open('./archive.bson','rb')
it = bson.decode_file_iter(bson_doc)
total = sum(1 for _ in it)
print(total)


This works in theory, but is slow in practice when bson documents are large. Anyone have a quicker approach to counting the number of documents in a bson document without doing a full decode?

I am currently using the python 2.7 and pymongo.
https://api.mongodb.com/python/current/api/bson/index.html

Answer

I don't have a file at hand to try, but I believe there's a way - if you'll parse the data by hand.

The source for bson.decode_file_iter (sans the docstring) goes like this:

_UNPACK_INT = struct.Struct("<i").unpack

def decode_file_iter(file_obj, codec_options=DEFAULT_CODEC_OPTIONS):
    while True:
        # Read size of next object.
        size_data = file_obj.read(4)
        if len(size_data) == 0:
            break  # Finished with file normaly.
        elif len(size_data) != 4:
            raise InvalidBSON("cut off in middle of objsize")
        obj_size = _UNPACK_INT(size_data)[0] - 4
        elements = size_data + file_obj.read(obj_size)
        yield _bson_to_dict(elements, codec_options)

I presume, the time-consuming operation is _bson_to_dict call - and you don't need one.

So, all you need is to read the file - get the int32 value with the next document's size and skip it. Then count how many documents you've encountered doing this.

So, I believe, this function should do the trick:

import struct
import os
from bson.errors import InvalidBSON

def count_file_documents(file_obj):
    """Counts how many documents provided BSON file contains"""
    cnt = 0
    while True:
        # Read size of next object.
        size_data = file_obj.read(4)
        if len(size_data) == 0:
            break  # Finished with file normaly.
        elif len(size_data) != 4:
            raise InvalidBSON("cut off in middle of objsize")
        obj_size = struct.Struct("<i").unpack(size_data)[0] - 4
        # Skip the next obj_size bytes
        file_obj.seek(obj_size, os.SEEK_CUR)
        cnt += 1
    return cnt

(I haven't tested the code, though. Don't have MongoDB at hand.)