I'd like to calculated the number of documents stored in a mongodb bson file without having to import the file into the db via mongo restore.
The best I've been able to come up with in python is
bson_doc = open('./archive.bson','rb')
it = bson.decode_file_iter(bson_doc)
total = sum(1 for _ in it)
I don't have a file at hand to try, but I believe there's a way - if you'll parse the data by hand.
The source for
bson.decode_file_iter (sans the docstring) goes like this:
_UNPACK_INT = struct.Struct("<i").unpack def decode_file_iter(file_obj, codec_options=DEFAULT_CODEC_OPTIONS): while True: # Read size of next object. size_data = file_obj.read(4) if len(size_data) == 0: break # Finished with file normaly. elif len(size_data) != 4: raise InvalidBSON("cut off in middle of objsize") obj_size = _UNPACK_INT(size_data) - 4 elements = size_data + file_obj.read(obj_size) yield _bson_to_dict(elements, codec_options)
I presume, the time-consuming operation is
_bson_to_dict call - and you don't need one.
So, all you need is to read the file - get the int32 value with the next document's size and skip it. Then count how many documents you've encountered doing this.
So, I believe, this function should do the trick:
import struct import os from bson.errors import InvalidBSON def count_file_documents(file_obj): """Counts how many documents provided BSON file contains""" cnt = 0 while True: # Read size of next object. size_data = file_obj.read(4) if len(size_data) == 0: break # Finished with file normaly. elif len(size_data) != 4: raise InvalidBSON("cut off in middle of objsize") obj_size = struct.Struct("<i").unpack(size_data) - 4 # Skip the next obj_size bytes file_obj.seek(obj_size, os.SEEK_CUR) cnt += 1 return cnt
(I haven't tested the code, though. Don't have MongoDB at hand.)