I'd like to calculated the number of documents stored in a mongodb bson file without having to import the file into the db via mongo restore.
The best I've been able to come up with in python is
bson_doc = open('./archive.bson','rb')
it = bson.decode_file_iter(bson_doc)
total = sum(1 for _ in it)
print(total)
I don't have a file at hand to try, but I believe there's a way - if you'll parse the data by hand.
The source for bson.decode_file_iter
(sans the docstring) goes like this:
_UNPACK_INT = struct.Struct("<i").unpack
def decode_file_iter(file_obj, codec_options=DEFAULT_CODEC_OPTIONS):
while True:
# Read size of next object.
size_data = file_obj.read(4)
if len(size_data) == 0:
break # Finished with file normaly.
elif len(size_data) != 4:
raise InvalidBSON("cut off in middle of objsize")
obj_size = _UNPACK_INT(size_data)[0] - 4
elements = size_data + file_obj.read(obj_size)
yield _bson_to_dict(elements, codec_options)
I presume, the time-consuming operation is _bson_to_dict
call - and you don't need one.
So, all you need is to read the file - get the int32 value with the next document's size and skip it. Then count how many documents you've encountered doing this.
So, I believe, this function should do the trick:
import struct
import os
from bson.errors import InvalidBSON
def count_file_documents(file_obj):
"""Counts how many documents provided BSON file contains"""
cnt = 0
while True:
# Read size of next object.
size_data = file_obj.read(4)
if len(size_data) == 0:
break # Finished with file normaly.
elif len(size_data) != 4:
raise InvalidBSON("cut off in middle of objsize")
obj_size = struct.Struct("<i").unpack(size_data)[0] - 4
# Skip the next obj_size bytes
file_obj.seek(obj_size, os.SEEK_CUR)
cnt += 1
return cnt
(I haven't tested the code, though. Don't have MongoDB at hand.)