Duc Phan Duc Phan - 6 months ago 23
Node.js Question

MongoDB query comments along with user information

I am creating an application with nodejs and mongod(Not mongoose). I have a problem that gave me headache over few days, anyone please suggest a way for this!!.
I have a mongodb design like this

post{
_id:ObjectId(...),
picture: 'some_url',
comments:[
{_id:ObjectId(...),
user_id:Object('123456'),
body:"some content"
},
{_id:ObjectId(...),
user_id:Object('...'),
body:"other content"
}
]
}

user{
_id:ObjectId('123456'),
name: 'some name', --> changable at any times
username: 'some_name', --> changable at any times
picture: 'url_link' --> changable at any times
}


I want to query the post along with all the user information so the query will look like this:

[{
_id:ObjectId(...),
picture: 'some_url',
comments:[
{_id:ObjectId(...),
user_id:Object('123456'),
user_data:{
_id:ObjectId('123456'),
name: 'some name',
username: 'some_name',
picture: 'url_link'
}
body:"some content"
},
{_id:ObjectId(...),
user_id:Object('...'),
body:"other content"
}
]
}]


I tried to use loop to manually get the user data and add to comment but it proves to be difficult and not achievable by my coding skill :(

Please anybody got any suggestion, I would be really appreciated.

P/s I am trying another approach that I would embedded all the user data in to the comment and whenever the user update their username, name or picture. They will update it in all the comment as well

Answer

The problem(s)

As written before, there are several problems when over-embedding:

Problem 1: BSON size limit

As of the time of this writing, BSON documents are limited to 16MB. If that limit is reached, MongoDB would throw an exception and you simply could not add more comments and in worst case scenarios not even change the (user-)name or the picture if the change would increase the size of the document.

Problem 2: Query limitations and performance

It is not easily possible to query or sort the comments array under certain conditions. Some things would require a rather costly aggregation, others rather complicated statements.

While one could argue that once the queries are in place, this isn't much of a problem, I beg to differ. First, the more complicated a query is, the harder it is to optimize, both for the developer and subsequently MongoDBs query optimizer. I have had the best results with simplyfying data models and queries, speeding up responses by a factor of 100 in one instance.

When scaling, the ressources needed for complicated and/or costly queries might even sum up to whole machines when compared to a simpler data model and according queries.

Problem 3: Maintainability

Last but not least you might well run into problems maintaining your code. As a simple rule of thumb

The more complicated your code becomes, the harder it is to maintain. The harder code is to maintain, the more time it needs to maintain the code. The more time it needs to maintain code, the more expensive it gets.

Conclusion: Complicated code is expensive.

In this context, "expensive" both refers to money (for professional projects) and time (for hobby projects).

(My!) Solution

It is pretty easy: simplify your data model. Consequently, your queries will become less complicated and (hopefully) faster.

Step 1: Identify your use cases

That's going to be a wild guess for me, but the important thing here is to show you the general method. I'd define your use cases as follows:

  1. For a given post, users should be able to comment
  2. For a given post, show the author and the comments, along with the commenters and authors username and their picture
  3. For a given user, it should be easily possible to change the name, username and picture

Step 2: Model your data accordingly

Users

First of all, we have a straightforward user model

{
  _id: new ObjectId(),
  name: "Joe Average",
  username: "HotGrrrl96",
  picture: "some_link"
}

Nothing new here, added just for completeness.

Posts

{
  _id: new ObjectId()
  title: "A post",
  content: " Interesting stuff",
  picture: "some_link",
  created: new ISODate(),
  author: {
    username: "HotGrrrl96",
    picture: "some_link"
  }
}

And that's about it for a post. There are two things to note here: first, we store the author data we immediately need when displaying a post, since this saves us a query for a very common, if not ubiquitous use case. Why don't we save the comments and commenters data acordingly? Because of the 16 MB size limit, we are trying to prevent the storage of references in a single document. Rather, we store the references in comment documents:

Comments

{
  _id: new ObjectId(),
  post: someObjectId,
  created: new ISODate(),
  commenter: {
    username: "FooBar",
    picture: "some_link"
  },
  comment: "Awesome!"
}

The same as with posts, we have all the necessary data for displaying a post.

The queries

What we have achieved now is that we circumvented the BSON size limit and we don't need to refer to the user data in order to be able to display posts and comments, which should save us a lot of queries. But let's come back to the use cases and some more queries

Adding a comment

That's totally straightforward now.

Getting all or some comments for a given post

For all comments

db.comments.find({post:objectIdOfPost})

For the 3 lastest comments

db.comments.find({post:objectIdOfPost}).sort({created:-1}).limit(3)

So for displaying a post and all (or some) of its comments including the usernames and pictures we are at two queries. More than you needed before, but we circumvented the size limit and basically you can have an indefinite number of comments for every post. But let's get to something real

Getting the latest 5 posts and their latest 3 comments

This is a two step process. However, with proper indexing (will come back to that later) this still should be fast (and hence resource saving):

var posts = db.posts.find().sort({created:-1}).limit(5)
posts.forEach(
  function(post) {
    doSomethingWith(post);
    var comments = db.comments.find({"post":post._id}).sort("created":-1).limit(3);
    doSomethingElseWith(comments);
  }
)

Get all posts of a given user sorted from newest to oldest and their comments

var posts = db.posts.find({"author.username": "HotGrrrl96"},{_id:1}).sort({"created":-1});
var postIds = [];
posts.forEach(
  function(post){
    postIds.push(post._id);
  }
)
var comments = db.comments.find({post: {$in: postIds}}).sort({post:1, created:-1});

Note that we have only two queries here. Although you need to "manually" make the connection between posts and their respective comments, that should be pretty straightforward.

Change a username

This presumably is a rare use case executed. However, it isn't very complicated with said data model

First, we change the user document

db.users.update(
  { username: "HotGrrrl96"},
  {
    $set: { username: "Joe Cool"},
    $push: {oldUsernames: "HotGrrrl96" }
  },
  {
    writeConcern: {w: "majority"}
  }
);

We push the old username to an according array. This is a security measure in case something goes wrong with the following operations. Furthermore, we set the write concern to a rather high level in order to make sure the data is durable.

db.posts.update(
  { "author.username": "HotGrrrl96"},
  { $set:{ "author.username": "Joe Cool"} },
  {
    multi:true,
    writeConcern: {w:"majority"}
  }
)

Nothing special here. The update statement for the comments looks pretty much the same. While those queries take some time, they are rarely executed.

The indices

As a rule of thumb, one can say that MongoDB can only use one index per query. While this is not entirely true since there are index intersections, it is easy to deal with. Another thing is that individual fields in a compound index can be used independently. So an easy approach to index optimization is to find the query with the most fields used in operations which make use of indices and create a compound index of them. Note that the order of occurrence in the query matters. So, let's go ahead.

Posts

db.posts.createIndex({"author.username":1,"created":-1})

Comments

db.comments.createIndex({"post":1, "created":-1})

Conclusion

A fully embedded document per post admittedly is the the fastest way of loading it and it's comments. However, it does not scale well and due to the nature of possibly complex queries necessary to deal with it, this performance advantage may be leveraged or even eliminated.

With the above solution, you trade some speed (if!) against basically unlimited scalability and a much more straightforward way of dealing with the data.

Hth.

Comments