Skip to main content
Glimpses of Daniel's world

M101JS Week 4

On August 12th the first M101JS (MongoDB for Node.js developers) course started. Previously I completed both the M101J (MongoDB for Java developers) and M102 (MongoDB for DBAs) courses available through 10gen's education site.

When I finish this course I will have refreshed my MongoDB knowledge and got some experience with Node.js in the form of a blog application. I decided to document my progress in (at least) weekly summaries. This will be part four of seven. You can read part three here.

For this series of blog posts I will structure each post into an introduction (you just passed it) and two sections. The first section, Lectures, will summarize what I learned from the videos and related examples. In the second section, Homework, I will mention what I learned from practicing the homework assignments and anything I might have done extra because of it.

Lectures

This week's topic is performance. After you made your initial schema design, you will need to optimize it's performance. If you don't start optimizing from the start then once your data reaches a certain point users will complain it is getting slow.

Although database administrators are the ones most often associated with improving performance, a good developer knows how to spot points of improvement himself. A good developer also knows how to write well performing software.

It's all about indexes

The majority of the week was about indexes. Indexes are the key to making queries fast. MongoDB uses them to find and sort in collections. Finding something in a collection is in the broadest sense of the word, because to update or remove documents the database needs to find them as well.

The mantra for engineers using MongoDB is: “model based on your usage pattern”. This applies again to the discipline of application design part when you create indexes. You shouldn't waste time and storage space on indexes you rarely use. When you add an index an insert or update will trigger MongoDB to update the indexes. If there are many indexes to update then your modification of the collection takes a little longer.

The order of the keys in an index is an important choice. Let's assume you want to index people based on gender, age and length. You chose to index on male first, then from young to old and finally from tall to short. You can do this with the mongo shell with the following command.

[code language="javascript"]

db.people.ensureIndex({ 'gender': -1, 'age': 1, 'length': 1 });

[/code]

This has the following consequences. You will be able to do these queries with this index:

However, you won't be able to use the index to find everybody taller than 5'3" or everybody older than 21. In your query you need to search with at least the left-most part of an index to let MongoDB consider using that particular index. Now, about the * I put on the query to find all men taller than 5'3". That query can use the gender part of the index, but from there on needs to scan for all people that are taller than 5'3".

Indexes are also used for sorting. The same rules as for searching apply. One important aspect in sorting is that MongoDB considers indexes if the sorting part of the query matches the index's sort order or is the reverse of it. An example will explain it better.

Let's assume an index where you sort people on ascending age and ascending height.

[code language="javascript"]

db.people.ensureIndex({ 'age': 1, 'height': 1 });

[/code]

Once you put an index on multiple keys MongoDB starts building it up from left to right. In this case there will be an indexing structure ascending on age. Inside of each age segment there will be another structure ascending on height. When MongoDB searches for indexes to use it will find the index on age and see it's ascending. If the query asks for a descending index, MongoDB might consider using the same index but will flip the order of the pattern it will use to check this index. This means that in the following cases the index could be used:

But in the next cases it can't:

Multi-keys

This week the concept of multi-keys, briefly introduced in week 3 was explained more elaborately. In essence it's the indexing of values in an array for at most one key in the index. That means a blog post, which has a list of tags, can be indexed with multiple references to that post, one from each tag in the list. If the same post would also have a list of categories and you can't put both keys, tags and categories, in the same index. MongoDB won't allow multi-key indexes where more than one value of a document will be a multi-key. For each document either the tags or the categories can be an array, but not both.

Unique and sparse indexes

Each value of an index could have multiple references to the collection. This means a value can be duplicated across several documents in the same collection. There are cases where you would like a certain key to have unique values. If it's only one key, you might consider using the value for the _id key. The _id key has unique values, but the name isn't very friendly. You might already be using that field for another purpose. Let's consider the people collection again where we let the_id key be a registration ID and we want a key for the unique code on somebody's ID card. Then you create a unique index on that key. Any duplicate entry on insert will give an error. If there are already duplicates in the collection, you won't be able to create the index without removing the duplicates. This could be done while creating the index but is generally not a good idea.

I chose to make the ID card code unique, but some people might not have an ID card. If I tried to apply the index to the collection it might fail because all those people without ID cards would be considered to have the value null. All documents are indexed, even the ones without the key. Those are shoved into a big bucket where the value for that key is null. If you want to index only the documents that do have the key, you can use sparse indexes. There is a slight danger in using a sparse index, if it is applied to query results in any way, then anything not in the index will be filtered out.

Hinting

By default MongoDB decides what index to use, picking the one that returns the fastest. Hinting is a way to force using a specific index, or none at all. You could apply a hint when you use find operators that won't make optimum use of the indexes. These operators need to scan (a large part of) the collection, even if indexes are use. Examples are: $gt, $lt, $ne, $nin and $regex (when your regular expression isn't describing the start of a text).

Geospatial indexes

There was also some mentioning of geospatial indexes. I used it in a blog before and it was a nice feature in it's infancy stages back then (about a year ago). The main problem I had when trying to make a proximity query using Google Maps was that the order of the x and y index in MongoDB is different from the way Google Maps would return latitude and longitude. I always had to flip it in order for the query to give the right result. The other difficult part was using the spherical model, which is more realistic when getting actual distances. You need to give the distance in radians. And you need to use runCommand. Which is less convenient when processing the result.

Tweaking your performance

There are several options to tweak your performance in MongoDB. By default slow queries, the ones taking longer that 100ms inside of MongoDB, are logged. You could use the explain command at the end of a query, which will tell you some information about the query plan. There is an option to turn on profiling of queries with the options: off, slow queries, and all queries. The profiled queries are sent to a capped collection which you can query to figure out what went slow and why (a document similar to what you get from explain returns).

Mongotop and mongostat

These two programs will tell you a little bit more about the performance of your application. The first, mongotop, just displays a snapshot in intervals of where most of the time is spent for each collection. So you can determine if there reads or writes are slow. You use it in addition to the profiling. The second program, mongostat, has one column that is usually looked at. The one telling you how much misses there are on indexes. If it is anything above zero, your indexes are too big and don't fit in memory. However, being zero might indicate that there are no indexes being used. Together with profiling and logs of slow queries you can determine whether to be smarter about indexing or scale up the memory.

Homework

The fourth week's assignments consisted of two multiple choice questions, an assignment to put indexes on the collection to make the blog application fast and the final assignment is to do a little analysis on profile data.

On multiple choice questions it's important to read the questions and answers properly. In homework assignments you get feedback on your attempts, so after the first wrong you have two more chances to make it right. The final exam doesn't have this luxury. This doesn't help with any exam anxiety, I can tell you that much...

The blog performance assignment isn't hard at all. You just need to make the proper indexes and remember that you have to design them to  fit the data usage. Once finished a validating script will tell you how well you did. At a certain level of performance you are told the magic key but teased to find the ultimate answer. If you found the perfect answer from the start, you will of course not see the teasing.

In general this week was just a refresher, again, and most time was spent on writing this blog. I truly believe that having to write it down made me think about it more, increasing my knowledge on the subject.

I am now half way through the course and must say that there hasn't been much Node.js related topics. Most of the assignments and lectures touch on the MongoDB theory and usage of the shell. The Node.js driver has many similarities to the mongo shell, with the main exception of using callbacks. Perhaps the real Node.js challenges will come in the last two weeks, with the final exam being the toughest because there is no feedback and only your last answer counts.

In week five the lectures will cover the topic of the aggregation framework. I think it's mostly used to generate reports because most of the time a properly designed schema means you don't need to aggregate data in regular queries. However, scheduled tasks could be used to aggregate.