Querying and indexing a shard cluster

From the application’s perspective, there’s no difference between querying a sharded cluster and querying a single mongod. In both cases, the query interface and the pro- cess of iterating over the result set are the same. But behind the scenes, things are dif- ferent, and it’s worthwhile to understand exactly what’s going on.

12.5.1 Query routing

Imagine you’re querying a sharded cluster. How many shards does mongos need to contact to return a proper query response? If you give it some thought, you’ll see that it depends on whether the shard key is present in the query selector that we pass to find and similar operations. Remember that the config servers (and thus mongos) maintain a mapping of shard key ranges to shards. These mappings are none other than the chunks we examined earlier in the chapter. If a query includes the shard key, then mongos can quickly consult the chunk data to determine exactly which shard contains the query’s result set. This is called a targeted query.

But if the shard key isn’t part of the query, the query planner will have to visit all shards to fulfill the query completely. This is known as a global or scatter/gather query.

The diagram in figure 12.5 illustrates both query types.

Figure 12.5 shows a cluster with two shards, two mongos routers, and two application servers. The shard key for this cluster is {username: 1, _id: 1}. We’ll discuss how to choose a good shard key in section 12.6.

Details about the chunk that was moved

Shard that the chunk was moved from Shard that the

chunk was moved to

356 CHAPTER 12 Scaling your system with sharding

To the left of the figure, you can see a targeted query that includes the username field in its query selector. In this case, the mongos router can use the value of the username field to route the query directly to the correct shard.

To the right of the figure, you can see a global or scatter/gather query that doesn’t include any part of the shard key in its query selector. In this case, the mongos router must broadcast the query to both shards.

The effect of query targeting on performance cannot be overstated. If all your queries are global, that means each shard must respond to every single query against your cluster. In contrast, if all your queries are targeted, each shard only needs to handle on average the total number of requests divided by the total number of shards. The implications for scalability are clear.

But targeting isn’t the only thing that affects performance in a sharded cluster.

As you’ll see in the next section, everything you’ve learned about the performance of an unsharded deployment still applies to a sharded cluster, but to each shard individually.

12.5.2 Indexing in a sharded cluster

No matter how well-targeted your queries are, they must eventually run on at least one shard. This means that if your shards are slow to respond to queries, your cluster will be slow as well.

Shard-a (replica set)

Targeted queries (query selector has

shard key) mongos

Queries only the shard with the chunk containing documents with “Abbott”

as the shard key

Query cannot be isolated using chunk information so it is sent to all shards

find({username:"Abbott"}) find({filename:"sheet-1"})

Global queries (query selector lacks

shard key) Application

Shard-b (replica set)

mongos

Application

Figure 12.5 Targeted and global queries against a shard cluster

357 Querying and indexing a shard cluster

As in an unsharded deployment, indexing is an important part of optimizing performance. There are only a few key points to keep in mind about indexing that are specific to a sharded cluster:

■ Each shard maintains its own indexes. When you declare an index on a sharded collection, each shard builds a separate index for its portion of the collection.

For example, when you issue the db.spreadsheets.createIndex() command while connected to a mongos router, each shard processes the index creation command individually.

■ It follows that the sharded collections on each shard should have the same indexes. If this ever isn’t the case, you’ll see inconsistent query performance.

■ Sharded collections permit unique indexes on the _id field and on the shard key only. Unique indexes are prohibited elsewhere because enforcing them would require inter-shard communication, which is against the fundamental design of sharding in MongoDB.

Once you understand how queries are routed and how indexing works, you should be in a good position to write smart queries and indexes for your sharded cluster. Most of the advice on indexing and query optimization from chapter 8 will apply.

In the next section, we’ll cover the powerful explain() tool, which you can use to see exactly what path is taken by a query against your cluster.

12.5.3 The explain() tool in a sharded cluster

The explain() tool is your primary way to troubleshoot and optimize queries. It can show you exactly how your query would be executed, including whether it can be targeted and whether it can use an index. The following listing shows an example of what this output might look like.

mongos> db.spreadsheets.createIndex({username:1, updated_at:-1}) {

"raw" : {

"shard-a/localhost:30000,localhost:30001" : { "createdCollectionAutomatically" : false, "numIndexesBefore" : 3,

"numIndexesAfter" : 4, "ok" : 1

"shard-b/localhost:30100,localhost:30101" : { "createdCollectionAutomatically" : false, "numIndexesBefore" : 3,

"numIndexesAfter" : 4, "ok" : 1

} }, "ok" : 1 }

Listing 12.1 Index and query to return latest documents updated by a user

358 CHAPTER 12 Scaling your system with sharding

mongos> db.spreadsheets.find({username: "Wallace"}).sort({updated_at:- 1}).explain()

{

"clusteredType" : "ParallelSort", "shards" : {

"shard-b/localhost:30100,localhost:30101" : [ {

"cursor" : "BtreeCursor username_1_updated_at_-1", "isMultiKey" : false,

"n" : 100,

"nscannedObjects" : 100, "nscanned" : 100,

"nscannedObjectsAllPlans" : 200, "nscannedAllPlans" : 200,

"scanAndOrder" : false, "indexOnly" : false, "nYields" : 1, "nChunkSkips" : 0, "millis" : 3, "indexBounds" : { "username" : [ [

"Wallace", "Wallace"

] ],

"updated_at" : [ [

{

"$maxElement" : 1 },

{

"$minElement" : 1 }

] ] },

"server" : "localhost:30100", "filterSet" : false

} ] },

"cursor" : "BtreeCursor username_1_updated_at_-1", "n" : 100,

"nChunkSkips" : 0, "nYields" : 1, "nscanned" : 100,

"nscannedAllPlans" : 200, "nscannedObjects" : 100,

"nscannedObjectsAllPlans" : 200, "millisShardTotal" : 3,

"millisShardAvg" : 3, "numQueries" : 1,

"numShards" : 1, "indexBounds" : {

Index on updated_at and username used to

fetch documents c

Number of shards this query was sent to

359 Choosing a shard key

"username" : [ [

"Wallace", "Wallace"

] ],

"updated_at" : [ [

{

"$maxElement" : 1 },

{

"$minElement" : 1 }

] ] },

"millis" : 4 }

You can see from this explain() plan that this query was only sent to one shard B, and that when it ran on that shard it used the index we created to satisfy the sort more efficiently c. Note that this explain() plan output is from v2.6 and earlier, and it has changed in 3.0 and later versions. Chapter 8 contains output from the explain() command when used on a MongoDB v3.0 server. Consult the documentation at https://docs.mongodb.org/manual/reference/method/cursor.explain/ for your specific version if you see any fields you don’t understand.

12.5.4 Aggregation in a sharded cluster

It’s worth noting that the aggregation framework also benefits from sharding. The analysis of an aggregation is a bit more complicated than a single query, and may change between versions as new optimizations are introduced. Fortunately, the aggregation framework also has an explain() option that you can use to see details about how your query would perform. As a basic rule of thumb, the number of shards that an aggregation operation needs to contact is dependent on the data that the operation needs as input to complete. For example, if you’re counting every document in your entire database, you’ll need to query all shards, but if you’re only counting a small range of documents, you may not need to query every shard. Consult the cur- rent documentation at https://docs.mongodb.org/manual/reference/method/db .collection.aggregate/ for more details.

MongoDB’s core server and tools

Diving into the MongoDB shell