Distributing data in a sharded cluster

Before discussing the different ways to shard, let’s discuss how data is grouped and organized in MongoDB. This topic is relevant to a discussion of sharding because it illustrates the different boundaries on which we can partition our data.

To illustrate this, we’ll use the Google Docs–like application we’ll build later in the chapter. Figure 12.2 shows how the data for such an application would be structured in MongoDB.

Looking at the figure from the innermost box moving outward, you can see there are four different levels of granularity in MongoDB: document, chunk, collection, and database.

These four levels of granularity represent the units of data in MongoDB:

■ Document—The smallest unit of data in MongoDB. A document represents a single object in the system and can’t be divided further. You can compare this to a row in a relational database. Note that we consider a document and all its fields to be a single atomic unit. In the innermost box in figure 12.2, you can see a document with a username field with a value of "hawkins".

■ Chunk—A group of documents clustered by values on a field. A chunk is a concept that exists only in sharded setups. This is a logical grouping of documents based on their values for a field or set of fields, known as a shard key. We’ll cover the shard key when we go into more detail about chunks later in this section, and then again in section 12.6. The chunk shown in figure 12.2 contains all the documents that have the field username with values between "bakkum" and "verch".

■ Collection—A named grouping of documents within a database. To allow users to separate a database into logical groupings that make sense for the application, MongoDB provides the concept of a collection. This is nothing more than a named grouping of documents, and it must be explicitly specified by the application to run any queries. In figure 12.2, the collection name is spreadsheets. This collection name essentially identifies a subgroup within the cloud- docs database, which we’ll discuss next.

■ Database—Contains collections of documents. This is the top-level named grouping in the system. Because a database contains collections of documents, a

Database: cloud-docs Collection: spreadsheets

Chunk: all documents with username between “bakkum” and “verch”

Document:{"username":"hawkins",…} … … …

Figure 12.2 Levels of granularity available in a sharded MongoDB deployment

340 CHAPTER 12 Scaling your system with sharding

collection must also be specified to perform any operations on the documents themselves. In figure 12.2, the database name is cloud-docs. To run any queries, the collection must also be specified—spreadsheets in our example. The combination of database name and collection name together is unique throughout the system, and is commonly referred to as the namespace. It is usually repre- sented by concatenating the collection name and database together, separated by a period character. For the example shown in figure 12.2, that would look like cloud-docs.spreadsheets.

Databases and collections were covered in section 4.3, and are present in unsharded deployments as well, so the only unfamiliar grouping here should be the chunk.

12.3.1 Ways data can be distributed in a sharded cluster

Now you know the different ways in which data is logically grouped in MongoDB. The next questions are, how does this interact with sharding? On which of these groupings can we partition our data? The quick answer to these questions is that data can be distributed in a sharded cluster on two of these four groupings:

■ On the level of an entire database, where each database along with all its collections is put on its own shard.

■ On the level of partitions or chunks of a collection, where the documents within a collection itself are divided up and spread out over multiple shards, based on values of a field or set of fields called the shard key in the documents.

You may wonder why MongoDB does partitioning based on chunks rather than on individual documents. It seems like that would be the most logical grouping because a document is the smallest possible unit. But when you consider the fact that not only do we have to partition the data, but we also have to be able to find it again, you’ll see that if we partition on a document level—for example, by allowing each spreadsheet in our Google Docs–like application to be independently moved around—we need to store metadata on the config servers, keeping track of every single document independently. If you imagine a system with small documents, half of your data may end up being metadata on the config servers just keeping track of where your actual data is stored.

Granularity jump from database to partition of collection

You may also wonder why there’s a jump in granularity from an entire database to a partition of a collection. Why isn’t there an intermediate step where we can distribute on the level of whole collections, without partitioning the collections themselves?

The real answer to this question is that it’s completely theoretically possible. It just hasn’t been implemented yet. Fortunately, because of the relationship between databases and collections, there’s an easy workaround. If you’re in a situation where you

341 Distributing data in a sharded cluster

In the next two sections, we’ll cover each of the supported distribution methods. First we’ll discuss distributing entire databases, and then we’ll move on to the more com- mon and useful method of distributing chunks within a collection.

12.3.2 Distributing databases to shards

As you create new databases in a sharded cluster, each database is assigned to a different shard. If you do nothing else, a database and all its collections will live forever on the shard where they were created. The databases themselves don’t even need to be sharded.

Because the name of a database is specified by the application, you can think of this as a kind of manual partitioning. MongoDB has nothing to do with how well-partitioned your data is. To see why this is manual, consider using this method to shard the spreadsheets collection in our documents example. To shard this two ways using database distribution, you’d have to make two databases—say files1 and files2—and evenly divide the data between the files1.spreadsheets and the files2.spreadsheets collections. It’s completely up to you to decide which spreadsheet goes in which collection and come up with a scheme to query the appropriate database to find them later.

This is a difficult problem, which is why we don’t recommend this approach for this type of application.

When is the database distribution method really useful? One example of a real application for database distribution is MongoDB as a service. In one implementation of this model, customers can pay for access to a single MongoDB database. On the backend, each database is created in a sharded cluster. This means that if each client uses roughly the same amount of data, the distribution of the data will be optimal due to the distribution of the databases throughout the cluster.

12.3.3 Sharding within collections

Now, we’ll review the more powerful form of MongoDB sharding: sharding an individual collection. This is what the phrase automatic sharding refers to, because this is the form of sharding in which MongoDB itself makes all the partitioning decisions, without any direct intervention from the application.

To allow for partitioning of an individual collection, MongoDB defines the idea of a chunk, which as you saw earlier is a logical grouping of documents, based on the values

have different collections—say, files.spreadsheets and files.powerpoints—

that you want to be put on separate servers, you can store them in separate databases. For example, you could store spreadsheets in files_spreadsheets.spreadsheets and PowerPoint files in files_powerpoints.powerpoints. Because files _spreadsheets and files_powerpoints are two separate databases, they’ll be distributed, and so will the collections.

342 CHAPTER 12 Scaling your system with sharding

of a predetermined field or set of fields called a shard key. It’s the user’s responsibility to choose the shard key, and we’ll cover how to do this in section 12.8.

For example, consider the following document from a spreadsheet management application:

{

_id: ObjectId("4d6e9b89b600c2c196442c21") filename: "spreadsheet-1",

updated_at: ISODate("2011-03-02T19:22:54.845Z"), username: "banks",

data: "raw document data"

}

If all the documents in our collection have this format, we can, for example, choose a shard key of the _id field and the username field. MongoDB will then use that infor- mation in each document to determine what chunk the document belongs to.

How does MongoDB make this determination? At its core, MongoDB’s sharding is range-based; this means that each “chunk” represents a range of shard keys. When MongoDB looks at a document to determine what chunk it belongs to, it first extracts the values for the shard key and then finds the chunk whose shard key range contains the given shard key values.

To see a concrete example of what this looks like, imagine that we chose a shard key of username for this spreadsheets collection, and we have two shards, “A” and

“B.” Our chunk distribution may look something like table 12.1.

Looking at the table, it becomes a bit clearer what purpose chunks serve in a sharded cluster. If we gave you a document with a username field of "Babbage", you’d immedi- ately know that it should be on shard A, just by looking at the table above. In fact, if we gave you any document that had a username field, which in this case is our shard key, you’d be able to use table 12.1 to determine which chunk the document belonged to, and from there determine which shard it should be sent to. We’ll look into this pro- cess in much more detail in sections 12.5 and 12.6.

Table 12.1 Chunks and shards

Start End Shard

- Abbot B

Abbot Dayton A

Dayton Harris B

Harris Norris A

Norris  B

343 Building a sample shard cluster

Distributing data in a sharded cluster

MongoDB’s core server and tools

Diving into the MongoDB shell