Data Store Shards in Google App Engine?

The picture on the right shows the unique id numbers of some of the shortened URLs in the ur.ly database, sorted by created date. Those unique ids are automatically generated by Google App Engine’s data store. Surprised by what you see?

Most databases make it easy to generate auto-incrementing id numbers as keys for database access. At first glance, it’s surprising that the ids generated by GAE are not in order. They aren’t random, but there are some interesting patterns. This shouldn’t surprise us – what we’re seeing is one way that the data store makes scaling possible.

It looks like the data store is partitioned or sharded so that different groups or sets of items live in different databases. Ids 28-33 live in one place while 14-18 live in another. Each shard is responsible for generating its own unique ids, and the range of ids a given shard can generate is somehow limited so ids from different servers won’t collide (see the auto_increment_increment and auto_increment_offset variables in MySQL for something similar). I also assume that ids are distributed (think memcached’s consistent hashing) so finding the correct shard for an id is quick. If ur.ly ever gets really busy, it’ll be interesting to look for evidence of a larger number of shards, perhaps dynamically allocated in response to need.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s