Data Store Shards in Google App Engine?

The picture on the right shows the unique id numbers of some of the shortened URLs in the ur.ly database, sorted by created date. Those unique ids are automatically generated by Google App Engine’s data store. Surprised by what you see?

Most databases make it easy to generate auto-incrementing id numbers as keys for database access. At first glance, it’s surprising that the ids generated by GAE are not in order. They aren’t random, but there are some interesting patterns. This shouldn’t surprise us – what we’re seeing is one way that the data store makes scaling possible.

It looks like the data store is partitioned or sharded so that different groups or sets of items live in different databases. Ids 28-33 live in one place while 14-18 live in another. Each shard is responsible for generating its own unique ids, and the range of ids a given shard can generate is somehow limited so ids from different servers won’t collide (see the auto_increment_increment and auto_increment_offset variables in MySQL for something similar). I also assume that ids are distributed (think memcached’s consistent hashing) so finding the correct shard for an id is quick. If ur.ly ever gets really busy, it’ll be interesting to look for evidence of a larger number of shards, perhaps dynamically allocated in response to need.

ur.ly – Dang Short Urls Powered by Google App Engine

Google App Engine let’s you build web apps that run on Google’s infrastructure. What’s the best way to get familiar with a new framework like this? Build something — preferably something simple and useful, and that’s what I set out to do.

I’ve played around in the URL-shortening space pioneered by TinyURL and understand the problem domain well. I’ve built a couple of these in PHP and Ruby (merb) and registered the ur.ly domain in September 2007 but never released anything. When GAE launched, memories of Steve Rubel’s Could a Billion TinyURL’s Go 404? post (hat tip to Dave Winer) echoed in my brain. Why not build a URL-shortener on GAE and let Google worry about scalability?

So I built ur.ly, a simple, super-scalable (thanks GAE), fast (yeah memcached) URL-shortening application. There’s an API to make it easy to use and it’s open source so you can play with the code or run your own.

Feedback anyone?