02 April 2015

ID Generation Strategies

The theories

Historically, common practice is to let the database decide the next ID for a new entity. The database is THE place where consistency was maintained and therefore the best source of truth for what's available next.

Things can go wonky from there when you get into systems with eventual consistency between writes and reads. As a result, I have been investigating other ID generation strategies.

One that has become popular is to use UUIDs for entity IDs. Clients can generate UUIDs at will and they are reasonably guaranteed to be unique. Initially, I thought I would switch to this, but further research brought up 2 problems. One, various platforms have differing capabilities at generating UUIDs, especially looking at HTML5. More importantly, GUIDs are hard to remember, to communicate, to type. Even if they aren't exposed to the users, *I* still have to deal with them when debugging, querying, etc.

So the next idea was to use an ID issuance service, where the client first requests a new ID. This can be retried as necessary until one is obtained. Once obtained, the client makes requests against the ID. The downside here is the risk of unused IDs due to transient failure. I could also imagine misbehaving retries creating swathes of unused IDs.

Then you can diverge into ID request tracking -- clients send a UUID, then poll another resource by UUID to watch for the ID to be generated... or at the very least the UUID can enable retries on creating the same ID. Or you can use a HiLo type algorithm so the client itself reserves a block of IDs ahead of time that it can use until it runs out. Again, with non-sequential IDs (across different clients) and (blocks of) unused IDs as minor drawbacks.

An Implementation

My first attempt at ID generation used the conventional model of creating an ID when an entity is created, based on the next available ID from the database. Then returning that ID to the caller (e.g. a Location header for REST-oriented folks). The client is also sending me a GUID (or UUID) with the request so I can trap retries (... or can I?). Ultimately the next important decision point was:

What do I do if a client sends me the same Request UUID more than once? Then to my horror, the answer is predictably: "That depends." If the client initially failed to receive a response and is retrying the entity creation, (and assuming the server can query generated IDs by UUID), then the correct response would be give them back the same previously generated ID for that UUID. However, if it's months later and due to bad pseudo-RNG the same UUID is generated for a fresh request, then the correct response would be to error. Or perhaps better, I could still return the previously generated ID and depend on the POST itself to fail because the entity had already been created. This is essentially a concurrency error where I expected entity version -1 (not created), but actual version was >= 0.

Implementing ID generation tracking in order to allow at most once creation semantics has its own challenges. Tracking means storing the request ID/entity ID and then either querying or caching this store. Querying can be a problem depending on your ID store. I'm using EventStore so as to avoid adding another database into the mix, and querying each time is essentially like a table scan -- not ideal. Caching is a brand of musical chairs that can work, but can get complicated if "done right". The sticking point on caching for me is expiring least-recently-used entity ID sets (each entity has its own incrementing set of IDs) so that memory usage doesn't grow unbounded. Maybe in 5 years every RequestID/EntityID fits in memory just fine, but maybe it doesn't! Cache miss loading performance is also likely going to be an issue over time.

Current Conclusions

Should I try to guarantee at-most-once creation? Well the record-keeping requirements on the back-end certainly makes me question that!

For a separate ID issuance service, is issuing an ID that never gets used really so bad? (Let's say due to a transient network failure and retry) Having gaps in ID issuance appears to be psychologically damaging to certain personality types, and it *is* good to be conscientious about things "falling through the cracks". What can we do to sate that? I suppose a report could suffice to satisfy that curiosity. Maybe even a follow-up procedure to officially tombstone issued but unused IDs after a certain time period if it's important that every one be accounted for.

That actually wouldn't be procedurally much different from allowing a client to double-post the same customer, causing two different customer IDs with the same data, and then having to administratively go back and delete one. Although the issued-but-unused-ID method leaves your data in potentially better shape in the interim.

For the ID issuance service, there is still an itch that I want to scratch: misbehaving clients (let's say a bug) requesting lots of IDs. You could just say "Who cares if 10 million IDs were issued but unused before we fixed the problem?" and go on about your day. But in principle, I'm unlikely to say that. And I haven't thought through options to determine a good remedy. A throttle is the immediate answer to mind, but that's terribly boring and makes me think that I really need to look at the problem differently to illuminate better options.

Oh, and by the way...

And finally, I want to say that exploring ID generation has brought up a very important shift in the line of thinking about IDs -- that is, an ID is metadata about an object rather than part of its content. When you look at most existing code, you see the entity ID being part of the entity itself. But ID issuance begs to differ because the ID must be known ahead of time before the entity can even be created. The client-generated GUID (which is also ID issuance, but from the client instead of server) also takes this tact. In fact, the SQL database itself also takes this tact in that it has to know what the next available ID is before it inserts the data. But our code has been so tightly integrated with SQL implementation details up to now that it was taken as a given that ID needs to be part of the data itself. When really, that is just what SQL requires - ID being part of the data row.

Functionally, the data itself doesn't usually give a flying rip about its own ID number when actually doing work. The important part is that the infrastructure knows about identity and can appropriately provision work and load data when given an ID. NOTE: In the relational DB world, an entity may care about the ID number of a *separate* entity insofar as it needs to ask the infrastructure to load that other entity's data.

Anyway, this realization has affected the how I model entities (e.g. in DDD, no more aggregate IDs on the aggregates themselves), and so I thought it was worth mentioning.

No comments: