Unique IDs

If you have dealt with any form data & database, you have had the need to assign unique IDs to data saved. Many a time these IDs are internal, but sometimes IDs need to be used in other ways by your users (e.g. to be emailed or pasted into some form etc). And the common two technique devs use for assigning an ID are either:

  1. a sequential number - an auto incrementing ID
  2. a random number generator - like UUID v4

However they both have at least one downside. The downside of a sequential number is that (obviously) the IDs are too predictable. It is too easy to run some scraper iterating over your URLs. And the downside of uuid v4 is that, well, it’s long and ugly.

For internal IDs, that users don’t need to deal with, both auto incrementing ID or uuid v4 seems fine. But when they got to paste it, email it or speak it over a phone or communicate it in any way, then it would be nicer to have shorter IDs. I have this use-case now, and I am thinking.. are any another alternatives where the IDs generated are shorter?


What does Stack overflow say?

A stack overflow answer gives me the math behind calculating the probability of generating a colliding ID. The math is called the “Birthday problem”

Given n number of IDs to generate and c number of characters of b character set (like hexadecimal), the probability of collision of IDs, approximately is:

p (approximate) = (n^2) / (2 * b^c)

(Academic note: the equation approximation assumes that p is less than 0.5)

Cool! So let’s decide on n, b and a “safe” probability of collision.

n - the number of IDs we are going to generate. I am taking this number to be as 10 million. My data is not in the “big data” space so this is big enough.

b - character set. uuid v4 uses hex. But it’s better if we can use a larger character set, since that would make the ID shorter. For my requirement, I need to be able to use it in a URL part (and also be printable on paper). So I am going for upper case alphanumeric characters. Which gives us 36 characters. I would take a tip from Base58 encoding and avoid visually similar characters (namely 0,O,I and L), but.. for this post let me stick to upper case alphanumeric characters.

p - and finally a “safe” margin for collisions. I have run simulation tests on various p values, and figured out 0.001 is sufficient to ensure IDs are unique, given that you are using a reasonably good pseudo random number generator. I am using node.js Math.random() function.

We can now compute how many characters that is required:

0.001 = (10^7)^2 / (2 * 36^c)

And the answer is approximately 11 character length.

So for 10m items, a 11 character alphanumeric ID seems sufficient to prevent ID collisions.


Sortability

The random IDs I have suggested so far are not sortable by creation date. Being able to sort the IDs, also makes sequential access (from disk) for databases faster. So some databases like MongoDB use a timestamp prefix and random suffix to generate their IDs. This would keep the ID pretty random looking, and yet sortable by string prefix.

One such convention would be = “YYSSSSSRRRR” where YY = year in base 36, “zero” padded to 2 character length SSSSS = seconds from start of year in base 36, “zero” padded to 5 character length, RRRR = 4 random base 36 characters, giving total of 11 character length like the math I calculated before

Additional reference on this topic:

That’s all folks. You can use the math and adjust the character length for your use case. The point of this post was to come up with an educated guess on how many characters you really need.


Implementation with JS

If I go for the 32 glyph character set, the character length I need is still 11 characters. So I am going to implement that.

const characterSet = '123456789ABCDEFGHJKMNPQRSTUVWXYZ';
function generateUniqueId() {
  return 'x'
    .repeat(11)
    .replace(/x/g, () => characterSet[Math.trunc(Math.random() * 32)]);
}

generateUniqueId() // e.g. YMSATXVFM94AG


Blog Post Revision 2021

In the initial version of this blog post:

  1. I used p value in percentage. Thats the way I was thinking at that point in time. But in mathematic literature, everyone use p as a value between 0 and 1. Also saying “0.1%” is no more intuitive than saying “0.001”, so I changed equation back to value between 0 and 1.

  2. I tried to make the IDs “unguessable” to brute force attacks over an HTTP end-point. I took a arbitrary number to multiply the p value with, to try and protect from such attacks, but in reality a brute force attack’s success totally depends on how many requests the attacker can make to the end-point. I don’t want to get deeper into that topic for now. So the value of p is now based on n, the max number of IDs that one intended to generate, which I used in my collision simulation tests as well.



Munawwar Firoz

Software Developer | Web Dev