Unique IDs

If you have dealt with any form data & database, you have had the need to assign unique IDs to data saved. Many a time these IDs are internal, but sometimes IDs need to be used in other ways by your users (e.g. to be emailed or pasted into some form etc). And the common two technique devs use for assigning an ID are either:

  1. a sequential number - an auto incrementing ID
  2. a random number generator - like UUID v4

However they both have at least one downside. The downside of a sequential number is that (obviously) the IDs are too predictable. It is too easy to run some scraper iterating over your URLs. And the downside of uuid v4 is that, well, it’s long and ugly.

For internal IDs, that users don’t need to deal with, both auto incrementing ID or uuid v4 seems fine. But when they got to paste it, email it or speak it over a phone or communicate it in any way, then it would be nicer to have shorter IDs. I have this use-case now, and I am thinking.. are any another alternatives where the IDs generated are shorter?

What does Stack overflow say?

A stack overflow answer gives me the math behind calculating the probability of generating a colliding ID. The math is called the “Birthday problem”

Given n number of IDs to generate and c number of characters of b character set (like hexadecimal), the probability of collision of IDs (in %)

p (approximate) = (n^2) / (2 * b^c) * 100

Cool! So let’s decide on n, b and a “safe” probability of collision.

n - the number of IDs we are going to generate. I am taking this number to be as 10 million. My data is not in the “big data” space so this is big enough.

b - character set. uuid v4 uses hex. But it’s better if we can use a larger character set, since that would make the ID shorter. For my requirement, I need to be able to use it in a URL part (and also be printable on paper). So I am going for upper case alphanumeric characters. Which gives us 36 characters. I would take a tip from Base58 encoding and avoid visually similar characters (namely 0,O,I and L), but.. for this post let me stick to upper case alphanumeric characters.

p - and finally a “safe” margin for collisions - I am not sure what would be a “safe” probability. I’d like it to be harder for scrapers to guess the IDs as well. So I am going to go for 0.0001%.

We can now compute how many characters that is required:

0.0001 = (10^7)^2 / (2 * 36^c) * 100

And the answer is approximately 13 character length.

So for 10m items, a 13 character alphanumeric ID seems sufficient to prevent ID collisions and be non-guessable.

Sortability

The random IDs I have suggested so far are not sortable by creation date. Being able to sort the IDs, also makes sequential access (from disk) for databases faster. So some databases like MongoDB use a timestamp prefix and random suffix to generate their IDs. This would keep the ID pretty random looking, and yet sortable by string prefix.

One such convention would be = “YYSSSSSRRRRRR” where YY = year in base 36, “zero” padded to 2 character length SSSSS = seconds from start of year in base 36, “zero” padded to 5 character length, RRRRRR = 6 random base 36 characters, giving total of 13 character length like the math I calculated before

Additional reference on this topic - eager.io/blog/how-long-does-an-id-need-to-be/

That’s all folks. You can use the math and adjust the character length for your use case. The point of this post was to come up with an educated guess on how many characters you really need.


Implementation with JS

If I go for the 32 glyph character set, the character length I need is still 13 characters. So I am going to implement that.

const characterSet = '123456789ABCDEFGHJKMNPQRSTUVWXYZ';
function generateUniqueId() {
  return 'x'
    .repeat(13)
    .replace(/x/g, () => characterSet[Math.trunc(Math.random() * 32)]);
}

generateUniqueId() // e.g. YMSATXVFM94AG

Munawwar Firoz

Software Developer, Full Stack, Web