Programming: PHP, Modern Systems, and GUIDs or UUIDs
Posted by Greg Bulmash in Web Programming, tags: PHPSo, I'm working on a project where I want to assign unique IDs to some elements in a list. I did some research on GUIDs (Globally Unique Identifiers) and UUIDs (Universally Unique Identifiers), but the whole thing seemed a bit too overwrought for me.
See, my IDs only have to be unique in a list. That list will probably have, at most, 200 members. I have a lot less worry about collisions (generating and assigning the same ID to two separate items, thus causing confusion). There are a few GUID/UUID generator scripts out there that do all sorts of gymnastics to generate a 32-40 character ID so unique, there's supposedly no chance of it colliding with another ID of that length... ever.
The math bears that out to some extent. A 32-character ID using just A-Z and 0-9 (a.k.a. Base 36) has over 63 quindecillion, or 63 times a trillion times a trillion times a trillion times a trillion different IDs. Put another way, a 70 kilogram (approx 154 pounds) person has approximately 7 octillion atoms in their body, which is more atoms than there are stars in the universe at current estimates. 63.3 quindecillion is enough to give every atom in every person on Earth over a trillion unique identifiers... per atom... and that's counting the fat people too.
Still, there's a part of me that thinks that a 32-character code (36 with dashes) is just too dang long. I should program a function to generate a 6 character Base 36 code, because while that raises chances of collision from 1 in gazillions to one in 2 billion, it saves 30 characters worth of disk space and computational gymnastics. Since I only need the IDs to be locally unique to one file or one database table, I should program the quick function to generate the code (pick a random number between 1 and 2 billion, then use base_convert(picked_num, 10, 36) to change the random number to a Base 36 code).
But, I do realize that some of that mindset, while good because we should always seek the most efficient way to do things, also gives some chance of collision, requiring I write and use collision check/prevention code elsewhere in the script, so it gets balanced out. Additionally, that mindset is a bit of an old-timer mindset. I'm old enough to remember being excited when the price of hard drives dropped under $1 per megabyte, letting me get a 250 megabyte drive for $245 back in the 90s. Yesterday I saw a 1.5 terabyte external drive at Costco for $139.99. At that cost, 250 megabytes of that drive costs under 2.45 cents. Put simply, I was paying 10,000 times more per megabyte 15 years ago than I'm paying today.
Okay, since we're using an ASCII-safe character set and storing with UTF-8 encoding, it means the 36-character code (32 letters and numbers + 4 dashes) is going to require 30 more characters of storage. To be fair, we'll triple it for overhead. To use up $1 worth of extra disk space on that 1.5 terabyte drive, I'd need to store over 99 million IDs. In my project, I'm figuring a seriously heavy user might generate a few thousand over the course of a year. I don't really have to be sparing to control storage costs.
What about the computational cost, then? Efficiency is important. But, then again, how much does processing horsepower cost? Using the gigaflops (billions of floating-point operations per second) measure applied to super computers... Back in 1997, computing power cost approximately $30,000 per gigaflop in a system sporting two 16-processor Beowulf clusters with Pentium Pro microprocessors. Now, a desktop sporting a Core 2 Quad 8200 ($149.99 at Newegg when I last checked) delivers 37.28 gigaflops at a cost of $4.02 per gigaflop.
I decided to try a few methods for generating unique ID's. I tried a class that generates a 32-character UUID/GUID posted by Marius Karthaus in the comments of the uniqid() function documentation on php.net, the uniqid() function which generates a 13-character ID, then I tried picking a random number between one and a large number and converting it to base 36 using 2,000,000,000 and 9,000,000,000,000,000,000 (two billion and 9 quintillion). I topped out at 9 quintillion because when I went up to 90 quintillion, the result for the mt_rand() function was always 1. I generated 50 sets of 5,000 IDs for each and got the following average execution times per set (running on my local development environment - a MacBook Pro):
UUID: 0.16870986938477 seconds
Random 2 Billion: 0.021249299049377 seconds
Random 9 Quintillion: 0.024264307022095 seconds
UNIQID: 0.11121699333191 seconds
Computationally, if you factor the time to create versus the odds of collision, the UUID costs less than twice uniqid()'s method, but gives you a few gazillion times as many possible IDs (3.7 x 10^29 times as many). You'll cut processing time by a factor of close to 8 if you go with the one-in-two-billion method, and close to a factor of 6 if you go with the one-in-nine-quintillion method. Of course, even with the most computationally expensive method, generating more unique IDs than I expect any user to need in a year uses less than 2/10ths of a second.
So, if a REALLY heavy user completes enough tasks that they need 5,000 unique IDs, those IDs (assuming that data and overhead amount to 108 bytes) will use up 1/198th of a penny worth of disk space and 1/18,000th of an hour worth of computing time.
I still feel compelled through years of conditioning to try to squeeze every last bit of efficiency out of my code, but the fact that I spent hours researching this, testing it, and determining the relative efficiencies of the various methods means I probably cost myself more than all the computing and disk space savings combined among everybody who ever uses the script I'm working on.


Entries (RSS)