Posts Tagged “PHP”

So, I'm working on a project where I want to assign unique IDs to some elements in a list. I did some research on GUIDs (Globally Unique Identifiers) and UUIDs (Universally Unique Identifiers), but the whole thing seemed a bit too overwrought for me.

See, my IDs only have to be unique in a list. That list will probably have, at most, 200 members. I have a lot less worry about collisions (generating and assigning the same ID to two separate items, thus causing confusion). There are a few GUID/UUID generator scripts out there that do all sorts of gymnastics to generate a 32-40 character ID so unique, there's supposedly no chance of it colliding with another ID of that length... ever.

The math bears that out to some extent. A 32-character ID using just A-Z and 0-9 (a.k.a. Base 36) has over 63 quindecillion, or 63 times a trillion times a trillion times a trillion times a trillion different IDs. Put another way, a 70 kilogram (approx 154 pounds) person has approximately 7 octillion atoms in their body, which is more atoms than there are stars in the universe at current estimates. 63.3 quindecillion is enough to give every atom in every person on Earth over a trillion unique identifiers... per atom... and that's counting the fat people too.

Still, there's a part of me that thinks that a 32-character code (36 with dashes) is just too dang long. I should program a function to generate a 6 character Base 36 code, because while that raises chances of collision from 1 in gazillions to one in 2 billion, it saves 30 characters worth of disk space and computational gymnastics. Since I only need the IDs to be locally unique to one file or one database table, I should program the quick function to generate the code (pick a random number between 1 and 2 billion, then use base_convert(picked_num, 10, 36) to change the random number to a Base 36 code).

But, I do realize that some of that mindset, while good because we should always seek the most efficient way to do things, also gives some chance of collision, requiring I write and use collision check/prevention code elsewhere in the script, so it gets balanced out. Additionally, that mindset is a bit of an old-timer mindset. I'm old enough to remember being excited when the price of hard drives dropped under $1 per megabyte, letting me get a 250 megabyte drive for $245 back in the 90s. Yesterday I saw a 1.5 terabyte external drive at Costco for $139.99. At that cost, 250 megabytes of that drive costs under 2.45 cents. Put simply, I was paying 10,000 times more per megabyte 15 years ago than I'm paying today.

Okay, since we're using an ASCII-safe character set and storing with UTF-8 encoding, it means the 36-character code (32 letters and numbers + 4 dashes) is going to require 30 more characters of storage. To be fair, we'll triple it for overhead. To use up $1 worth of extra disk space on that 1.5 terabyte drive, I'd need to store over 99 million IDs. In my project, I'm figuring a seriously heavy user might generate a few thousand over the course of a year. I don't really have to be sparing to control storage costs.

What about the computational cost, then? Efficiency is important. But, then again, how much does processing horsepower cost? Using the gigaflops (billions of floating-point operations per second) measure applied to super computers... Back in 1997, computing power cost approximately $30,000 per gigaflop in a system sporting two 16-processor Beowulf clusters with Pentium Pro microprocessors. Now, a desktop sporting a Core 2 Quad 8200 ($149.99 at Newegg when I last checked) delivers 37.28 gigaflops at a cost of $4.02 per gigaflop.

I decided to try a few methods for generating unique ID's. I tried a class that generates a 32-character UUID/GUID posted by Marius Karthaus in the comments of the uniqid() function documentation on php.net, the uniqid() function which generates a 13-character ID, then I tried picking a random number between one and a large number and converting it to base 36 using 2,000,000,000 and 9,000,000,000,000,000,000 (two billion and 9 quintillion). I topped out at 9 quintillion because when I went up to 90 quintillion, the result for the mt_rand() function was always 1. I generated 50 sets of 5,000 IDs for each and got the following average execution times per set (running on my local development environment - a MacBook Pro):

UUID: 0.16870986938477 seconds
Random 2 Billion: 0.021249299049377 seconds
Random 9 Quintillion: 0.024264307022095 seconds
UNIQID: 0.11121699333191 seconds

Computationally, if you factor the time to create versus the odds of collision, the UUID costs less than twice uniqid()'s method, but gives you a few gazillion times as many possible IDs (3.7 x 10^29 times as many). You'll cut processing time by a factor of close to 8 if you go with the one-in-two-billion method, and close to a factor of 6 if you go with the one-in-nine-quintillion method. Of course, even with the most computationally expensive method, generating more unique IDs than I expect any user to need in a year uses less than 2/10ths of a second.

So, if a REALLY heavy user completes enough tasks that they need 5,000 unique IDs, those IDs (assuming that data and overhead amount to 108 bytes) will use up 1/198th of a penny worth of disk space and 1/18,000th of an hour worth of computing time.

I still feel compelled through years of conditioning to try to squeeze every last bit of efficiency out of my code, but the fact that I spent hours researching this, testing it, and determining the relative efficiencies of the various methods means I probably cost myself more than all the computing and disk space savings combined among everybody who ever uses the script I'm working on.

  • Share/Bookmark

Comments No Comments »

So, I had this problem. I have Marc Liyanage's PHP 5.2.4 package installed on my MacBook Pro and use it for local dev tasks. It's supposed to have JSON installed, phpinfo says it has JSON installed, but when I ran a bit of code that used the json_decode() function, I got the "call to undefined function" error.

Read the rest of this entry »

  • Share/Bookmark

Comments No Comments »

So, I was working on a project that generates a form based on a selection of items from a database, and the form could get pretty long. For performance testing I timed the execution of the script and found some very odd results. If the number of items on the list was 16 or thereabouts, the execution was about 5-7 one-thousands of a second. If the number of the items on the list was 50 (the maximum displayed on a page), the execution time was around 100 times longer.

I was stumped, how did increasing the number of items in the list by a factor of around 3 increase the execution time by a factor of 100?

I initially thought it was the database query. If the list was 16 items, that was the maximum number returned for the query. If the list was 50 items, it might have been the first 50 out of 300, which required extra sorting and organizing. But the database queries were timing out at close to the same speed.

I checked the processing of the data. Perhaps there was some bottleneck in my code that I could isolate. But as I broke the script down into smaller and smaller pieces, I found that all the parts that created the dynamically generated page were working fine up until I echoed the string that sent the compiled page to the user...

echo $page;

It was that line that took a couple of thousandths of a second when there were 16 questions, but took two or three hundred times longer when you tripled the length of $page. In fact, the tipping point was really between 19 and 20 questions. At 19 questions, it was executing in around 0.0085 seconds. At 20 questions, the execution time jumped to 0.27 seconds. Adding just one question caused the script to take 30 times longer to execute.

I Googled the topic and found that this is a well-known weakness of PHP. And, as a matter of fact, output in general is a weakness of PHP, because even knowing about this, people have long argued about whether it's faster to use "print" or "echo".

There were a number of suggestions on how to compensate, such as ob_start() and ob_flush to use output buffering with more fine grained control, but that wasn't doing the trick for me. Another suggested breaking the output string into chunks and outputting them individually in sequence. That also didn't work... at least not with his value of 8192 (8 kilobytes) as a chunk size. So I tried larger chunks (48 kilobytes, since that seemed to be the upper limit before slowing occurred) and smaller chunks (256 bytes, getting real small), but those weren't working. I was about ready to give up when I decided on one more try. I tried 2048 (2 kilobytes) and things finally sped up, and it seemed the best speed came from a 1-kilobyte chunk size.

So, if you find a PHP script is running really slow, the culprit might be something as simple as "echo $string" and the only way to fix it is to performance tune the output of that string. Amazing, huh?

  • Share/Bookmark

Comments 3 Comments »

This evening I got a comment on my post about how to detect when cell phones access your web site. It was a question about the way I went about the PHP code for matching a bunch of small text snippets against the User Agent string.

Basically, I put all the text snippets in a large array and then used a foreach loop to match each one individually against the User Agent string. A visitor to the page asked why I hadn't done a very large regular expression, essentially including all those snippets with "or" symbols between them. So instead of "match 1, match 2, match 3," I'd get "match 1 or 2 or 3". She asked: "Is this more efficient processing, or more organized for you, or avoids some possible error getting thrown?"

Well, the initial answer was: "I just didn't think of doing a giant regex. I like arrays and my mind went in that direction."

But that only answers why I did it, not whether it's better. And while this is such a small piece of code that you can execute it 10,000 times in under 1.7 seconds, so it's not going to increase your server load significantly, I realized that on a very high traffic web site, cutting execution time even on such a small piece of code could make a difference. So I tested it.

I ran both methods (the foreach and the array as a giant regex) against a string of 190-195 characters in 10,000 iteration batches, and moved the bit of matching text around the string.

I got approximately the same execution time for the foreach method no matter where the matching text was in the string. But with the giant regex, the execution times varied by huge amounts. The farther the matching text was from the beginning of the string, the slower the giant regex worked. When the matching text was in the first 10 characters, the giant regex was 6-7 times faster than the foreach method. If it was around the 30th character, around 4 times faster.

But when the matching text was somewhere around the 150th character, the giant regex took 50-60% longer than the foreach method. When there was no match whatsoever, the giant regex took over twice as long.

I then tried increasing the length of the string (with no matches in it). The performance disparity grew deeper and deeper. At 760 characters, the giant regex was taking around 4.25 times longer than the foreach method. At 1520 characters (around 250 words), the giant regex had slipped to 5.4 times longer than the foreach method. Yet if there was a match in the first 10 characters of the 1520-character string, the giant regex took no longer than if they were in the first ten characters of a 190-character string.

Now because User Agent strings generally aren't too long and you don't have to go too deep to get a match, in this specific case, using the giant regex would probably be more computationally efficient. And if you were running this snippet millions of times a day, you could see tangible gains from optimizing it. But as a rule of thumb, the foreach method is going to be the better general purpose choice.

So, Jane, I hope that answered your question.

  • Share/Bookmark

Comments 1 Comment »

Get an angel for your site An Angel Watches Over This Site