Simple Threading With PHP and MySQL
Posted by Greg Bulmash in Techno Thoughts, Web Programming, tags: mysql, PHPWhile working on some projects, I wanted to develop a comment engine, so people could comment on posts. I know, I could just use a blog, but a blog wouldn't have fit these applications.
Additionally, I wanted comment threading and to be able to sort the hierarchy simply, using a minimum of RAM and CPU cycles.
For those of you who don't know what threading is, it's organizing articles/comments/entries in a hierarchy based on their relationship to a prior article/comment/post. For example:
|- Comment 1
|---- Comment 4 (reply to comment 1)
|------- Comment 8 (reply to comment 4)
|---- Comment 6 (reply to comment 1)
|- Comment 2
|---- Comment 3 (reply to comment 2)
|------- Comment 7 (reply to comment 3)
|---------- Comment 9 (reply to comment 7)
|- Comment 5
Now, so long as you know the parent (Comment four is the parent of Comment eight), you can recurse the comments and create a hierarchical array. But it's inefficient both on the PHP back end (recursion is laborious), and in terms of the SQL query that may have to pull way more comments than you want to display.
So, while it's a somewhat simple solution in terms of storing the comments, it's a memory and processor hog in terms of getting them out and displaying them. I was sure there had to be something more efficient.
I read about presort order traversal, recursive processes, and all sorts of stuff that made my head ache. Then, I *finally* saw a solution that was pure genius. Creating a "lineage string" for each comment.
Lineage is the parentage of the comment plus the comment's own unique id, using an autoincrement to create unique IDs. Parentage is the lineage of the parent. Let's create a super-simple table and show how this works.
CREATE TABLE `comments` (
`uid` INT( 4 ) UNSIGNED ZEROFILL NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`parentage` VARCHAR( 220 ),
`content` TEXT NOT NULL ,
`post_id` VARCHAR( 10 )
) ENGINE = MYISAM ;
Here's the threaded list above with the parentage strings you'd store in parentheses. For each comment, if it wasn't a reply to another comment, it would have no parentage. If it was a reply to a comment, it would have that comment's parentage with that comment's own ID appended to the parentage.
|- Comment 1 ()
|---- Comment 4 (-0001)
|------- Comment 8 (-0001-0004)
|---- Comment 6 (-0001)
|- Comment 2 ()
|---- Comment 3 (-0002)
|------- Comment 7 (-0002-0003)
|---------- Comment 9 (-0002-0003-0007)
|- Comment 5 ()
Now, if you just sorted by parentage, your list would not really be threaded. It would sort as follows.
Comment 1 ()
Comment 2 ()
Comment 5 ()
Comment 4 (-0001)
Comment 6 (-0001)
Comment 8 (-0001-0004)
Comment 3 (-0002)
Comment 7 (-0002-0003)
Comment 9 (-0002-0003-0007)
For this to sort properly, you need the "lineage" (parentage with the comments own uid appended). Then you get...
|Comment 1 (-0001)
|--Comment 4 (-0001-0004)
|----Comment 8 (-0001-0004-0008)
|--Comment 6 (-0001-0006)
|Comment 2 (-0002)
|--Comment 3 (-0002-0003)
|----Comment 7 (-0002-0003-0007)
|------Comment 9 (-0002-0003-0007-0009)
|Comment 5 (-0005)
Now they sort properly.
Storing the full lineage string as an entry in the table itself is difficult and creates other problems, because you have to know the comment's ID when you store it. But the lineage is just the parentage and uid separated with a dash. So combine them during the SELECT.
SELECT *, CONCAT(parentage, '-', uid) as lineage FROM `trees` WHERE
post_id = "post50" ORDER BY lineage ASC LIMIT 0,5
You get the following selection (lineage in parentheses)...
Comment 1 (-0001)
Comment 4 (-0001-0004)
Comment 8 (-0001-0004-0008)
Comment 6 (-0001-0006)
Comment 2 (-0002)
When you want to show the next 5, you merely adjust the limit.
Spooling them out would be a simple matter of counting the dashes in the lineage string to determine the indentation level.
Now, of course, if a record was deleted and the autoincrement values reindexed, your whole table goes to Hell. So you have to be careful that doesn't happen.
And, with a 4-digit zerofill, as used above, you're limited to 9999 comments in the database before things break. But I just used 4-digit to keep the examples clean.
An important thing to understand with this method is that it only works for one threaded sorting method, oldest to newest. You can sort in other ways unthreaded, but it only threads and indents correctly in oldest to newest.


Entries (RSS)
I, too, wanted to create a threaded bbs style comment forum for my web pages. I wanted to keep individual threads in order, but I had the additional requirement that the most recently updated threads would bubble to the top. I think that active threads should be shown first and inactive threads drop down as they are ignored.
I use a thread string to do the sorting and I keep a last-comment-date on the root record. The string was an interesting problem. You show it as 4 digit decimal with a dash to separate. I went a step further and made a "base 94" conversion routine that makes the id a much smaller string and gives me a larger total message base. Why 94 ? I wanted to use printable ascii characters below 127 and use another printable character as a separator. I got 94.
I put the thing on a bunch of my sites and I have about 6000 messages. It seems to be working fine except for some badly designed web pages.
I works with a javascript snippet or can be embedded on the page with a php script. It uses the current page is the index so you can put one on every page and have a different set of threads.
You can see one the pages at: http://www.harptab.com/
I wanted to make it a "product" but I'll probably never get around to writing the extra management pages.
Keith
Do you have a working example of this code? I agree if the lineage string worked it would be pure genius, but I have yet to see the concept work practically.
For starters, if you use the zero fill option and sort via ASC, -0001-0004-0008 is a smaller number than -0001-0004 and therefore posts *above* its parent. One would think that sorting DESC would fix this problem, it does, but it doesn't fix what happens regardless of zerofill or not... the lineage sort breaks at at the 10's place value. 1-2 is greater than 1-11 and all 10+ digit posts will list the same way a directory lists
1.file --> .10.file --> 2.file first, second and last respectively.
Urgh... as far as the query goes, it is short and sweet which makes the lineage string too good to be true. I would love to see this concept in action, and would appreciate it if you'd point me to where there is a working example.
I've tried it and it simply doesn't work for me, but I'm hoping this doesn't mean that it doesn't work at all. Thanks in advance!
Just be warned that this technique will suffer performance issues. The query
SELECT *, CONCAT(parentage, '-', uid) as lineage FROM `trees` WHEREpost_id = "post50" ORDER BY lineage ASC LIMIT 0,5
cannot use an index for the sorting, so the entire table must be read to create the values for "lineage" to sort them, even though only 5 results are returned.