comment_id |
body |
parent_id |
1 | JungleDragon rocks |
|
2 | No it doesn't! |
1 |
3 | Oh yes it does! |
2 |
4 | The economy sucks |
|
5 | True | 4 |
So far so good? Yes, but the real problems start when we want to query that table for useful stuff. Imagine the following tasks:
- Get the full comment tree sorted in the correct order. Hint: you cannot use date sorting, since replies can be made anywhere and must be positioned below their parent at all times.
- For each comment in the tree, we want to know how deep it is (how many levels deep), so we can use that for indenting the comment below its parent.
- For each comment we want to know how many replies there are, no matter how deep in the tree.
As said, this is a classic problem, and luckily there are existing solutions and patterns for this. This article explains four different approaches, I highly recommend it. In summary, these are the four methods:
- Recursive. The approach we already discussed. This approach is inefficient and does not scale.
- Stack. This one is not much different from the recursive approach. The idea is that you build a stack of hierarchical data by looping through the rows, typically inside a stored procedure. Once you have the stack built, it is very easy to use and query, however, it still requires a lot of database connections.
- Modified Preorder Tree Traversal. This surely is the most sophisticated, but also most complex, of the methods to work with hierarchical data. I encourage you to read the full description to really understand it. The principle of this method is that we assign smart numbers (left and right) to each leaf in the tree, these numbers indicate the relative position of the node towards other nodes. With a tree preordered like this, querying the hierarchical data becomes very fast and easy. There is a major downfall though: as soon as you insert or change something in the tree, it has to be rebuild. This will lead to many update queries, and these are expensive. The larger the tree, the larger the time of inserting a new node.
- Flat table model. Surely we know that if we could preprocess things like the level of the comment (how deep is it) and the rank (sort order), querying would become much easier. The author also admits that this does shift the performance problem to the writes: upon inserting a comment, all others ranks have to be recalculated. Anyways, the idea of this model is that you store essential attributes of the tree, instead of calculating them each time we retrieve the tree. Kind of like a tree cache.
The Flat Table Model done right
Let's just get to it right away. Here's my(heavily inspired) solution:
comment_id |
body |
parent_id |
lineage | deep |
1 | JungleDragon rocks |
1 |
0 | |
2 | No it doesn't! |
1 |
1-2 |
1 |
3 | Oh yes it does! |
2 | 1-2-3 |
2 |
4 | The economy sucks |
4 |
0 | |
5 | True | 4 | 4-5 | 1 |
Let's go into a little bit more detail concerning the insertion of a new comment. Here's how it works:
- in our save routine of a new comment, we check whether it came from a parent (is parent_id set?)
- if so, we do a query to retrieve the parent row
- next, we calculate the lineage and deepness of the new comment. For lineage, we use the parent's lineage. For deepness, we use the parent's deepness + 1.
- we now insert our new comment. next, we use the returned comment id of the new comment, add it to the lineage of the newly created comment, and save it again (update).
Querying this structure is even better. It is super easy to get all comments in the right order, get the indentation, and even the replies per comment, no matter how many levels of nesting we have. This query kind of combines these three tasks:
SELECT c.id, c.user_id, c.body, c.deep, c.lineage, c.parent_id, (SELECT COUNT(*) FROM comment where comment.lineage LIKE (CONCAT(c.lineage,'%')) AND comment.lineage!=c.lineage) as replies
FROM comment as c
order by c.lineage
Careful readers will see that we are using a subquery to count the replies for each comment. This part does have some performance overhead, but not much. If you do not need to show the number of replies to each comment, leave it our for even faster results. Or, you could take this even further by storing the number of replies in the database. The downfall of that approach is that you will need to recursively update all parents of the current comment row. That's not as bad as it sounds, it is not likely that you will have more than a handful of nesting levels for a single comment, in fact, most will have none, one or two.
Referential integrity
What about referential integrity? Since we have flattened the lineage and deep columns, we have lost some of that. No worries though, we are keeping our referential parent_id -> id foreign key. We do have referential integrity at the foundation. Should anything go wrong with the lineage or deep columns, then we can easily use the parent_id -> id relationship to "rebuild" those values.
Usage
The advanced flat model is optimized for a comment system, it should not be used everywhere. Particularly it is suitable for wide trees (little nesting), not deep trees. The reason for this limitation is the lineage column, which flattens the hierarchy in a single column. When there are too many levels of nesting (let's say over 20), a different model may be more suitable.
Yet another way
Throughout my online research, I found yet another way to work with hierarchical data. This approach is again a flat model approach, but this time the author reserves one column per nesting level in the table to store the deepness. The problem is that this results in a hardcoded nesting level limit. The other problem is that it is patented. Go figure. That's like patenting breathing.
Conclusion
There are multiple approaches towards storing and retrieving hierarchical data in (my)SQL, but for comment systems and other wide tree hierarchies, I hope you like my tweaked flat model approach. It has no recursion, not even loops, fast retrieves, the least writes, and a solid foundation of referential integrity. Credits go out to the author of the article and the commenter that suggested the lineage method. I only extended those ideas a little bit.
In closing, I like to end with a small tip. In many online examples, I see people looping through a depth counter to append things like spaces for indentation. This is not needed, almost every language has a function for this, here it is in PHP:
str_repeat(" ",$row['deep']);
...where the first param is the indentation string, and the second param is the number of indentations.
Enjoy!
Implementations: Survey system - questionnaires dependency, forums..tx for a good writing ;)
ReplyDeleteKakicode: Hierarchical Data Using Mysql >>>>> Download Now
ReplyDelete>>>>> Download Full
Kakicode: Hierarchical Data Using Mysql >>>>> Download LINK
>>>>> Download Now
Kakicode: Hierarchical Data Using Mysql >>>>> Download Full
>>>>> Download LINK