Storing hierarchical data in a relational database is a classic problem.
Relational databases are not really designed for this purpose, because
they store data in flat rows. Storage is actually the least of our
problems. Imagine a commenting system with an unlimited level of nesting
in replies. This is easy to store in a relational database:
comment_id
|
body
|
parent_id
|
1 |
JungleDragon rocks
|
|
2 |
No it doesn't!
|
1
|
3 |
Oh yes it does!
|
2 |
4 |
The economy sucks
|
|
5 |
True |
4 |
For the sake of simplicity, I have omitted other comment fields such as
author, date, etc. The table shows that we can easily store unlimited
levels of nesting by simply linking to the parent of the current
comment. Comments that have no parent are the root comments, the others
are leafs. If we would apply a foreign key to the parent_id column and
link it to the id column we even get full referential integrity.
So far so good? Yes, but the real problems start when we want to query that table for useful stuff. Imagine the following tasks:
- Get the full comment tree sorted in the correct order. Hint: you
cannot use date sorting, since replies can be made anywhere and must be
positioned below their parent at all times.
- For each comment in the tree, we want to know how deep it is (how
many levels deep), so we can use that for indenting the comment below
its parent.
- For each comment we want to know how many replies there are, no matter how deep in the tree.
None of the above tasks can be queried efficiently using the table we
started with. The problem is that since we do not know how many levels
deep the hierarchy is, we need to make recursive calls, leading to many
queries to the database, high memory usage in our scripts and generally a
system that does not scale.
As said, this is a classic problem, and luckily there are existing solutions and patterns for this.
This article explains four different approaches, I highly recommend it. In summary, these are the four methods:
- Recursive. The approach we already discussed. This approach is inefficient and does not scale.
- Stack. This one is not much different from the recursive
approach. The idea is that you build a stack of hierarchical data by
looping through the rows, typically inside a stored procedure. Once you
have the stack built, it is very easy to use and query, however, it
still requires a lot of database connections.
- Modified Preorder Tree Traversal. This surely is the most
sophisticated, but also most complex, of the methods to work with
hierarchical data. I encourage you to read the full description to
really understand it. The principle of this method is that we assign
smart numbers (left and right) to each leaf in the tree, these numbers
indicate the relative position of the node towards other nodes. With a
tree preordered like this, querying the hierarchical data becomes very
fast and easy. There is a major downfall though: as soon as you insert
or change something in the tree, it has to be rebuild. This will lead to
many update queries, and these are expensive. The larger the tree, the
larger the time of inserting a new node.
- Flat table model. Surely we know that if we could preprocess
things like the level of the comment (how deep is it) and the rank
(sort order), querying would become much easier. The author also admits
that this does shift the performance problem to the writes: upon
inserting a comment, all others ranks have to be recalculated. Anyways,
the idea of this model is that you store essential attributes of the tree, instead of calculating them each time we retrieve the tree. Kind of like a tree cache.
Based on the options presented so far, it seems the flat table model
looks promising, yet it does have a major limitation: a very expensive
write operation. Luckily, one of the commenters of the article posted a
way around this. I have taken his idea as a basis, and extended it
somewhat.
The Flat Table Model done right
Let's just get to it right away. Here's my(heavily inspired) solution:
comment_id
|
body
|
parent_id
|
lineage |
deep |
1 |
JungleDragon rocks
|
|
1
|
0 |
2 |
No it doesn't!
|
1
|
1-2
|
1
|
3 |
Oh yes it does!
|
2 |
1-2-3
|
2 |
4 |
The economy sucks
|
|
4
|
0 |
5 |
True |
4 |
4-5 |
1 |
Instead of using a rank, we are using a lineage. They have the same
purpose (correctly ordering the comments), it's just that a lineage has
one major advantage: it never needs to be updated! As you can see, the
lineage column displays the hierarchy ids in a flat column (from highest
parent to lowest child). This field is calculated upon insertion, no
other inserts need to take place. The deep column simply sets the
nesting level of the comment, which is handy when we need this value for
indentation later on. By storing it, we require no recursive queries at
all.
Let's go into a little bit more detail concerning the insertion of a new comment. Here's how it works:
- in our save routine of a new comment, we check whether it came from a parent (is parent_id set?)
- if so, we do a query to retrieve the parent row
- next, we calculate the lineage and deepness of the new comment. For
lineage, we use the parent's lineage. For deepness, we use the parent's
deepness + 1.
- we now insert our new comment. next, we use the returned comment id
of the new comment, add it to the lineage of the newly created comment,
and save it again (update).
All in all, we are retrieving a row, inserting one, and updating one.
Considering the other highly inefficient methods and knowing that most
people will read comments and not write them, this is pretty awesome. We
have no recursion, not even a regular loop.
Querying this structure is even better. It is super easy to get all
comments in the right order, get the indentation, and even the replies
per comment, no matter how many levels of nesting we have. This query
kind of combines these three tasks:
SELECT c.id, c.user_id, c.body, c.deep, c.lineage, c.parent_id,
(SELECT COUNT(*) FROM comment where comment.lineage LIKE
(CONCAT(c.lineage,'%')) AND comment.lineage!=c.lineage) as replies
FROM comment as c
order by c.lineage
Careful readers will see that we are using a subquery to count the
replies for each comment. This part does have some performance overhead,
but not much. If you do not need to show the number of replies to
each
comment, leave it our for even faster results. Or, you could take this
even further by storing the number of replies in the database. The
downfall of that approach is that you will need to recursively update
all parents of the current comment row. That's not as bad as it sounds,
it is not likely that you will have more than a handful of nesting
levels for a single comment, in fact, most will have none, one or two.
Referential integrity
What about referential integrity? Since we have flattened the lineage
and deep columns, we have lost some of that. No worries though, we are
keeping our referential parent_id -> id foreign key. We do have
referential integrity at the foundation. Should anything go wrong with
the lineage or deep columns, then we can easily use the parent_id ->
id relationship to "rebuild" those values.
Usage
The advanced flat model is optimized for a comment system, it should not
be used everywhere. Particularly it is suitable for wide trees (little
nesting), not deep trees. The reason for this limitation is the lineage
column, which flattens the hierarchy in a single column. When there are
too many levels of nesting (let's say over 20), a different model may be
more suitable.
Yet another way
Throughout my online research, I found
yet another way to work with hierarchical data.
This approach is again a flat model approach, but this time the author
reserves one column per nesting level in the table to store the
deepness. The problem is that this results in a hardcoded nesting level
limit. The other problem is that it is patented. Go figure. That's like
patenting breathing.
Conclusion
There are multiple approaches towards storing and retrieving
hierarchical data in (my)SQL, but for comment systems and other wide
tree hierarchies, I hope you like my tweaked flat model approach. It has
no recursion, not even loops, fast retrieves, the least writes, and a
solid foundation of referential integrity. Credits go out to the author
of the article and the commenter that suggested the lineage method. I
only extended those ideas a little bit.
In closing, I like to end with a small tip. In many online examples, I
see people looping through a depth counter to append things like spaces
for indentation. This is not needed, almost every language has a
function for this, here it is in PHP:
str_repeat(" ",$row['deep']);
...where the first param is the indentation string, and the second param is the number of indentations.
Enjoy!