开发者

MySQL - Best method to handle this hierarchical data?

开发者 https://www.devze.com 2023-01-05 05:27 出处:网络
This is a followup to: MySQL - Is it possible to get all sub-items in a hierarchy? I have an arbitrary-depth adjacency list model table (I am at the point that I can convert it into a nested set mode

This is a followup to:

MySQL - Is it possible to get all sub-items in a hierarchy?

I have an arbitrary-depth adjacency list model table (I am at the point that I can convert it into a nested set model.

I read the MySQL data on how to use a nested set model, though it seemed to get increasingly complex and very complex to do basic functions such as inserting, updating 开发者_运维技巧and deleting.

Another blog showing how to use a trigger system with the adjacency list model to keep a table of ancestors that relates each object to its ancestors.


Right now I need to be able to return a list of all children of a given node, to change or delete them. This hierarchical structure won't be changing all the time once created, but there will be a mass amount of the hierarchical structures.

The three methods I see are:

  1. Created a Stored Procedure which would do a recursive query that returns all children.

  2. Convert to Nested Set Model which would require to get into the complexities and possibly create a stored procedure to add, edit and delete in that.

  3. Create the Ancestor Table described above on insert/delete triggers to handle all of the data.

If there are other methods I'm not exploring, please let me know and I'll update this list.


Quassnoi has run some performance tests on the nested sets model and the adjacency list model and documented the results and recommendations in his blog post Adjacency list vs. nested sets: MySQL. The executive summary is:

  • Nested sets is faster for fetching all child nodes or all parent nodes.
  • Nested sets is a bad idea if you frequently need to update the table.

Here is the conclusion from his article:

In MySQL, the nested sets model should be preferred if the updates to the hierarhical structure are infrequent and it is affordable to lock the table for the duration of an update (which can take minutes on a long table).

This implies creating the table using MyISAM storage engine, creating the bounding box of a GEOMETRY type as described above, indexing it with a SPATIAL index and persisting the level in the table.

If the updates to the table are frequent or it is inaffordable to lock the table for a long period of time implied by an update, then the adjacency list model should be used to store the hierarchical data.

This requires creating a function to query the table.

The rest of the article shows how to define the table, implement the queries and gives performance measurements. The use of the spatial index is a clever idea to improve the performance of the nested set model that might be new to you.


If you're also considering approaches without MySQL then you might want to look at PostgreSQL which is another free and open-source database. PostgreSQL supports recursive queries in the form of recursive common table expressions which make querying heirarchical data easier than in MySQL and also give better performance. Quassnoi has also written an article Adjacency list vs. nested sets: PostgreSQL that shows the details.

While we are talking about looking at other approaches, Oracle's database is also worth a mention. Oracle also have a custom extension CONNECT BY which make querying heirarchical data very easy and fast. Quassnoi's article Adjacency list vs. nested sets: Oracle again covers the performance details. The query you need to get all children is extremely simple in this case:

SELECT *
FROM yourtable
START WITH id = 42
CONNECT BY parent = PRIOR id


I would always go with the Nested Set for shear simplicity and convienience. I always suggest this article. It shows excelent the queries that are needed for the work with such hierachrchical data. The only disadvantage I see here is that it can get slower with inserting/updateing new records when the hierachry reached a certain level of complexity, but the reading is faster than many other solutions I hae seen.

Just to give you an example from the article above:

SELECT t1.name AS lev1, t2.name as lev2, t3.name as lev3, t4.name as lev4
FROM category AS t1
LEFT JOIN category AS t2 ON t2.parent = t1.category_id
LEFT JOIN category AS t3 ON t3.parent = t2.category_id
LEFT JOIN category AS t4 ON t4.parent = t3.category_id
WHERE t1.name = 'ELECTRONICS';

+-------------+----------------------+--------------+-------+
| lev1        | lev2                 | lev3         | lev4  |
+-------------+----------------------+--------------+-------+
| ELECTRONICS | TELEVISIONS          | TUBE         | NULL  |
| ELECTRONICS | TELEVISIONS          | LCD          | NULL  |
| ELECTRONICS | TELEVISIONS          | PLASMA       | NULL  |
| ELECTRONICS | PORTABLE ELECTRONICS | MP3 PLAYERS  | FLASH |
| ELECTRONICS | PORTABLE ELECTRONICS | CD PLAYERS   | NULL  |
| ELECTRONICS | PORTABLE ELECTRONICS | 2 WAY RADIOS | NULL  |
+-------------+----------------------+--------------+-------+
6 rows in set (0.00 sec)

SQL wise, I don't think it can get any prettier and simpler ;)

I have no idea to the Stored Procedure way. But since it involces recursion (in your case), I don't know if it will be fast with many levels in the hierarchy. I assume you can give it a try.


Maybe you should consider using document-oriented database like MongoDB. It could make your life a lot easier.


When dealing with hierarchical data sets I find it best to approach it with caching in mind. One of the main benefits to this way of dealing with this issue this way is it doesn't require de-normalizing you database into something that might be harder to mutate.

Since memory heaps' (memcache,redis,etc) lookups are much faster than SQL for simple id -> data resolutions, I would use them to cache a list of the ids of direct children for each node. This way you can get decent performance via a recursive algorithm to build a complete list for any node.

To add/delete a new node, you will only need to invalidate its' direct parent cache O(1).

If that's not fast enough, you can add another layer of cache to a list of all child of a node at each node. In order for this to work with a decently mutable dataset, you should record the cache performance (ratio of fresh/cached hits) of each node and set a tolerance level for when to store the cache. This also can be stored in a memory heap since it's non-vital data.

If you use this more advanced caching model, you will need to note these complete children node lists will need to be invalidated when any of it's children are changed O(log n).

Once you have your list of children id's you can use SQL's WHERE id IN( id1, id2, .... ) syntax to query for what you want.


I once had to store a complex hierarchical arbitrary-depth bill-of-material system in a SQL-like database manager that wasn't really up to the task, and it ended up forcing messy and tricky indicies, data definitions, queries, etc. After restarting from scratch, using the db manager to provide only an API for record reads and writes on simple indexed keys, and doing all of the actual input/manipulation/reporting in external code, the final result was quicker to implement, easier to understand, and simpler to maintain and enhance. The most complex query needed was essentially SELECT A FROM B.

So, instead of embedding logic and operations inside the restrictions of MySQL, consider banging out code to do what you want, and relying on MySQL only for the lowest-level gets/puts.

0

精彩评论

暂无评论...
验证码 换一张
取 消